* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
# Your code here, you can add cells if necessary
import time

window_size = 2000
max_same_bucket = 2
time_loc = 10000

buckets = []

def merge():
    for i in range(len(buckets)-1,max_same_bucket-1,-1):
        if buckets[i]['bit_sum']==buckets[i-max_same_bucket]['bit_sum']:
            buckets[i-max_same_bucket]['timestamp']=buckets[i-max_same_bucket+1]['timestamp']
            buckets[i-max_same_bucket]['bit_sum']+=buckets[i-max_same_bucket+1]['bit_sum']
            del buckets[i-max_same_bucket+1]

def delete_expire(timestamp):
    if len(buckets)>0 and timestamp-window_size==buckets[0]['timestamp']:
        del buckets[0]

def DGIM():
    bit_sum = 0
    start_time = time.time()
    with open("../input/coding2/stream_data.txt") as f:
        for i in range(time_loc):
            tmp = f.read(2)
            if tmp:
                delete_expire(i+1)
                if int(tmp.strip('\t'))==1:
                    new_bucket = {"timestamp":i+1,"bit_sum":1}
                    buckets.append(new_bucket)
                    merge()
    for i in range(len(buckets)):
        bit_sum+=buckets[i]['bit_sum']
    bit_sum-=buckets[0]['bit_sum']/2
    end_time = time.time()
    return bit_sum, end_time-start_time

bit_sum, cost_time = DGIM()
print("number of 1-bits: {}".format(int(bit_sum)))
print("cost time: {}".format(cost_time))

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
# Your code here, you can add cells if necessary
import time

window_size = 2000
max_same_bucket = 2
time_loc = 10000

def count_precise():
    bit_sum = 0
    start_time = time.time()
    with open("../input/coding2/stream_data.txt") as f:
        beg_loc = 0 if time_loc<=window_size else 2*(time_loc-window_size)
        f.seek(beg_loc)
        for i in range(time_loc if time_loc<=window_size else window_size):
            tmp = f.read(2)
            if tmp and int(tmp.strip('\t'))==1:
                bit_sum+=1
    end_time = time.time()
    return bit_sum, end_time-start_time

bit_sum, cost_time = count_precise()
print("actual number of 1-bits: {}".format(bit_sum))
print("actual cost time: {}".format(cost_time))

DGIM will cost little more time because of the calculation process. However, it will save a lot of space and we could query the result online by using DGIM. The difference between the ground truth and the DGIM result is also small.

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
# Your code here, you can add cells if necessary
import pandas as pd
import random
import numpy as np

minhash_num = 100
bands_num = 20

data=pd.read_csv("../input/coding2/docs_for_lsh.csv",index_col='doc_id').to_numpy().transpose()


def generate_sig():
    length = data.shape[0]
    num_files = data.shape[1]
    sig=np.zeros((minhash_num,num_files),dtype=np.int)
    #print(sig.shape)
    for i in range(minhash_num):
        sig[i,:]=np.random.permutation(data).argmax(axis=0)
    return sig
    
def find_most_similar():
    num_files = data.shape[0]
    sim_x,sim_y=0,0
    max_jac_sim = 0
    for i in range(num_files):
        for j in range(i,num_files):
            cur_sim=0
            if i!=j:
                for k in range(minhash_num):
                    if signature[k][i]==signature[k][j]:
                        cur_sim+=1
            if cur_sim>max_jac_sim:
                max_jac_sim=cur_sim
                sim_x=i
                sim_y=j
    print(sim_x)
    print(sim_y)
                    
#print(data.shape[0])
signature=generate_sig()
#for i in range(minhash_num):
    #generate_one_sig()
    #print("finish",i)
#print(signature)
#find_most_similar()


Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
# Your code here, you can add cells if necessary
import hashlib


def LSH(m,b,r):
    hash_bucket={}
    beg, end=0,r
    cnt=0
    while end<=m.shape[0]:
        cnt+=1
        for col in range(m.shape[1]):
            hashfc= hashlib.md5()
            s=str(m[beg: beg + r, col]) #+ str(cnt)
            hashfc.update(s.encode())
            tag=hashfc.hexdigest()
            
            if (tag,cnt) not in hash_bucket:
                hash_bucket[(tag,cnt)]=[col]
            elif col not in hash_bucket[(tag,cnt)]:
                hash_bucket[(tag,cnt)].append(col)
        beg+=r
        end+=r
    return hash_bucket
        

row=minhash_num//bands_num
#print(row,bands_num)
signature=np.array(signature)
hash_bucket=LSH(signature,bands_num,row)
#print(hash_bucket)
res={}
query=0
for (key,i) in hash_bucket:
    if query in hash_bucket[(key,i)]:
        for j in hash_bucket[(key,i)]:
            if j not in res:
                res[j]=1
            else:
                res[j]+=1
res_order=sorted(res.items(),key=lambda x:x[1],reverse=True)
x=0
outputs=[]
for i in range(len(res_order)):
    if x>=30:
        break
    if res_order[i][0]!=query:
        x+=1
        outputs.append(res_order[i][0])
    
#print(len(res))
print("Predicted 30 most similar files:")
print(outputs)

In [None]:
from sklearn.metrics import jaccard_score
ground_truth=[]
file0=data[:,0]
for i in range(30):
    filex=data[:,outputs[i]]
    ground_truth.append(jaccard_score(file0,filex))
print("Corresponding jaccard score:")
print(ground_truth)