* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

In [None]:
import numpy as np

In [None]:
stream=np.loadtxt('../input/coding2/stream_data.txt')

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
class DGIM():
    def __init__(self,window_size):
        self.bucket_list = {}
        self.window_size=window_size
        self.current_time=0
        self.bucket=np.array([])
    def sort_buck_list(self):
        #整理bucket的函数，负责合并bucket
        nowkey=0
        while nowkey in self.bucket_list:
            if len(self.bucket_list[nowkey])>2:
                timestamp=self.bucket_list[nowkey][1] #使用结尾时间作为一个bucket的时间戳
                
                self.bucket_list[nowkey]=self.bucket_list[nowkey][2:]
                
                if not nowkey+1 in self.bucket_list:
                    self.bucket_list[nowkey+1]=np.array([timestamp])
                else:
                    self.bucket_list[nowkey+1]=np.append(self.bucket_list[nowkey+1],timestamp)
            nowkey=nowkey+1
    def filt_buck_list(self):
        #筛选bucket的函数,负责去掉window外的bucket
        for key in self.bucket_list:
            
            self.bucket_list[key]=self.bucket_list[key][self.bucket_list[key]>=0]
        for key,value in self.bucket_list.items():
            #print(key,value)
            for i in range(len(value)):
                self.bucket= np.append(self.bucket,2**key)
        if self.current_time>self.window_size:
            self.bucket[-2]=self.bucket[-2]/2 #最后一个bucket除2
        
    def stream_input(self,bit):
        #读取新数据的函数
        if self.current_time<self.window_size:
            if bit==1: 
                if not 0 in self.bucket_list:
                    self.bucket_list[0]=np.array([self.current_time])
                else:
                    self.bucket_list[0]=np.append(self.bucket_list[0],self.current_time)
                
               
                
        else:
            for keys in self.bucket_list:
                self.bucket_list[keys]=self.bucket_list[keys]-1
            if bit==1:
                if not 0 in self.bucket_list:
                    self.bucket_list[0]=np.array([self.window_size-1])
                else:
                    self.bucket_list[0]=np.append(self.bucket_list[0],self.window_size-1)
                
        self.sort_buck_list()
       
        self.current_time=self.current_time+1    
               
        
        



    def query(self):
        
        return self.bucket

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:

timestamp=0 #设置初始时间戳
windowlength=1000 #设置窗长
dgim=DGIM(windowlength)
for i,bit in enumerate(stream[0:timestamp+windowlength]):
    dgim.stream_input(bit)


dgim.filt_buck_list()


import time
time_start=time.time()
print("dgim:",np.sum(dgim.query()))

time_end=time.time()
print('totally cost dgim',time_end-time_start)
print('totally space dgim',len(dgim.query()))
time_start=time.time()
vsum=0
for v in range(windowlength):
    vsum=vsum+stream[timestamp+v]
print("sum:",vsum)

time_end=time.time()
print('totally cost sum',time_end-time_start)
print('totally space sum',windowlength)    

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
import numpy as np

In [None]:
data = (np.loadtxt("../input/coding2/docs_for_lsh.csv",delimiter=',',skiprows=1))

In [None]:
#数据预处理，去掉第一行，并转置
data1=data[:,1:]
data1.shape
data1=data1.T
data1.shape

In [None]:
from tqdm import tqdm

def signature(data,nums):
    #建立signature
    h,w=data.shape
    
    #print(rand_mat.shape)
    signature =np.zeros((nums,w))
    weight=np.arange(h,0,-1)
    weight=weight[:,np.newaxis]
    for i in tqdm(range(nums)):
        #随机打乱
        permute=np.random.permutation(h)
        rand_mat=data[permute,:]
        rand_mat=weight*rand_mat
        
        pos=np.argmax(rand_mat,axis=0)
        signature[i,:]=pos
        
        
        
    return signature




In [None]:
#建立长度为80的signature
a=signature(data1[:,:],80)
print(a.shape)

In [None]:

def lsh(data,bucketsize):
    #哈希函数，将bucket中的数据，先乘以2的k次方的权重，之后相加，最后除以500取余数

    all_table=[]
    num_feat,num_file=data.shape
    num_buck=int(num_feat/bucketsize)
    powers_of_two = 1 << np.arange(bucketsize - 1, -1, -1)
    for i in range(0,num_feat,bucketsize):
        table={}
        now_row=data[i:i+bucketsize,:]
        
        index = (powers_of_two.dot(now_row))%400
        for data_idx,idx in enumerate(index):
            if idx not in table:
                # If no list yet exists for this bin, assign the bin an empty list.
                table[idx] = []
            table[idx].append(data_idx)
        all_table.append(table)
    return all_table

In [None]:
#得到所有bucket,bucket size=5 一共16个bucket
c=lsh(a,5)
print(a.shape)
print(len(c))


Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
sim_score=np.zeros(a.shape[1])
#print(sim_score[0])
for table in c:
    for score,pos in table.items():
        #print(score,pos)
        if 0 in pos:
            for p in pos:
                sim_score[p]=sim_score[p]+1

ind=np.argsort(sim_score[:])

e=ind[-31:-1]

top_30=e[::-1]
search_data=data1[:,0]
print("index fo top 30 similar document:",top_30)

In [None]:
from sklearn.metrics import jaccard_score
print('top 30 most similar documents:')
print("index  similarity")
for idx in top_30:
    print(idx,jaccard_score(search_data,data1[:,idx]))
    
