* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
# let current time be 1e4, starting from 1
import time
 
bucket_n = [] #bucket list
n_max_bucket = 3 # or any other int
size_window = 1000
time_location = 3000

def Is_due(time_now):
    if len(bucket_n)>0 and time_now-size_window==bucket_n[0]['timestamp']:
        del bucket_n[0]
        
def Merge():
    for i in range(len(bucket_n)-1,n_max_bucket-1,-1):
        if bucket_n[i]['bit_sum']==bucket_n[i-n_max_bucket]['bit_sum']:
            bucket_n[i-n_max_bucket]['bit_sum']+=bucket_n[i-n_max_bucket+1]['bit_sum']
            bucket_n[i-n_max_bucket]['timestamp']=bucket_n[i-n_max_bucket+1]['timestamp']
            del bucket_n[i-n_max_bucket+1]
            
def Count_bit():
    bit_sum=0
    flag_half=1
    start_time=time.time()
    with open('../input/coding2/stream_data.txt', 'r') as f:
        for i in range(time_location):
            temp = f.read(2)[0]
            if temp:
                Is_due(i+1)
                if temp == '1':
                    bucket={"timestamp":i+1,"bit_sum":1}
                    bucket_n.append(bucket)
                    Merge()
    for i in range(len(bucket_n)):
        bit_sum+=bucket_n[i]['bit_sum']
    bit_sum-=bucket_n[0]['bit_sum']/2
    return bit_sum if len(bucket_n)>0 else 0,time.time()-start_time

bit_sum,bit_time = Count_bit()
print(int(bit_sum),bit_time)

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
def Count_act():
    bit_sum=0
    start_time=time.time()
    with open('../input/coding2/stream_data.txt', 'r') as f:
        f.seek(0 if time_location<=size_window else 2*(time_location-size_window))
        for i in range(time_location if time_location<=size_window else size_window):
            temp=f.read(2)[0]
            if temp and temp == '1':
                bit_sum+=1
    return bit_sum,time.time()-start_time

bit_act_sum,bit_act_time=Count_act()
print(bit_act_sum,bit_act_time)


### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:

import numpy as np
import pandas as pd
from tqdm import tqdm

data = pd.read_csv('../input/coding2/docs_for_lsh.csv')
data = np.array(data)
data = data[:,1:]

def signature(data):
    n = 70
    b = 80

    row = data.shape[0]
    col = data.shape[1]
    #print(row,col)
    seq = np.arange(col)
    persig = np.zeros((b,col))
    for i in range(b):
        persig[i,:] = np.random.permutation(seq)
        
    sig = np.ones((b,row))
    sig = -1*sig

    for i in tqdm(range(b)):
        h = persig[i,:]
        for j in range(row):
            data_j = data[j,:]
            min_index = min(h[data_j==1])
            sig[i,j] = min_index
    
    return sig

sig = signature(data)
#print(sig.shape())
print(sig)


In [None]:
import random
import hashlib

def minHash(sigMatrix, b, r):

    hashBuckets = {}
    
    begin, end = 0, r
    
    count = 0
    row = sigMatrix.shape[1]
    while end <= sigMatrix.shape[0]:
        for colNum in range(row):

            hashObj = hashlib.md5()

            band = str(sigMatrix[begin: begin + r, colNum])
            hashObj.update(band.encode())

            tag = hashObj.hexdigest()

            if tag not in hashBuckets:
                hashBuckets[tag] = [colNum]
            elif colNum not in hashBuckets[tag]:
                hashBuckets[tag].append(colNum)
        begin += r
        end += r

    return hashBuckets

hashBucket = minHash(sig,14,5)
print(len(hashBucket))

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
from sklearn.metrics import jaccard_score

row = sig.shape[1]
cnt = np.zeros(sig.shape[1])
queryCol = 0
for key in hashBucket:
    if queryCol in hashBucket[key]:
        for i in hashBucket[key]:
            cnt[i] += 1

sorted_cnt = np.argsort(cnt)

top30 = []

for i in range(30):
    temp = sorted_cnt[row-i-1]
    top30.append(temp)

print('top 30 with jaccard:')
for idx in top30:
    print(idx,jaccard_score(data.T[:,0],data.T[:,idx]))