* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
import time

# merge buckets if the number of the type is greater than 2
def drop(arr):
    for i in range(len(arr)-1, 1, -1):
        if arr[i][1]==arr[i-2][1]:
            # arr[i-2][0] = arr[i-1][0]
            arr[i-2][1] += 1
            del arr[i-1]

# get the data stream
data = open("/kaggle/input/coding2/stream_data.txt",'r')
f = data.readline()
data.close()
act_list = f.strip().split('\t')

# my DGIM
stamp = 0
N = 1000
time_list = []
time1 = time.time()
num = 0
for i in act_list:
    stamp += 1
    if int(i)==0:
        continue
    else:        
        if len(time_list) > 0 and time_list[0][0] + N <= stamp:
            del time_list[0]
        time_list.append([stamp, 0])
        drop(time_list)
for i in time_list:
    num += 2**(i[1])
num -= 2**(time_list[0][1]-1)
print("Execution time of my DGIM: ",time.time()-time1)
print("The buckets: ",time_list)
print("The number of 1-bits: ",num)

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
import time

# get the data stream
data = open("/kaggle/input/coding2/stream_data.txt",'r')
f = data.readline()
data.close()
act_list = f.strip().split('\t')

# get the actual number of 1-bits in the windows
num = 0
N = 1000
time1 = time.time()
for i in range(len(act_list) - N, len(act_list)):
    if int(act_list[i])==1:
        num +=1
print("Execution time of actual count: ",time.time()-time1)
print("Actual number of 1-bits:",num)

Running time: Actual count(0.00051) is less than my DGIM(0.07903). I refer the reason is that DGIM spends more time on drop buckets.
Space: My DGIM(loglogN) is less than actual count(N), which is its advantage.
Relative error: |316-391|/391=19.2%<50%

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
# # read the shingles of documents and stores in the list row after processing.
# import csv

# with open('/kaggle/input/coding2/docs_for_lsh.csv','r') as csvfile:
#     reader = csv.reader(csvfile)
#     rows_ori = [row for row in reader]
#     rows = []
#     del rows_ori[0]
#     for i in range(0,len(rows_ori)):
#         row = rows_ori[i]
#         row = [int(x) for x in row]
#         del row[0]
#         rows.append(row)

In [None]:
# # define the hash fuction
# import random

# # generate minhash function
# def get_minhash(factor, constant, size, x):
#     return (factor*x+constant)%size

# # generate hash function for LSH
# def hash1(x,length):
#     s = 0;
#     for i in range(len(x)):
#         s = ((int(s)<<2)+(int(x[i])>>4))^(int(x[i])<<10)
#     s = s % length
#     if s<0:
#         s += length
#     return s

# # generate specific number of minhash functions and return the list of parameters(prime number) of functions.
# # parameters are randomly generated.
# def get_hashlist(sigsize=100):
#     hlist=[]
#     num1 = [3,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,233,239,241,251,257,263,269,271,277,281,283,293]
#     num2 = [97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,179,181,191,193,197,199,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409]
#     for i in range(sigsize):
#         k1 = random.randint(0,len(num1)-1)
#         k2 = random.randint(0,len(num2)-1)
#         factor=num1[k1]
#         constant=num2[k2]
#         hlist.append((factor,constant))
#     return hlist 

# hlist=get_hashlist(150)

In [None]:
# # generate my signature matrix
# import pandas as pd

# # for each document shingles, generate signature list of the document
# def get_signature(shingle, hashlist):
#     signature=[]
#     size=len(shingle)
#     sigsize = len(hashlist)
#     for i in range(sigsize):
#         sig=float("inf")
#         for k in range(0,size):
#             if shingle[k]==1:
#                 value = get_minhash(hashlist[i][0], hashlist[i][1], size, k)
#                 if value<sig:
#                     sig=value               
#             else:
#                 pass
#         signature.append(sig)
#     return signature


# # my jaccard similariry fuction
# def jaccard_sig(signature1, signature2):
#     k=0
#     for i in range(len(signature1)):
#         if signature1[i]==signature2[i]:
#             k+=1;
#     k = k/len(signature1)
#     return k


# sigall=[]
# if os.path.exists('./mysignature_test1.csv'):
#     #get signatures from preprocess file
#     with open('./mysignature_test1.csv','r') as csvfile:
#         reader = csv.reader(csvfile)
#         rows_ori = [row for row in reader]
#         del rows_ori[0]
#         for i in range(0,len(rows_ori)):
#             row = rows_ori[i]
#             row = [int(x) for x in row]
#             del row[0]
#             sigall.append(row)
# else:
#     # generate signatures and store them as file
#     for i in range(len(rows)):
#         sigall.append(get_signature(rows[i],hlist))
#     mysig = pd.DataFrame(data=sigall)
#     mysig.to_csv('./mysignature_test1.csv')
#     print("finish")

In [None]:
# # LSH
# import csv
# import random
# import os
# from operator import itemgetter
# import numpy as np

# def choose_similiar(signature, bands=10, row=10, bucsize=5000000,s=0.8):
#     candidates={}
#     for i in range(1):
#         buckets=[]
#         for j in range(bucsize):
#             buckets.append([])
#         for k in range(len(signature)):
#             band0=signature[k]
#             band1=band0[i*row:(i+1)*row]
#             band="".join('%s' %l for l in band1)
#             buckets[hash1(band,bucsize)].append(k)
#         #judge the similarity if they are in the same bucket for at least one band
#         for item in buckets:
#             if len(item)>1:
#                 for i1 in range(len(item)):
#                     for j1 in range(i1+1,len(item)):
#                         pair = (item[i1],item[j1])
#                         if pair not in candidates:
#                             A = signature[item[i1]]
#                             B = signature[item[j1]]
#                             sim = jaccard_sig(A, B)
#                             if sim >= s:
#                                  candidates[pair] = sim
#         print("band-finish",i) #hint print
#     #ordered by the similarity
#     sort = sorted(candidates.items(),key=itemgetter(1), reverse=True)
#     return candidates, sort


Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
# # Your code here, you can add cells if necessary
# from sklearn.metrics import jaccard_score

# # c,s=choose_similiar(sigall)
# # for item in s:
# #     if item[0][0]==0:
# #         print(item)
# # print(s)

# #calculate the actual result
# same_act=[]
# for k in range(len(rows)):
#     same_act.append([jaccard_score(rows[0],rows[k]),k])
# same_act.sort()
# same_act.reverse()
# print("The actually 30 most similiar document ids", same_act[1:31])

The actually 30 most similiar document ids [[0.9090909090909091, 91300], [0.9090909090909091, 89833], [0.9090909090909091, 32681], [0.8333333333333334, 89825], [0.8333333333333334, 84306], [0.8333333333333334, 81480], [0.8333333333333334, 81379], [0.8333333333333334, 62080], [0.8333333333333334, 40298], [0.8333333333333334, 39310], [0.8333333333333334, 28910], [0.8333333333333334, 20854], [0.8181818181818182, 99370], [0.8181818181818182, 84520], [0.8181818181818182, 81289], [0.8181818181818182, 73681], [0.8181818181818182, 72156], [0.8181818181818182, 69724], [0.8181818181818182, 68730], [0.8181818181818182, 67032], [0.8181818181818182, 58852], [0.8181818181818182, 58694], [0.8181818181818182, 52076], [0.8181818181818182, 48131], [0.8181818181818182, 46220], [0.8181818181818182, 39784], [0.8181818181818182, 26980], [0.8181818181818182, 23585], [0.8181818181818182, 2575], [0.8181818181818182, 1331]]

**Since kaggle and jupyter notebook can't keep running for a long time, I used the server to get the following result with the same code.**

my result:((0, 67032), 0.91),((0, 89833), 0.91),((0, 23585), 0.9),((0, 2575), 0.89),((0, 48131), 0.89),((0, 32681), 0.89),((0, 91300), 0.89),((0, 39784), 0.86),((0, 1331), 0.85),((0, 28910), 0.85),((0, 78531), 0.84),((0, 58852), 0.83),((0, 69724), 0.83),((0, 62080), 0.83),((0, 66125), 0.83),((0, 72156), 0.82),((0, 52076), 0.82),((0, 58694), 0.82),((0, 61193), 0.82),((0, 44663), 0.82),((0, 40298), 0.81),((0, 81480), 0.81),((0, 20854), 0.81),((0, 72078), 0.81),((0, 14134), 0.81),((0, 26261), 0.81),((0, 81289), 0.8),((0, 89825), 0.8),((0, 81379), 0.8),((0, 39310), 0.8). The number of intersetion is 23. 78531,66125,61193,44663,72078,14134 and 26261 are misjudged. Excepted 14131(0.42 similarity), these similiarity score is all 0.75, which means my method is right but the accuracy can be improved.