* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
# Your code here, you can add cells if necessary


# the function method
import math
import time
import numpy as np
filename = "../input/coding2/stream_data.txt"

buckets = {}
window_size = 1000
current_time = 0
update_time = 1000# no larger than the window_size
update_index = 0

keysnum = int(math.log(window_size, 2)) + 1
key_list = list()

# initialize the buckets
for i in range(keysnum):
    key = int(math.pow(2, i))
    key_list.append(key)
    buckets[key] = list()


def Update_buckets(inputdict, klist, numkeys):
    for key in klist:
        if len(inputdict[key]) > 2:
            inputdict[key].pop(0)
            tstamp = inputdict[key].pop(0)
            if key != klist[-1]:
                inputdict[key * 2].append(tstamp)
        else:
            break

def OutputResult(inputdict, klist, window_size):
    cnt = 0
    firststamp = 0

    for key in klist:
        if len(inputdict[key]) > 0:
            firststamp = inputdict[key][0]

    for key in klist:
        for tstamp in inputdict[key]:
            if tstamp != firststamp:
                cnt += key
            else:
                cnt += 0.5 * key
    print ("The estimated number of 1 s by DGIM in the last %d bits: %d" % (window_size, cnt))

with open(filename, 'r') as rfile:
    start_time = time.time()
    while True:
        char = rfile.read(1)
        #print ("char", char)
        
        if not char:# no more input
            OutputResult(buckets, key_list, window_size)
            print ("The end of the document.")
            break
        
        #print ("buckets.keys()", buckets.keys())
        if char == "1" or char == "0" :
            current_time = (current_time + 1) % window_size
            update_index = (update_index + 1) % update_time
            for k in buckets.keys():
                for itemstamp in buckets[k]:
                    if itemstamp == current_time:# remove record which is out of the window
                        buckets[k].remove(itemstamp)
            if update_index == 0:
                OutputResult(buckets, key_list, window_size)

        if char == "1":# add it to the buckets
            buckets[1].append(current_time)
            Update_buckets(buckets, key_list, keysnum)
        
        

    end_time = time.time()
    time_spent_DGIM = end_time - start_time
    print ("time_spent_DGIM",  time_spent_DGIM)

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
# Your code here, you can add cells if necessary

# Your code here, you can add cells if necessary

# the normal method 
import math
filename = "../input/coding2/stream_data.txt"
window_size = 1000
current_time = 0
updateindex = 0

sum_bit = 0
keep_num = np.zeros((window_size))

with open(filename, 'r') as rfile:
    start_time = time.time()
    while True:
        char = rfile.read(1)
        
        if not char:# no more input
            print ("The number of 1 s in the last %d bits: %d" % (current_time, sum_bit))
            print ("The end of the document.")
            break 
        if current_time==0 :
            sum_bit = 0

        if char == "1" or char == "0" :
            keep_num_new = np.zeros((window_size))
            for i in range(1, window_size):
                keep_num_new[i] = keep_num[i-1]
                keep_num_new[0] = int(char)

            keep_num = keep_num_new
            current_time = (current_time + 1) 


        if current_time == window_size :
            sum_bit = 0
            for i in range( window_size ):
                sum_bit = sum_bit + keep_num[i]
            print ("The number of 1 s in the last %d bits: %d" % (window_size, sum_bit))
            current_time = 0

    end_time = time.time()
    time_spent_NORMAL = end_time - start_time
    print ("time_spent_NORMAL", time_spent_NORMAL)

In [None]:
print ("time_spent_NORMAL =", time_spent_NORMAL , "s , while time_spent_DGIM =", time_spent_DGIM, "s ")

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
# Your code here, you can add cells if necessary

# input the document 
import pandas as pd
import numpy as np
df=pd.read_csv('../input/coding2/docs_for_lsh.csv',sep=',') 
print (df.head())
print (df.tail())


In [None]:
print ("df['doc_id'] ", df['198'])
doc_num = df.shape[1] - 1
print ("doc_num", doc_num)
# print ("doc_num", df['198'][999998])

In [None]:
# the min_hashing function
def min_hashing(df,  sig_length):
    doc_num = df.shape[0]
    print ("doc_num", doc_num)
    shingle_num = df.shape[1] - 1
    print ("shingle_num", shingle_num)
    signature = np.zeros((sig_length, doc_num))
    for i in range(sig_length):
        index = np.arange(shingle_num)
        np.random.shuffle(index)
        #print ("index", index)
        for j in range(doc_num):  # j_th document
            if np.mod(j, 500000) == 0 :
                print ("j / 500000", j / 500000)
            for k in range(shingle_num): 
                tmp_index = index[k]
                tmp_search = df[str(tmp_index)][j]
                if tmp_search == 1:
                    signature[i, j] = k
                    #print ("k", k)
                    break
        print ("i in range(sig_length)", i)
        print ("signature[i,:]", signature[i,:])
    return signature

# the hashing for the band
def hash_band(signature, element_band, hash_interval):
    sig_length = signature.shape[0]
    doc_num = signature.shape[1]
    band_num = int (sig_length / element_band)
    print (band_num)
    hash_result = np.zeros((band_num, doc_num))
    
    for i in range(band_num):
        print ("band_num = ", i)
        for j in range(doc_num):
            tmp_signature = signature[i * element_band:i * element_band + element_band, j ]
            tmp_sum = np.sum(tmp_signature)
            tmp_hash_result = int(tmp_sum / hash_interval )
#             print ("tmp_signature", tmp_signature, "j = ", j )
#             print ("tmp_hash_result", tmp_hash_result)
            hash_result [i,j] = tmp_hash_result
    return hash_result


In [None]:
sig_length = 45  # the length of the sig_length
element_band = 5  # the length of the band
hash_interval = 3  # the interval of the hashing

# establish the signature
signature = min_hashing(df,  sig_length)
print ("signature",signature )
print ("signature",signature )
print ("signature",signature[:, 0] )

# get the band hashing result
hash_result =  hash_band(signature, element_band, hash_interval)
print ("hash_result", hash_result)

In [None]:
def get_similar_document(hash_result, compare_index, top_k ):
    search_num = len(compare_index)
    similar_document_matrix = np.zeros((search_num, top_k ))
    for i in range(search_num):
        compare_doc = compare_index[i]
        tmp_index = hash_result[:, compare_doc]
        compare_matrix = np.zeros_like(hash_result)
        for j in range(compare_matrix.shape[1]):
            compare_matrix[:, j] = (hash_result[:, j]  == tmp_index)
            #print ("compare_matrix[:, j]", compare_matrix[:, j], " j = ", j)
        # caculate the Jaccard_sim
        Jaccard_sim = np.sum(compare_matrix, axis = 0) / compare_matrix.shape[0]
#         print ("Jaccard_sim.shape", Jaccard_sim.shape)
        sim_doc = Jaccard_sim.argsort()[::-1]
#         print ("sim_doc.shape", sim_doc.shape)
        print (sim_doc[0:31])
        print (Jaccard_sim[sim_doc[0:31]])
        similar_document_matrix[i,:] = sim_doc[1:31]
        print (" compare_doc = ", compare_doc ,  " similar_document_matrix = ", similar_document_matrix[i,:])
    return similar_document_matrix


In [None]:
# the result
compare_index = [0, 3] # searching document 
top_k = 30 
similar_document_matrix = get_similar_document(hash_result, compare_index , top_k)


In [None]:
# examine the result
from sklearn import metrics
data = (np.loadtxt("../input/coding2/docs_for_lsh.csv", delimiter=',', skiprows=1, usecols=range(1,201)))


result = np.zeros((top_k))
compare_doc = 0
for i in range(top_k):
    compare = int (similar_document_matrix[compare_doc][i])
    #print ("compare", compare)
    result[i] = metrics.jaccard_score(data[compare,:],data[compare_doc,:])
    print(metrics.jaccard_score(data[compare,:],data[compare_doc,:]))

print ("mean effect = ", np.mean(result))

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
# Your code here, you can add cells if necessary

# the result
compare_index = [0]
top_k = 30 
similar_document_matrix = get_similar_document(hash_result, compare_index , top_k)
# print ("similar_document_matrix", similar_document_matrix)

In [None]:
# examine the result
from sklearn import metrics
# data = (np.loadtxt("../input/coding2/docs_for_lsh.csv", delimiter=',', skiprows=1, usecols=range(1,201)))

compare_doc = 0
for i in range(top_k):
    compare = int (similar_document_matrix[compare_doc][i])
    result[i] = metrics.jaccard_score(data[compare,:],data[0,:])
    print(metrics.jaccard_score(data[compare,:],data[0,:]))

print ("mean effect = ", np.mean(result))