* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

First we preprocess the input file

In [None]:
with open('clean.txt','w') as clean_f:
    with open('../input/coding2/stream_data.txt') as f:
        line = f.readline()
        cur_bits_list = line.split('\t')
        for bit in cur_bits_list:
            line_clean = bit + '\n'
            clean_f.write(line_clean)

In [None]:
# Your code here, you can add cells if necessary
import math
import time

buckets = []
windowsize = 1000

In [None]:
current_location = 5000

In [None]:
def check_over_windowsize(time_now):
    if len(buckets)>0 and time_now-windowsize==buckets[0]['timestamp']:
        # the leftest bucket is over windowsize
        del buckets[0]

In [None]:
def merge_buckets():
    for i in range(len(buckets)-1,1,-1):
        if buckets[i]['bit_sum']==buckets[i-2]['bit_sum']:
            # have 3 same size buckets, then merge
            buckets[i-2]['bit_sum']+=buckets[i-1]['bit_sum']
            buckets[i-2]['timestamp']=buckets[i-1]['timestamp']
            del buckets[i-1]

In [None]:
def DGIM():
    total_sum = 0
    # 读取文件
    with open('clean.txt') as f:
        # 从头读到指定位置
        i = 0
        for i in range(current_location):
#             print(i)
            warning_flag = 0
            line = f.readline()
            if line:
                check_over_windowsize(i+1)
                # 如果读到1 则加bucket
                if line.strip('\n') == "1":
                    bucket = {"timestamp":i+1,"bit_sum":1}
                    buckets.append(bucket)
                    merge_buckets()
    # 统计目前个数
    for i in range(len(buckets)):
        total_sum+=buckets[i]['bit_sum']
    total_sum-=buckets[0]['bit_sum']/2
    return total_sum if len(buckets)>0 else 0
                

In [None]:
start_time_DGIM = time.time()
DGIM_sum = DGIM()
end_time_DGIM = time.time()
print("DGIM method total time:",end_time_DGIM - start_time_DGIM)
print("DGIM sum:",DGIM_sum)

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
# Your code here, you can add cells if necessary
def count_exact_results():
    total_num = 0
    with open('clean.txt') as f:
        f.seek(0 if windowsize >= current_location else 2*(current_location - windowsize))
        for i in range(current_location if windowsize >= current_location else windowsize):
            line = f.readline()
#             print(line)
            if line and line.strip('\n') == '1':
                total_num += 1
    return total_num

In [None]:
start_time_exact = time.time()
exact_sum = count_exact_results()
end_time_exact = time.time()
print("exact result's total time:",end_time_exact - start_time_exact)
print("exact sum:",exact_sum)

compare two methods' space consume:

In [None]:
import sys
DGIM_space = sys.getsizeof(buckets)
print("DGIM space:",DGIM_space)
exact_space = sys.getsizeof("1")
print("exact finding uses space:",exact_space)

So we can find that when taking the window size of 1000, getting the exact result have a faster speed and cost less space.

However, when we need to get a big window size: the DGIM method runs faster and have a less space cost.

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
# Your code here, you can add cells if necessary
signature_length = 25
bands = 5

In [None]:
import pandas as pd
import numpy as np
import random

In [None]:
origin_data = pd.read_csv('../input/coding2/docs_for_lsh.csv')
# print(origin_data)
input_matrix = origin_data.values
print(input_matrix)
print(input_matrix.shape)

In [None]:
def minhash(input_matrix,signature_length):
    sign = np.zeros((signature_length,1000000),dtype=int)
    # each row : a signature getting from one function 
    for i in range(signature_length):
        cur = np.zeros((1,1000000),dtype = int)
        a = random.randint(1,10)
        b = random.randint(1,10)
#         new_line_number = list()
        m = 0 # current inset num
        while m != 1000000:
            for j in range(200):
                cur_newline_num = ((j*a+b)%229)%200
#                 new_line_number.append(cur_newline_num)
                if input_matrix[m][cur_newline_num+1] == 1:
                    cur[0][m] = cur_newline_num
                    m += 1
                    break
        sign[i] = cur
    return sign

In [None]:
min_hash_table = minhash(input_matrix,25)
# each row: one signature for different files
print(min_hash_table)
print(min_hash_table.shape)

In [None]:
min_hash_table_reverse = min_hash_table.T

In [None]:
def jaccard_calculation(table,line_A,line_B):
    A_hash_values = table[line_A]
    B_hash_values = table[line_B]
    return np.float(np.count_nonzero(A_hash_values == B_hash_values)) / np.float(len(A_hash_values))

In [None]:
test_score = jaccard_calculation(min_hash_table_reverse,24,10000)
print(test_score)

traditional method (not big data!) just search! (need much time)

In [None]:
# best_score = 0
# best_A = 0
# best_B = 0
# for i in range(1000000):
#     for j in range(i+1,1000000):
#         cur_jaccard = jaccard_calculation(min_hash_table,i,j)
#         if cur_jaccard > best_score:
#             best_score = cur_jaccard
#             best_A = i
#             best_B = j

LSH method:

In [None]:
print(min_hash_table_reverse.shape)

In [None]:
total_pairs = {}
total_cutting_num = int(signature_length / bands)
H_table_lists = []
for i in range(total_cutting_num):
# i = 0
    cur = min_hash_table_reverse[:,i*bands:(i+1)*bands]
    Hash_table = {}
    cur_line_num = 0
    for j in cur:
        cur_key = ''
        for m in j:
            cur_key += str(m)
            cur_key += ','
        if cur_key in Hash_table.keys():
            Hash_table[cur_key].append(cur_line_num)
        else:
            Hash_table[cur_key] = [cur_line_num]
        cur_line_num += 1
    H_table_lists.append(Hash_table)

In [None]:
print(len(Hash_table.keys()))
# print(Hash_table.keys())

In [None]:
# print(Hash_table)

In [None]:
best_score = 0
best_a = 0
best_b = 0
# Final_table = {}
max_flag = 0
for table in H_table_lists:
    for bucket in table.values():
        # 取每一个数组
        bucket_length = len(bucket)
        for i in range(bucket_length):
            # 数组中两两配对
            line_number_a = bucket[i]
            for j in range(i+1,bucket_length):
                line_number_b = bucket[j]
#                 cur_tuple = (line_number_a,line_number_b)
#                 if cur_tuple not in Final_table.keys():
#                     Final_table[cur_tuple] = 1
#                 else:
#                     Final_table[cur_tuple] += 1
                cur_score = jaccard_calculation(min_hash_table_reverse,line_number_a,line_number_b)
                if cur_score > best_score:
                    best_score = cur_score
                    best_a = line_number_a
                    best_b = line_number_b
                    if best_score == 1:
                        max_flag = 1
                        break
            if max_flag:
                break
        if max_flag:
            break
    if max_flag:
        break

In [None]:
print(best_score)
print(best_a)
print(best_b)

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
# Your code here, you can add cells if necessary
threshold = 0.8

total_pairs = {}
total_cutting_num = int(signature_length / bands)
H_table_lists = []
for i in range(total_cutting_num):
# i = 0
    cur = min_hash_table_reverse[:,i*bands:(i+1)*bands]
    Hash_table = {}
    cur_line_num = 0
    for j in cur:
        cur_key = ''
        for m in j:
            cur_key += str(m)
            cur_key += ','
        if cur_key in Hash_table.keys():
            if 0 in Hash_table[cur_key]:
                Hash_table[cur_key].append(cur_line_num)
        elif cur_line_num == 0:
            Hash_table[cur_key] = [cur_line_num]
        cur_line_num += 1
    H_table_lists.append(Hash_table)

In [None]:
# print(Hash_table)

In [None]:
def take_score(elem):
    return elem[0]

In [None]:
best_score_list = []
cur_store_num = 0
cur_min_tuple = (0,0)
# Final_table = {}
for table in H_table_lists:
    for bucket in table.values():
        # 取每一个数组
        bucket_length = len(bucket)
#         for i in range(bucket_length):
        i = 0
        # 数组中和0两两配对
        line_number_a = bucket[i]
        for j in range(i+1,bucket_length):
            line_number_b = bucket[j]
#                 cur_tuple = (line_number_a,line_number_b)
#                 if cur_tuple not in Final_table.keys():
#                     Final_table[cur_tuple] = 1
#                 else:
#                     Final_table[cur_tuple] += 1
            cur_score = jaccard_calculation(min_hash_table_reverse,line_number_a,line_number_b)
            if cur_score > cur_min_tuple[0]:
                cur_0_tuple = (cur_score,line_number_b)
                best_score_list.append(cur_0_tuple)
                cur_store_num += 1
                if cur_store_num == 31:
                    best_score_list.sort(key=take_score,reverse=True)
#                     best_score_list.sort(reverse=True)
                    best_score_list.pop(30)
                    cur_min_tuple = best_score_list[29]

打印出与0号文档相似度前30的结果

In [None]:
best_score_list.sort(key=take_score,reverse=True)
print(best_score_list)