* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
size_window = 1000
time_location = 1000 #假设当前窗从首个字符开始

def Count_bit_act():
    bit_sum = 0  # 统计1-bit个数
    with open('../input/coding2/stream_data.txt', 'r') as f:
        f.seek(0 if time_location <= size_window else 2 * (time_location - size_window))  # 跳转到窗口大小之前行位置
        num = f.read()
        num = num.split()
        for i in range(time_location if time_location <= size_window else size_window):
            temp=num[i]
            if temp and int(temp) == 1:
                bit_sum += 1
    return bit_sum

bit_act_sum= Count_bit_act()

print("The number of \"1\" is:",bit_act_sum)

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
import time
import psutil
import os

bucket_n = []  # 桶的列表
n_max_bucket = 2
size_window = 1000 #假设窗长1000
time_location = 1000 #假设窗从首个字符开始

def show_info(start):
    pid = os.getpid()
    #模块名比较容易理解：获得当前进程的pid
    p = psutil.Process(pid)
    #根据pid找到进程，进而找到占用的内存值
    info = p.memory_full_info()
    memory = info.uss/1024
    return memory

def Count_bit_act():
    bit_sum = 0  # 统计1-bit个数
    start_time = time.time()
    start = show_info('start')

    with open('../input/coding2/stream_data.txt', 'r') as f:
        f.seek(0 if time_location <= size_window else 2 * (time_location - size_window))  # 跳转到窗口大小之前行位置
        num = f.read()
        num = num.split()
        for i in range(time_location if time_location <= size_window else size_window):
            temp=num[i]
            if temp and int(temp) == 1:
                bit_sum += 1
    end = show_info('end')

    return bit_sum, time.time() - start_time, str(end-start)

def Is_due(time_now):
    if len(bucket_n) > 0 and time_now - size_window == bucket_n[0]['timestamp']:  # 最左边的桶的时间戳等于当前时间减去窗口大小，到期了
        del bucket_n[0]

def Merge():
    for i in range(len(bucket_n) - 1, n_max_bucket - 1, -1):
        if bucket_n[i]['bit_sum'] == bucket_n[i - n_max_bucket]['bit_sum']:
            # 存在n_max_bucket个大小相同的桶
            bucket_n[i - n_max_bucket]['bit_sum'] += bucket_n[i - n_max_bucket + 1]['bit_sum']
            bucket_n[i - n_max_bucket]['timestamp'] = bucket_n[i - n_max_bucket + 1]['timestamp']
            del bucket_n[i - n_max_bucket + 1]

def Count_bit():
    bit_sum = 0
    start_time = time.time()
    start=show_info('start')

    with open('../input/coding2/stream_data.txt', 'r') as f:
        num = f.read()
        num = num.split()
        for i in range(time_location):
            temp = num[i] # 读取文件的值
            if temp:
                Is_due(i + 1)  # 判断是否有桶到期
                if int(temp) == 1:
                    bucket = {"timestamp": i + 1, "bit_sum": 1}  # 桶的结构
                    bucket_n.append(bucket)
                    Merge()  # 合并大小相同的桶
    for i in range(len(bucket_n)):
        bit_sum += bucket_n[i]['bit_sum']
    bit_sum -= bucket_n[0]['bit_sum'] / 2
    bit_sum=int(bit_sum)
    end=show_info('end')

    return bit_sum if len(bucket_n) > 0 else 0, time.time() - start_time,str(end-start)


bit_sum, bit_time, bit_space= Count_bit()
bit_act_sum, bit_act_time, bit_act_space = Count_bit_act()

print("The estimated number of \"1\" is:",bit_sum,", the processing time is:",bit_time,"s",", and the space is:",bit_space,"KB")
print("The exact number of \"1\" is:",bit_act_sum,", the processing time is:",bit_act_time,"s",", and the space is:",bit_act_space,"KB")

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
import numpy as np
import csv
import random

row_num = 0
data = []
with open('../input/coding2/docs_for_lsh.csv') as f:
    dataset = csv.reader(f)
    for row in dataset:
        row_num += 1
        if row_num == 1:
            pass
        else:
            data.append(row[1:])
data = np.array(data)
data = data.T

def MinHash(data, b, r):
    '''
    param: data: the ndarray type, for shingles and documents
    param: b: the number of the bands in signature matrix
    param: r: the number of rows in one single band in the signature matrix, b*r stands for the length of signature
    '''
    n = b * r
    signature = []
    for i in range(n):
        permutation = []
        signal_signature = []
        for num in range(1, data.shape[0] + 1):
            permutation.append(num)
        random.shuffle(permutation)
        for j in range(data.shape[1]):
            for k in range(data.shape[0]):
                index = permutation.index(k + 1)
                if data[index][j] == '1':
                    signal_signature.append(k + 1)
                    break
                else:
                    pass
        signature.append(signal_signature)
    return np.array(signature)

b = 10
r = 5
res_signature = MinHash(data, b, r)
print(res_signature)

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
import hashlib
from sklearn.metrics import jaccard_score

def LSH(signature, b, r):
    '''
    param: signature: the ndarray type, signature matrix
    param: b: the number of the bands in signature matrix
    param: r: the number of rows in one single band in the signature matrix, b*r stands for the length of signature
    '''
    length, docnum = signature.shape
    buckets = {}
    start = 0

    for i in range(b):
        for j in range(docnum):
            md5 = hashlib.md5()
            signal_band = str(signature[start:start + r, j])
            hashed_band = md5.update(signal_band.encode())
            hashed_band = md5.hexdigest()

            if hashed_band not in buckets:
                buckets[hashed_band] = [j]
            elif j not in buckets[hashed_band]:
                buckets[hashed_band].append(j)
        start += r
    return buckets

LSH_table = LSH(res_signature, b, r)
print('LSH over!')

def search(LSH_table, num):
    '''
    param: LSH_table: the dictionary type, LSH buckets
    param: num: the document id to search
    '''
    res = {}
    for key in LSH_table:
        if num in LSH_table[key] and len(LSH_table) != 1:
            for docnum in LSH_table[key]:
                if docnum == num:
                    pass
                else:
                    if docnum in res:
                        res[docnum] += 1
                    else:
                        res[docnum] = 1
    return res

result = search(LSH_table,0)
result = sorted(result.items(),key=lambda item:item[1])

nearest_neighbor_num = 30
nearest_neighbor = []
for i in range(len(result)-1,len(result)-nearest_neighbor_num-1,-1):   # find the nearest documents
    nearest_neighbor.append(result[i])
#print('The nearest documents with times in the buckets are {}. '.format(nearest_neighbor))

check_data = data.T

LSH_neighbor = []

for i in range(len(nearest_neighbor)):
    check_doc = nearest_neighbor[i][0]
    score = jaccard_score(check_data[check_doc],check_data[0], pos_label= '1', average = 'binary')
    #print("doc {}'s score with doc 0 is : {}".format(check_doc,score))
    LSH_neighbor.append((check_doc,score))

print('Thenearest documents are {}'.format(LSH_neighbor))