* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

输入当前的时间

为使结果易于保存，这里直接将current time的值赋为9000，实际可以输入

In [None]:
#current_time = int(input("Please enter the current time(from 1):"))
current_time = 9000

DGIM算法求出近似解

In [None]:
bucket_list = []

def del_bucket(t):
    #t 代表当前时刻
    if len(bucket_list)>0 and t-1000==bucket_list[0]['timestamp']:
        del bucket_list[0]

def merge_bucket():
    for i in range(len(bucket_list)-1,1,-1):
        if bucket_list[i]['size'] == bucket_list[i-2]['size']:
            bucket_list[i-2]['size']+=bucket_list[i-1]['size']
            bucket_list[i-2]['timestamp']=bucket_list[i-1]['timestamp']
            del bucket_list[i-1]
        
def DGIM(current_time):
    count = 0
    
    with open('../input/coding2/stream_data.txt','r') as f:
        data = f.read().split()
        
    #根据原则更新bucket列表
    for i in range(current_time):
        del_bucket(i+1)#去除最左边可能已经过期的bucket
        if int(data[i])==1:#若当前的数为1
            bucket = {"timestamp":i+1,"size":1}
            bucket_list.append(bucket)
            merge_bucket()#向后更新bucket 确保每种大小的bucket数量为1个或者两个
    
    #计算最终的估计count
    for i in range(len(bucket_list)):
        count += bucket_list[i]['size']
    count -= bucket_list[0]['size']/2 #最后一个bucket的size只加一半
    
    return count if len(bucket_list)>0 else 0

count = DGIM(current_time)
print("The estimated count by DGIM at current time is:",count)

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

求精确解

In [None]:
def accurate_count(current_time):
    with open('../input/coding2/stream_data.txt','r') as f:
        data=f.read().split()
    left_side = 0 if current_time<=1000 else current_time-1000
    total = 0
    for i in range(current_time if current_time<=1000 else 1000):
        total += int(data[left_side+i])
    return total
acc_count = accurate_count(current_time)
print("The accurate count at current time is:",acc_count)

**The differences in running time and space:**

For the accurate function: it needs to store the whole stream window, so the running space is O(N), and for a data item comes at current time, it will only take O(1) running time to update the window and count the number of 1s.

And for the DGIM algorithm: for each stream, it needs to store O($log^2N$) bits and for each query, it will need O(logN) to estimate the number.

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

csv数据的读取和预处理

In [None]:
import csv
import numpy as np

dataSet =[]
with open("../input/coding2/docs_for_lsh.csv") as f:
    reader = csv.reader(f)
    headers = next(reader)
    for row in reader:
        data_row = row[1:]
        dataSet.append([float(item) for item in data_row])
query_matrix = np.array(dataSet)
#query_matrix = query_matrix[0:10000]
input_matrix = query_matrix.T #使得列代表document

确定参数 取n=100 b=20

In [None]:
n = 100
b = 20

minhash主体部分

In [None]:
import random
import hashlib

def sigGen(matrix):
    seqSet = [i for i in range(matrix.shape[0])] #200
    result = [-1 for i in range(matrix.shape[1])] #1000000
    
    count = 0
    
    while len(seqSet) > 0:
        randomSeq = random.choice(seqSet)
        for i in range(matrix.shape[1]):
            if matrix[randomSeq][i] != 0 and result[i] == -1:
                result[i] = randomSeq
                count += 1
        if count == matrix.shape[1]:
            break
        seqSet.remove(randomSeq)
    
    return result

def sigMatrixGen(input_matrix, n):

    result = []

    for i in range(n):
        sig = sigGen(input_matrix)
        result.append(sig)


    return np.array(result)


def minHash(input_matrix, n, b):

    hashBuckets = {}
    r = int(n / b)
    sigMatrix = sigMatrixGen(input_matrix, n)
    begin, end = 0, r
    count = 0

    while end <= sigMatrix.shape[0]:

        count += 1

        for colNum in range(sigMatrix.shape[1]):
            hashObj = hashlib.md5()
            band = str(sigMatrix[begin: begin + r, colNum]) + str(count)
            hashObj.update(band.encode())
            tag = hashObj.hexdigest()
            if tag not in hashBuckets:
                hashBuckets[tag] = [colNum]
            elif colNum not in hashBuckets[tag]:
                hashBuckets[tag].append(colNum)
        begin += r
        end += r

    return hashBuckets

In [None]:
# Your code here, you can add cells if necessary
hashBuckets = minHash(input_matrix,n,b)
print("To show the correctness of the hashBuckets, here shows its first item:",list(hashBuckets.items())[0])

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

给定document0搜索前30相似的文档

搜索函数

In [None]:
def nn_search(queryCol):
    result = set()
    for key in hashBuckets:
        if queryCol in hashBuckets[key]:
            for i in hashBuckets[key]:
                result.add(i)

    result.remove(queryCol)
    return result

自己定义的计算相似度函数

In [None]:
def Jaccard(a,b):
    a = list(a)
    b = list(b)
    n = 0
    d = 0
    for i in range(len(a)):
        if a[i] == b[i] == 1:
            n += 1
            d += 1
        elif not (a[i] == b[i] == 0):
            d += 1
    return n / d

用自己定义的Jaccard相似度函数基于hashbuckets计算

In [None]:
search_result = nn_search(0)
search_result = list(search_result)
score = []
from sklearn import metrics
for i in range(len(search_result)):
    score.append(Jaccard(query_matrix[0],query_matrix[search_result[i]]))
score = np.array(score)
index = np.argsort(-score)
index = list(index)
result = []
for i in index[0:30]:
    result.append(search_result[i])
print(result)

用sklearn.metrics.jaccard_score()基于整个数据集验证

In [None]:
score_v = []
from sklearn import metrics
for i in range(query_matrix.shape[0]-1):
    score_v.append(metrics.jaccard_score(query_matrix[0],query_matrix[i+1]))
score_v = np.array(score_v)
index_v = np.argsort(-score_v)
index_v = list(index_v)
result_v = []
for i in index_v[0:30]:
    result_v.append(i+1)
print(result_v)

计算hit rate进行验证

In [None]:
count = 0
for i in result:
    if i in result_v:
        count += 1
hit_rate = count/30
print(hit_rate)