* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
# Your code here, you can add cells if necessary
import math
import time
start_t = time.process_time()

filename = "../input/coding2/stream_data.txt"

container = {}
windowsize = 1000
timestamp = 0
updateinterval = 1000# no larger than the windowsize
updateindex = 0

keysnum = int(math.log(windowsize, 2)) + 1
keylist = list()
# initialize the container
for i in range(keysnum):
    key = int(math.pow(2, i))
    keylist.append(key)
    container[key] = list()

def UpdateContainer(inputdict, klist, numkeys):
    for key in klist:
        if len(inputdict[key]) > 2:
            inputdict[key].pop(0)
            tstamp = inputdict[key].pop(0)
            if key != klist[-1]:
                inputdict[key * 2].append(tstamp)
        else:
            break

def OutputResult(inputdict, klist, wsize):
    cnt = 0
    firststamp = 0
    for key in klist:
        if len(inputdict[key]) > 0:
            firststamp = inputdict[key][0]
        #for tstamp in inputdict[key]:
            #print ("size of bucket: %d, timestamp: %d" % (key, tstamp))
    for key in klist:
        for tstamp in inputdict[key]:
            if tstamp != firststamp:
                cnt += key
            else:
                cnt += 0.5 * key
    print ("Estimated number of ones in the last %d bits: %d" % (wsize, cnt))

with open(filename, 'r') as sfile:
    while True:
        char = sfile.read(1)
        if not char:# no more input
            break
        timestamp = (timestamp + 1) % windowsize
        for k in container.keys():
            for itemstamp in container[k]:
                if itemstamp == timestamp:# remove record which is out of the window
                    container[k].remove(itemstamp)
        if char == "1":# add it to the container
            container[1].append(timestamp)
            UpdateContainer(container, keylist, keysnum)
        updateindex = (updateindex + 1) % updateinterval
        if updateindex == 0:
            OutputResult(container, keylist, windowsize)
            
end_t = time.process_time()
#统计运行时间
print ("run time: %s Seconds" % (end_t-start_t))

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
# Your code here, you can add cells if necessary
import math
import time
start_t = time.process_time()
filename = "../input/coding2/stream_data.txt"

container = []
windowsize = 1000
updateinterval = 1000# no larger than the windowsize
updateindex = 0

def OutputResult(inputlist, wsize):
    cnt = 0
    for char in inputlist:
        if char=="1":
            cnt += 1
    print ("Accurate number of ones in the last %d bits: %d" % (wsize, cnt))

with open(filename, 'r') as sfile:
    while True:
        char = sfile.read(1)
        if not char:# no more input
            break
        container.append(char)
        if len(container) > windowsize:
            container.pop(0)
        updateindex = (updateindex + 1) % updateinterval
        if updateindex == 0:
            OutputResult(container, windowsize)
            
end_t = time.process_time()
#统计运行时间
print ("run time: %s Seconds" % (end_t-start_t))

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
# Your code here, you can add cells if necessary
import numpy as np
import linecache
import random
import itertools
filename = "../input/coding2/docs_for_lsh.csv"

n=100
b=20
r=int(n/b)
colu=200
row = -1
for count, line in enumerate(open(filename, "rU")):
    pass
    row += 1
matrix = np.ones([n,colu])*(-1)
#计算签名矩阵
for k in range(n):
    seqSet = [i for i in range(row)]
    count = 0
    while len(seqSet) > 0:
        # choose a row of matrix randomly
        randomSeq = random.choice(seqSet)
        read = linecache.getline(filename,randomSeq+1)
        for i in range(colu):
            if read[i+1] != '0' and matrix[k][i] == -1:
                matrix[k][i] = randomSeq
                count += 1
        if count == colu:
            break
        seqSet.remove(randomSeq)
relation = np.zeros([colu,colu])
#Banding
power=np.array([[10**30,10**24,10**12,10**6,1]])
for k in range(b):
    bucket = {}
    for i in range(colu):
        key = np.dot(power,matrix[k*r:k*r+5,i])
        key=key[0]
        if key not in bucket:
            bucket[key] = []
        bucket[key].append(i)
    #hash相撞的文件之间连一条边
    for key in bucket:
        if len(bucket[key]) > 1:
            rel = list(itertools.permutations(bucket[key], 2))
            for point in rel:
                relation[point] += 1
print("不同文件相似程度（数值代表两不同文件hash相撞次数，越大越相似）\n",relation)
relation = np.triu(relation,0)
sim = np.where(relation==np.max(relation))
print("最相似文件对：\n",list(zip(*sim)))

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
# Your code here, you can add cells if necessary
index = relation[0,:].argsort()[-30:][::-1]
print("与doc0的相撞次数：",relation[0,:])
print("与doc0最接近的30个文件：",index)