* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
# Your code here, you can add cells if necessary
#读入数据并去掉空格转化为int列表格式
import time
from math import *

Bucket = {}        # dictionary type to store buckets as following form: { key:2^i, value:[timestamp1,timestamp2...] ...}
window_size = 1000    
stoptimestamp = 1000  # the end location for searching, if 2000, it means we only search the first 2000 bits in the file
same_buckets_num = 2  # the max number of the buckets with the same size, if overpass, we do update/merge


keylist = []          # store the power of 2, within window_size
for i in range(int(log(window_size,2))+1):
    key = int(pow(2,i))
    keylist.append(key)
    Bucket[key] = []     # create empty Bucket

    

def DGIM(data,Bucket,keylist,window_size,same_buckets_num,stoptimestamp):
    '''
    param: data: the list with the 0-1 sequence to search
    param: Bucket: the container for buckets
    param: keylist: the list with the bucket size
    param: window_size: the length for the sum of the current buckets' sizes
    param: same_buckets_num: the max number of the buckets with the same size, if overpass, we do update/merge
    param: stoptimestamp: the end location for searching
    '''
    start_time = time.time()
    cnt = 0
    timestamp = 0
    
    for i in range(stoptimestamp):
        timestamp = (timestamp + 1) % window_size     # for each bit in, timestamp++, and we mod it by window_size in case 
                                                      # the stoptimestamp overpass window_size.
        for key in Bucket:
            for eachstamp in Bucket[key]:
                if eachstamp == timestamp:            # if the stoptimestamp overpass window_size, the timestamps may
                    Bucket[key].remove(eachstamp)  # yield conflicts, we check the same timestamps and remove in order
                                                      # to avoid the confliction.
                    
        if data[i] == '1':
            Bucket[1].append(timestamp)
            for key in keylist:                              # check the buckets size
                if len(Bucket[key]) > same_buckets_num:   # if overpass the max number, we do merge oepration
                    Bucket[key].pop(0)
                    tmpstamp = Bucket[key].pop(0)
                    if key != keylist[-1]:
                        Bucket[key*2].append(tmpstamp)
                    else:
                        Bucket[key].pop(0)
                else:
                    break
    firststamp = 0                                 # find the first timestamp 
    for key in keylist:
        if len(Bucket[key]) > 0:
            firststamp = Bucket[key][0]    
        for tmpstamp in Bucket[key]:
            print("size of bucket: {}, with the timestamp: {}".format(key,tmpstamp))
    for key in keylist:
        for tmpstamp in Bucket[key]:
            if tmpstamp != firststamp:            # not the firststamp, we add all keys
                cnt += key
            else:
                cnt += 0.5*key                    # the firststamp, we add half
            
    end_time = time.time()
    return cnt,end_time-start_time
    
with open('../input/coding2/stream_data.txt','r') as f:
    data = f.read().split('\t')
    res, cost_time = DGIM(data,Bucket,keylist,window_size,same_buckets_num,stoptimestamp)
    print("Estimated number of 1s in the last {} bits of all {} bits: {} with the costed time: {}".format(window_size,stoptimestamp,res,cost_time))    


2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
# Your code here, you can add cells if necessary
window_size = 1000
stoptimestamp = 4000


def bruteforce(data,window_size,stoptimestamp):
    '''
    param: data: the list with the 0-1 sequence to search
    param: window_size: the length for the sum of the current buckets' sizes
    param: stoptimestamp: the end location for searching
    '''
    start_time = time.time()
    cnt = 0
    
    for i in range(stoptimestamp-window_size,stoptimestamp):
        if data[i] == '1':
            cnt += 1
    
    end_time = time.time()
    return cnt, end_time-start_time 
        

with open('../input/coding2/stream_data.txt','r') as f:
    data = f.read().split('\t')
    res, cost_time = bruteforce(data,window_size,stoptimestamp)
    print("Exact number of 1s in the last {} bits of all {} bits: {} with the costed time: {}".format(window_size,stoptimestamp,res,cost_time))

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
# Your code here, you can add cells if necessary
import numpy as np
import csv
import random

time = 0
data = []
with open('../input/coding2/docs_for_lsh.csv') as f:
#with open('../input/newtest/test/test.csv') as f:
    # effective row data length 1000000, id from column 0-199(up to 200)
    csvmap = csv.reader(f)
    for row in csvmap:
        time += 1
        if time == 1:               # pass the first row (store column id)
            pass
        else:
            data.append(row[1:])    # pass the first column (store doc id)

data = np.array(data)
data = data.T                       # for there is 1000000 documents and shingles are 200, we need to do tranverse
print(data.shape)
print(data)
print('Data preprocessing over!')

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
# Your code here, you can add cells if necessary
def Hash(data, b, r):
    '''
    param: data: the ndarray type, for shingles and documents
    param: b: the number of the bands in signature matrix
    param: r: the number of rows in one single band in the signature matrix, b*r stands for the length of signature
    '''
    n = b*r
    signature = []
    
    for i in range(n):                          
        trans = []                        
        signal_signature = []                  
        for num in range(1,data.shape[0]+1):
            trans.append(num)
        
        random.shuffle(trans)            
      
       
        for j in range(data.shape[1]):
            for k in range(data.shape[0]):
                index = trans.index(k+1)  
                
                if data[index][j] == '1':
                    signal_signature.append(k+1)
                    break
                else:
                    pass
                
        signature.append(signal_signature)
    return np.array(signature)

b = 10
r = 5
res_signature = Hash(data,b,r)
print(res_signature)
