* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
__author__ = 'HaoningWu'
__StudentID__ = '518030910285'

"""
Since it may take a long time to run the whole codes, 
I also store the outputs as comments at the end of each code block.
If there exists any question, please contact with me.
Thank you!
"""

# DGIM counts the number of 1's in the last k bits in the window
# According to this problem, we set k exactly equals to windowsize
import time
import sys

bucket_list = [] # The list for storing buckets, its elements are directory objects

maxBucket = 10   # the maximum number of similar buckets allowed
windowSize = 1000    # size of the window
currentTime = 2000   # current time (start with 0)
k = windowSize    # the last k bits, in this problem, we just need to set it equal to windowSize

fileName = '../input/coding2/stream_data.txt'
f = open(fileName, 'r')
dataset = f.read().split('\t')    # Load data and preprocess

def DGIM(data = dataset, window_size = windowSize, current_time = currentTime):
    start_time = time.time()
    cnt = 0
    
    for i in range(current_time):
        value = data[i]
        if value:
            # Determine if any buckets are due
            # If the timestamp of the leftmost bucket is equal to the current time minus the window_size, it's due, so delete it
            if (len(bucket_list) > 0) and (i+1-window_size == bucket_list[0]['timestamp']):
                del bucket_list[0]
            # Create new bucket if the new input is 1, otherwise pass it
            if int(value) == 1:
                bucket = {'timestamp': i+1, 'cnt': 1}    # Bucket Structure 
                bucket_list.append(bucket)
                # If there exist maxBucket buckets of the same size, we need to merge them
                for i in range(len(bucket_list)-1, maxBucket-1, -1):
                    if bucket_list[i]['cnt'] == bucket_list[i-maxBucket]['cnt']: # With the same size
                        bucket_list[i-maxBucket]['cnt'] += bucket_list[i-maxBucket+1]['cnt'] # Merge
                        bucket_list[i-maxBucket]['timestamp'] = bucket_list[i-maxBucket+1]['timestamp']
                        del bucket_list[i-maxBucket+1] # delete the bucket closer to the right
    
    for i in range(len(bucket_list)):
        cnt += bucket_list[i]['cnt']    # Sum the sizes of all buckets but the last
    cnt -= int(bucket_list[0]['cnt'] / 2)    # Add half the size of the last bucket
    total_space = sys.getsizeof(bucket_list)
    total_time = time.time() - start_time
    
    return cnt, total_time, total_space

dgim_cnt, dgim_cnt_time, dgim_cnt_space = DGIM(dataset, windowSize, currentTime)
print("DGIM Estimated Count: %d, Running Time of DGIM: %f, Space of DGIM: %f, maxBucket: %d, windowSize: %d, currentTimestap: %d"%(dgim_cnt, dgim_cnt_time, dgim_cnt_space, maxBucket, windowSize, currentTime))

# DGIM Estimated Count: 41, Running Time of DGIM: 0.000305, Space of DGIM: 272.000000, maxBucket: 10, windowSize: 1000, currentTimestap: 100
# DGIM Estimated Count: 375, Running Time of DGIM: 0.005183, Space of DGIM: 536.000000, maxBucket: 10, windowSize: 1000, currentTimestap: 1000
# DGIM Estimated Count: 390, Running Time of DGIM: 0.011971, Space of DGIM: 536.000000, maxBucket: 10, windowSize: 1000, currentTimestap: 2000
# DGIM Estimated Count: 39, Running Time of DGIM: 0.000269, Space of DGIM: 216.000000, maxBucket: 4, windowSize: 1000, currentTimestap: 100
# DGIM Estimated Count: 359, Running Time of DGIM: 0.003294, Space of DGIM: 288.000000, maxBucket: 4, windowSize: 1000, currentTimestap: 1000
# DGIM Estimated Count: 374, Running Time of DGIM: 0.008183, Space of DGIM: 368.000000, maxBucket: 4, windowSize: 1000, currentTimestap: 2000

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
# Accurately Count the number of 1's in the window
def count_acc_num(data = dataset, window_size = windowSize, current_time = currentTime):
    start_time = time.time()
    cnt = 0
    # If the current time is smaller than the windowSize, simply count at the beginning until the current moment
    # If the current time is larger than the windowSize, the starting position is the current position minus the windowSize
    start = max(current_time - window_size, 0)
    for i in range(min(current_time, window_size)):
        value = data[i + start]
        if value and int(value) == 1:
            cnt += 1
    total_time = time.time() - start_time
    total_space = sys.getsizeof(data[start: start + min(current_time, window_size)])
    return cnt, total_time, total_space

acc_cnt, acc_cnt_time, acc_cnt_space = count_acc_num(dataset, windowSize, currentTime)
print("Accurate Count: %d, Running Time of accurate count: %f, Space of accurate count: %f, maxBucket: %d, windowSize: %d, currentTimestap: %d"%(acc_cnt, acc_cnt_time, acc_cnt_space, maxBucket, windowSize, currentTime))
error = abs(acc_cnt - dgim_cnt)
error_rate = error / acc_cnt
time_difference = abs(dgim_cnt_time - acc_cnt_time)
space_difference = abs(dgim_cnt_space - acc_cnt_space)
print("Error: %d, Error rate: %f, time_difference: %f, space_difference: %f"%(error, error_rate, time_difference, space_difference))


# Accurate Count: 43, Running Time of accurate count: 0.000038, Space of accurate count: 872.000000, maxBucket: 10, windowSize: 1000, currentTimestap: 100
# Error: 2, Error rate: 0.046512, time_difference: 0.000267, space_difference: 600.000000
# Accurate Count: 391, Running Time of accurate count: 0.000357, Space of accurate count: 8072.000000, maxBucket: 10, windowSize: 1000, currentTimestap: 1000
# Error: 16, Error rate: 0.040921, time_difference: 0.004826, space_difference: 7536.000000
# Accurate Count: 399, Running Time of accurate count: 0.000455, Space of accurate count: 8072.000000, maxBucket: 10, windowSize: 1000, currentTimestap: 2000
# Error: 9, Error rate: 0.022556, time_difference: 0.011516, space_difference: 7536.000000
# Accurate Count: 43, Running Time of accurate count: 0.000038, Space of accurate count: 872.000000, maxBucket: 4, windowSize: 1000, currentTimestap: 100
# Error: 4, Error rate: 0.093023, time_difference: 0.000231, space_difference: 656.000000
# Accurate Count: 391, Running Time of accurate count: 0.000360, Space of accurate count: 8072.000000, maxBucket: 4, windowSize: 1000, currentTimestap: 1000
# Error: 32, Error rate: 0.081841, time_difference: 0.002934, space_difference: 7784.000000
# Accurate Count: 399, Running Time of accurate count: 0.000377, Space of accurate count: 8072.000000, maxBucket: 4, windowSize: 1000, currentTimestap: 2000
# Error: 25, Error rate: 0.062657, time_difference: 0.007805, space_difference: 7704.000000


### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
# Dataset Preprocessing
import csv
import numpy as np
from random import shuffle

fileName = '../input/coding2/docs_for_lsh.csv'
dataset = open(fileName, 'r')
reader = csv.reader(dataset)
rows = [row for row in reader]
data = []

# I found that only 1~100000 rows of data are useful, so here is a trick: I just use these 100000 rows of data
# This can be useful for accelerating, while we can also use the full dataset.
# Besides, only 1~20 columns of data have non-zero elements, so here is another trick: I just use these 20 columns of data
# We can also use the full dataset, however, in my opinion, it's the same.

#for item in rows[1:]:
for item in rows[1:100001]:
    data.append(item[1:21]) 
    #data.append(item[1:])    # Split with the first column and row (id and description) 

data = np.array(data).T
# print(data.shape)
# data.shape = (200, 1000000)

In [None]:
# GenerateSignature using minHash
b, r = 15, 5
# n = b * r: Length of signature (number of distinct minhash functions)
# b: Number of bands that divide the signature matrix

def generateSignature(input_matrix):
    # Generate signature row value for the signature matrix
    # row number of the original matrix
    rowSeries = [i for i in range(input_matrix.shape[0])]    # 200 / 20
    result = [-1 for i in range(input_matrix.shape[1])]    # 1000000 / 100000
    columnCount = 0
    # Reorder the row numbers
    shuffle(rowSeries)
    for i in range(len(rowSeries)):    # 200 / 20
        rowIndex = rowSeries.index(i)
        for j in range(input_matrix.shape[1]):    # 1000000 / 100000
            if result[j] == -1 and int(input_matrix[rowIndex][j]) != 0:    # Pay attention to converting it into 'int'
                result[j] = rowIndex
                columnCount += 1
        if columnCount == input_matrix.shape[1]:
            break
    return result

def minHash(input_matrix, b, r):
    # minHash function, generate signature matrix
    sigMatrix = []
    n = b * r
    for i in range(n):
        sig = generateSignature(input_matrix)
        sigMatrix.append(sig)
    return np.array(sigMatrix)

signatureMatrix = minHash(data, b, r) # shape: (100, 1000000) / (100, 100000)

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
# Local Sensitive Hash
# I just take the 30 most similar document ids, which share the most buckets with the query.
import hashlib
import sklearn
from sklearn.metrics import jaccard_score

def LSH(sigMatrix, b, r):
    # Local Sensitive Hash
    hashBuckets = {}
    # start and end of band row
    start, end = 0, r
    count = 0     # count the number of band level

    while end <= sigMatrix.shape[0]:    # 100
        count += 1
        
        # traverse the column of signature matrix
        for colNum in range(sigMatrix.shape[1]):
            hashObject = hashlib.sha256()
            band = str(sigMatrix[start: start + r, colNum]) + str(count)
            #band = "".join(str(sigMatrix[start:start+r, colNum])) + str(count)
            hashObject.update(band.encode())    # hash function
            tag = hashObject.hexdigest()
            
            # Update buckets
            if tag not in hashBuckets:
                hashBuckets[tag] = [colNum]
            elif colNum not in hashBuckets[tag]:
                hashBuckets[tag].append(colNum)
        start += r
        end += r
            
    return hashBuckets

LSH_result = LSH(signatureMatrix, b, r)

# Search for documents which are similar to the query document
def nn_search(dataSet, queryColumn = 0):
    result = {}

    for key in dataSet:
        if queryColumn in dataSet[key]:
            for i in dataSet[key]:
                if i in result.keys():
                    result[i] += 1
                else:
                    result[i] = 1

    return result

res = nn_search(LSH_result, 0)
res = sorted(res.items(), key = lambda item:item[1], reverse = True)

print("LSH output length:", len(res))
output = []

for i in res[1:31]:
    output.append(i[0])
    
print("The 30 outputs of LSH: ", sorted(output))

check_list = {}
check_data = data.T

for i in res:
    index = i[0]
    Jaccard_score = sklearn.metrics.jaccard_score(check_data[0], check_data[index], pos_label='1')
    if Jaccard_score > 0.8:
        check_list[index] = Jaccard_score
print("LSH check output length: ", len(check_list) - 1)
print("Calculate Jaccard_score with LSH output information to check: ", check_list)
print("Check output: ", sorted(check_list.keys())[1:])
print("Ground Truth: [1331, 2575, 20854, 23585, 26980, 28910, 32681, 39310, 39784, 40298, 46220, 48131, 52076, 58694, 58852, 62080, 67032, 68730, 69724, 72156, 73681, 81289, 81379, 81480, 84306, 84520, 89825, 89833, 91300, 99370]")

# output of b = 15, r = 5, n = 75: accuracy = 21 / 30 = 70% 
# [1331, 2110, 2575, 20854, 23585, 28910, 32681, 39784, 42302, 43869, 46220, 48131, 52076, 58694, 58852, 61687, 67032, 68730, 69724, 72078, 72156, 73681, 78531, 79551, 81289, 84520, 89825, 89833, 91300, 96208]
# output of b = 20, r = 5, n = 100: accuracy = 14 / 30 = 47% 
# [1331, 2575, 14134, 14518, 15637, 20854, 20988, 23585, 28910, 32681, 34797, 40298, 44663, 48131, 58711, 61687, 62080, 66125, 67032, 69028, 70964, 73681, 73687, 78531, 81480, 83204, 89580, 89825, 89833, 99370]
# output of b = 20, r = 10, n = 200: accuracy = 14 / 30 = 47% 
# [1331, 2575, 14134, 14518, 15637, 20854, 20988, 23585, 28910, 32681, 34797, 40298, 44663, 48131, 58711, 61687, 62080, 66125, 67032, 69028, 70964, 73681, 73687, 78531, 81480, 83204, 89580, 89825, 89833, 99370]
# output of b = 20, r = 15, n = 300: accuracy = 12 / 30 = 40% 
# [474, 2110, 2397, 6385, 14134, 14518, 28910, 32681, 36005, 37464, 39784, 41720, 44663, 46220, 46923, 50250, 52172, 55780, 58124, 58852, 59443, 62080, 64156, 69724, 72156, 81289, 82105, 89825, 89833, 91300]
# output of b = 30, r = 10, n = 300: accuracy = 22 / 30 = 73% 
# [474, 1331, 2575, 6429, 20854, 23585, 28910, 32681, 39310, 39784, 41720, 42302, 48131, 52076, 58694, 58852, 62080, 64156, 67032, 69028, 69724, 72078, 81289, 81480, 84520, 87793, 89825, 89833, 91300, 99370]
# Ground Truth: 
# [1331, 2575, 20854, 23585, 26980, 28910, 32681, 39310, 39784, 40298, 46220, 48131, 52076, 58694, 58852, 62080, 67032, 68730, 69724, 72156, 73681, 81289, 81379, 81480, 84306, 84520, 89825, 89833, 91300, 99370]


# The results of LSH are related to probability, so the accuracy may be hard to be 100%, but we can reduce the search dimension to find the most similar documents

# The check results are as follows, it can run very fast because we use LSH to reduce searching dimensions
# Length: 30
# {0: 1.0, 89833: 0.9090909090909091, 32681: 0.9090909090909091, 91300: 0.9090909090909091, 62080: 0.8333333333333334, 89825: 0.8333333333333334, 20854: 0.8333333333333334, 58852: 0.8181818181818182, 69724: 0.8181818181818182, 84520: 0.8181818181818182, 99370: 0.8181818181818182, 2575: 0.8181818181818182, 48131: 0.8181818181818182, 67032: 0.8181818181818182, 28910: 0.8333333333333334, 23585: 0.8181818181818182, 1331: 0.8181818181818182, 39310: 0.8333333333333334, 81289: 0.8181818181818182, 39784: 0.8181818181818182, 81480: 0.8333333333333334, 52076: 0.8181818181818182, 58694: 0.8181818181818182, 84306: 0.8333333333333334, 40298: 0.8333333333333334, 72156: 0.8181818181818182, 73681: 0.8181818181818182, 68730: 0.8181818181818182, 81379: 0.8333333333333334, 46220: 0.8181818181818182, 26980: 0.8181818181818182}
# We still take b = 15, r = 5, n = 75
# Check outputs:
# [1331, 2575, 20854, 23585, 26980, 28910, 32681, 39310, 39784, 40298, 46220, 48131, 52076, 58694, 58852, 62080, 67032, 68730, 69724, 72156, 73681, 81289, 81379, 81480, 84306, 84520, 89825, 89833, 91300, 99370]
# Ground Truth: 
# [1331, 2575, 20854, 23585, 26980, 28910, 32681, 39310, 39784, 40298, 46220, 48131, 52076, 58694, 58852, 62080, 67032, 68730, 69724, 72156, 73681, 81289, 81379, 81480, 84306, 84520, 89825, 89833, 91300, 99370]
# Accuracy is 100% now

In [None]:
# Check for answers
import sklearn
from sklearn.metrics import jaccard_score
test_data = data.T
# test_data.shape = (1000000, 200)

candidates = {}
# for i in range(1, 1000000):
for i in range(1, 100000):
    Jaccard_score = sklearn.metrics.jaccard_score(test_data[0], test_data[i], pos_label='1')    # Pay attention to the parameter 'pos_label'
    if Jaccard_score > 0.8:
        candidates[i] = Jaccard_score
print("Ground Truth and corresponding Jaccard_score: ", candidates)
print("Ground Truth: ", sorted(candidates.keys()))
# {1331: 0.8181818181818182, 2575: 0.8181818181818182, 20854: 0.8333333333333334, 23585: 0.8181818181818182, 26980: 0.8181818181818182, 28910: 0.8333333333333334, 32681: 0.9090909090909091, 39310: 0.8333333333333334, 39784: 0.8181818181818182, 40298: 0.8333333333333334, 46220: 0.8181818181818182, 48131: 0.8181818181818182, 52076: 0.8181818181818182, 58694: 0.8181818181818182, 58852: 0.8181818181818182, 62080: 0.8333333333333334, 67032: 0.8181818181818182, 68730: 0.8181818181818182, 69724: 0.8181818181818182, 72156: 0.8181818181818182, 73681: 0.8181818181818182, 81289: 0.8181818181818182, 81379: 0.8333333333333334, 81480: 0.8333333333333334, 84306: 0.8333333333333334, 84520: 0.8181818181818182, 89825: 0.8333333333333334, 89833: 0.9090909090909091, 91300: 0.9090909090909091, 99370: 0.8181818181818182}