# EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
import math
import time


DGIM_res = []
file = '../input/coding2/stream_data.txt'
window = 1000
nowstamp = 0

sizes = [int(math.pow(2, i)) for i in range(int(math.log(window, 2))+1)]
DGIM_container = {i: [] for i in sizes}

def add_one(stamp):
    global DGIM_container
    DGIM_container[sizes[0]].append(stamp)
    
def update():
    global DGIM_container
    for i in sizes:
        if len(DGIM_container[i]) > 2:
            DGIM_container[i].pop(0)
            stamp = DGIM_container[i].pop(0)
            if not (i == sizes[-1]):
                DGIM_container[i * 2].append(stamp)

def display():
    cnt = 0
    firststamp = 0
    for key in sizes:
        if len(DGIM_container[key]) > 0:
            firststamp = DGIM_container[key][0]
    for key in sizes:
        for stamp in DGIM_container[key]:
            if stamp != firststamp:
                cnt += key
            else:
                cnt += 0.5 * key
    print('In the window of', str(window), 'bits, there are', \
          str(cnt), 'ones')
    DGIM_res.append(int(cnt))
    
    
time11 = time.process_time()
index = 0
with open(file, 'r') as f:
    while 1:
        char = f.read(1)
        if not char:
            break
        if char == '\t':
            continue
            
        nowstamp = (nowstamp+1) % window
        for k in sizes:
            for itemstamp in DGIM_container[k]:
                if itemstamp == nowstamp:
                    DGIM_container[k].remove(nowstamp)

        if char == "1":
            DGIM_container[1].append(nowstamp)
            update()
            
        index = (index + 1) % window
        if index == 0:
            display()
time12 = time.process_time()

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
from collections import deque
index2 = 0

NORMAL_container = deque()
NORMAL_res = []
time21 = time.process_time()
with open('../input/coding2/stream_data.txt', 'r') as f:
    while 1:
        char = f.read(1)
        if not char:
            break
        if char == '\t':
            continue

        if len(NORMAL_container) >= window:
            NORMAL_container.popleft()
        NORMAL_container.append(int(char)) 
            
        index2 = (index2 + 1) % window
        if index2 == 0:
            print('In the window of', str(window), 'bits, there are', \
              str(sum(NORMAL_container)), 'ones')
            NORMAL_res.append(sum(NORMAL_container))
            
time22 = time.process_time()

In [None]:
with open('../input/coding2/stream_data.txt', 'r') as f:
    data = f.read()
data = data.split("\t")
print(sum([int(data[i]) for i in range(1000)]))

In [None]:
print("index".ljust(5, " "),end="|")
print("DGIM".ljust(10, " "),end="|")
print("NORMAL".ljust(10, " "),end="|")
print("ERROR RATE".ljust(10," "))

for i in range(len(DGIM_res)):
    print(str(i).ljust(5, " "),end="|")
    print(str(DGIM_res[i]).ljust(10, " "),end="|")
    print(str(NORMAL_res[i]).ljust(10, " ") ,end="|")
    print(str(round(100 * abs(DGIM_res[i] - NORMAL_res[i]) / NORMAL_res[i], 5)) + '%'.ljust(10, " "))
    
print("*"*45)
print('sum'.ljust(5, " "),end="|")
print(str(sum(DGIM_res)).ljust(10, " "),end="|")
print(str(sum(NORMAL_res)).ljust(10, " "), end="|")
print(str(round(100 * abs(sum(DGIM_res) - sum(NORMAL_res)) / sum(NORMAL_res), 5)) + '%'.ljust(10, " "))

In [None]:
print('Time difference:')
print('DGIM:', '%s ms' % ((time12 - time11)*1000))
print('NORMAL:', '%s ms' % ((time22 - time21)*1000))
print('',end='\n'+"*"*30 + '\n')

print("Space difference:")
print('DGIM:', '%s' % (sum([len(i) for i in DGIM_container.values()])))
print('NORMAL:', '%s' % ((len(NORMAL_container))))

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
import pandas as pd

df = pd.read_csv('../input/coding2/docs_for_lsh.csv',index_col=0)

df.head()

In [None]:
import numpy as np
import random
import hashlib


def signMatrixGenerate(matrix, n):

    final = []

    for i in range(n):

        seq = [i for i in range(matrix.shape[0])]
        result_one = [-1 for i in range(matrix.shape[1])]
        cnt = 0

        while len(seq) > 0:
            randomSeed = random.sample(seq, 1)[0]
            for i in range(matrix.shape[1]):

                if result_one[i] == -1 and matrix[randomSeed][i] != 0:
                    result_one[i] = randomSeed
                    cnt += 1

            if cnt == matrix.shape[1]:
                break

            seq.remove(randomSeed)
        final.append(result_one)

    return np.array(final)

def minHash(matrix, b=30, r=5):
    n = b * r
    signMatrix = signMatrixGenerate(matrix, n)
    begin, end = 0, r
    cnt = 0

    while end <= signMatrix.shape[0]:
        cnt += 1
        for col in range(signMatrix.shape[1]):

            hashObj = hashlib.md5()
            band = str(signMatrix[begin: begin + r, col]) + str(cnt)
            hashObj.update(band.encode())
            hash_tag = hashObj.hexdigest()
            if hash_tag not in bucket:
                bucket[hash_tag] = []
            if col not in bucket[hash_tag]:
                bucket[hash_tag].append(col)
        begin += r
        end += r

In [None]:
bucket= {}
input_matrix = np.array(df).T
minHash(input_matrix, 30, 5)
print(len(bucket))

In [None]:
def _search(searched):
    result = set()
    for key in bucket:
        if searched in bucket[key]:
            for i in bucket[key]:
                result.add(i)

    result.remove(searched)
    return result

In [None]:
def myJaccard(a, b):
    a = np.array(a)[0]
    b = np.array(b)[0]
    num = den = 0
    for i in range(len(a)):
        if a[i] == b[i] == 1:
            num += 1
            den += 1
        elif not (a[i] == b[i] == 0):
            den += 1
    return num / den

all_doc = [i for i in range(df.shape[0])]

# We only show a part of the final results.
for doc in all_doc:
    if doc >= 30:
        break
    print('doc{}: {}'.format(doc, max([i for i in _search(doc) if i != doc], 
                                      key=lambda i: myJaccard(df.iloc[doc:doc+1], df.iloc[i:i+1]))))

# We only show a part of the final results.
print("......(not all)")

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
from sklearn.metrics import jaccard_score


nowDoc = 0
find_in_bucket = _search(nowDoc)
nowDoc_res = sorted([i for i in find_in_bucket if i != nowDoc], 
                    key=lambda i: myJaccard(df.iloc[nowDoc:nowDoc+1], df.iloc[i:i+1]),
                   reverse=True)[:30]
print('My result: doc{} \n{}'.format(nowDoc, 
                                          '\n'.join([str(i)+('('+ str(myJaccard(df.iloc[nowDoc:nowDoc+1], \
                                                               df.iloc[i:i+1]))+')').ljust(10, ' ')  \
                                                                for i in nowDoc_res])))

print("-"*50)
# ----------------------Validation----------------------
nowDoc_res = sorted([i for i in all_doc if i != nowDoc], 
                    key=lambda i: jaccard_score(np.array(df.iloc[nowDoc:nowDoc+1])[0],np.array(df.iloc[i:i+1])[0]),
                   reverse=True)[:30]
print('Validation: doc{} \n{}'.format(nowDoc , 
                                          '\n'.join([str(i)+('('+ \
        str(jaccard_score(np.array(df.iloc[nowDoc:nowDoc+1])[0],np.array(df.iloc[i:i+1])[0])) +')').ljust(10, ' ') \
                                                                for i in nowDoc_res])))