* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
# Your code here, you can add cells if necessary
class DGIM(object):
    def __init__(self, filepath, windowsize=1000, autoUpdate=True):
        '''
        Parameters:
            filepath: the path of the stream file
            windowsize: the length of the current window
            autoUpdate: wheter update the count after each coming bit if True
        '''
        import math
        self.filepath = filepath
        self.windowsize = windowsize
        self.buckets = []
        self.counts = [0] * int(2*math.log(self.windowsize))
        self.cnt1bits = 0             # The number of bit 1 in the current window
        self.curTs = 0                # The current relevant timestamp equaling to the real timestamp mod windowsize
        self.autoUpdate = autoUpdate  # Update cnt1bits after each coming bit if True
        
    def pushDown(self):
        '''
        If the last bit of the oldest bucket equals to the current relevant timestamp, namely out of the window, remove this bucket.
        '''
        if self.buckets == []:
            return
        if self.buckets[-1][0] == self.curTs-1:
            self.counts[self.buckets[-1][1]] -= 1
            self.buckets.pop()
            
    def checkBuckets(self):
        '''
        Check the buckets of each size from the newest to the oldest, merging 2 buckets if 3 of them have the same size.
        '''
        for i in range(len(self.counts)):
            if self.counts[i] == 3:
                self.counts[i] = 1
                self.counts[i+1] += 1
                self.buckets[i+1][1] += 1
                self.buckets.pop(i+2)
        
    def forward(self):
        '''
        Receive a new bit from the stream and build a new bucket if it's bit 1.
        '''
        c = self.file.read(1)
        if c == '\t':
            c = self.file.read(1)
        if c == '':
            return -1
        self.curTs = (self.curTs + 1) % self.windowsize
        self.pushDown()
        if c == '1':
            self.counts[0] += 1
            self.buckets.insert(0,[self.curTs-1, 0])
        self.checkBuckets()
        
    def update(self):
        '''
        Count the number of bit 1 in the current window via the buckets
        '''
        self.cnt1bits = 0
        for i in range(len(self.counts)-1):
            if self.counts[i+1] == 0:
                self.cnt1bits += 2**(i-1) * self.counts[i]
                break
            else:
                self.cnt1bits += 2**i * self.counts[i]
    
    def work(self):
        '''
        Traverse the whole stream file.
        '''
        self.file = open(self.filepath, 'r')
        while(True):
            if self.forward() == -1:
                break
            if self.autoUpdate:
                self.update()
        self.file.close()
    
    def autoUpdateEnable(self):
        '''
        Switch on the auto-update.
        '''
        self.autoUpdate = True
    
    def autoUpdateDisable(self):
        '''
        Switch off the auto-update.
        '''
        self.autoUpdate = False
    
    def query(self):
        if not self.autoUpdate:
            self.update()
        return self.cnt1bits
        

In [None]:
d = DGIM("../input/coding2/stream_data.txt", 1000)
%time d.work()
d.query()

In [None]:
d = DGIM("../input/coding2/stream_data.txt", 1000)
d.autoUpdateDisable()
%time d.work()
d.query()

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
# Your code here, you can add cells if necessary
class plainCount(object):
    def __init__(self, filepath, windowsize=1000):
        '''
        Parameters:
            filepath: the path of the stream file
            windowsize: the length of the current window
        '''
        self.filepath = filepath
        self.windowsize = windowsize
        self.buffer = []
        self.cnt1bits = 0
        
    def forward(self):
        '''
        Receive a new bit from the stream and count if it's bit 1. Discard the oldest bit.
        '''
        c = self.file.read(1)
        if c == '\t':
            c = self.file.read(1)
        if c == '':
            return -1
        self.buffer.append(c)
        self.cnt1bits += (c == '1')
        if len(self.buffer) > self.windowsize:
            lc = self.buffer.pop(0)
            self.cnt1bits -= (lc == '1')

    def work(self):
        '''
        Traverse the whole stream file.
        '''
        self.file = open(self.filepath, 'r')
        while(True):
            if self.forward() == -1:
                break
        self.file.close()
    
    def query(self):
        return self.cnt1bits

In [None]:
p = plainCount("../input/coding2/stream_data.txt", 1000)
%time p.work()
p.query()

**Cost Comparison**

|        | DGIM | DGIM* | Plain |  
| :----: | :--: | :---: | :---: |  
| Time/ms  | 247 | 92.6 | 41.7 |  
| Space/bits | 8 | 8 | 1000 |  

In this table, the "Time" row indicates how long the algorithm finishs processing the whole "stream_data.txt" file, and the "Space" row indicates how many bits are used to store the counting info. DGIM* disables the auto-update of counts after each coming bit, which is acceptable because we are not querying all the time. It is obvious that DGIM requires less space at the cost of time performance, and the cost can be limited if auto-update is disabled.

**Accuracy Comparison**

|        | DGIM | Plain |  
| :----: | :--: | :---: |  
| Counts of 1  | 508  | 391 |  
| Error Rate | 29.9% | 0% | 

In this table, the "Counts of 1" row indicates the number of '1' in the last 1000 characters of the "stream_data.txt" file, and DGIM keeps it error rate below 50% as the theoretical proof.

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
import numpy as np
from functools import partial
from tqdm import tqdm
from sklearn.metrics import jaccard_score
class LocSenHash(object):
    def __init__(self):
        return
        
    def load(self, filename):
        self.docs = np.loadtxt(filename, dtype=np.uint32, delimiter=',', skiprows=1)[:, 1:]
        self.docNum, self.shingleNum = self.docs.shape
    
    def hashReset(self, prime, mod):
        self.mod = mod
        self.hashFunc = []
        for i in range(self.sigNum):
            a = np.random.randint(1, prime)
            b = np.random.randint(1, prime)
            self.hashFunc.append(partial(self.sigHash, a=a, b=b, prime=prime, mod=mod))
    
    def sigHash(self, x, a, b, prime, mod):
        return ((a * x + b) % prime) % mod

    def getSignature(self, prime, mod, sigNum):
        self.sigNum = sigNum
        self.hashReset(prime, mod)
        self.hashTable = np.full((self.sigNum, self.docNum), self.mod, dtype=np.uint32)
        for i in tqdm(range(self.shingleNum)):
            for j, hashFunc in enumerate(self.hashFunc):
                hashValue = hashFunc(i)
                nzi = np.nonzero(self.docs[:, i])
                self.hashTable[j, nzi] = np.minimum(self.hashTable[j, nzi], hashValue)
    
    def lsHash(self, bandNum, rowNum, prime, mod):
        self.bandNum = bandNum
        self.rowNum = rowNum
        self.lshResult = []
        a = np.random.randint(1, prime)
        b = np.random.randint(1, prime)
        for i in tqdm(range(self.bandNum)):
            temp = {}
            hashValue = []
            for j in range(self.docNum):
                hashValue.append(hash(self.hashTable[i*self.rowNum:(i+1)*self.rowNum, j].tobytes()))
            for j, v in enumerate(hashValue):
                if v in temp.keys():
                    temp[v].append(j)
                else:
                    temp[v] = [j]
            self.lshResult.append(temp)
    
    def getSimDoc(self, doci, num):
        self.simCount = {}
        for j in tqdm(range(len(self.lshResult))):
            for key, value in self.lshResult[j].items():
                if doci in value:
                    for i in value:
                        if i in self.simCount.keys():
                            self.simCount[i] += 1
                        else:
                            self.simCount[i] = 1
        self.similarity = sorted(self.simCount.items(), key=lambda item: item[1], reverse=True)
        self.simDoc = []
        for i in range(1, num+1):
            self.simDoc.append(self.similarity[i][0])
    
    def checkSim(self):
        print("Similar Doc\t\tJaccard Score\n")
        for doc in self.simDoc:
            print("{}\t\t\t\t{}".format(doc, jaccard_score(self.docs[0,:], self.docs[doc, :])))

In [None]:
l = LocSenHash()
l.load("../input/coding2/docs_for_lsh.csv")

In [None]:
l.getSignature(2**31-1, 2**17-1, 100)

In [None]:
l.lsHash(10, 10, 2**31-1, 2**17-1)

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
l.getSimDoc(0, 30)
l.checkSim()