# Introduction

In this homework, we have worked with a dataset found on [UC Irvine Machine Learning Repository's website](https://archive-beta.ics.uci.edu/) and is composed of 1000 sports articles. [[link]](https://archive.ics.uci.edu/ml/machine-learning-databases/00450/)

# Instructions

To test our algorithms, you first need to download the data used for this work and unzip it in a folder called "Data" that is placed at the same position as this notebook. Otherwise, you can change the variable <code>dir_path</code> in the case below [Import data](#Import-data) to make it convinient for your usage. 
<br>
<br>
When you have collected the data, you only need to run this notebook to test our algorithms. The different libraries used are <code>numpy</code>, <code>time</code> and <code>os</code>. So you have to install them on your environment before launching the program.

# Technical report
The different steps of our work are : 
<ul>
    <li><u>Importation and save of data :</u> We have collected the data in a list <code>tab_data</code>. Each item of the list correspond to a different article. <a href="#Import-data"> [Link]</a></li>
    <li><u>Implementation of the shingling function :</u>For the shingling funciton, we have picked all chains of <code>k=10</code> characters and then used a native hash function of Python for strings, in order to collect a set of shinglings for each item of <code>tab_data</code>.<a href="#Shingling-function"> [Link]</a></li>
    <li><u>Implementation of the similarity function :</u> Since we have computed sets in the previous step, it is easy to compute the similarity function. Indeed, it is easy to compute the union of two sets (and we can infer the size of the intersection with it). <u>Useful propriety:</u> $|A \cap B| + |A \cup B| = |A| + |B|$ <a href="#Compare-Sets"> [Link]</a> </li>
    <li><u>Implementation of the MinHashing function :</u> First, we have to create a map of the shinglings to numerote them. To do so, we have computed a dictionary <code>dict_sh_to_int</code> which has shinglings as keys and integers in $ ⟦ 0;N_{\#_{Shingling}}-1 ⟧$ and its reversed dictionnary <code>dict_int_to_sh</code>. Then two methods have been tried:
        <ul>
            <li><u>Method 1 :</u> Computation of a characteristic matrix for each shingling in each matrix. this require to build a matrix, mainly composed of zeros, of dimensions $ N_{\#_{Shingling}} \times N_{\#_{Articles}}$. This method works for a small amount of data but the memory was too small when we have computed it with the whole dataset.</li>
            <li><u>Method 2 :</u> Directly build the MinHashing matrix by looking directly in the sets of shinglings of each article. We have used <code>n=100</code> hashed functions. These functions are simply permutations of $ ⟦ 0;N_{\#_{Shingling}}-1 ⟧$ and we can associate this permutation to a "permutation" of shinglings through <code>dict_int_to_sh</code>. Thus, we look for the first shingling of each set according to the order of each permutation. This result is saved in the matrix <code>matrix_hashed_2</code> (each row correpond to an article) of size  $ N_{\#_{Articles}} \times n$.</li>
        </ul>
     Finally, we have computed the matrix <code>matrix_hashed_2</code> that gives the MinHashing values. <a href="#MinHashing">[Link]</a></li>
    <li><u>Implementation of the signature comparison function :</u> it's simply count the number of times, the value given by the signatures is the same for both parameters. Divided by <code>n</code>, we have got the proportion of common signatures. <a href="#CompareSignature"> [Link]</a> </li>
    <li><u>Implementation of the LSH model :</u>  <a href="#LSH"> [Link]</a> </li>
</ul>  

# Performances

In [1]:
import numpy as np
# import pandas as pd
# from itertools import permutations 
# import plotly.graph_objects as go
# import matplotlib.pyplot as plt
# import seaborn as sns
# from IPython.display import display, HTML, Markdown
import os
import time
# from tqdm import tqdm

In [2]:
t_0 = time.time()

# Import data

array(['Text0001.txt', 'Text0002.txt', 'Text0003.txt', 'Text0004.txt',
       'Text0005.txt', 'Text0006.txt', 'Text0007.txt', 'Text0008.txt',
       'Text0009.txt', 'Text0010.txt', 'Text0011.txt', 'Text0012.txt',
       'Text0013.txt', 'Text0014.txt', 'Text0015.txt', 'Text0016.txt',
       'Text0017.txt', 'Text0018.txt', 'Text0019.txt', 'Text0020.txt',
       'Text0021.txt', 'Text0022.txt', 'Text0023.txt', 'Text0024.txt',
       'Text0025.txt', 'Text0026.txt', 'Text0027.txt', 'Text0028.txt',
       'Text0029.txt', 'Text0030.txt', 'Text0031.txt', 'Text0032.txt',
       'Text0033.txt', 'Text0034.txt', 'Text0035.txt', 'Text0036.txt',
       'Text0037.txt', 'Text0038.txt', 'Text0039.txt', 'Text0040.txt',
       'Text0041.txt', 'Text0042.txt', 'Text0043.txt', 'Text0044.txt',
       'Text0045.txt', 'Text0046.txt', 'Text0047.txt', 'Text0048.txt',
       'Text0049.txt', 'Text0050.txt', 'Text0051.txt', 'Text0052.txt',
       'Text0053.txt', 'Text0054.txt', 'Text0055.txt', 'Text0056.txt',
      

In [29]:
dir_path = "Data/"
sample_size = 1000
tab_data=[]
list_files = np.sort(os.listdir(dir_path))
for i in range(sample_size):
    with open(dir_path+list_files[i], encoding="ansi") as f:
        tab_data.append(f.read())

In [30]:
tab_data

['Finalists in the Apertura play-offs, Toluca had drawn their first two Clausura games but got off to a good start when Edgar Benitez put them ahead in the 16th minute.\nMatias Britos levelled 20 minutes later but Lucas Silva netted 14 minutes from the end to ensure the visitors took all three points.\n  \tFranco Arizala scored 13 minutes from time to ensure Jaguares claimed their first point with a 1-1 draw against Monterrey, who had opened the scoring through Aldo De Nigris (14).\n Hosts Jaguares also had Jorge Rodriguez sent off in the closing moments.',
 'City manager Roberto Mancini has consistently said his fellow Italian is not for sale throughout this month\'s transfer window but that has not quashed rumors linking him with the San Siro giants.\n  \tMilan have made their liking for the 22-year-old clear but have previously baulked at City\'s reported Â£28million valuation. Now fresh reports have emerged claiming negotiations between the clubs have begun but City\'s public messa

# Shingling function

In [31]:
def shingling (str_doc, k):
    res = set()
    for i in range (len(str_doc)-k+1):
        sample = str_doc[i:i+k]
        hashed_sample = hash(sample)
        res.add(hashed_sample)
    return res

In [32]:
k=10
tab_sh=[]
for data in tab_data:
    tab_sh.append(shingling(data,k))

In [33]:
tab_sh[4]

{2638341628333195267,
 3505249665800552452,
 -6917529969108320244,
 -6478376416265330675,
 -8167185670997213162,
 -2795868008568586219,
 8085990605414916112,
 5769703417160384532,
 6557495180025864214,
 -2255444943385968614,
 -3813593467918688227,
 -4025754694839197665,
 -4972278854652895198,
 -1208065887592357855,
 -6838555718449053659,
 -7999904922883997658,
 5120162707336929315,
 -7736799887066038229,
 1409733005778894890,
 -8331198564011581393,
 1310135214946549804,
 -8869770435912196046,
 725665027147776050,
 -7177003449116024778,
 4120167320945016881,
 2739334762500284470,
 3075572427272937526,
 1147089262074257465,
 3700202646151503929,
 -508967862807953348,
 -5700096538785849281,
 -6697133646992252864,
 -7185848563036544961,
 -3512588507935612861,
 3525835692913492039,
 -3572800693258452917,
 3453068991158067274,
 8486563638802260042,
 6961752118702202954,
 -4144896371010420653,
 -5431272565341011881,
 -6804657982352400296,
 -7442512648647425958,
 -2132072393538805670,
 7471322

# Compare Sets

In [34]:
def CompareSets(set1,set2):
    size_1, size_2, size_union = len(set1), len(set2), len(set1.union(set2)) # Cardinal (set1), Card(set2), Card(set1 u set2)
    size_inter = size_1 + size_2 - size_union # Card(set1 n set 2) = Card(set1) + Card (set2) - Card(set1 u set 2)
    return size_inter/size_union

In [35]:
CompareSets(tab_sh[70],tab_sh[1]),CompareSets(tab_sh[1],tab_sh[1])

(0.01418052904281429, 1.0)

In [24]:
tab_data[70]

'The Uruguay international is enjoying an outstanding season having scored 20 goals for the Anfield outfit.\n\nThat has led to speculation that other clubs could move in for the £40million-rated forward if the Reds\' absence from the Champions League continues.\n\nLiverpool are currently seventh in the Premier League and facing a battle to regain a place in Europe\'s elite competition.\n\nBut Rodgers is confident Suarez shares his vision for the club and does not believe the 26-year-old\'s future on Merseyside depends on whether they qualify or not.\n\nRodgers said: "Luis had a terrific season last season and he had an opportunity to leave in the summer, and probably would have had a ready-made excuse with a new manager coming in, but he never did.\n\n"He committed to the club. He sees the vision going forward, and he believes in that.\n\n"He knows it won\'t happen overnight. He has had a brilliant season with 20 goals until now, and hopefully there are many more to come."\n\nSuarez si

# MinHashing

Creation of a dictionary to numerate the shingling from 0 to $N_{\#_{Shingling}}-1$.

In [10]:
print("Creation of a set of all shinglings")
t_begin = time.time()
union_sh =set()
for sh in (tab_sh):
    union_sh = union_sh.union(sh) #creation of a global set of all the values 
print("Duration:", round(time.time()-t_begin,1),"s")
t_begin = time.time()
print("Creation of a global dictionary")
dict_sh_to_int={}
for i,hash_val in enumerate(list(union_sh)):
    dict_sh_to_int[hash_val]=i  # creation of a dictionary in order to link each shingle to a unique row in the matrix
dict_int_to_sh = {v: k for k, v in dict_sh_to_int.items()}
print("Duration:", round(time.time()-t_begin,1),"s")

Creation of a set of all shinglings
Duration: 55.5 s
Creation of a global dictionary
Duration: 1.1 s


## Method 1 : With creation of characteristic matrix
Issue : could be too heavy in memory

Creation of a Matrix (# of shinglings,# of article)
```Python
matrix = np.empty((len(union_sh), sample_size)) #creation of a well-sized matrix #too heavy in memory
for i,sh in enumerate(tab_sh):
    for hash_val in sh:
        matrix[dict_sh_to_int[hash_val]][i]=1 # fulling of the matrix
```

Generation of 100 permutations functions
```Python
n = 100
tab_permutations =[]  #generation of n permutations of [0,1,....,N_sh-1] => equivalent to n hash functions
for i in range(n):
    tab_permutations.append(np.arange(len(union_sh)))
    np.random.shuffle(tab_permutations[i])
``` 

Computation of hashed matrix
```Python
matrix_hashed = np.zeros((sample_size,n))
for i in range(n):
    for j in range(sample_size):
        ind = 0
        while (matrix[tab_permutations[i][dict_int]][j]==0):
            ind+=1
        matrix_hashed[j][i]=ind
matrix_hashed
```

## Method 2 : Direct creation of the MinHashing matrix

In [22]:
n = 100
t_begin = time.time()
print("Creation of the",n, "permutions functions")
tab_permutations_2 =[]  #generation of n permutations of [0,1,....,N_sh-1] => equivalent to n hash functions
for i in (range(n)):
    tab_permutations_2.append(np.arange(len(union_sh)))
    np.random.shuffle(tab_permutations_2[i])
print("Duration:", round(time.time()-t_begin,1),"s")
t_begin = time.time()

print("Creation of the hashed matrix")
matrix_hashed_2 = np.zeros((sample_size,n))
for i in (range(n)):
    for j,sh in enumerate(tab_sh):
        ind = 0
        while (dict_int_to_sh[tab_permutations_2[i][ind]] not in sh):
            ind+=1
        matrix_hashed_2[j][i]=ind
print("Duration:", round(time.time()-t_begin,1),"s")
matrix_hashed_2

Creation of the 100 permutions functions
Duration: 6.6 s
Creation of the hashed matrix
Duration: 61.7 s


array([[1.0330e+03, 4.9580e+03, 3.5240e+03, ..., 5.6900e+02, 6.0160e+03,
        8.7900e+03],
       [1.0620e+03, 5.0800e+02, 2.0920e+03, ..., 4.9000e+01, 2.2940e+03,
        4.4400e+02],
       [2.1175e+04, 2.2108e+04, 1.3793e+04, ..., 3.0740e+03, 3.3440e+03,
        2.9400e+02],
       ...,
       [5.1000e+01, 0.0000e+00, 3.0100e+02, ..., 6.5900e+02, 4.9700e+02,
        1.1000e+01],
       [2.3400e+02, 1.3730e+03, 1.1400e+02, ..., 1.7300e+02, 7.8000e+01,
        1.9700e+02],
       [6.7800e+02, 1.9360e+03, 2.8700e+02, ..., 1.9200e+02, 1.0000e+00,
        6.7900e+02]])

# CompareSignatures

In [12]:
def CompareSignatures(mat_hashed,i,j):
    '''Comparison between row i and j of the MinHashed matrix'''
    similar = 0
    for k,a in enumerate(mat_hashed[i]):
        if (a==mat_hashed[j][k]):
            similar+=1
    return similar/len(mat_hashed[j])

In [13]:
compared_signature_matrix = np.zeros((sample_size,sample_size))
print("Creation of a comparison matrix")
t_begin = time.time()
for i in (range(sample_size)):
    for j in range(i):
        compared_signature_matrix[i][j] = CompareSignatures(matrix_hashed_2,i,j)
print("Duration:", round(time.time()-t_begin,1),"s")

Creation of a comparison matrix
Duration: 14.5 s


In [14]:
val, t, count = np.unique(compared_signature_matrix, return_index=True, return_counts=True)
# plt.loglog(val,count)
# plt.show()
val,count

(array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
        0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.21, 0.22, 0.24,
        0.28, 0.3 , 0.31, 0.32, 0.37, 0.49, 0.56, 0.65, 0.69, 0.72, 0.74,
        0.96, 0.98, 1.  ]),
 array([874360,  98550,  21180,   4304,    992,    303,    104,     69,
            34,     28,      8,     12,     14,      6,      3,      3,
             2,      2,      1,      1,      2,      2,      2,      1,
             1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      6], dtype=int64))

In [15]:
np.where(compared_signature_matrix==1),np.where(compared_signature_matrix==0.51)

((array([421, 430, 472, 956, 981, 984], dtype=int64),
  array([345, 381, 338, 708, 838, 875], dtype=int64)),
 (array([], dtype=int64), array([], dtype=int64)))

In [16]:
tab_data[643], tab_data[76]

('Morgan fell to the ground as Hazard attempted to get the ball from him, with the 22-year-old then trying to kick it from under him but appearing to instead make contact with the youngster.\n  \tQPR manager Redknapp sympathises with Hazard and holds Morgan responsible for the incident.\n  \t"Hazard toe-poked the ball under the boy\'s body. Why is the kid lying on the ball in the first place?" he said.\n  \t"You can imagine the frustration - you\'re a player trying to reach a cup final but there\'s this kid behaving like an idiot who won\'t give you the ball back.\n  \t"Hazard didn\'t kick the kid, he kicked the ball underneath him, but the whole thing got blown out of all proportion.\n  \t"I can think of a lot of players who would have kicked a bit harder than he did. He just toe-poked the ball away.\n  \t"The boy was tweeting before the game that he\'s a super time waster. The way he behaved was disgusting."\n  \tMorgan himself briefly broke his silence on Thursday night, tweeting: "

# LSH

In [46]:
class LSH:
    def __init__(self,b,r,N=1e9, n =sample_size):
        self.s = np.power(1/b,1/r)
        self.b = b
        self.r = r
        self.N = N 
        self.random_tab = np.random.randint(1e6, size=(b,r+1))
        self.matrix = np.zeros((n,self.r))
        print("Estimated s = ",self.s)

    def  __hash_naive(self, i, arr): #private function
        res =self.random_tab[i][-1]
        for j,a in enumerate(arr):
            res+=a*self.random_tab[i][j]
        return res%self.N
    
    def __LocalitySensitiveHashing(self, hashed_vector) -> np.ndarray:
        res = np.zeros(self.b)
        for i in range(self.b):
            sample = hashed_vector[i*self.r:(i+1)*self.r]
            res[i] = self.__hash_naive(i,sample)
        return res
    
    def full_LSH_matrix(self, minhash_matrix):
        for i,sign in enumerate(minhash_matrix):
            print(i,sign,self.__LocalitySensitiveHashing(sign))
            self.matrix[i] = self.__LocalitySensitiveHashing(sign)
        self.matrix = self.matrix.transpose()
            
    def find_similarity(self):
        set_sim = set()
        for band in range(self.b):
            for i in range(len(self.matrix[band])):
                for j in range(i):
                    if (self.matrix[band][i]==self.matrix[band][j]):
                        set_sim.add((i,j))
        return set_sim

In [47]:
LSH_model = LSH(20,5)

Estimated s =  0.5492802716530588


In [48]:
LSH_model.full_LSH_matrix(matrix_hashed_2)

0 [ 1033.  4958.  3524.  2235.  1258.  6783.  5634. 10900.  9517.   682.
  4622.  4583.  1809.   178. 16310. 10208.   863.  1895.  4302.  1495.
  6438.  3149.  9538.  2082.   610.  1358.  2307. 16553.  1241.  1229.
 11629.  4970.  7378. 10843.  5370.  1318.  1238.   614.   261.   951.
  5077.  7919.  1541.  1379.  8468. 20977.   586.  5361.    74.  5824.
   793.  5448.  7255.  5098.  3467.   164. 13799.  3789. 16294.  4081.
  4755.   967.   102.  2241.  1692.  1742.  2045.  1175.  6719.  5048.
  4065. 12567.  2519.   482.   393. 16297.  2014.  1257.  3690.   388.
  6967.  2740.   448.  3063.  1726.  8297.   868.  3134.  6553.  1306.
  2665.  6500.   354.   177.   425.   209.  1512.   569.  6016.  8790.] [2.72135658e+08 7.03126854e+08 2.15957723e+08 1.38750360e+08
 3.94094609e+08 1.21295029e+08 3.08080080e+07 5.72496356e+08
 7.72959140e+07 9.34744319e+08 1.04056448e+08 9.24445732e+08
 6.62045160e+08 4.31011720e+07 2.10857768e+08 7.59547808e+08
 1.52192939e+08 8.19535830e+07 9.59192500e+

ValueError: could not broadcast input array from shape (20,) into shape (5,)

In [50]:
LSH_model.matrix[0]

array([0., 0., 0., 0., 0.])

In [42]:
LSH_model.find_similarity()

{(1, 0),
 (2, 0),
 (2, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3)}

In [21]:
time.time()-t_0

132.60751366615295