# Introduction

In this homework, we have worked with a dataset found on [UC Irvine Machine Learning Repository's website](https://archive-beta.ics.uci.edu/) and is composed of 1000 sports articles. [[Link]](https://archive.ics.uci.edu/ml/machine-learning-databases/00450/)

# Instructions

To test our algorithms, you first need to download the data used for this work and unzip it in a folder called "Data" that is placed at the same position as this notebook. Otherwise, you can change the variable <code>dir_path</code> in the case below [Import data](#Import-data) to make it convinient for your usage. 
<br>
<br>
When you have collected the data, you only need to run this notebook to test our algorithms. The different libraries used are <code>numpy</code>, <code>time</code>, <code>plotly</code>, <code>pandas</code> and <code>os</code>. So you have to install them on your environment before launching the program.

# Technical report
The different steps of our work are : 
<ul>
    <li><u>Importation and save of data :</u> We have collected the data in a list <code>tab_data</code>. Each item of the list correspond to a different article. <a href="#Import-data"> [Link]</a></li>
    <li><u>Implementation of the shingling function :</u>For the shingling funciton, we have picked all chains of <code>k=10</code> characters and then used a native hash function of Python for strings, in order to collect a set of shinglings for each item of <code>tab_data</code>.<a href="#Shingling-function"> [Link]</a></li>
    <li><u>Implementation of the similarity function :</u> Since we have computed sets in the previous step, it is easy to compute the similarity function. Indeed, it is easy to compute the union of two sets (and we can infer the size of the intersection with it). <u>Useful propriety:</u> $|A \cap B| + |A \cup B| = |A| + |B|$ <a href="#Compare-Sets"> [Link]</a> </li>
    <li><u>Implementation of the MinHashing function :</u> First, we have to create a map of the shinglings to numerote them. To do so, we have computed a dictionary <code>dict_sh_to_int</code> which has shinglings as keys and integers in $ ⟦ 0;N_{\#_{Shingling}}-1 ⟧$ and its reversed dictionnary <code>dict_int_to_sh</code>. Then two methods have been tried:
        <ul>
            <li><u>Method 1 :</u> Computation of a characteristic matrix for each shingling in each matrix. this require to build a matrix, mainly composed of zeros, of dimensions $ N_{\#_{Shingling}} \times N_{\#_{Articles}}$. This method works for a small amount of data but the memory was too small when we have computed it with the whole dataset.</li>
            <li><u>Method 2 :</u> Directly build the MinHashing matrix by looking directly in the sets of shinglings of each article. We have used <code>n=100</code> hashed functions. These functions are simply permutations of $ ⟦ 0;N_{\#_{Shingling}}-1 ⟧$ and we can associate this permutation to a "permutation" of shinglings through <code>dict_int_to_sh</code>. Thus, we look for the first shingling of each set according to the order of each permutation. This result is saved in the matrix <code>matrix_hashed_2</code> (each row correpond to an article) of size  $ N_{\#_{Articles}} \times n$.</li>
        </ul>
     Finally, we have computed the matrix <code>matrix_hashed_2</code> that gives the MinHashing values. <a href="#MinHashing">[Link]</a></li>
    <li><u>Implementation of the signature comparison function :</u> it's simply count the number of times, the value given by the signatures is the same for both parameters. Divided by <code>n</code>, we have got the proportion of common signatures. <a href="#CompareSignature"> [Link]</a> </li>
    <li><u>Implementation of the LSH model :</u> We have computed a class for LSH. It is composed of :
        <ul>
            <li>number of rows <code>b</code></li>
            <li>length of a row <code>r</code></li>
            <li>estimation of threshold <code>s</code> where $s=\left(\frac{1}{b}\right)^{\frac{1}{r}}$</li>
        self.matrix = np.zeros((n,self.b))
            <li>large prime number for the hashed function <code>N</code></li>
            <li>matrix <code>random_tab</code> of size $b \times (r+1)$ of random numbers which are the coefficient for the hashing function</li>
            <li>matrix <code>matrix</code> of size $N_{\#_{Articles}} \times b$ which stored the hashed value of the signature of each band of each vector (initialiazed at 0)</li>
        </ul>
        The second step after the initialisation is to fill <code>matrix</code>. To do so, we inspect every vectors of signature <code>full_LSH_matrix</code>, we cut it in bands as the paramaters are intiated <code>__LocalitySensitiveHashing</code>, and we hash everyone with <code>__hash_naive</code>.
        <br>
        The last step is to compute the similarity. It is done by the function <code>find_similarity</code> that compares the values of <code>matrix</code> between the rows. <a href="#LSH"> [Link]</a>
    </li>
</ul>

# Performances

## Performance of the model $k=10, n=100, b=20, r=5$
We have reported the time of computation of the different step in this model.
<table>
    <tr>
        <th>Step</th>
        <th>Computation time (in sec)</th>
    </tr>
    <tr>
        <td>Generation of the shinglings</td>
        <td>2.4</td>
    </tr>
    <tr>
        <td>Comparison of all set of shinglings</td>
        <td>175</td>
    </tr>
    <tr>
        <td>Creation of a set of all shinglings</td>
        <td>80.0</td>
    </tr>
    <tr>
        <td>Creation of a global dictionnary</td>
        <td>0.8</td>
    </tr>
    <tr>
        <td>Creation of the 100 permutions functions</td>
        <td>10.4</td>
    </tr>
    <tr>
        <td>Creation of the hashed matrix</td>
        <td>79.2</td>
    </tr>
    <tr>
        <td>Comparison of all signature values</td>
        <td>20.1</td>
    </tr>
    <tr>
        <td>Computation of LSH and find similar items</td>
        <td>6.5</td>
    </tr>
</table>
From this table, we can compare the time for the different models, that correpond to where we stop in the process.
<table>
    <tr>
        <th>Model</th>
        <th>Computation time (in sec)</th>
    </tr>
    <tr>
        <td>Comparison of shinglings</td>
        <td>177.4</td>
    </tr>
    <tr>
        <td>Comparison of signatures</td>
        <td>192.9</td>
    </tr>
    <tr>
        <td>Comparison of LSH</td>
        <td>179.3</td>
    </tr>
</table>
This result goes against our goal (and intuition). Each further step adds approximation and it is interssant because it should run faster.  It is due to the duration of the step "Creation of a set of all shinglings". However, for a larger sample, it would be more efficient to do that.
<br>
We have put a threshold in the <code>LLSH_model</code> at 0.55. We have detected 12 similar items but one is a false positve and we have missed one item. These results are summarized in the following table.
<table>
    <tr>
        <th></th>
        <th scope="col">True</th>
        <th scope="col">False</th>
        <th scope="col">Total</th>
    </tr>
    <tr>
        <th scope="row">Predicted True</th>
        <td>11</td>
        <td>1</td>
        <td>12</td>
    </tr>
    <tr>
        <th scope="row">Predicted False</th>
        <td>1</td>
        <td>499487</td>
        <td>499488</td>
    </tr>
    <tr>
        <th scope="row">Total</th>
        <td>12</td>
        <td>499488</td>
        <td>499500</td>
    </tr>
</table>
We have a $F_{score}=0.92$. That confirms that our results are relevant.

## Comparison of the performance by modifying $k$
To see the impact of $k$, we have computed the whole model by modifying the value of $k$ in $⟦5;14⟧$ without modifying the other parameters. The first thing that we have observed is that the computation time increases quickly and then stabilises from 11. 
![](Results/k_time.png)
We were also interested in the results to see if there is a $k$ that is more relevant. And it seems that from $k=6$, that the number of found similar articles does almost not vary.
![](Results/k_item.png)
By grouping these two observations, it seems that the best $k$ that gives quick and relevant results is $k=6$.

In [43]:
import numpy as np
import os
import time

In [2]:
t_0 = time.time()

# Import data

In [3]:
dir_path = "Data/"
sample_size = 1000
tab_data=[]
for file_name in np.sort(os.listdir(dir_path)):
    with open(dir_path+file_name, encoding="ansi") as f:
        tab_data.append(f.read())

In [4]:
print("The average amount of characters of the articles are",np.mean([len(a) for a in tab_data]))

The average amount of characters of the articles are 3985.97


# Shingling function

In [5]:
k=10

In [6]:
def shingling (str_doc, k):
    res = set()
    for i in range (len(str_doc)-k+1):
        sample = str_doc[i:i+k]
        hashed_sample = hash(sample)
        res.add(hashed_sample)
    return res

def generate_tab_sh(tab_data, k):
    tab_sh=[]
    for data in tab_data:
        tab_sh.append(shingling(data,k))
    return tab_sh

In [85]:
print("Generation of the shinglings")
t_begin = time.time()
tab_sh = generate_tab_sh(tab_data, k)
print("Duration:", round(time.time()-t_begin,1),"s")

Generation of the shinglings
Duration: 2.4 s


# Compare Sets

In [8]:
def CompareSets(set1,set2):
    size_1, size_2, size_union = len(set1), len(set2), len(set1.union(set2)) # Cardinal (set1), Card(set2), Card(set1 u set2)
    size_inter = size_1 + size_2 - size_union # Card(set1 n set 2) = Card(set1) + Card (set2) - Card(set1 u set 2)
    return size_inter/size_union

In [109]:
print("Computation of all comparison value")
t_begin = time.time()
Compare_set_values = []
for i in range(len(tab_sh)):
    for j in range(i):
        Compare_set_values.append(CompareSets(tab_sh[i],tab_sh[j]))
print("Duration:", round(time.time()-t_begin,1),"s")

Computation of all comparison value
Duration: 175.0 s


In [110]:
a = np.array(Compare_set_values)
exp_res_sim = a[np.where(a>0.549)[0]]

exp_res_sim

array([0.58860898, 0.98469876, 1.        , 1.        , 0.66825596,
       1.        , 0.6712404 , 0.64778053, 0.9080033 , 1.        ,
       0.99714081, 0.9999127 ])

# MinHashing

Creation of a dictionary to numerate the shingling from 0 to $N_{\#_{Shingling}}-1$.

In [18]:
def preprocessing_MinHashing(tab_sh, printage=False):
    if (printage):
        print("Creation of a set of all shinglings")
    t_begin = time.time()
    union_sh =set()
    for sh in (tab_sh):
        union_sh = union_sh.union(sh) #creation of a global set of all the values 
    if (printage):
        print("Duration:", round(time.time()-t_begin,1),"s")
    t_begin = time.time()
    if (printage):
        print("Creation of a global dictionary")
    dict_int_to_sh={}
    for i,hash_val in enumerate(list(union_sh)):
        dict_int_to_sh[i]=hash_val  # creation of a dictionary in order to link each shingle to a unique row in the matrix
    if (printage):
        print("Duration:", round(time.time()-t_begin,1),"s")
    return union_sh,dict_int_to_sh

In [19]:
union_sh,dict_int_to_sh = preprocessing_MinHashing(tab_sh, printage=True)
# dict_sh_to_int = {v: k for k, v in dict_sh_to_int.items()}

Creation of a set of all shinglings
Duration: 80.0 s
Creation of a global dictionary
Duration: 0.8 s


## Method 1 : With creation of characteristic matrix
Issue : could be too heavy in memory

Creation of a Matrix (# of shinglings,# of article)
```Python
matrix = np.empty((len(union_sh), sample_size)) #creation of a well-sized matrix #too heavy in memory
for i,sh in enumerate(tab_sh):
    for hash_val in sh:
        matrix[dict_sh_to_int[hash_val]][i]=1 # fulling of the matrix
```

Generation of 100 permutations functions
```Python
n = 100
tab_permutations =[]  #generation of n permutations of [0,1,....,N_sh-1] => equivalent to n hash functions
for i in range(n):
    tab_permutations.append(np.arange(len(union_sh)))
    np.random.shuffle(tab_permutations[i])
``` 

Computation of hashed matrix
```Python
matrix_hashed = np.zeros((sample_size,n))
for i in range(n):
    for j in range(sample_size):
        ind = 0
        while (matrix[tab_permutations[i][dict_int]][j]==0):
            ind+=1
        matrix_hashed[j][i]=ind
matrix_hashed
```

## Method 2 : Direct creation of the MinHashing matrix

In [20]:
n = 100

In [21]:
def MinHashing(tab_sh, printage=False):
    global dict_int_to_sh
    global union_sh
    
    t_begin = time.time()
    if (printage):
        print("Creation of the",n, "permutions functions")
    tab_permutations_2 =[]  #generation of n permutations of [0,1,....,N_sh-1] => equivalent to n hash functions
    
    for i in (range(n)):
        tab_permutations_2.append(np.arange(len(union_sh)))
        np.random.shuffle(tab_permutations_2[i])
    if (printage):
        print("Duration:", round(time.time()-t_begin,1),"s")
        
    
    t_begin = time.time()
    if (printage):
        print("Creation of the hashed matrix")
    matrix_hashed_2 = np.zeros((sample_size,n))
    
    for i in (range(n)):
        for j,sh in enumerate(tab_sh):
            ind = 0
            while (dict_int_to_sh[tab_permutations_2[i][ind]] not in sh):
                ind+=1
            matrix_hashed_2[j][i]=ind
    if (printage):
        print("Duration:", round(time.time()-t_begin,1),"s")
        
    return matrix_hashed_2

In [22]:
matrix_hashed_2 = MinHashing(tab_sh, printage=True)

Creation of the 100 permutions functions
Duration: 10.4 s
Creation of the hashed matrix
Duration: 79.2 s


# CompareSignatures

In [23]:
def CompareSignatures(mat_hashed,i,j):
    '''Comparison between row i and j of the MinHashed matrix'''
    similar = 0
    for k,a in enumerate(mat_hashed[i]):
        if (a==mat_hashed[j][k]):
            similar+=1
    return similar/len(mat_hashed[j])

In [24]:
def list_comparision_signature(matrix_hashed):
    compared_signature_values = []
    print("Creation of comparison of signature values")
    t_begin = time.time()
    for i in (range(sample_size)):
        for j in range(i):
            compared_signature_values.append(CompareSignatures(matrix_hashed,i,j))
    print("Duration:", round(time.time()-t_begin,1),"s")
    return compared_signature_values

In [25]:
compared_signature_values = list_comparision_signature(matrix_hashed_2)

Creation of comparison of signature values
Duration: 20.1 s


In [26]:
val, t, count = np.unique(compared_signature_values, return_index=True, return_counts=True)
val,count

(array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
        0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.19, 0.2 , 0.24, 0.25,
        0.32, 0.41, 0.42, 0.47, 0.62, 0.66, 0.72, 0.75, 0.76, 0.91, 1.  ]),
 array([383021,  93197,  18226,   3514,    813,    326,    142,     72,
            76,     30,     25,     11,      5,      8,      1,      4,
             1,      2,      3,      2,      1,      1,      3,      1,
             1,      1,      1,      1,      1,      1,      1,      1,
             7], dtype=int64))

# LSH

In [91]:
class LSH:
    def __init__(self,b,r,N=1e9, n_article = sample_size,print_s=False):
        self.s = np.power(1/b,1/r)
        self.b = b
        self.r = r
        self.N = N 
        self.random_tab = np.random.randint(1e6, size=(b,r+1))
        self.matrix = np.zeros((n_article,self.b))
        if (print_s):
            print("Estimated s = ",self.s)

    def  __hash_naive(self, i, arr): #private function
        res = self.random_tab[i][-1]
        for j,a in enumerate(arr):
            res+=a*self.random_tab[i][j]
        return res%self.N
    
    def __LocalitySensitiveHashing(self, hashed_vector) -> np.ndarray:
        res = np.zeros(self.b)
        for i in range(self.b):
            sample = hashed_vector[i*self.r:(i+1)*self.r]
            res[i] = self.__hash_naive(i,sample)
        return res #length b
    
    def full_LSH_matrix(self, minhash_matrix):
        for i,vector in enumerate(minhash_matrix):
            self.matrix[i] = self.__LocalitySensitiveHashing(vector) #
        self.matrix = self.matrix.transpose()
            
    def find_similarity(self):
        set_sim = set()
        for band in range(self.b):
            for i in range(np.shape(self.matrix)[1]):
                for j in range(i):
                    if (self.matrix[band][i]==self.matrix[band][j]):
                        set_sim.add((i,j))
        return set_sim

In [92]:
time_begin=time.time()
LSH_model = LSH(20,5,print_s=True)

Estimated s =  0.5492802716530588


In [93]:
LSH_model.full_LSH_matrix(matrix_hashed_2)

In [94]:
similar = LSH_model.find_similarity()
time.time()-time_begin

7.315013647079468

In [95]:
for sim in list(similar):
    print("Similarity found between",sim,". Real similarity :",round(CompareSets(tab_sh[sim[0]],tab_sh[sim[1]]),3))

Similarity found between (556, 328) . Real similarity : 0.671
Similarity found between (601, 559) . Real similarity : 0.299
Similarity found between (472, 338) . Real similarity : 1.0
Similarity found between (324, 161) . Real similarity : 0.985
Similarity found between (984, 875) . Real similarity : 1.0
Similarity found between (165, 164) . Real similarity : 0.589
Similarity found between (430, 381) . Real similarity : 1.0
Similarity found between (981, 838) . Real similarity : 0.997
Similarity found between (947, 728) . Real similarity : 0.908
Similarity found between (460, 353) . Real similarity : 0.668
Similarity found between (956, 708) . Real similarity : 1.0
Similarity found between (421, 345) . Real similarity : 1.0


In [112]:
print("Count found :",len(similar)," Expected value :",len(exp_res_sim))
false_pos = [sim for sim in similar if 0.549>CompareSets(tab_sh[sim[0]],tab_sh[sim[1]])]
print("# false positive :",len(false_pos),"; # false negative :",len(exp_res_sim)-len(similar)+len(false_pos))

Count found : 12  Expected value : 12
# false positive : 1 ; # false negative : 1


In [32]:
time.time()-t_0

759.9965267181396

# Comparison of results depending on $k$

```Python
k_values = np.arange(5,15)
numb_sim = []
time_tab = []
n=100
b=20
r=5

for k in k_values:
    time_begin = time.time()
    tab_sh = generate_tab_sh(tab_data, k)
    union_sh, dict_int_to_sh = preprocessing_MinHashing(tab_sh)
    matrix_hashed_2 = MinHashing(tab_sh)
    LSH_model = LSH(20,5)
    LSH_model.full_LSH_matrix(matrix_hashed_2)
    similar = LSH_model.find_similarity()
    numb_sim.append(len(similar))
    time_tab.append(time.time()-time_begin)

df_result = pd.DataFrame(data={'k':k_values,'Estimation of  similar items':numb_sim,'Computation time' : time_tab}).set_index('k')
df_result.to_csv('Results/comparison_k.csv', sep=';')    
```

In [50]:
import pandas as pd
import plotly.graph_objects as go

df_result = pd.read_csv('Results/comparison_k.csv', sep=';')

In [84]:
go.Figure(data=[go.Scatter(x=df_result.index,y=df_result["Estimation of  similar items"], mode="lines", hovertemplate='k = %{x}<br>Number = %{y}<extra></extra>')],
         layout={'xaxis_title':"Length k of a shingling",
                 'yaxis_title':"Computation time (in s)",
                 'title':"Number of similar items depending on the size of a shingling"})

In [58]:
go.Figure(data=[go.Scatter(x=df_result.index,y=df_result["Computation time"], mode="lines", hovertemplate='k = %{x}<br>Time = %{y}s<extra></extra>')],
         layout={'xaxis_title':"Length k of a shingling",
                 'yaxis_title':"Computation time (in s)",
                 'title':"Computation time depending on the size of a shingling"})