# Calcul de la distance de Cosine et Levhenstein

Copy paste from Coda to fill the information

## Objective(s)

- Dans l'US, Création table poids obtenus via le Word2Vec, nous avons préparé une table avec la liste des mots les plus récurants dans la base d'entrainement avec les poids rattachées. Dans cette nouvelle étape, nous devons calculer la similarité entre les mots qui ne sont pas identiques dans l'adresse de l'INSEE et de l'INPI.
- Lors de l'US, Creation table inpi insee contenant le test `status_cas` a effectuer pour dedoublonner les lignes, nous avons créé deux variables, `list_excep_insee` et `list_except_inpi` qui représentent les mots qui ne sont pas identiques.
- Lors de l'US, Creation table merge INSEE INPI filtree, nous avons créé la variable `row_id`, qui va nous permettre de rajouter les variables suivantes a la table des cas

La siretisation repose sur une matrice de règles de gestion classée de manière ordonnée. Pour créer la matrice, il faut au préalable créer les variables nécéssaires à la création des tests. 

Le tableau ci dessous indique l'ensemble des tests a réaliser ainsi que leur dépendence.


Lors de cette US, nous allons créer 6 variables qui vont permettre a la réalisation des tests `test_similarite_exception_words` et `test_distance_levhenstein_exception_words`. Les six variables sont les suivantes:

- `unzip_inpi`: Mot comparé coté inpi
- `unzip_insee`: Mot comparé coté insee
- `max_cosine_distance`: Score de similarité entre le mot compaté coté inpi et coté insee
- `levenshtein_distance`: Nombre d'édition qu'il faut réaliser pour arriver à reproduire les deux mots
- `key_except_to_test`: Champs clé-valeur pour toutes les possibiltés des mots qui ne sont pas en communs entre l'insee et l'inpi
* Il faut penser a garder la variable `row_id` 
- La similarité doit etre calculée sur l'ensemble des éléments non communs, puis il faut récupérer la distance la plus élevée.

## Metadata 

* Metadata parameters are available here: 
* US Title: Calcul de la distance de Cosine et Levhenstein
* Epic: Epic 8
* US: US 8
* Date Begin: 9/8/2020
* Duration Task: 0
* Status:  
* Source URL: [US 08 PreparationWord2Vec](https://coda.io/d/_dCtnoqIftTn/US-08-PreparationWord2Vec_su_Xz)
* Task type:
  * Jupyter Notebook
* Users: :
  * Thomas Pernet
* Watchers:
  * Thomas Pernet
* Estimated Log points:
  * One being a simple task, 15 a very difficult one
  *  5
* Task tag
  *  #computation,#sql-query,#machine-learning,#word2vec,#similarite,#preparation-similarite
* Toggl Tag
  * #data-preparation
  
## Input Cloud Storage [AWS]

If link from the internet, save it to the cloud first

### Tables [AWS]

1. Batch 1:
  * Select Provider: Athena
  * Select table(s): ets_insee_inpi_cases,list_weight_mots_insee_inpi_word2vec,ets_insee_inpi_status_cas
    * Select only tables created from the same notebook, else copy/paste selection to add new input tables
    * If table(s) does not exist, add them: Add New Table
    * Information:
      * Region: 
        * NameEurope (Paris)
        * Code: eu-west-3
      * Database: siretisation
      * Notebook construction file: 
        * 05_creation_table_cases
        * 07_creation_table_poids_Word2Vec
        * 02_cas_de_figure
    
## Destination Output/Delivery

1. AWS
    1. Athena: 
      * Region: Europe (Paris)
      * Database: siretisation
      * Tables (Add name new table): ets_inpi_similarite_max_word2vec
      * List new tables
      * ets_inpi_similarite_max_word2vec

## Things to know (Steps, Attention points or new flow of information)

### Sources of information  (meeting notes, Documentation, Query, URL)

Sources of information  (meeting notes, Documentation, Query, URL)
1. Jupyter Notebook (Github Link)
  1. md : [07_creation_table_poids_Word2Vec.md](https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/08_US_DATUM/07_creation_table_poids_Word2Vec.md)


## Connexion serveur

In [1]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_athena import service_athena
from awsPy.aws_s3 import service_s3
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import os, shutil

path = os.getcwd()
parent_path = str(Path(path).parent)
path_cred = r"{}/credential_AWS.json".format(parent_path)
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = 'eu-west-3')

region = 'eu-west-3'
bucket = 'calfdata'

In [2]:
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = region)
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = bucket, verbose = False) 

In [3]:
pandas_setting = True
if pandas_setting:
    cm = sns.light_palette("green", as_cmap=True)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)

# Input/output

In [4]:
s3_output = 'inpi/sql_output'
database = 'siretisation'

In [5]:
query = """
DROP TABLE siretisation.ets_inpi_similarite_max_word2vec;
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = None, ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

{'Results': {'State': 'SUCCEEDED',
  'SubmissionDateTime': datetime.datetime(2020, 9, 8, 10, 59, 20, 199000, tzinfo=tzlocal()),
  'CompletionDateTime': datetime.datetime(2020, 9, 8, 10, 59, 21, 369000, tzinfo=tzlocal())},
 'QueryID': 'a4c184bc-05c7-47ab-b705-9ab3fb718c62'}

In [7]:
query = """
CREATE TABLE siretisation.ets_inpi_similarite_max_word2vec
WITH (
  format='PARQUET'
) AS
WITH dataset AS (
  SELECT 
    siretisation.ets_insee_inpi.row_id, 
    index_id, 
    status_cas, 
    inpi_except, 
    insee_except, 
    transform(
      sequence(
        1, 
        CARDINALITY(insee_except)
      ), 
      x -> insee_except
    ), 
    ZIP(
      inpi_except, 
      transform(
        sequence(
          1, 
          CARDINALITY(inpi_except)
        ), 
        x -> insee_except
      )
    ) as test 
  FROM 
    siretisation.ets_insee_inpi  
    
  LEFT JOIN siretisation.ets_insee_inpi_status_cas 
  ON siretisation.ets_insee_inpi.row_id = siretisation.ets_insee_inpi_status_cas.row_id
  where 
    (status_cas != 'CAS_2' AND CARDINALITY(inpi_except)  > 0 AND CARDINALITY(insee_except) > 0)
  )
 SELECT 
  * 
FROM 
  (
    WITH distance AS (
      SELECT 
        * 
      FROM 
        (
          WITH list_weights_insee_inpi AS (
            SELECT 
              row_id, 
              index_id, 
              status_cas, 
              inpi_except, 
              insee_except, 
              unzip_inpi, 
              unzip_insee, 
              list_weights_inpi, 
              list_weights_insee 
            FROM 
              (
                SELECT 
                  row_id, 
                  index_id, 
                  status_cas, 
                  inpi_except, 
                  insee_except, 
                  unzip.field0 as unzip_inpi, 
                  unzip.field1 as insee, 
                  test 
                FROM 
                  dataset CROSS 
                  JOIN UNNEST(test) AS new (unzip)
              ) CROSS 
              JOIN UNNEST(insee) as test (unzip_insee) 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_inpi 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_inpi ON unzip_inpi = tb_weight_inpi.words 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_insee 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_insee ON unzip_insee = tb_weight_insee.words 
          ) 
          SELECT 
            row_id, 
            index_id, 
            status_cas, 
            inpi_except, 
            insee_except, 
            unzip_inpi, 
            unzip_insee, 
            REDUCE(
              zip_with(
                list_weights_inpi, 
                list_weights_insee, 
                (x, y) -> x * y
              ), 
              CAST(
                ROW(0.0) AS ROW(sum DOUBLE)
              ), 
              (s, x) -> CAST(
                ROW(x + s.sum) AS ROW(sum DOUBLE)
              ), 
              s -> s.sum
            ) / (
              SQRT(
                REDUCE(
                  transform(
                    list_weights_inpi, 
                    (x) -> POW(x, 2)
                  ), 
                  CAST(
                    ROW(0.0) AS ROW(sum DOUBLE)
                  ), 
                  (s, x) -> CAST(
                    ROW(x + s.sum) AS ROW(sum DOUBLE)
                  ), 
                  s -> s.sum
                )
              ) * SQRT(
                REDUCE(
                  transform(
                    list_weights_insee, 
                    (x) -> POW(x, 2)
                  ), 
                  CAST(
                    ROW(0.0) AS ROW(sum DOUBLE)
                  ), 
                  (s, x) -> CAST(
                    ROW(x + s.sum) AS ROW(sum DOUBLE)
                  ), 
                  s -> s.sum
                )
              )
            ) AS cosine_distance 
          FROM 
            list_weights_insee_inpi
        )
    ) 
    SELECT 
      row_id, 
      dataset.index_id, 
      inpi_except, 
      insee_except, 
      unzip_inpi, 
      unzip_insee, 
      max_cosine_distance,
      -- CASE WHEN max_cosine_distance >= .6 THEN 'TRUE' ELSE 'FALSE' END AS test_distance_cosine,
      test as key_except_to_test,
      levenshtein_distance(unzip_inpi, unzip_insee) AS levenshtein_distance
      -- CASE WHEN levenshtein_distance(unzip_inpi, unzip_insee) <=1  THEN 'TRUE' ELSE 'FALSE' END AS test_distance_levhenstein
    
    FROM 
      dataset 
      LEFT JOIN (
        SELECT 
          distance.index_id, 
          unzip_inpi, 
          unzip_insee, 
          max_cosine_distance 
        FROM 
          distance 
          RIGHT JOIN (
            SELECT 
              index_id, 
              MAX(cosine_distance) as max_cosine_distance 
            FROM 
              distance 
            GROUP BY 
              index_id
          ) as tb_max_distance ON distance.index_id = tb_max_distance.index_id 
          AND distance.cosine_distance = tb_max_distance.max_cosine_distance
      ) as tb_max_distance_lookup ON dataset.index_id = tb_max_distance_lookup.index_id
  )
"""

s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = None, ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

{'Results': {'State': 'SUCCEEDED',
  'SubmissionDateTime': datetime.datetime(2020, 9, 8, 11, 22, 3, 835000, tzinfo=tzlocal()),
  'CompletionDateTime': datetime.datetime(2020, 9, 8, 11, 24, 33, 999000, tzinfo=tzlocal())},
 'QueryID': 'c98e86b7-c6d3-450a-b415-2a8f178c92d5'}

# Pas a pas

Pour récupérer la similarité la plus élevée entre les mots qui ne sont pas communs entre l'adresse de l'INSEE et de l'INPI, il faut suivre plusieurs étapes. Les étapes sont les suivantes:



### 1. filtre et creation ensemble des similarités a calculer

- Filtrer les lignes qui ne correspondent pas au cas 2 et qui ont une cardinalité des mots `except` supérieure à 0. Effectivement, il n'est pas nécéssaire de calculer une similarité si l'une des listes, insee ou inpi, est vide.
- Création d'un champ clé valeur qui indique l'ensemble des similarités a calculer

On utilise les fonctions:

- `transform`
- `ZIP`
- `sequence`

La difficulté dans cette étape était de trouver un moyen de dupliquer la liste de l'INSEE pour chacune des clés de la liste de l'INPI. Le trick est d'utiliser `sequence` afin de répeter autant de fois la liste de l'INSEE qu'il y a de clé à l'INPI. Plus précisément, si la liste de l'INPI a deux valeurs, ie deux clés, et que la liste de l'INSEE a trois éléments. La taille de l'INSEE n'a pas d'impact, ce qui est important c'est de connaitre la taille de l'INPI. Dans notre exemple, le code va répéter la liste de l'INSEE 2 fois, car il y a deux clés à l'INPI.

Exemple concret:

- INPI -> [FRERES, AMADEO]
- INSEE -> [MARTYRS, RESISTANCE]
- Il faut comparer: 
    - FRERES -> [MARTYRS, RESISTANCE] 
    - AMADEO -> [MARTYRS, RESISTANCE]
- Clé valeur finale -> [
{field0=FRERES, field1=[MARTYRS, RESISTANCE]},
{field0=AMADEO, field1=[MARTYRS, RESISTANCE]}
]

In [11]:
query = """
WITH dataset AS (
  SELECT 
    siretisation.ets_insee_inpi.row_id, 
    index_id, 
    status_cas, 
    inpi_except, 
    insee_except, 
    transform(
      sequence(
        1, 
        CARDINALITY(insee_except)
      ), 
      x -> insee_except
    ), 
    ZIP(
      inpi_except, 
      transform(
        sequence(
          1, 
          CARDINALITY(inpi_except)
        ), 
        x -> insee_except
      )
    ) as test 
  FROM 
    siretisation.ets_insee_inpi  
    
  LEFT JOIN siretisation.ets_insee_inpi_status_cas 
  ON siretisation.ets_insee_inpi.row_id = siretisation.ets_insee_inpi_status_cas.row_id
  where 
    (status_cas != 'CAS_2' AND CARDINALITY(inpi_except)  > 0 AND CARDINALITY(insee_except) > 0 OR index_id = 4664896)
  LIMIT 10
  )
  
 SELECT 
  * 
  FROM dataset

"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = 'repeat', ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,row_id,index_id,status_cas,inpi_except,insee_except,_col5,test
0,1767,2846046,CAS_4,[BALZAC],"[QUATRE, SEPTEMBRE]","[[QUATRE, SEPTEMBRE], [QUATRE, SEPTEMBRE]]","[{field0=BALZAC, field1=[QUATRE, SEPTEMBRE]}]"
1,1769,3605111,CAS_3,[RUE],[AVENUE],[[AVENUE]],"[{field0=RUE, field1=[AVENUE]}]"
2,1783,3370482,CAS_4,[ZI],"[ZONE, INDUSTRIELLE]","[[ZONE, INDUSTRIELLE], [ZONE, INDUSTRIELLE]]","[{field0=ZI, field1=[ZONE, INDUSTRIELLE]}]"
3,1851,9559462,CAS_3,[ALLEE],[RUE],[[RUE]],"[{field0=ALLEE, field1=[RUE]}]"
4,1911,3990780,CAS_4,"[TRAVERSE, GOUFONNE, CENTRE, COMMERCIAL]","[AVENUE, GOUFFONNE, C]","[[AVENUE, GOUFFONNE, C], [AVENUE, GOUFFONNE, C], [AVENUE, GOUFFONNE, C]]","[{field0=TRAVERSE, field1=[AVENUE, GOUFFONNE, C]}, {field0=GOUFONNE, field1=[AVENUE, GOUFFONNE, C]}, {field0=CENTRE, field1=[AVENUE, GOUFFONNE, C]}, {field0=COMMERCIAL, field1=[AVENUE, GOUFFONNE, C]}]"
5,1947,10636946,CAS_3,[GRASSE],[RATTACHEMENT],[[RATTACHEMENT]],"[{field0=GRASSE, field1=[RATTACHEMENT]}]"
6,1968,5185540,CAS_3,[GRIBAL],[GIRBAL],[[GIRBAL]],"[{field0=GRIBAL, field1=[GIRBAL]}]"
7,2078,8523042,CAS_4,[SAUSSAIES],"[SAINT, HONORE]","[[SAINT, HONORE], [SAINT, HONORE]]","[{field0=SAUSSAIES, field1=[SAINT, HONORE]}]"
8,2110,1450128,CAS_4,"[ST, GERVAIS]","[DOCTEUR, PAUL, GERMAN]","[[DOCTEUR, PAUL, GERMAN], [DOCTEUR, PAUL, GERMAN], [DOCTEUR, PAUL, GERMAN]]","[{field0=ST, field1=[DOCTEUR, PAUL, GERMAN]}, {field0=GERVAIS, field1=[DOCTEUR, PAUL, GERMAN]}]"
9,2134,3951565,CAS_4,[FIRBEIX],"[LIEU, DIT]","[[LIEU, DIT], [LIEU, DIT]]","[{field0=FIRBEIX, field1=[LIEU, DIT]}]"


### 2. Produit cartesien possibilité et liste poid

Dans la seconde étape, nous allons "exploser" la clé-valeur afin de pouvoir attribuler la liste des poids aux mots de l'INPI et de l'INSEE.

Nous allons poursuivre le reste du pas à pas avec l'index `4664896`, qui fait référence à l'exemple ci dessus.

L'explosion du champs `test` se fait avec la fonction `CROSS JOIN`. La variable `unzip_inpi` correspond aux clés du champs `test` alors que la variable `unzip_insee` correspond aux valeurs. Le `CROSS JOIN` implique 4 lignes au total. 

Nous avons donc deux colonnes avec les pairs de mots qu'il faut calculer la similarité, et deux colonnes avec les poids.

In [12]:
query = """
WITH dataset AS (
  SELECT 
    siretisation.ets_insee_inpi.row_id, 
    index_id, 
    status_cas, 
    inpi_except, 
    insee_except, 
    transform(
      sequence(
        1, 
        CARDINALITY(insee_except)
      ), 
      x -> insee_except
    ), 
    ZIP(
      inpi_except, 
      transform(
        sequence(
          1, 
          CARDINALITY(inpi_except)
        ), 
        x -> insee_except
      )
    ) as test 
  FROM 
    siretisation.ets_insee_inpi  
    
  LEFT JOIN siretisation.ets_insee_inpi_status_cas 
  ON siretisation.ets_insee_inpi.row_id = siretisation.ets_insee_inpi_status_cas.row_id
  where 
    (status_cas != 'CAS_2' AND CARDINALITY(inpi_except)  > 0 AND CARDINALITY(insee_except) > 0 AND index_id = 4664896)
  
  )
SELECT 
              row_id, 
              index_id, 
              status_cas, 
              inpi_except, 
              insee_except, 
              unzip_inpi, 
              unzip_insee, 
              list_weights_inpi, 
              list_weights_insee 
            FROM 
              (
                SELECT 
                  row_id, 
                  index_id, 
                  status_cas, 
                  inpi_except, 
                  insee_except, 
                  unzip.field0 as unzip_inpi, 
                  unzip.field1 as insee, 
                  test 
                FROM 
                  dataset CROSS 
                  JOIN UNNEST(test) AS new (unzip)
              ) CROSS 
              JOIN UNNEST(insee) as test (unzip_insee) 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_inpi 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_inpi ON unzip_inpi = tb_weight_inpi.words 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_insee 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_insee ON unzip_insee = tb_weight_insee.words 
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = 'explosion', ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,row_id,index_id,status_cas,inpi_except,insee_except,unzip_inpi,unzip_insee,list_weights_inpi,list_weights_insee
0,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",FRERES,MARTYRS,"[-1.688985, 0.097557895, 1.0489448, -3.487443, -1.3399991, 0.4498464, -1.4566468, -0.87048954, -1.2435615, 0.3774105, 1.6745398, 0.9896413, 2.1044707, -0.51510066, 0.16149546, -3.0437603, 0.50869876, 2.7351184, -1.1178105, 0.5041334, 2.4126704, 0.53866184, -1.3713572, -1.2401533, 0.17355655, 0.047125872, 0.14253643, 1.4832045, -0.033414114, -0.32354763, -1.4184397, -2.347894, -1.3796191, -2.7735548, 1.5177174, 0.07628559, -0.14851883, 2.526894, 0.72204036, -1.2351246, 1.7430878, 1.8407371, 1.5273428, 1.6378278, -1.4441094, -0.92657965, 0.6114409, -0.5970426, 0.5298836, 2.105611, 1.4486173, 1.3387399, 0.28431913, -0.72167766, -0.4750457, -1.880534, 2.079888, 1.3568192, 1.4658229, -0.64631444, 1.4790719, 1.2740519, 4.9677095, 1.9248492, 0.14292595, 4.194341, -0.22896816, -0.8432166, 1.705185, 1.6199191, 5.807013, -0.26892865, -1.4424264, 2.1220005, -1.1211319, -4.1195436, -0.8993205, -0.47154924, -0.051248588, 1.7858429, 1.2479581, 2.4390466, -2.615668, 2.7144387, -1.1336803, -0.9504373, 0.42034212, 0.6522583, 0.83028173, -0.33658272, 0.27665073, 1.7544823, 0.8760765, -0.27753547, -1.2579558, -1.3732074, 0.89605165, -0.9968676, -0.099919416, -3.0614164]","[-0.16130398, -0.5583664, 7.135714E-4, -1.8685576, -0.66981864, 0.08794541, -0.64017045, -1.1379296, -2.5542083, -0.19815344, 2.1212552, -2.785796, -2.2404165, -1.3197905, -3.3379602, -0.7703352, 1.3340237, -0.59136367, -0.06352994, -0.276839, 0.6682162, -0.7250914, 0.29910278, -1.5698713, 0.6393642, -1.0393646, 1.8908089, -0.20669825, -0.48462233, -0.21876171, -3.0442584, 1.7248535, -0.66967446, -1.2857287, 1.0468934, -0.5697006, -0.18675226, -0.82099324, 0.24874417, -2.9566681, 0.99892336, -0.7281621, 1.0763891, -0.4793534, -0.17651007, -1.148234, -1.1141315, 1.4952755, 0.8538215, 0.5071729, 0.66542155, -0.19409566, -1.7753453, -0.7059357, -1.7921116, -0.685305, -0.87791896, 0.44349656, 1.5142463, -0.799651, -0.07076867, 1.9158139, 1.0892427, -0.14858314, 0.25825214, -0.5178906, -1.333723, 0.94021904, 0.225154, -0.6923133, -0.31622592, -1.256697, -0.13491465, -0.6220355, -0.84561133, -0.54744524, 1.2497375, -0.56209326, -1.9172865, -0.9206086, 1.0032535, 0.286544, -1.4973941, -1.1314479, 0.8986352, -1.5813429, 1.108498, -0.66021955, 0.6145745, 0.22086996, 0.680037, -2.495816, 0.2857849, -0.9543916, -2.0819683, -0.8010362, 0.369732, 0.70697147, 0.54890186, -1.1213378]"
1,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",AMADEO,RESISTANCE,"[0.10835233, 0.307957, -0.5756974, -0.21822615, -0.41529492, -0.03489152, -0.034487452, 0.023482231, -0.7280744, 0.27365163, -0.049811274, -0.49503738, -0.4558065, 0.006723329, -0.12963401, 0.07919338, -0.33047056, -0.20311871, 0.10791033, -0.20039529, -0.37215993, -0.17138784, -0.1144783, -0.5619282, 0.6183555, -0.3253822, 0.025963642, 0.029569551, -0.010889502, -0.29141203, -0.18173097, -0.20770955, 0.13482161, 0.21674705, 0.3959404, -0.15857975, 0.28512117, 0.15994772, -0.35802966, -0.42040747, 0.3427062, -0.47629416, 0.28886706, 0.38614088, 0.076509215, -0.6584791, 0.37441805, -0.21275704, 0.38755137, 0.38589337, 0.25039104, 0.111188084, 0.13622893, 0.080100685, 0.025717424, -0.61775, 0.18610889, 0.34514168, 0.0054040486, -0.46469507, -0.3869265, 0.34901676, -0.16019733, -0.29333568, 0.182646, 0.36254779, -0.6077748, -0.08411278, 0.63209605, 0.067260414, -0.09450507, -0.25314018, -0.111741595, 0.35432914, 0.12674154, -0.68386126, -0.40157545, -0.15316018, 0.1077685, -0.33444956, -0.3176587, 0.07800743, -0.28782275, -0.77510524, -0.49866614, -0.4893062, -0.25443476, -0.29218763, 0.11124221, -0.09143806, -0.20461826, -0.12607165, 0.5711711, -0.2759387, -0.07001778, -0.21211171, -0.014845994, -0.41933548, 0.23724158, -0.31881934]","[-0.19620584, 1.6205584, 0.13788338, -0.22766168, -0.63744533, 0.2700976, 0.30442047, -0.81683743, -1.5852581, -0.2109028, 2.615766, -2.236224, -0.76350594, -0.20201756, -1.6577406, -0.59905183, 1.5542233, 0.28602546, 1.0209166, -1.7282783, -1.0166166, -1.6715709, -0.7523009, -1.0603728, -0.117878065, 0.90440655, 1.086054, -1.09216, 0.5359887, 0.42149693, -1.2177356, 1.7897729, -0.111570686, 0.76023173, 0.16672827, -2.298927, -0.73743296, -0.73748624, 1.3283676, -1.6173036, 0.95380235, -0.51807934, 1.8109344, 1.9766983, -1.1374876, -0.011348068, -1.6871501, -1.6737306, 0.04228129, 1.2558118, 0.869428, -0.08804982, 0.8733846, 0.65010756, -0.49490568, -1.028711, 1.624814, -0.351934, 1.2993368, 1.582106, 0.3125408, 2.1090963, 0.067718685, 1.1099043, -0.157489, 0.9634114, -0.56188446, 0.01380811, 1.3821543, -1.5496154, 1.7897134, -0.37755457, 0.045469787, 1.1981195, -0.01570279, -0.56835455, 0.13468437, 0.30032554, 0.22408834, -1.9496518, -0.49267983, -1.1014726, -0.09132616, -1.6396186, -0.111661345, -0.9598511, 1.2796035, 1.7527312, 1.737827, 0.23690899, 0.20718007, 1.6944544, 1.3505652, 0.12022101, 0.37390172, -0.63548315, 1.9753871, -0.98605317, -1.0243738, -0.8146936]"
2,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",FRERES,RESISTANCE,"[-1.688985, 0.097557895, 1.0489448, -3.487443, -1.3399991, 0.4498464, -1.4566468, -0.87048954, -1.2435615, 0.3774105, 1.6745398, 0.9896413, 2.1044707, -0.51510066, 0.16149546, -3.0437603, 0.50869876, 2.7351184, -1.1178105, 0.5041334, 2.4126704, 0.53866184, -1.3713572, -1.2401533, 0.17355655, 0.047125872, 0.14253643, 1.4832045, -0.033414114, -0.32354763, -1.4184397, -2.347894, -1.3796191, -2.7735548, 1.5177174, 0.07628559, -0.14851883, 2.526894, 0.72204036, -1.2351246, 1.7430878, 1.8407371, 1.5273428, 1.6378278, -1.4441094, -0.92657965, 0.6114409, -0.5970426, 0.5298836, 2.105611, 1.4486173, 1.3387399, 0.28431913, -0.72167766, -0.4750457, -1.880534, 2.079888, 1.3568192, 1.4658229, -0.64631444, 1.4790719, 1.2740519, 4.9677095, 1.9248492, 0.14292595, 4.194341, -0.22896816, -0.8432166, 1.705185, 1.6199191, 5.807013, -0.26892865, -1.4424264, 2.1220005, -1.1211319, -4.1195436, -0.8993205, -0.47154924, -0.051248588, 1.7858429, 1.2479581, 2.4390466, -2.615668, 2.7144387, -1.1336803, -0.9504373, 0.42034212, 0.6522583, 0.83028173, -0.33658272, 0.27665073, 1.7544823, 0.8760765, -0.27753547, -1.2579558, -1.3732074, 0.89605165, -0.9968676, -0.099919416, -3.0614164]","[-0.19620584, 1.6205584, 0.13788338, -0.22766168, -0.63744533, 0.2700976, 0.30442047, -0.81683743, -1.5852581, -0.2109028, 2.615766, -2.236224, -0.76350594, -0.20201756, -1.6577406, -0.59905183, 1.5542233, 0.28602546, 1.0209166, -1.7282783, -1.0166166, -1.6715709, -0.7523009, -1.0603728, -0.117878065, 0.90440655, 1.086054, -1.09216, 0.5359887, 0.42149693, -1.2177356, 1.7897729, -0.111570686, 0.76023173, 0.16672827, -2.298927, -0.73743296, -0.73748624, 1.3283676, -1.6173036, 0.95380235, -0.51807934, 1.8109344, 1.9766983, -1.1374876, -0.011348068, -1.6871501, -1.6737306, 0.04228129, 1.2558118, 0.869428, -0.08804982, 0.8733846, 0.65010756, -0.49490568, -1.028711, 1.624814, -0.351934, 1.2993368, 1.582106, 0.3125408, 2.1090963, 0.067718685, 1.1099043, -0.157489, 0.9634114, -0.56188446, 0.01380811, 1.3821543, -1.5496154, 1.7897134, -0.37755457, 0.045469787, 1.1981195, -0.01570279, -0.56835455, 0.13468437, 0.30032554, 0.22408834, -1.9496518, -0.49267983, -1.1014726, -0.09132616, -1.6396186, -0.111661345, -0.9598511, 1.2796035, 1.7527312, 1.737827, 0.23690899, 0.20718007, 1.6944544, 1.3505652, 0.12022101, 0.37390172, -0.63548315, 1.9753871, -0.98605317, -1.0243738, -0.8146936]"
3,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",AMADEO,MARTYRS,"[0.10835233, 0.307957, -0.5756974, -0.21822615, -0.41529492, -0.03489152, -0.034487452, 0.023482231, -0.7280744, 0.27365163, -0.049811274, -0.49503738, -0.4558065, 0.006723329, -0.12963401, 0.07919338, -0.33047056, -0.20311871, 0.10791033, -0.20039529, -0.37215993, -0.17138784, -0.1144783, -0.5619282, 0.6183555, -0.3253822, 0.025963642, 0.029569551, -0.010889502, -0.29141203, -0.18173097, -0.20770955, 0.13482161, 0.21674705, 0.3959404, -0.15857975, 0.28512117, 0.15994772, -0.35802966, -0.42040747, 0.3427062, -0.47629416, 0.28886706, 0.38614088, 0.076509215, -0.6584791, 0.37441805, -0.21275704, 0.38755137, 0.38589337, 0.25039104, 0.111188084, 0.13622893, 0.080100685, 0.025717424, -0.61775, 0.18610889, 0.34514168, 0.0054040486, -0.46469507, -0.3869265, 0.34901676, -0.16019733, -0.29333568, 0.182646, 0.36254779, -0.6077748, -0.08411278, 0.63209605, 0.067260414, -0.09450507, -0.25314018, -0.111741595, 0.35432914, 0.12674154, -0.68386126, -0.40157545, -0.15316018, 0.1077685, -0.33444956, -0.3176587, 0.07800743, -0.28782275, -0.77510524, -0.49866614, -0.4893062, -0.25443476, -0.29218763, 0.11124221, -0.09143806, -0.20461826, -0.12607165, 0.5711711, -0.2759387, -0.07001778, -0.21211171, -0.014845994, -0.41933548, 0.23724158, -0.31881934]","[-0.16130398, -0.5583664, 7.135714E-4, -1.8685576, -0.66981864, 0.08794541, -0.64017045, -1.1379296, -2.5542083, -0.19815344, 2.1212552, -2.785796, -2.2404165, -1.3197905, -3.3379602, -0.7703352, 1.3340237, -0.59136367, -0.06352994, -0.276839, 0.6682162, -0.7250914, 0.29910278, -1.5698713, 0.6393642, -1.0393646, 1.8908089, -0.20669825, -0.48462233, -0.21876171, -3.0442584, 1.7248535, -0.66967446, -1.2857287, 1.0468934, -0.5697006, -0.18675226, -0.82099324, 0.24874417, -2.9566681, 0.99892336, -0.7281621, 1.0763891, -0.4793534, -0.17651007, -1.148234, -1.1141315, 1.4952755, 0.8538215, 0.5071729, 0.66542155, -0.19409566, -1.7753453, -0.7059357, -1.7921116, -0.685305, -0.87791896, 0.44349656, 1.5142463, -0.799651, -0.07076867, 1.9158139, 1.0892427, -0.14858314, 0.25825214, -0.5178906, -1.333723, 0.94021904, 0.225154, -0.6923133, -0.31622592, -1.256697, -0.13491465, -0.6220355, -0.84561133, -0.54744524, 1.2497375, -0.56209326, -1.9172865, -0.9206086, 1.0032535, 0.286544, -1.4973941, -1.1314479, 0.8986352, -1.5813429, 1.108498, -0.66021955, 0.6145745, 0.22086996, 0.680037, -2.495816, 0.2857849, -0.9543916, -2.0819683, -0.8010362, 0.369732, 0.70697147, 0.54890186, -1.1213378]"


### 3. Calcul de la similarité

Le calcul de la similarité s'effectue avec la distance de cosine. Pour connaitre le pas a pas, veuillez vous référer au notebook [07_creation_table_poids_Word2Vec.md#test-acceptanceanalyse-du-modele](https://scm.saas.cagip.group.gca/PERNETTH/inseeinpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/08_US_DATUM/07_creation_table_poids_Word2Vec.md#test-acceptanceanalyse-du-mod%C3%A8le) pour comprendre les étapes

In [13]:
query =  """
WITH dataset AS (
  SELECT 
    siretisation.ets_insee_inpi.row_id, 
    index_id, 
    status_cas, 
    inpi_except, 
    insee_except, 
    transform(
      sequence(
        1, 
        CARDINALITY(insee_except)
      ), 
      x -> insee_except
    ), 
    ZIP(
      inpi_except, 
      transform(
        sequence(
          1, 
          CARDINALITY(inpi_except)
        ), 
        x -> insee_except
      )
    ) as test 
  FROM 
    siretisation.ets_insee_inpi  
    
  LEFT JOIN siretisation.ets_insee_inpi_status_cas 
  ON siretisation.ets_insee_inpi.row_id = siretisation.ets_insee_inpi_status_cas.row_id
  where 
    (status_cas != 'CAS_2' AND CARDINALITY(inpi_except)  > 0 AND CARDINALITY(insee_except) > 0 AND index_id = 4664896)
  
  )
  SELECT 
  * 
FROM 
  (
    WITH distance AS (
SELECT 
              row_id, 
              index_id, 
              status_cas, 
              inpi_except, 
              insee_except, 
              unzip_inpi, 
              unzip_insee, 
              list_weights_inpi, 
              list_weights_insee 
            FROM 
              (
                SELECT 
                  row_id, 
                  index_id, 
                  status_cas, 
                  inpi_except, 
                  insee_except, 
                  unzip.field0 as unzip_inpi, 
                  unzip.field1 as insee, 
                  test 
                FROM 
                  dataset CROSS 
                  JOIN UNNEST(test) AS new (unzip)
              ) CROSS 
              JOIN UNNEST(insee) as test (unzip_insee) 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_inpi 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_inpi ON unzip_inpi = tb_weight_inpi.words 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_insee 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_insee ON unzip_insee = tb_weight_insee.words 
      )
    SELECT row_id, 
            index_id, 
            status_cas, 
            inpi_except, 
            insee_except, 
            unzip_inpi, 
            unzip_insee, 
            REDUCE(
              zip_with(
                list_weights_inpi, 
                list_weights_insee, 
                (x, y) -> x * y
              ), 
              CAST(
                ROW(0.0) AS ROW(sum DOUBLE)
              ), 
              (s, x) -> CAST(
                ROW(x + s.sum) AS ROW(sum DOUBLE)
              ), 
              s -> s.sum
            ) / (
              SQRT(
                REDUCE(
                  transform(
                    list_weights_inpi, 
                    (x) -> POW(x, 2)
                  ), 
                  CAST(
                    ROW(0.0) AS ROW(sum DOUBLE)
                  ), 
                  (s, x) -> CAST(
                    ROW(x + s.sum) AS ROW(sum DOUBLE)
                  ), 
                  s -> s.sum
                )
              ) * SQRT(
                REDUCE(
                  transform(
                    list_weights_insee, 
                    (x) -> POW(x, 2)
                  ), 
                  CAST(
                    ROW(0.0) AS ROW(sum DOUBLE)
                  ), 
                  (s, x) -> CAST(
                    ROW(x + s.sum) AS ROW(sum DOUBLE)
                  ), 
                  s -> s.sum
                )
              )
            ) AS cosine_distance  
    FROM distance
    )
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = 'cosine', ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,row_id,index_id,status_cas,inpi_except,insee_except,unzip_inpi,unzip_insee,cosine_distance
0,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",AMADEO,RESISTANCE,0.326015
1,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",FRERES,RESISTANCE,0.241596
2,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",AMADEO,MARTYRS,0.338002
3,87887,4664896,CAS_3,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",FRERES,MARTYRS,0.182855


### 4. Recupération de la similarité maximum par `row_id` et calcul Levensthein

La dernière étape consiste a récupérer la similarité maximum sur les doublons provenant du `row_id` de sorte à n'avoir qu'une ligne par pair 

In [16]:
query = """
WITH dataset AS (
  SELECT 
    siretisation.ets_insee_inpi.row_id, 
    index_id, 
    status_cas, 
    inpi_except, 
    insee_except, 
    transform(
      sequence(
        1, 
        CARDINALITY(insee_except)
      ), 
      x -> insee_except
    ), 
    ZIP(
      inpi_except, 
      transform(
        sequence(
          1, 
          CARDINALITY(inpi_except)
        ), 
        x -> insee_except
      )
    ) as test 
  FROM 
    siretisation.ets_insee_inpi  
    
  LEFT JOIN siretisation.ets_insee_inpi_status_cas 
  ON siretisation.ets_insee_inpi.row_id = siretisation.ets_insee_inpi_status_cas.row_id
  where 
    (status_cas != 'CAS_2' AND CARDINALITY(inpi_except)  > 0 AND CARDINALITY(insee_except) > 0 AND index_id = 4664896)
  )
 SELECT 
  * 
FROM 
  (
    WITH distance AS (
      SELECT 
        * 
      FROM 
        (
          WITH list_weights_insee_inpi AS (
            SELECT 
              row_id, 
              index_id, 
              status_cas, 
              inpi_except, 
              insee_except, 
              unzip_inpi, 
              unzip_insee, 
              list_weights_inpi, 
              list_weights_insee 
            FROM 
              (
                SELECT 
                  row_id, 
                  index_id, 
                  status_cas, 
                  inpi_except, 
                  insee_except, 
                  unzip.field0 as unzip_inpi, 
                  unzip.field1 as insee, 
                  test 
                FROM 
                  dataset CROSS 
                  JOIN UNNEST(test) AS new (unzip)
              ) CROSS 
              JOIN UNNEST(insee) as test (unzip_insee) 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_inpi 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_inpi ON unzip_inpi = tb_weight_inpi.words 
              LEFT JOIN (
                SELECT 
                  words, 
                  list_weights as list_weights_insee 
                FROM 
                  siretisation.list_weight_mots_insee_inpi_word2vec 
              ) tb_weight_insee ON unzip_insee = tb_weight_insee.words 
          ) 
          SELECT 
            row_id, 
            index_id, 
            status_cas, 
            inpi_except, 
            insee_except, 
            unzip_inpi, 
            unzip_insee, 
            REDUCE(
              zip_with(
                list_weights_inpi, 
                list_weights_insee, 
                (x, y) -> x * y
              ), 
              CAST(
                ROW(0.0) AS ROW(sum DOUBLE)
              ), 
              (s, x) -> CAST(
                ROW(x + s.sum) AS ROW(sum DOUBLE)
              ), 
              s -> s.sum
            ) / (
              SQRT(
                REDUCE(
                  transform(
                    list_weights_inpi, 
                    (x) -> POW(x, 2)
                  ), 
                  CAST(
                    ROW(0.0) AS ROW(sum DOUBLE)
                  ), 
                  (s, x) -> CAST(
                    ROW(x + s.sum) AS ROW(sum DOUBLE)
                  ), 
                  s -> s.sum
                )
              ) * SQRT(
                REDUCE(
                  transform(
                    list_weights_insee, 
                    (x) -> POW(x, 2)
                  ), 
                  CAST(
                    ROW(0.0) AS ROW(sum DOUBLE)
                  ), 
                  (s, x) -> CAST(
                    ROW(x + s.sum) AS ROW(sum DOUBLE)
                  ), 
                  s -> s.sum
                )
              )
            ) AS cosine_distance 
          FROM 
            list_weights_insee_inpi
        )
    ) 
    SELECT 
      row_id, 
      dataset.index_id, 
      inpi_except, 
      insee_except, 
      unzip_inpi, 
      unzip_insee, 
      max_cosine_distance,
      test as key_except_to_test,
      levenshtein_distance(unzip_inpi, unzip_insee) AS levenshtein_distance
    
    FROM 
      dataset 
      LEFT JOIN (
        SELECT 
          distance.index_id, 
          unzip_inpi, 
          unzip_insee, 
          max_cosine_distance 
        FROM 
          distance 
          RIGHT JOIN (
            SELECT 
              index_id, 
              MAX(cosine_distance) as max_cosine_distance 
            FROM 
              distance 
            GROUP BY 
              index_id
          ) as tb_max_distance ON distance.index_id = tb_max_distance.index_id 
          AND distance.cosine_distance = tb_max_distance.max_cosine_distance
      ) as tb_max_distance_lookup ON dataset.index_id = tb_max_distance_lookup.index_id
  )
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = 'max_cosine', ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,row_id,index_id,inpi_except,insee_except,unzip_inpi,unzip_insee,max_cosine_distance,key_except_to_test,levenshtein_distance
0,87887,4664896,"[FRERES, AMADEO]","[MARTYRS, RESISTANCE]",AMADEO,MARTYRS,0.338002,"[{field0=FRERES, field1=[MARTYRS, RESISTANCE]}, {field0=AMADEO, field1=[MARTYRS, RESISTANCE]}]",6


# Generation report

In [17]:
import os, time, shutil, urllib, ipykernel, json
from pathlib import Path
from notebook import notebookapp

In [18]:
def create_report(extension = "html", keep_code = False):
    """
    Create a report from the current notebook and save it in the 
    Report folder (Parent-> child directory)
    
    1. Exctract the current notbook name
    2. Convert the Notebook 
    3. Move the newly created report
    
    Args:
    extension: string. Can be "html", "pdf", "md"
    
    
    """
    
    ### Get notebook name
    connection_file = os.path.basename(ipykernel.get_connection_file())
    kernel_id = connection_file.split('-', 1)[0].split('.')[0]

    for srv in notebookapp.list_running_servers():
        try:
            if srv['token']=='' and not srv['password']:  
                req = urllib.request.urlopen(srv['url']+'api/sessions')
            else:
                req = urllib.request.urlopen(srv['url']+ \
                                             'api/sessions?token=' + \
                                             srv['token'])
            sessions = json.load(req)
            notebookname = sessions[0]['name']
        except:
            pass  
    
    sep = '.'
    path = os.getcwd()
    #parent_path = str(Path(path).parent)
    
    ### Path report
    #path_report = "{}/Reports".format(parent_path)
    #path_report = "{}/Reports".format(path)
    
    ### Path destination
    name_no_extension = notebookname.split(sep, 1)[0]
    source_to_move = name_no_extension +'.{}'.format(extension)
    dest = os.path.join(path,'Reports', source_to_move)
    
    ### Generate notebook
    if keep_code:
        os.system('jupyter nbconvert --to {} {}'.format(
    extension,notebookname))
    else:
        os.system('jupyter nbconvert --no-input --to {} {}'.format(
    extension,notebookname))
    
    ### Move notebook to report folder
    #time.sleep(5)
    shutil.move(source_to_move, dest)
    print("Report Available at this adress:\n {}".format(dest))

In [19]:
create_report(extension = "html", keep_code = False)

Report Available at this adress:
 C:\Users\PERNETTH\Documents\Projects\InseeInpi_matching\Notebooks_matching\Data_preprocessed\programme_matching\08_US_DATUM\Reports\08_calcul_cosine_levhenstein.html
