# Creation table INSEE INPI sans doublon

# Objective(s)

* Dans cette dernière étape, il suffit de récupérer le rang minimum de la table ets_insee_inpi_regle par index_id. En récupérant le minimum, la technique retourne la ligne la plus probable par rapport aux autres informations fournies par l'INSEE. AUtrement dit, nous avons récupéré la ligne qui satisfaient le plus de condition. Il est possible d'avoir encore des doublons, qui résultent d'une mauvaise préparation de la donnée ou d'une impossibilité de dédoubler le siret.

# Metadata

* Epic: Epic 6
* US: US 7
* Date Begin: 9/29/2020
* Duration Task: 0
* Description: récupérer le rang minimum de la table ets_insee_inpi_regle par index_id.
* Step type: Final table
* Status: Active
  * Change Status task: Active
  * Update table: Modify rows
* Source URL: US 07 Dedoublonnement
* Task type: Jupyter Notebook
* Users: Thomas Pernet
* Watchers: Thomas Pernet
* User Account: https://937882855452.signin.aws.amazon.com/console
* Estimated Log points: 5
* Task tag: #athena,#lookup-table,#sql,#remove-duplicate,#siretisation,#inpi,#siren,#siret,#insee,#documentation
* Toggl Tag: #documentation

# Input Cloud Storage [AWS/GCP]

## Table/file

* Origin: 
* Athena
* Name: 
* ets_insee_inpi_regle
* Github: 
  * https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/11_sumup_siretisation/08_creation_table_match_regles_gestion_insee_inpi.md

# Destination Output/Delivery

## Table/file

* Origin: 
* Athena
* Name:
* ets_insee_inpi_no_duplicate
* GitHub:
* https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/11_sumup_siretisation/09_creation_table_ets_insee_inpi_no_duplicate.md

## Connexion serveur

In [1]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_s3 import service_s3
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import os, shutil

path = os.getcwd()
parent_path = str(Path(path).parent)
path_cred = r"{}/credential_AWS.json".format(parent_path)
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = 'eu-west-3')

region = 'eu-west-3'
bucket = 'calfdata'

In [2]:
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = region)
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = bucket, verbose = False) 

In [3]:
pandas_setting = True
if pandas_setting:
    cm = sns.light_palette("green", as_cmap=True)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)

# Introduction

Le rapprochement entre les deux tables, à savoir l’INSEE et l’INPI, va amener à la création de deux vecteurs d’adresse. Un vecteur avec des mots contenus spécifiquement à l’INSEE, et un second vecteur avec les mots de l’adresse de l’INPI. Notre objectif est de comparé ses deux vecteurs pour définir si ils sont identiques ou non. Nous avons distingué 7 cas de figures possibles entre les deux vecteurs (figure 1).

![](https://drive.google.com/uc?export=view&id=1Qj_HooHrhFYSuTsoqFbl4Vxy9tN3V5Bu)

A partir de la, nous avons créé une matrice de règles de gestion, puis créer lesdites règles selon les informations de l'INSEE et de l'INPI.

Dans cette matrice, chacune des lignes vient par ordre croissant, c'est a dire que la ligne 1 est préférée à la ligne 2

Le tableau ci dessous récapitule les règles:


| Rang | Nom_variable                              | Dependence                                    | Notebook                           | Difficulte | Table_input                                                                                                                                                            | Variables_crees_US                                                                 | Possibilites                  |
|------|-------------------------------------------|-----------------------------------------------|------------------------------------|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|-------------------------------|
| 1    | status_cas                                |                                               | 02_cas_de_figure                   | Moyen      | ets_insee_inpi_status_cas                                                                                                                                              | status_cas,intersection,pct_intersection,union_,inpi_except,insee_except           | CAS_1,CAS_2,CAS_3,CAS_4,CAS_5 |
| 2    | test_list_num_voie                        | intersection_numero_voie,union_numero_voie    | 03_test_list_num_voie              | Moyen      | ets_insee_inpi_list_num_voie                                                                                                                                           | intersection_numero_voie,union_numero_voie                                         | FALSE,NULL,TRUE,PARTIAL       |
| 3    | test_enseigne                             | list_enseigne,enseigne                        | 04_test_enseigne                   | Moyen      | ets_insee_inpi_list_enseigne                                                                                                                                           | list_enseigne_contain                                                              | FALSE,NULL,TRUE               |
| 4    | test_pct_intersection                     | pct_intersection,index_id_max_intersection    | 06_creation_nb_siret_siren_max_pct | Facile     | ets_insee_inpi_var_group_max                                                                                                                                           | count_inpi_index_id_siret,count_inpi_siren_siret,index_id_max_intersection         | FALSE,TRUE                    |
| 4    | test_index_id_duplicate                   | count_inpi_index_id_siret                     | 06_creation_nb_siret_siren_max_pct | Facile     | ets_insee_inpi_var_group_max                                                                                                                                           | count_inpi_index_id_siret,count_inpi_siren_siret,index_id_max_intersection         | FALSE,TRUE                    |
| 4    | test_siren_insee_siren_inpi               | count_initial_insee,count_inpi_siren_siret    | 06_creation_nb_siret_siren_max_pct | Facile     | ets_insee_inpi_var_group_max                                                                                                                                           | count_inpi_index_id_siret,count_inpi_siren_siret,index_id_max_intersection         | FALSE,TRUE                    |
| 5    | test_similarite_exception_words           | max_cosine_distance                           | 08_calcul_cosine_levhenstein       | Difficile  | ets_insee_inpi_similarite_max_word2vec                                                                                                                                 | unzip_inpi,unzip_insee,max_cosine_distance,levenshtein_distance,key_except_to_test | FALSE,NULL,TRUE               |
| 5    | test_distance_levhenstein_exception_words | levenshtein_distance                          | 08_calcul_cosine_levhenstein       | Difficile  | ets_insee_inpi_similarite_max_word2vec                                                                                                                                 | unzip_inpi,unzip_insee,max_cosine_distance,levenshtein_distance,key_except_to_test | FALSE,NULL,TRUE               |
| 6    | test_date                                 | datecreationetablissement,date_debut_activite | 10_match_et_creation_regles.md     | Facile     | ets_insee_inpi_list_num_voie,ets_insee_inpi_list_enseigne,ets_insee_inpi_similarite_max_word2vec,ets_insee_inpi_status_cas,ets_insee_inpi_var_group_max,ets_insee_inpi |                                                                                    | FALSE,TRUE                    |
| 6    | test_siege                                | status_ets,etablissementsiege                 | 10_match_et_creation_regles.md     | Facile     | ets_insee_inpi_list_num_voie,ets_insee_inpi_list_enseigne,ets_insee_inpi_similarite_max_word2vec,ets_insee_inpi_status_cas,ets_insee_inpi_var_group_max,ets_insee_inpi |                                                                                    | FALSE,TRUE,NULL               |
| 6    | test_status_admin                         | etatadministratifetablissement,status_admin   | 10_match_et_creation_regles.md     | Facile     | ets_insee_inpi_list_num_voie,ets_insee_inpi_list_enseigne,ets_insee_inpi_similarite_max_word2vec,ets_insee_inpi_status_cas,ets_insee_inpi_var_group_max,ets_insee_inpi |                                                                                    | FALSE,NULL,TRUE               |

Dans cette dernière étape, il suffit de récupérer le rang minimum de la table `ets_insee_inpi_regle` par `index_id`. En récupérant le minimum, la technique retourne la ligne la plus probable par rapport aux autres informations fournies par l'INSEE. AUtrement dit, nous avons récupéré la ligne qui satisfaient le plus de condition. Il est possible d'avoir encore des doublons, qui résultent d'une mauvaise préparation de la donnée ou d'une impossibilité de dédoubler le siret.

In [4]:
s3_output = 'SQL_OUTPUT_ATHENA'
database = 'ets_siretisation'

In [5]:
query = """
DROP TABLE ets_siretisation.ets_insee_inpi_no_duplicate;
"""

s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = None, ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

{'Results': {'State': 'SUCCEEDED',
  'SubmissionDateTime': datetime.datetime(2020, 9, 29, 15, 13, 3, 202000, tzinfo=tzlocal()),
  'CompletionDateTime': datetime.datetime(2020, 9, 29, 15, 13, 4, 315000, tzinfo=tzlocal())},
 'QueryID': 'e662363c-26e0-42e8-840c-55eef39e5a3a'}

In [7]:
query = """
CREATE TABLE ets_siretisation.ets_insee_inpi_no_duplicate
WITH (
  format='PARQUET'
) AS
WITH tb_min_rank AS (
SELECT 
  rank, 
  min_rank, 
  row_id, 
  ets_insee_inpi_regle.index_id, 
  siren, 
  siret, 
  sequence_id, 
  count_inpi_index_id_siret, 
  count_inpi_siren_siret, 
  count_initial_insee, 
  test_index_id_duplicate, 
  test_siren_insee_siren_inpi, 
  adresse_distance_insee, 
  adresse_distance_inpi, 
  insee_except, 
  inpi_except, 
  intersection, 
  union_, 
  pct_intersection, 
  index_id_max_intersection, 
  status_cas, 
  test_pct_intersection, 
  unzip_inpi, 
  unzip_insee, 
  max_cosine_distance, 
  key_except_to_test, 
  levenshtein_distance, 
  test_similarite_exception_words, 
  test_distance_levhenstein_exception_words, 
  list_numero_voie_matching_inpi, 
  list_numero_voie_matching_insee, 
  intersection_numero_voie, 
  union_numero_voie, 
  test_list_num_voie, 
  enseigne, 
  list_enseigne, 
  list_enseigne_contain, 
  test_enseigne, 
  date_debut_activite, 
  test_date, 
  etablissementsiege, 
  status_ets, 
  test_siege, 
  etatadministratifetablissement, 
  status_admin, 
  test_status_admin 
FROM 
  ets_siretisation.ets_insee_inpi_regle 
  INNER JOIN (
    SELECT 
      index_id, 
      MIN(rank) AS min_rank 
    FROM 
      ets_siretisation.ets_insee_inpi_regle 
    GROUP BY 
      index_id
  ) as tb_min_rank ON ets_insee_inpi_regle.index_id = tb_min_rank.index_id 
  AND ets_insee_inpi_regle.rank = tb_min_rank.min_rank
  )
  SELECT 
  rank, 
  min_rank, 
  row_id, 
  tb_min_rank.index_id, 
  count_index,
  siren, 
  siret, 
  sequence_id, 
  count_inpi_index_id_siret, 
  count_inpi_siren_siret, 
  count_initial_insee, 
  test_index_id_duplicate, 
  test_siren_insee_siren_inpi, 
  adresse_distance_insee, 
  adresse_distance_inpi, 
  insee_except, 
  inpi_except, 
  intersection, 
  union_, 
  pct_intersection, 
  index_id_max_intersection, 
  status_cas, 
  test_pct_intersection, 
  unzip_inpi, 
  unzip_insee, 
  max_cosine_distance, 
  key_except_to_test, 
  levenshtein_distance, 
  test_similarite_exception_words, 
  test_distance_levhenstein_exception_words, 
  list_numero_voie_matching_inpi, 
  list_numero_voie_matching_insee, 
  intersection_numero_voie, 
  union_numero_voie, 
  test_list_num_voie, 
  enseigne, 
  list_enseigne, 
  list_enseigne_contain, 
  test_enseigne, 
  date_debut_activite, 
  test_date, 
  etablissementsiege, 
  status_ets, 
  test_siege, 
  etatadministratifetablissement, 
  status_admin, 
  test_status_admin 
  FROM tb_min_rank
  LEFT JOIN (
    SELECT index_id, COUNT(*) AS count_index
    FROM tb_min_rank
    GROUP BY index_id
    ) as tb_nb_index
    ON tb_min_rank.index_id = tb_nb_index.index_id

"""

s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
  filename = None, ## Add filename to print dataframe
  destination_key = None ### Add destination key if need to copy output
        )

{'Results': {'State': 'SUCCEEDED',
  'SubmissionDateTime': datetime.datetime(2020, 9, 29, 15, 13, 57, 925000, tzinfo=tzlocal()),
  'CompletionDateTime': datetime.datetime(2020, 9, 29, 15, 14, 34, 942000, tzinfo=tzlocal())},
 'QueryID': 'ebbc0a7d-e402-4d87-a2b7-58f2e8f15b63'}

# Analyse

1. Count nombre lignes & index
2. Evaluation des doublons

## 1. Count nombre lignes & index

Nombre de lignes

In [9]:
query = """
SELECT COUNT(*) as CNT
FROM ets_insee_inpi_no_duplicate 
"""

s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'cnt_nb_lignes_rank', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )  

Unnamed: 0,CNT
0,9217708


Nombre d'index

In [11]:
query = """
SELECT COUNT(distinct(index_id)) as CNT
FROM ets_insee_inpi_no_duplicate 
"""

s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'cnt_nb_index_rank', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,CNT
0,9141917


Nombre d'index par cas

In [13]:
query = """
SELECT status_cas,  COUNT(distinct(index_id)) as cnt
FROM ets_insee_inpi_no_duplicate 
GROUP BY status_cas
ORDER BY cnt
"""

s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'cnt_nb_index_rank', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,status_cas,cnt
0,CAS_4,358290
1,CAS_3,748512
2,CAS_5,775873
3,CAS_1,7259242


## 2. Evaluation des doublons

Le tableau ci dessous récapitule les index uniques et les doublons

In [14]:
query = """
SELECT count_index, COUNT(*) as ligne_dup
FROM ets_insee_inpi_no_duplicate 
GROUP BY count_index 
ORDER BY count_index
"""

nb_ligne = s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'cnt_nb_dup_lignes_rank', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
)

In [15]:
query = """
SELECT count_index, COUNT(DISTINCT(index_id)) as index_dup
FROM ets_insee_inpi_no_duplicate 
GROUP BY count_index 
ORDER BY count_index
"""

nb_index = s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'cnt_nb_dup_index_rank', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

In [16]:
(
pd.concat([    
 pd.concat([
    pd.concat(
    [
        nb_ligne.sum().to_frame().T.rename(index = {0:'total'}), 
        nb_ligne
    ], axis = 0),
    ],axis = 1,keys=["Lignes"]),
    (
 pd.concat([
    pd.concat(
    [
        nb_index.sum().to_frame().T.rename(index = {0:'total'}), 
        nb_index
    ], axis = 0),
    ],axis = 1,keys=["Index"])
)],axis= 1
    )
    .style
    .format("{:,.0f}")
                  .bar(subset= [
                      ('Lignes','ligne_dup'),
                      ('Index','index_dup'),
                      
                  ],
                       color='#d65f5f')
)

Unnamed: 0_level_0,Lignes,Lignes,Index,Index
Unnamed: 0_level_1,count_index,ligne_dup,count_index,index_dup
total,2751,9217708,2751,9141917
0,1,9087219,1,9087219
1,2,95392,2,47696
2,3,4857,3,1619
3,4,17752,4,4438
4,5,930,5,186
5,6,1290,6,215
6,7,350,7,50
7,8,888,8,111
8,9,342,9,38


Nombre d'index récuperé

In [17]:
nb_index.iloc[0,1]

9087219

Nombre d'index a trouver

In [18]:
nb_index.sum().to_frame().T.rename(index = {0:'total'}).iloc[0,1]

9141917

Pourcentage de probable trouvé

In [19]:
round(nb_index.iloc[0,1] / nb_index.sum().to_frame().T.rename(index = {0:'total'}).iloc[0,1], 4)

0.994

Analyse des ranks

In [21]:
query = """
WITH dataset AS (
  
  SELECT 
  MAP(
    ARRAY[0.1,0.25,0.5,0.75,0.8,0.95],
    approx_percentile(
      min_rank,
    ARRAY[0.1,0.25,0.5,0.75,0.8,0.95])
    ) AS nest
    FROM "ets_siretisation"."ets_insee_inpi_no_duplicate"  
    ) 
    
    SELECT 
    pct, 
    value AS  min_rank
    FROM dataset
    CROSS JOIN UNNEST(nest) as t(pct, value)
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'distribution_rank', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,pct,min_rank
0,0.1,2857
1,0.25,3995
2,0.5,4481
3,0.75,14335
4,0.8,27295
5,0.95,32021


### Regle 10% 

In [25]:
query ="""
SELECT *
FROM rank_matrice_regles_gestion 
WHERE rank = 2857
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,test_pct_intersection,status_cas,test_index_id_duplicate,test_list_num_voie,test_siren_insee_siren_inpi,test_siege,test_enseigne,test_similarite_exception_words,test_distance_levhenstein_exception_words,test_date,test_status_admin,rank
0,True,CAS_1,True,,False,False,,,,True,True,2857


In [26]:
query ="""
SELECT *
FROM ets_insee_inpi_no_duplicate 
WHERE rank = 2857
LIMIT 3
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,rank,min_rank,row_id,index_id,count_index,siren,siret,sequence_id,count_inpi_index_id_siret,count_inpi_siren_siret,count_initial_insee,test_index_id_duplicate,test_siren_insee_siren_inpi,adresse_distance_insee,adresse_distance_inpi,insee_except,inpi_except,intersection,union_,pct_intersection,index_id_max_intersection,status_cas,test_pct_intersection,unzip_inpi,unzip_insee,max_cosine_distance,key_except_to_test,levenshtein_distance,test_similarite_exception_words,test_distance_levhenstein_exception_words,list_numero_voie_matching_inpi,list_numero_voie_matching_insee,intersection_numero_voie,union_numero_voie,test_list_num_voie,enseigne,list_enseigne,list_enseigne_contain,test_enseigne,date_debut_activite,test_date,etablissementsiege,status_ets,test_siege,etatadministratifetablissement,status_admin,test_status_admin
0,2857,2857,1557520,957065,1,341514834,34151483400017,696218,2,2,1,True,False,SAP,SAP,,,1.0,1.0,0.75,0.75,CAS_1,True,,,,,,,,,,,,,,,,,1986-12-23,True,True,False,False,A,A,True
1,2857,2857,1618971,1423378,1,401008164,40100816400022,1718545,5,14,2,True,False,MAS AUDRAN,MAS AUDRAN,,,2.0,2.0,1.0,1.0,CAS_1,True,,,,,,,,,,,,,,,,,1998-01-01,True,True,False,False,A,A,True
2,2857,2857,1625962,623001,1,387551245,38755124500013,1345713,2,2,1,True,False,HAMEAU COURS,HAMEAU COURS,,,2.0,2.0,1.0,1.0,CAS_1,True,,,,,,,,,,,,,,,,,1992-04-23,True,True,False,False,A,A,True


### Regle 25% 

In [27]:
query ="""
SELECT *
FROM rank_matrice_regles_gestion 
WHERE rank = 3995
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,test_pct_intersection,status_cas,test_index_id_duplicate,test_list_num_voie,test_siren_insee_siren_inpi,test_siege,test_enseigne,test_similarite_exception_words,test_distance_levhenstein_exception_words,test_date,test_status_admin,rank
0,True,CAS_1,False,True,True,True,,,,False,True,3995


In [28]:
query ="""
SELECT *
FROM ets_insee_inpi_no_duplicate 
WHERE rank = 3995
LIMIT 3
"""
s3.run_query(
            query=query,
            database=database,
            s3_output='INPI/sql_output',
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,rank,min_rank,row_id,index_id,count_index,siren,siret,sequence_id,count_inpi_index_id_siret,count_inpi_siren_siret,count_initial_insee,test_index_id_duplicate,test_siren_insee_siren_inpi,adresse_distance_insee,adresse_distance_inpi,insee_except,inpi_except,intersection,union_,pct_intersection,index_id_max_intersection,status_cas,test_pct_intersection,unzip_inpi,unzip_insee,max_cosine_distance,key_except_to_test,levenshtein_distance,test_similarite_exception_words,test_distance_levhenstein_exception_words,list_numero_voie_matching_inpi,list_numero_voie_matching_insee,intersection_numero_voie,union_numero_voie,test_list_num_voie,enseigne,list_enseigne,list_enseigne_contain,test_enseigne,date_debut_activite,test_date,etablissementsiege,status_ets,test_siege,etatadministratifetablissement,status_admin,test_status_admin
0,3995,3995,45937,344723,1,319773180,31977318000017,255592,1,1,1,False,True,RUE VIEILLE TEMPLE,RUE VIEILLE TEMPLE,,,3.0,3.0,1.0,1.0,CAS_1,True,,,,,,,,[125],[125],1.0,1.0,True,,,,,1980-09-23,False,True,True,True,A,A,True
1,3995,3995,83199,351201,1,339779720,33977972000020,650049,1,1,1,False,True,RUE CHAZELLES,RUE CHAZELLES,,,2.0,2.0,1.0,1.0,CAS_1,True,,,,,,,,[1],[1],1.0,1.0,True,,,,,1986-10-01,False,True,True,True,A,A,True
2,3995,3995,218927,407087,1,303275028,30327502800015,54303,1,1,1,False,True,RUE BAGNERES,RUE BAGNERES,,,2.0,2.0,0.33,0.33,CAS_1,True,,,,,,,,[50],[50],1.0,1.0,True,,,,,1971-01-01,False,True,True,True,A,A,True


### Regle 50% 

In [29]:
query ="""
SELECT *
FROM rank_matrice_regles_gestion 
WHERE rank = 4481
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,test_pct_intersection,status_cas,test_index_id_duplicate,test_list_num_voie,test_siren_insee_siren_inpi,test_siege,test_enseigne,test_similarite_exception_words,test_distance_levhenstein_exception_words,test_date,test_status_admin,rank
0,True,CAS_1,False,True,False,True,,,,False,True,4481


In [30]:
query ="""
SELECT * 
FROM ets_insee_inpi_no_duplicate 
WHERE min_rank = 4481
LIMIT 5
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules_32141', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,rank,min_rank,row_id,index_id,count_index,siren,siret,sequence_id,count_inpi_index_id_siret,count_inpi_siren_siret,count_initial_insee,test_index_id_duplicate,test_siren_insee_siren_inpi,adresse_distance_insee,adresse_distance_inpi,insee_except,inpi_except,intersection,union_,pct_intersection,index_id_max_intersection,status_cas,test_pct_intersection,unzip_inpi,unzip_insee,max_cosine_distance,key_except_to_test,levenshtein_distance,test_similarite_exception_words,test_distance_levhenstein_exception_words,list_numero_voie_matching_inpi,list_numero_voie_matching_insee,intersection_numero_voie,union_numero_voie,test_list_num_voie,enseigne,list_enseigne,list_enseigne_contain,test_enseigne,date_debut_activite,test_date,etablissementsiege,status_ets,test_siege,etatadministratifetablissement,status_admin,test_status_admin
0,4481,4481,1034827,696051,1,331648089,33164808900026,501948,1,1,2,False,False,RUE SAINTOIS,RUE SAINTOIS,,,2.0,2.0,1.0,1.0,CAS_1,True,,,,,,,,[1],[1],1.0,1.0,True,,[SARL POIRSON JEAN FRANCOIS],False,,1985-02-01,False,True,True,True,A,A,True
1,4481,4481,1051818,589216,1,331759118,33175911800028,504291,1,1,2,False,False,RUE NOTRE DAME LORETTE,RUE NOTRE DAME LORETTE,,,4.0,4.0,1.0,1.0,CAS_1,True,,,,,,,,[10],[10],1.0,1.0,True,,,,,22/02/1985,False,False,False,True,F,F,True
2,4481,4481,1583004,1549551,1,352483341,35248334101338,979341,1,1,413,False,False,AVENUE LIBERATION,AVENUE LIBERATION,,,2.0,2.0,1.0,1.0,CAS_1,True,,,,,,,,[6],[6],1.0,1.0,True,,,,,2000-07-21,False,False,False,True,A,A,True
3,4481,4481,1584496,1251126,1,411016108,41101610800010,1955823,1,68,1,False,False,RUE GAMBETTA,RUE GAMBETTA,,,2.0,2.0,1.0,1.0,CAS_1,True,,,,,,,,[31],[31],1.0,1.0,True,,,,,1997-02-01,False,True,True,True,A,A,True
4,4481,4481,1585175,1497227,1,428816482,42881648200031,2418863,1,1,3,False,False,AVENUE JEAN JAURES,AVENUE JEAN JAURES,,,3.0,3.0,1.0,1.0,CAS_1,True,,,,,,,,"[118, 130]","[118, 130]",2.0,2.0,True,,,,,2001-12-31,False,True,True,True,A,A,True


### Regle 75% 

In [31]:
query ="""
SELECT *
FROM rank_matrice_regles_gestion 
WHERE rank = 14335
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,test_pct_intersection,status_cas,test_index_id_duplicate,test_list_num_voie,test_siren_insee_siren_inpi,test_siege,test_enseigne,test_similarite_exception_words,test_distance_levhenstein_exception_words,test_date,test_status_admin,rank
0,True,CAS_3,False,,False,,,False,False,True,True,14335


In [34]:
query ="""
SELECT * 
FROM ets_insee_inpi_no_duplicate 
WHERE min_rank = 14336
LIMIT 5
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules_32141', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,rank,min_rank,row_id,index_id,count_index,siren,siret,sequence_id,count_inpi_index_id_siret,count_inpi_siren_siret,count_initial_insee,test_index_id_duplicate,test_siren_insee_siren_inpi,adresse_distance_insee,adresse_distance_inpi,insee_except,inpi_except,intersection,union_,pct_intersection,index_id_max_intersection,status_cas,test_pct_intersection,unzip_inpi,unzip_insee,max_cosine_distance,key_except_to_test,levenshtein_distance,test_similarite_exception_words,test_distance_levhenstein_exception_words,list_numero_voie_matching_inpi,list_numero_voie_matching_insee,intersection_numero_voie,union_numero_voie,test_list_num_voie,enseigne,list_enseigne,list_enseigne_contain,test_enseigne,date_debut_activite,test_date,etablissementsiege,status_ets,test_siege,etatadministratifetablissement,status_admin,test_status_admin


### Regle 80% 

In [35]:
query ="""
SELECT *
FROM rank_matrice_regles_gestion 
WHERE rank = 27295
"""
s3.run_query(
            query=query,
            database=database,
            s3_output=s3_output,
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,test_pct_intersection,status_cas,test_index_id_duplicate,test_list_num_voie,test_siren_insee_siren_inpi,test_siege,test_enseigne,test_similarite_exception_words,test_distance_levhenstein_exception_words,test_date,test_status_admin,rank
0,True,CAS_5,False,True,True,True,,False,False,True,True,27295


In [36]:
query ="""
SELECT * 
FROM ets_insee_inpi_no_duplicate 
WHERE min_rank = 27295
LIMIT 5
"""
s3.run_query(
            query=query,
            database=database,
            s3_output='INPI/sql_output',
      filename = 'rules_32141', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,rank,min_rank,row_id,index_id,count_index,siren,siret,sequence_id,count_inpi_index_id_siret,count_inpi_siren_siret,count_initial_insee,test_index_id_duplicate,test_siren_insee_siren_inpi,adresse_distance_insee,adresse_distance_inpi,insee_except,inpi_except,intersection,union_,pct_intersection,index_id_max_intersection,status_cas,test_pct_intersection,unzip_inpi,unzip_insee,max_cosine_distance,key_except_to_test,levenshtein_distance,test_similarite_exception_words,test_distance_levhenstein_exception_words,list_numero_voie_matching_inpi,list_numero_voie_matching_insee,intersection_numero_voie,union_numero_voie,test_list_num_voie,enseigne,list_enseigne,list_enseigne_contain,test_enseigne,date_debut_activite,test_date,etablissementsiege,status_ets,test_siege,etatadministratifetablissement,status_admin,test_status_admin
0,27295,27295,1622900,1041714,1,441851482,44185148200014,2852302,1,1,1,False,True,IMPASSE BELLEVUE,IMPASSE BELLEVUE QUIMPERLE,,[QUIMPERLE],2.0,3.0,0.67,0.67,CAS_5,True,,,,,,False,False,[9],[9],1.0,1.0,True,,,,,2002-04-11,True,True,True,True,A,A,True
1,27295,27295,1027640,2317029,1,380559666,38055966600017,1160641,1,1,1,False,True,CHEMIN NOYERS,CHEMIN NOYERS FOSSE TIGNE,,"[FOSSE, TIGNE]",2.0,4.0,1.0,1.0,CAS_5,True,,,,,,False,False,[1],[1],1.0,1.0,True,,,,,1990-09-01,True,True,True,True,A,A,True
2,27295,27295,1140783,1833198,1,378886980,37888698000010,1099844,1,1,1,False,True,RUE MIELLE,RUE MIELLE GOUVILLE SUR MER,,"[GOUVILLE, SUR, MER]",2.0,5.0,0.33,0.33,CAS_5,True,,,,,,False,False,[45],[45],1.0,1.0,True,,,,,1990-05-04,True,True,True,True,A,A,True
3,27295,27295,1394644,1556730,1,400081485,40008148500015,1689136,1,1,1,False,True,RUE BOIS PRETRE MR VANNESSON MICHEL,RUE BOIS PRETRE,"[MR, VANNESSON, MICHEL]",,3.0,6.0,1.0,1.0,CAS_5,True,,,,,,False,False,[608],[608],1.0,1.0,True,,,,,1995-01-28,True,True,True,True,A,A,True
4,27295,27295,1753838,3527889,1,410446090,41044609000012,1937205,1,1,1,False,True,RUE THOMAS ALVA EDISON,RUE THOMAS ALVA EDISON ARLES,,[ARLES],4.0,5.0,1.0,1.0,CAS_5,True,,,,,,False,False,[22],[22],1.0,1.0,True,,,,,1996-11-29,True,True,True,True,A,A,True


### Regle 95% 

In [37]:
query ="""
SELECT *
FROM rank_matrice_regles_gestion 
WHERE rank = 32021
"""
s3.run_query(
            query=query,
            database=database,
            s3_output='INPI/sql_output',
      filename = 'rules', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,test_pct_intersection,status_cas,test_index_id_duplicate,test_list_num_voie,test_siren_insee_siren_inpi,test_siege,test_enseigne,test_similarite_exception_words,test_distance_levhenstein_exception_words,test_date,test_status_admin,rank
0,False,CAS_1,True,True,False,False,,,,False,True,32021


In [38]:
query ="""
SELECT * 
FROM ets_insee_inpi_no_duplicate 
WHERE min_rank = 32021
LIMIT 5
"""
s3.run_query(
            query=query,
            database=database,
            s3_output='INPI/sql_output',
      filename = 'rules_32141', ## Add filename to print dataframe
      destination_key = None ### Add destination key if need to copy output
        )

Unnamed: 0,rank,min_rank,row_id,index_id,count_index,siren,siret,sequence_id,count_inpi_index_id_siret,count_inpi_siren_siret,count_initial_insee,test_index_id_duplicate,test_siren_insee_siren_inpi,adresse_distance_insee,adresse_distance_inpi,insee_except,inpi_except,intersection,union_,pct_intersection,index_id_max_intersection,status_cas,test_pct_intersection,unzip_inpi,unzip_insee,max_cosine_distance,key_except_to_test,levenshtein_distance,test_similarite_exception_words,test_distance_levhenstein_exception_words,list_numero_voie_matching_inpi,list_numero_voie_matching_insee,intersection_numero_voie,union_numero_voie,test_list_num_voie,enseigne,list_enseigne,list_enseigne_contain,test_enseigne,date_debut_activite,test_date,etablissementsiege,status_ets,test_siege,etatadministratifetablissement,status_admin,test_status_admin
0,32021,32021,944223,829673,1,391399953,39139995300010,1470475,5,8,1,True,False,AVENUE GARE,AVENUE GARE,,,2.0,2.0,0.0,1.0,CAS_1,False,,,,,,,,[3],[3],1.0,1.0,True,,,,,1994-07-01,False,True,False,False,A,A,True
1,32021,32021,952234,877590,1,338599855,33859985500024,618687,3,113,1,True,False,PLACE L EGLISE,PLACE L EGLISE,,,3.0,3.0,0.0,0.44,CAS_1,False,,,,,,,,[10],[10],1.0,1.0,True,,,,,2007-05-01,False,True,False,False,A,A,True
2,32021,32021,948542,804451,1,316111517,31611151700017,193799,3,3,1,True,False,BIS RUE BAUME,BIS RUE BAUME,,,3.0,3.0,0.25,1.0,CAS_1,False,,,,,,,,[2],[2],1.0,1.0,True,,,,,1979-07-05,False,True,False,False,A,A,True
3,32021,32021,971465,1313848,1,341407583,34140758300051,693316,2,2,4,True,False,RUE BOIS BOUQUIN,RUE BOIS BOUQUIN,,,3.0,3.0,0.0,1.0,CAS_1,False,,,,,,,,[7],[7],1.0,1.0,True,,,,,2000-09-01,False,True,False,False,A,A,True
4,32021,32021,956707,1070886,1,403854862,40385486200030,1815738,14,27,4,True,False,BOULEVARD DELESSERT,BOULEVARD DELESSERT,,,2.0,2.0,0.0,1.0,CAS_1,False,,,,,,,,[7],[7],1.0,1.0,True,,,,,01/01/1996,False,True,False,False,A,A,True


# Generation report

In [39]:
import os, time, shutil, urllib, ipykernel, json
from pathlib import Path
from notebook import notebookapp

In [40]:
def create_report(extension = "html", keep_code = False):
    """
    Create a report from the current notebook and save it in the 
    Report folder (Parent-> child directory)
    
    1. Exctract the current notbook name
    2. Convert the Notebook 
    3. Move the newly created report
    
    Args:
    extension: string. Can be "html", "pdf", "md"
    
    
    """
    
    ### Get notebook name
    connection_file = os.path.basename(ipykernel.get_connection_file())
    kernel_id = connection_file.split('-', 1)[0].split('.')[0]

    for srv in notebookapp.list_running_servers():
        try:
            if srv['token']=='' and not srv['password']:  
                req = urllib.request.urlopen(srv['url']+'api/sessions')
            else:
                req = urllib.request.urlopen(srv['url']+ \
                                             'api/sessions?token=' + \
                                             srv['token'])
            sessions = json.load(req)
            notebookname = sessions[0]['name']
        except:
            pass  
    
    sep = '.'
    path = os.getcwd()
    #parent_path = str(Path(path).parent)
    
    ### Path report
    #path_report = "{}/Reports".format(parent_path)
    #path_report = "{}/Reports".format(path)
    
    ### Path destination
    name_no_extension = notebookname.split(sep, 1)[0]
    source_to_move = name_no_extension +'.{}'.format(extension)
    dest = os.path.join(path,'Reports', source_to_move)
    
    ### Generate notebook
    if keep_code:
        os.system('jupyter nbconvert --to {} {}'.format(
    extension,notebookname))
    else:
        os.system('jupyter nbconvert --no-input --to {} {}'.format(
    extension,notebookname))
    
    ### Move notebook to report folder
    #time.sleep(5)
    shutil.move(source_to_move, dest)
    print("Report Available at this adress:\n {}".format(dest))

In [None]:
create_report(extension = "html",keep_code = True)