# Test similarite exception list mots INSEE et INPI siretisation

Objective(s)

*  L’objectif de cette tache est de trouver une solution pour retourner la distance donnée par le Word2Vec entre 2 listes contenant des mots qui ne sont pas communs dans l’adresse INSEE et INPI
* Il faut faire le test lorsque la variable status_cas est egal a CAS_5,6 ou 7
* Par exemple:
    * inpi_except: [A, B]
    * insee_except: [A,C]
    * Le test: [[A,A], [A,C], [B,A],[B,C]]
    * Output: [p1, p2, p3, p4]
    * Recupération max list output
    * Variables nécéssaire:
        * inpi_except 
        * insee_except 
        * status_cas

## Metadata

* Metadata parameters are available here: Ressources_suDYJ#_luZqd
* Task type:
  * Jupyter Notebook
* Users: :
  * Thomas Pernet
* Watchers:
  * Thomas Pernet
* Estimated Log points:
  * One being a simple task, 15 a very difficult one
  *  14
* Task tag
  *  #sql-query,#matching,#siretisation,#machine-learning,#word2vec
* Toggl Tag
  * #poc
  
## Input Cloud Storage [AWS/GCP]

If link from the internet, save it to the cloud first

### Tables [AWS/BigQuery]

1. Batch 1:
    * Select Provider: Athena
      * Select table(s): ets_inpi_insee_cases
        * Select only tables created from the same notebook, else copy/paste selection to add new input tables
        * If table(s) does not exist, add them: Add New Table
        * Information:
          * Region: 
            * NameEurope (Paris)
            * Code: eu-west-3
          * Database: inpi
          * Notebook construction file: [07_pourcentage_siretisation_v3](https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/02_siretisation/07_pourcentage_siretisation_v3.md)
    
## Destination Output/Delivery

* Athena: 
    * Region: Europe (Paris)
    * Database: inpi
    * Tables (Add name new table): ets_inpi_inse_wordvec

  
## Things to know (Steps, Attention points or new flow of information)

### Sources of information  (meeting notes, Documentation, Query, URL)

1. Jupyter Notebook (Github Link)
  1. md : [Test_word2Vec.md](https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/02_siretisation/Test_word2Vec.md)

## Connexion serveur

In [1]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_athena import service_athena
from awsPy.aws_s3 import service_s3
from pathlib import Path
import pandas as pd
import numpy as np
import os, shutil
bucket = 'calfdata'
path = os.getcwd()
parent_path = str(Path(path).parent)
path_cred = r"{}/credential_AWS.json".format(parent_path)
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = 'eu-west-3')
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = 'calfdata', verbose = False) 
athena = service_athena.connect_athena(client = client,
                      bucket = 'calfdata') 

# Creation table analyse


In [2]:
drop_table = False
if drop_table:
    output = athena.run_query(
        query="DROP TABLE `ets_inpi_insee_cases`;",
        database='inpi',
        s3_output='INPI/sql_output'
    )

In [3]:
create_table = """
/*match insee inpi 7 cas de figs*/
CREATE TABLE inpi.ets_inpi_insee_cases
WITH (
  format='PARQUET'
) AS
WITH test_proba AS (
  SELECT 
  count_initial_insee, 
    index_id, 
    sequence_id, 
    siren, 
    siret, 
    Coalesce(
      try(
        date_parse(
          datecreationetablissement, '%Y-%m-%d'
        )
      ), 
      try(
        date_parse(
          datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss.SSS'
        )
      ), 
      try(
        date_parse(
          datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss'
        )
      ), 
      try(
        cast(
          datecreationetablissement as timestamp
        )
      )
    ) as datecreationetablissement, 
    Coalesce(
      try(
        date_parse(
          "date_début_activité", '%Y-%m-%d'
        )
      ), 
      try(
        date_parse(
          "date_début_activité", '%Y-%m-%d %hh:%mm:%ss.SSS'
        )
      ), 
      try(
        date_parse(
          "date_début_activité", '%Y-%m-%d %hh:%mm:%ss'
        )
      ), 
      try(
        cast(
          "date_début_activité" as timestamp
        )
      )
    ) as date_debut_activite, 
    etatadministratifetablissement, 
    status_admin, 
    etablissementsiege, 
    status_ets, 
    codecommuneetablissement, 
    code_commune, 
    codepostaletablissement, 
    code_postal_matching, 
    numerovoieetablissement, 
    numero_voie_matching, 
    typevoieetablissement, 
    type_voie_matching, 
    adresse_distance_inpi, 
    adresse_distance_insee, 
    list_numero_voie_matching_inpi, 
    list_numero_voie_matching_insee, 
    array_distinct(
      split(adresse_distance_inpi, ' ')
    ) as list_inpi, 
    cardinality(
      array_distinct(
        split(adresse_distance_inpi, ' ')
      )
    ) as lenght_list_inpi, 
    array_distinct(
      split(adresse_distance_insee, ' ')
    ) as list_insee, 
    cardinality(
      array_distinct(
        split(adresse_distance_insee, ' ')
      )
    ) as lenght_list_insee, 
    array_distinct(
      array_except(
        split(adresse_distance_insee, ' '), 
        split(adresse_distance_inpi, ' ')
      )
    ) as insee_except, 
    array_distinct(
      array_except(
        split(adresse_distance_inpi, ' '), 
        split(adresse_distance_insee, ' ')
      )
    ) as inpi_except, 
    CAST(
      cardinality(
        array_distinct(
          array_intersect(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as intersection, 
    CAST(
      cardinality(
        array_distinct(
          array_union(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as union_,
  CAST(
      cardinality(
        array_distinct(
          array_intersect(
            list_numero_voie_matching_inpi,
            list_numero_voie_matching_insee
          )
        )
      ) AS DECIMAL(10, 2)
    ) as intersection_numero_voie,
  CAST(
      cardinality(
        array_distinct(
          array_union(
            list_numero_voie_matching_inpi, 
            list_numero_voie_matching_insee
          )
        )
      ) AS DECIMAL(10, 2)
    ) as union_numero_voie,
     REGEXP_REPLACE(
  NORMALIZE(
  enseigne, 
            NFD
          ), 
          '\pM', 
          ''
        ) AS enseigne,
  enseigne1etablissement, enseigne2etablissement, enseigne3etablissement, 
  array_remove(
array_distinct(
SPLIT(
  concat(
  enseigne1etablissement,',', enseigne2etablissement,',', enseigne3etablissement),
  ',')
  ), ''
  ) as test, 
  
contains( 
         array_remove(
array_distinct(
SPLIT(
  concat(
  enseigne1etablissement,',', enseigne2etablissement,',', enseigne3etablissement),
  ',')
  ), ''
  ),REGEXP_REPLACE(
  NORMALIZE(
  enseigne, 
            NFD
          ), 
          '\pM', 
          ''
        )
         ) AS temp_test_enseigne
  FROM 
    "inpi"."ets_insee_inpi" -- limit 10
    ) 
SELECT 
count_initial_insee,
  index_id, 
  sequence_id, 
  siren, 
  siret, 
  CASE WHEN cardinality(list_numero_voie_matching_inpi) = 0 THEN NULL ELSE list_numero_voie_matching_inpi END as list_numero_voie_matching_inpi, 
  CASE WHEN cardinality(list_numero_voie_matching_insee) = 0 THEN NULL ELSE list_numero_voie_matching_insee END as list_numero_voie_matching_insee,
  intersection_numero_voie,
  union_numero_voie,
  
  CASE WHEN intersection_numero_voie = union_numero_voie AND (intersection_numero_voie IS NOT NULL OR union_numero_voie IS NOT NULL) THEN 'True' 
  WHEN (intersection_numero_voie IS NULL OR union_numero_voie IS NULL) THEN 'NULL'
  ELSE 'False' END AS test_list_num_voie,
  
  datecreationetablissement, 
  date_debut_activite, 
  
  CASE WHEN datecreationetablissement = date_debut_activite THEN 'True' 
  WHEN datecreationetablissement IS NULL 
  OR date_debut_activite IS NULL  THEN 'NULL'
  --WHEN datecreationetablissement = '' 
  --OR date_debut_activite = ''   THEN 'NULL'
  ELSE 'False' 
  END AS test_date, 
  
  etatadministratifetablissement, 
  status_admin, 
  
  CASE WHEN etatadministratifetablissement = status_admin THEN 'True' 
  WHEN etatadministratifetablissement IS NULL 
  OR status_admin IS NULL  THEN 'NULL'
  WHEN etatadministratifetablissement = '' 
  OR status_admin = '' THEN 'NULL'
  ELSE 'False'  
  END AS test_status_admin, 
  
  etablissementsiege, 
  status_ets, 
  
  CASE WHEN etablissementsiege = status_ets THEN 'True' 
  WHEN etablissementsiege IS NULL 
  OR status_ets IS NULL  THEN 'NULL'
  WHEN etablissementsiege = '' 
  OR status_ets = ''   THEN 'NULL'
  ELSE 'False'  
  END AS test_siege, 
  
  codecommuneetablissement, 
  code_commune, 
  
  CASE WHEN codecommuneetablissement = code_commune THEN 'True' 
  WHEN codecommuneetablissement IS NULL 
  OR code_commune IS NULL  THEN 'NULL'
  WHEN codecommuneetablissement = '' 
  OR code_commune = ''   THEN 'NULL'
  ELSE 'False'  
  END AS test_code_commune, 
  
  codepostaletablissement, 
  code_postal_matching, 
  numerovoieetablissement, 
  numero_voie_matching, 
  
  CASE WHEN numerovoieetablissement = numero_voie_matching THEN 'True' 
  WHEN numerovoieetablissement IS NULL 
  OR numero_voie_matching IS NULL  THEN 'NULL'
  WHEN numerovoieetablissement = '' 
  OR numero_voie_matching = ''   THEN 'NULL'
  ELSE 'False'  
  END AS test_numero_voie, 
  
  typevoieetablissement, 
  type_voie_matching, 
  
  CASE WHEN typevoieetablissement = type_voie_matching THEN 'True' 
  WHEN typevoieetablissement IS NULL 
  OR type_voie_matching IS NULL  THEN 'NULL'
  WHEN typevoieetablissement = '' 
  OR type_voie_matching = ''   THEN 'NULL'
  ELSE 'False'  
  END AS test_type_voie, 
  
  CASE WHEN cardinality(list_inpi) = 0 THEN NULL ELSE list_inpi END as list_inpi,
  
  lenght_list_inpi, 
  
  CASE WHEN cardinality(list_insee) = 0 THEN NULL ELSE list_insee END as list_insee,
  lenght_list_insee, 
  
  CASE WHEN cardinality(inpi_except) = 0 THEN NULL ELSE inpi_except END as inpi_except,
  CASE WHEN cardinality(insee_except) = 0 THEN NULL ELSE insee_except END as insee_except,
   
  intersection, 
  union_, 
  CASE WHEN intersection = union_  THEN 'CAS_1' WHEN intersection = 0 THEN 'CAS_2' WHEN lenght_list_inpi = intersection 
  AND intersection != union_ THEN 'CAS_3' WHEN lenght_list_insee = intersection 
  AND intersection != union_ THEN 'CAS_4' WHEN cardinality(insee_except) = cardinality(inpi_except) 
  AND intersection != 0 
  AND cardinality(insee_except) > 0 THEN 'CAS_5' WHEN cardinality(insee_except) > cardinality(inpi_except) 
  AND intersection != 0 
  AND cardinality(insee_except) > 0 
  AND cardinality(inpi_except) > 0 THEN 'CAS_6' WHEN cardinality(insee_except) < cardinality(inpi_except) 
  AND intersection != 0 
  AND cardinality(insee_except) > 0 
  AND cardinality(inpi_except) > 0 THEN 'CAS_7' ELSE 'CAS_NO_ADRESSE' END AS status_cas,
  enseigne, enseigne1etablissement, enseigne2etablissement, enseigne3etablissement, 
  CASE WHEN cardinality(test) = 0 THEN 'NULL'
WHEN enseigne = '' THEN 'NULL'
WHEN temp_test_enseigne = TRUE THEN 'True'
ELSE 'False' END AS test_enseigne 
  
FROM 
  test_proba
"""
output = athena.run_query(
        query=create_table,
        database='inpi',
        s3_output='INPI/sql_output'
    )

Execution ID: 82767256-97cc-40ff-b5b1-e47146685685


# Create table par cas

## Creation functions

La fonction ci dessous va générer le tableau d'analayse via une query, et retourne un dataframe Pandas, tout en stockant le resultat dans le dossier suivant:

- [calfdata/TEMP_ANALYSE_SIRETISATION/INDEX_20](https://s3.console.aws.amazon.com/s3/buckets/calfdata/TEMP_ANALYSE_SIRETISATION/INDEX_20/?region=eu-west-3&tab=overview)
- [calfdata/TEMP_ANALYSE_SIRETISATION/INDEX_20_TRUE](https://s3.console.aws.amazon.com/s3/buckets/calfdata/TEMP_ANALYSE_SIRETISATION/INDEX_20_TRUE/?region=eu-west-3&tab=overview)

In [4]:
df_ = (pd.DataFrame(data = {'index_unique': range(1,21)})
       .to_csv('index_20.csv', index = False)
      )

s3.upload_file(file_to_upload = 'index_20.csv',
            destination_in_s3 = 'TEMP_ANALYSE_SIRETISATION/INDEX_20')

In [5]:
create_table = """
CREATE EXTERNAL TABLE IF NOT EXISTS inpi.index_20 (
`index_unique`                     integer
    )
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '"'
   )
     LOCATION 's3://calfdata/TEMP_ANALYSE_SIRETISATION/INDEX_20'
     TBLPROPERTIES ('has_encrypted_data'='false',
              'skip.header.line.count'='1');"""
output = athena.run_query(
        query=create_table,
        database='inpi',
        s3_output='INPI/sql_output'
    )

Execution ID: b3bad8da-252a-46b8-b20d-9a9039040bf2


In [6]:
a = range(1,10)
b = ["True", "False", "NULL"]



index = pd.MultiIndex.from_product([a, b], names = ["index_unique", "groups"])

df_ = (pd.DataFrame(index = index)
       .reset_index()
       .sort_values(by = ["index_unique", "groups"])
       .to_csv('index_20_true.csv', index = False)
      )

s3.upload_file(file_to_upload = 'index_20_true.csv',
            destination_in_s3 = 'TEMP_ANALYSE_SIRETISATION/INDEX_20_TRUE')

In [7]:
create_table = """
CREATE EXTERNAL TABLE IF NOT EXISTS inpi.index_20_true (
`index_unique`                     integer,
`groups`                     string

    )
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '"'
   )
     LOCATION 's3://calfdata/TEMP_ANALYSE_SIRETISATION/INDEX_20_TRUE'
     TBLPROPERTIES ('has_encrypted_data'='false',
              'skip.header.line.count'='1');"""
output = athena.run_query(
        query=create_table,
        database='inpi',
        s3_output='INPI/sql_output'
    )

Execution ID: 5fed7a4a-c207-4322-8f80-02bf28a6e9a4


### Fonctions

In [8]:
def create_table_test_not_false(cas = "CAS_1"):
    """
    
    """
    top = """
    SELECT count_test_list_num_voie.status_cas,
    nb_unique_index, 
    index_unique,
    count_cas,
    test_list_num_voie,
    test_siege,
    test_enseigne,
    test_date, 
    test_status_admin,
    test_code_commune,
    test_type_voie
    FROM index_20 
    
    LEFT JOIN (
    SELECT count_, COUNT(count_) as count_cas
    FROM (
    SELECT COUNT(index_id) as count_
    FROM ets_inpi_insee_cases 
    WHERE status_cas = '{0}'
    GROUP BY index_id
    ORDER BY count_ DESC
  )
  GROUP BY count_
  ORDER BY count_
  ) AS count_unique
  ON index_20.index_unique = count_unique.count_ 
    """.format(cas)
    query = """
    LEFT JOIN (
    SELECT status_cas,count_index,  count(count_index) AS {1}
    FROM (
    SELECT status_cas, index_id, COUNT(test_enseigne) as count_index
    FROM ets_inpi_insee_cases 
    WHERE status_cas = '{0}' AND  {1} != 'False'
    GROUP BY status_cas, index_id
      ) as c
      GROUP BY status_cas, count_index
      ORDER BY count_index
      ) AS count_{1}
      ON index_20.index_unique = count_{1}.count_index 
    """
    
    bottom = """
    LEFT JOIN (
    SELECT  DISTINCT(status_cas), COUNT(DISTINCT(index_id)) as nb_unique_index
    FROM ets_inpi_insee_cases 
    WHERE status_cas = '{0}' 
    GROUP BY status_cas
    ) as index_unique
    ON index_unique.status_cas = count_test_list_num_voie.status_cas
    ORDER BY index_unique
    """.format(cas)

    for i, table in enumerate(["test_list_num_voie",
              "test_siege",
              "test_enseigne",
              "test_date", "test_status_admin", "test_code_commune", "test_type_voie"]):

        top += query.format(cas, table)
    top += bottom
    
    ### run query
    output = athena.run_query(
        query=top,
        database='inpi',
        s3_output='INPI/sql_output'
    )

    results = False
    filename = 'table_{}_test_not_false.csv'.format(cas)
    
    while results != True:
        source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
        destination_key = "{}/{}".format(
                                'ANALYSE_PRE_SIRETISATION',
                                filename
                            )
        
        results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )
        
    #filename = 'table_{}_test_not_false.csv'.format('CAS_1')
    index_unique_inpi = 10981811
    reindex= ['status_cas','nb_unique_index', 'index_unique','count_cas',
              'test_list_num_voie',
              'count_num_voie',
              'test_siege',
              'count_siege',
           'test_enseigne',
               'count_enseigne',
              'test_date',
               'count_date',
              'test_status_admin',
              'count_admin',
              'test_code_commune',
              'count_code_commune',
           'test_type_voie',
              'count_type_voie']
    test_1 = (s3.read_df_from_s3(
            key = 'ANALYSE_PRE_SIRETISATION/{}'.format(filename), sep = ',')
             )
    
    df_ = (
        test_1
     .assign(

         count_num_voie = lambda x: x['test_list_num_voie'] /  index_unique_inpi,
         count_siege = lambda x: x['test_siege'] /  index_unique_inpi,
         count_enseigne	 = lambda x: x['test_enseigne'] /  index_unique_inpi,
         count_date = lambda x: x['test_date'] /  index_unique_inpi,
         count_admin = lambda x: x['test_status_admin'] /  index_unique_inpi,
         count_code_commune = lambda x: x['test_code_commune'] /  index_unique_inpi,
         count_type_voie = lambda x: x['test_type_voie'] /  index_unique_inpi,
         status_cas = lambda x: x['status_cas'].fillna(method='ffill'),
         nb_unique_index = lambda x: x['nb_unique_index'].fillna(method='ffill')
     )
     .reindex(columns = reindex)
     .fillna(0)
                  .style
                  .format("{:,.0f}", subset =  [
                      "nb_unique_index",
                      "count_cas",
                      'test_list_num_voie',
                                                'test_siege',
                                                'test_enseigne',
                                                'test_date',
                                                'test_status_admin',
                                                'test_code_commune',
                                                'test_type_voie'])
                  .format("{:.2%}", subset =  ['count_num_voie',
                                               'count_siege',
                                               'count_enseigne',
                                               'count_date',
                                               'count_admin',
                                               'count_code_commune',
                                               'count_type_voie'])
                  .bar(subset= ['count_num_voie',
                                               'count_siege',
                                               'count_enseigne',
                                               'count_date',
                                               'count_admin',
                                               'count_code_commune',
                                               'count_type_voie'],
                       color='#d65f5f')
     )
    
    unique_1 = test_1.loc[lambda x: x['index_unique'].isin([1])]
    dic_ = {
    
    'nb_index_unique_{}'.format(cas): int(unique_1['nb_unique_index'].values[0]),
     'index_unique_inpi':index_unique_inpi,   
    'lignes_matches': {   
        'lignes_matche_list_num': int(unique_1['test_list_num_voie'].values[0]),
    'lignes_matche_list_num_pct': unique_1['test_list_num_voie'].values[0] / index_unique_inpi
    },    
    'lignes_a_trouver': {
        'test_list_num_voie':[
            int((unique_1['nb_unique_index'].values - unique_1['test_list_num_voie'].values)[0]),
            (unique_1['test_list_num_voie'].values / unique_1['nb_unique_index'].values)[0]
        ],
        'test_siege':[
            int((unique_1['nb_unique_index'].values - unique_1['test_siege'].values)[0]),
            (unique_1['test_siege'].values / unique_1['nb_unique_index'].values)[0]
        ],
        'test_enseigne':[
            int((unique_1['nb_unique_index'].values - unique_1['test_enseigne'].values)[0]),
            (unique_1['test_enseigne'].values / unique_1['nb_unique_index'].values)[0]
        ],
        'test_date':[
            int((unique_1['nb_unique_index'].values - unique_1['test_date'].values)[0]),
            (unique_1['test_date'].values / unique_1['nb_unique_index'].values)[0]
        ],
        'status_admin':[
            int((unique_1['nb_unique_index'].values - unique_1['test_status_admin'].values)[0]),
            (unique_1['test_status_admin'].values / unique_1['nb_unique_index'].values)[0]
        ],
        'test_code_commune':[
            int((unique_1['nb_unique_index'].values - unique_1['test_code_commune'].values)[0]),
            (unique_1['test_code_commune'].values / unique_1['nb_unique_index'].values)[0]
        ],
        'test_type_voie':[
            int((unique_1['nb_unique_index'].values - unique_1['test_type_voie'].values)[0]),
            (unique_1['test_type_voie'].values / unique_1['nb_unique_index'].values)[0]
        ],
    }
}
    
    return test_1, dic_

In [9]:
def table_list_num_other_tests(cas = 'CAS_1'):
    """
    """
    top = """
    SELECT 
    count_test_siege.status_cas,
    index_unique, 
    groups, 
    cnt_test_list_num_voie,
    cnt_test_siege,
    cnt_test_enseigne,
    cnt_test_date, 
    cnt_test_status_admin,
    cnt_test_code_commune,
    cnt_test_type_voie
    FROM index_20_true 
    """
    
    query = """
    -- {0}
    LEFT JOIN (
    SELECT status_cas, count_index,{0}, COUNT(index_id) as cnt_{0}
    FROM (
    SELECT ets_inpi_insee_cases.status_cas, count_index, ets_inpi_insee_cases.index_id, {0}
    FROM ets_inpi_insee_cases
    RIGHT JOIN (
    SELECT *
    FROM(
    SELECT status_cas, index_id, COUNT(index_id) as count_index
    FROM ets_inpi_insee_cases 
    WHERE status_cas = '{1}' AND  test_list_num_voie != 'False'
    GROUP BY status_cas, index_id
  )
  ) as index_
  ON ets_inpi_insee_cases.status_cas = index_.status_cas AND
  ets_inpi_insee_cases.index_id = index_.index_id
  WHERE ets_inpi_insee_cases.status_cas = '{1}' AND  test_list_num_voie != 'False'
  ) 
  GROUP BY status_cas, count_index, {0}
  ) as count_{0}
  ON index_20_true.index_unique = count_{0}.count_index AND
  index_20_true.groups = count_{0}.{0}
 
    """
    
    bottom =   """ORDER BY index_unique, groups"""
    for i, table in enumerate(["test_list_num_voie",
              "test_siege",
              "test_enseigne",
              "test_date", "test_status_admin", "test_code_commune", "test_type_voie"]):

        top += query.format(table, cas)
    top += bottom
    ### run query
    output = athena.run_query(
        query=top,
        database='inpi',
        s3_output='INPI/sql_output'
    )

    results = False
    filename = 'table_{}_num_voie_test_not_false.csv'.format(cas)
    
    while results != True:
        source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
        destination_key = "{}/{}".format(
                                'ANALYSE_PRE_SIRETISATION',
                                filename
                            )
        
        results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )
    reindex= ['status_cas',
          'index_unique',
          'groups',
              "total_rows",
              'cnt_test_list_num_voie',
              'count_list_num_voie',
              'cnt_test_siege',
              'count_siege',
           'cnt_test_enseigne',
               'count_enseigne',
              'cnt_test_date',
               'count_date',
              'cnt_test_status_admin',
              'count_admin',
              'cnt_test_code_commune',
              'count_code_commune',
           'cnt_test_type_voie',
              'count_type_voie']

    test_1 = (s3.read_df_from_s3(
            key = 'ANALYSE_PRE_SIRETISATION/{}'.format(filename), sep = ',')
          .assign(
         total_rows = lambda x: x['cnt_test_siege'].groupby(x['index_unique']).transform('sum'),
         count_list_num_voie = lambda x: x['cnt_test_list_num_voie'] /  x['total_rows'],
         count_siege = lambda x: x['cnt_test_siege'] /  x['total_rows'],
         count_enseigne	 = lambda x: x['cnt_test_enseigne'] /  x['total_rows'],
         count_date = lambda x: x['cnt_test_date'] /  x['total_rows'],
         count_admin = lambda x: x['cnt_test_status_admin'] /  x['total_rows'],
         count_code_commune = lambda x: x['cnt_test_code_commune'] /  x['total_rows'],
         count_type_voie = lambda x: x['cnt_test_type_voie'] /  x['total_rows'],
         status_cas = lambda x: x['status_cas'].fillna(method='ffill'),
         groups = lambda x: x['groups'].fillna('Null')
          )
          .reindex(columns = reindex)
          .fillna(0)
          .style
                  .format("{:,.0f}", subset =  ['total_rows',
                                                'cnt_test_list_num_voie',
                                                'cnt_test_siege',
                                                'cnt_test_enseigne',
                                                'cnt_test_date',
                                                'cnt_test_status_admin',
                                                'cnt_test_code_commune',
                                                'cnt_test_type_voie'])
                  .format("{:.2%}", subset =  ['count_list_num_voie',
                                               'count_siege',
                                               'count_enseigne',
                                               'count_date',
                                               'count_admin',
                                               'count_code_commune',
                                               'count_type_voie'])
                  .bar(subset= ['count_list_num_voie',
                                               'count_siege',
                                               'count_enseigne',
                                               'count_date',
                                               'count_admin',
                                               'count_code_commune',
                                               'count_type_voie'],
                       color='#d65f5f')
             )
    
    return test_1

In [10]:
def filter_list_num_test_false(cas = 'CAS_1',test = 'test_type_voie'):
    """
    """
    
    to_append = """count_initial_insee, index_id, sequence_id, siren, siret,
             list_inpi, list_insee,etablissementsiege, status_ets,
             enseigne, enseigne1etablissement, enseigne2etablissement,
             enseigne3etablissement, datecreationetablissement,
             date_debut_activite, etatadministratifetablissement, status_admin,
             typevoieetablissement, type_voie_matching"""

    for i, value in enumerate(["test_siege", "test_enseigne", "test_date", "test_status_admin", "test_type_voie"]):
        if value not in [test]:
            to_append += ",{}".format(value) 
    
    query = """
    SELECT  

count_initial_insee,filter_a.index_id, sequence_id, siren, siret,list_inpi, list_insee,
etablissementsiege, status_ets, test_siege, 
enseigne, enseigne1etablissement, enseigne2etablissement, enseigne3etablissement, test_enseigne, 
datecreationetablissement, date_debut_activite, test_date, 
etatadministratifetablissement, status_admin, test_status_admin, 
test_type_voie, typevoieetablissement, type_voie_matching 

    FROM (
    SELECT ets_inpi_insee_cases.status_cas, count_index, ets_inpi_insee_cases.index_id, {1}
    FROM ets_inpi_insee_cases
    RIGHT JOIN (
    SELECT *
    FROM(
    SELECT status_cas, index_id, COUNT(index_id) as count_index
    FROM ets_inpi_insee_cases 
    WHERE status_cas = '{0}' AND  test_list_num_voie != 'False'
    GROUP BY status_cas, index_id
  )
      WHERE count_index = 1
  ) as index_
  ON ets_inpi_insee_cases.status_cas = index_.status_cas AND
  ets_inpi_insee_cases.index_id = index_.index_id
  WHERE ets_inpi_insee_cases.status_cas = '{0}' AND  test_list_num_voie != 'False'
  ) as filter_a
  
  LEFT JOIN (
    
    SELECT {2}
    
    FROM ets_inpi_insee_cases
    WHERE ets_inpi_insee_cases.status_cas = '{0}' AND  test_list_num_voie != 'False'
    ) as filter_b
    ON filter_a.index_id = filter_b.index_id
    WHERE {1} = 'False'
    LIMIT 10
    """
    #print(query.format(cas, test,to_append))
    output = athena.run_query(
        query=query.format(cas, test,to_append),
        database='inpi',
        s3_output='INPI/sql_output'
    )

    results = False
    filename = 'table_{0}_{1}_example_filter.csv'.format(cas, test)
    
    while results != True:
        source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
        destination_key = "{}/{}".format(
                                'ANALYSE_PRE_SIRETISATION',
                                filename
                            )
        
        results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )
    
    test_1 = (s3.read_df_from_s3(
            key = 'ANALYSE_PRE_SIRETISATION/{}'.format(filename), sep = ',')
             )
    
    return test_1
    
    
    

# Analyse

## Nombre observations par cas

Le nombre d'observations doit correspondre au suivant:

|   Cas de figure | Titre                   |   Total |   Total cumulé |   pourcentage |   Pourcentage cumulé | Comment                 |
|----------------:|:------------------------|--------:|---------------:|--------------:|---------------------:|:------------------------|
|               1 | similarité parfaite     | 7775392 |        7775392 |     0.670261  |             0.670261 | Match parfait           |
|               2 | Exclusion parfaite      |  974444 |        8749836 |     0.0839998 |             0.75426  | Exclusion parfaite      |
|               3 | Match partiel parfait   |  407404 |        9157240 |     0.0351194 |             0.78938  | Match partiel parfait   |
|               4 | Match partiel parfait   |  558992 |        9716232 |     0.0481867 |             0.837566 | Match partiel parfait   |
|               5 | Match partiel compliqué | 1056406 |       10772638 |     0.0910652 |             0.928632 | Match partiel compliqué |
|               6 | Match partiel compliqué |  361242 |       11133880 |     0.0311401 |             0.959772 | Match partiel compliqué |
|               7 | Match partiel compliqué |  466671 |       11600551 |     0.0402283 |             1        | Match partiel compliqué |

## Nombre ets par cas

In [11]:
query = """
SELECT status_cas, COUNT(*) as count
FROM ets_inpi_insee_cases 
GROUP BY status_cas
"""

## Nombre etb unique INSEE par cas

In [12]:
query = """
SELECT status_cas, COUNT(DISTINCT(index_id)) as distinct_ets
FROM ets_inpi_insee_cases 
GROUP BY status_cas
ORDER BY status_cas
"""

In [13]:
query = """
SELECT * 
FROM (
SELECT status_cas, count_initial_insee, COUNT(*) as count
FROM ets_inpi_insee_cases 
GROUP BY status_cas, count_initial_insee
  )
  WHERE count_initial_insee = 1
ORDER BY status_cas, count_initial_insee
"""

## Distribution somme enseigne

In [14]:
query = """
SELECT 
  approx_percentile(sum_enseigne, ARRAY[0.25,0.50,0.75,.80,.85,.86,.87, .88, .89,.90,0.95, 0.99]) as sum_enseigne
FROM 
  ets_inpi_insee_cases 
"""

# Anayse cas

Explication:

- Dictionnaire:
    - 

- Table 1:
    - nb_unique_index: Nombre d'index unique pour un cas donnée. Ex. Il y a 7,584,503 index unique pour la cas 1
    - index_unique: . Possibilité de duplicate allant 1 (aucun duplicate) a 20. Si supérieur à 1, cela indique le nombre de lignes ayant 2,3,4 etc doublons
    - count_cas: Compte le nombre de duplicate par cas et index_unique. Par exemple, le cas 1 possède 128,821 lignes avec deux doublons pour un index donnée
    - `test_*`: Nombre de lignes ayant un result de test différent de false, pour chaqun des duplicates. par exemple, il y a 7,471,838 lignes ayant passées le test test_list_num_voie et n'ayant aucun duplicate.
    - `count_*`: test_* / nb_unique_index. Informe du pourcentage de lignes ayant un test concluant sur le nombre d'index unique. Se référé à la ligne 0.
- Table 2:
    - index_unique: Idem que index_unique
    - groups: Possibilité des résultats des tests -> True, False, NULL. NULL si aucune info dans les variables pour faire le test
    - total_rows: Nombre de lignes ayant réussi le test test_list_num_voie. Le chiffre doit correspondre à test_list_num_voie, ligne 0
    - `cnt_test_*`: Nombre de lignes ayant résussi le test test_list_num_voie, puis décomposé par résultat pour chaque test. Par exemple, il y a 3,037,959 lignes parmi les 7,471,838 lignes n'ayant pas de duplicates qui ont un test_siege egal à True.
    - `count_*`: cnt_test_* / total_rows. Pourcentage de lignes par décomposition des tests sur le nombre de lignes ayant réussi le test test_list_num_voie, décomposé par duplicate.
    

## Cas 01: similarité parfaite

* Definition: Les mots dans l’adresse de l’INPI sont égales aux mots dans l’adresse de l’INSEE
- Math definition: $\frac{|INSEE \cap INPI|}{|INSEE|+|INPI|-|INSEE \cap INPI|} =1$
- Règle: $ \text{intersection} = \text{union} \rightarrow \text{cas 1}$
* Query [case 1](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/24e58c22-4a67-4a9e-b98d-4eb9d65e7f27)

| list_inpi              | list_insee             | insee_except | intersection | union_ |
|------------------------|------------------------|--------------|--------------|--------|
| [BOULEVARD, HAUSSMANN] | [BOULEVARD, HAUSSMANN] | []           | 2            | 2      |
| [QUAI, GABUT]          | [QUAI, GABUT]          | []           | 2            | 2      |
| [BOULEVARD, VOLTAIRE]  | [BOULEVARD, VOLTAIRE]  | []           | 2            | 2      |

In [15]:
tb1, dic_tb1 = create_table_test_not_false(cas = "CAS_1")

Execution ID: 0d2d46e7-acc5-48a2-898f-c25235b76151


In [16]:
dic_tb1

{'nb_index_unique_CAS_1': 7584503,
 'index_unique_inpi': 10981811,
 'lignes_matches': {'lignes_matche_list_num': 7471838,
  'lignes_matche_list_num_pct': 0.6803830442902359},
 'lignes_a_trouver': {'test_list_num_voie': [112665, 0.985145368127615],
  'test_siege': [3083527, 0.5934437628938903],
  'test_enseigne': [189729, 0.9749846496204168],
  'test_date': [1454763, 0.8081927055734568],
  'status_admin': [1337668, 0.8236314231796071],
  'test_code_commune': [143984, 0.9810160270224694],
  'test_type_voie': [158062, 0.9791598737583729]}}

In [17]:
table_list_num_other_tests(cas = 'CAS_1')

Execution ID: fbd000eb-0971-454a-9069-e52d94b42064


Unnamed: 0,status_cas,index_unique,groups,total_rows,cnt_test_list_num_voie,count_list_num_voie,cnt_test_siege,count_siege,cnt_test_enseigne,count_enseigne,cnt_test_date,count_date,cnt_test_status_admin,count_admin,cnt_test_code_commune,count_code_commune,cnt_test_type_voie,count_type_voie
0,CAS_1,1,False,7471838,0,0.00%,3037959,40.66%,60250,0.81%,1402056,18.76%,1268334,16.97%,3071,0.04%,17107,0.23%
1,CAS_1,1,Null,7471838,959963,12.85%,0,0.00%,6847485,91.64%,2367000,31.68%,0,0.00%,110250,1.48%,752809,10.08%
2,CAS_1,1,True,7471838,6511875,87.15%,4433879,59.34%,564103,7.55%,3702782,49.56%,6203504,83.03%,7358517,98.48%,6701922,89.70%
3,CAS_1,2,False,75724,0,0.00%,40671,53.71%,2778,3.67%,23492,31.02%,19759,26.09%,22,0.03%,273,0.36%
4,CAS_1,2,Null,75724,19040,25.14%,0,0.00%,66709,88.09%,29908,39.50%,0,0.00%,1338,1.77%,13622,17.99%
5,CAS_1,2,True,75724,56684,74.86%,35053,46.29%,6237,8.24%,22324,29.48%,55965,73.91%,74364,98.20%,61829,81.65%
6,CAS_1,3,False,2586,0,0.00%,722,27.92%,477,18.45%,1496,57.85%,812,31.40%,0,0.00%,2,0.08%
7,CAS_1,3,Null,2586,1364,52.75%,0,0.00%,1768,68.37%,192,7.42%,0,0.00%,60,2.32%,797,30.82%
8,CAS_1,3,True,2586,1222,47.25%,1864,72.08%,341,13.19%,898,34.73%,1774,68.60%,2526,97.68%,1787,69.10%
9,CAS_1,4,False,644,0,0.00%,162,25.16%,124,19.25%,299,46.43%,211,32.76%,0,0.00%,8,1.24%


In [18]:
pd.set_option('display.max_columns', None) 

In [19]:
filter_list_num_test_false(cas = 'CAS_1',test = 'test_enseigne')

Execution ID: 2ad34fd6-7213-4546-9967-93b3d83fac85


Unnamed: 0,count_initial_insee,index_id,sequence_id,siren,siret,list_inpi,list_insee,etablissementsiege,status_ets,test_siege,enseigne,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,test_enseigne,datecreationetablissement,date_debut_activite,test_date,etatadministratifetablissement,status_admin,test_status_admin,test_type_voie,typevoieetablissement,type_voie_matching
0,4353,5124614,2402264,428268023,42826802300374,"[BOULEVARD, GAMBETTA]","[BOULEVARD, GAMBETTA]",False,False,True,SUPERMARCHE CASINO,CASINO,,,False,2000-07-01 00:00:00.000,2000-07-01 00:00:00.000,True,A,A,True,True,BD,BD
1,10,6086459,2477569,430426502,43042650200067,"[COURS, FONTANAROSA]","[COURS, FONTANAROSA]",False,False,True,RESDENCE LES MYRTILLES,RESIDENCE LES MYRTILLES,,,False,2011-01-01 00:00:00.000,2011-01-01 00:00:00.000,True,F,A,False,True,CRS,CRS
2,2,4237714,2332279,423684786,42368478600027,"[ROUTE, MARQUIXANES, IMPASSE, CASTELLANE]","[ROUTE, MARQUIXANES, IMPASSE, CASTELLANE]",True,False,False,SOCIETE D'EXPLOITATION DES ETS SALVAT,SOCIETE D'EXPLOITATION DES ET-S SAL,,,False,2013-04-01 00:00:00.000,1999-07-01 00:00:00.000,False,A,A,True,True,RTE,RTE
3,1,5678224,2659126,437518384,43751838400015,"[QUARTIER, ILES]","[QUARTIER, ILES]",True,False,False,DOMAINE MUCYN,EARL LES BATELIERS DU RHONE,,,False,2001-04-14 00:00:00.000,2001-04-14 00:00:00.000,True,A,A,True,,,QUA
4,2,1112430,2410792,428647754,42864775400020,"[RUE, CHARDONS, BLEUS]","[RUE, CHARDONS, BLEUS]",True,True,True,"""M.E.J.""",M.E.J.,,,False,2016-10-01 00:00:00.000,2000-01-03 00:00:00.000,False,A,A,True,True,RUE,RUE
5,1,1700991,243854,319147286,31914728600011,"[AVENUE, MICHEL, D, ORNANO]","[AVENUE, MICHEL, D, ORNANO]",True,False,False,HOTEL L'EPI D'OR - REST'O RIPAILLES,HOTEL L'EPI D'OR,REST'O RIPAILLES,,False,1980-05-28 00:00:00.000,1980-05-28 00:00:00.000,True,F,A,False,True,AV,AV
6,2,697005,522338,332559871,33255987100048,"[ALLEE, ANTOINE, BOURDELLE]","[ALLEE, ANTOINE, BOURDELLE]",True,False,False,KAYACIK ALAIN,WEB-RETAIL,,,False,2017-01-30 00:00:00.000,2017-01-30 00:00:00.000,True,A,A,True,True,ALL,ALL
7,4,2658232,2316877,423330885,42333088500041,"[AVENUE, PONT, FRANCE]","[AVENUE, PONT, FRANCE]",True,False,False,LE CHENE VERT,PUB DES CARS,,,False,2017-08-22 00:00:00.000,2016-08-01 00:00:00.000,False,F,A,False,True,AV,AV
8,9,3608078,2319952,423398577,42339857700084,"[BIS, RUE, GUSTAVE, EIFFEL]","[BIS, RUE, GUSTAVE, EIFFEL]",False,False,True,DFC²,DFC,,,False,2011-11-02 00:00:00.000,2011-11-02 00:00:00.000,True,F,F,True,True,RUE,RUE
9,1,8169116,4765936,513401208,51340120800014,"[PLACE, MARTIN, NADAUD]","[PLACE, MARTIN, NADAUD]",True,False,False,BISTROT DE LA PLACE,BISTOT DE LA PLACE,,,False,2009-04-01 00:00:00.000,2009-06-20 00:00:00.000,False,A,A,True,True,PL,PL


## CAS 03: Intersection parfaite INPI

* Definition:  Tous les mots dans l’adresse de l’INPI  sont contenus dans l’adresse de l’INSEE
* Math définition: $\frac{|INPI|}{|INSEE \cap INPI|}  \text{  = 1 and }|INSEE \cap INPI| <> |INSEE \cup INPI|$
* Query [case 3](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/7fb420a1-5f50-4256-a2ba-b8c7c2b63c9b)
* Règle: $|\text{list_inpi}|= \text{intersection}  \text{  = 1 and }\text{intersection} \neq  \text{union} \rightarrow \text{cas 3}$

In [20]:
tb3, dic_tb3 = create_table_test_not_false(cas = "CAS_3")

Execution ID: 6bd17b01-dc97-4e05-9e5c-4f34c40dd3f0


In [21]:
dic_tb3

{'nb_index_unique_CAS_3': 395751,
 'index_unique_inpi': 10981811,
 'lignes_matches': {'lignes_matche_list_num': 333616,
  'lignes_matche_list_num_pct': 0.030378960264386266},
 'lignes_a_trouver': {'test_list_num_voie': [62135, 0.8429947113210075],
  'test_siege': [161173, 0.5927413954734163],
  'test_enseigne': [12925, 0.9673405752607068],
  'test_date': [100141, 0.7469595781185645],
  'status_admin': [85211, 0.7846853198096783],
  'test_code_commune': [8460, 0.9786229219888263],
  'test_type_voie': [17896, 0.954779646798113]}}

In [22]:
tb3

Unnamed: 0,status_cas,nb_unique_index,index_unique,count_cas,test_list_num_voie,test_siege,test_enseigne,test_date,test_status_admin,test_code_commune,test_type_voie
0,CAS_3,395751.0,1,387589.0,333616.0,234578.0,382826.0,295610.0,310540.0,387291.0,377855.0
1,CAS_3,395751.0,2,6915.0,2957.0,2621.0,6000.0,2257.0,3120.0,6913.0,6751.0
2,CAS_3,395751.0,3,721.0,266.0,402.0,569.0,196.0,424.0,720.0,686.0
3,CAS_3,395751.0,4,246.0,84.0,161.0,161.0,67.0,114.0,246.0,242.0
4,CAS_3,395751.0,5,72.0,25.0,62.0,54.0,22.0,53.0,72.0,70.0
5,CAS_3,395751.0,6,69.0,12.0,42.0,50.0,15.0,37.0,69.0,69.0
6,CAS_3,395751.0,7,9.0,2.0,8.0,6.0,1.0,8.0,9.0,9.0
7,CAS_3,395751.0,8,28.0,21.0,25.0,9.0,11.0,24.0,28.0,28.0
8,CAS_3,395751.0,9,13.0,5.0,6.0,10.0,3.0,2.0,13.0,13.0
9,CAS_3,395751.0,10,13.0,1.0,14.0,5.0,5.0,18.0,13.0,13.0


In [23]:
table_list_num_other_tests(cas = 'CAS_3')

Execution ID: d2d67086-1dea-4b45-8137-dd9f24497bb0


Unnamed: 0,status_cas,index_unique,groups,total_rows,cnt_test_list_num_voie,count_list_num_voie,cnt_test_siege,count_siege,cnt_test_enseigne,count_enseigne,cnt_test_date,count_date,cnt_test_status_admin,count_admin,cnt_test_code_commune,count_code_commune,cnt_test_type_voie,count_type_voie
0,CAS_3,1,False,333616,0,0.00%,131829,39.52%,4026,1.21%,78037,23.39%,62947,18.87%,287,0.09%,9145,2.74%
1,CAS_3,1,Null,333616,107882,32.34%,0,0.00%,308840,92.57%,99381,29.79%,0,0.00%,12743,3.82%,100589,30.15%
2,CAS_3,1,True,333616,225734,67.66%,201787,60.48%,20750,6.22%,156198,46.82%,270669,81.13%,320586,96.09%,223882,67.11%
3,CAS_3,2,False,5914,0,0.00%,2327,39.35%,562,9.50%,2830,47.85%,1786,30.20%,4,0.07%,135,2.28%
4,CAS_3,2,Null,5914,3384,57.22%,0,0.00%,4771,80.67%,1298,21.95%,0,0.00%,210,3.55%,2730,46.16%
5,CAS_3,2,True,5914,2530,42.78%,3587,60.65%,581,9.82%,1786,30.20%,4128,69.80%,5700,96.38%,3049,51.56%
6,CAS_3,3,False,798,0,0.00%,246,30.83%,149,18.67%,504,63.16%,229,28.70%,0,0.00%,14,1.75%
7,CAS_3,3,Null,798,563,70.55%,0,0.00%,597,74.81%,102,12.78%,0,0.00%,21,2.63%,470,58.90%
8,CAS_3,3,True,798,235,29.45%,552,69.17%,52,6.52%,192,24.06%,569,71.30%,777,97.37%,314,39.35%
9,CAS_3,4,False,336,0,0.00%,117,34.82%,100,29.76%,251,74.70%,131,38.99%,0,0.00%,0,0.00%


In [24]:
filter_list_num_test_false(cas = 'CAS_3',test = 'test_type_voie')

Execution ID: 59c763cf-8197-40c7-8153-9ad9d765b419


Unnamed: 0,count_initial_insee,index_id,sequence_id,siren,siret,list_inpi,list_insee,etablissementsiege,status_ets,test_siege,enseigne,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,test_enseigne,datecreationetablissement,date_debut_activite,test_date,etatadministratifetablissement,status_admin,test_status_admin,test_type_voie,typevoieetablissement,type_voie_matching
0,3,2202588,2236218,421361056,42136105600037,"[ROUTE, D, AGEN]","[AVENUE, LOUIS, RESSES, ROUTE, D, AGEN]",False,False,True,,,,,,2006-02-01 00:00:00.000,2006-02-01 00:00:00.000,True,A,A,True,False,AV,RTE
1,2,901992,2252833,421744111,42174411100020,"[PLAINE, FAUCHERIE]","[LIEU, DIT, PLAINE, FAUCHERIE]",True,True,True,,,,,,2000-07-01 00:00:00.000,2000-07-01 00:00:00.000,True,A,A,True,False,LD,PLN
2,1,877655,2253717,421763475,42176347500017,"[CHEMIN, ROMPEY, MARCHON]","[ROUTE, MARCHON, CHEMIN, ROMPEY]",True,True,True,,,,,,1999-01-01 00:00:00.000,,,F,A,False,False,RTE,CHE
3,1,2461214,4403198,503036089,50303608900012,"[ROUTE, CHAPELLE, PRESSOIRS]","[LIEU, DIT, PRESSOIRS, ROUTE, CHAPELLE]",True,True,True,,,,,,2008-03-01 00:00:00.000,2008-03-01 00:00:00.000,True,A,A,True,False,LD,RTE
4,1,9012118,6940271,809100159,80910015900015,"[CHEMIN, VERT]","[RUE, CHEMIN, VERT]",True,True,True,,,,,,2015-01-09 00:00:00.000,,,A,A,True,False,RUE,CHE
5,1,9064894,6940272,809100159,80910015900015,"[CHEMIN, VERT]","[RUE, CHEMIN, VERT]",True,False,False,,,,,,2015-01-09 00:00:00.000,2015-01-09 00:00:00.000,True,A,A,True,False,RUE,CHE
6,1,751497,178093,315142778,31514277800010,[VILLAGE],"[ROUTE, SOUGERES, VILLAGE]",True,False,False,,,,,,1979-01-01 00:00:00.000,1979-02-01 00:00:00.000,False,A,F,False,False,RTE,VLGE
7,1,751498,178093,315142778,31514277800010,[VILLAGE],"[ROUTE, SOUGERES, VILLAGE]",True,False,False,,,,,,1979-01-01 00:00:00.000,1979-02-01 00:00:00.000,False,A,F,False,False,RTE,VLGE
8,1,2203589,2246739,421601915,42160191500018,"[CITE, CONRAD]","[RUE, CITE, CONRAD]",True,True,True,,,,,,1999-01-22 00:00:00.000,,,A,A,True,False,RUE,CITE
9,1,6076839,2246740,421601915,42160191500018,"[CITE, CONRAD]","[RUE, CITE, CONRAD]",True,False,False,,,,,,1999-01-22 00:00:00.000,1999-01-22 00:00:00.000,True,A,A,True,False,RUE,CITE


## CAS 04: Intersection parfaite INSEE

* Definition:  Tous les mots dans l’adresse de l’INSEE  sont contenus dans l’adresse de l’INPI
* Math definition: $\frac{|INSEE|}{|INSEE \cap INPI|}  \text{  = 1 and }|INSEE \cap INPI| <> |INSEE \cup INPI|$
* Query [case 4](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/65344bf4-8999-4ddb-a65e-11bb825f5f40)
* Règle: $|\text{list_insee}|= \text{intersection}  \text{  = 1 and }\text{intersection} \neq  \text{union} \rightarrow \text{cas 4}$

| list_inpi                                                 | list_insee                                      | insee_except | intersection | union_ |
|-----------------------------------------------------------|-------------------------------------------------|--------------|--------------|--------|
| [ROUTE, D, ENGHIEN]                                       | [ROUTE, ENGHIEN]                                | []           | 2            | 3      |
| [ZAC, PARC, D, ACTIVITE, PARIS, EST, ALLEE, LECH, WALESA] | [ALLEE, LECH, WALESA, ZAC, PARC, ACTIVITE, EST] | []           | 7            | 9      |
| [LIEU, DIT, PADER, QUARTIER, RIBERE]                      | [LIEU, DIT, RIBERE]                             | []           | 3            | 5      |
| [A, BOULEVARD, CONSTANTIN, DESCAT]                        | [BOULEVARD, CONSTANTIN, DESCAT]                 | []           | 3            | 4      |
| [RUE, MENILMONTANT, BP]                                   | [RUE, MENILMONTANT]                             | []           | 2            | 3      |


In [25]:
tb4, dic_tb4 = create_table_test_not_false(cas = "CAS_4")

Execution ID: efc7d94e-60fb-4dd0-b1ae-5606ec760179


In [26]:
dic_tb4

{'nb_index_unique_CAS_4': 537921,
 'index_unique_inpi': 10981811,
 'lignes_matches': {'lignes_matche_list_num': 463298,
  'lignes_matche_list_num_pct': 0.04218775937775655},
 'lignes_a_trouver': {'test_list_num_voie': [74623, 0.8612751686585949],
  'test_siege': [206065, 0.6169233028641752],
  'test_enseigne': [17240, 0.9679506842082759],
  'test_date': [118155, 0.780348787275455],
  'status_admin': [119907, 0.7770918034432566],
  'test_code_commune': [14741, 0.9725963477908466],
  'test_type_voie': [21522, 0.959990407513371]}}

In [27]:
tb4

Unnamed: 0,status_cas,nb_unique_index,index_unique,count_cas,test_list_num_voie,test_siege,test_enseigne,test_date,test_status_admin,test_code_commune,test_type_voie
0,CAS_4,537921.0,1,525832.0,463298.0,331856.0,520681.0,419766.0,418014.0,523180.0,516399.0
1,CAS_4,537921.0,2,10572.0,4106.0,3154.0,9509.0,3133.0,4053.0,10529.0,9679.0
2,CAS_4,537921.0,3,830.0,193.0,491.0,626.0,182.0,409.0,827.0,791.0
3,CAS_4,537921.0,4,236.0,51.0,140.0,160.0,55.0,120.0,235.0,218.0
4,CAS_4,537921.0,5,61.0,15.0,58.0,41.0,17.0,18.0,61.0,55.0
5,CAS_4,537921.0,6,43.0,7.0,31.0,30.0,10.0,44.0,43.0,41.0
6,CAS_4,537921.0,7,39.0,15.0,34.0,9.0,10.0,19.0,39.0,39.0
7,CAS_4,537921.0,8,28.0,19.0,29.0,13.0,20.0,9.0,28.0,28.0
8,CAS_4,537921.0,9,15.0,2.0,18.0,11.0,11.0,12.0,15.0,14.0
9,CAS_4,537921.0,10,55.0,29.0,49.0,39.0,14.0,29.0,55.0,58.0


In [28]:
table_list_num_other_tests(cas = 'CAS_4')

Execution ID: 1b510761-b490-4c1e-bf6a-86aa05615d91


Unnamed: 0,status_cas,index_unique,groups,total_rows,cnt_test_list_num_voie,count_list_num_voie,cnt_test_siege,count_siege,cnt_test_enseigne,count_enseigne,cnt_test_date,count_date,cnt_test_status_admin,count_admin,cnt_test_code_commune,count_code_commune,cnt_test_type_voie,count_type_voie
0,CAS_4,1,False,463298,0,0.00%,173231,37.39%,4827,1.04%,93826,20.25%,95554,20.62%,2593,0.56%,7196,1.55%
1,CAS_4,1,Null,463298,176325,38.06%,0,0.00%,430206,92.86%,145221,31.35%,0,0.00%,20709,4.47%,153094,33.04%
2,CAS_4,1,True,463298,286973,61.94%,290067,62.61%,28265,6.10%,224251,48.40%,367744,79.38%,439996,94.97%,303008,65.40%
3,CAS_4,2,False,8212,0,0.00%,3080,37.51%,632,7.70%,3747,45.63%,2603,31.70%,44,0.54%,245,2.98%
4,CAS_4,2,Null,8212,4810,58.57%,0,0.00%,6902,84.05%,1036,12.62%,0,0.00%,370,4.51%,4101,49.94%
5,CAS_4,2,True,8212,3402,41.43%,5132,62.49%,678,8.26%,3429,41.76%,5609,68.30%,7798,94.96%,3866,47.08%
6,CAS_4,3,False,579,0,0.00%,112,19.34%,102,17.62%,370,63.90%,138,23.83%,0,0.00%,18,3.11%
7,CAS_4,3,Null,579,410,70.81%,0,0.00%,436,75.30%,18,3.11%,0,0.00%,48,8.29%,364,62.87%
8,CAS_4,3,True,579,169,29.19%,467,80.66%,41,7.08%,191,32.99%,441,76.17%,531,91.71%,197,34.02%
9,CAS_4,4,False,204,0,0.00%,29,14.22%,52,25.49%,134,65.69%,62,30.39%,0,0.00%,22,10.78%


In [29]:
filter_list_num_test_false(cas = 'CAS_4',test = 'test_type_voie')

Execution ID: 14e262f7-1aef-4e1d-ac59-9214dbd99cf7


Unnamed: 0,count_initial_insee,index_id,sequence_id,siren,siret,list_inpi,list_insee,etablissementsiege,status_ets,test_siege,enseigne,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,test_enseigne,datecreationetablissement,date_debut_activite,test_date,etatadministratifetablissement,status_admin,test_status_admin,test_type_voie,typevoieetablissement,type_voie_matching
0,1,1255682,4660823,510497233,51049723300013,"[LIEU, DIT, CRAYS, SUR, BREUIL, ROUTE, BELUZES]","[ROUTE, BELUZES]",True,True,True,,,,,,2009-02-01 00:00:00.000,,,A,A,True,False,RTE,LD
1,1,1255684,4660824,510497233,51049723300013,"[LIEU, DIT, CRAYS, SUR, BREUIL, ROUTE, BELUZES]","[ROUTE, BELUZES]",True,False,False,,,,,,2009-02-01 00:00:00.000,2009-02-01 00:00:00.000,True,A,A,True,False,RTE,LD
2,1,647627,9541992,878599059,87859905900010,"[LOTISSEMENT, CAMPAGNE, MARGUERITE, BOULEVARD,...","[BOULEVARD, HENRI, BARBUSSE, CAMPAGNE, MARGUER...",True,True,True,,,,,,2019-10-21 00:00:00.000,2019-10-21 00:00:00.000,True,A,A,True,False,BD,LOT
3,1,42961,9126973,848455481,84845548100012,"[RUE, ALLEE, PAMPELUNE]","[ALLEE, PAMPELUNE]",True,False,False,,,,,,2019-02-18 00:00:00.000,2019-02-18 00:00:00.000,True,A,A,True,False,ALL,RUE
4,1,11032406,9105510,848139440,84813944000012,"[RUE, ESCALIER, A, ROUTE, GARGES]","[ROUTE, GARGES, ESCALIER, A]",True,False,False,,,,,,2019-02-15 00:00:00.000,2019-02-15 00:00:00.000,True,A,A,True,False,RTE,RUE
5,1,5211135,2274529,422288381,42228838100011,"[BIS, GRANDE, RUE, ORMEAUX]","[BIS, RUE, GRANDE]",True,True,True,,,,,,1999-03-01 00:00:00.000,1999-03-01 00:00:00.000,True,F,A,False,False,RUE,GR
6,2,2528995,4714240,512000076,51200007600012,"[ROUTE, MONTGAZIN, LIEU, DIT, JEANNY]","[LIEU, DIT, JEANNY]",False,False,True,SOLER DANIEL,,,,,2009-04-23 00:00:00.000,2015-06-29 00:00:00.000,False,F,A,False,False,LD,RTE
7,3,8838080,6714807,803135136,80313513600031,"[QUARTIER, COMBETTES, IMPASSE, CANAILLOUS, VIL...","[IMPASSE, CANAILLOUS, QUARTIER, COMBETTES]",True,True,True,,,,,,2019-11-06 00:00:00.000,2014-06-01 00:00:00.000,False,A,A,True,False,IMP,QUA
8,1,954811,140604,312024490,31202449000012,"[RUE, PAUL, GUIGOU, RESIDENCE, CEDRES]","[RESIDENCE, CEDRES]",True,False,False,,,,,,,1977-11-01 00:00:00.000,,F,A,False,False,RES,RUE
9,2,10348247,9113291,848258406,84825840600026,"[IMPASSE, HAMEAU, COLBERT, ER, ETAGE]","[HAMEAU, COLBERT, ER, ETAGE]",True,True,True,,,,,,2019-02-17 00:00:00.000,2019-02-06 00:00:00.000,False,A,A,True,False,HAM,IMP


## CAS 05: Cardinality exception parfaite INSEE INPI, intersection positive

* Definition:  L’adresse de l’INPI contient des mots de l’adresse de l’INSEE et la cardinality des mots non présents dans les deux adresses est équivalente
* Math definition: $|INPI|-|INPI \cap INSEE| = |INSEE|-|INPI \cap INSEE|$
* Query [case 5](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/fec67222-3a7b-4bfb-af20-dd70d82932e3)
* Règle: $|\text{insee_except}| = |\text{inpi_except}| \text{ and } \text{intersection} > 0 \rightarrow \text{cas 5}$

In [30]:
tb5, dic_tb5 = create_table_test_not_false(cas = "CAS_5")

Execution ID: 18cda252-8209-4d2b-ba43-9e78b6a5c302


In [31]:
dic_tb5

{'nb_index_unique_CAS_5': 985565,
 'index_unique_inpi': 10981811,
 'lignes_matches': {'lignes_matche_list_num': 788667,
  'lignes_matche_list_num_pct': 0.0718157506079826},
 'lignes_a_trouver': {'test_list_num_voie': [196898, 0.800218148980534],
  'test_siege': [423910, 0.5698812356364116],
  'test_enseigne': [54534, 0.9446672720723646],
  'test_date': [266863, 0.7292284121290833],
  'status_admin': [306151, 0.6893649835373619],
  'test_code_commune': [42697, 0.9566776417587881],
  'test_type_voie': [169254, 0.8282670346451021]}}

In [32]:
tb5

Unnamed: 0,status_cas,nb_unique_index,index_unique,count_cas,test_list_num_voie,test_siege,test_enseigne,test_date,test_status_admin,test_code_commune,test_type_voie
0,CAS_5,985565.0,1,943789,788667.0,561655,931031,718702.0,679414,942868,816311
1,CAS_5,985565.0,2,33429,6882.0,13699,30469,10529.0,10635,33413,29108
2,CAS_5,985565.0,3,4467,664.0,3026,3635,1267.0,1950,4466,3999
3,CAS_5,985565.0,4,1474,218.0,1220,1146,484.0,766,1474,1351
4,CAS_5,985565.0,5,741,100.0,699,595,261.0,392,741,677
5,CAS_5,985565.0,6,470,52.0,410,368,192.0,209,470,446
6,CAS_5,985565.0,7,203,34.0,182,168,103.0,103,203,200
7,CAS_5,985565.0,8,134,42.0,140,108,64.0,109,134,147
8,CAS_5,985565.0,9,115,29.0,116,82,49.0,41,115,93
9,CAS_5,985565.0,10,124,34.0,113,105,55.0,36,124,119


In [33]:
table_list_num_other_tests(cas = 'CAS_5')

Execution ID: e511417d-1e03-4883-bf5b-f2413f0a4cec


Unnamed: 0,status_cas,index_unique,groups,total_rows,cnt_test_list_num_voie,count_list_num_voie,cnt_test_siege,count_siege,cnt_test_enseigne,count_enseigne,cnt_test_date,count_date,cnt_test_status_admin,count_admin,cnt_test_code_commune,count_code_commune,cnt_test_type_voie,count_type_voie
0,CAS_5,1,False,788667,0,0.00%,321762,40.80%,9288,1.18%,168672,21.39%,173290,21.97%,839,0.11%,122710,15.56%
1,CAS_5,1,Null,788667,145758,18.48%,0,0.00%,721177,91.44%,235906,29.91%,0,0.00%,39527,5.01%,181846,23.06%
2,CAS_5,1,True,788667,642909,81.52%,466905,59.20%,58202,7.38%,384089,48.70%,615377,78.03%,748301,94.88%,484111,61.38%
3,CAS_5,2,False,13764,0,0.00%,4795,34.84%,1012,7.35%,5958,43.29%,4822,35.03%,16,0.12%,1954,14.20%
4,CAS_5,2,Null,13764,6794,49.36%,0,0.00%,11458,83.25%,3052,22.17%,0,0.00%,610,4.43%,3775,27.43%
5,CAS_5,2,True,13764,6970,50.64%,8969,65.16%,1294,9.40%,4754,34.54%,8942,64.97%,13138,95.45%,8035,58.38%
6,CAS_5,3,False,1992,0,0.00%,187,9.39%,356,17.87%,1055,52.96%,535,26.86%,0,0.00%,249,12.50%
7,CAS_5,3,Null,1992,1506,75.60%,0,0.00%,1440,72.29%,138,6.93%,0,0.00%,48,2.41%,866,43.47%
8,CAS_5,3,True,1992,486,24.40%,1805,90.61%,196,9.84%,799,40.11%,1457,73.14%,1944,97.59%,877,44.03%
9,CAS_5,4,False,872,0,0.00%,30,3.44%,222,25.46%,464,53.21%,257,29.47%,0,0.00%,71,8.14%


In [34]:
filter_list_num_test_false(cas = 'CAS_5',test = 'test_type_voie')

Execution ID: 6e467c59-8bbe-4cd5-a55c-d27e98f38f89


Unnamed: 0,count_initial_insee,index_id,sequence_id,siren,siret,list_inpi,list_insee,etablissementsiege,status_ets,test_siege,enseigne,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,test_enseigne,datecreationetablissement,date_debut_activite,test_date,etatadministratifetablissement,status_admin,test_status_admin,test_type_voie,typevoieetablissement,type_voie_matching
0,2,2463554,4543325,507830941,50783094100021,"[RUE, MATTES, ATHELIA, I]","[CHEMIN, MATTES, ATHELIA, I]",True,True,True,,,,,,2009-01-01 00:00:00.000,,,A,A,True,False,CHE,RUE
1,2,2463555,4543326,507830941,50783094100021,"[RUE, MATTES, ATHELIA, I]","[CHEMIN, MATTES, ATHELIA, I]",True,False,False,,,,,,2009-01-01 00:00:00.000,2008-09-01 00:00:00.000,False,A,A,True,False,CHE,RUE
2,6,2371914,4544408,507851970,50785197000057,"[VALLEE, VILLAGE, CRS, GARONNE]","[COURS, GARONNE, VALLEE, VILLAGE]",False,False,True,,,,,,2014-07-01 00:00:00.000,2014-07-01 00:00:00.000,True,A,A,True,False,CRS,VLGE
3,2,3195205,4544913,507864577,50786457700014,"[RUE, MARECHAL, MORTIER]","[AVENUE, MARECHAL, MORTIER]",True,True,True,,,,,,2008-09-02 00:00:00.000,,,A,A,True,False,AV,RUE
4,2,3195206,4544914,507864577,50786457700014,"[RUE, MARECHAL, MORTIER]","[AVENUE, MARECHAL, MORTIER]",True,False,False,8EME ART,,,,,2008-09-02 00:00:00.000,2008-09-25 00:00:00.000,False,A,A,True,False,AV,RUE
5,5,10199535,4555368,508108842,50810884200057,"[ROUTE, NATIONALE, CENTRE, COMMERCIAL, AUCHAN]","[AVENUE, REPUBLIQUE, CENTRE, COMMERCIAL, AUCHAN]",False,False,True,SOMETIME,SOMETIME,,,True,2011-06-18 00:00:00.000,2011-06-18 00:00:00.000,True,A,A,True,False,AV,RTE
6,1,4536274,7212231,813232501,81323250100011,"[ROUTE, CHARTRES]","[RUE, CHARTRES]",True,True,True,,,,,,2015-09-01 00:00:00.000,,,A,A,True,False,RUE,RTE
7,1,10388420,7212232,813232501,81323250100011,"[ROUTE, CHARTRES]","[RUE, CHARTRES]",True,False,False,,,,,,2015-09-01 00:00:00.000,2015-09-01 00:00:00.000,True,A,A,True,False,RUE,RTE
8,1,8727241,7214538,813271236,81327123600016,"[RUE, BOIS, MOLLIERES, VERNAYES]","[DOMAINE, BOIS, MOLLIERES, VERNAYES]",True,False,False,,,,,,2015-10-01 00:00:00.000,2018-06-11 00:00:00.000,False,F,A,False,False,DOM,RUE
9,1,5317466,7214696,813272960,81327296000010,"[ROUTE, NANTES]","[RUE, NANTES]",True,True,True,,LA MINOPAINS,,,,2015-09-01 00:00:00.000,,,A,A,True,False,RUE,RTE


## CAS 06: Cardinality exception INSEE supérieure INPI, intersection positive 

* Definition:  L’adresse de l’INPI contient des mots de l’adresse de l’INSEE et la cardinality des mots non présents dans l’adresse de l’INSEE est supérieure à la cardinality de l’adresse de l’INPI
* Math definition: $|INPI|-|INPI \cap INSEE| < |INSEE|-|INPI \cap INSEE|$
* Query [case 6](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/9bdce567-5871-4a5a-add4-d5cca6a83528)
* Règle: $|\text{insee_except}| > |\text{inpi_except}| \text{ and } \text{intersection} > 0 \rightarrow \text{cas 6}$

In [35]:
tb6, dic_tb6 = create_table_test_not_false(cas = "CAS_6")

Execution ID: 833bd5cf-0443-45d5-b19f-66d1b547440a


In [36]:
dic_tb6

{'nb_index_unique_CAS_6': 319138,
 'index_unique_inpi': 10981811,
 'lignes_matches': {'lignes_matche_list_num': 165602,
  'lignes_matche_list_num_pct': 0.015079662179580398},
 'lignes_a_trouver': {'test_list_num_voie': [153536, 0.5189040477787039],
  'test_siege': [143387, 0.5507053375028984],
  'test_enseigne': [29687, 0.9069775457639015],
  'test_date': [126999, 0.6020561637912126],
  'status_admin': [145726, 0.5433762196917947],
  'test_code_commune': [22816, 0.9285074168541508],
  'test_type_voie': [52694, 0.8348864754432251]}}

In [37]:
tb6

Unnamed: 0,status_cas,nb_unique_index,index_unique,count_cas,test_list_num_voie,test_siege,test_enseigne,test_date,test_status_admin,test_code_commune,test_type_voie
0,CAS_6,319138.0,1,296702,165602.0,175751,289451,192139.0,173412.0,296322,266444
1,CAS_6,319138.0,2,16116,3783.0,9376,14060,4616.0,5838.0,16105,14457
2,CAS_6,319138.0,3,3087,576.0,2369,2353,797.0,1357.0,3086,2830
3,CAS_6,319138.0,4,1193,232.0,1037,902,350.0,664.0,1193,1084
4,CAS_6,319138.0,5,615,151.0,559,485,167.0,291.0,613,562
5,CAS_6,319138.0,6,376,113.0,346,278,96.0,193.0,375,334
6,CAS_6,319138.0,7,264,46.0,263,235,107.0,133.0,264,241
7,CAS_6,319138.0,8,167,37.0,147,121,52.0,59.0,167,127
8,CAS_6,319138.0,9,84,19.0,82,48,19.0,56.0,84,53
9,CAS_6,319138.0,10,78,30.0,78,53,64.0,25.0,78,71


In [38]:
table_list_num_other_tests(cas = 'CAS_6')

Execution ID: 770a56ad-c355-4bfa-afc8-fdaef4069672


Unnamed: 0,status_cas,index_unique,groups,total_rows,cnt_test_list_num_voie,count_list_num_voie,cnt_test_siege,count_siege,cnt_test_enseigne,count_enseigne,cnt_test_date,count_date,cnt_test_status_admin,count_admin,cnt_test_code_commune,count_code_commune,cnt_test_type_voie,count_type_voie
0,CAS_6,1,False,165602,0,0.00%,62346,37.65%,3938,2.38%,47142,28.47%,43531,26.29%,335,0.20%,22902,13.83%
1,CAS_6,1,Null,165602,90186,54.46%,0,0.00%,149828,90.47%,43089,26.02%,0,0.00%,10368,6.26%,75087,45.34%
2,CAS_6,1,True,165602,75416,45.54%,103256,62.35%,11836,7.15%,75371,45.51%,122071,73.71%,154899,93.54%,67613,40.83%
3,CAS_6,2,False,7566,0,0.00%,1917,25.34%,1030,13.61%,4177,55.21%,2879,38.05%,6,0.08%,1003,13.26%
4,CAS_6,2,Null,7566,6180,81.68%,0,0.00%,5676,75.02%,1064,14.06%,0,0.00%,420,5.55%,3685,48.70%
5,CAS_6,2,True,7566,1386,18.32%,5649,74.66%,860,11.37%,2325,30.73%,4687,61.95%,7140,94.37%,2878,38.04%
6,CAS_6,3,False,1728,0,0.00%,238,13.77%,387,22.40%,1112,64.35%,593,34.32%,0,0.00%,247,14.29%
7,CAS_6,3,Null,1728,1565,90.57%,0,0.00%,1205,69.73%,165,9.55%,0,0.00%,90,5.21%,847,49.02%
8,CAS_6,3,True,1728,163,9.43%,1490,86.23%,136,7.87%,451,26.10%,1135,65.68%,1638,94.79%,634,36.69%
9,CAS_6,4,False,928,0,0.00%,70,7.54%,174,18.75%,600,64.66%,271,29.20%,0,0.00%,143,15.41%


In [39]:
filter_list_num_test_false(cas = 'CAS_6',test = 'test_type_voie')

Execution ID: 183844b9-fce8-40f0-b248-bdd4fc99fbc8


Unnamed: 0,count_initial_insee,index_id,sequence_id,siren,siret,list_inpi,list_insee,etablissementsiege,status_ets,test_siege,enseigne,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,test_enseigne,datecreationetablissement,date_debut_activite,test_date,etatadministratifetablissement,status_admin,test_status_admin,test_type_voie,typevoieetablissement,type_voie_matching
0,1,1573419,31063,300558756,30055875600015,"[AVENUE, CALMETTE]","[RUE, PROFESSEUR, CALMETTE]",True,True,True,,,,,,1900-01-01 00:00:00.000,,,A,A,True,False,RUE,AV
1,1,1557300,31064,300558756,30055875600015,"[AVENUE, CALMETTE]","[RUE, PROFESSEUR, CALMETTE]",True,False,False,,,,,,1900-01-01 00:00:00.000,1972-05-06 00:00:00.000,False,A,A,True,False,RUE,AV
2,1,5676635,2249686,421665779,42166577900011,"[ROUTE, GRAVELINES, FORT, VERT]","[AVENUE, GENERAL, GAULLE, FORT, VERT]",True,True,True,,,,,,1998-12-28 00:00:00.000,1998-12-28 00:00:00.000,True,A,A,True,False,AV,RTE
3,2,5827408,2342263,423902329,42390232900022,"[ROUTE, HAMEAU, CHAMPS]","[RUE, CHAMPS, RTE, HAMEAU]",True,False,False,VISION D'HOMME,,,,,2012-01-01 00:00:00.000,1999-09-09 00:00:00.000,False,F,A,False,False,RUE,RTE
4,208,1140824,159540,313811515,31381151501746,"[CENTRE, COMMERCIAL, CARREFOUR, CHELLES, AULNO...","[AVENUE, GENDARME, CASTERMANT, CCIAL, CARREFOU...",False,False,True,,,,,,2010-04-30 00:00:00.000,2010-04-30 00:00:00.000,True,A,A,True,False,AV,CAR
5,1,1195303,4655332,510349996,51034999600015,"[AVENUE, VENDEE]","[RUE, VENDEE, CENTRE, COMMERCIAL, INTERMARCHE]",True,True,True,ADELE.A,ADELE.A,,,True,2009-02-02 00:00:00.000,2009-03-02 00:00:00.000,False,A,A,True,False,RUE,AV
6,1,8454740,4362824,502149354,50214935400016,"[CENTRE, CIAL, AUCHAN, CAPS, ROUTE, BOULOGNE]","[AVENUE, ROGER, SALENGRO, CENTRE, CIAL, AUCHAN...",True,True,True,,LYNX OPTIQUE,,,,2008-01-22 00:00:00.000,,,A,A,True,False,AV,RTE
7,1,5381514,4362825,502149354,50214935400016,"[ROUTE, BOULOGNE, CENTRE, COMMERCIAL, AUCHAN, ...","[AVENUE, ROGER, SALENGRO, CENTRE, CIAL, AUCHAN...",True,True,True,LYNX OPTIQUE,LYNX OPTIQUE,,,True,2008-01-22 00:00:00.000,2008-02-01 00:00:00.000,False,A,A,True,False,AV,RTE
8,1,3949397,2137849,418616397,41861639700018,"[AVENUE, AMIRAL, MUSELIER]","[PLACE, L, AMIRAL, MUSELIER]",True,True,True,,,,,,1998-04-29 00:00:00.000,,,F,A,False,False,PL,AV
9,1,2538735,2137850,418616397,41861639700018,"[AVENUE, AMIRAL, MUSELIER]","[PLACE, L, AMIRAL, MUSELIER]",True,False,False,AZUR CAFE,,,,,1998-04-29 00:00:00.000,1998-06-23 00:00:00.000,False,F,A,False,False,PL,AV


## CAS 07: Cardinality exception INPI supérieure INSEE, intersection positive 

* Definition:  L’adresse de l’INSEE contient des mots de l’adresse de l’INPI et la cardinality des mots non présents dans l’adresse de l’INPI est supérieure à la cardinality de l’adresse de l’INSEE
* Math definition: $|INPI|-|INPI \cap INSEE| > |INSEE|-|INPI \cap INSEE|$
* Règle: $|\text{insee_except}| < |\text{inpi_except}| \text{ and } \text{intersection} > 0 \rightarrow \text{cas 7}$

In [40]:
tb7, dic_tb7 = create_table_test_not_false(cas = "CAS_7")

Execution ID: 9a7920e3-b494-4301-8573-3a40b6801d24


In [41]:
dic_tb7

{'nb_index_unique_CAS_7': 401620,
 'index_unique_inpi': 10981811,
 'lignes_matches': {'lignes_matche_list_num': 234930,
  'lignes_matche_list_num_pct': 0.021392646440555205},
 'lignes_a_trouver': {'test_list_num_voie': [166690, 0.5849559284896171],
  'test_siege': [176983, 0.5593272247398038],
  'test_enseigne': [40686, 0.8986952840993975],
  'test_date': [142773, 0.6445072456550969],
  'status_admin': [178524, 0.5554902644290624],
  'test_code_commune': [35142, 0.9124993775210398],
  'test_type_voie': [67433, 0.8320975051043275]}}

In [42]:
tb7

Unnamed: 0,status_cas,nb_unique_index,index_unique,count_cas,test_list_num_voie,test_siege,test_enseigne,test_date,test_status_admin,test_code_commune,test_type_voie
0,CAS_7,401620,1,367508,234930,224637,360934,258847,223096,366478,334187
1,CAS_7,401620,2,25116,6757,11670,23034,7493,7270,25070,21648
2,CAS_7,401620,3,4395,863,2889,3591,1201,1704,4389,3855
3,CAS_7,401620,4,1632,414,1358,1321,476,852,1627,1398
4,CAS_7,401620,5,824,198,704,645,246,430,822,714
5,CAS_7,401620,6,498,139,476,377,182,264,498,416
6,CAS_7,401620,7,377,70,342,266,82,175,376,326
7,CAS_7,401620,8,217,57,201,176,68,103,217,208
8,CAS_7,401620,9,155,54,156,120,34,73,155,131
9,CAS_7,401620,10,130,36,125,101,72,97,130,119


In [43]:
table_list_num_other_tests(cas = 'CAS_7')

Execution ID: d1477134-6bf7-4f6b-8c87-717190a81430


Unnamed: 0,status_cas,index_unique,groups,total_rows,cnt_test_list_num_voie,count_list_num_voie,cnt_test_siege,count_siege,cnt_test_enseigne,count_enseigne,cnt_test_date,count_date,cnt_test_status_admin,count_admin,cnt_test_code_commune,count_code_commune,cnt_test_type_voie,count_type_voie
0,CAS_7,1,False,234930,0,0.00%,87611,37.29%,4350,1.85%,60509,25.76%,66964,28.50%,879,0.37%,26919,11.46%
1,CAS_7,1,Null,234930,115685,49.24%,0,0.00%,213808,91.01%,72985,31.07%,0,0.00%,15169,6.46%,105153,44.76%
2,CAS_7,1,True,234930,119245,50.76%,147319,62.71%,16772,7.14%,101436,43.18%,167966,71.50%,218882,93.17%,102858,43.78%
3,CAS_7,2,False,13514,0,0.00%,4363,32.29%,1086,8.04%,6926,51.25%,5893,43.61%,56,0.41%,2083,15.41%
4,CAS_7,2,Null,13514,10581,78.30%,0,0.00%,11101,82.14%,2394,17.71%,0,0.00%,716,5.30%,4772,35.31%
5,CAS_7,2,True,13514,2933,21.70%,9151,67.71%,1327,9.82%,4194,31.03%,7621,56.39%,12742,94.29%,6659,49.27%
6,CAS_7,3,False,2589,0,0.00%,471,18.19%,407,15.72%,1618,62.50%,1066,41.17%,9,0.35%,386,14.91%
7,CAS_7,3,Null,2589,2258,87.22%,0,0.00%,1947,75.20%,258,9.97%,0,0.00%,138,5.33%,884,34.14%
8,CAS_7,3,True,2589,331,12.78%,2118,81.81%,235,9.08%,713,27.54%,1523,58.83%,2442,94.32%,1319,50.95%
9,CAS_7,4,False,1656,0,0.00%,199,12.02%,340,20.53%,1083,65.40%,599,36.17%,4,0.24%,235,14.19%


In [44]:
filter_list_num_test_false(cas = 'CAS_7',test = 'test_type_voie')

Execution ID: f57e6ece-2584-437c-97a6-3fad2805f3f7


Unnamed: 0,count_initial_insee,index_id,sequence_id,siren,siret,list_inpi,list_insee,etablissementsiege,status_ets,test_siege,enseigne,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,test_enseigne,datecreationetablissement,date_debut_activite,test_date,etatadministratifetablissement,status_admin,test_status_admin,test_type_voie,typevoieetablissement,type_voie_matching
0,1,3106429,2483510,431282821,43128282100013,"[PL, L, EGLISE, RUE, FOLGUET, PLUMELEC]","[PLACE, L, EGLISE]",True,True,True,,,,,,2000-05-15 00:00:00.000,2000-05-15 00:00:00.000,True,A,A,True,False,PL,RUE
1,1,1255362,4669160,510718752,51071875200015,"[RUE, PATIS, SEGRE]","[ALLEE, PATIS]",True,True,True,,,,,,2009-02-10 00:00:00.000,2009-02-10 00:00:00.000,True,A,A,True,False,ALL,RUE
2,1,5822095,4653800,510311608,51031160800010,"[ROUTE, L, AEROPORT, AVENUE, CD, PALYVESTRE]","[AVENUE, L, AEROPORT, CD, PALYVESTE]",True,False,False,RELAIS HYERES PLAGE,RELAIS HYERES PLAGE,,,False,2009-02-03 00:00:00.000,2009-02-03 00:00:00.000,True,A,A,True,False,AV,RTE
3,2,1799685,1053467,377599782,37759978200028,"[A, AVENUE, JEAN, JAURES]","[RUE, JEAN, JAURES]",True,True,True,,,,,,1999-03-10 00:00:00.000,1990-03-01 00:00:00.000,False,A,A,True,False,RUE,AV
4,3,3096208,1056197,377681150,37768115000050,"[RUE, CLUZEL, ZONE, D, ACTIVITE]","[ROUTE, CLUZEL, ZONE, ACTIVITE]",True,True,True,,,,,,2005-01-01 00:00:00.000,,,A,A,True,False,RTE,RUE
5,1,2550995,1056606,377693932,37769393200016,"[RUE, JAMES, GABRIEL, LECOMTE, EPERNAY]","[AVENUE, JAMES, GABRIEL, LECOMTE]",True,True,True,,,,,,1990-03-14 00:00:00.000,,,A,A,True,False,AV,RUE
6,63,3031369,867395,349021840,34902184000476,"[PARC, D, ACTIVITES, NEPTUNE, I, ROUTE, TORIGNI]","[RUE, TORIGNI, PARC, D, ACTIVITES, NEPTUNE]",False,False,True,BEST DRIVE,BEST DRIVE,,,True,2007-01-01 00:00:00.000,2005-01-01 00:00:00.000,False,F,A,False,False,RUE,RTE
7,7,9257143,5242481,529326126,52932612600040,"[CENTRE, CIAL, VACHE, NOIRE, LOCAL, NO, AV, LA...","[AVENUE, LAPLACE, CC, VACHE, NOIRE, LOCAL, N]",False,False,True,ALONE STREET,ALONE STREET,,,True,2018-08-21 00:00:00.000,2018-08-21 00:00:00.000,True,A,A,True,False,AV,CAR
8,1,6128869,2882272,442496196,44249619600019,"[RUE, HECTOR, BERLIOZ, ZI, GRAVIERE, RIOM]","[AVENUE, HECTOR, BERLIOZ, ZAC, GRAVIERE]",True,True,True,,,,,,2002-07-01 00:00:00.000,,,A,A,True,False,AV,RUE
9,1,5412767,2882273,442496196,44249619600019,"[PARC, EUROPEEN, D, ENTREPRISES, RUE, HECTOR, ...","[AVENUE, HECTOR, BERLIOZ, ZAC, GRAVIERE]",True,False,False,,,,,,2002-07-01 00:00:00.000,2002-07-01 00:00:00.000,True,A,A,True,False,AV,RUE


## Resume tests

La différence du nombre d'observation vient du cas numéro 2, ou les siren ont été matché mais aucune des deux adresses ne correspond

In [45]:
nb_to_find = {
    'cas':[],
    'lignes_matche_list_num':[],
    'to_find':[],
    'lignes_matche_list_num_pct': [],
    
}

for d, value in enumerate([dic_tb1,dic_tb3,dic_tb4,dic_tb5,dic_tb6,dic_tb7]):
    cas = d + 1
    if d >= 1:
        cas = d + 2
    nb_to_find['cas'].append(cas)
    nb_to_find['to_find'].append(value['lignes_a_trouver']['test_list_num_voie'][0]),
    nb_to_find['lignes_matche_list_num'].append(value['lignes_matches']['lignes_matche_list_num']),
    nb_to_find['lignes_matche_list_num_pct'].append(value['lignes_matches']['lignes_matche_list_num_pct'])
    
reindex = ["cas",
           "lignes_matche_list_num", "lignes_matche_list_num_pct", "cum_sum_matche","cum_sum_matche_pct",
           "to_find","to_find_pct", "cum_sum_to_find", "cum_sum_to_find_pct"
          ]
    
(pd.DataFrame(nb_to_find).assign(
    cum_sum_to_find = lambda x: x['to_find'].cumsum(),
    cum_sum_matche = lambda x: x['lignes_matche_list_num'].cumsum(),
    cum_sum_matche_pct = lambda x: x['lignes_matche_list_num_pct'].cumsum(),
    to_find_pct = lambda x:  x['to_find']/x['to_find'].sum(),
    cum_sum_to_find_pct = lambda x: x['cum_sum_to_find']/x['to_find'].sum(),
    #cum_sum_to_find_pct = lambda x: x['pct_total'].cumsum(),
    #cum_sum_pct_inverse = lambda x: 1-x['pct_total'].cumsum(),
    #cum_pct_match = lambda x: x['pct_match'].cumsum(),
    
)
 .reindex(columns  = reindex)
 .style
 .format("{:.2%}", subset =  ['lignes_matche_list_num_pct', 'cum_sum_matche_pct', 'to_find_pct',
                              'cum_sum_to_find_pct'])
 .format("{:,.0f}", subset =  ['lignes_matche_list_num','cum_sum_matche', 'to_find', 'cum_sum_to_find'])
 .bar(subset= ['lignes_matche_list_num_pct','to_find_pct'], color='#d65f5f')
)

Unnamed: 0,cas,lignes_matche_list_num,lignes_matche_list_num_pct,cum_sum_matche,cum_sum_matche_pct,to_find,to_find_pct,cum_sum_to_find,cum_sum_to_find_pct
0,1,7471838,68.04%,7471838,68.04%,112665,14.70%,112665,14.70%
1,3,333616,3.04%,7805454,71.08%,62135,8.11%,174800,22.80%
2,4,463298,4.22%,8268752,75.29%,74623,9.73%,249423,32.54%
3,5,788667,7.18%,9057419,82.48%,196898,25.69%,446321,58.22%
4,6,165602,1.51%,9223021,83.98%,153536,20.03%,599857,78.25%
5,7,234930,2.14%,9457951,86.12%,166690,21.75%,766547,100.00%


# Generation report

In [46]:
import os, time, shutil, urllib, ipykernel, json
from pathlib import Path
from notebook import notebookapp

In [47]:
def create_report(extension = "html"):
    """
    Create a report from the current notebook and save it in the 
    Report folder (Parent-> child directory)
    
    1. Exctract the current notbook name
    2. Convert the Notebook 
    3. Move the newly created report
    
    Args:
    extension: string. Can be "html", "pdf", "md"
    
    
    """
    
    ### Get notebook name
    connection_file = os.path.basename(ipykernel.get_connection_file())
    kernel_id = connection_file.split('-', 1)[0].split('.')[0]

    for srv in notebookapp.list_running_servers():
        try:
            if srv['token']=='' and not srv['password']:  
                req = urllib.request.urlopen(srv['url']+'api/sessions')
            else:
                req = urllib.request.urlopen(srv['url']+ \
                                             'api/sessions?token=' + \
                                             srv['token'])
            sessions = json.load(req)
            notebookname = sessions[0]['name']
        except:
            pass  
    
    sep = '.'
    path = os.getcwd()
    #parent_path = str(Path(path).parent)
    
    ### Path report
    #path_report = "{}/Reports".format(parent_path)
    #path_report = "{}/Reports".format(path)
    
    ### Path destination
    name_no_extension = notebookname.split(sep, 1)[0]
    source_to_move = name_no_extension +'.{}'.format(extension)
    dest = os.path.join(path,'Reports', source_to_move)
    
    ### Generate notebook
    os.system('jupyter nbconvert --no-input --to {} {}'.format(
    extension,notebookname))
    
    ### Move notebook to report folder
    #time.sleep(5)
    shutil.move(source_to_move, dest)
    print("Report Available at this adress:\n {}".format(dest))

In [48]:
create_report(extension = "html")

Report Available at this adress:
 C:\Users\PERNETTH\Documents\Projects\InseeInpi_matching\Notebooks_matching\Data_preprocessed\programme_matching\02_siretisation\Reports\07_pourcentage_siretisation_v3.html
