# Analyse divergence adresse INPI INSEE

Objective(s)

* Création d’une table d’analyse rapprochant la table préparée de l’INPI et la table préparée de l’INSEE
* Analyse des similarités/dissimilarités entre l’adresse de l’INPI et de l’INSEE
* L’analyse comporte le compte du nombre d’observations impactant chacun des cas (1 a 7)
* L’analyse doit comporter pour chacun des cas, l’analyse des divergences en prenant en compte les informations complémentaires de la base de données, a savoir:
  * Information sur établissement
    * Date de création
      * datecreationetablissement 
      * date_début_activité 
    * Etablissement ouvert/fermé
      * etatadministratifetablissement 
      * status_admin 
    * Etablissement Siege 
      * etablissementsiege 
      * status_ets 
  * Information sur l’adresse
    * Code commune
      * codecommuneetablissement 
      * code_commune  
    * Code commune
      * codecommuneetablissement 
      * code_commune  
    * Adresse 
      * Numéro de voie
        * numerovoieetablissement 
        * numero_voie_matching 
      * Type de voie
        * typevoieetablissement 
        * type_voie_matching 
* L’analyse doit aussi comporter le nombre de doublon (les lignes index_id  par cas)
  * Pour un cas donnée
  * Pour un cas et test donnée
* L’analyse doit aussi comporter le nombre de doublon (les sequences sequence_id  par cas)
  * Pour un cas donnée
  * Pour un cas et test donnée  

## Metadata

* Metadata parameters are available here: Ressources_suDYJ#_luZqd
* Task type:
  * Jupyter Notebook
* Users: :
  * Thomas Pernet
* Watchers:
  * Thomas Pernet
* Estimated Log points:
  * One being a simple task, 15 a very difficult one
  *  10
* Task tag
  *  #sql-query,#probability,#matching
* Toggl Tag
  * #datanalaysis
* Instance [AWS/GCP]
  *  
  
## Input Cloud Storage [AWS/GCP]

If link from the internet, save it to the cloud first

### AWS

1. S3
  * File (csv/json) + name and link: 
    * 
    * Notebook construction file (data lineage, md link from Github) 
      * md :
      * py :
2. Athena 
  * Region: eu-west-3 
  * Database: inpi 
    * Table: ets_final_sql  
  * Notebook construction file (data lineage) 
    * md : https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/03_ETS_add_variables.md
  * Region: Europe (Paris)
  * Database: inpi 
    * Table: insee_final_sql  
  * Notebook construction file (data lineage) 
    * md : 04_ETS_add_variables_insee.md
    
## Destination Output/Delivery

1. Table/Data (AWS/GCP link)
  * Description expected outcome:
    *  La table rassemble l’INSEE et l’INPI et un ensemble de variables connexe pour distinguer les siret
  * AWS
    * Bucket:
      * Link
  * Athena: 
    * Region: Europe (Paris)
    * Database: inpi 
    *  Table:   ets_insee_inpi  
    
## Things to know (Steps, Attention points or new flow of information)

### Sources of information  (meeting notes, Documentation, Query, URL)

1. Other source [Name](link)
  * Source 1: Ensemble fonctions Presto a appliquer sur les arrays. Les fonctions uniques, distinct, intersect sont intéressantes dans notre cas de figure

## Connexion serveur

In [1]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_athena import service_athena
from awsPy.aws_s3 import service_s3
from pathlib import Path
import pandas as pd
import numpy as np
import os, shutil
bucket = 'calfdata'
path = os.getcwd()
parent_path = str(Path(path).parent)
path_cred = r"{}/credential_AWS.json".format(parent_path)
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = 'eu-west-3')
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = 'calfdata', verbose = False) 
athena = service_athena.connect_athena(client = client,
                      bucket = 'calfdata') 

# Functions

La fonction ci dessous va générer le tableau d'analayse via une query, et retourne un dataframe Pandas, tout en stockant le resultat dans le dossier suivant:

- [calfdata/Analyse_cas_similarite_adresse](https://s3.console.aws.amazon.com/s3/buckets/calfdata/Analyse_cas_similarite_adresse/?region=eu-west-3&tab=overview)

In [2]:
a = ["True", "False", "NULL"]
b = range(1,20)


index = pd.MultiIndex.from_product([a, b], names = ["groups", "cnt_test"])

df_ = (pd.DataFrame(index = index)
       .reset_index()
       .sort_values(by = ["cnt_test", "groups"])
       .to_csv('cartesian_table.csv', index = False)
      )

s3.upload_file(file_to_upload = 'cartesian_table.csv',
            destination_in_s3 = 'Temp_table_analyse_similarite')

In [3]:
create_table = """
CREATE EXTERNAL TABLE IF NOT EXISTS inpi.cartesian_table (
`groups`                     string,
`cnt_test`                   int
    )
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '"'
   )
     LOCATION 's3://calfdata/Temp_table_analyse_similarite'
     TBLPROPERTIES ('has_encrypted_data'='false',
              'skip.header.line.count'='1');"""
output = athena.run_query(
        query=create_table,
        database='inpi',
        s3_output='INPI/sql_output'
    )

Execution ID: 389b3e01-f7ed-4139-96ec-eff8ed0c9bdb


## Compte nombre obs par cas

In [4]:
query_count = """
WITH test_proba AS (
  SELECT 
    array_distinct(
      split(adresse_distance_inpi, ' ')
    ) as list_inpi, 
  
    cardinality(
      array_distinct(
        split(adresse_distance_inpi, ' ')
      )
    ) as lenght_list_inpi, 
  
    array_distinct(
      split(adresse_distance_insee, ' ')
    ) as list_insee, 
  
    cardinality(
      array_distinct(
        split(adresse_distance_insee, ' ')
      )
    ) as lenght_list_insee, 
  
    array_distinct(
      array_except(
        split(adresse_distance_insee, ' '), 
        split(adresse_distance_inpi, ' ')
      )
    ) as insee_except, 
  array_distinct(
      array_except(
        split(adresse_distance_inpi, ' '), 
        split(adresse_distance_insee, ' ')
      )
    ) as inpi_except,
    CAST(
      cardinality(
        array_distinct(
          array_intersect(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as intersection, 
    CAST(
      cardinality(
        array_distinct(
          array_union(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as union_ 
  FROM 
    "inpi"."ets_insee_inpi" -- limit 10
    ) 
SELECT 
 count(*) 
FROM 
  test_proba 
WHERE {}
"""

filter_ = ""

In [5]:
def compte_obs_cas(case = 1):
    """
    """
    
    if case ==1:
        
        filter_= "intersection = union_"
        
    if case ==2:
        
        filter_= "intersection = 0"
    
    if case ==3:
        
        filter_= "lenght_list_inpi = intersection AND intersection != union_"
    
    if case ==4:
        
        filter_= "lenght_list_insee = intersection AND intersection != union_"
    
    if case ==5:
        
        filter_= "cardinality(insee_except) = cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0"
    
    if case ==6:
        
        filter_= "cardinality(insee_except) > cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"
    
    if case ==7:
        
        filter_= "cardinality(insee_except) < cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"
    
    query_ =query_count.format(filter_)
    
    output = athena.run_query(
        query=query_,
        database='inpi',
        s3_output='INPI/sql_output'
    )

    results = False
    
    filename = 'nb_obs_cas_{}.csv'.format(case)
    
    while results != True:
        source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
        destination_key = "{}/{}".format(
                                'Analyse_cas_similarite_adresse',
                                filename
                            )

        results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )
        
    df_ = (s3.read_df_from_s3(
        key = 'Analyse_cas_similarite_adresse/{}'.format(filename), sep = ',')
          ).values[0][0]
    
    return df_

## Compte nombre duplicate par cas

In [6]:
query_duplicate_cas = """
WITH test_proba AS (
  SELECT 
    {0}, 
    Coalesce(
      try(
        date_parse(
          datecreationetablissement, '%Y-%m-%d'
        )
      ), 
      try(
        date_parse(
          datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss.SSS'
        )
      ), 
      try(
        date_parse(
          datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss'
        )
      ), 
      try(
        cast(
          datecreationetablissement as timestamp
        )
      )
    ) as datecreationetablissement, 
    Coalesce(
      try(
        date_parse(
          "date_début_activité", '%Y-%m-%d'
        )
      ), 
      try(
        date_parse(
          "date_début_activité", '%Y-%m-%d %hh:%mm:%ss.SSS'
        )
      ), 
      try(
        date_parse(
          "date_début_activité", '%Y-%m-%d %hh:%mm:%ss'
        )
      ), 
      try(
        cast(
          "date_début_activité" as timestamp
        )
      )
    ) as date_debut_activite, 
    etatadministratifetablissement, 
    status_admin, 
    etablissementsiege, 
    status_ets, 
    codecommuneetablissement, 
    code_commune, 
    codepostaletablissement, 
    code_postal_matching, 
    numerovoieetablissement, 
    numero_voie_matching, 
    typevoieetablissement, 
    type_voie_matching, 
    adresse_distance_inpi, 
    adresse_distance_insee, 
    array_distinct(
      split(adresse_distance_inpi, ' ')
    ) as list_inpi, 
    cardinality(
      array_distinct(
        split(adresse_distance_inpi, ' ')
      )
    ) as lenght_list_inpi, 
    array_distinct(
      split(adresse_distance_insee, ' ')
    ) as list_insee, 
    cardinality(
      array_distinct(
        split(adresse_distance_insee, ' ')
      )
    ) as lenght_list_insee, 
    array_distinct(
      array_except(
        split(adresse_distance_insee, ' '), 
        split(adresse_distance_inpi, ' ')
      )
    ) as inpi_except, 
    array_distinct(
      array_except(
        split(adresse_distance_inpi, ' '), 
        split(adresse_distance_insee, ' ')
      )
    ) as insee_except, 
    CAST(
      cardinality(
        array_distinct(
          array_intersect(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as intersection, 
    CAST(
      cardinality(
        array_distinct(
          array_union(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as union_ 
  FROM 
    "inpi"."ets_insee_inpi" -- limit 10
    ) 
SELECT 
  * 
FROM 
  (
    WITH tests AS (
      SELECT 
        {0}, 
        adresse_distance_inpi, 
        adresse_distance_insee, 
        datecreationetablissement, 
        date_debut_activite, 
        CASE WHEN datecreationetablissement = date_debut_activite THEN 'True' WHEN datecreationetablissement IS NULL 
        OR date_debut_activite IS NULL THEN 'NULL' ELSE 'False' END AS test_date, 
        etatadministratifetablissement, 
        status_admin, 
        CASE WHEN etatadministratifetablissement = status_admin THEN 'True' WHEN etatadministratifetablissement = '' 
        OR status_admin = '' THEN 'NULL' ELSE 'False' END AS test_status_admin, 
        etablissementsiege, 
        status_ets, 
        CASE WHEN etablissementsiege = status_ets THEN 'True' WHEN etablissementsiege = '' 
        OR status_ets = '' THEN 'NULL' ELSE 'False' END AS test_siege, 
        codecommuneetablissement, 
        code_commune, 
        CASE WHEN codecommuneetablissement = code_commune THEN 'True' WHEN codecommuneetablissement = '' 
        OR code_commune = '' THEN 'NULL' ELSE 'False' END AS test_code_commune, 
        codepostaletablissement, 
        code_postal_matching, 
        CASE WHEN codepostaletablissement = code_postal_matching THEN 'True' WHEN codepostaletablissement = '' 
        OR code_postal_matching = '' THEN 'NULL' ELSE 'False' END AS test_code_postal, 
        numerovoieetablissement, 
        numero_voie_matching, 
        CASE WHEN numerovoieetablissement = numero_voie_matching THEN 'True' WHEN numerovoieetablissement = '' 
        OR numero_voie_matching = '' THEN 'NULL' ELSE 'False' END AS test_numero_voie, 
        typevoieetablissement, 
        type_voie_matching, 
        CASE WHEN typevoieetablissement = type_voie_matching THEN 'True' WHEN typevoieetablissement = '' 
        OR type_voie_matching = '' THEN 'NULL' ELSE 'False' END AS test_type_voie, 
        list_inpi, 
        list_insee, 
        inpi_except, 
        insee_except, 
        intersection, 
        union_ 
      FROM 
        test_proba 
      WHERE 
        {1}
    ) 
    SELECT 
      count_index_id, 
      COUNT(*) count_duplicate_index_id 
    FROM 
      (
        SELECT 
          {0}, 
          COUNT(*) AS count_index_id 
        FROM 
          tests 
        GROUP BY 
          {0}
      ) 
    GROUP BY 
      count_index_id -- WHERE test_type_voie = 'False'
      ) 
ORDER BY 
  count_index_id,
  count_duplicate_index_id DESC
"""

In [7]:
def compte_dup_cas(var = 'index_id', case = 1):
    """
    """
    
    if case ==1:
        
        filter_= "intersection = union_"
        
    if case ==2:
        
        filter_= "intersection = 0"
    
    if case ==3:
        
        filter_= "lenght_list_inpi = intersection AND intersection != union_"
    
    if case ==4:
        
        filter_= "lenght_list_insee = intersection AND intersection != union_"
    
    if case ==5:
        
        filter_= "cardinality(insee_except) = cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0"
    
    if case ==6:
        
        filter_= "cardinality(insee_except) > cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"
    
    if case ==7:
        
        filter_= "cardinality(insee_except) < cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"
    
    query_ =query_duplicate_cas.format(var, filter_)
    
    output = athena.run_query(
        query=query_,
        database='inpi',
        s3_output='INPI/sql_output'
    )

    results = False
    
    filename = 'nb_dup_cas_{}.csv'.format(case)
    
    while results != True:
        source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
        destination_key = "{}/{}".format(
                                'Analyse_cas_similarite_adresse',
                                filename
                            )

        results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )
        
    df_ = (s3.read_df_from_s3(
        key = 'Analyse_cas_similarite_adresse/{}'.format(filename), sep = ',')
           .assign(percentage = lambda x: x['count_duplicate_index_id'] / 
                  x['count_duplicate_index_id'].sum())
           .style
           .bar(subset= ['count_duplicate_index_id'],
                   color='#d65f5f')
           .format("{:.2%}", subset =  ['percentage'])
           .format("{:,.0f}", subset =  ['count_duplicate_index_id'])
           
          )
    
    return df_

## Compte nombre obs par cas et test

In [8]:
query_count_case = """
WITH test_proba AS (
  SELECT 
  Coalesce(
         try(date_parse(datecreationetablissement, '%Y-%m-%d')),
         try(date_parse(datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss.SSS')),
         try(date_parse(datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss')),
         try(cast(datecreationetablissement as timestamp))
       )  as datecreationetablissement,

Coalesce(
         try(date_parse("date_début_activité", '%Y-%m-%d')),
         try(date_parse("date_début_activité", '%Y-%m-%d %hh:%mm:%ss.SSS')),
         try(date_parse("date_début_activité", '%Y-%m-%d %hh:%mm:%ss')),
         try(cast("date_début_activité" as timestamp))
  ) as date_debut_activite,
  etatadministratifetablissement, status_admin,
  etablissementsiege,status_ets,
  codecommuneetablissement, code_commune,
  codepostaletablissement, code_postal_matching,
  numerovoieetablissement, numero_voie_matching,
  typevoieetablissement, type_voie_matching,
  
    array_distinct(
      split(adresse_distance_inpi, ' ')
    ) as list_inpi, 
  
    cardinality(
      array_distinct(
        split(adresse_distance_inpi, ' ')
      )
    ) as lenght_list_inpi, 
  
    array_distinct(
      split(adresse_distance_insee, ' ')
    ) as list_insee, 
  
    cardinality(
      array_distinct(
        split(adresse_distance_insee, ' ')
      )
    ) as lenght_list_insee,
  
  array_distinct(
              array_except(
                split(adresse_distance_insee, ' '), 
                split(adresse_distance_inpi, ' ')
              )
            )as inpi_except, 
  array_distinct(
              array_except(
                split(adresse_distance_inpi, ' '), 
                split(adresse_distance_insee, ' ')
              )
            )as insee_except,
  
    CAST(
      cardinality(
        array_distinct(
          array_intersect(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as intersection, 
    CAST(
      cardinality(
        array_distinct(
          array_union(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as union_
  FROM "inpi"."ets_insee_inpi"-- limit 10
  )
  SELECT *
  FROM (WITH tests AS (
    SELECT 
  datecreationetablissement,date_debut_activite,
  CASE WHEN datecreationetablissement = date_debut_activite THEN 'True'
  WHEN datecreationetablissement IS NULL OR date_debut_activite IS NULL THEN 'NULL'
  ELSE 'False' END AS test_date,
  
  etatadministratifetablissement,status_admin,
  CASE WHEN etatadministratifetablissement = status_admin THEN 'True' 
  WHEN etatadministratifetablissement = '' OR status_admin = '' THEN 'NULL'
  ELSE 'False' END AS test_status_admin,
  
  etablissementsiege,status_ets,
  CASE WHEN etablissementsiege = status_ets THEN 'True' 
  WHEN etablissementsiege = '' OR status_ets = '' THEN 'NULL'
  ELSE 'False' END AS test_siege,
  
  codecommuneetablissement,code_commune,
  CASE WHEN codecommuneetablissement = code_commune THEN 'True' 
  WHEN codecommuneetablissement = '' OR code_commune = '' THEN 'NULL'
  ELSE 'False' END AS test_code_commune,
  
  codepostaletablissement,code_postal_matching,
  CASE WHEN codepostaletablissement = code_postal_matching THEN 'True' 
  WHEN codepostaletablissement = '' OR code_postal_matching = '' THEN 'NULL'
  ELSE 'False' END AS test_code_postal,
  
  numerovoieetablissement,numero_voie_matching,
  CASE WHEN numerovoieetablissement = numero_voie_matching THEN 'True' 
  WHEN numerovoieetablissement = '' OR numero_voie_matching = '' THEN 'NULL'
  ELSE 'False' END AS test_numero_voie,
  
  typevoieetablissement,type_voie_matching,
  CASE WHEN typevoieetablissement = type_voie_matching THEN 'True'
  WHEN typevoieetablissement = '' OR type_voie_matching = '' THEN 'NULL'
  ELSE 'False' END AS test_type_voie,
  
  list_inpi, list_insee, inpi_except, insee_except, intersection, union_
  
  FROM test_proba
  WHERE {}
    )
        
  SELECT DISTINCT(test_date) as groups,
        count_test_date,
        count_test_status_admin,
        count_test_siege,
        count_test_commune,
        count_test_cp,
        count_test_num_voie,
        count_test_type_voie
        
  FROM tests
  LEFT JOIN (
    SELECT test_date as groups,  count(*) as count_test_date
    FROM tests
    GROUP BY test_date
    ) as date_
  ON date_.groups = tests.test_date
        
  LEFT JOIN (
    SELECT test_status_admin as groups,  count(*) as count_test_status_admin
    FROM tests
    GROUP BY test_status_admin
    ) as admin_
  ON admin_.groups = tests.test_date      
        
LEFT JOIN (
    SELECT test_siege as groups,  count(*) as count_test_siege
    FROM tests
    GROUP BY test_siege
    ) as siege_
  ON siege_.groups = tests.test_date

LEFT JOIN (
      SELECT test_code_commune as groups,  count(*) as count_test_commune
      FROM tests
      GROUP BY test_code_commune
      ) as code_commune_
    ON code_commune_.groups = tests.test_date

LEFT JOIN (
        SELECT test_code_postal as groups,  count(*) as count_test_cp
        FROM tests
        GROUP BY test_code_postal
        ) as cp_
      ON cp_.groups = tests.test_date

LEFT JOIN (
          SELECT test_numero_voie as groups,  count(*) as count_test_num_voie
          FROM tests
          GROUP BY test_numero_voie
          ) as num_voie_
        ON num_voie_.groups = tests.test_date

LEFT JOIN (
            SELECT test_type_voie as groups,  count(*) as count_test_type_voie
            FROM tests
            GROUP BY test_type_voie
            ) as type_voie_
          ON type_voie_.groups = tests.test_date
 )
"""

In [9]:
def generate_analytical_table(case = 1):
    """
    """
    if case ==1:
        
        filter_= "intersection = union_"
        
    if case ==2:
        
        filter_= "intersection = 0"
    
    if case ==3:
        
        filter_= "lenght_list_inpi = intersection AND intersection != union_"
    
    if case ==4:
        
        filter_= "lenght_list_insee = intersection AND intersection != union_"
    
    if case ==5:
        
        filter_= "cardinality(insee_except) = cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0"
    
    if case ==6:
        
        filter_= "cardinality(insee_except) > cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"
    
    if case ==7:
        
        filter_= "cardinality(insee_except) < cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"

    output = athena.run_query(
        query=query_count_case.format(filter_),
        database='inpi',
        s3_output='INPI/sql_output'
    )

    results = False
    filename = 'cas_{}.csv'.format(case)
    
    while results != True:
        source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
        destination_key = "{}/{}".format(
                                'Analyse_cas_similarite_adresse',
                                filename
                            )

        results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )

    test_1 = (s3.read_df_from_s3(
        key = 'Analyse_cas_similarite_adresse/{}'.format(filename), sep = ',')
              .assign(test = 'cas_{}'.format(case),
                     count_test_date_pct = lambda x: x['count_test_date'] / x['count_test_date'].sum(),
                     count_test_status_admin_pct = lambda x: x['count_test_status_admin'] / x['count_test_status_admin'].sum(),
                     count_test_siege_pct = lambda x: x['count_test_siege'] / x['count_test_siege'].sum(),
                     count_test_commune_pct = lambda x: x['count_test_commune'] / x['count_test_commune'].sum(),
                     count_test_cp_pct = lambda x: x['count_test_cp'] / x['count_test_cp'].sum(),
                     count_test_num_voie_pct = lambda x: x['count_test_num_voie'] / x['count_test_num_voie'].sum(),
                     count_test_type_voie_pct = lambda x: x['count_test_type_voie'] / x['count_test_type_voie'].sum(),
                     )

              .replace({'groups' :{np.nan: 'Null'}})
              #.set_index(['test'])
              .reindex(columns = [
                  'test',
                  'groups',
                   'count_test_num_voie','count_test_num_voie_pct',
                  'count_test_type_voie','count_test_type_voie_pct',
                  'count_test_commune','count_test_commune_pct',
                  'count_test_date','count_test_date_pct',
                  'count_test_status_admin','count_test_status_admin_pct',
                  'count_test_siege','count_test_siege_pct',
                  
                  'count_test_cp','count_test_cp_pct',
                 
              ])
              .fillna(0)
              .style
              .format("{:,.0f}", subset =  ['count_test_date',
                                            'count_test_status_admin',
                                            'count_test_siege',
                                            'count_test_commune',
                                            'count_test_cp',
                                            'count_test_num_voie',
                                            'count_test_type_voie'])
              .format("{:.2%}", subset =  ['count_test_date_pct',
                                           'count_test_status_admin_pct',
                                           'count_test_siege_pct',
                                           'count_test_commune_pct',
                                           'count_test_cp_pct',
                                           'count_test_num_voie_pct',
                                           'count_test_type_voie_pct'])
              .bar(subset= ['count_test_date',
                                            'count_test_status_admin',
                                            'count_test_siege',
                                            'count_test_commune',
                                            'count_test_cp',
                                            'count_test_num_voie',
                                            'count_test_type_voie'],
                   color='#d65f5f')
              #.unstack(0)
             )    
    
    return test_1

## Compte nombre duplicate par cas et test

In [10]:
query_dup_cas = """
WITH test_proba AS (
  SELECT 
  {0},
  Coalesce(
         try(date_parse(datecreationetablissement, '%Y-%m-%d')),
         try(date_parse(datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss.SSS')),
         try(date_parse(datecreationetablissement, '%Y-%m-%d %hh:%mm:%ss')),
         try(cast(datecreationetablissement as timestamp))
       )  as datecreationetablissement,

Coalesce(
         try(date_parse("date_début_activité", '%Y-%m-%d')),
         try(date_parse("date_début_activité", '%Y-%m-%d %hh:%mm:%ss.SSS')),
         try(date_parse("date_début_activité", '%Y-%m-%d %hh:%mm:%ss')),
         try(cast("date_début_activité" as timestamp))
  ) as date_debut_activite,
  etatadministratifetablissement, status_admin,
  etablissementsiege,status_ets,
  codecommuneetablissement, code_commune,
  codepostaletablissement, code_postal_matching,
  numerovoieetablissement, numero_voie_matching,
  typevoieetablissement, type_voie_matching,
  
    array_distinct(
      split(adresse_distance_inpi, ' ')
    ) as list_inpi, 
  
    cardinality(
      array_distinct(
        split(adresse_distance_inpi, ' ')
      )
    ) as lenght_list_inpi, 
  
    array_distinct(
      split(adresse_distance_insee, ' ')
    ) as list_insee, 
  
    cardinality(
      array_distinct(
        split(adresse_distance_insee, ' ')
      )
    ) as lenght_list_insee,
  
  array_distinct(
              array_except(
                split(adresse_distance_insee, ' '), 
                split(adresse_distance_inpi, ' ')
              )
            )as inpi_except, 
  array_distinct(
              array_except(
                split(adresse_distance_inpi, ' '), 
                split(adresse_distance_insee, ' ')
              )
            )as insee_except,
  
    CAST(
      cardinality(
        array_distinct(
          array_intersect(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as intersection, 
    CAST(
      cardinality(
        array_distinct(
          array_union(
            split(adresse_distance_inpi, ' '), 
            split(adresse_distance_insee, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ) as union_
  FROM "inpi"."ets_insee_inpi"-- limit 10
  )
  SELECT *
  FROM (WITH tests AS (
    SELECT 
    {0},
  datecreationetablissement,date_debut_activite,
  CASE WHEN datecreationetablissement = date_debut_activite THEN 'True'
  WHEN datecreationetablissement IS NULL OR date_debut_activite IS NULL THEN 'NULL'
  ELSE 'False' END AS test_date,
  
  etatadministratifetablissement,status_admin,
  CASE WHEN etatadministratifetablissement = status_admin THEN 'True' 
  WHEN etatadministratifetablissement = '' OR status_admin = '' THEN 'NULL'
  ELSE 'False' END AS test_status_admin,
  
  etablissementsiege,status_ets,
  CASE WHEN etablissementsiege = status_ets THEN 'True' 
  WHEN etablissementsiege = '' OR status_ets = '' THEN 'NULL'
  ELSE 'False' END AS test_siege,
  
  codecommuneetablissement,code_commune,
  CASE WHEN codecommuneetablissement = code_commune THEN 'True' 
  WHEN codecommuneetablissement = '' OR code_commune = '' THEN 'NULL'
  ELSE 'False' END AS test_code_commune,
  
  codepostaletablissement,code_postal_matching,
  CASE WHEN codepostaletablissement = code_postal_matching THEN 'True' 
  WHEN codepostaletablissement = '' OR code_postal_matching = '' THEN 'NULL'
  ELSE 'False' END AS test_code_postal,
  
  numerovoieetablissement,numero_voie_matching,
  CASE WHEN numerovoieetablissement = numero_voie_matching THEN 'True' 
  WHEN numerovoieetablissement = '' OR numero_voie_matching = '' THEN 'NULL'
  ELSE 'False' END AS test_numero_voie,
  
  typevoieetablissement,type_voie_matching,
  CASE WHEN typevoieetablissement = type_voie_matching THEN 'True'
  WHEN typevoieetablissement = '' OR type_voie_matching = '' THEN 'NULL'
  ELSE 'False' END AS test_type_voie,
  
  list_inpi, list_insee, inpi_except, insee_except, intersection, union_
  
  FROM test_proba
  WHERE {1}
    )
        SELECT 
        cartesian_table.groups, 
        cnt_test,
        cnt_index_date,
        cnt_index_admin,
        cnt_index_siege,
        cnt_index_commune,
        cnt_index_cp,
        cnt_index_num_voie,
        cnt_index_type_voie

        
  FROM cartesian_table
        
  LEFT JOIN (
    
    SELECT groups, cnt_test_date, COUNT(*) AS cnt_index_date   
    FROM (    
    SELECT test_date as groups,{0},   count(*) as cnt_test_date
    FROM tests
    GROUP BY test_date, {0}
    ) as date_
    GROUP BY groups, cnt_test_date
        ) as count_dup_date
   ON count_dup_date.groups = cartesian_table.groups and
        count_dup_date.cnt_test_date = cartesian_table.cnt_test
        
   LEFT JOIN (
    
    SELECT groups, cnt_test_admin, COUNT(*) AS cnt_index_admin
    FROM (    
    SELECT test_status_admin as groups,{0}, count(*) as cnt_test_admin
    FROM tests
    GROUP BY test_status_admin, {0}
    ) as admin_
    GROUP BY groups, cnt_test_admin
        ) as count_dup_admin_
   ON count_dup_admin_.groups = cartesian_table.groups and
        count_dup_admin_.cnt_test_admin = cartesian_table.cnt_test   
        
   
   LEFT JOIN (
    
    SELECT groups, cnt_test_siege, COUNT(*) AS cnt_index_siege
    FROM (    
    SELECT test_siege as groups,{0}, count(*) as cnt_test_siege
    FROM tests
    GROUP BY test_siege, {0}
    ) as admin_
    GROUP BY groups, cnt_test_siege
        ) as count_dup_siege_
   ON count_dup_siege_.groups = cartesian_table.groups and
        count_dup_siege_.cnt_test_siege = cartesian_table.cnt_test
        
   LEFT JOIN (
    
    SELECT groups, cnt_test_commune, COUNT(*) AS cnt_index_commune
    FROM (    
    SELECT test_code_commune as groups,{0}, count(*) as cnt_test_commune
    FROM tests
    GROUP BY test_code_commune, {0}
    ) as siege_
    GROUP BY groups, cnt_test_commune
        ) as count_dup_commune_
   ON count_dup_commune_.groups = cartesian_table.groups and
        count_dup_commune_.cnt_test_commune = cartesian_table.cnt_test
        
    LEFT JOIN (
    
    SELECT groups, cnt_test_cp, COUNT(*) AS cnt_index_cp
    FROM (    
    SELECT test_code_postal as groups,{0}, count(*) as cnt_test_cp
    FROM tests
    GROUP BY test_code_postal, {0}
    ) as cp_
    GROUP BY groups, cnt_test_cp
        ) as count_dup_cp_
   ON count_dup_cp_.groups = cartesian_table.groups and
        count_dup_cp_.cnt_test_cp = cartesian_table.cnt_test
        
   LEFT JOIN (
    
    SELECT groups, cnt_test_num_voie, COUNT(*) AS cnt_index_num_voie
    FROM (    
    SELECT test_numero_voie as groups,{0}, count(*) as cnt_test_num_voie
    FROM tests
    GROUP BY test_numero_voie, {0}
    ) as num_voie_
    GROUP BY groups, cnt_test_num_voie
        ) as count_dup_num_voie_
   ON count_dup_num_voie_.groups = cartesian_table.groups and
        count_dup_num_voie_.cnt_test_num_voie = cartesian_table.cnt_test
        
   LEFT JOIN (
    
    SELECT groups, cnt_test_type_voie, COUNT(*) AS cnt_index_type_voie
    FROM (    
    SELECT test_type_voie as groups,{0}, count(*) as cnt_test_type_voie
    FROM tests
    GROUP BY test_type_voie, {0}
    ) as type_voie_
    GROUP BY groups, cnt_test_type_voie
        ) as count_dup_type_voie_
   ON count_dup_type_voie_.groups = cartesian_table.groups and
        count_dup_type_voie_.cnt_test_type_voie = cartesian_table.cnt_test
   
        )
   ORDER BY cnt_test ASC, groups
"""

In [11]:
def generate_analytical_table_dup(var = 'index_id', case = 1):
    """
    """
    if case ==1:
        
        filter_= "intersection = union_"
        
    if case ==2:
        
        filter_= "intersection = 0"
    
    if case ==3:
        
        filter_= "lenght_list_inpi = intersection AND intersection != union_"
    
    if case ==4:
        
        filter_= "lenght_list_insee = intersection AND intersection != union_"
    
    if case ==5:
        
        filter_= "cardinality(insee_except) = cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0"
    
    if case ==6:
        
        filter_= "cardinality(insee_except) > cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"
    
    if case ==7:
        
        filter_= "cardinality(insee_except) < cardinality(inpi_except) AND intersection != 0 AND cardinality(insee_except) > 0 AND cardinality(inpi_except) > 0"

    output = athena.run_query(
        query=query_dup_cas.format(var, filter_),
        database='inpi',
        s3_output='INPI/sql_output'
    )

    results = False
    filename = 'cas_dup_{}.csv'.format(case)
    
    while results != True:
        source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
        destination_key = "{}/{}".format(
                                'Analyse_cas_similarite_adresse',
                                filename
                            )

        results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )

    test_1 = (s3.read_df_from_s3(
        key = 'Analyse_cas_similarite_adresse/{}'.format(filename), sep = ',')
              .assign(test = 'cas_{}'.format(case),
                     count_test_date_pct = lambda x: x['cnt_index_date'] / x['cnt_index_date'].sum(),
                     count_test_status_admin_pct = lambda x: x['cnt_index_admin'] / x['cnt_index_admin'].sum(),
                     count_test_siege_pct = lambda x: x['cnt_index_siege'] / x['cnt_index_siege'].sum(),
                     count_test_commune_pct = lambda x: x['cnt_index_commune'] / x['cnt_index_commune'].sum(),
                     count_test_cp_pct = lambda x: x['cnt_index_cp'] / x['cnt_index_cp'].sum(),
                     count_test_num_voie_pct = lambda x: x['cnt_index_num_voie'] / x['cnt_index_num_voie'].sum(),
                     count_test_type_voie_pct = lambda x: x['cnt_index_type_voie'] / x['cnt_index_type_voie'].sum(),
                     )

              .replace({'groups' :{np.nan: 'Null'}})
              #.set_index(['test'])
              .reindex(columns = [
                  'test',
                  'groups',
                  "cnt_test",
                  'cnt_index_num_voie','count_test_num_voie_pct',
                  'cnt_index_type_voie','count_test_type_voie_pct',
                  'cnt_index_commune','count_test_commune_pct',
                  'cnt_index_date','count_test_date_pct',
                  'cnt_index_admin','count_test_status_admin_pct',
                  'cnt_index_siege','count_test_siege_pct',        
                  'cnt_index_cp','count_test_cp_pct',
                  
                  
              ])
              .fillna(0)
              .style
              .format("{:,.0f}", subset =  ['cnt_index_date',
                                            'cnt_index_admin',
                                            'cnt_index_siege',
                                            'cnt_index_commune',
                                            'cnt_index_cp',
                                            'cnt_index_num_voie',
                                            'cnt_index_type_voie'])
              .format("{:.2%}", subset =  ['count_test_date_pct',
                                           'count_test_status_admin_pct',
                                           'count_test_siege_pct',
                                           'count_test_commune_pct',
                                           'count_test_cp_pct',
                                           'count_test_num_voie_pct',
                                           'count_test_type_voie_pct'])
              .bar(subset= ['cnt_index_date',
                                            'cnt_index_admin',
                                            'cnt_index_siege',
                                            'cnt_index_commune',
                                            'cnt_index_cp',
                                            'cnt_index_num_voie',
                                            'cnt_index_type_voie'],
                   color='#d65f5f')
              #.unstack(0)
             )    
    
    return test_1

# Creation table analyse

## Full Pipeline

*   Dans ce notebook, tous les codes SQL pour faire la siretisation seront présent de manière atomique afin de faciliter l’écriture des US. 
   * La première query consiste à rapprocher les deux tables INPI & INSEE
   * La second partie consiste a calculer Levenshtein edit distance sur l’adresse et l’enseigne 
   * La troisième partie consiste a calculer la distance de Jaccard sur l’adresse (au niveau de la lettre) et l’enseigne
   * La quatrième partie consiste a calculer la présence d’un des mots de l’adresse de l’INPI dans l’adresse de l’INSEE  
   * La cinquième partie consiste a calculer la distance de Jaccard sur l’adresse au niveau du mot 

In [12]:
query = """
CREATE TABLE inpi.ets_insee_inpi WITH (format = 'PARQUET') AS WITH insee_inpi AS (
  SELECT 
    index_id, 
    sequence_id, 
    count_initial_insee, 
    ets_final_sql.siren, 
    siret, 
    code_greffe, 
    nom_greffe, 
    numero_gestion, 
    id_etablissement, 
    status, 
    origin, 
    date_greffe, 
    file_timestamp, 
    datecreationetablissement, 
    "date_début_activité", 
    libelle_evt, 
    last_libele_evt, 
    etatadministratifetablissement, 
    status_admin, 
    type, 
    etablissementsiege, 
    status_ets, 
    adresse_reconstituee_inpi, 
    adresse_reconstituee_insee, 
    adresse_regex_inpi, 
    adresse_distance_inpi, 
    adresse_distance_insee, 
    list_numero_voie_matching_inpi, 
    list_numero_voie_matching_insee, 
    numerovoieetablissement, 
    numero_voie_matching, 
    typevoieetablissement, 
    type_voie_matching, 
    ets_final_sql.code_postal_matching, 
    ets_final_sql.ville_matching, 
    codecommuneetablissement, 
    code_commune, 
    enseigne, 
    enseigne1etablissement, 
    enseigne2etablissement, 
    enseigne3etablissement 
  FROM 
    ets_final_sql 
    INNER JOIN (
      SELECT 
        count_initial_insee, 
        siren, 
        siret, 
        datecreationetablissement, 
        etablissementsiege, 
        etatadministratifetablissement, 
        codepostaletablissement, 
        codecommuneetablissement, 
        ville_matching, 
        list_numero_voie_matching_insee, 
        numerovoieetablissement, 
        typevoieetablissement, 
        adresse_reconstituee_insee, 
        adresse_distance_insee, 
        enseigne1etablissement, 
        enseigne2etablissement, 
        enseigne3etablissement 
      FROM 
        insee_final_sql
    ) as insee ON ets_final_sql.siren = insee.siren 
    AND ets_final_sql.ville_matching = insee.ville_matching 
    AND ets_final_sql.code_postal_matching = insee.codepostaletablissement 
  WHERE 
    status != 'IGNORE'
) 
SELECT 
  index_id, 
  sequence_id, 
  count_initial_insee, 
  siren, 
  siret, 
  code_greffe, 
  nom_greffe, 
  numero_gestion, 
  id_etablissement, 
  status, 
  origin, 
  date_greffe, 
  file_timestamp, 
  datecreationetablissement, 
  "date_début_activité", 
  libelle_evt, 
  last_libele_evt, 
  etatadministratifetablissement, 
  status_admin, 
  type, 
  etablissementsiege, 
  status_ets, 
  adresse_reconstituee_inpi, 
  adresse_reconstituee_insee, 
  adresse_regex_inpi, 
  adresse_distance_inpi, 
  adresse_distance_insee, 
  (
    CAST(
      cardinality(
        array_distinct(
          split(adresse_distance_inpi, ' ')
        )
      ) AS DECIMAL(10, 2)
    ) / (
      CAST(
        cardinality(
          array_distinct(
            split(adresse_distance_insee, ' ')
          )
        ) AS DECIMAL(10, 2)
      )
    )
  ) / NULLIF(
    CAST(
      cardinality(
        array_distinct(
          array_except(
            split(adresse_distance_insee, ' '), 
            split(adresse_distance_inpi, ' ')
          )
        )
      ) AS DECIMAL(10, 2)
    ), 
    0
  )* (
    cardinality(
      array_distinct(
        split(adresse_distance_inpi, ' ')
      )
    )* cardinality(
      array_distinct(
        split(adresse_distance_insee, ' ')
      )
    )
  )/(
    NULLIF(
      CAST(
        cardinality(
          array_distinct(
            array_union(
              split(adresse_distance_inpi, ' '), 
              split(adresse_distance_insee, ' ')
            )
          )
        ) AS DECIMAL(10, 2)
      ), 
      0
    ) * NULLIF(
      CAST(
        cardinality(
          array_distinct(
            array_intersect(
              split(adresse_distance_inpi, ' '), 
              split(adresse_distance_insee, ' ')
            )
          )
        ) AS DECIMAL(10, 2)
      ), 
      0
    )
  ) as score_pairing, 
  CASE WHEN cardinality(
    array_distinct(
      split(adresse_distance_inpi, ' ')
    )
  ) = 0 THEN NULL ELSE array_distinct(
    split(adresse_distance_inpi, ' ')
  ) END as liste_distinct_inpi, 
  CASE WHEN cardinality(
    array_distinct(
      split(adresse_distance_insee, ' ')
    )
  ) = 0 THEN NULL ELSE array_distinct(
    split(adresse_distance_insee, ' ')
  ) END as liste_distinct_insee, 
  CASE WHEN cardinality(
    array_distinct(
      array_except(
        split(adresse_distance_insee, ' '), 
        split(adresse_distance_inpi, ' ')
      )
    )
  ) = 0 THEN NULL ELSE array_distinct(
    array_except(
      split(adresse_distance_insee, ' '), 
      split(adresse_distance_inpi, ' ')
    )
  ) END as insee_exclusion, 
  CASE WHEN cardinality(
    array_distinct(
      array_except(
        split(adresse_distance_inpi, ' '), 
        split(adresse_distance_insee, ' ')
      )
    )
  ) = 0 THEN NULL ELSE array_distinct(
    array_except(
      split(adresse_distance_inpi, ' '), 
      split(adresse_distance_insee, ' ')
    )
  ) END as inpi_exclusion, 
  regexp_like(
    adresse_reconstituee_insee, adresse_regex_inpi
  ) as regex_adresse, 
  list_numero_voie_matching_inpi, 
  list_numero_voie_matching_insee, 
  numerovoieetablissement, 
  numero_voie_matching, 
  typevoieetablissement, 
  type_voie_matching, 
  code_postal_matching, 
  ville_matching, 
  codecommuneetablissement, 
  code_commune, 
  enseigne, 
  enseigne1etablissement, 
  enseigne2etablissement, 
  enseigne3etablissement, 
  levenshtein_distance(
    enseigne, enseigne1etablissement
  ) as edit_enseigne1, 
  levenshtein_distance(
    enseigne, enseigne2etablissement
  ) as edit_enseigne2, 
  levenshtein_distance(
    enseigne, enseigne3etablissement
  ) as edit_enseigne3, 
  1 - CAST(
    cardinality(
      array_intersect(
        regexp_extract_all(enseigne, '(\d+)|([A-Z])'), 
        regexp_extract_all(
          enseigne1etablissement, '(\d+)|([A-Z])'
        )
      )
    ) AS DECIMAL(10, 2)
  ) / NULLIF(
    CAST(
      cardinality(
        array_union(
          regexp_extract_all(enseigne, '(\d+)|([A-Z])'), 
          regexp_extract_all(
            enseigne1etablissement, '(\d+)|([A-Z])'
          )
        )
      ) AS DECIMAL(10, 2)
    ), 
    0
  ) as jaccard_enseigne1_lettre, 
  1 - CAST(
    cardinality(
      array_intersect(
        regexp_extract_all(enseigne, '(\d+)|([A-Z])'), 
        regexp_extract_all(
          enseigne2etablissement, '(\d+)|([A-Z])'
        )
      )
    ) AS DECIMAL(10, 2)
  ) / NULLIF(
    CAST(
      cardinality(
        array_union(
          regexp_extract_all(enseigne, '(\d+)|([A-Z])'), 
          regexp_extract_all(
            enseigne2etablissement, '(\d+)|([A-Z])'
          )
        )
      ) AS DECIMAL(10, 2)
    ), 
    0
  ) as jaccard_enseigne2_lettre, 
  1 - CAST(
    cardinality(
      array_intersect(
        regexp_extract_all(enseigne, '(\d+)|([A-Z])'), 
        regexp_extract_all(
          enseigne3etablissement, '(\d+)|([A-Z])'
        )
      )
    ) AS DECIMAL(10, 2)
  ) / NULLIF(
    CAST(
      cardinality(
        array_union(
          regexp_extract_all(enseigne, '(\d+)|([A-Z])'), 
          regexp_extract_all(
            enseigne3etablissement, '(\d+)|([A-Z])'
          )
        )
      ) AS DECIMAL(10, 2)
    ), 
    0
  ) as jaccard_enseigne3_lettre 
FROM 
  insee_inpi

"""

# Evaluation nombre de cas

## Similarité entre deux adresses

Le rapprochement entre les deux tables, à savoir l’INSEE et l’INPI, va amener à la création de deux vecteurs d’adresse. Un vecteur avec des mots contenus spécifiquement à l’INSEE, et un second vecteur avec les mots de l’adresse de l’INPI. Notre objectif est de comparé ses deux vecteurs pour définir si ils sont identiques ou non. Nous avons distingué 7 cas de figures possibles entre les deux vecteurs (figure 1).

![](https://drive.google.com/uc?export=view&id=1Qj_HooHrhFYSuTsoqFbl4Vxy9tN3V5Bu)

## Définition

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Intersection_of_sets_A_and_B.svg/400px-Intersection_of_sets_A_and_B.svg.png)



La table `ets_insee_inpi` contient 11 600 551 observations

In [13]:
initial_obs = 11600551

## Tableau recapitulatif

|   Cas de figure | Titre                   |   Total |   Total cumulé |   pourcentage |   Pourcentage cumulé | Comment                 |
|----------------:|:------------------------|--------:|---------------:|--------------:|---------------------:|:------------------------|
|               1 | similarité parfaite     | 7775392 |        7775392 |     0.670261  |             0.670261 | Match parfait           |
|               2 | Exclusion parfaite      |  974444 |        8749836 |     0.0839998 |             0.75426  | Exclusion parfaite      |
|               3 | Match partiel parfait   |  407404 |        9157240 |     0.0351194 |             0.78938  | Match partiel parfait   |
|               4 | Match partiel parfait   |  558992 |        9716232 |     0.0481867 |             0.837566 | Match partiel parfait   |
|               5 | Match partiel compliqué | 1056406 |       10772638 |     0.0910652 |             0.928632 | Match partiel compliqué |
|               6 | Match partiel compliqué |  361242 |       11133880 |     0.0311401 |             0.959772 | Match partiel compliqué |
|               7 | Match partiel compliqué |  466671 |       11600551 |     0.0402283 |             1        | Match partiel compliqué |

In [14]:
dic_ = {
    'Cas de figure': [], 
    'Titre': [], 
    'Total': [], 
    'Total cumulé': [], 
    'pourcentage': [], 
    'Pourcentage cumulé': [], 
    'Comment': [], 
}

## Cas de figure 1: similarité parfaite

* Definition: Les mots dans l’adresse de l’INPI sont égales aux mots dans l’adresse de l’INSEE
- Math definition: $\frac{|INSEE \cap INPI|}{|INSEE|+|INPI|-|INSEE \cap INPI|} =1$
- Règle: $ \text{intersection} = \text{union} \rightarrow \text{cas 1}$
* Query [case 1](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/24e58c22-4a67-4a9e-b98d-4eb9d65e7f27)

| list_inpi              | list_insee             | insee_except | intersection | union_ |
|------------------------|------------------------|--------------|--------------|--------|
| [BOULEVARD, HAUSSMANN] | [BOULEVARD, HAUSSMANN] | []           | 2            | 2      |
| [QUAI, GABUT]          | [QUAI, GABUT]          | []           | 2            | 2      |
| [BOULEVARD, VOLTAIRE]  | [BOULEVARD, VOLTAIRE]  | []           | 2            | 2      |

- Nombre d'observation:  7 775 392 
    - Percentage initial: 0.67

In [15]:
cas_1 =  compte_obs_cas(case= 1)

Execution ID: c88a483a-2b8d-4d3e-a29b-4131cc7847ad


In [16]:
dic_['Cas de figure'].append(1)
dic_['Titre'].append('similarité parfaite')
dic_['Total'].append(cas_1)
dic_['Total cumulé'].append(cas_1)
dic_['pourcentage'].append(cas_1/initial_obs)
dic_['Pourcentage cumulé'].append(cas_1/initial_obs)
dic_['Comment'].append("Match parfait")

In [17]:
generate_analytical_table(case = 1)

Execution ID: 6b7731a4-b823-4993-83dd-9f2305a15f17


Unnamed: 0,test,groups,count_test_num_voie,count_test_num_voie_pct,count_test_type_voie,count_test_type_voie_pct,count_test_commune,count_test_commune_pct,count_test_date,count_test_date_pct,count_test_status_admin,count_test_status_admin_pct,count_test_siege,count_test_siege_pct,count_test_cp,count_test_cp_pct
0,cas_1,Null,1084365,13.95%,782532,10.06%,115742,1.49%,2452952,31.55%,0,0.00%,0,0.00%,0,0.00%
1,cas_1,False,190078,2.44%,17979,0.23%,3162,0.04%,1506160,19.37%,1381738,17.77%,3164336,40.70%,0,0.00%
2,cas_1,True,6500949,83.61%,6974881,89.70%,7656488,98.47%,3816280,49.08%,6393654,82.23%,4611056,59.30%,7775392,100.00%


Analyse Index

In [18]:
compte_dup_cas(var = 'index_id', case = 1)

Execution ID: 4ce813bb-76d5-4265-96dd-f184b83e9b88


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,7443607,98.14%
1,2,128821,1.70%
2,3,8152,0.11%
3,4,1626,0.02%
4,5,576,0.01%
5,6,292,0.00%
6,7,108,0.00%
7,8,110,0.00%
8,9,66,0.00%
9,10,95,0.00%


In [19]:
generate_analytical_table_dup(var = 'index_id', case = 1)

Execution ID: fe350094-09f1-48b6-b356-fa67747674e6


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_1,False,1,166053,2.16%,17610,0.23%,3088,0.04%,1431756,18.73%,1339506,17.49%,3132922,40.78%,0,0.00%
1,cas_1,Null,1,1025663,13.35%,751806,9.91%,110188,1.45%,2372676,31.03%,0,0.00%,0,0.00%,0,0.00%
2,cas_1,True,1,6443723,83.86%,6676760,88.03%,7330331,96.66%,3757064,49.14%,6246835,81.56%,4500976,58.59%,7443607,98.15%
3,cas_1,False,2,7945,0.10%,170,0.00%,37,0.00%,30346,0.40%,18041,0.24%,14738,0.19%,0,0.00%
4,cas_1,Null,2,9410,0.12%,9350,0.12%,2347,0.03%,37126,0.49%,0,0.00%,0,0.00%,0,0.00%
5,cas_1,True,2,27237,0.35%,118114,1.56%,126437,1.67%,9809,0.13%,48208,0.63%,27940,0.36%,128821,1.70%
6,cas_1,False,3,1257,0.02%,3,0.00%,0,0.00%,2547,0.03%,1123,0.01%,360,0.00%,0,0.00%
7,cas_1,Null,3,739,0.01%,688,0.01%,193,0.00%,1383,0.02%,0,0.00%,0,0.00%,0,0.00%
8,cas_1,True,3,440,0.01%,7418,0.10%,7959,0.10%,751,0.01%,2975,0.04%,3205,0.04%,8152,0.11%
9,cas_1,False,4,414,0.01%,5,0.00%,0,0.00%,569,0.01%,224,0.00%,71,0.00%,0,0.00%


Analyse séquence

In [20]:
compte_dup_cas(var = 'sequence_id', case = 1)

Execution ID: a7957360-5c58-45a6-8180-246156e6b501


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,5954248,87.68%
1,2,757519,11.16%
2,3,50979,0.75%
3,4,22484,0.33%
4,5,1024,0.02%
5,6,2324,0.03%
6,7,88,0.00%
7,8,412,0.01%
8,9,158,0.00%
9,10,155,0.00%


In [21]:
generate_analytical_table_dup(var = 'sequence_id', case = 1)

Execution ID: be94fb46-5bbc-4699-941f-efd699d75cc5


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_1,False,1,134015,1.95%,15171,0.22%,2585,0.04%,1248131,18.00%,1157852,16.78%,2484293,35.57%,0,0.00%
1,cas_1,Null,1,901716,13.09%,655236,9.64%,85831,1.26%,1846484,26.63%,0,0.00%,0,0.00%,0,0.00%
2,cas_1,True,1,5080283,73.77%,5292335,77.89%,5868471,86.41%,3111186,44.86%,4985966,72.28%,3824466,54.76%,5954248,87.69%
3,cas_1,False,2,19541,0.28%,1205,0.02%,261,0.00%,110419,1.59%,95562,1.39%,286626,4.10%,0,0.00%
4,cas_1,Null,2,64680,0.94%,52506,0.77%,13274,0.20%,252554,3.64%,0,0.00%,0,0.00%,0,0.00%
5,cas_1,True,2,623015,9.05%,700482,10.31%,743044,10.94%,311612,4.49%,595051,8.63%,331402,4.74%,757519,11.16%
6,cas_1,False,3,2130,0.03%,76,0.00%,10,0.00%,5466,0.08%,6522,0.09%,24556,0.35%,0,0.00%
7,cas_1,Null,3,3505,0.05%,2597,0.04%,592,0.01%,20687,0.30%,0,0.00%,0,0.00%,0,0.00%
8,cas_1,True,3,41682,0.61%,47934,0.71%,50204,0.74%,11931,0.17%,38542,0.56%,16521,0.24%,50979,0.75%
9,cas_1,False,4,1394,0.02%,30,0.00%,5,0.00%,3448,0.05%,2303,0.03%,7379,0.11%,0,0.00%


## Cas de figure 2: Dissimilarité parfaite

* Definition: Aucun des mots de l’adresse de l’INPI sont égales aux mots dans l’adresse de l’INSEE
* Math definition: $\frac{|INSEE \cap INPI|}{|INSEE|+|INPI|-|INSEE \cap INPI|}$
* Query [case 2](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/4363e8b4-b3c7-4964-804f-4e66b0780a17)
* Règle: $\text{intersection} = 0 \rightarrow \text{cas 2}$

| list_inpi                               | list_insee                              | insee_except                            | intersection | union_ |
|-----------------------------------------|-----------------------------------------|-----------------------------------------|--------------|--------|
| [CHEMIN, MOUCHE]                        | [AVENUE, CHARLES, GAULLE, SAINT, GENIS] | [AVENUE, CHARLES, GAULLE, SAINT, GENIS] | 0            | 7      |
| [AVENUE, CHARLES, GAULLE, SAINT, GENIS] | [CHEMIN, MOUCHE]                        | [CHEMIN, MOUCHE]                        | 0            | 7      |

- Nombre d'observation: 974 727
    - Percentage initial: 0.08

In [22]:
cas_2 =compte_obs_cas(case= 2)

Execution ID: 1386763b-04fe-43df-bebc-3bf158564ce4


In [23]:
dic_['Cas de figure'].append(2)
dic_['Titre'].append('Exclusion parfaite')
dic_['Total'].append(cas_2)
dic_['Total cumulé'].append(cas_1 + cas_2)
dic_['pourcentage'].append(cas_2/initial_obs)
dic_['Pourcentage cumulé'].append((cas_1 +cas_2)/initial_obs)
dic_['Comment'].append("Exclusion parfaite")

In [24]:
generate_analytical_table(case = 2)

Execution ID: df47e7e7-fef1-441c-8aaf-0b314fd57ecb


Unnamed: 0,test,groups,count_test_num_voie,count_test_num_voie_pct,count_test_type_voie,count_test_type_voie_pct,count_test_commune,count_test_commune_pct,count_test_date,count_test_date_pct,count_test_status_admin,count_test_status_admin_pct,count_test_siege,count_test_siege_pct,count_test_cp,count_test_cp_pct
0,cas_2,True,23205,2.38%,0,0.00%,938971,96.36%,256532,26.33%,377981,38.79%,560483,57.52%,974444,100.00%
1,cas_2,False,705501,72.40%,815281,83.67%,1529,0.16%,491142,50.40%,596463,61.21%,413961,42.48%,0,0.00%
2,cas_2,Null,245738,25.22%,159163,16.33%,33944,3.48%,226770,23.27%,0,0.00%,0,0.00%,0,0.00%


Analyse index

In [25]:
compte_dup_cas(var = 'index_id', case = 2)

Execution ID: 28b65714-9c37-4681-9b95-b63cda502004


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,649128,85.71%
1,2,74413,9.83%
2,3,16586,2.19%
3,4,5976,0.79%
4,5,3002,0.40%
5,6,1866,0.25%
6,7,1310,0.17%
7,8,966,0.13%
8,9,803,0.11%
9,10,554,0.07%


In [26]:
generate_analytical_table_dup(var = 'index_id', case = 2)

Execution ID: 7e44b072-fda9-4305-8b7d-2fe0ca2723c5


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_2,False,1,471966,60.04%,539293,69.70%,1173,0.16%,303762,38.58%,447680,55.64%,356073,45.33%,0,0.00%
1,cas_2,Null,1,197741,25.15%,135749,17.54%,21165,2.80%,175269,22.26%,0,0.00%,0,0.00%,0,0.00%
2,cas_2,True,1,22342,2.84%,0,0.00%,626790,82.83%,220078,27.95%,266958,33.18%,342898,43.66%,649128,85.78%
3,cas_2,False,2,52709,6.70%,61782,7.98%,95,0.01%,38873,4.94%,42923,5.33%,20535,2.61%,0,0.00%
4,cas_2,Null,2,12860,1.64%,6556,0.85%,2359,0.31%,15049,1.91%,0,0.00%,0,0.00%,0,0.00%
5,cas_2,True,2,330,0.04%,0,0.00%,71959,9.51%,4860,0.62%,19423,2.41%,37218,4.74%,74413,9.83%
6,cas_2,False,3,11500,1.46%,13869,1.79%,16,0.00%,10025,1.27%,7760,0.96%,3266,0.42%,0,0.00%
7,cas_2,Null,3,2274,0.29%,1019,0.13%,646,0.09%,2678,0.34%,0,0.00%,0,0.00%,0,0.00%
8,cas_2,True,3,44,0.01%,0,0.00%,15924,2.10%,1561,0.20%,5668,0.70%,10196,1.30%,16586,2.19%
9,cas_2,False,4,4312,0.55%,5098,0.66%,6,0.00%,3757,0.48%,2722,0.34%,712,0.09%,0,0.00%


Analyse séquence

In [27]:
compte_dup_cas(var = 'sequence_id', case = 2)

Execution ID: 49b7f8fb-5e1a-4895-82cb-dcc237bb3966


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,520540,77.04%
1,2,111584,16.51%
2,3,18157,2.69%
3,4,11952,1.77%
4,5,2531,0.37%
5,6,3686,0.55%
6,7,1009,0.15%
7,8,1470,0.22%
8,9,732,0.11%
9,10,751,0.11%


In [28]:
generate_analytical_table_dup(var = 'sequence_id', case = 2)

Execution ID: 4880825b-df8c-436d-98b1-b1de0f2622f4


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_2,False,1,368582,52.50%,427006,61.78%,966,0.14%,235961,32.86%,370758,51.09%,303738,42.27%,0,0.00%
1,cas_2,Null,1,168377,23.98%,115860,16.76%,16953,2.51%,158496,22.07%,0,0.00%,0,0.00%,0,0.00%
2,cas_2,True,1,18776,2.67%,0,0.00%,503381,74.54%,190345,26.51%,213954,29.48%,286647,39.89%,520540,77.14%
3,cas_2,False,2,83689,11.92%,94503,13.67%,173,0.03%,59916,8.34%,68413,9.43%,40608,5.65%,0,0.00%
4,cas_2,Null,2,23152,3.30%,14126,2.04%,3717,0.55%,20212,2.81%,0,0.00%,0,0.00%,0,0.00%
5,cas_2,True,2,1835,0.26%,0,0.00%,107549,15.93%,17567,2.45%,36886,5.08%,53049,7.38%,111584,16.54%
6,cas_2,False,3,13224,1.88%,15416,2.23%,20,0.00%,9488,1.32%,9522,1.31%,5235,0.73%,0,0.00%
7,cas_2,Null,3,2789,0.40%,1444,0.21%,577,0.09%,3486,0.49%,0,0.00%,0,0.00%,0,0.00%
8,cas_2,True,3,156,0.02%,0,0.00%,17519,2.59%,1849,0.26%,5885,0.81%,9617,1.34%,18157,2.69%
9,cas_2,False,4,8758,1.25%,10163,1.47%,9,0.00%,7191,1.00%,5219,0.72%,1665,0.23%,0,0.00%


## Cas de figure 3: Intersection parfaite INPI

* Definition:  Tous les mots dans l’adresse de l’INPI  sont contenus dans l’adresse de l’INSEE
* Math définition: $\frac{|INPI|}{|INSEE \cap INPI|}  \text{  = 1 and }|INSEE \cap INPI| <> |INSEE \cup INPI|$
* Query [case 3](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/7fb420a1-5f50-4256-a2ba-b8c7c2b63c9b)
* Règle: $|\text{list_inpi}|= \text{intersection}  \text{  = 1 and }\text{intersection} \neq  \text{union} \rightarrow \text{cas 3}$

| list_inpi                    | list_insee                                               | insee_except                            | intersection | union_ |
|------------------------------|----------------------------------------------------------|-----------------------------------------|--------------|--------|
| [ALLEE, BERLIOZ]             | [ALLEE, BERLIOZ, CHEZ, MME, IDALI]                       | [CHEZ, MME, IDALI]                      | 2            | 5      |
| [RUE, MAI, OLONNE, SUR, MER] | [RUE, HUIT, MAI, OLONNE, SUR, MER]                       | [HUIT]                                  | 5            | 6      |
| [RUE, CAMILLE, CLAUDEL]      | [RUE, CAMILLE, CLAUDEL, VITRE]                           | [VITRE]                                 | 3            | 4      |
| [ROUTE, D, ESLETTES]         | [ROUTE, D, ESLETTES, A]                                  | [A]                                     | 3            | 4      |
| [AVENUE, MAI]                | [AVENUE, HUIT, MAI]                                      | [HUIT]                                  | 2            | 3      |
| [RUE, SOUS, DINE]            | [RUE, SOUS, DINE, RES, SOCIALE, HENRIETTE, D, ANGEVILLE] | [RES, SOCIALE, HENRIETTE, D, ANGEVILLE] | 3            | 8      |

- Nombre d'observation: 407404
    - Percentage initial: 0.03

In [29]:
cas_3 = compte_obs_cas(case= 3)

Execution ID: bad295e8-0f77-4b2f-8884-474c44b0b971


In [30]:
dic_['Cas de figure'].append(3)
dic_['Titre'].append('Match partiel parfait')
dic_['Total'].append(cas_3)
dic_['Total cumulé'].append(cas_1 + cas_2 +cas_3)
dic_['pourcentage'].append(cas_3/initial_obs)
dic_['Pourcentage cumulé'].append((cas_1 + cas_2 +cas_3)/initial_obs)
dic_['Comment'].append("Match partiel parfait")

In [31]:
generate_analytical_table(case = 3)

Execution ID: 85a93ebb-4420-4fc2-8330-1d49d884a7a3


Unnamed: 0,test,groups,count_test_num_voie,count_test_num_voie_pct,count_test_type_voie,count_test_type_voie_pct,count_test_commune,count_test_commune_pct,count_test_date,count_test_date_pct,count_test_status_admin,count_test_status_admin_pct,count_test_siege,count_test_siege_pct,count_test_cp,count_test_cp_pct
0,cas_3,True,259463,63.69%,289312,71.01%,393070,96.48%,181981,44.67%,320406,78.65%,244121,59.92%,407404,100.00%
1,cas_3,Null,110931,27.23%,37878,9.30%,14029,3.44%,119869,29.42%,0,0.00%,0,0.00%,0,0.00%
2,cas_3,False,37010,9.08%,80214,19.69%,305,0.07%,105554,25.91%,86998,21.35%,163283,40.08%,0,0.00%


Analyse index

In [32]:
compte_dup_cas(var = 'index_id', case = 3)

Execution ID: ac0c97f7-f1d6-4d2b-86ab-5093ec7aaa53


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,387589,97.94%
1,2,6915,1.75%
2,3,721,0.18%
3,4,246,0.06%
4,5,72,0.02%
5,6,69,0.02%
6,7,9,0.00%
7,8,28,0.01%
8,9,13,0.00%
9,10,13,0.00%


In [33]:
generate_analytical_table_dup(var = 'index_id', case = 3)

Execution ID: e3db1c39-a3c2-465c-bd4c-03f17ae7e290


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_3,False,1,34098,8.54%,78038,19.69%,298,0.08%,97447,24.44%,83528,20.92%,161087,40.27%,0,0.00%
1,cas_3,Null,1,105140,26.33%,34949,8.82%,13468,3.40%,115922,29.07%,0,0.00%,0,0.00%,0,0.00%
2,cas_3,True,1,254858,63.83%,275559,69.54%,373823,94.47%,179688,45.07%,310540,77.77%,234578,58.65%,387589,97.95%
3,cas_3,False,2,1018,0.25%,884,0.22%,2,0.00%,2448,0.61%,1197,0.30%,803,0.20%,0,0.00%
4,cas_3,Null,2,1563,0.39%,659,0.17%,222,0.06%,1571,0.39%,0,0.00%,0,0.00%,0,0.00%
5,cas_3,True,2,1704,0.43%,5038,1.27%,6691,1.69%,686,0.17%,3120,0.78%,2621,0.66%,6915,1.75%
6,cas_3,False,3,140,0.04%,100,0.03%,1,0.00%,324,0.08%,134,0.03%,64,0.02%,0,0.00%
7,cas_3,Null,3,204,0.05%,94,0.02%,17,0.00%,122,0.03%,0,0.00%,0,0.00%,0,0.00%
8,cas_3,True,3,134,0.03%,476,0.12%,703,0.18%,74,0.02%,424,0.11%,402,0.10%,721,0.18%
9,cas_3,False,4,42,0.01%,18,0.00%,0,0.00%,160,0.04%,41,0.01%,34,0.01%,0,0.00%


Analyse séquence

In [34]:
compte_dup_cas(var = 'sequence_id', case = 3)

Execution ID: 18dad1f7-43a9-4675-a86f-d45f491e6bc7


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,325612,89.75%
1,2,33627,9.27%
2,3,2039,0.56%
3,4,1061,0.29%
4,5,76,0.02%
5,6,172,0.05%
6,7,11,0.00%
7,8,48,0.01%
8,9,11,0.00%
9,10,18,0.00%


In [35]:
generate_analytical_table_dup(var = 'sequence_id', case = 3)

Execution ID: 3799f34d-ccfc-419a-aa2c-3735d1a4b4c5


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_3,False,1,28900,7.89%,68144,18.75%,254,0.07%,84847,23.03%,73018,19.95%,136908,37.05%,0,0.00%
1,cas_3,Null,1,92497,25.26%,30028,8.26%,10777,2.97%,100931,27.40%,0,0.00%,0,0.00%,0,0.00%
2,cas_3,True,1,209973,57.34%,228459,62.87%,314788,86.75%,149273,40.52%,257872,70.46%,200982,54.38%,325612,89.76%
3,cas_3,False,2,3072,0.84%,5281,1.45%,21,0.01%,7769,2.11%,5706,1.56%,11359,3.07%,0,0.00%
4,cas_3,Null,2,6963,1.90%,2770,0.76%,1440,0.40%,8178,2.22%,0,0.00%,0,0.00%,0,0.00%
5,cas_3,True,2,21798,5.95%,25272,6.95%,32097,8.85%,14779,4.01%,26323,7.19%,17581,4.76%,33627,9.27%
6,cas_3,False,3,279,0.08%,308,0.08%,3,0.00%,505,0.14%,397,0.11%,767,0.21%,0,0.00%
7,cas_3,Null,3,499,0.14%,191,0.05%,62,0.02%,387,0.11%,0,0.00%,0,0.00%,0,0.00%
8,cas_3,True,3,1100,0.30%,1497,0.41%,1964,0.54%,622,0.17%,1560,0.43%,896,0.24%,2039,0.56%
9,cas_3,False,4,164,0.04%,96,0.03%,0,0.00%,382,0.10%,157,0.04%,228,0.06%,0,0.00%


## Cas de figure 4: Intersection parfaite INSEE

* Definition:  Tous les mots dans l’adresse de l’INSEE  sont contenus dans l’adresse de l’INPI
* Math definition: $\frac{|INSEE|}{|INSEE \cap INPI|}  \text{  = 1 and }|INSEE \cap INPI| <> |INSEE \cup INPI|$
* Query [case 4](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/65344bf4-8999-4ddb-a65e-11bb825f5f40)
* Règle: $|\text{list_insee}|= \text{intersection}  \text{  = 1 and }\text{intersection} \neq  \text{union} \rightarrow \text{cas 4}$

| list_inpi                                                 | list_insee                                      | insee_except | intersection | union_ |
|-----------------------------------------------------------|-------------------------------------------------|--------------|--------------|--------|
| [ROUTE, D, ENGHIEN]                                       | [ROUTE, ENGHIEN]                                | []           | 2            | 3      |
| [ZAC, PARC, D, ACTIVITE, PARIS, EST, ALLEE, LECH, WALESA] | [ALLEE, LECH, WALESA, ZAC, PARC, ACTIVITE, EST] | []           | 7            | 9      |
| [LIEU, DIT, PADER, QUARTIER, RIBERE]                      | [LIEU, DIT, RIBERE]                             | []           | 3            | 5      |
| [A, BOULEVARD, CONSTANTIN, DESCAT]                        | [BOULEVARD, CONSTANTIN, DESCAT]                 | []           | 3            | 4      |
| [RUE, MENILMONTANT, BP]                                   | [RUE, MENILMONTANT]                             | []           | 2            | 3      |

- Nombre d'observation: 558992
    - Percentage initial: 0.05

In [36]:
cas_4 = compte_obs_cas(case= 4)

Execution ID: c584a2ca-57de-4972-9f82-4ca5726cfcd4


In [37]:
dic_['Cas de figure'].append(4)
dic_['Titre'].append('Match partiel parfait')
dic_['Total'].append(cas_4)
dic_['Total cumulé'].append(cas_1 + cas_2 + cas_3 + cas_4)
dic_['pourcentage'].append(cas_4/initial_obs)
dic_['Pourcentage cumulé'].append((cas_1 + cas_2 + cas_3 + cas_4) / initial_obs)
dic_['Comment'].append("Match partiel parfait")

In [38]:
generate_analytical_table(case = 4)

Execution ID: 36a3b489-2c50-4395-ab86-1148d86b647f


Unnamed: 0,test,groups,count_test_num_voie,count_test_num_voie_pct,count_test_type_voie,count_test_type_voie_pct,count_test_commune,count_test_commune_pct,count_test_date,count_test_date_pct,count_test_status_admin,count_test_status_admin_pct,count_test_siege,count_test_siege_pct,count_test_cp,count_test_cp_pct
0,cas_4,False,36724,6.57%,11503,2.06%,2751,0.49%,125430,22.44%,123869,22.16%,210573,37.67%,0,0.00%
1,cas_4,Null,198374,35.49%,168672,30.17%,22870,4.09%,170830,30.56%,0,0.00%,0,0.00%,0,0.00%
2,cas_4,True,323894,57.94%,378817,67.77%,533371,95.42%,262732,47.00%,435123,77.84%,348419,62.33%,558992,100.00%


Analyse index

In [39]:
compte_dup_cas(var = 'index_id', case = 4)

Execution ID: 6fbda57b-3381-4feb-b9cb-1e8692cf23b0


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,525832,97.75%
1,2,10572,1.97%
2,3,830,0.15%
3,4,236,0.04%
4,5,61,0.01%
5,6,43,0.01%
6,7,39,0.01%
7,8,28,0.01%
8,9,15,0.00%
9,10,55,0.01%


In [40]:
generate_analytical_table_dup(var = 'index_id', case = 4)

Execution ID: 4ca2adf9-25fb-4d67-87c9-4bbfce97ea43


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_4,False,1,32392,5.95%,10960,2.03%,2652,0.49%,116001,21.36%,119180,21.91%,209260,38.34%,0,0.00%
1,cas_4,Null,1,185043,34.02%,155929,28.93%,21876,4.07%,166561,30.67%,0,0.00%,0,0.00%,0,0.00%
2,cas_4,True,1,320206,58.86%,361456,67.05%,501304,93.22%,253205,46.62%,418014,76.85%,331856,60.81%,525832,97.78%
3,cas_4,False,2,1416,0.26%,206,0.04%,43,0.01%,3278,0.60%,1733,0.32%,514,0.09%,0,0.00%
4,cas_4,Null,2,2421,0.45%,2159,0.40%,374,0.07%,1862,0.34%,0,0.00%,0,0.00%,0,0.00%
5,cas_4,True,2,1651,0.30%,7062,1.31%,10155,1.89%,1271,0.23%,4053,0.75%,3154,0.58%,10572,1.97%
6,cas_4,False,3,244,0.04%,18,0.00%,3,0.00%,342,0.06%,124,0.02%,48,0.01%,0,0.00%
7,cas_4,Null,3,164,0.03%,174,0.03%,32,0.01%,92,0.02%,0,0.00%,0,0.00%,0,0.00%
8,cas_4,True,3,52,0.01%,607,0.11%,795,0.15%,90,0.02%,409,0.08%,491,0.09%,830,0.15%
9,cas_4,False,4,58,0.01%,9,0.00%,1,0.00%,112,0.02%,42,0.01%,8,0.00%,0,0.00%


Analyse séquence

In [41]:
compte_dup_cas(var = 'sequence_id', case = 4)

Execution ID: 329766a9-ee1e-43a6-aff0-b43037f4bfb9


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,436201,88.87%
1,2,50531,10.29%
2,3,2232,0.45%
3,4,1237,0.25%
4,5,63,0.01%
5,6,214,0.04%
6,7,31,0.01%
7,8,59,0.01%
8,9,19,0.00%
9,10,56,0.01%


In [42]:
generate_analytical_table_dup(var = 'sequence_id', case = 4)

Execution ID: 7919c8ab-a73a-471d-adde-7967f9c28d88


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_4,False,1,27443,5.52%,9552,1.94%,2333,0.48%,98659,19.83%,102335,20.58%,176843,35.20%,0,0.00%
1,cas_4,Null,1,158717,31.91%,133549,27.13%,17507,3.57%,141741,28.49%,0,0.00%,0,0.00%,0,0.00%
2,cas_4,True,1,261823,52.63%,296087,60.14%,417002,84.93%,207098,41.62%,344936,69.38%,280594,55.84%,436201,88.90%
3,cas_4,False,2,3259,0.66%,819,0.17%,187,0.04%,10804,2.17%,9342,1.88%,15441,3.07%,0,0.00%
4,cas_4,Null,2,14357,2.89%,12318,2.50%,2384,0.49%,13535,2.72%,0,0.00%,0,0.00%,0,0.00%
5,cas_4,True,2,28760,5.78%,36248,7.36%,47689,9.71%,22973,4.62%,37427,7.53%,26846,5.34%,50531,10.30%
6,cas_4,False,3,322,0.06%,37,0.01%,12,0.00%,540,0.11%,428,0.09%,719,0.14%,0,0.00%
7,cas_4,Null,3,591,0.12%,512,0.10%,64,0.01%,371,0.07%,0,0.00%,0,0.00%,0,0.00%
8,cas_4,True,3,1007,0.20%,1596,0.32%,2137,0.44%,685,0.14%,1520,0.31%,1034,0.21%,2232,0.45%
9,cas_4,False,4,230,0.05%,32,0.01%,2,0.00%,410,0.08%,189,0.04%,119,0.02%,0,0.00%


## Cas de figure 5: Cardinality exception parfaite INSEE INPI, intersection positive

* Definition:  L’adresse de l’INPI contient des mots de l’adresse de l’INPI et la cardinality des mots non présents dans les deux adresses est équivalente
* Math definition: $|INPI|-|INPI \cap INSEE| = |INSEE|-|INPI \cap INSEE|$
* Query [case 5](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/fec67222-3a7b-4bfb-af20-dd70d82932e3)
* Règle: $|\text{insee_except}| = |\text{inpi_except}| \text{ and } \text{intersection} > 0 \rightarrow \text{cas 5}$

| list_inpi                                                                                  | list_insee                                                                              | insee_except | inpi_except  | intersection | union_ |
|--------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|--------------|--------------|--------------|--------|
| [AVENUE, GEORGES, VACHER, C, A, SAINTE, VICTOIRE, IMMEUBLE, CCE, CD, ZI, ROUSSET, PEYNIER] | [AVENUE, GEORGES, VACHER, C, A, STE, VICTOIRE, IMMEUBLE, CCE, CD, ZI, ROUSSET, PEYNIER] | [STE]        | [SAINTE]     | 12           | 14     |
| [BIS, AVENUE, PAUL, DOUMER, RES, SAINT, MARTIN, BAT, D, C, O, M, ROSSI]                    | [BIS, AVENUE, PAUL, DOUMER, RES, ST, MARTIN, BAT, D, C, O, M, ROSSI]                    | [ST]         | [SAINT]      | 12           | 14     |
| [ROUTE, DEPARTEMENTALE, CHEZ, SOREME, CENTRE, COMMERCIAL, L, OCCITAN, PLAN, OCCIDENTAL]    | [ROUTE, DEPARTEMENTALE, CHEZ, SOREME, CENTRE, COMMERCIAL, L, OCCITAN, PLAN, OC]         | [OC]         | [OCCIDENTAL] | 9            | 11     |
| [LIEU, DIT, FOND, CHAMP, MALTON, PARC, EOLIEN, SUD, MARNE, PDL]                            | [LIEU, DIT, FONDD, CHAMP, MALTON, PARC, EOLIEN, SUD, MARNE, PDL]                        | [FONDD]      | [FOND]       | 9            | 11     |
| [AVENUE, ROBERT, BRUN, ZI, CAMP, LAURENT, LOT, NUMERO, ST, BERNARD]                        | [AVENUE, ROBERT, BRUN, ZI, CAMP, LAURENT, LOT, ST, BERNARD, N]                          | [N]          | [NUMERO]     | 9            | 11     |
| [PLACE, MARCEL, DASSAULT, PARC, D, ACTIVITES, TY, NEHUE, BATIMENT, H]                      | [PLACE, MARCEL, DASSAULT, PARC, D, ACTIVITES, TY, NEHUE, BAT, H]                        | [BAT]        | [BATIMENT]   | 9            | 11     |

- Nombre d'observation: 1056406
    - Percentage initial: 0.09

In [43]:
cas_5 = compte_obs_cas(case= 5)

Execution ID: 2581d3a5-8cd5-4e23-96b1-ee9d2d629a2e


In [44]:
dic_['Cas de figure'].append(5)
dic_['Titre'].append('Match partiel compliqué')
dic_['Total'].append(cas_5)
dic_['Total cumulé'].append(cas_1 + cas_2 + cas_3 + cas_4 + cas_5)
dic_['pourcentage'].append(cas_5/initial_obs)
dic_['Pourcentage cumulé'].append((cas_1 + cas_2 + cas_3 + cas_4 + cas_5)/initial_obs)
dic_['Comment'].append("Match partiel compliqué")

In [45]:
generate_analytical_table(case = 5)

Execution ID: e0075999-8989-4390-9a8c-1c22d1292ad1


Unnamed: 0,test,groups,count_test_num_voie,count_test_num_voie_pct,count_test_type_voie,count_test_type_voie_pct,count_test_commune,count_test_commune_pct,count_test_date,count_test_date_pct,count_test_status_admin,count_test_status_admin_pct,count_test_siege,count_test_siege_pct,count_test_cp,count_test_cp_pct
0,cas_5,True,631061,59.74%,712321,67.43%,1010098,95.62%,463997,43.92%,727118,68.83%,628399,59.48%,1056406,100.00%
1,cas_5,Null,188775,17.87%,109594,10.37%,45352,4.29%,296638,28.08%,0,0.00%,0,0.00%,0,0.00%
2,cas_5,False,236570,22.39%,234491,22.20%,956,0.09%,295771,28.00%,329288,31.17%,428007,40.52%,0,0.00%


Analyse index

In [46]:
compte_dup_cas(var = 'index_id', case = 5)

Execution ID: d9d54c11-2b4e-4169-b934-0c95dfb525d9


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,943789,95.76%
1,2,33429,3.39%
2,3,4467,0.45%
3,4,1474,0.15%
4,5,741,0.08%
5,6,470,0.05%
6,7,203,0.02%
7,8,134,0.01%
8,9,115,0.01%
9,10,124,0.01%


In [47]:
generate_analytical_table_dup(var = 'index_id', case = 5)

Execution ID: 394da01b-f39d-400f-862f-191ad92a6446


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_5,False,1,188602,18.81%,226028,22.84%,921,0.09%,251449,25.15%,299116,29.75%,416802,41.54%,0,0.00%
1,cas_5,Null,1,163656,16.32%,96870,9.79%,42385,4.30%,277862,27.79%,0,0.00%,0,0.00%,0,0.00%
2,cas_5,True,1,624127,62.24%,628397,63.50%,900483,91.40%,440840,44.09%,679414,67.56%,561655,55.97%,943789,95.79%
3,cas_5,False,2,13488,1.35%,3430,0.35%,16,0.00%,12528,1.25%,10419,1.04%,4527,0.45%,0,0.00%
4,cas_5,Null,2,3423,0.34%,2111,0.21%,1052,0.11%,7396,0.74%,0,0.00%,0,0.00%,0,0.00%
5,cas_5,True,2,3050,0.30%,25340,2.56%,32361,3.28%,3133,0.31%,10635,1.06%,13699,1.37%,33429,3.39%
6,cas_5,False,3,2521,0.25%,334,0.03%,1,0.00%,2249,0.22%,1383,0.14%,442,0.04%,0,0.00%
7,cas_5,Null,3,654,0.07%,411,0.04%,134,0.01%,626,0.06%,0,0.00%,0,0.00%,0,0.00%
8,cas_5,True,3,166,0.02%,3422,0.35%,4332,0.44%,641,0.06%,1950,0.19%,3026,0.30%,4467,0.45%
9,cas_5,False,4,932,0.09%,73,0.01%,0,0.00%,818,0.08%,393,0.04%,66,0.01%,0,0.00%


Analyse séquence

In [48]:
compte_dup_cas(var = 'sequence_id', case = 5)

Execution ID: 71505690-94c8-4cc4-b4aa-2e916fa949af


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,778442,87.03%
1,2,99188,11.09%
2,3,8242,0.92%
3,4,5222,0.58%
4,5,692,0.08%
5,6,1092,0.12%
6,7,177,0.02%
7,8,336,0.04%
8,9,137,0.02%
9,10,175,0.02%


In [49]:
generate_analytical_table_dup(var = 'sequence_id', case = 5)

Execution ID: 95bdf595-cba5-47c9-b674-24e0b5212193


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_5,False,1,145039,15.92%,200149,22.24%,786,0.09%,212025,23.07%,254984,27.85%,348015,37.73%,0,0.00%
1,cas_5,Null,1,143012,15.70%,82176,9.13%,36670,4.10%,235966,25.67%,0,0.00%,0,0.00%,0,0.00%
2,cas_5,True,1,518142,56.88%,505520,56.18%,741522,82.91%,370956,40.36%,556734,60.80%,480134,52.06%,778442,87.07%
3,cas_5,False,2,28470,3.13%,14735,1.64%,75,0.01%,28253,3.07%,27956,3.05%,33730,3.66%,0,0.00%
4,cas_5,Null,2,12116,1.33%,8299,0.92%,3488,0.39%,24442,2.66%,0,0.00%,0,0.00%,0,0.00%
5,cas_5,True,2,50331,5.53%,73282,8.14%,95441,10.67%,35470,3.86%,62808,6.86%,47481,5.15%,99188,11.09%
6,cas_5,False,3,3419,0.38%,819,0.09%,4,0.00%,2584,0.28%,2648,0.29%,2838,0.31%,0,0.00%
7,cas_5,Null,3,1024,0.11%,688,0.08%,214,0.02%,2028,0.22%,0,0.00%,0,0.00%,0,0.00%
8,cas_5,True,3,2857,0.31%,6412,0.71%,8005,0.90%,1701,0.19%,4644,0.51%,4015,0.44%,8242,0.92%
9,cas_5,False,4,2582,0.28%,403,0.04%,2,0.00%,1870,0.20%,1201,0.13%,708,0.08%,0,0.00%


## Cas de figure 6: Cardinality exception INSEE supérieure INPI, intersection positive 

* Definition:  L’adresse de l’INPI contient des mots de l’adresse de l’INPI et la cardinality des mots non présents dans l’adresse de l’INSEE est supérieure à la cardinality de l’adresse de l’INPI
* Math definition: $|INPI|-|INPI \cap INSEE| < |INSEE|-|INPI \cap INSEE|$
* Query [case 6](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/9bdce567-5871-4a5a-add4-d5cca6a83528)
* Règle: $|\text{insee_except}| > |\text{inpi_except}| \text{ and } \text{intersection} > 0 \rightarrow \text{cas 6}$

| list_inpi                                                                         | list_insee                                                                               | insee_except          | inpi_except   | intersection | union_ |
|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------|---------------|--------------|--------|
| [AVENUE, AUGUSTE, PICARD, POP, UP, TOURVILLE, CC, EMPLACEMENT, DIT, PRECAIRE, N]  | [AVENUE, AUGUSTE, PICARD, POP, UP, TOURVILL, CC, TOURVILLE, EMPLACEMT, DIT, PRECAIRE, N] | [TOURVILL, EMPLACEMT] | [EMPLACEMENT] | 10           | 13     |
| [ROUTE, COTE, D, AZUR, C, O, TENERGIE, ARTEPARC, MEYREUIL, BAT, A]                | [ROUTE, C, O, TENERGIE, ARTEPARC, MEYREUI, BAT, A, RTE, COTE, D, AZUR]                   | [MEYREUI, RTE]        | [MEYREUIL]    | 10           | 13     |
| [C, O, TENERGIE, ARTEPARC, MEYREUIL, BATIMENT, A, ROUTE, COTE, D, AZUR]           | [ROUTE, C, O, TENERGIE, ARTEPARC, MEYREUI, BATIMENT, A, RTE, COTE, D, AZUR]              | [MEYREUI, RTE]        | [MEYREUIL]    | 10           | 13     |
| [LOTISSEMENT, VANGA, DI, L, ORU, VILLA, FRANCK, TINA, CHEZ, COLOMBANI, CHRISTIAN] | [LIEU, DIT, VANGA, DI, L, ORU, VILLA, FRANCK, TINA, CHEZ, COLOMBANI, CHRISTIAN]          | [LIEU, DIT]           | [LOTISSEMENT] | 10           | 13     |
| [AVENUE, DECLARATION, DROITS, HOMME, RES, CLOS, ST, MAMET, BAT, C, APPT]          | [AVENUE, DECL, DROITS, L, HOMME, RES, CLOS, ST, MAMET, BAT, C, APPT]                     | [DECL, L]             | [DECLARATION] | 10           | 13     |

- Nombre d'observation: 361242
    - Percentage initial: 0.03

In [50]:
cas_6 = compte_obs_cas(case= 6)

Execution ID: b400044a-d860-4b2a-a7c2-18e65dcce100


In [51]:
dic_['Cas de figure'].append(6)
dic_['Titre'].append('Match partiel compliqué')
dic_['Total'].append(cas_6)
dic_['Total cumulé'].append(cas_1 + cas_2 +cas_3 +cas_4 + cas_5 +cas_6)
dic_['pourcentage'].append(cas_6/initial_obs)
dic_['Pourcentage cumulé'].append((cas_1 + cas_2 +cas_3 +cas_4 + cas_5 +cas_6)/initial_obs)
dic_['Comment'].append("Match partiel compliqué")

In [52]:
generate_analytical_table(case = 6)

Execution ID: 81aaa3ad-b713-4475-bcbd-9c500d9eccd7


Unnamed: 0,test,groups,count_test_num_voie,count_test_num_voie_pct,count_test_type_voie,count_test_type_voie_pct,count_test_commune,count_test_commune_pct,count_test_date,count_test_date_pct,count_test_status_admin,count_test_status_admin_pct,count_test_siege,count_test_siege_pct,count_test_cp,count_test_cp_pct
0,cas_6,True,118441,25.38%,287344,61.57%,444419,95.23%,163465,35.03%,260679,55.86%,289877,62.12%,466671,100.00%
1,cas_6,False,199235,42.69%,92162,19.75%,1177,0.25%,179577,38.48%,205992,44.14%,176794,37.88%,0,0.00%
2,cas_6,Null,148995,31.93%,87165,18.68%,21075,4.52%,123629,26.49%,0,0.00%,0,0.00%,0,0.00%


Analyse index

In [53]:
compte_dup_cas(var = 'index_id', case = 6)

Execution ID: 8750ec6b-c8f1-416c-97c5-a1bf554d797f


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,367508,91.51%
1,2,25116,6.25%
2,3,4395,1.09%
3,4,1632,0.41%
4,5,824,0.21%
5,6,498,0.12%
6,7,377,0.09%
7,8,217,0.05%
8,9,155,0.04%
9,10,130,0.03%


In [54]:
generate_analytical_table_dup(var = 'index_id', case = 6)

Execution ID: 2eb31799-dd77-4240-b7ce-f7aa908b04e0


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_6,False,1,149178,36.09%,82285,20.17%,1030,0.26%,128325,31.07%,170656,40.80%,165770,40.06%,0,0.00%
1,cas_6,Null,1,123075,29.78%,71505,17.53%,17828,4.44%,108257,26.21%,0,0.00%,0,0.00%,0,0.00%
2,cas_6,True,1,115627,27.98%,223805,54.87%,348650,86.87%,150590,36.46%,223096,53.34%,224637,54.29%,367508,91.57%
3,cas_6,False,2,13159,3.18%,2618,0.64%,46,0.01%,10921,2.64%,10273,2.46%,4272,1.03%,0,0.00%
4,cas_6,Null,2,4059,0.98%,2131,0.52%,796,0.20%,5211,1.26%,0,0.00%,0,0.00%,0,0.00%
5,cas_6,True,2,1025,0.25%,17492,4.29%,24274,6.05%,2282,0.55%,7270,1.74%,11670,2.82%,25116,6.26%
6,cas_6,False,3,2620,0.63%,454,0.11%,6,0.00%,2374,0.57%,1619,0.39%,519,0.13%,0,0.00%
7,cas_6,Null,3,663,0.16%,380,0.09%,162,0.04%,699,0.17%,0,0.00%,0,0.00%,0,0.00%
8,cas_6,True,3,86,0.02%,3262,0.80%,4227,1.05%,502,0.12%,1704,0.41%,2889,0.70%,4395,1.10%
9,cas_6,False,4,1000,0.24%,180,0.04%,5,0.00%,951,0.23%,531,0.13%,111,0.03%,0,0.00%


Analyse séquence

In [55]:
compte_dup_cas(var = 'sequence_id', case = 6)

Execution ID: 84318ae4-c35b-4557-accb-45bd273f8f2e


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,311177,84.49%
1,2,44602,12.11%
2,3,5180,1.41%
3,4,3704,1.01%
4,5,672,0.18%
5,6,989,0.27%
6,7,306,0.08%
7,8,411,0.11%
8,9,146,0.04%
9,10,214,0.06%


In [56]:
generate_analytical_table_dup(var = 'sequence_id', case = 6)

Execution ID: 1ccac78c-752f-46ea-be0a-8a3aed056edb


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_6,False,1,121568,32.06%,73152,19.52%,916,0.25%,105676,27.63%,144807,37.79%,143792,37.55%,0,0.00%
1,cas_6,Null,1,107011,28.22%,62179,16.59%,15167,4.12%,98430,25.73%,0,0.00%,0,0.00%,0,0.00%
2,cas_6,True,1,99833,26.33%,185556,49.51%,295448,80.24%,128726,33.66%,187045,48.81%,192486,50.27%,311177,84.56%
3,cas_6,False,2,21946,5.79%,6292,1.68%,87,0.02%,18935,4.95%,19880,5.19%,13450,3.51%,0,0.00%
4,cas_6,Null,2,10485,2.76%,5896,1.57%,1881,0.51%,8876,2.32%,0,0.00%,0,0.00%,0,0.00%
5,cas_6,True,2,8073,2.13%,30326,8.09%,42540,11.55%,11997,3.14%,21460,5.60%,23265,6.08%,44602,12.12%
6,cas_6,False,3,3012,0.79%,520,0.14%,6,0.00%,2422,0.63%,2299,0.60%,1274,0.33%,0,0.00%
7,cas_6,Null,3,929,0.24%,528,0.14%,154,0.04%,995,0.26%,0,0.00%,0,0.00%,0,0.00%
8,cas_6,True,3,405,0.11%,3811,1.02%,5003,1.36%,792,0.21%,2140,0.56%,2997,0.78%,5180,1.41%
9,cas_6,False,4,2190,0.58%,411,0.11%,10,0.00%,1861,0.49%,1162,0.30%,368,0.10%,0,0.00%


## Cas de figure 7: Cardinality exception INPI supérieure INSEE, intersection positive 

* Definition:  L’adresse de l’INSEE contient des mots de l’adresse de l’INPI et la cardinality des mots non présents dans l’adresse de l’INPI est supérieure à la cardinality de l’adresse de l’INSEE
* Math definition: $|INPI|-|INPI \cap INSEE| > |INSEE|-|INPI \cap INSEE|$
* Règle: $|\text{insee_except}| < |\text{inpi_except}| \text{ and } \text{intersection} > 0 \rightarrow \text{cas 7}$

| list_inpi                                                                                    | list_insee                                                                   | insee_except | inpi_except                 | intersection | union_ |
|----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|--------------|-----------------------------|--------------|--------|
| [RTE, CABRIERES, D, AIGUES, CHEZ, MR, DOL, JEAN, CLAUDE, LIEUDIT, PLAN, PLUS, LOIN]          | [ROUTE, CABRIERES, D, AIGUES, CHEZ, MR, DOL, JEAN, CLAUDE, PLAN, PLUS, LOIN] | [ROUTE]      | [RTE, LIEUDIT]              | 11           | 14     |
| [ROUTE, N, ZAC, PONT, RAYONS, CC, GRAND, VAL, ILOT, B, BAT, A, LOCAL]                        | [ZONE, ZAC, PONT, RAYONS, CC, GRAND, VAL, ILOT, B, BAT, A, LOCAL]            | [ZONE]       | [ROUTE, N]                  | 11           | 14     |
| [BOULEVARD, PAUL, VALERY, BAT, B, ESC, H, APPT, C, O, MADAME, BLANDINE, BOVE]                | [BOULEVARD, PAUL, VALERY, BAT, B, ESC, H, APT, C, O, BOVE, BLANDINE]         | [APT]        | [APPT, MADAME]              | 11           | 14     |
| [RUE, JEANNE, D, ARC, A, L, ANGLE, N, ROLLON, EME, ETAGE, POLE, PRO, AGRI]                   | [RUE, JEANNE, D, ARC, A, L, ANGLE, N, ROLLON, E, ETAGE]                      | [E]          | [EME, POLE, PRO, AGRI]      | 10           | 15     |
| [CHEZ, MR, MME, DANIEL, DEZEMPTE, AVENUE, BALCONS, FRONT, MER, L, OISEAU, BLEU, BATIMENT, B] | [AVENUE, BALCONS, FRONT, MER, CHEZ, MR, MME, DANIEL, DEZEMPTE, L, OISEA]     | [OISEA]      | [OISEAU, BLEU, BATIMENT, B] | 10           | 15     |

- Nombre d'observation: 466671
    - Percentage initial: 0.04

In [57]:
cas_7 = compte_obs_cas(case= 7)

Execution ID: afaa6b6c-107c-4179-b003-138ee3e0ad1d


In [58]:
dic_['Cas de figure'].append(7)
dic_['Titre'].append('Match partiel compliqué')
dic_['Total'].append(cas_7)
dic_['Total cumulé'].append(cas_1 + cas_2 + cas_3 + cas_4 + cas_5+ cas_6 + cas_7)
dic_['pourcentage'].append(cas_7/initial_obs)
dic_['Pourcentage cumulé'].append((cas_1 + cas_2 + cas_3 + cas_4 + cas_5+ cas_6 + cas_7)/initial_obs)
dic_['Comment'].append("Match partiel compliqué")

In [59]:
generate_analytical_table(case = 7)

Execution ID: 030b22ae-8b01-493a-9705-628fe6aa9ddb


Unnamed: 0,test,groups,count_test_num_voie,count_test_num_voie_pct,count_test_type_voie,count_test_type_voie_pct,count_test_commune,count_test_commune_pct,count_test_date,count_test_date_pct,count_test_status_admin,count_test_status_admin_pct,count_test_siege,count_test_siege_pct,count_test_cp,count_test_cp_pct
0,cas_7,True,73185,20.26%,230749,63.88%,346229,95.84%,124772,34.54%,200897,55.61%,223160,61.78%,361242,100.00%
1,cas_7,False,175349,48.54%,92198,25.52%,421,0.12%,151174,41.85%,160345,44.39%,138082,38.22%,0,0.00%
2,cas_7,Null,112708,31.20%,38295,10.60%,14592,4.04%,85296,23.61%,0,0.00%,0,0.00%,0,0.00%


Analyse index

In [60]:
compte_dup_cas(var = 'index_id', case = 7)

Execution ID: 67d9de4d-22b2-4ede-9703-0d96c95c71b5


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,296702,92.97%
1,2,16116,5.05%
2,3,3087,0.97%
3,4,1193,0.37%
4,5,615,0.19%
5,6,376,0.12%
6,7,264,0.08%
7,8,167,0.05%
8,9,84,0.03%
9,10,78,0.02%


In [61]:
generate_analytical_table_dup(var = 'index_id', case = 7)

Execution ID: 2e794c4a-b108-48e8-88ae-8b7ff0c4643c


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_7,False,1,138999,42.86%,84603,26.26%,380,0.12%,114508,35.24%,137574,41.85%,130668,40.30%,0,0.00%
1,cas_7,Null,1,94686,29.20%,30250,9.39%,12763,4.00%,76615,23.58%,0,0.00%,0,0.00%,0,0.00%
2,cas_7,True,1,71630,22.09%,186222,57.81%,283559,88.90%,115524,35.55%,173412,52.75%,175751,54.20%,296702,93.02%
3,cas_7,False,2,9767,3.01%,2214,0.69%,11,0.00%,8164,2.51%,6949,2.11%,2853,0.88%,0,0.00%
4,cas_7,Null,2,3233,1.00%,1197,0.37%,529,0.17%,2922,0.90%,0,0.00%,0,0.00%,0,0.00%
5,cas_7,True,2,578,0.18%,11779,3.66%,15576,4.88%,1694,0.52%,5838,1.78%,9376,2.89%,16116,5.05%
6,cas_7,False,3,2046,0.63%,410,0.13%,1,0.00%,1909,0.59%,1142,0.35%,322,0.10%,0,0.00%
7,cas_7,Null,3,599,0.18%,256,0.08%,94,0.03%,385,0.12%,0,0.00%,0,0.00%,0,0.00%
8,cas_7,True,3,62,0.02%,2337,0.73%,2992,0.94%,412,0.13%,1357,0.41%,2369,0.73%,3087,0.97%
9,cas_7,False,4,747,0.23%,156,0.05%,0,0.00%,757,0.23%,357,0.11%,73,0.02%,0,0.00%


Analyse séquence

In [62]:
compte_dup_cas(var = 'sequence_id', case = 7)

Execution ID: 054e5c05-c3b0-4289-b684-f47a9acc2f45


Unnamed: 0,count_index_id,count_duplicate_index_id,percentage
0,1,246372,84.91%
1,2,34615,11.93%
2,3,3917,1.35%
3,4,2719,0.94%
4,5,517,0.18%
5,6,743,0.26%
6,7,228,0.08%
7,8,292,0.10%
8,9,76,0.03%
9,10,131,0.05%


In [63]:
generate_analytical_table_dup(var = 'sequence_id', case = 7)

Execution ID: adb409c2-9ca2-4ae0-b1a2-e659589b7149


Unnamed: 0,test,groups,cnt_test,cnt_index_num_voie,count_test_num_voie_pct,cnt_index_type_voie,count_test_type_voie_pct,cnt_index_commune,count_test_commune_pct,cnt_index_date,count_test_date_pct,cnt_index_admin,count_test_status_admin_pct,cnt_index_siege,count_test_siege_pct,cnt_index_cp,count_test_cp_pct
0,cas_7,False,1,111354,37.76%,73276,24.99%,327,0.11%,93130,31.07%,114329,38.30%,112412,37.71%,0,0.00%
1,cas_7,Null,1,81243,27.55%,26004,8.87%,10657,3.67%,69729,23.26%,0,0.00%,0,0.00%,0,0.00%
2,cas_7,True,1,60843,20.63%,151304,51.60%,235636,81.22%,98541,32.88%,142809,47.83%,147889,49.61%,246372,84.97%
3,cas_7,False,2,19433,6.59%,6829,2.33%,34,0.01%,16189,5.40%,15799,5.29%,10514,3.53%,0,0.00%
4,cas_7,Null,2,8562,2.90%,2838,0.97%,1399,0.48%,5501,1.84%,0,0.00%,0,0.00%,0,0.00%
5,cas_7,True,2,5386,1.83%,24334,8.30%,33128,11.42%,9264,3.09%,17942,6.01%,19496,6.54%,34615,11.94%
6,cas_7,False,3,2515,0.85%,600,0.20%,2,0.00%,1937,0.65%,1715,0.57%,959,0.32%,0,0.00%
7,cas_7,Null,3,823,0.28%,301,0.10%,115,0.04%,641,0.21%,0,0.00%,0,0.00%,0,0.00%
8,cas_7,True,3,305,0.10%,2933,1.00%,3783,1.30%,647,0.22%,1830,0.61%,2467,0.83%,3917,1.35%
9,cas_7,False,4,1682,0.57%,392,0.13%,1,0.00%,1484,0.50%,861,0.29%,257,0.09%,0,0.00%


## Recapitulatif cas

In [64]:
(pd.DataFrame(dic_)
 .style
 .format("{:,.0f}", subset =  ['Total',
                                            'Total cumulé'])
              .format("{:.2%}", subset =  ['pourcentage',
                                           'Pourcentage cumulé'])
              .bar(subset= ['Total',
                                            'Total cumulé'],
                   color='#d65f5f')
)

Unnamed: 0,Cas de figure,Titre,Total,Total cumulé,pourcentage,Pourcentage cumulé,Comment
0,1,similarité parfaite,7775392,7775392,67.03%,67.03%,Match parfait
1,2,Exclusion parfaite,974444,8749836,8.40%,75.43%,Exclusion parfaite
2,3,Match partiel parfait,407404,9157240,3.51%,78.94%,Match partiel parfait
3,4,Match partiel parfait,558992,9716232,4.82%,83.76%,Match partiel parfait
4,5,Match partiel compliqué,1056406,10772638,9.11%,92.86%,Match partiel compliqué
5,6,Match partiel compliqué,361242,11133880,3.11%,95.98%,Match partiel compliqué
6,7,Match partiel compliqué,466671,11600551,4.02%,100.00%,Match partiel compliqué


In [66]:
#print(pd.DataFrame(dic_).set_index('Cas de figure').to_markdown())

## Resume match

Le tableau ci dessous résume les matches si on n'utilise qu'un seul des tests sur les index uniques. Par exemple, il y a 7,443,607 lignes uniques pour le cas numéro 1. Parmi ses 7,443,607, il y a 6,443,723 lignes uniques qui ont passé le test du numéro de voie. L'ensemble des lignes uniques qui ont passé le test du numéro de voie représente environ 70% des lignes de l'INPI

| Test              | Cas 1                           | Cas 2             | Cas 3              | Cas 4             | Cas 5                  | Cas 6             | Cas 7                | SUM             | pourcentage INPI |
|-------------------|---------------------------------|-------------------|--------------------|-------------------|------------------------|-------------------|----------------------|-----------------|------------------|
| Total analyse     |                7 775 392        |          974 444  |          407 404   |          558 992  |         1 056 406      |         361 242   |         466 671      |    11 600 551   |                  |
| Index unique      |                7 443 607        |                   |          387 589   |          525 832  |              943 789   |         367 508   |         296 702      |      9 965 027  | 89,47%           |
|                   |      test count (1) + True      |                   |                    |                   |                        |                   |                      |                 |                  |
| Test numéro voie  |                6 443 723        |                   |          254 858   |          320 206  |              624 127   |         115 627   |               7 163  |      7 765 704  | 69,72%           |
| Test type voie    |                6 676 760        |                   |          275 559   |          361 456  |              628 397   |         223 805   |         186 222      |      8 352 199  | 74,99%           |
| Test date         |                    3 757   064  |                   |          179 688   |          253 205  |                44 084  |           15 059  |         115 524      |      4 364 624  | 39,19%           |
| Test status admin |                    6 246   835  |                   |            31 054  |          418 014  |              679 414   |         223 096   |         173 412      |      7 771 825  | 69,78%           |
| Test siege        |                    4 500   976  |                   |          234 578   |          331 856  |              561 655   |         224 637   |           175 751    |      6 029 453  | 54,13%           |

# Generate report

In [67]:
import os, time, shutil, urllib, ipykernel, json
from pathlib import Path
from notebook import notebookapp

In [68]:
def create_report(extension = "html"):
    """
    Create a report from the current notebook and save it in the 
    Report folder (Parent-> child directory)
    
    1. Exctract the current notbook name
    2. Convert the Notebook 
    3. Move the newly created report
    
    Args:
    extension: string. Can be "html", "pdf", "md"
    
    
    """
    
    ### Get notebook name
    connection_file = os.path.basename(ipykernel.get_connection_file())
    kernel_id = connection_file.split('-', 1)[0].split('.')[0]

    for srv in notebookapp.list_running_servers():
        try:
            if srv['token']=='' and not srv['password']:  
                req = urllib.request.urlopen(srv['url']+'api/sessions')
            else:
                req = urllib.request.urlopen(srv['url']+ \
                                             'api/sessions?token=' + \
                                             srv['token'])
            sessions = json.load(req)
            notebookname = sessions[0]['name']
        except:
            pass  
    
    sep = '.'
    path = os.getcwd()
    #parent_path = str(Path(path).parent)
    
    ### Path report
    #path_report = "{}/Reports".format(parent_path)
    #path_report = "{}/Reports".format(path)
    
    ### Path destination
    name_no_extension = notebookname.split(sep, 1)[0]
    source_to_move = name_no_extension +'.{}'.format(extension)
    dest = os.path.join(path,'Reports', source_to_move)
    
    ### Generate notebook
    os.system('jupyter nbconvert --no-input --to {} {}'.format(
    extension,notebookname))
    
    ### Move notebook to report folder
    #time.sleep(5)
    shutil.move(source_to_move, dest)
    print("Report Available at this adress:\n {}".format(dest))

In [69]:
create_report(extension = "html")

Report Available at this adress:
 C:\Users\PERNETTH\Documents\Projects\InseeInpi_matching\Notebooks_matching\Data_preprocessed\programme_matching\02_siretisation\Reports\06_analyse_pre_siretisation_v3.html
