# Test Matching Insee/ETS

## INSEE

- https://s3.console.aws.amazon.com/s3/object/calfdata/INSEE/Stock/ETS/
        - INSEE/Stock/ETS/StockEtablissement_utf8.csv
        
```
['siren', 'siret']
```

## INPI

- https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/Stock_processed/
    - INPI/TC_1/Stock_processed/initial_ETS.gz
    - INPI/TC_1/Stock_processed/initial_ETS.json
    
Colonnes test:

```
["Siren","Date_Immatriculation", "Date_Clôture", "Date_Greffe"]
```

## Sauvegarde

* La liste des SIREN matchés sera sauvegardée selon leur nature et origine
  * nature → ACTES/COMPTES/ETS/etc
  * origine → initial/partiel/new/evt

Les matchés seront sauvegardé dans calfdata/SIRETISATION/matche/ au format suivant:

* insee_nature_origine_matche.gz
    * ex: insee_pm_initial_matche.gz
    
    

## Moteur de recherche TEST

* Insee
  * http://avis-situation-sirene.insee.fr/IdentificationListeSiret.action
* INPI/TC
  * https://data.inpi.fr/
* Infogreffe
  * https://www.infogreffe.fr/


Le siège ne donne pas de nouveau SIRET, il indique seulement le lieu de la juridiction

In [None]:
import boto3, json
import dask.dataframe as dd
import pandas as pd
#import Match_inpi_insee.aws_connectors as aws
#from tqdm.notebook import tqdm
#import tqdm
%load_ext autoreload
%autoreload 2

In [None]:
#instance_aws = 'https://calfdata.s3.eu-west-3.amazonaws.com'
#bucket = 'calfdata'

In [None]:
# instanciate AWS connection
#AWS_connection = aws.aws_instantiate(instance_aws, bucket)

## Preparation fichiers

## Matching établissement principal

Ici, on filtre les variables communes pour l'INSEE & INPI établissements secondaires.

### Candidats

**INSEE**

https://www.sirene.fr/sirene/public/static/liste-variables

- numeroVoieEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/numeroVoieEtablissement
- indiceRepetitionEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/indiceRepetitionEtablissement
- typeVoieEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/typeVoieEtablissement
- libelleVoieEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleVoieEtablissement
- complementAdresseEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/complementAdresseEtablissement
- codeCommuneEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codeCommuneEtablissement
- libelleCommuneEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleCommuneEtablissement
- codePostalEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codePostalEtablissement
- codeCedexEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codeCedexEtablissement
- libelleCedexEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleCedexEtablissement
- distributionSpecialeEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/distributionSpecialeEtablissement
- libelleCommuneEtrangerEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleCommuneEtrangerEtablissement
- codePaysEtrangerEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codePaysEtrangerEtablissement
- libellePaysEtrangerEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libellePaysEtrangerEtablissement

**INPI**

- Adresse_Ligne1/Adresse_Ligne2/Adresse_Ligne3: Selon les greffes, l’adresse (n°+ voie) sera présente soit en ligne1 adresse, soit en ligne2 adresse.
Toutes les lignes d’adresse ne sont pas nécessairement renseignées.



### Créer fichier toutes les possibilités communes

In [None]:
communes = (pd.read_csv('temp_local\communes-01012019.csv')
            .set_index('ncc').reindex(columns = ['nccenr', 'libelle']))#.unstack()
communes.loc[lambda x: x['libelle'].isin(['Châtillon-sur-Chalaronne'])]

In [None]:
communes = (pd.read_csv('temp_local\communes-01012019.csv')
            .set_index('ncc')
            .reindex(columns=['nccenr', 'libelle'])
            .assign(
    noaccent=lambda x: x['nccenr'].str.normalize('NFKD')
    .str.encode('ascii', errors='ignore')
    .str.decode('utf-8'),
    nccenr_noponc=lambda x: x['nccenr'].str.replace('[^\w\s]', ' '),
    libelle_noponc=lambda x: x['libelle'].str.replace('[^\w\s]', ' '),
    noaccent_noponc=lambda x: x['noaccent'].str.replace('[^\w\s]', ' '),
    uppercase=lambda x: x.index,
    nccenr_uppercase=lambda x: x['nccenr'].str.upper(),
    libelle_uppercase=lambda x: x['libelle'].str.upper(),
    noaccent_uppercase=lambda x: x['noaccent'].str.upper(),
    nccenr_noponc_uppercase=lambda x: x['nccenr_noponc'].str.upper(),
    libelle_noponc_uppercase=lambda x: x['libelle_noponc'].str.upper(),
    noaccent_noponc_uppercase=lambda x: x['noaccent_noponc'].str.upper(),
    nccenr_lowercase=lambda x: x['nccenr'].str.lower(),
    libelle_lowercase=lambda x: x['libelle'].str.lower(),
    noaccent_lowercase=lambda x: x['noaccent'].str.lower(),
    nccenr_noponc_lowercase=lambda x: x['nccenr_noponc'].str.lower(),
    libelle_noponc_lowercase=lambda x: x['libelle_noponc'].str.lower(),
    noaccent_noponc_lowercase=lambda x: x['noaccent_noponc'].str.lower(),
    nccenr_noarrond1=lambda x: x['nccenr'].str.replace(
        'er Arrondissement', ''),
    uppercase_noarrond1=lambda x: x['uppercase'].str.replace(
        'ER ARRONDISSEMENT', ''),
    lowercase_noarrond1=lambda x: x['nccenr_lowercase'].str.replace(
        'er arrondissement', ''),
    nccenr_noarrond=lambda x: x['nccenr'].str.replace('e Arrondissement', ''),
    uppercase_noarrond=lambda x: x['uppercase'].str.replace(
        'E ARRONDISSEMENT', ''),
    lowercase_noarrond=lambda x: x['nccenr_lowercase'].str.replace(
        'e arrondissement', ''),
)
)

for n in communes.columns:
    var_ = '{}_ST'.format(n)
    var_1 = '{}_st'.format(n)
    var_2 = '{}_St'.format(n)
    
    communes[var_] = communes[n].str.replace('SAINT', 'ST')
    communes[var_1] = communes[n].str.replace('Saint', 'st')
    communes[var_2] = communes[n].str.replace('Saint', 'St')
    
    var_ = '{}_Sbar'.format(n)
    var_1 = '{}_sbar'.format(n)
    
    communes[var_] = communes[n].str.replace('SUR', 'S/')
    communes[var_1] = communes[n].str.replace('sur', 's/')
    
communes = (communes
            .stack()
            .rename('possibilite')
            .reset_index()
            .drop(columns='level_1')
            .drop_duplicates(subset=['possibilite']))
communes.head()

In [None]:
communes.to_csv('data\input\communes_france.csv', index = False)

In [None]:
#insee = AWS_connection.url_instance_bucket(path_file = 'INSEE/Stock/ETS/StockEtablissement_utf8.csv')
#ets = AWS_connection.url_instance_bucket(path_file = 'INPI/TC_1/Stock_processed/initial_ETS.gz')
#ets

In [None]:
insee = r"\temp_local\StockEtablissement_utf8.csv"
ets = r"\temp_local\initial_ETS.gz"

In [None]:
# load data into dataframes
data_insee_ = dd.read_csv(insee,
                          usecols=['siren',
                                   'siret',
                                   "numeroVoieEtablissement",
                                   "indiceRepetitionEtablissement",
                                   "typeVoieEtablissement",
                                   "libelleVoieEtablissement",
                                   "complementAdresseEtablissement",
                                   "codeCommuneEtablissement",
                                   "libelleCommuneEtablissement",
                                   "codePostalEtablissement",
                                   "codeCedexEtablissement",
                                   "libelleCedexEtablissement",
                                   "distributionSpecialeEtablissement",
                                   "libelleCommuneEtrangerEtablissement",
                                   "codePaysEtrangerEtablissement",
                                   "libellePaysEtrangerEtablissement",
                                   "dateCreationEtablissement"
                                   ],
                          dtype={'siren': 'object',
                                 'siret': 'object',
                                 "numeroVoieEtablissement":'object',
                                   "indiceRepetitionEtablissement":'object',
                                   "typeVoieEtablissement":'object',
                                   "libelleVoieEtablissement":'object',
                                   "complementAdresseEtablissement":'object',
                                   "codeCommuneEtablissement":'object',
                                   "libelleCommuneEtablissement":'object',
                                   "codePostalEtablissement":'object',
                                   "codeCedexEtablissement":'object',
                                   "libelleCedexEtablissement":'object',
                                   "distributionSpecialeEtablissement":'object',
                                   "libelleCommuneEtrangerEtablissement":'object',
                                   "codePaysEtrangerEtablissement":'object',
                                   "libellePaysEtrangerEtablissement":'object'
                                 }
                          )

data_ets_ = (dd.read_csv(ets,
                         usecols=[
                             'Type',
                             'Siren',
                             'Code_Postal',
                             'Code_Commune',
                             'Adresse_Ligne1',
                             'Adresse_Ligne2',
                             'Adresse_Ligne3',
                             'Ville',
                             'Pays'
                         ],
                         dtype={
                             'Type': 'object',
                             'Siren': 'object',
                             'Code_Postal': 'object',
                             'Code_Commune': 'object',
                             'Adresse_Ligne1': 'object',
                             'Adresse_Ligne2': 'object',
                             'Adresse_Ligne3': 'object',
                             'Ville':'object',
                             'Pays':'object'
                         },
                         compression='gzip',
                         blocksize=None,
                         low_memory=False
                         )
             .compute()
             .rename(columns={"Siren": "siren"})
             .loc[lambda x: ~x['Type'].isin(['SIE'])]
             )

In [None]:
data_insee_.shape

In [None]:
data_ets_.shape

In [None]:
siren_inpi = data_ets_['siren'].drop_duplicates()
len(siren_inpi)

In [None]:
len(siren_inpi)/data_ets_.shape[0]

In [None]:
subset_insee = (data_insee_
                .loc[data_insee_['siren'].isin(siren_inpi.to_list())]
                .loc[data_insee_['dateCreationEtablissement'] <= "2018-01-01"]
                .assign(
                libelleCommuneEtablissement = lambda x:
                    x['libelleCommuneEtablissement'].str.replace('-', ' ')
                )
                .compute()
               )

Siren INPI mais pas INSEE -> Cette entreprise a exercé son droit d'opposition auprès de l'INSEE. Ses données ne peuvent pas être diffusées publiquement.

temp insee - > gagner du temps pendant la periode de dév
temp inpi - > gagner du temps pendant la periode de dév

In [None]:
siren_to_remove = siren_inpi.loc[lambda x : ~x.isin(subset_insee['siren'])]
len(siren_to_remove)

In [None]:
df_siren_to_find = data_ets_.loc[lambda x:
                                 (~x['siren'].isin(siren_to_remove))    
                                 ]
len(df_siren_to_find)

In [None]:
#df_siren_to_find.to_csv('temp_inpi.csv', index = False)

### Nan variables matching 

on exclue les variables avec que des nan dans les variables candidates

-> on les traitera après

In [None]:
import numpy as np
siren_fullna = df_siren_to_find.loc[lambda x:
                      (x['Adresse_Ligne1'].isin([np.nan]))
                     & (x['Adresse_Ligne2'].isin([np.nan]))
                     & (x['Adresse_Ligne3'].isin([np.nan]))
                     & (x['Code_Postal'].isin([np.nan]))
                     & (x['Ville'].isin([np.nan]))
                     & (x['Code_Commune'].isin([np.nan]))
                     ]['siren']

In [None]:
df_siren_to_find = df_siren_to_find.loc[lambda x:
                                 (~x['siren'].isin(siren_fullna))
                                 ]
len(siren_fullna)

### Nombres d'ets par SIREN INSEE

On calcule le nombre d'etb pour le fichier INSEE.

In [None]:
subset_insee_count = subset_insee.merge(
    (subset_insee
     .groupby('siren')['siren']
     .count()
     .rename('count')
     .reset_index())
)

In [None]:
df_siren_to_find = df_siren_to_find.merge(
    (df_siren_to_find
     .groupby('siren')['siren']
     .count()
     .rename('count')
     .reset_index()
    )
)

In [None]:
df_siren_to_find.shape[0]

Insee enlever les tirets dans la ville

In [None]:
def siren_unique(df):
    """
    """
    print("Nombre total obs: {}".format(len(df)))
    count_ = (df
              .groupby('siren')['siren']
              .count()
              .rename('count')
              .reset_index()
              .groupby('count')['count']
              .count()
              .reset_index(name='total_count')
              .set_index('count')
              # .compute()
              .assign(pct=lambda x: x/x.sum())
              .iloc[:10, :]
              .style
              .format('{:,.2%}', subset=['pct'])
              )
    return count_

### SAuvegarde fichier pour test class

In [None]:
df_siren_to_find = df_siren_to_find.rename(columns = 
                                           {'count': 'count_initial_inpi'})
df_siren_to_find.assign(index = lambda x:x.index).to_csv('data\input\inpi_etb_{}.gz'.format(
    df_siren_to_find.shape[0]),
                        compression='gzip', index = False) 

In [None]:
subset_insee_count = subset_insee_count.rename(columns = 
                                               {'count': 'count_initial_insee'})
subset_insee_count.to_csv('data\input\insee_2017_{}.gz'.format(
    subset_insee_count.shape[0]),
                        compression='gzip')

Quick stat

In [None]:
siren_unique(df = subset_insee_count)

In [None]:
siren_unique(df = df_siren_to_find)

## Step 0: Clean ville

Ajout matching des communes pour retrouver le libelé commune de l'INSEE

ATTENTION, il faut nétoyer la variables ville dans l'INSEE. Veuillez regarder le fichier `communes.xlsx` pour voir les différents problèmes

ex: 
- CEDEX, cedex, digit, (d+), 

attention, l'arrondissement peut être mis entre parenthèse 

- MARSEILLE (7E)

- process:
    - creer variables avec numeric seulement
    - recreer ville 2 si test pas NAN pour avoir l'arrondissement
    - virer les differentes informations dans ville via regex

In [None]:
import numpy as np
regex = 'CEDEX|cedex|Cedex|\([^)]*\)|/\s\s+/|^\d+\s|\s\d+\s|\s\d+$|\d+|\.|\--|COMMUNE DE |COMMUNE DE|commune de |commune de|Commune de |Commune de |\s$'
test_adress = df_siren_to_find.copy()
test_adress['test'] =test_adress['Ville'].str.extract(r'(\d+)')
test_adress['Ville_clean'] = test_adress['Ville'].str.replace(regex,'')
test_adress['Ville_clean'] = test_adress['Ville_clean'].str.replace('\s$|\s^',
                                                                    '')

### arrondissement
test_adress['ville2'] = np.where(
    np.logical_and(
         ~test_adress['test'].isin([np.nan]),
        test_adress['test'].str.len() <=2
    )
   ,
    test_adress['Ville_clean'] + '' + test_adress['test'].astype(str),
    test_adress['Ville_clean']
)

test_adress['Ville_upper'] = test_adress['Ville_clean'].str.upper()

test_adress = test_adress.merge(communes,
                         left_on='ville2',
                         right_on='possibilite',
                         how='left',
                         indicator=True)

test_adress = pd.concat([
    test_adress.loc[lambda x: x['_merge'].isin(['both'])],
    (test_adress
     .loc[lambda x: x['_merge'].isin(['left_only'])]
     .drop(columns=['ncc', 'possibilite', '_merge'])
     .merge(communes,
            left_on='Ville_upper',
            right_on='possibilite',
            how='left',
            indicator=True)
     )
])

test_adress = pd.concat([
    test_adress.loc[lambda x: x['_merge'].isin(['both'])],
    (test_adress
     .loc[lambda x: x['_merge'].isin(['left_only'])]
     .drop(columns=['ncc', 'possibilite', '_merge'])
     .assign(
         noaccent=lambda x: x['Ville_clean'].str.normalize('NFKD')
         .str.encode('ascii', errors='ignore')
         .str.decode('utf-8'))
     ).merge(communes,
             left_on='noaccent',
             right_on='possibilite',
             how='left',
             indicator=True)])

In [None]:
test_adress.head()

In [None]:
log_commune = {
    
    'total_match':[int(test_adress['_merge'].value_counts()['both']),
                   float(test_adress['_merge'].value_counts()['both']/test_adress.shape[0])
                  ],
    'total_unmatch':[int(test_adress['_merge'].value_counts()['left_only']),
                   float(test_adress['_merge'].value_counts()['left_only']/test_adress.shape[0])
                  ],
    'details_unmatch': {
        'Code_Postal':int(test_adress.loc[lambda x: x['_merge'].isin(['left_only'])].isna().sum()[['Code_Postal']][0]),
        'Code_Commune':int(test_adress.loc[lambda x: x['_merge'].isin(['left_only'])].isna().sum()[['Code_Commune']][0]),
    }
}

In [None]:
log_commune

In [None]:
with open(r'data\logs\commune.json', 'w') as outfile:
                json.dump(log_commune, outfile)

In [None]:
log_commune

In [None]:
test_adress['_merge'].value_counts()['left_only']

In [None]:
test_adress.loc[lambda x: x['_merge'].isin(['left_only'])].isna().sum()[['Code_Postal', 'Code_Commune']]

In [None]:
test_adress.loc[lambda x: (x['_merge'].isin(['left_only']) )
               & (x['Code_Postal'].isin([np.nan]))
               & (~x['Ville'].isin([np.nan]))]

In [None]:
communes.loc[lambda x : x['possibilite'].isin(['NURIEUX'])]

In [None]:
communes[communes['possibilite'].str.contains('LA VARENNE')]

In [None]:
communes[communes['ncc'].str.contains('SAINT RAMBERT EN BUGEY')]

In [None]:
test_adress = test_adress.drop(columns = '_merge')

In [None]:
test_adress.shape

In [None]:
subset_insee_count.to_csv('subset_insee_count.csv', index = False)

## Process

On ne match que les SIREN dont la date de création est inférieur a 2018

1) ~Step : Calculer le nombre de `nan` dans les colonnes de matching~

2) ~Step : Compter le nombre de SIRET by SIREN~

2) Step 2:  merge sur siren et code postal

3) Step 3:  merge sur siren et code commune



### Step 1: Match uniquement les 1 dans INSEE/INPI

On enlève les matches du dataframe `df_siren_to_find` et on ajoute les `left_only`.

Pareil pour l'INSEE pour gagner en mémoire.

In [None]:
test_adress = test_adress.drop(columns = ['test', 'Ville_clean', 'ville2', 'Ville_upper',
        'possibilite', '_merge', 'noaccent'])

In [None]:
test_adress[lambda x:x['ncc'].isin([np.nan])]['Ville'].drop_duplicates().to_excel('ville_pas_insee.xlsx')

In [None]:
test_adress.to_csv('data\input\inpi_etb_cleaned_{}.gz'.format(
    test_adress.shape[0]),
                        compression='gzip', index = False) 

In [None]:
## Il y a 4312053 établissements uniques dans l'INPI
siren_unique(df = test_adress)

In [None]:
from dask.diagnostics import ProgressBar
from dask.multiprocessing import get
import re
import dask.dataframe as dd
import pandas as pd
import numpy as np
pbar = ProgressBar()
pbar.register()

In [None]:
insee_col = ['siren',
 'siret',
 'dateCreationEtablissement',
 'complementAdresseEtablissement',
 'numeroVoieEtablissement',
 'indiceRepetitionEtablissement',
 'typeVoieEtablissement',
 'libelleVoieEtablissement',
 'codePostalEtablissement',
 'libelleCommuneEtablissement',
 'libelleCommuneEtrangerEtablissement',
 'distributionSpecialeEtablissement',
 'codeCommuneEtablissement',
 'codeCedexEtablissement',
 'libelleCedexEtablissement',
 'codePaysEtrangerEtablissement',
 'libellePaysEtrangerEtablissement',
 'count_initial_insee']

inpi_col =['siren',
           'index',
 'Type',
 'Adresse_Ligne1',
 'Adresse_Ligne2',
 'Adresse_Ligne3',
 'Code_Postal',
 'Ville',
 'Code_Commune',
 'Pays',
 'count_initial_inpi',
 'ncc']
insee_dtype = {
    'siren': 'object',
    'siret': 'object',
    'dateCreationEtablissement': 'object',
    'complementAdresseEtablissement': 'object',
    'numeroVoieEtablissement': 'object',
    'indiceRepetitionEtablissement': 'object',
    'typeVoieEtablissement': 'object',
    'libelleVoieEtablissement': 'object',
    'codePostalEtablissement': 'object',
    'libelleCommuneEtablissement': 'object',
    'libelleCommuneEtrangerEtablissement': 'object',
    'distributionSpecialeEtablissement': 'object',
    'codeCommuneEtablissement': 'object',
    'codeCedexEtablissement': 'object',
    'libelleCedexEtablissement': 'object',
    'codePaysEtrangerEtablissement': 'object',
    'libellePaysEtrangerEtablissement': 'object',
    'count_initial_insee': 'int'
}

inpi_dtype = {
    'siren': 'object',
    'index':'int',
 'Type': 'object',
 'Adresse_Ligne1': 'object',
 'Adresse_Ligne2': 'object',
 'Adresse_Ligne3': 'object',
 'Code_Postal': 'object',
 'Ville': 'object',
 'Code_Commune': 'object',
 'Pays': 'object',
 'count_initial_inpi': 'int',
 'ncc': 'object',
}

In [None]:
def import_dask(file, usecols = None, dtype=None):
        """
        """
        dd_df = dd.read_csv(file, usecols = usecols, dtype = dtype,
        blocksize=None,compression='gzip')

        return dd_df

In [None]:
def match_unmatch(df_inpi_initial, df_inpi_mergeboth, step = '1_unique_siren',
                  to_csv = True):
    """
    """

    merge_ = (
        df_inpi_mergeboth
        .merge(df_inpi_initial,
               how='right',
               indicator=True)
    )
    
    match_ = merge_.loc[lambda x: 
                       x['_merge'].isin(['both'])].drop(columns = '_merge')
    
    unmatch_ = merge_.loc[lambda x: 
                       ~x['_merge'].isin(['both'])].drop(columns = ['_merge',
                                                                    'siret'])
    
    
    if to_csv:
        name_match = 'data/output/match_{}_{}.gz'.format(step, match_.shape[0])
        name_unmatch = 'data/input/unmatched/unmatch_{}_{}.gz'.format(
            step, unmatch_.shape[0])
        match_.to_csv(name_match, index = False, compression='gzip',)
        unmatch_.to_csv(name_unmatch, index = False,compression='gzip')
        
    return unmatch_

In [None]:
def log_detail(df_, option = 'left_only'):
    """
    option -> right_only ou left_only
    """
    log_ = {
    
    'total_match':[int(df_['_merge'].value_counts()['both']),
                   float(df_['_merge'].value_counts()['both']/df_.shape[0])
                  ],
    'total_unmatch':[int(df_['_merge'].value_counts()[option]),
                   float(df_['_merge'].value_counts()[option]/df_.shape[0])
                  ],
    'details_unmatch': {
        'Code_Postal':int(df_.loc[lambda x: x['_merge'].isin([option])].isna().sum()[['Code_Postal']][0]),
        'Code_Commune':int(df_.loc[lambda x: x['_merge'].isin([option])].isna().sum()[['Code_Commune']][0]),
    }
}
    return log_

In [None]:
def match_unique_etb(list_inpi):
    """


    """
    insee = import_dask(file=r'data\input\insee_2017_7480120.gz',
                        usecols=insee_col, dtype=insee_dtype)

    inpi = import_dask(file=r'data\input\inpi_etb_cleaned_4979272.gz',
                       usecols=inpi_col, dtype=inpi_dtype)

    m1_unique = (
        insee.loc[insee['count_initial_insee'].isin([1])]
        .merge(inpi.loc[inpi['count_initial_inpi'].isin([1])],
               how='right', indicator=True)
    )

    unmatched = match_unmatch(
        df_inpi_initial=inpi.compute(),
        df_inpi_mergeboth=(m1_unique.compute()
                       .reindex(columns=list_inpi)
                       .loc[lambda x:
                            x['_merge'].isin(['both'])]
                       .drop(columns=['_merge'])),
        step='1_unique_siren',
        to_csv=True)
    
    log_detail(test_adress1, option = 'right_only')
    
    return unmatched

In [None]:
list_inpi = [
        'siren',
             'siret',
        'index',
             'Type',
             'Adresse_Ligne1',
             'Adresse_Ligne2',
             'Adresse_Ligne3',
             'Code_Postal',
             'Ville',
             'Code_Commune',
             'Pays',
             'ncc',
             '_merge']
test_adress1 = match_unique_etb(list_inpi)

In [None]:
test_adress1

In [None]:
log_detail(test_adress1, option = 'right_only')

In [None]:
test_adress1 = match_unmatch(
            df_inpi_initial=test_adress,
            df_inpi_merge=(test_adress1
                           .reindex(columns=list_inpi)
                           .loc[lambda x:
                                x['_merge'].isin(['both'])]
                           .drop(columns=['_merge'])),
    step='1_unique_siren',
    to_csv=False)

In [None]:
m1_unique = (
    dd_df_insee.loc[dd_df_insee['count_initial_insee'].isin([1])]
    .merge(dd_df_inpi.loc[dd_df_inpi['count_initial_inpi'].isin([1])],
           how='right', indicator=True)
)
m1_unique.compute()

Sur les 4312053 etbs uniques, on match 3009904 correctement

Pour recuperer les matchés, on filtre la base initiale -> a savoir `test_adress`

In [None]:
test_adress1 = match_unmatch(
    df_inpi_initial=test_adress,
    df_inpi_merge=m1_unique.reindex(columns=list_inpi).loc[lambda x:
                                         x['_merge'].isin(['both'])].drop(columns=['_merge']),
    step='1_unique_siren',
    to_csv=True)

Conclusion step 1:

- matched: 3009904
- Unmatched: 1969368

In [None]:
test_adress1.head()

In [None]:
siren_unique(df = test_adress1)

Exemple de SIREN qui ont seulement une ligne dans l'INPI mais plusieurs SIRET dans l'INSEE.

- 813543063
- 800897092

In [None]:
test_adress1.loc[lambda x: x['count'] ==1].head(2)

In [None]:
subset_insee_count.loc[lambda x: x['siren'].isin(['813543063'])]

In [None]:
subset_insee_count.loc[lambda x: x['siren'].isin(['800897092'])]

In [None]:
subset_insee_count.isna().sum().sort_values()

In [None]:
test_adress.isna().sum().sort_values()

### Step 2: Merging 

Dans cette partie, on va merger sur plusieurs candidats. La plupart des SIREN peuvent être matché via le code postal, code commune, ou ville directement. Si un SIREN a plusieurs SIRET dans la même ville ou code postal, il fera l'objet d'une recherche plus poussée.

Trois cas de figure découle du merge:

- 1) Merge forte pertinence
- 2) merge pertinence moyenne -> plusieurs SIRET pour un même candidat
- 3) Unmerge

#### 1:  merge sur siren et Ville

- Merge sur siren & libelleCommuneEtablissement|Ville_clean

In [None]:
%%time

df_nomatch = pd.DataFrame()
df_input = import_dask(file=r'data\input\unmatched\unmatch_1_unique_siren_1969368.gz',
                       usecols=inpi_col, dtype=inpi_dtype)
  
total_match = pd.DataFrame()
for i in [
        ['ncc', 'libelleCommuneEtablissement'],
        ['Code_Postal', 'codePostalEtablissement'],
        ['Code_Commune', 'codeCommuneEtablissement']
    ]:
    df_input_ = merge_siren_candidat(df_input=df_input,
                                  option=i)

    df_input = df_input_['unmatch']
    total_match = total_match.append(df_input_['true_match'])

    
match_unmatch(
        df_inpi_initial= import_dask(file=r'data\input\unmatched\unmatch_1_unique_siren_1969368.gz',
                       usecols=inpi_col, dtype=inpi_dtype).compute(),
        df_inpi_mergeboth=total_match.drop(columns = '_merge'),
        step='1_ville_cp_cc',
        to_csv=True)

In [None]:
list(df_input.compute())

In [None]:
total_match.shape

In [None]:
total_match.head()

In [None]:
list_inpi = ['siren',
                 'siret',
                 'Type',
                 'Adresse_Ligne1',
                 'Adresse_Ligne2',
                 'Adresse_Ligne3',
                 'Code_Postal',
                 'Ville',
                 'Code_Commune',
                 'Pays',
                 'test',
                 'Ville_clean',
                 'ville2',
                 'ncc',
                 'possibilite',
                 'noaccent']

In [None]:
def merge_siren_candidat(df_input, 
                         option=['ncc', 'libelleCommuneEtablissement']):
    """
    option list can only be one of these:
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    """
    insee = import_dask(file=r'data\input\insee_2017_7480120.gz',
                        usecols=insee_col, dtype=insee_dtype)
    
    if '_merge' in df_input.columns:
        df_input = (df_input
                    .drop(columns=['siret',
                                   'numeroVoieEtablissement',
                                   'libelleVoieEtablissement',
                                   'codePostalEtablissement',
                                   'libelleCommuneEtablissement',
                                   'codeCommuneEtablissement',
                                   '_merge']))

    temp = df_input.merge(insee,
                          how='left',
                          left_on=['siren', option[0]],
                          right_on=['siren',  option[1]],
                          indicator=True,
                          suffixes=['_insee', '_inpi'])

    to_check = temp[temp['_merge'].isin(['both'])]
    nomatch = temp[~temp['_merge'].isin(['both'])]
    
    to_check = to_check.compute()

    # calcul le nombre cas de figure 2 -> très conservative
    test_match = (to_check
                  .merge(
                      (to_check
                       .groupby(['siren', option[1]])['siren']
                       .count()
                       .rename('count')
                       .reset_index()
                       )
                  )
                  )

    true_match = (test_match
                  .loc[lambda x:x['count'] == 1]
                  .reindex(columns=list_inpi))

    name_csv = r'Data\output\{}_true_match_{}.csv'.format(
        option[0], true_match.shape[0])

    true_match.to_csv(name_csv, index=False)

    dic_ = {
        'true_match': true_match,
        'unmatch': nomatch
    }

    return dic_

In [None]:
test = test_adress1.merge(subset_insee_count,
                   how='left',
                   left_on=['siren', 'ncc'],
                   right_on=['siren', 'libelleCommuneEtablissement'],
                   indicator=True,
                   suffixes=['_insee', '_inpi'])

In [None]:
test['_merge'].value_counts()

In [None]:
### calcul le nombre cas de figure 2 -> très conservative
test_match = (test.loc[lambda x: x['_merge'].isin(['both'])]
              .merge(
                  (test.loc[lambda x: x['_merge'].isin(['both'])]
            .groupby(['siren', 'libelleCommuneEtablissement'])['siren']
            .count()
            .rename('count')
            .reset_index()
                  )
              )
)
test_match['count'].value_counts()

In [None]:
test_adress2 = match_unmatch(
    df_inpi_initial=test_adress1,
    df_inpi_mergeboth=(test_match
                       .loc[lambda x: x['count'].isin([1])]
                   .reindex(columns=list_inpi).loc[lambda x:
                                         x['_merge'].isin(['both'])]
                   .drop(columns = ['count','_merge'])),
    step='2_ville',
    to_csv=True)

In [None]:
test_adress2.shape

In [None]:
test_adress2.shape[0] + test_match['count'].value_counts().loc[1] == test_adress1.shape[0]

#### 2:  merge sur siren et code postal

In [None]:
test = test_adress2.merge(subset_insee_count,
                   how='left',
                   left_on=['siren', 'Code_Postal'],
                   right_on=['siren', 'codePostalEtablissement'],
                   indicator=True,
                   suffixes=['_insee', '_inpi'])
test['_merge'].value_counts()

In [None]:
### calcul le nombre cas de figure 2 -> très conservative
test_match = (test.loc[lambda x: x['_merge'].isin(['both'])]
              .merge(
                  (test.loc[lambda x: x['_merge'].isin(['both'])]
            .groupby(['siren', 'codePostalEtablissement'])['siren']
            .count()
            .rename('count')
            .reset_index()
                  )
              )
)
test_match['count'].value_counts()

In [None]:
test_adress3 = match_unmatch(
    df_inpi_initial=test_adress2,
    df_inpi_mergeboth=(test_match
                       .loc[lambda x: x['count'].isin([1])]
                   .reindex(columns=list_inpi).loc[lambda x:
                                         x['_merge'].isin(['both'])]
                   .drop(columns = ['count','_merge'])),
    step='3_codePostal',
    to_csv=True)
test_adress3.shape[0] + test_match['count'].value_counts().loc[1] == test_adress2.shape[0]

### 3:  merge sur siren et code commune

- Merge sur siren & codeCommuneEtablissement

In [None]:
test = test_adress3.merge(subset_insee_count,
                   how='left',
                   left_on=['siren', 'Code_Commune'],
                   right_on=['siren', 'codeCommuneEtablissement'],
                   indicator=True,
                   suffixes=['_insee', '_inpi'])
test['_merge'].value_counts()

In [None]:
### calcul le nombre cas de figure 2 -> très conservative
test_match = (test.loc[lambda x: x['_merge'].isin(['both'])]
              .merge(
                  (test.loc[lambda x: x['_merge'].isin(['both'])]
            .groupby(['siren', 'codeCommuneEtablissement'])['siren']
            .count()
            .rename('count')
            .reset_index()
                  )
              )
)
test_match['count'].value_counts()

In [None]:
test_match['count'].value_counts().loc[1]

In [None]:
test_adress4 = match_unmatch(
    df_inpi_initial=test_adress3,
    df_inpi_mergeboth=(test_match
                       .loc[lambda x: x['count'].isin([1])]
                   .reindex(columns=list_inpi).loc[lambda x:
                                         x['_merge'].isin(['both'])]
                   .drop(columns = ['count','_merge'])),
    step='4_codecommune',
    to_csv=True)
test_adress4.shape[0] + test_match['count'].value_counts().loc[1] == test_adress3.shape[0]

In [None]:
test_adress4.shape[0]

### Créer code loop step 1/4

## Match adresse

Il reste 14% des siren a matcher avec l'adresse

In [None]:
### Total a matcher avant ville/code postal/commune
test_adress4.shape[0] / test_adress.shape[0]

### Match avec adresse

On selectionne uniquement ceux pas matché.
Pour accélerer la recherche, on utilise que le sous ensemble de siren a vérifier dans le fichier INSEE

Verifier si on peut matcher avec le numéro de l'adresse -> au cas ou principal et secondaire dans le même endroit

Il faut retravailler les adresses:

- Upper case

La recherche se fait sur le libellé adresse. Dans l'INSEE, pas de numéro de voie, ni de typologie (rue, avenue, etc)

ALL: Allée
AV: Avenue
BD: Boulevard
CAR: Carrefour
CHE: Chemin
CHS: Chaussée
CITE: Cité
COR: Corniche
CRS: Cours
DOM: Domaine
DSC: Descente
ECA: Ecart
ESP: Esplanade
FG: Faubourg
GR: Grande Rue
HAM: Hameau
HLE: Halle
IMP: Impasse
LD: Lieu dit
LOT: Lotissement
MAR: Marché
MTE: Montée
PAS: Passage
PL: Place
PLN: Plaine
PLT: Plateau
PRO: Promenade
PRV: Parvis
QUA: Quartier
QUAI: Quai
RES: Résidence
RLE: Ruelle
ROC: Rocade
RPT: Rond Point
RTE: Route
RUE: Rue
SEN: Sente - Sentier
SQ: Square
TPL: Terre-plein
TRA: Traverse
VLA: Villa
VLGE: Village

In [None]:
additional = ["Avenue",
"Boulevard",
"Carrefour",
"Chemin",
"Chaussee",
"Cite",
"Corniche",
"Cours",
"Domaine",
"Descente",
"Ecart",
"Esplanade",
"Faubourg",
"Grande Rue",
"Hameau",
"Halle",
"Impasse",
"Lieu dit",
"Lotissement",
"Marche",
"Montee",
"Passage",
"Place",
"Plaine",
"Plateau",
"Promenade",
"Parvis",
"Quartier",
"Quai",
"Residence",
"Ruelle",
"Rocade",
"Rond Point",
"Route",
"Rue",
"Sentier",
"Square",
"Terre plein",
"Traverse",
"Villa",
"Village", 'Rn',
'bp', 'cedex', 'Bis',
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
        ]

In [None]:
import nltk
#nltk.download('stopwords')

In [None]:

from nltk.corpus import stopwords

In [None]:
stop_words = stopwords.words('french')
stop_words.extend(additional)
upper_stop = [i.upper() for i in stop_words]

In [None]:
pd.Series(upper_stop).to_csv('upper_stop.csv', index = False)

In [None]:
#pd.read_csv('upper_stop.csv').iloc[:,0].to_list()

In [None]:
def create_split_adress(x):
    """
    """
    split_ = x.str.split().to_list()
    return  split_


def create_regex_adress(x):
    """
    """
    try:
        split_ = [i + "$" for i in x]
        reg = '|'.join(split_)
    except:
        reg = np.nan
    return  reg

import re

def find_regex(regex, test_str, siret):
    """
    """
    try:
        matches = re.search(regex, test_str)
        if matches:
            return siret
        else:
            return np.nan
    except:
        return np.nan

def prepare_adress(df):
    """
    """
    temp_adresse = df.compute().assign(
    
        Adress_new = lambda x: 
        x['Adresse_Ligne1'].fillna('') + ' '+\
        x['Adresse_Ligne2'].fillna('') + ' '+\
        x['Adresse_Ligne3'].fillna(''),
        Adresse_new_clean=lambda x: x['Adress_new'].str.normalize(
            'NFKD')
        .str.encode('ascii', errors='ignore')
        .str.decode('utf-8')
        .str.replace('[^\w\s]|\d+', ' ')
        .str.upper(),

    )
    temp_adresse['Adresse_new_clean'] = (temp_adresse['Adresse_new_clean']
                                            .apply(lambda x:
                                                   ' '.join([word for word in
                                                             str(x).split() if
                                                             word not in 
                                                             (upper_stop)]))
                                            )
                                            
    temp_adresse = temp_adresse.assign(
        Adresse_new_clean_split=lambda x:
        create_split_adress(x['Adresse_new_clean'])
    )

    temp_adresse['Adresse_new_clean_reg'] = \
    temp_adresse['Adresse_new_clean_split'].apply(lambda x:
                                                     create_regex_adress(x))
    
    temp_adresse = temp_adresse.drop(columns = ['Adresse_new_clean',
                                                'Adresse_new_clean_split'])
    
    chunks = split(city_, 60000)
    try:
        for i in range(0, 10):
            chunks[i].to_csv(
            r'Data\input\unmatched\chunk\chunk_{}.gz'.format(i),
                index = False, compression ='gzip')
    except:
        pass

#### Creer des fichiers intermediaires pour éviter prob memoire

Pour accelerer le code, on utilise uniquement les adresses sans les na et on filtre l'insee

In [None]:
#http://yaoyao.codes/pandas/2018/01/23/pandas-split-a-dataframe-into-chunks
def index_marks(nrows, chunk_size):
    return range(1 * chunk_size, (nrows // chunk_size + 1) * chunk_size, chunk_size)

def split(dfm, chunk_size):
    indices = index_marks(dfm.shape[0], chunk_size)
    return np.split(dfm, indices)

POur acceleter le calcul, on convertit la df en Dask et on fait un map partition

Penser a garder NCC

In [None]:
from dask.diagnostics import ProgressBar
from dask.multiprocessing import get
import re
import dask.dataframe as dd
import pandas as pd
import numpy as np
pbar = ProgressBar()
pbar.register()



# load data into dataframes
subset_insee_count = dd.read_csv('subset_insee_count.csv',
                          usecols=['siren',
            'siret',
            'libelleCommuneEtablissement',
            'libelleVoieEtablissement',
            'numeroVoieEtablissement',
                                   'codePostalEtablissement',
                                   'codeCommuneEtablissement'
                                   ],
                          dtype={'siren': 'object',
                                 'siret': 'object',
                                 "libelleCommuneEtablissement":'object',
                                   "libelleVoieEtablissement":'object',
                                   "numeroVoieEtablissement":'object',
                                 'codePostalEtablissement':'object',
                                 'codeCommuneEtablissement':'object'
                                 }
                          )

def find_regex(regex, test_str, siret):
    """
    """
    try:
        matches = re.search(regex, test_str)
        if matches:
            return siret
        else:
            return np.nan
    except:
        return np.nan

In [None]:
insee = import_dask(file=r'data\input\insee_2017_7480120.gz',
                        usecols=insee_col, dtype=insee_dtype)

inpi = import_dask(file=r'data\input\unmatched\unmatch_1_ville_cp_cc_744141.gz',
                       usecols=inpi_col, dtype=inpi_dtype)

In [None]:
test_adress4 = pd.read_csv(r'data\Unmatch\unmatch_4_codecommune_703641.gz',
                          compression='gzip',low_memory=False)

In [None]:
prepare_adress(inpi)#.iloc[:, 1:]

In [None]:
chunks = split(city_, 60000)
try:
    for i in range(0, 10):
        chunks[i].to_csv(
        r'Data\Unmatch\chunk\chunk_{}.csv'.format(i), index = False)
except:
    pass

In [None]:
def prepare_csvAdress(df_input, option=['ncc', 'libelleCommuneEtablissement'],
                      chunk=0,constraint_regex = True):
    """
    option list can only be one of these:
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    """

    list_inpi = ['siren',
                 'siret',
                 'Type',
                 'Adresse_Ligne1',
                 'Adresse_Ligne2',
                 'Adresse_Ligne3',
                 'Code_Postal',
                 'Ville',
                 'Code_Commune',
                 'Pays',
                 'test',
                 'Ville_clean',
                 'ville2',
                 'ncc',
                 'possibilite',
                 'noaccent']
    if '_merge' in df_input.columns:
        df_input = (df_input
                    .drop(columns=['siret',
                                   'numeroVoieEtablissement',
                                   'libelleVoieEtablissement',
                                   'codePostalEtablissement',
                                   'libelleCommuneEtablissement',
                                   'codeCommuneEtablissement',
                                   '_merge']))

    temp = df_input.merge(subset_insee_count,
                          how='left',
                          left_on=['siren', option[0]],
                          right_on=['siren',  option[1]],
                          indicator=True,
                          suffixes=['_insee', '_inpi'])

    to_check = temp[temp['_merge'].isin(['both'])]
    nomatch = temp[~temp['_merge'].isin(['both'])]
    
    if constraint_regex ==False:

        to_check['Adresse_new_clean_reg'] = \
        to_check['Adresse_new_clean_reg'].str.replace('$', '')
    # test
    # to_check = to_check.dropna(subset = ['libelleVoieEtablissement'])
    to_check['siret_test1'] = to_check.map_partitions(
        lambda df:
        df.apply(lambda x:
                 find_regex(
                     x['Adresse_new_clean_reg'],
                     x['libelleVoieEtablissement'],
                     x['siret']), axis=1)
    )

    to_check = to_check.dropna(subset=['siret_test1']).compute()

    # calcul le nombre cas de figure 2 -> très conservative
    test_match = (to_check
                  .merge(
                      (to_check
                       .groupby(['siren', 'Adress_new'])['siren']
                       .count()
                       .rename('count')
                       .reset_index()
                       )
                  )
                  )
    print(test_match['count'].value_counts())
    true_match = (test_match
                  .loc[lambda x:x['count'] == 1]
                  .reindex(columns=list_inpi))

    name_csv = r'Data\\Match\chunk\{}\{}_true_match_{}.csv'.format(
        option[0], chunk, true_match.shape[0])

    true_match.to_csv(name_csv, index=False)

    dic_ = {
        'true_match': true_match,
        'unmatch': nomatch
    }

    return dic_

In [None]:
def merge_siren_candidat(df_input, regex_go = False,
                         option=['ncc', 'libelleCommuneEtablissement']):
    """
    option list can only be one of these:
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    """
    insee = import_dask(file=r'data\input\insee_2017_7480120.gz',
                        usecols=insee_col, dtype=insee_dtype)
    
    if '_merge' in df_input.columns:
        df_input = (df_input
                    .drop(columns=['siret',
                                   'numeroVoieEtablissement',
                                   'libelleVoieEtablissement',
                                   'codePostalEtablissement',
                                   'libelleCommuneEtablissement',
                                   'codeCommuneEtablissement',
                                   '_merge']))

    temp = df_input.merge(insee,
                          how='left',
                          left_on=['siren', option[0]],
                          right_on=['siren',  option[1]],
                          indicator=True,
                          suffixes=['_insee', '_inpi'])

    to_check = temp[temp['_merge'].isin(['both'])]
    nomatch = temp[~temp['_merge'].isin(['both'])]
    
    if regex_go:
        to_check['siret_test1'] = to_check.map_partitions(
        lambda df:
        df.apply(lambda x:
                 find_regex(
                     x['Adresse_new_clean_reg'],
                     x['libelleVoieEtablissement'],
                     x['siret']), axis=1)
    )
        to_check = to_check.dropna(subset=['siret_test1']).compute()
        group_option = 'Adress_new'
    else:
        group_option = option[1]
        to_check = to_check.compute()

    # calcul le nombre cas de figure 2 -> très conservative
    test_match = (to_check
                  .merge(
                      (to_check
                       .groupby(['siren', group_option])['siren']
                       .count()
                       .rename('count')
                       .reset_index()
                       )
                  )
                  )

    true_match = (test_match
                  .loc[lambda x:x['count'] == 1]
                  .reindex(columns=list_inpi))

    name_csv = r'Data\output\{}_true_match_{}.csv'.format(
        option[0], true_match.shape[0])

    true_match.to_csv(name_csv, index=False)

    dic_ = {
        'true_match': true_match,
        'unmatch': nomatch
    }

    return dic_

In [None]:
insee_col = ['siren',
             'siret',
             'dateCreationEtablissement',
             'complementAdresseEtablissement',
             'numeroVoieEtablissement',
             'indiceRepetitionEtablissement',
             'typeVoieEtablissement',
             'libelleVoieEtablissement',
             'codePostalEtablissement',
             'libelleCommuneEtablissement',
             'libelleCommuneEtrangerEtablissement',
             'distributionSpecialeEtablissement',
             'codeCommuneEtablissement',
             'codeCedexEtablissement',
             'libelleCedexEtablissement',
             'codePaysEtrangerEtablissement',
             'libellePaysEtrangerEtablissement',
             'count_initial_insee']

inpi_col = ['siren',
            'index',
            'Type',
            'Adresse_Ligne1',
            'Adresse_Ligne2',
            'Adresse_Ligne3',
            'Code_Postal',
            'Ville',
            'Code_Commune',
            'Pays',
            'count_initial_inpi',
            'ncc',
            'Adress_new',
            'Adresse_new_clean_reg']
insee_dtype = {
    'siren': 'object',
    'siret': 'object',
    'dateCreationEtablissement': 'object',
    'complementAdresseEtablissement': 'object',
    'numeroVoieEtablissement': 'object',
    'indiceRepetitionEtablissement': 'object',
    'typeVoieEtablissement': 'object',
    'libelleVoieEtablissement': 'object',
    'codePostalEtablissement': 'object',
    'libelleCommuneEtablissement': 'object',
    'libelleCommuneEtrangerEtablissement': 'object',
    'distributionSpecialeEtablissement': 'object',
    'codeCommuneEtablissement': 'object',
    'codeCedexEtablissement': 'object',
    'libelleCedexEtablissement': 'object',
    'codePaysEtrangerEtablissement': 'object',
    'libellePaysEtrangerEtablissement': 'object',
    'count_initial_insee': 'int'
}

inpi_dtype = {
    'siren': 'object',
    'index': 'int',
    'Type': 'object',
    'Adresse_Ligne1': 'object',
    'Adresse_Ligne2': 'object',
    'Adresse_Ligne3': 'object',
    'Code_Postal': 'object',
    'Ville': 'object',
    'Code_Commune': 'object',
    'Pays': 'object',
    'count_initial_inpi': 'int',
    'ncc': 'object',
    'Adress_new': 'object',
    'Adresse_new_clean_reg': 'object'
}

In [None]:
csv_file = r'data\input\unmatched\chunk\chunk_{}.gz'.format(chunk)
inpi = import_dask(file=csv_file,
                       usecols=inpi_col, dtype=inpi_dtype)
merge_siren_candidat(df_input=inpi,
                     option=['ncc', 'libelleCommuneEtablissement'],
                     regex_go = True)

In [None]:
inpi.head()

In [None]:
%%time
df_nomatch = pd.DataFrame()
df_match = pd.DataFrame()
for chunk in [0,
              1
              #,2,3,4,5,6,7,8,9
             ]:
    csv_file = r'data\input\unmatched\chunk\chunk_{}.gz'.format(chunk)
    inpi = import_dask(file=csv_file,
                       usecols=inpi_col, dtype=inpi_dtype)
    df_input = inpi
    total_match = pd.DataFrame()
    for i in [
        ['ncc', 'libelleCommuneEtablissement'],
        ['Code_Postal', 'codePostalEtablissement'],
        ['Code_Commune', 'codeCommuneEtablissement']
    ]:
        df_input_ = merge_siren_candidat(df_input=df_input,
                                      option=i,
                                      find_regex = True)

        df_input = df_input_['unmatch']
        total_match = total_match.append(df_input_['true_match'])
    
    test_adress2 = match_unmatch(
        df_inpi_initial=inpi.computes(),
        df_inpi_mergeboth=total_match,
        step='3_adress_{}'.format(chunk),
        to_csv=False)
    
    df_nomatch = df_nomatch.append(test_adress2)
    df_match = df_match.append(total_match)


#name_csv = r'Data\Unmatch\chunk\{}\{}_to_check.csv'.format(
#    'adress_only', 0)
#df_input_['unmatch'].compute().to_csv(name_csv, index=False)

In [None]:
df_nomatch.shape

In [None]:
df_nomatch.shape[0] / 4312053

In [None]:
#df_nomatch.loc[lambda x:x['siren'].isin(['388239667'])]

In [None]:
#subset_insee_count = subset_insee_count.compute()
#subset_insee_count.loc[lambda x:x['siren'].isin(['388239667'])]

In [None]:
#subset_insee_count.loc[lambda x:x['siren'].isin(['388239667'])]

Match avec numero de rue

In [None]:
df_nomatch = df_nomatch.drop(columns = 'digit_inpi')

In [None]:
def match_voie(df_input, option=['ncc', 'libelleCommuneEtablissement'],
              constraint_regex = True):
    """
    option list can only be one of these:
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    """
    
    # load data into dataframes
    subset_insee_count = dd.read_csv('subset_insee_count.csv',
                          usecols=['siren',
            'siret',
            'libelleCommuneEtablissement',
            'libelleVoieEtablissement',
            'numeroVoieEtablissement',
                                   'codePostalEtablissement',
                                   'codeCommuneEtablissement'
                                   ],
                          dtype={'siren': 'object',
                                 'siret': 'object',
                                 "libelleCommuneEtablissement":'object',
                                   "libelleVoieEtablissement":'object',
                                   "numeroVoieEtablissement":'object',
                                 'codePostalEtablissement':'object',
                                 'codeCommuneEtablissement':'object'
                                 }
                          )

    list_inpi = ['siren',
                 'siret',
                 'Type',
                 'Adresse_Ligne1',
                 'Adresse_Ligne2',
                 'Adresse_Ligne3',
                 'Code_Postal',
                 'Ville',
                 'Code_Commune',
                 'Pays',
                 'test',
                 'Ville_clean',
                 'ville2',
                 'ncc',
                 'possibilite',
                 'noaccent']
    if '_merge' in df_input.columns:
        df_input = (df_input
                    .drop(columns=['siret',
                                   'numeroVoieEtablissement',
                                   'libelleVoieEtablissement',
                                   'codePostalEtablissement',
                                   'libelleCommuneEtablissement',
                                   'codeCommuneEtablissement',
                                    #'count_inpi', 
                                   #'count',
                                   'count_insee',
                                   '_merge']))

    # Calculer le nombre de SIRET par SIREN a l'INSEE
    # Exemple SIREN 750767907
    subset_insee_count = subset_insee_count.merge(
        (subset_insee_count
         .groupby('siren')['siren']
         .count()
         .rename('count_insee')
         .reset_index())
    )
    
    temp = df_input.merge(subset_insee_count,
                          how='left',
                          left_on=['siren', option[0]],
                          right_on=['siren',  option[1]],
                          indicator=True,
                          suffixes=['_insee', '_inpi'])

    to_check = temp[temp['_merge'].isin(['both'])]
    nomatch = temp[~temp['_merge'].isin(['both'])]
    
    if constraint_regex ==False:
        to_check['Adresse_new_clean_reg'] = \
        to_check['Adresse_new_clean_reg'].str.replace('$', '')

    to_check['siret_test1'] = to_check.map_partitions(
        lambda df:
        df.apply(lambda x:
                 find_regex(
                     x['Adresse_new_clean_reg'],
                     x['libelleVoieEtablissement'],
                     x['siret']), axis=1)
    )
    
   
    to_check = to_check.dropna(subset=['siret_test1']).compute()
    to_check['digit_inpi'] = to_check['Adress_new'].str.extract(r'(\d+)')
    

    # test
    # to_check = to_check.dropna(subset = ['libelleVoieEtablissement'])
    to_check['test'] = np.where(
        to_check['digit_inpi'] ==
        to_check['numeroVoieEtablissement'],
        True, False
    )
    
    to_check = to_check[to_check['test'].isin([True])]

    # calcul le nombre cas de figure 2 -> très conservative
    test_match = (to_check
                  .merge(
                      (to_check
                       .groupby(['siren', 'numeroVoieEtablissement'])['siren']
                       .count()
                       .rename('count_inpi')
                       .reset_index()
                       )
                  )
                  )
    print(test_match['count_inpi'].value_counts())
    ### Si nb siret insee == 1 mais inpi pas 1, c'est tout de meme un SIRET 
    ### identique
    ### test realise sur data 2017
    #2    478
    #3      9
    #6      6
    true_match = (test_match
                  .loc[lambda x:
                       (x['count_inpi'] == 1)
                      |
                      (x['count_insee'].isin([1])
                      & ~x['count_inpi'].isin([1]))]
                  .reindex(columns=list_inpi))

    name_csv = r'Data\Match\numero\{}\voie_match_{}_{}.csv'.format(
        option[0], str(constraint_regex),true_match.shape[0])

    true_match.to_csv(name_csv, index=False)

    dic_ = {
        'true_match': true_match,
        'unmatch': nomatch
    }

    return dic_

In [None]:
df_nomatch.to_csv(r'data\Unmatch\06_voie.csv', index = False)

In [None]:
unmatch_adress = dd.read_csv(r'data\Unmatch\06_voie.csv',dtype={
                    'siren':'object',
                                     'Code_Postal':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object',
                                     'Type':'object',
                                     'Adresse_Ligne1':'object',
                                     'Adresse_Ligne2':'object',
                                     'Adresse_Ligne3':'object',
                                     'Ville':'object',
                                     'Code_Commune':'object',
                                     'Pays':'object',
                                     'test':'object',
                                     'Ville_clean':'object',
                                     'ville2':'object',
                                     'ncc':'object',
                                     'possibilite':'object',
                                     'noaccent':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object'
            })

Exemple SIREN/SIRET matché uniquement avec le digit. Pas matché avant a cause des fautes

- 798126694

In [None]:
%%time

df_nomatch = pd.DataFrame()
df_input = unmatch_adress
  
total_match = pd.DataFrame()
for i in [
        ['ncc', 'libelleCommuneEtablissement'],
        ['Code_Postal', 'codePostalEtablissement'],
        ['Code_Commune', 'codeCommuneEtablissement']
    ]:
    df_input_ = match_voie(df_input=df_input,
                                  option=i)

    df_input = df_input_['unmatch']
    total_match = total_match.append(df_input_['true_match'])
    
df_nomatch = match_unmatch(
        df_inpi_initial=unmatch_adress.drop(columns = ['test', 'count']).compute(),
        df_inpi_mergeboth=total_match.drop(columns = 'test'),
        step='6_voie_true',
        to_csv=True)

#name_csv = r'Data\Unmatch\chunk\{}\{}_to_check.csv'.format(
#    'adress_only', 0)
#df_input_['unmatch'].compute().to_csv(name_csv, index=False)

In [None]:
list(unmatch_adress.compute())

In [None]:
list(total_match)

In [None]:
df_nomatch = match_unmatch(
        df_inpi_initial=unmatch_adress.drop(columns = ['test', 'count']).compute(),
        df_inpi_mergeboth=total_match.drop(columns = 'test'),
        step='6_voie_true',
        to_csv=True)

In [None]:
df_nomatch.shape

In [None]:
df_nomatch.shape[0] / 4312053

In [None]:
df_nomatch.head()

In [None]:
df_nomatch['siren'].nunique()

## Exemple amélioration matching

enlever la contrainte sur le regex

- 306168568

In [None]:
subset_insee_count.compute().loc[lambda x:x['siren'].isin(['331455691'])]
df_nomatch.loc[lambda x:x['siren'].isin(['331455691'])]

In [None]:
df_nomatch.loc[lambda x:x['siren'].isin(['331455691'])]

### Enlever la contrainte regex

Deux tests:

- Sans numéro de voie
- Avec numéro de voie

#### Sans numéro de voie

In [None]:
def test_(df_input, option=['ncc', 'libelleCommuneEtablissement'],
              constraint_regex = True):
    """
    option list can only be one of these:
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    """
    
    # load data into dataframes
    subset_insee_count = dd.read_csv('subset_insee_count.csv',
                          usecols=['siren',
            'siret',
            'libelleCommuneEtablissement',
            'libelleVoieEtablissement',
            'numeroVoieEtablissement',
                                   'codePostalEtablissement',
                                   'codeCommuneEtablissement'
                                   ],
                          dtype={'siren': 'object',
                                 'siret': 'object',
                                 "libelleCommuneEtablissement":'object',
                                   "libelleVoieEtablissement":'object',
                                   "numeroVoieEtablissement":'object',
                                 'codePostalEtablissement':'object',
                                 'codeCommuneEtablissement':'object'
                                 }
                          )

    list_inpi = ['siren',
                 'siret',
                 'Type',
                 'Adresse_Ligne1',
                 'Adresse_Ligne2',
                 'Adresse_Ligne3',
                 'Code_Postal',
                 'Ville',
                 'Code_Commune',
                 'Pays',
                 'test',
                 'Ville_clean',
                 'ville2',
                 'ncc',
                 'possibilite',
                 'noaccent']
    if '_merge' in df_input.columns:
        df_input = (df_input
                    .drop(columns=['siret',
                                   'numeroVoieEtablissement',
                                   'libelleVoieEtablissement',
                                   'codePostalEtablissement',
                                   'libelleCommuneEtablissement',
                                   'codeCommuneEtablissement',
                                    #'count_inpi', 
                                   #'count',
                                   'count_insee',
                                   '_merge']))

    # Calculer le nombre de SIRET par SIREN a l'INSEE
    # Exemple SIREN 750767907
    subset_insee_count = subset_insee_count.merge(
        (subset_insee_count
         .groupby('siren')['siren']
         .count()
         .rename('count_insee')
         .reset_index())
    )
    
    temp = df_input.merge(subset_insee_count,
                          how='left',
                          left_on=['siren', option[0]],
                          right_on=['siren',  option[1]],
                          indicator=True,
                          suffixes=['_insee', '_inpi'])
    

    to_check = temp[temp['_merge'].isin(['both'])]
    nomatch = temp[~temp['_merge'].isin(['both'])]
    
    if constraint_regex ==False:

        to_check['Adresse_new_clean_reg'] = \
        to_check['Adresse_new_clean_reg'].str.replace('$', '')
        
    

    to_check['siret_test1'] = to_check.map_partitions(
        lambda df:
        df.apply(lambda x:
                 find_regex(
                     x['Adresse_new_clean_reg'],
                     x['libelleVoieEtablissement'],
                     x['siret']), axis=1)
    )
    
    to_check = to_check.dropna(subset=['siret_test1']).compute()

    # test
    # calcul le nombre cas de figure 2 -> très conservative
    test_match = (to_check
                  .merge(
                      (to_check
                       .groupby(['siren', 'Adress_new'])['siren']
                       .count()
                       .rename('count_inpi')
                       .reset_index()
                       )
                  )
                  )
    print(test_match['count_inpi'].value_counts())
    ### Si nb siret insee == 1 mais inpi pas 1, c'est tout de meme un SIRET 
    ### identique
    ### test realise sur data 2017
    #2    478
    #3      9
    #6      6
    true_match = (test_match
                  .loc[lambda x:
                       x['count_inpi'] == 1]
                  .reindex(columns=list_inpi))

    name_csv = r'Data\Match\numero\{}\voie_match_{}_{}.csv'.format(
        option[0], str(constraint_regex),true_match.shape[0])

    true_match.to_csv(name_csv, index=False)

    dic_ = {
        'true_match': true_match,
        'unmatch': nomatch
    }

    return dic_

In [None]:
%%time
unmatch_adress = dd.read_csv(r'C:\Users\PERNETTH\Documents\Projects\InseeInpi_matching\Notebooks_matching\data\Unmatch\unmatch_6_voie_true_105782.gz',dtype={
                    'siren':'object',
                                     'Code_Postal':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object',
                                     'Type':'object',
                                     'Adresse_Ligne1':'object',
                                     'Adresse_Ligne2':'object',
                                     'Adresse_Ligne3':'object',
                                     'Ville':'object',
                                     'Code_Commune':'object',
                                     'Pays':'object',
                                     'Ville_clean':'object',
                                     'ville2':'object',
                                     'ncc':'object',
                                     'possibilite':'object',
                                     'noaccent':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object'
            },compression='gzip', blocksize=None)

df_nomatch = pd.DataFrame()
df_input = unmatch_adress
  
total_match = pd.DataFrame()
for i in [
        ['ncc', 'libelleCommuneEtablissement'],
        ['Code_Postal', 'codePostalEtablissement'],
        ['Code_Commune', 'codeCommuneEtablissement']
    ]:
    df_input_ = test_(df_input=df_input,
                                  option=i,
                          constraint_regex = False)

    df_input = df_input_['unmatch']
    total_match = total_match.append(df_input_['true_match'])

In [None]:
unmatch_adress.compute().shape

In [None]:
total_match.shape

In [None]:
df_nomatch = match_unmatch(
        df_inpi_initial=unmatch_adress.compute(),
        df_inpi_mergeboth=total_match,
        step='6_voie_false_1',
        to_csv=True)

In [None]:
df_nomatch.shape[0]/4312053

#### avec numero de voie

In [None]:
def test_2(
    df_input, option=["ncc", "libelleCommuneEtablissement"], constraint_regex=True
):
    """
    option list can only be one of these:
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    """

    # load data into dataframes
    subset_insee_count = dd.read_csv(
        "subset_insee_count.csv",
        usecols=[
            "siren",
            "siret",
            "libelleCommuneEtablissement",
            "libelleVoieEtablissement",
            "numeroVoieEtablissement",
            "codePostalEtablissement",
            "codeCommuneEtablissement",
        ],
        dtype={
            "siren": "object",
            "siret": "object",
            "libelleCommuneEtablissement": "object",
            "libelleVoieEtablissement": "object",
            "numeroVoieEtablissement": "object",
            "codePostalEtablissement": "object",
            "codeCommuneEtablissement": "object",
        },
    )

    list_inpi = [
        "siren",
        "siret",
        "Type",
        "Adresse_Ligne1",
        "Adresse_Ligne2",
        "Adresse_Ligne3",
        "Code_Postal",
        "Ville",
        "Code_Commune",
        "Pays",
        "test",
        "Ville_clean",
        "ville2",
        "ncc",
        "possibilite",
        "noaccent",
    ]
    if "_merge" in df_input.columns:
        df_input = df_input.drop(
            columns=[
                "siret",
                "numeroVoieEtablissement",
                "libelleVoieEtablissement",
                "codePostalEtablissement",
                "libelleCommuneEtablissement",
                "codeCommuneEtablissement",
                #'count_inpi',
                #'count',
                "count_insee",
                "_merge",
            ]
        )

    # Calculer le nombre de SIRET par SIREN a l'INSEE
    # Exemple SIREN 750767907
    subset_insee_count = subset_insee_count.merge(
        (
            subset_insee_count.groupby("siren")["siren"]
            .count()
            .rename("count_insee")
            .reset_index()
        )
    )

    temp = df_input.merge(
        subset_insee_count,
        how="left",
        left_on=["siren", option[0]],
        right_on=["siren", option[1]],
        indicator=True,
        suffixes=["_insee", "_inpi"],
    )

    to_check = temp[temp["_merge"].isin(["both"])]
    nomatch = temp[~temp["_merge"].isin(["both"])]

    if constraint_regex == False:

        to_check["Adresse_new_clean_reg"] = to_check[
            "Adresse_new_clean_reg"
        ].str.replace("$", "")

    to_check["siret_test1"] = to_check.map_partitions(
        lambda df: df.apply(
            lambda x: find_regex(
                x["Adresse_new_clean_reg"], x["libelleVoieEtablissement"], x["siret"]
            ),
            axis=1,
        )
    )

    
    to_check = to_check.dropna(subset=['siret_test1']).compute()
    to_check['digit_inpi'] = to_check['Adress_new'].str.extract(r'(\d+)')
    
    
    to_check['test'] = np.where(
        to_check['digit_inpi'] ==
        to_check['numeroVoieEtablissement'],
        True, False
    )
    
    to_check = to_check[to_check['test'].isin([True])]
    
     # test
    # calcul le nombre cas de figure 2 -> très conservative
    test_match = to_check.merge(
        (
            to_check.groupby(["siren", "Adress_new"])["siren"]
            .count()
            .rename("count_inpi")
            .reset_index()
        )
    )
    print(test_match["count_inpi"].value_counts())
   
    ### Si nb siret insee == 1 mais inpi pas 1, c'est tout de meme un SIRET
    ### identique
    ### test realise sur data 2017
    # 2    478
    # 3      9
    # 6      6
    true_match = test_match.loc[lambda x: x["count_inpi"] == 1].reindex(
        columns=list_inpi
    )

    name_csv = r"Data\Match\numero\{}\voie_match_{}_{}.csv".format(
        option[0], str(constraint_regex), true_match.shape[0]
    )

    true_match.to_csv(name_csv, index=False)

    dic_ = {"true_match": true_match, "unmatch": nomatch}

    return dic_

In [None]:
%%time
unmatch_adress = dd.read_csv(r'C:\Users\PERNETTH\Documents\Projects\InseeInpi_matching\Notebooks_matching\data\Unmatch\unmatch_6_voie_false_1_92564.gz',dtype={
                    'siren':'object',
                                     'Code_Postal':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object',
                                     'Type':'object',
                                     'Adresse_Ligne1':'object',
                                     'Adresse_Ligne2':'object',
                                     'Adresse_Ligne3':'object',
                                     'Ville':'object',
                                     'Code_Commune':'object',
                                     'Pays':'object',
                                     'Ville_clean':'object',
                                     'ville2':'object',
                                     'ncc':'object',
                                     'possibilite':'object',
                                     'noaccent':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object'
            },compression='gzip', blocksize=None)

df_nomatch = pd.DataFrame()
df_input = unmatch_adress
  
total_match = pd.DataFrame()
for i in [
        ['ncc', 'libelleCommuneEtablissement'],
        ['Code_Postal', 'codePostalEtablissement'],
        ['Code_Commune', 'codeCommuneEtablissement']
    ]:
    df_input_ = test_2(df_input=df_input,
                                  option=i,
                          constraint_regex = False)

    df_input = df_input_['unmatch']
    total_match = total_match.append(df_input_['true_match'])

In [None]:
df_nomatch = match_unmatch(
        df_inpi_initial=unmatch_adress.compute(),
        df_inpi_mergeboth=total_match.drop(columns = 'test'),
        step='6_voie_false_2',
        to_csv=True)

In [None]:
total_match.columns

In [None]:
df_nomatch.shape[0]/4312053

In [None]:
df_nomatch.head()

In [None]:
subset_insee_count.compute().loc[lambda x:x['siren'].isin(['515226462'])]

In [None]:
subset_insee_count.compute().loc[lambda x:x['siren'].isin(['515226462'])]
df_nomatch.loc[lambda x:
               x['siren'].isin(['515226462'])]

In [None]:
df_nomatch.isna().sum()

### Dernier check

Verification après

- Si le SIREN a l'INSEE pour une adresse donnée à qu'une ligne mais plusieurs a l'INPI
    - 515226462

In [None]:
def test_3(
    df_input, option=["ncc", "libelleCommuneEtablissement"], constraint_regex=True
):
    """
    option list can only be one of these:
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    - ['ncc', 'libelleCommuneEtablissement']
    """

    # load data into dataframes
    subset_insee_count = dd.read_csv(
        "subset_insee_count.csv",
        usecols=[
            "siren",
            "siret",
            "libelleCommuneEtablissement",
            "libelleVoieEtablissement",
            "numeroVoieEtablissement",
            "codePostalEtablissement",
            "codeCommuneEtablissement",
        ],
        dtype={
            "siren": "object",
            "siret": "object",
            "libelleCommuneEtablissement": "object",
            "libelleVoieEtablissement": "object",
            "numeroVoieEtablissement": "object",
            "codePostalEtablissement": "object",
            "codeCommuneEtablissement": "object",
        },
    )

    list_inpi = [
        "siren",
        "siret",
        "Type",
        "Adresse_Ligne1",
        "Adresse_Ligne2",
        "Adresse_Ligne3",
        "Code_Postal",
        "Ville",
        "Code_Commune",
        "Pays",
        "test",
        "Ville_clean",
        "ville2",
        "ncc",
        "possibilite",
        "noaccent",
    ]
    if "_merge" in df_input.columns:
        df_input = df_input.drop(
            columns=[
                "siret",
                "numeroVoieEtablissement",
                "libelleVoieEtablissement",
                "codePostalEtablissement",
                "libelleCommuneEtablissement",
                "codeCommuneEtablissement",
                #'count_inpi',
                #'count',
                "count_insee",
                "_merge",
            ]
        )

    # Calculer le nombre de SIRET par SIREN a l'INSEE
    # Exemple SIREN 750767907
    subset_insee_count = subset_insee_count.merge(
        (
            subset_insee_count.groupby(["siren", 'libelleVoieEtablissement'])["siren"]
            .count()
            .rename("count_insee")
            .reset_index()
        )
    )

    temp = df_input.merge(
        subset_insee_count,
        how="left",
        left_on=["siren", option[0]],
        right_on=["siren", option[1]],
        indicator=True,
        suffixes=["_insee", "_inpi"],
    )

    to_check = temp[temp["_merge"].isin(["both"])]
    nomatch = temp[~temp["_merge"].isin(["both"])]

    if constraint_regex == False:

        to_check["Adresse_new_clean_reg"] = to_check[
            "Adresse_new_clean_reg"
        ].str.replace("$", "")

    to_check["siret_test1"] = to_check.map_partitions(
        lambda df: df.apply(
            lambda x: find_regex(
                x["Adresse_new_clean_reg"], x["libelleVoieEtablissement"], x["siret"]
            ),
            axis=1,
        )
    )

    
    to_check = to_check.dropna(subset=['siret_test1']).compute()
    to_check['digit_inpi'] = to_check['Adress_new'].str.extract(r'(\d+)')
    
    
    to_check['test'] = np.where(
        to_check['digit_inpi'] ==
        to_check['numeroVoieEtablissement'],
        True, False
    )
    
    to_check = to_check[to_check['test'].isin([True])]
    
     # test
    # calcul le nombre cas de figure 2 -> très conservative
    test_match = to_check.merge(
        (
            to_check.groupby(["siren", "Adress_new"])["siren"]
            .count()
            .rename("count_inpi")
            .reset_index()
        )
    )
    print(test_match["count_insee"].value_counts())
   
    ### Si nb siret insee == 1 mais inpi pas 1, c'est tout de meme un SIRET
    ### identique
    ### test realise sur data 2017
    # 2    478
    # 3      9
    # 6      6
    return test_match
    true_match = test_match.loc[lambda x: x["count_insee"] == 1].reindex(
        columns=list_inpi
    )

    name_csv = r"Data\Match\numero\{}\voie_match_{}_{}.csv".format(
        option[0], str(constraint_regex), true_match.shape[0]
    )

    true_match.to_csv(name_csv, index=False)

    dic_ = {"true_match": true_match, "unmatch": nomatch}

    return dic_

In [None]:
%%time
unmatch_adress = dd.read_csv(r'C:\Users\PERNETTH\Documents\Projects\InseeInpi_matching\Notebooks_matching\data\Unmatch\unmatch_6_voie_false_2_88186.gz',dtype={
                    'siren':'object',
                                     'Code_Postal':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object',
                                     'Type':'object',
                                     'Adresse_Ligne1':'object',
                                     'Adresse_Ligne2':'object',
                                     'Adresse_Ligne3':'object',
                                     'Ville':'object',
                                     'Code_Commune':'object',
                                     'Pays':'object',
                                     'Ville_clean':'object',
                                     'ville2':'object',
                                     'ncc':'object',
                                     'possibilite':'object',
                                     'noaccent':'object',
                                     'Adress_new':'object',
                                     'Adresse_new_clean_reg':'object'
            },compression='gzip', blocksize=None)

df_nomatch = pd.DataFrame()
df_input = unmatch_adress
  
test_ = test_3(
    df_input, 
    option=["ncc", "libelleCommuneEtablissement"], constraint_regex=True
)#["true_match"].loc[lambda x:
#               x['siren'].isin(['515226462'])]

In [None]:
subset_insee_count.compute().loc[lambda x:
                                 x['siren'].isin(['662043116']) & 
                                 x['libelleCommuneEtablissement'].isin(['DIJON'])]


In [None]:
df_input.compute().loc[lambda x:
               x['siren'].isin(['662043116'])
                      & x['Ville'].isin(['Dijon'])
                      ]

In [None]:
test_.loc[lambda x:
               x['siren'].isin(['662043116'])
                      & x['Ville'].isin(['Dijon'])
                      ]

In [None]:
test_.loc[lambda x :x['count_insee'].isin([2])].head(2)

In [None]:
total_match = pd.DataFrame()
for i in [
        ['ncc', 'libelleCommuneEtablissement'],
        ['Code_Postal', 'codePostalEtablissement'],
        ['Code_Commune', 'codeCommuneEtablissement']
    ]:
    df_input_ = test_3(df_input=df_input,
                                  option=i,
                          constraint_regex = False)

    df_input = df_input_['unmatch']
    total_match = total_match.append(df_input_['true_match'])

In [None]:
df_nomatch = match_unmatch(
        df_inpi_initial=unmatch_adress.compute(),
        df_inpi_mergeboth=total_match.drop(columns = 'test'),
        step='7_last_voie',
        to_csv=True)