# Test Matching Insee/ETS

## INSEE

- https://s3.console.aws.amazon.com/s3/object/calfdata/INSEE/Stock/ETS/
        - INSEE/Stock/ETS/StockEtablissement_utf8.csv
        
```
['siren', 'siret']
```

## INPI

- https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/Stock_processed/
    - INPI/TC_1/Stock_processed/initial_ETS.gz
    - INPI/TC_1/Stock_processed/initial_ETS.json
    
Colonnes test:

```
["Siren","Date_Immatriculation", "Date_Clôture", "Date_Greffe"]
```

## Sauvegarde

* La liste des SIREN matchés sera sauvegardée selon leur nature et origine
  * nature → ACTES/COMPTES/ETS/etc
  * origine → initial/partiel/new/evt

Les matchés seront sauvegardé dans calfdata/SIRETISATION/matche/ au format suivant:

* insee_nature_origine_matche.gz
    * ex: insee_pm_initial_matche.gz
    
    

## Moteur de recherche TEST

* Insee
  * http://avis-situation-sirene.insee.fr/IdentificationListeSiret.action
* INPI/TC
  * https://data.inpi.fr/
* Infogreffe
  * https://www.infogreffe.fr/


Le siège ne donne pas de nouveau SIRET, il indique seulement le lieu de la juridiction

In [1]:
import boto3, json
import dask.dataframe as dd
import pandas as pd
#import Match_inpi_insee.aws_connectors as aws
#from tqdm.notebook import tqdm
#import tqdm
%load_ext autoreload
%autoreload 2

  import pandas.util.testing as tm


In [2]:
#instance_aws = 'https://calfdata.s3.eu-west-3.amazonaws.com'
#bucket = 'calfdata'

In [3]:
# instanciate AWS connection
#AWS_connection = aws.aws_instantiate(instance_aws, bucket)

## Preparation fichiers

## Matching établissement principal

Ici, on filtre les variables communes pour l'INSEE & INPI établissements secondaires.

### Candidats

**INSEE**

https://www.sirene.fr/sirene/public/static/liste-variables

- numeroVoieEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/numeroVoieEtablissement
- indiceRepetitionEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/indiceRepetitionEtablissement
- typeVoieEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/typeVoieEtablissement
- libelleVoieEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleVoieEtablissement
- complementAdresseEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/complementAdresseEtablissement
- codeCommuneEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codeCommuneEtablissement
- libelleCommuneEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleCommuneEtablissement
- codePostalEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codePostalEtablissement
- codeCedexEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codeCedexEtablissement
- libelleCedexEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleCedexEtablissement
- distributionSpecialeEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/distributionSpecialeEtablissement
- libelleCommuneEtrangerEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libelleCommuneEtrangerEtablissement
- codePaysEtrangerEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/codePaysEtrangerEtablissement
- libellePaysEtrangerEtablissement: https://www.sirene.fr/sirene/public/static/liste-variables/libellePaysEtrangerEtablissement

**INPI**

- Adresse_Ligne1/Adresse_Ligne2/Adresse_Ligne3: Selon les greffes, l’adresse (n°+ voie) sera présente soit en ligne1 adresse, soit en ligne2 adresse.
Toutes les lignes d’adresse ne sont pas nécessairement renseignées.



### Créer fichier toutes les possibilités communes

In [4]:
communes = pd.read_csv('temp_local\communes-01012019.csv').set_index('ncc').reindex(columns = ['nccenr', 'libelle'])#.unstack()
communes.loc[lambda x: x['libelle'].isin(['Châtillon-sur-Chalaronne'])]

Unnamed: 0_level_0,nccenr,libelle
ncc,Unnamed: 1_level_1,Unnamed: 2_level_1
CHATILLON SUR CHALARONNE,Châtillon-sur-Chalaronne,Châtillon-sur-Chalaronne


In [5]:
communes = (pd.read_csv('temp_local\communes-01012019.csv')
            .set_index('ncc')
            .reindex(columns=['nccenr', 'libelle'])
            .assign(
    noaccent=lambda x: x['nccenr'].str.normalize('NFKD')
    .str.encode('ascii', errors='ignore')
    .str.decode('utf-8'),
    nccenr_noponc=lambda x: x['nccenr'].str.replace('[^\w\s]', ' '),
    libelle_noponc=lambda x: x['libelle'].str.replace('[^\w\s]', ' '),
    noaccent_noponc=lambda x: x['noaccent'].str.replace('[^\w\s]', ' '),
    uppercase=lambda x: x.index,
    nccenr_uppercase=lambda x: x['nccenr'].str.upper(),
    libelle_uppercase=lambda x: x['libelle'].str.upper(),
    noaccent_uppercase=lambda x: x['noaccent'].str.upper(),
    nccenr_noponc_uppercase=lambda x: x['nccenr_noponc'].str.upper(),
    libelle_noponc_uppercase=lambda x: x['libelle_noponc'].str.upper(),
    noaccent_noponc_uppercase=lambda x: x['noaccent_noponc'].str.upper(),
    nccenr_lowercase=lambda x: x['nccenr'].str.lower(),
    libelle_lowercase=lambda x: x['libelle'].str.lower(),
    noaccent_lowercase=lambda x: x['noaccent'].str.lower(),
    nccenr_noponc_lowercase=lambda x: x['nccenr_noponc'].str.lower(),
    libelle_noponc_lowercase=lambda x: x['libelle_noponc'].str.lower(),
    noaccent_noponc_lowercase=lambda x: x['noaccent_noponc'].str.lower(),
    nccenr_noarrond1=lambda x: x['nccenr'].str.replace(
        'er Arrondissement', ''),
    uppercase_noarrond1=lambda x: x['uppercase'].str.replace(
        'ER ARRONDISSEMENT', ''),
    lowercase_noarrond1=lambda x: x['nccenr_lowercase'].str.replace(
        'er arrondissement', ''),
    nccenr_noarrond=lambda x: x['nccenr'].str.replace('e Arrondissement', ''),
    uppercase_noarrond=lambda x: x['uppercase'].str.replace(
        'E ARRONDISSEMENT', ''),
    lowercase_noarrond=lambda x: x['nccenr_lowercase'].str.replace(
        'e arrondissement', ''),
)
)

for n in communes.columns:
    var_ = '{}_ST'.format(n)
    var_1 = '{}_st'.format(n)
    var_2 = '{}_St'.format(n)
    
    communes[var_] = communes[n].str.replace('SAINT', 'ST')
    communes[var_1] = communes[n].str.replace('Saint', 'st')
    communes[var_2] = communes[n].str.replace('Saint', 'St')
    
    var_ = '{}_Sbar'.format(n)
    var_1 = '{}_sbar'.format(n)
    
    communes[var_] = communes[n].str.replace('SUR', 'S/')
    communes[var_1] = communes[n].str.replace('sur', 's/')
    
communes = (communes
            .stack()
            .rename('possibilite')
            .reset_index()
            .drop(columns='level_1')
            .drop_duplicates(subset=['possibilite']))
communes.head()

Unnamed: 0,ncc,possibilite
0,ABERGEMENT CLEMENCIAT,Abergement-Clémenciat
1,ABERGEMENT CLEMENCIAT,L'Abergement-Clémenciat
2,ABERGEMENT CLEMENCIAT,Abergement-Clemenciat
3,ABERGEMENT CLEMENCIAT,Abergement Clémenciat
4,ABERGEMENT CLEMENCIAT,L Abergement Clémenciat


In [6]:
#insee = AWS_connection.url_instance_bucket(path_file = 'INSEE/Stock/ETS/StockEtablissement_utf8.csv')
#ets = AWS_connection.url_instance_bucket(path_file = 'INPI/TC_1/Stock_processed/initial_ETS.gz')
#ets

In [7]:
insee = r"\temp_local\StockEtablissement_utf8.csv"
ets = r"\temp_local\initial_ETS.gz"

In [8]:
# load data into dataframes
data_insee_ = dd.read_csv(insee,
                          usecols=['siren',
                                   'siret',
                                   "numeroVoieEtablissement",
                                   "indiceRepetitionEtablissement",
                                   "typeVoieEtablissement",
                                   "libelleVoieEtablissement",
                                   "complementAdresseEtablissement",
                                   "codeCommuneEtablissement",
                                   "libelleCommuneEtablissement",
                                   "codePostalEtablissement",
                                   "codeCedexEtablissement",
                                   "libelleCedexEtablissement",
                                   "distributionSpecialeEtablissement",
                                   "libelleCommuneEtrangerEtablissement",
                                   "codePaysEtrangerEtablissement",
                                   "libellePaysEtrangerEtablissement",
                                   "dateCreationEtablissement"
                                   ],
                          dtype={'siren': 'object',
                                 'siret': 'object',
                                 "numeroVoieEtablissement":'object',
                                   "indiceRepetitionEtablissement":'object',
                                   "typeVoieEtablissement":'object',
                                   "libelleVoieEtablissement":'object',
                                   "complementAdresseEtablissement":'object',
                                   "codeCommuneEtablissement":'object',
                                   "libelleCommuneEtablissement":'object',
                                   "codePostalEtablissement":'object',
                                   "codeCedexEtablissement":'object',
                                   "libelleCedexEtablissement":'object',
                                   "distributionSpecialeEtablissement":'object',
                                   "libelleCommuneEtrangerEtablissement":'object',
                                   "codePaysEtrangerEtablissement":'object',
                                   "libellePaysEtrangerEtablissement":'object'
                                 }
                          )

data_ets_ = (dd.read_csv(ets,
                         usecols=[
                             'Type',
                             'Siren',
                             'Code_Postal',
                             'Code_Commune',
                             'Adresse_Ligne1',
                             'Adresse_Ligne2',
                             'Adresse_Ligne3',
                             'Ville',
                             'Pays'
                         ],
                         dtype={
                             'Type': 'object',
                             'Siren': 'object',
                             'Code_Postal': 'object',
                             'Code_Commune': 'object',
                             'Adresse_Ligne1': 'object',
                             'Adresse_Ligne2': 'object',
                             'Adresse_Ligne3': 'object',
                             'Ville':'object',
                             'Pays':'object'
                         },
                         compression='gzip',
                         blocksize=None,
                         low_memory=False
                         )
             .compute()
             .rename(columns={"Siren": "siren"})
             .loc[lambda x: ~x['Type'].isin(['SIE'])]
             )

In [9]:
data_insee_ = (data_insee_
               .compute()
               .loc[lambda x: x['dateCreationEtablissement'] <= "2018-01-01"])

In [10]:
data_insee_.shape

(23683869, 17)

In [11]:
data_ets_.shape

(5204825, 9)

In [12]:
siren_inpi = data_ets_['siren'].drop_duplicates()
len(siren_inpi)

4711793

In [13]:
len(siren_inpi)/data_ets_.shape[0]

0.9052740485991364

In [14]:
subset_insee = data_insee_.loc[lambda x: x['siren'].isin(siren_inpi)]
subset_insee.shape

(7480120, 17)

Siren INPI mais pas INSEE -> Cette entreprise a exercé son droit d'opposition auprès de l'INSEE. Ses données ne peuvent pas être diffusées publiquement.

In [15]:
siren_to_remove = siren_inpi.loc[lambda x : ~x.isin(subset_insee['siren'])]
len(siren_to_remove)

201794

In [16]:
df_siren_to_find = data_ets_.loc[lambda x:
                                 (~x['siren'].isin(siren_to_remove))
                                 #| (~x['Adresse_Ligne1'].isin([np.nan]))
                                 #& (~x['Adresse_Ligne2'].isin([np.nan]))
                                 #& (~x['Adresse_Ligne3'].isin([np.nan]))
                                 #& (~x['Code_Postal'].isin([np.nan]))
                                 #& (~x['Ville'].isin([np.nan]))
                                 #& (~x['Code_Commune'].isin([np.nan]))
                                 ]
len(df_siren_to_find)

4986974

### Nan variables matching 

on exclue les variables avec que des nan dans les variables candidates

-> on les traitera après

In [17]:
import numpy as np
siren_fullna = df_siren_to_find.loc[lambda x:
                      (x['Adresse_Ligne1'].isin([np.nan]))
                     & (x['Adresse_Ligne2'].isin([np.nan]))
                     & (x['Adresse_Ligne3'].isin([np.nan]))
                     & (x['Code_Postal'].isin([np.nan]))
                     & (x['Ville'].isin([np.nan]))
                     & (x['Code_Commune'].isin([np.nan]))
                     ]['siren']

In [18]:
df_siren_to_find = df_siren_to_find.loc[lambda x:
                                 (~x['siren'].isin(siren_fullna))
                                 ]
len(siren_fullna)

6335

In [19]:
df_siren_to_find.shape[0]

4979272

### Nombres d'ets par SIREN INSEE

On calcule le nombre d'etb pour le fichier INSEE.

In [20]:
subset_insee_count = subset_insee.merge(
    (subset_insee
     .groupby('siren')['siren']
     .count()
     .rename('count')
     .reset_index())
)

On calcule le nombre d'etb pour le fichier INPI.

In [21]:
df_siren_to_find = df_siren_to_find.merge(
    (df_siren_to_find
     .groupby('siren')['siren']
     .count()
     .rename('count')
     .reset_index())
)

Insee enlever les tirets dans la ville

In [22]:
subset_insee_count['libelleCommuneEtablissement'] = subset_insee_count['libelleCommuneEtablissement'].str.replace('-', ' ')

In [23]:
def siren_unique(df):
    """
    """
    print("Nombre total obs: {}".format(len(df)))
    count_ = (df
              .groupby('siren')['siren']
              .count()
              .rename('count')
              .reset_index()
              .groupby('count')['count']
              .count()
              .reset_index(name='total_count')
              .set_index('count')
              # .compute()
              .assign(pct=lambda x: x/x.sum())
              .iloc[:10, :]
              .style
              .format('{:,.2%}', subset=['pct'])
              )
    return count_

Quick stat

In [24]:
siren_unique(df = subset_insee_count)

Nombre total obs: 7480120


Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3019633,66.95%
2,979757,21.72%
3,301250,6.68%
4,108365,2.40%
5,44389,0.98%
6,20116,0.45%
7,10204,0.23%
8,5818,0.13%
9,3665,0.08%
10,2509,0.06%


In [25]:
siren_unique(df = df_siren_to_find)

Nombre total obs: 4979272


Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4312053,95.75%
2,142484,3.16%
3,25170,0.56%
4,8708,0.19%
5,4226,0.09%
6,2495,0.06%
7,1558,0.03%
8,1113,0.02%
9,794,0.02%
10,617,0.01%


## Step 0: Clean ville

Ajout matching des communes pour retrouver le libelé commune de l'INSEE

ATTENTION, il faut nétoyer la variables ville dans l'INSEE. Veuillez regarder le fichier `communes.xlsx` pour voir les différents problèmes

ex: 
- CEDEX, cedex, digit, (d+), 

attention, l'arrondissement peut être mis entre parenthèse 

- MARSEILLE (7E)

- process:
    - creer variables avec numeric seulement
    - recreer ville 2 si test pas NAN pour avoir l'arrondissement
    - virer les differentes informations dans ville via regex

In [26]:
regex = 'CEDEX|cedex|Cedex|\([^)]*\)|/\s\s+/|^\d+\s|\s\d+\s|\s\d+$|\d+|\.|\--|COMMUNE DE |COMMUNE DE|commune de |commune de|Commune de |Commune de |\s$'
test_adress = df_siren_to_find.copy()
test_adress['test'] =test_adress['Ville'].str.extract(r'(\d+)')
test_adress['Ville_clean'] = test_adress['Ville'].str.replace(regex,'')
test_adress['Ville_clean'] = test_adress['Ville_clean'].str.replace('\s$|\s^','')
test_adress['ville2'] = np.where(
    np.logical_and(
         ~test_adress['test'].isin([np.nan]),
        test_adress['test'].str.len() <=2
    )
   ,
    test_adress['Ville_clean'] + '' + test_adress['test'].astype(str),
    test_adress['Ville_clean']
)

test_adress = test_adress.merge(communes,
                         left_on='ville2',
                         right_on='possibilite',
                         how='left',
                         indicator=True)

test_adress = pd.concat([
    test_adress.loc[lambda x: x['_merge'].isin(['both'])],
    (test_adress
     .loc[lambda x: x['_merge'].isin(['left_only'])]
     .drop(columns=['ncc', 'possibilite', '_merge'])
     .merge(communes,
            left_on='Ville_clean',
            right_on='possibilite',
            how='left',
            indicator=True)
     )

])

test_adress = pd.concat([
    test_adress.loc[lambda x: x['_merge'].isin(['both'])],
    (test_adress
     .loc[lambda x: x['_merge'].isin(['left_only'])]
     .drop(columns=['ncc', 'possibilite', '_merge'])
     .assign(
         noaccent=lambda x: x['Ville_clean'].str.normalize('NFKD')
         .str.encode('ascii', errors='ignore')
         .str.decode('utf-8'))
     ).merge(communes,
             left_on='noaccent',
             right_on='possibilite',
             how='left',
             indicator=True)])
test_adress.groupby('_merge')["_merge"].count()


_merge
left_only      202980
right_only          0
both          4776292
Name: _merge, dtype: int64

In [27]:
test_adress = test_adress.drop(columns = '_merge')

In [28]:
test_adress.shape

(4979272, 16)

In [29]:
test_adress.head()

Unnamed: 0,siren,Type,Adresse_Ligne1,Adresse_Ligne2,Adresse_Ligne3,Code_Postal,Ville,Code_Commune,Pays,count,test,Ville_clean,ville2,ncc,possibilite,noaccent
0,813543063,PRI,,RD 1075,la Gare,1500,Ambronay,1007,FRANCE,1,,Ambronay,Ambronay,AMBRONAY,Ambronay,
1,800897092,SEP,,640 route de Montgrimoux,,1570,Feillens,1159,FRANCE,1,,Feillens,Feillens,FEILLENS,Feillens,
2,440432052,SEP,,13 rue du Plat,,1500,Ambronay,1007,FRANCE,1,,Ambronay,Ambronay,AMBRONAY,Ambronay,
3,534728381,SEP,,chemin des Bourbouillons,,1250,Drom,1150,FRANCE,1,,Drom,Drom,DROM,Drom,
4,750647851,SEP,,638 rue Aimé Bonneville,,1200,Chatillon en Michaille,1091,FRANCE,1,,Chatillon en Michaille,Chatillon en Michaille,CHATILLON EN MICHAILLE,Chatillon en Michaille,


## Process

On ne match que les SIREN dont la date de création est inférieur a 2018

1) ~Step : Calculer le nombre de `nan` dans les colonnes de matching~

2) ~Step : Compter le nombre de SIRET by SIREN~

2) Step 2:  merge sur siren et code postal

3) Step 3:  merge sur siren et code commune



### Step 1: Match uniquement les 1 dans INSEE/INPI

On enlève les matches du dataframe `df_siren_to_find` et on ajoute les `left_only`.

Pareil pour l'INSEE pour gagner en mémoire.

In [30]:
m1_unique = (
    subset_insee_count.loc[lambda x: x['count'].isin([1])]
 .merge(test_adress.loc[lambda x: x['count'].isin([1])],
         how='left',indicator=True)
       )

In [31]:
m1_unique.groupby('_merge')["_merge"].count()

_merge
left_only        9729
right_only          0
both          3009904
Name: _merge, dtype: int64

In [32]:
to_remove_ = m1_unique.loc[lambda x: x['_merge'].isin(['both'])]['siren'].to_list()

In [33]:
test_adress = test_adress.loc[lambda x: ~x['siren'].isin(to_remove_)]
#subset_insee_count = subset_insee_count.loc[lambda x: ~x['siren'].isin(to_remove_)]

In [34]:
siren_unique(df = test_adress)

Nombre total obs: 1969368


Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1302149,87.17%
2,142484,9.54%
3,25170,1.68%
4,8708,0.58%
5,4226,0.28%
6,2495,0.17%
7,1558,0.10%
8,1113,0.07%
9,794,0.05%
10,617,0.04%


In [35]:
siren_unique(df = subset_insee_count)

Nombre total obs: 7480120


Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3019633,66.95%
2,979757,21.72%
3,301250,6.68%
4,108365,2.40%
5,44389,0.98%
6,20116,0.45%
7,10204,0.23%
8,5818,0.13%
9,3665,0.08%
10,2509,0.06%


Exemple de SIREN qui ont seulement une ligne dans l'INPI mais plusieurs SIRET dans l'INSEE.

In [36]:
test_adress.loc[lambda x: x['count'] ==1].head(2)

Unnamed: 0,siren,Type,Adresse_Ligne1,Adresse_Ligne2,Adresse_Ligne3,Code_Postal,Ville,Code_Commune,Pays,count,test,Ville_clean,ville2,ncc,possibilite,noaccent
0,813543063,PRI,,RD 1075,la Gare,1500,Ambronay,1007,FRANCE,1,,Ambronay,Ambronay,AMBRONAY,Ambronay,
1,800897092,SEP,,640 route de Montgrimoux,,1570,Feillens,1159,FRANCE,1,,Feillens,Feillens,FEILLENS,Feillens,


In [37]:
subset_insee_count.loc[lambda x: x['siren'].isin(['813543063'])]

Unnamed: 0,siren,siret,dateCreationEtablissement,complementAdresseEtablissement,numeroVoieEtablissement,indiceRepetitionEtablissement,typeVoieEtablissement,libelleVoieEtablissement,codePostalEtablissement,libelleCommuneEtablissement,libelleCommuneEtrangerEtablissement,distributionSpecialeEtablissement,codeCommuneEtablissement,codeCedexEtablissement,libelleCedexEtablissement,codePaysEtrangerEtablissement,libellePaysEtrangerEtablissement,count
6866695,813543063,81354306300016,2015-10-01,VIA ALFIERI MASERATI SN - CAP -,,,,ZONE INDUSTRIALE,,,95100 BLOCCO PALMA - CATANIA,,,,,99127.0,ITALIE,3
6866696,813543063,81354306300024,2015-10-01,LA GARE,,,,RD 1075,1500.0,AMBRONAY,,,1007.0,,,,,3
6866697,813543063,81354306300032,2016-05-24,I 0616135,109.0,,,PONTE SAN GIOVANNI VIA ADRIATICA,,,PERUGIA,,,,,99127.0,ITALIE,3


In [38]:
subset_insee_count.loc[lambda x: x['siren'].isin(['800897092'])]

Unnamed: 0,siren,siret,dateCreationEtablissement,complementAdresseEtablissement,numeroVoieEtablissement,indiceRepetitionEtablissement,typeVoieEtablissement,libelleVoieEtablissement,codePostalEtablissement,libelleCommuneEtablissement,libelleCommuneEtrangerEtablissement,distributionSpecialeEtablissement,codeCommuneEtablissement,codeCedexEtablissement,libelleCedexEtablissement,codePaysEtrangerEtablissement,libellePaysEtrangerEtablissement,count
6402442,800897092,80089709200014,2014-03-06,,26,,RTE,D'OSLON,71380,SAINT MARCEL,,,71445,,,,,2
6402443,800897092,80089709200022,2015-07-03,,640,,RTE,DE MONTGRIMOUX,1570,FEILLENS,,,1159,,,,,2


In [39]:
subset_insee_count.isna().sum().sort_values()

siren                                        0
dateCreationEtablissement                    0
count                                        0
siret                                        0
libelleCommuneEtablissement               9691
codeCommuneEtablissement                  9691
codePostalEtablissement                   9691
libelleVoieEtablissement                 97530
typeVoieEtablissement                  1075153
numeroVoieEtablissement                1652820
complementAdresseEtablissement         5887798
indiceRepetitionEtablissement          7188456
distributionSpecialeEtablissement      7408647
codeCedexEtablissement                 7409543
libelleCedexEtablissement              7411252
libellePaysEtrangerEtablissement       7470432
codePaysEtrangerEtablissement          7470432
libelleCommuneEtrangerEtablissement    7472315
dtype: int64

In [40]:
test_adress.isna().sum().sort_values()

siren                   0
Type                    0
count                   0
Pays                   33
Ville                1245
Ville_clean          1245
ville2               1245
Code_Postal          1737
Code_Commune        65614
ncc                 77234
possibilite         77234
Adresse_Ligne2     803945
Adresse_Ligne1    1038866
Adresse_Ligne3    1676779
noaccent          1889395
test              1960221
dtype: int64

### Step 2: Merging 

Dans cette partie, on va merger sur plusieurs candidats. La plupart des SIREN peuvent être matché via le code postal, code commune, ou ville directement. Si un SIREN a plusieurs SIRET dans la même ville ou code postal, il fera l'objet d'une recherche plus poussée.

Trois cas de figure découle du merge:

- 1) Merge forte pertinence
- 2) merge pertinence moyenne -> plusieurs SIRET pour un même candidat
- 3) Unmerge

#### 1:  merge sur siren et Ville

- Merge sur siren & libelleCommuneEtablissement|Ville_clean

In [41]:
def merge(df_insee, df_inpi, left_on, right_on):
    """
    """
    # match
    data_merged_1 = (df_insee
                     .merge(
                         df_inpi,
                         how='right',
                         left_on=left_on,
                         right_on=right_on,
                         indicator=True,
                         suffixes=['_insee', '_inpi'])
                     )

    # count
    count_ = (data_merged_1
              .loc[lambda x: x['_merge'].isin(['both'])]
              .groupby(['siren', 'ncc'])['siren']
              .count()
              .rename('count')
              .reset_index()
              .groupby('count')['count']
              .count()
              .reset_index(name='total_count')
              .set_index('count')
              .assign(pct=lambda x: x/x.sum())
              .iloc[:10, :]
              .style
              .format('{:,.2%}', subset=['pct'])
              )

    # detail match
    detail = data_merged_1.groupby('_merge')["_merge"].count()

    # cas de figure 2
    siren_fig2 = (data_merged_1
                  .loc[lambda x: x['_merge'].isin(['both'])]
                  .groupby(['siren', 'ncc'])['siren']
                  .count()
                  .rename('count')
                  .loc[lambda x:x > 1]
                  .reset_index('ncc')
                  .index
                  )

    # non matche
    siren_nmatched = (data_merged_1
                      .loc[lambda x: x['_merge'].isin(['right_only'])]['siren']
                      .to_list()
                      )
    new_unmatch = df_inpi.loc[lambda x: x['siren'].isin(siren_nmatched)]

    dic_ = {

        'count_': count_,
        'detail': detail,
        'siren_fig': siren_fig2,
        'size_fig2': len(siren_fig2),
        'new_unmatch': new_unmatch,
    }

    return dic_

In [42]:
test_city = merge(df_insee =subset_insee_count,
      df_inpi =test_adress,
      left_on=['siren', 'libelleCommuneEtablissement'],
      right_on=['siren', 'ncc'])

In [43]:
test_city['detail']

_merge
left_only           0
right_only     488966
both          2404212
Name: _merge, dtype: int64

In [44]:
test_city['size_fig2']

482994

In [45]:
test_city['count_']

Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,923735,65.67%
2,368745,26.21%
3,51847,3.69%
4,38090,2.71%
5,2175,0.15%
6,9324,0.66%
7,237,0.02%
8,2488,0.18%
9,3098,0.22%
10,810,0.06%


In [46]:
test_city['new_unmatch'].shape

(738732, 16)

Exemple de cas de figure 2: merge pertinence moyenne -> plusieurs SIRET pour un même candidat:

- 200000560

In [47]:
subset_insee_count.loc[lambda x: x['siren'].isin(['200000560'])]

Unnamed: 0,siren,siret,dateCreationEtablissement,complementAdresseEtablissement,numeroVoieEtablissement,indiceRepetitionEtablissement,typeVoieEtablissement,libelleVoieEtablissement,codePostalEtablissement,libelleCommuneEtablissement,libelleCommuneEtrangerEtablissement,distributionSpecialeEtablissement,codeCommuneEtablissement,codeCedexEtablissement,libelleCedexEtablissement,codePaysEtrangerEtablissement,libellePaysEtrangerEtablissement,count
75,200000560,20000056000017,2005-12-09,,35,,RUE,KLEBER,92130,ISSY LES MOULINEAUX,,,92040,,,,,2
76,200000560,20000056000025,2012-04-16,,71,,BD,GALLIENI,92130,ISSY LES MOULINEAUX,,,92040,,,,,2


In [48]:
test_adress.loc[lambda x: x['siren'].isin(['200000560'])]

Unnamed: 0,siren,Type,Adresse_Ligne1,Adresse_Ligne2,Adresse_Ligne3,Code_Postal,Ville,Code_Commune,Pays,count,test,Ville_clean,ville2,ncc,possibilite,noaccent
4709810,200000560,PRI,71 Boulevard Galliéni,,,92130,Issy-les-Moulineaux,92040,FRANCE,1,,Issy-les-Moulineaux,Issy-les-Moulineaux,ISSY LES MOULINEAUX,Issy-les-Moulineaux,


In [49]:
siren_unique(df = test_city['new_unmatch'])

Nombre total obs: 738732


Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,329101,79.67%
2,56542,13.69%
3,11663,2.82%
4,4595,1.11%
5,2482,0.60%
6,1608,0.39%
7,1080,0.26%
8,795,0.19%
9,612,0.15%
10,486,0.12%


#### 2:  merge sur siren et code postal

In [50]:
test_cp = merge(df_insee =subset_insee_count,
      df_inpi = test_city['new_unmatch'],
      left_on= ['siren', 'codePostalEtablissement'],
      right_on= ['siren', 'Code_Postal'])

In [51]:
test_cp['detail']

_merge
left_only           0
right_only      46035
both          1218393
Name: _merge, dtype: int64

In [52]:
test_cp['size_fig2']

180110

In [53]:
test_cp['count_']

Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,380282,67.86%
2,123274,22.00%
3,21209,3.78%
4,17855,3.19%
5,2132,0.38%
6,5190,0.93%
7,652,0.12%
8,1790,0.32%
9,1764,0.31%
10,801,0.14%


In [54]:
siren_unique(df = test_cp['new_unmatch'])

Nombre total obs: 215318


Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,19849,56.23%
2,8273,23.44%
3,2009,5.69%
4,937,2.65%
5,530,1.50%
6,403,1.14%
7,335,0.95%
8,252,0.71%
9,184,0.52%
10,185,0.52%


### 3:  merge sur siren et code commune

- Merge sur siren & codeCommuneEtablissement

In [55]:
test_com = merge(df_insee =subset_insee_count,
      df_inpi = test_cp['new_unmatch'],
      left_on= ['siren', 'codeCommuneEtablissement'],
      right_on= ['siren', 'Code_Commune'])

In [56]:
test_com['detail']

_merge
left_only          0
right_only     33027
both          459365
Name: _merge, dtype: int64

In [57]:
test_com['size_fig2']

39446

In [58]:
test_com['count_']

Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,101513,72.02%
2,20075,14.24%
3,4174,2.96%
4,7019,4.98%
5,475,0.34%
6,2017,1.43%
7,166,0.12%
8,708,0.50%
9,1033,0.73%
10,328,0.23%


In [59]:
siren_unique(df = test_com['new_unmatch'])

Nombre total obs: 164068


Unnamed: 0_level_0,total_count,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,9274,49.34%
2,5626,29.93%
3,1089,5.79%
4,472,2.51%
5,241,1.28%
6,170,0.90%
7,147,0.78%
8,116,0.62%
9,85,0.45%
10,86,0.46%


## A Verifier

Il reste a véfifier les cas de figure 2 et les unmatches

In [60]:
### Total a matcher avant ville/code postal/commune
test_adress.shape[0]

1969368

In [None]:
### total cas de figure 2 

In [61]:
test_city['size_fig2'] + test_cp['size_fig2']+ test_com['size_fig2']

702550

In [62]:
### total unmatch 
test_com['new_unmatch'].shape[0]

164068

In [63]:
### nombres totals SIREN a matcher

test_city['size_fig2'] + test_cp['size_fig2']+ test_com['size_fig2'] + test_com['new_unmatch'].shape[0]

866618

Verification le nombre a matcher correspond bien a 2M

In [93]:
pd.concat(
    [
        df_siren_to_find.loc[lambda x: x['siren'].isin(test_city['siren_fig'])],
        df_siren_to_find.loc[lambda x: x['siren'].isin(test_cp['siren_fig'])],
        df_siren_to_find.loc[lambda x: x['siren'].isin(test_com['siren_fig'])],
        test_com['new_unmatch'] 
    ]

).shape

(1487311, 16)

In [None]:
test_city['size_fig2']

In [74]:
test_city['size_fig2']

482994

In [64]:
(866618 / df_siren_to_find.shape[0]) * 100

17.404512145550594

In [65]:
### Match avec ville/code postal/commune
test_adress.shape[0] - test_city['size_fig2'] + test_cp['size_fig2'] + \
test_com['size_fig2'] - test_com['new_unmatch'].shape[0]

1541862

In [66]:
test_adress.loc[lambda x : x['siren'].isin(['200000560'])]

Unnamed: 0,siren,Type,Adresse_Ligne1,Adresse_Ligne2,Adresse_Ligne3,Code_Postal,Ville,Code_Commune,Pays,count,test,Ville_clean,ville2,ncc,possibilite,noaccent
4709810,200000560,PRI,71 Boulevard Galliéni,,,92130,Issy-les-Moulineaux,92040,FRANCE,1,,Issy-les-Moulineaux,Issy-les-Moulineaux,ISSY LES MOULINEAUX,Issy-les-Moulineaux,


In [67]:
subset_insee_count.loc[lambda x : x['siren'].isin(['200000560'])]

Unnamed: 0,siren,siret,dateCreationEtablissement,complementAdresseEtablissement,numeroVoieEtablissement,indiceRepetitionEtablissement,typeVoieEtablissement,libelleVoieEtablissement,codePostalEtablissement,libelleCommuneEtablissement,libelleCommuneEtrangerEtablissement,distributionSpecialeEtablissement,codeCommuneEtablissement,codeCedexEtablissement,libelleCedexEtablissement,codePaysEtrangerEtablissement,libellePaysEtrangerEtablissement,count
75,200000560,20000056000017,2005-12-09,,35,,RUE,KLEBER,92130,ISSY LES MOULINEAUX,,,92040,,,,,2
76,200000560,20000056000025,2012-04-16,,71,,BD,GALLIENI,92130,ISSY LES MOULINEAUX,,,92040,,,,,2


### Match avec adresse

On selectionne uniquement ceux pas matché.
Pour accélerer la recherche, on utilise que le sous ensemble de siren a vérifier dans le fichier INSEE

Verifier si on peut matcher avec le numéro de l'adresse -> au cas ou principal et secondaire dans le même endroit

Il faut retravailler les adresses:

- Upper case

La recherche se fait sur le libellé adresse. Dans l'INSEE, pas de numéro de voie, ni de typologie (rue, avenue, etc)

ALL: Allée
AV: Avenue
BD: Boulevard
CAR: Carrefour
CHE: Chemin
CHS: Chaussée
CITE: Cité
COR: Corniche
CRS: Cours
DOM: Domaine
DSC: Descente
ECA: Ecart
ESP: Esplanade
FG: Faubourg
GR: Grande Rue
HAM: Hameau
HLE: Halle
IMP: Impasse
LD: Lieu dit
LOT: Lotissement
MAR: Marché
MTE: Montée
PAS: Passage
PL: Place
PLN: Plaine
PLT: Plateau
PRO: Promenade
PRV: Parvis
QUA: Quartier
QUAI: Quai
RES: Résidence
RLE: Ruelle
ROC: Rocade
RPT: Rond Point
RTE: Route
RUE: Rue
SEN: Sente - Sentier
SQ: Square
TPL: Terre-plein
TRA: Traverse
VLA: Villa
VLGE: Village

In [69]:
additional = ["Avenue",
"Boulevard",
"Carrefour",
"Chemin",
"Chaussee",
"Cite",
"Corniche",
"Cours",
"Domaine",
"Descente",
"Ecart",
"Esplanade",
"Faubourg",
"Grande Rue",
"Hameau",
"Halle",
"Impasse",
"Lieu dit",
"Lotissement",
"Marche",
"Montee",
"Passage",
"Place",
"Plaine",
"Plateau",
"Promenade",
"Parvis",
"Quartier",
"Quai",
"Residence",
"Ruelle",
"Rocade",
"Rond Point",
"Route",
"Rue",
"Sentier",
"Square",
"Terre plein",
"Traverse",
"Villa",
"Village"
'bp', 'cedex']

In [70]:
import nltk
#nltk.download('stopwords')

In [71]:
from nltk.corpus import stopwords

In [72]:
stop_words = stopwords.words('french')
stop_words.extend(additional)
upper_stop = [i.upper() for i in stop_words]

In [85]:
def create_split_adress(x):
    """
    """
    split_ = x.str.split().to_list()
    
    #split_ = ''.join(str(e) for e in split_)
    #reg = '|'.join(split_)
    return  split_


def create_regex_adress(x):
    """
    """
    try:
        split_ = [i + "$" for i in x]
        reg = '|'.join(split_)
    except:
        reg = np.nan
    return  reg

def lookupInseeInpi(df_insee, siren, regex_):
    """
    """
    try:
        siret_ = df_insee.loc[lambda x: 
                                  x['siren'].isin([siren])
                      & x['libelleVoieEtablissement'].str.contains(
                          regex_, 
                          case = False, 
                          regex = True)
                      ]['siret']
        return siret_.values[0]
    except:
        return np.nan 

### Test avec cas figure 2 -> code postal



In [76]:
len(test_city['siren_fig'])

482994

In [79]:
def prepare_adress(df):
    """
    """
    #temp_adresse = m3['unmerged'].loc[lambda x: ~
    #                                  x['siren'].isin(siren_count_1_found)].copy()
    #sous_ensemble_insee = subset_insee_count.loc[lambda x: x['siren'].isin(
    #    temp_adresse['siren'].to_list())]

    temp_adresse = df.assign(
        Adresse_Ligne1_clean=lambda x: x['Adresse_Ligne1'].str.normalize(
            'NFKD')
        .str.encode('ascii', errors='ignore')
        .str.decode('utf-8')
        .str.replace('[^\w\s]|\d+', '')
        .str.upper(),
        Adresse_Ligne2_clean=lambda x: x['Adresse_Ligne2'].str.normalize(
            'NFKD')
        .str.encode('ascii', errors='ignore')
        .str.decode('utf-8')
        .str.replace('[^\w\s]|\d+', '')
        .str.upper(),
        Adresse_Ligne3_clean=lambda x: x['Adresse_Ligne3'].str.normalize(
            'NFKD')
        .str.encode('ascii', errors='ignore')
        .str.decode('utf-8')
        .str.replace('[^\w\s]|\d+', '')
        .str.upper()
    )
    temp_adresse['Adresse_Ligne1_clean'] = (temp_adresse['Adresse_Ligne1_clean']
                                            .apply(lambda x:
                                                   ' '.join([word for word in
                                                             str(x).split() if
                                                             word not in 
                                                             (upper_stop)]))
                                            )

    temp_adresse['Adresse_Ligne2_clean'] = (temp_adresse['Adresse_Ligne2_clean']
                                            .apply(lambda x:
                                                   ' '.join([word for word in
                                                             str(x).split() if
                                                             word not in 
                                                             (upper_stop)]))
                                           )
                                            

    temp_adresse['Adresse_Ligne3_clean'] = (temp_adresse['Adresse_Ligne3_clean']
                                            .apply(lambda x:
                                                   ' '.join([word for word in
                                                             str(x).split() if
                                                             word not in 
                                                             (upper_stop)]))
                                           )
                                            

    temp_adresse = temp_adresse.assign(
        Adresse_Ligne1_clean_split=lambda x:
        create_split_adress(x['Adresse_Ligne1_clean']),
        Adresse_Ligne2_clean_split=lambda x:
        create_split_adress(x['Adresse_Ligne2_clean']),
        Adresse_Ligne3_clean_split=lambda x:
        create_split_adress(x['Adresse_Ligne3_clean'])
    )

    temp_adresse['Adresse_Ligne1_clean_reg'] = temp_adresse['Adresse_Ligne1_clean_split'].apply(lambda x:
                                                                                                create_regex_adress(x))
    temp_adresse['Adresse_Ligne2_clean_reg'] = temp_adresse['Adresse_Ligne2_clean_split'].apply(lambda x:
                                                                                                create_regex_adress(x))
    temp_adresse['Adresse_Ligne3_clean_reg'] = temp_adresse['Adresse_Ligne3_clean_split'].apply(lambda x:
                                                                                                create_regex_adress(x))

    return temp_adresse

In [81]:
test1 = prepare_adress(
    df_siren_to_find.loc[lambda x: x['siren'].isin(test_city['siren_fig'])])

In [82]:
test1.shape[0]

757325

In [83]:
test1.iloc[0, :]

siren                              440432052
Type                                     SEP
Adresse_Ligne1                           NaN
Adresse_Ligne2                13 rue du Plat
Adresse_Ligne3                           NaN
Code_Postal                            01500
Ville                               Ambronay
Code_Commune                           01007
Pays                                  FRANCE
count                                      1
Adresse_Ligne1_clean                     nan
Adresse_Ligne2_clean                    PLAT
Adresse_Ligne3_clean                     nan
Adresse_Ligne1_clean_split             [nan]
Adresse_Ligne2_clean_split            [PLAT]
Adresse_Ligne3_clean_split             [nan]
Adresse_Ligne1_clean_reg                nan$
Adresse_Ligne2_clean_reg               PLAT$
Adresse_Ligne3_clean_reg                nan$
Name: 2, dtype: object

In [86]:
lookupInseeInpi(
   df_insee=  subset_insee_count, 
    siren = test1.iloc[0, 0],
    regex_ = test1.iloc[0, -2])

'44043205200023'

In [88]:
subset_insee_count.loc[lambda x: x['siren'].isin([test1.iloc[0, 0]])]

Unnamed: 0,siren,siret,dateCreationEtablissement,complementAdresseEtablissement,numeroVoieEtablissement,indiceRepetitionEtablissement,typeVoieEtablissement,libelleVoieEtablissement,codePostalEtablissement,libelleCommuneEtablissement,libelleCommuneEtrangerEtablissement,distributionSpecialeEtablissement,codeCommuneEtablissement,codeCedexEtablissement,libelleCedexEtablissement,codePaysEtrangerEtablissement,libellePaysEtrangerEtablissement,count
2955654,440432052,44043205200015,2002-01-01,AMBRONAY,,,CHE,DE RONDE,1500,AMBRONAY,,,1007,,,,,2
2955655,440432052,44043205200023,2016-04-04,,13.0,,RUE,DU PLAT,1500,AMBRONAY,,,1007,,,,,2


In [90]:
test1.loc[lambda x: x['siren'].isin([test1.iloc[0, 0]])]

Unnamed: 0,siren,Type,Adresse_Ligne1,Adresse_Ligne2,Adresse_Ligne3,Code_Postal,Ville,Code_Commune,Pays,count,Adresse_Ligne1_clean,Adresse_Ligne2_clean,Adresse_Ligne3_clean,Adresse_Ligne1_clean_split,Adresse_Ligne2_clean_split,Adresse_Ligne3_clean_split,Adresse_Ligne1_clean_reg,Adresse_Ligne2_clean_reg,Adresse_Ligne3_clean_reg
2,440432052,SEP,,13 rue du Plat,,1500,Ambronay,1007,FRANCE,1,,PLAT,,[nan],[PLAT],[nan],nan$,PLAT$,nan$


Pour accelerer le code, on utilise uniquement les adresses sans les na

#### test_1

In [97]:
test_1 = test1.loc[lambda x: ~x['Adresse_Ligne1_clean_reg'].isin(['nan$'])]
test_1.shape

(300504, 19)

In [95]:
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [100]:
%%time
test_1.head(10).apply(lambda x: lookupInseeInpi(
    df_insee=  subset_insee_count,
    siren = x['siren'],
    regex_ = x['Adresse_Ligne1_clean_reg']),
    axis = 1 )

Wall time: 1min 8s


42     33122115000088
43                NaN
50                NaN
146               NaN
167    33145569100229
171    33145569100195
172               NaN
210               NaN
220    71298043200231
221    71298043200157
dtype: object

In [103]:
%%time
test_1.head(10).parallel_apply(lambda x: lookupInseeInpi(
    siren = x['siren'],
    regex_ = x['test_1']),
    axis = 1 )

[autoreload of joblib.numpy_pickle failed: Traceback (most recent call last):
  File "C:\Users\PERNETTH\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "C:\Users\PERNETTH\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py", line 394, in superreload
    module = reload(module)
  File "C:\Users\PERNETTH\AppData\Local\Continuum\anaconda3\lib\imp.py", line 314, in reload
    return importlib.reload(module)
  File "C:\Users\PERNETTH\AppData\Local\Continuum\anaconda3\lib\importlib\__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 630, in _exec
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "C:\Users\PERNETTH\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\numpy_pick

AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'

In [None]:
lookupInseeInpi(siren = test_1.iloc[2, 0],
                regex_ = test_1.iloc[2, -2])

In [None]:
sous_ensemble_insee.loc[lambda x: x['siren'].isin([test_1.iloc[2, 0]])]

In [None]:
((17000 * 2.62) /10)/60

In [None]:
%%time
test_1['siret_1'] = test_1.apply(lambda x: lookupInseeInpi(
    siren = x['siren'],
    regex_ = x['test_1']),
    axis = 1 )

In [None]:
test_1.shape

In [None]:
test_1.loc[lambda x : ~x['siret_1'].isin([np.nan])].head()

In [None]:
test_1.loc[lambda x : ~x['siret_1'].isin([np.nan])].shape

In [None]:
(sous_ensemble_insee
 .loc[lambda x: x['siren'].isin(["395395759"])]
)

In [None]:
test_1['siret_1'].isna().sum()

#### test_2

In [None]:
%%time
test_2 = temp_adresse.loc[lambda x: ~x['test_2'].isin(['nan'])]
test_2['siret_2'] = test_1.apply(lambda x: lookupInseeInpi(
    siren = x['siren'],
    regex_ = x['test_2']),
    axis = 1 )

In [None]:
test_2['siret_2'].isna().sum()

In [None]:
test_2.shape

In [None]:
test_2.loc[lambda x : ~x['siret_2'].isin([np.nan])].head()

In [None]:
test_2.loc[lambda x : ~x['siret_2'].isin([np.nan])].shape

In [None]:
(sous_ensemble_insee
 .loc[lambda x: x['siren'].isin(["394674881"])]
)

#### test_3

In [None]:
%%time
test_3 = temp_adresse.loc[lambda x: ~x['test_3'].isin(['nan'])]
test_3['siret_3'] = test_1.apply(lambda x: lookupInseeInpi(
    siren = x['siren'],
    regex_ = x['test_3']),
    axis = 1 )

In [None]:
test_3.loc[lambda x : ~x['siret_3'].isin([np.nan])].shape

In [None]:
test_3.loc[lambda x : ~x['siret_3'].isin([np.nan])].head()

In [None]:
(sous_ensemble_insee
 .loc[lambda x: x['siren'].isin(["302556832"])]
)

In [None]:
lookupInseeInpi(siren = '302556832',
                regex_ = '^AVE$|^MICHEL$|^JOURDAN$')

temp1.head(

In [None]:
import os
os.remove('communes.xlsx')
temp1.loc[lambda x: x['_merge'].isin(['left_only'])].drop_duplicates('ville2').to_excel('communes.xlsx')

In [None]:
temp.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["Siren"] )