# Test Matching Insee/PP

## INSEE

- https://s3.console.aws.amazon.com/s3/object/calfdata/INSEE/Stock/ETS/
        - INSEE/Stock/ETS/StockEtablissement_utf8.csv
        
```
['siren', 'siret']
```

## INPI

- https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/Stock_processed/
    - INPI/TC_1/Stock_processed/initial_PP.gz
    - INPI/TC_1/Stock_processed/initial_PP.json
    

## Sauvegarde

* La liste des SIREN matchés sera sauvegardée selon leur nature et origine
  * nature → ACTES/COMPTES/ETS/etc
  * origine → initial/partiel/new/evt

Les matchés seront sauvegardé dans calfdata/SIRETISATION/matche/ au format suivant:

* insee_nature_origine_matche.gz
    * ex: insee_pp_initial_matche.gz
    
    

## Moteur de recherche TEST

* Insee
  * http://avis-situation-sirene.insee.fr/IdentificationListeSiret.action
* INPI/TC
  * https://data.inpi.fr/


In [1]:
import boto3, json
import dask.dataframe as dd
import pandas as pd
import Match_inpi_insee.aws_connectors as aws
#from tqdm.notebook import tqdm
#import tqdm
%load_ext autoreload
%autoreload 2

In [2]:
# update of dask to avoid error
# 'AttributeError: 'DataFrame' object has no attribute '_getitem_array'

! python -m pip install --upgrade dask
! python -m pip install fsspec
! python -m pip install --upgrade s3fs

Requirement already up-to-date: dask in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (2.10.1)
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Requirement already up-to-date: s3fs in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.4.0)
Requirement not upgraded as not directly required: boto3>=1.9.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1.11.5)
Requirement not upgraded as not directly required: fsspec>=0.6.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (0.6.2)
Requirement not upgraded as not directly required: botocore>=1.12.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1

In [3]:
instance_aws = 'https://calfdata.s3.eu-west-3.amazonaws.com'
bucket = 'calfdata'

# define import paths
## INSEE
source='insee'
insee_csv_relative_filepath = "INSEE/Stock/ETS/StockEtablissement_utf8.csv"

## INPI
inpi_import_root_path = "INPI/TC_1/Stock_processed"
inpi_nature = 'PP' #* nature → ACTES/COMPTES/ETS/etc
inpi_origin = 'initial' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}.{}".format(inpi_origin,inpi_nature,'json')
inpi_gz_file_name = "{}_{}.{}".format(inpi_origin,inpi_nature,'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [4]:
# instanciate AWS connection
AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [5]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
insee = AWS_connection.url_instance_bucket(path_file = insee_csv_relative_filepath)
pp = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

## Cleaning of the tables based on unmatched cases

### initial matching results

```
{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1234287',
   'total_match': '1182979',
   'total_unmatched_left': '0',
   'total_unmatched_right': '51308'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```
### Cause n°1 for unmatched : missing zeros before siren in INPI

```
### check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]
merged = data_merged.loc[lambda x: x['_merge'].isin(['both'])]
```

```
# SIREN with less than 9 car haven't been merged
merged.groupby('siren_length')['siren'].count()
 
siren_length
9    1182979
Name: siren, dtype: int64
```

```
###  Most of unmerged are 9 car
unmerged.groupby('siren_length')['siren'].count()
siren_length
1        7
6        1
7      726
8     6484
9    44090
Name: siren, dtype: int64
```

#### We used a fillO function to complete siren from INPI before matching

#### After correction, total_unmatched_right : - 5822


```
# check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]

# Unmerged have decreased 51308=>45486. (all siren are now 9 car)
unmerged.groupby('siren_length')['siren'].count()

siren_length
9    45486
Name: siren, dtype: int64


{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1235130',
   'total_match': '1189644',
   'total_unmatched_left': '0',
   'total_unmatched_right': '45486'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```

In [6]:
# load data into dataframes
## INSEE
data_ets_= dd.read_csv(insee, 
                       usecols = ['siren', 'siret','statutDiffusionEtablissement'],
                       dtype={'siren':'object',
                              'siret':'object',
                             'statutDiffusionEtablissement': 'object'}
                      )

In [7]:
## INPI
dtypes__ = {'Siren':'object',
            'Adresse_Ligne1': 'object',
       'Code_Commune': 'object',
       'Code_Postal': 'object',
       'Conjoint_Collab_Pseudo': 'object',
       'DAP_Adresse_Ligne1': 'object',
       'DAP_Code_Commune': 'object',
       'Date_1re_Immatriculation': 'object',
       'Sans_Activité': 'object'}

data_pp_= (dd.read_csv(pp, 
                      usecols =["Siren",
                                'Adresse_Ligne1',
                                'Code_Commune',
                                'Code_Postal',
                                'Conjoint_Collab_Pseudo',
                                'DAP_Adresse_Ligne1',
                                'DAP_Code_Commune',
                                'Date_1re_Immatriculation',
                                'Sans_Activité'
                               ],
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_.head()

Unnamed: 0,siren,Date_1re_Immatriculation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo
0,442377040,,,,1000,,,,
1,809861297,,,,1140,1348.0,,,
2,480952712,,,,1380,1025.0,,,
3,388239667,,,,69130,,,,
4,828427245,,,,1220,1143.0,,,


In [8]:
# correction of missing zeros in siren
def fill0(text, num):
    while len(text) < num:
        text = str('0' + text)
    return text

data_pp_['siren']=data_pp_['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [9]:
# Merge data

#NB 1 : Due to computation limits, it is not possible to put this part in a class

#NB 2 : Computation limit error : UnpicklingError: invalid load key, '\xdd'.
# => restart kernel

data_merged = data_ets_.merge(
    data_pp_,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [10]:
# create a check_url for online manual data verification
data_merged = data_merged.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_1re_Immatriculation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,_merge,url
0,5741509,574150900011,O,,,,4270.0,,,,,both,https://data.inpi.fr/entreprises/005741509
1,5915533,591553300011,O,,,,80220.0,80373.0,,,,both,https://data.inpi.fr/entreprises/005915533
2,6472419,647241900032,O,,,9 rue de Conan,44780.0,44098.0,,,,both,https://data.inpi.fr/entreprises/006472419
3,6641104,664110400010,O,,,,,,,,,both,https://data.inpi.fr/entreprises/006641104
4,7241532,724153200014,O,,,,,,,,,both,https://data.inpi.fr/entreprises/007241532


In [11]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1235130',
   'total_match': '1189644',
   'total_unmatched_left': '0',
   'total_unmatched_right': '45486'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

In [12]:
# IN PROGRESS
# Describe results & check

In [13]:
# Unmatched sample
data_merged.loc[lambda x: x['_merge'].isin(['right_only'])].head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_1re_Immatriculation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,_merge,url
14241,504630310,,,,,,1500,1004.0,,,,right_only,https://data.inpi.fr/entreprises/504630310
14242,823218854,,,,,,1090,1258.0,,,,right_only,https://data.inpi.fr/entreprises/823218854
14243,546110156,,,,,,1110,,,,,right_only,https://data.inpi.fr/entreprises/546110156
14244,490068327,,,,,,1480,1166.0,,,,right_only,https://data.inpi.fr/entreprises/490068327
14245,793005000,,,,,,1150,1202.0,,,,right_only,https://data.inpi.fr/entreprises/793005000


In [14]:
# Count #SIRET by SIREN
test = (data_pp_
        .groupby('siren')['siren']
        .count()
        .rename('count')
        .reset_index()
        .groupby('count')
        .count()
        .compute()
        .assign(pct = lambda x:100*x/x.sum())
        #.rename(columns={'count': 'siret_count'})
       )
test

Unnamed: 0_level_0,siren,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,741492,98.905294
2,7823,1.043485
3,304,0.04055
4,54,0.007203
5,9,0.0012
6,4,0.000534
7,7,0.000934
8,1,0.000133
10,1,0.000133
11,1,0.000133


In [15]:
# check some cases
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]
merged = data_merged.loc[lambda x: x['_merge'].isin(['both'])]
unmerged['siren'].head()

14241    504630310
14242    823218854
14243    546110156
14244    490068327
14245    793005000
Name: siren, dtype: object

In [16]:
#to do : verif si possible d'avoir plusieurs fois le meme SIREN dans data_pp_

In [17]:
# cases found N°3:
## created recently : 2020
#  => utiliser le champ dateCreationEtablissement
# 2019 stock INSEE
# 2017 stock INPI


In [18]:
test_creation_date = dd.read_csv(insee, 
                       usecols = ['siren', 'siret','dateCreationEtablissement'],
                       dtype={'siren':'object',
                              'siret':'object',
                             'dateCreationEtablissement': 'object'}
                      )

In [19]:
test_creation_date.head()

Unnamed: 0,siren,siret,dateCreationEtablissement
0,325175,32517500016,2000-09-26
1,325175,32517500024,2008-05-20
2,325175,32517500032,2009-05-27
3,325175,32517500040,2011-10-21
4,325175,32517500057,2014-01-07


In [20]:
test_creation_date['dateCreationEtablissement'] = test_creation_date['dateCreationEtablissement'].astype('datetime64[ns]') 

In [21]:
test_creation_date.head()

Unnamed: 0,siren,siret,dateCreationEtablissement
0,325175,32517500016,2000-09-26
1,325175,32517500024,2008-05-20
2,325175,32517500032,2009-05-27
3,325175,32517500040,2011-10-21
4,325175,32517500057,2014-01-07


In [22]:
test_creation_date['dateCreationEtablissement'].max().compute() #ko

Timestamp('2116-01-19 00:00:00')

In [23]:
test_creation_date.info() 

<class 'dask.dataframe.core.DataFrame'>
Columns: 3 entries, siren to dateCreationEtablissement
dtypes: datetime64[ns](1), object(2)