# Test Matching Insee/PP

## INSEE

- https://s3.console.aws.amazon.com/s3/object/calfdata/INSEE/Stock/ETS/
        - INSEE/Stock/ETS/StockEtablissement_utf8.csv
        
```
['siren', 'siret']
```

## INPI

- https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/Stock_processed/
    - INPI/TC_1/Stock_processed/initial_PP.gz
    - INPI/TC_1/Stock_processed/initial_PP.json
    

## Sauvegarde

* La liste des SIREN matchés sera sauvegardée selon leur nature et origine
  * nature → ACTES/COMPTES/ETS/etc
  * origine → initial/partiel/new/evt

Les matchés seront sauvegardé dans calfdata/SIRETISATION/matche/ au format suivant:

* insee_nature_origine_matche.gz
    * ex: insee_pp_initial_matche.gz
    
    

## Moteur de recherche TEST

* Insee
  * http://avis-situation-sirene.insee.fr/IdentificationListeSiret.action
* INPI/TC
  * https://data.inpi.fr/


In [1]:
import boto3, json
import dask.dataframe as dd
import pandas as pd
import Match_inpi_insee.aws_connectors as aws
%load_ext autoreload
%autoreload 2

In [2]:
# update of dask to avoid error
# 'AttributeError: 'DataFrame' object has no attribute '_getitem_array'

! python -m pip install --upgrade dask
! python -m pip install fsspec
! python -m pip install --upgrade s3fs

Requirement already up-to-date: dask in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (2.10.1)
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Requirement already up-to-date: s3fs in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.4.0)
Requirement not upgraded as not directly required: boto3>=1.9.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1.11.5)
Requirement not upgraded as not directly required: fsspec>=0.6.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (0.6.2)
Requirement not upgraded as not directly required: botocore>=1.12.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1

In [3]:
inpi_nature = 'PP' #* nature → ACTES/COMPTES/ETS/etc
load_type = 'aws' # 'aws'/'local'

In [None]:
source='insee'
insee_filename = 'StockEtablissement_utf8.csv'

# define import paths
if load_type == 'local':
    ## INSEE on local machine
    insee_filepath = 'data'
    insee_csv_relative_filepath = "{}/{}".format(insee_filepath,insee_filename)
    insee = insee_csv_relative_filepath
    inpi_processed_import_full_path = 'data'
    siretisation_import_path = 'data'
    #matched data
    siretisation_fullpath =  "{}/{}".format(siretisation_import_path,siretisation_filename)
    matche = siretisation_fullpath
else :
    ## INSEE on AWS
    import boto3
    import Match_inpi_insee.aws_connectors as aws
    insee_filepath = "INSEE/Stock/ETS"
    insee_csv_relative_filepath = "{}/{}".format(insee_filepath,insee_filename)
    instance_aws = 'https://calfdata.s3.eu-west-3.amazonaws.com'
    bucket = 'calfdata'
    # instanciate AWS connection
    AWS_connection = aws.aws_instantiate(instance_aws, bucket)
    insee = AWS_connection.url_instance_bucket(path_file = insee_csv_relative_filepath)
    inpi_import_root_path = "INPI/TC_1/Stock_processed"
    
    siretisation_import_path = 'SIRETISATION/matche'
    siretisation_fullpath =  "{}/{}".format(siretisation_import_path,siretisation_filename)
    matche = AWS_connection.url_instance_bucket(path_file = siretisation_fullpath)

In [4]:
inpi_origin = 'initial' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}.{}".format(inpi_origin,inpi_nature,'json')
inpi_gz_file_name = "{}_{}.{}".format(inpi_origin,inpi_nature,'gz')
inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [5]:
# instanciate AWS connection
AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [6]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
insee = AWS_connection.url_instance_bucket(path_file = insee_csv_relative_filepath)
pp_initial = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

## Cleaning of the tables based on unmatched cases

### initial matching results

```
{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1234287',
   'total_match': '1182979',
   'total_unmatched_left': '0',
   'total_unmatched_right': '51308'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```
### Cause n°1 for unmatched : missing zeros before siren in INPI

```
### check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]
merged = data_merged.loc[lambda x: x['_merge'].isin(['both'])]
```

```
# SIREN with less than 9 car haven't been merged
merged.groupby('siren_length')['siren'].count()
 
siren_length
9    1182979
Name: siren, dtype: int64
```

```
###  Most of unmerged are 9 car
unmerged.groupby('siren_length')['siren'].count()
siren_length
1        7
6        1
7      726
8     6484
9    44090
Name: siren, dtype: int64
```

#### We used a fillO function to complete siren from INPI before matching

#### After correction, total_unmatched_right : - 5822


```
# check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]

# Unmerged have decreased 51308=>45486. (all siren are now 9 car)
unmerged.groupby('siren_length')['siren'].count()

siren_length
9    45486
Name: siren, dtype: int64


{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1235130',
   'total_match': '1189644',
   'total_unmatched_left': '0',
   'total_unmatched_right': '45486'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```

In [7]:
usecols_= [
'activitePrincipaleEtablissement',
#'activitePrincipaleRegistreMetiersEtablissement',
#'anneeEffectifsEtablissement',
#'caractereEmployeurEtablissement',
#'codeCedex2Etablissement',
#'codeCedexEtablissement',
#'codeCommune2Etablissement',
#'codeCommuneEtablissement',
#'codePaysEtranger2Etablissement',
#'codePaysEtrangerEtablissement',
#'codePostal2Etablissement',
'codePostalEtablissement',
#'complementAdresse2Etablissement',
#'complementAdresseEtablissement',
'dateCreationEtablissement',
#'dateDebut',
#'dateDernierTraitementEtablissement',
#'denominationUsuelleEtablissement',
#'distributionSpeciale2Etablissement',
#'distributionSpecialeEtablissement',
#'enseigne1Etablissement',
#'enseigne2Etablissement',
#'enseigne3Etablissement',
#'etablissementSiege',
'etatAdministratifEtablissement',
#'indiceRepetition2Etablissement',
#'indiceRepetitionEtablissement',
#'libelleCedex2Etablissement',
#'libelleCedexEtablissement',
#'libelleCommune2Etablissement',
'libelleCommuneEtablissement',
#'libelleCommuneEtranger2Etablissement',
#'libelleCommuneEtrangerEtablissement',
#'libellePaysEtranger2Etablissement',
#'libellePaysEtrangerEtablissement',
#'libelleVoie2Etablissement',
'libelleVoieEtablissement',
#'nic',
#'nombrePeriodesEtablissement',
'nomenclatureActivitePrincipaleEtablissement',
#'numeroVoie2Etablissement',
#'numeroVoieEtablissement',
'siren',
'siret',
#'statutDiffusionEtablissement',
#'trancheEffectifsEtablissement',
#'typeVoie2Etablissement'
#'typeVoieEtablissement'
]

dtype_ = {
'activitePrincipaleEtablissement': 'object',
#'activitePrincipaleRegistreMetiersEtablissement': 'object',
#'anneeEffectifsEtablissement': 'object',
#'caractereEmployeurEtablissement': 'object',
#'codeCedex2Etablissement': 'object',
#'codeCedexEtablissement': 'object',
#'codeCommune2Etablissement': 'object',
#'codeCommuneEtablissement': 'object',
#'codePaysEtranger2Etablissement': 'object',
#'codePaysEtrangerEtablissement': 'object',
#'codePostal2Etablissement': 'object',
'codePostalEtablissement': 'object',
#'complementAdresse2Etablissement': 'object',
#'complementAdresseEtablissement': 'object',
'dateCreationEtablissement': 'object',
#'dateDebut': 'object',
#'dateDernierTraitementEtablissement': 'object',
#'denominationUsuelleEtablissement': 'object',
#'distributionSpeciale2Etablissement': 'object',
#'distributionSpecialeEtablissement': 'object',
#'enseigne1Etablissement': 'object',
#'enseigne2Etablissement': 'object',
#'enseigne3Etablissement': 'object',
#'etablissementSiege': 'object',
'etatAdministratifEtablissement': 'object',
#'indiceRepetition2Etablissement': 'object',
#'indiceRepetitionEtablissement': 'object',
#'libelleCedex2Etablissement': 'object',
#'libelleCedexEtablissement': 'object',
#'libelleCommune2Etablissement': 'object',
'libelleCommuneEtablissement': 'object',
#'libelleCommuneEtranger2Etablissement': 'object',
#'libelleCommuneEtrangerEtablissement': 'object',
#'libellePaysEtranger2Etablissement': 'object',
#'libellePaysEtrangerEtablissement': 'object',
#'libelleVoie2Etablissement': 'object',
'libelleVoieEtablissement': 'object',
#'nic': 'object',
#'nombrePeriodesEtablissement': 'object',
'nomenclatureActivitePrincipaleEtablissement': 'object',
#'numeroVoie2Etablissement': 'object',
#'numeroVoieEtablissement': 'object',
'siren': 'object',
'siret': 'object',
#'statutDiffusionEtablissement': 'object',
#'trancheEffectifsEtablissement': 'object',
#'typeVoie2Etablissement': 'object'
#'typeVoieEtablissement': 'object'
}

In [8]:
# load data into dataframes
## INSEE
data_ets_= dd.read_csv(insee, 
                       usecols = usecols_,
                       dtype=dtype_
                      )

In [9]:
## INPI Columns

usecols__ =[
#'Code Greffe',
'Nom_Greffe',
#'Numero_Gestion',
'Siren',
'Type_Inscription',
'Date_Immatriculation',
#'Date_1re_Immatriculation',
'Date_Radiation',
#'Date_Transfert',
#'Sans_Activité',
'Date_Debut_Activité',
#'Date_Début_1re_Activité',
'Date_Cessation_Activité',
'Nom_Patronymique',
'Nom_Usage',
#'Pseudonyme',
'Prénoms',
'Date_Naissance',
'Ville_Naissance',
'Pays_Naissance',
#'Nationalité',
'Adresse_Ligne1',
#'Adresse_Ligne2',
#'Adresse_Ligne3',
'Code_Postal',
'Ville',
#'Code_Commune',
#'Pays',
#'Activité_Forain',
#'DAP',
#'DAP_Dénomination',
#'DAP_Objet',
#'DAP_Date_Clôture',
#'DAP_Adresse_Ligne1',
#'DAP_Adresse_Ligne2',
#'DAP_Adresse_Ligne3',
#'DAP_Code_Postal',
#'DAP_Ville',
#'DAP_Code_Commune',
#'DAP_Pays',
#'EIRL',
#'Auto-entrepreneur',
'Conjoint_Collab_Nom_Patronym',
'Conjoint_Collab_Nom_Usage',
#'Conjoint_Collab_Pseudo',
'Conjoint_Collab_Prénoms',
'Conjoint_Collab_Date_Fin',
'Date_Greffe',
'Libelle_Evt'
]

dtypes__ = {
#'Code Greffe': 'object',
'Nom_Greffe': 'object',
#'Numero_Gestion': 'object',
'Siren': 'object',
'Type_Inscription': 'object',
'Date_Immatriculation': 'object',
#'Date_1re_Immatriculation': 'object',
'Date_Radiation': 'object',
#'Date_Transfert': 'object',
#'Sans_Activité': 'object',
#'Date_Debut_Activité': 'object',
'Date_Début_1re_Activité': 'object',
'Date_Cessation_Activité': 'object',
'Nom_Patronymique': 'object',
'Nom_Usage': 'object',
#'Pseudonyme': 'object',
'Prénoms': 'object',
'Date_Naissance': 'object',
'Ville_Naissance': 'object',
'Pays_Naissance': 'object',
#'Nationalité': 'object',
'Adresse_Ligne1': 'object',
#'Adresse_Ligne2': 'object',
#'Adresse_Ligne3': 'object',
'Code_Postal': 'object',
'Ville': 'object',
#'Code_Commune': 'object',
#'Pays': 'object',
#'Activité_Forain': 'object',
#'DAP': 'object',
#'DAP_Dénomination': 'object',
#'DAP_Objet': 'object',
#'DAP_Date_Clôture': 'object',
#'DAP_Adresse_Ligne1': 'object',
#'DAP_Adresse_Ligne2': 'object',
#'DAP_Adresse_Ligne3': 'object',
#'DAP_Code_Postal': 'object',
#'DAP_Ville': 'object',
#'DAP_Code_Commune': 'object',
#'DAP_Pays': 'object',
#'EIRL': 'object',
#'Auto-entrepreneur': 'object',
'Conjoint_Collab_Nom_Patronym': 'object',
'Conjoint_Collab_Nom_Usage': 'object',
#'Conjoint_Collab_Pseudo': 'object',
'Conjoint_Collab_Prénoms': 'object',
'Conjoint_Collab_Date_Fin': 'object',
'Date_Greffe': 'object',
'Libelle_Evt': 'object'
}

# INITIAL PP

In [21]:
## INPI Data
data_pp_initial_= (dd.read_csv(pp_initial, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_initial_.head()

Please ensure that each individual file can fit in memory and
use the keyword ``blocksize=None to remove this message``
Setting ``blocksize=None``
  "Setting ``blocksize=None``" % compression


MemoryError: 

In [11]:
# correction of missing zeros in siren
def fill0(text, num):
    while len(text) < num:
        text = str('0' + text)
    return text

data_pp_initial_['siren']=data_pp_initial_['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [12]:
# Merge data

#NB 1 : Due to computation limits, it is not possible to put this part in a class

#NB 2 : Computation limit error : UnpicklingError: invalid load key, '\xdd'.
# => restart kernel

data_merged_initial_ = data_ets_.merge(
    data_pp_initial_,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

MemoryError: 

In [None]:
# create a check_url for online manual data verification
data_merged_initial_ = data_merged_initial_.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_initial_.head()

In [None]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_initial_,
                        save='yes')
match_dict

In [None]:
unmerged_initial = data_merged_initial_.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_initial = data_merged_initial_.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_initial.head()

In [None]:
# Count #SIRET by SIREN
test = (data_pp_initial_
        .groupby('siren')['siren']
        .count()
        .rename('count')
        .reset_index()
        .groupby('count')
        .count()
        .compute()
        .assign(pct = lambda x:100*x/x.sum())
        #.rename(columns={'count': 'siret_count'})
       )
test

# NEW PP

In [13]:
# Monthly New Import
year='2017'

In [14]:
# define import paths
## INSEE
#source='insee'
#insee_csv_relative_filepath = "INSEE/Stock/ETS/StockEtablissement_utf8.csv"

## INPI
#inpi_import_root_path = "INPI/TC_1/Stock_processed"
#inpi_nature = 'PP' #* nature → ACTES/COMPTES/ETS/etc
inpi_origin = 'new' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'json') #2017_NEW_PP.json
inpi_gz_file_name = "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [15]:
inpi_import_gz_filepath

'INPI/TC_1/Stock_processed/2017_NEW_PP.gz'

In [16]:
# instanciate AWS connection
#AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [17]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_new = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [18]:
# load data into dataframes

## INPI
dtypes__ = {'Siren':'object',
            'Adresse_Ligne1': 'object',
       'Code_Commune': 'object',
       'Code_Postal': 'object',
       'Conjoint_Collab_Pseudo': 'object',
       'DAP_Adresse_Ligne1': 'object',
       'DAP_Code_Commune': 'object',
       'Date_Immatriculation': 'object',
       'Date_1re_Immatriculation': 'object',
       'Date_Radiation': 'object',
       'Date_Greffe': 'object',
       'Sans_Activité': 'object'}

data_pp_new= (dd.read_csv(pp_new, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_new.head()

Unnamed: 0,Nom_Greffe,siren,Type_Inscription,Date_Immatriculation,Date_Radiation,Date_Debut_Activité,Date_Cessation_Activité,Nom_Patronymique,Nom_Usage,Prénoms,...,Pays_Naissance,Adresse_Ligne1,Code_Postal,Ville,Conjoint_Collab_Nom_Patronym,Conjoint_Collab_Nom_Usage,Conjoint_Collab_Prénoms,Conjoint_Collab_Date_Fin,Date_Greffe,Libelle_Evt
0,Bourg-en-Bresse,801884248,P,2017-11-30,,2017-10-24,,AUGUSTE,,"Coralie, Jeanine",...,FRANCE,,1240,Certines,,,,,2017-11-30,Création
1,Bourg-en-Bresse,833599822,P,2017-11-30,,2017-11-04,,LAMKADDAM,,Abdelhafid,...,MAROC,,1300,Belley,,,,,2017-11-30,Création
2,Bourg-en-Bresse,832710818,P,2017-10-19,,2017-09-27,,GARBIT,,"Nicolas, Gaspard, Jacques, Marie",...,FRANCE,,1500,Amberieu en Bugey,,,,,2017-10-19,Création
3,Bourg-en-Bresse,830157384,P,2017-11-10,,2017-11-07,,LESAGE,,"Nicolas, Ludovic, Stephane",...,FRANCE,,1240,Dompierre-sur-Veyle,,,,,2017-11-10,Création
4,Bourg-en-Bresse,833642630,P,2017-12-01,,2017-12-01,,CASTRONOVO,,Richard Jocelyn,...,FRANCE,,1480,Chaleins,,,,,2017-12-01,Création


In [19]:
# correction of missing zeros in siren
data_pp_new['siren']=data_pp_new['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [20]:
# Merge data

#NB 1 : Due to computation limits, it is not possible to put this part in a class

#NB 2 : Computation limit error : UnpicklingError: invalid load key, '\xdd'.
# => restart kernel

data_merged_new = data_ets_.merge(
    data_pp_new,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

MemoryError: 

In [None]:
# create a check_url for online manual data verification
data_merged_new = data_merged_new.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_new.head()

In [None]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_new,
                        save='yes')
match_dict

In [None]:
unmerged_new = data_merged_new.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_new = data_merged_new.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_new.head()
unmerged_new.iloc[1:40]

# Evt PP

In [None]:
# Monthly New Import
year='2017'

In [None]:
# define import paths

## INPI
inpi_origin = 'evt' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'json') #2017_EVT_PP.json
inpi_gz_file_name = "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [None]:
# instanciate AWS connection
#AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [None]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_evt = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [None]:
# load data into dataframes

## INPI

data_pp_evt= (dd.read_csv(pp_evt, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_evt.head()

In [None]:
# correction of missing zeros in siren
data_pp_evt['siren']=data_pp_evt['siren'].apply(lambda x: fill0(x,9))

In [None]:
# Merge data

data_merged_evt = data_ets_.merge(
    data_pp_evt,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [None]:
# create a check_url for online manual data verification
data_merged_evt = data_merged_evt.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_evt.head()

In [None]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_evt,
                        save='yes')
match_dict

In [None]:
unmerged_evt = data_merged_evt.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_evt = data_merged_evt.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_evt.head()
unmerged_evt.iloc[1:40]

# Partiel PP

In [None]:
# define import paths
inpi_origin = 'partiel' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}.{}".format(inpi_origin,inpi_nature,'json')
inpi_gz_file_name = "{}_{}.{}".format(inpi_origin,inpi_nature,'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)


In [None]:
inpi_import_gz_filepath

In [None]:
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [None]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_partiel = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [None]:
# load data into dataframes

## INPI

data_pp_partiel= (dd.read_csv(pp_partiel, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_partiel.head()

In [None]:
# correction of missing zeros in siren
data_pp_partiel['siren']=data_pp_partiel['siren'].apply(lambda x: fill0(x,9))

In [None]:
# Merge data

data_merged_partiel = data_ets_.merge(
    data_pp_partiel,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [None]:
# create a check_url for online manual data verification
data_merged_partiel = data_merged_partiel.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_partiel.head()

In [None]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_partiel,
                        save='yes')
match_dict

In [None]:
unmerged_partiel = data_merged_partiel.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_partiel = data_merged_partiel.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_partiel.head()

In [None]:
# Count #SIRET by SIREN
test = (data_pp_partiel
        .groupby('siren')['siren']
        .count()
        .rename('count')
        .reset_index()
        .groupby('count')
        .count()
        .compute()
        .assign(pct = lambda x:100*x/x.sum())
       )
test