# Test Matching Insee/PP

## INSEE

- https://s3.console.aws.amazon.com/s3/object/calfdata/INSEE/Stock/ETS/
        - INSEE/Stock/ETS/StockEtablissement_utf8.csv
        
```
['siren', 'siret']
```

## INPI

- https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/Stock_processed/
    - INPI/TC_1/Stock_processed/initial_PP.gz
    - INPI/TC_1/Stock_processed/initial_PP.json
    

## Sauvegarde

* La liste des SIREN matchés sera sauvegardée selon leur nature et origine
  * nature → ACTES/COMPTES/ETS/etc
  * origine → initial/partiel/new/evt

Les matchés seront sauvegardé dans calfdata/SIRETISATION/matche/ au format suivant:

* insee_nature_origine_matche.gz
    * ex: insee_pp_initial_matche.gz
    
    

## Moteur de recherche TEST

* Insee
  * http://avis-situation-sirene.insee.fr/IdentificationListeSiret.action
* INPI/TC
  * https://data.inpi.fr/


In [1]:
import boto3, json
import dask.dataframe as dd
import pandas as pd
import Match_inpi_insee.aws_connectors as aws
%load_ext autoreload
%autoreload 2

In [2]:
# update of dask to avoid error
# 'AttributeError: 'DataFrame' object has no attribute '_getitem_array'

! python -m pip install --upgrade dask
! python -m pip install fsspec
! python -m pip install --upgrade s3fs

Requirement already up-to-date: dask in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (2.10.1)
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Requirement already up-to-date: s3fs in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.4.0)
Requirement not upgraded as not directly required: boto3>=1.9.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1.11.5)
Requirement not upgraded as not directly required: botocore>=1.12.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1.14.5)
Requirement not upgraded as not directly required: fsspec>=0.6.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (

In [None]:
inpi_nature = 'PP' #* nature → ACTES/COMPTES/ETS/etc

In [3]:
instance_aws = 'https://calfdata.s3.eu-west-3.amazonaws.com'
bucket = 'calfdata'

# define import paths
## INSEE
source='insee'
insee_csv_relative_filepath = "INSEE/Stock/ETS/StockEtablissement_utf8.csv"

## INPI
inpi_import_root_path = "INPI/TC_1/Stock_processed"

inpi_origin = 'initial' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}.{}".format(inpi_origin,inpi_nature,'json')
inpi_gz_file_name = "{}_{}.{}".format(inpi_origin,inpi_nature,'gz')
inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [4]:
# instanciate AWS connection
AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [5]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
insee = AWS_connection.url_instance_bucket(path_file = insee_csv_relative_filepath)
pp_initial = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

## Cleaning of the tables based on unmatched cases

### initial matching results

```
{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1234287',
   'total_match': '1182979',
   'total_unmatched_left': '0',
   'total_unmatched_right': '51308'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```
### Cause n°1 for unmatched : missing zeros before siren in INPI

```
### check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]
merged = data_merged.loc[lambda x: x['_merge'].isin(['both'])]
```

```
# SIREN with less than 9 car haven't been merged
merged.groupby('siren_length')['siren'].count()
 
siren_length
9    1182979
Name: siren, dtype: int64
```

```
###  Most of unmerged are 9 car
unmerged.groupby('siren_length')['siren'].count()
siren_length
1        7
6        1
7      726
8     6484
9    44090
Name: siren, dtype: int64
```

#### We used a fillO function to complete siren from INPI before matching

#### After correction, total_unmatched_right : - 5822


```
# check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]

# Unmerged have decreased 51308=>45486. (all siren are now 9 car)
unmerged.groupby('siren_length')['siren'].count()

siren_length
9    45486
Name: siren, dtype: int64


{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1235130',
   'total_match': '1189644',
   'total_unmatched_left': '0',
   'total_unmatched_right': '45486'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```

In [None]:
usecols_= [
'activitePrincipaleEtablissement',
'activitePrincipaleRegistreMetiersEtablissement',
#'anneeEffectifsEtablissement',
#'caractereEmployeurEtablissement',
#'codeCedex2Etablissement',
#'codeCedexEtablissement',
#'codeCommune2Etablissement',
#'codeCommuneEtablissement',
#'codePaysEtranger2Etablissement',
#'codePaysEtrangerEtablissement',
#'codePostal2Etablissement',
'codePostalEtablissement',
#'complementAdresse2Etablissement',
#'complementAdresseEtablissement',
'dateCreationEtablissement',
'dateDebut',
#'dateDernierTraitementEtablissement',
#'denominationUsuelleEtablissement',
#'distributionSpeciale2Etablissement',
#'distributionSpecialeEtablissement',
'enseigne1Etablissement',
#'enseigne2Etablissement',
#'enseigne3Etablissement',
'etablissementSiege',
'etatAdministratifEtablissement',
#'indiceRepetition2Etablissement',
#'indiceRepetitionEtablissement',
#'libelleCedex2Etablissement',
#'libelleCedexEtablissement',
#'libelleCommune2Etablissement',
'libelleCommuneEtablissement',
#'libelleCommuneEtranger2Etablissement',
#'libelleCommuneEtrangerEtablissement',
#'libellePaysEtranger2Etablissement',
#'libellePaysEtrangerEtablissement',
#'libelleVoie2Etablissement',
'libelleVoieEtablissement',
'nic',
#'nombrePeriodesEtablissement',
'nomenclatureActivitePrincipaleEtablissement',
#'numeroVoie2Etablissement',
#'numeroVoieEtablissement',
'siren',
'siret',
'statutDiffusionEtablissement',
#'trancheEffectifsEtablissement',
'typeVoie2Etablissement'
#'typeVoieEtablissement'
]

dtype_ = 
{
'activitePrincipaleEtablissement': 'object',
'activitePrincipaleRegistreMetiersEtablissement': 'object',
#'anneeEffectifsEtablissement': 'object',
#'caractereEmployeurEtablissement': 'object',
#'codeCedex2Etablissement': 'object',
#'codeCedexEtablissement': 'object',
#'codeCommune2Etablissement': 'object',
#'codeCommuneEtablissement': 'object',
#'codePaysEtranger2Etablissement': 'object',
#'codePaysEtrangerEtablissement': 'object',
#'codePostal2Etablissement': 'object',
'codePostalEtablissement': 'object',
#'complementAdresse2Etablissement': 'object',
#'complementAdresseEtablissement': 'object',
'dateCreationEtablissement': 'object',
'dateDebut': 'object',
#'dateDernierTraitementEtablissement': 'object',
#'denominationUsuelleEtablissement': 'object',
#'distributionSpeciale2Etablissement': 'object',
#'distributionSpecialeEtablissement': 'object',
'enseigne1Etablissement': 'object',
#'enseigne2Etablissement': 'object',
#'enseigne3Etablissement': 'object',
'etablissementSiege': 'object',
'etatAdministratifEtablissement': 'object',
#'indiceRepetition2Etablissement': 'object',
#'indiceRepetitionEtablissement': 'object',
#'libelleCedex2Etablissement': 'object',
#'libelleCedexEtablissement': 'object',
#'libelleCommune2Etablissement': 'object',
'libelleCommuneEtablissement': 'object',
#'libelleCommuneEtranger2Etablissement': 'object',
#'libelleCommuneEtrangerEtablissement': 'object',
#'libellePaysEtranger2Etablissement': 'object',
#'libellePaysEtrangerEtablissement': 'object',
#'libelleVoie2Etablissement': 'object',
'libelleVoieEtablissement': 'object',
'nic': 'object',
#'nombrePeriodesEtablissement': 'object',
'nomenclatureActivitePrincipaleEtablissement': 'object',
#'numeroVoie2Etablissement': 'object',
#'numeroVoieEtablissement': 'object',
'siren': 'object',
'siret': 'object',
'statutDiffusionEtablissement': 'object',
#'trancheEffectifsEtablissement': 'object',
'typeVoie2Etablissement': 'object',
#'typeVoieEtablissement': 'object',

}

In [6]:
# load data into dataframes
## INSEE
data_ets_= dd.read_csv(insee, 
                       usecols = usecols_,
                       dtype=dtype_
                      )

In [7]:
## INPI Columns

usecols__ =[
#'Code Greffe',
'Nom_Greffe',
#'Numero_Gestion',
'Siren',
'Type_Inscription',
'Date_Immatriculation',
#'Date_1re_Immatriculation',
'Date_Radiation',
#'Date_Transfert',
#'Sans_Activité',
'Date_Debut_Activité',
'Date_Début_1re_Activité',
'Date_Cessation_Activité',
'Nom_Patronymique',
'Nom_Usage',
#'Pseudonyme',
'Prénoms',
'Date_Naissance',
'Ville_Naissance',
'Pays_Naissance',
'Nationalité',
'Adresse_Ligne1',
'Adresse_Ligne2',
'Adresse_Ligne3',
'Code_Postal',
'Ville',
'Code_Commune',
'Pays',
'Activité_Forain',
#'DAP',
#'DAP_Dénomination',
#'DAP_Objet',
#'DAP_Date_Clôture',
#'DAP_Adresse_Ligne1',
#'DAP_Adresse_Ligne2',
#'DAP_Adresse_Ligne3',
#'DAP_Code_Postal',
#'DAP_Ville',
#'DAP_Code_Commune',
#'DAP_Pays',
#'EIRL',
#'Auto-entrepreneur',
'Conjoint_Collab_Nom_Patronym',
'Conjoint_Collab_Nom_Usage',
#'Conjoint_Collab_Pseudo',
'Conjoint_Collab_Prénoms',
'Conjoint_Collab_Date_Fin',
'Date_Greffe',
'Libelle_Evt'
]

dtypes__ = {
#'Code Greffe': 'object',
'Nom_Greffe': 'object',
#'Numero_Gestion': 'object',
'Siren': 'object',
'Type_Inscription': 'object',
'Date_Immatriculation': 'object',
#'Date_1re_Immatriculation': 'object',
'Date_Radiation': 'object',
#'Date_Transfert': 'object',
#'Sans_Activité': 'object',
'Date_Debut_Activité': 'object',
'Date_Début_1re_Activité': 'object',
'Date_Cessation_Activité': 'object',
'Nom_Patronymique': 'object',
'Nom_Usage': 'object',
#'Pseudonyme': 'object',
'Prénoms': 'object',
'Date_Naissance': 'object',
'Ville_Naissance': 'object',
'Pays_Naissance': 'object',
'Nationalité': 'object',
'Adresse_Ligne1': 'object',
'Adresse_Ligne2': 'object',
'Adresse_Ligne3': 'object',
'Code_Postal': 'object',
'Ville': 'object',
'Code_Commune': 'object',
'Pays': 'object',
'Activité_Forain': 'object',
#'DAP': 'object',
#'DAP_Dénomination': 'object',
#'DAP_Objet': 'object',
#'DAP_Date_Clôture': 'object',
#'DAP_Adresse_Ligne1': 'object',
#'DAP_Adresse_Ligne2': 'object',
#'DAP_Adresse_Ligne3': 'object',
#'DAP_Code_Postal': 'object',
#'DAP_Ville': 'object',
#'DAP_Code_Commune': 'object',
#'DAP_Pays': 'object',
#'EIRL': 'object',
#'Auto-entrepreneur': 'object',
'Conjoint_Collab_Nom_Patronym': 'object',
'Conjoint_Collab_Nom_Usage': 'object',
#'Conjoint_Collab_Pseudo': 'object',
'Conjoint_Collab_Prénoms': 'object',
'Conjoint_Collab_Date_Fin': 'object',
'Date_Greffe': 'object',
'Libelle_Evt': 'object'
}

# INITIAL PP

In [8]:
## INPI Data
data_pp_initial_= (dd.read_csv(pp_initial, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_initial_.head()

Unnamed: 0,siren,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe
0,442377040,2002-06-18,,,,,1000,,,,,2015-08-04
1,809861297,2015-02-27,,,,,1140,1348.0,,,,2015-02-27
2,480952712,2005-02-23,,,,,1380,1025.0,,,,2017-05-03
3,388239667,1992-10-14,,,,,69130,,,,,2017-05-03
4,828427245,2017-04-18,,,,,1220,1143.0,,,,2017-04-18


In [10]:
# correction of missing zeros in siren
def fill0(text, num):
    while len(text) < num:
        text = str('0' + text)
    return text

data_pp_initial_['siren']=data_pp_initial_['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [11]:
# Merge data

#NB 1 : Due to computation limits, it is not possible to put this part in a class

#NB 2 : Computation limit error : UnpicklingError: invalid load key, '\xdd'.
# => restart kernel

data_merged_initial_ = data_ets_.merge(
    data_pp_initial_,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [12]:
# create a check_url for online manual data verification
data_merged_initial_ = data_merged_initial_.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_initial_.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
0,300004066,30000406600019,O,1973-08-01,,,,14 rue du Tilleul,59600,59392.0,,,,2017-05-04,both,https://data.inpi.fr/entreprises/300004066
1,300036688,30003668800012,O,1974-03-18,,,,,81100,,,,,2017-05-04,both,https://data.inpi.fr/entreprises/300036688
2,300044542,30004454200029,O,1979-11-29,,,,,56510,,,,,1979-11-29,both,https://data.inpi.fr/entreprises/300044542
3,300047388,30004738800024,O,1974-01-17,,,,,56600,,,,,1974-01-17,both,https://data.inpi.fr/entreprises/300047388
4,300054871,30005487100011,O,1973-09-06,,,,,22700,22168.0,,,,2016-10-05,both,https://data.inpi.fr/entreprises/300054871


In [13]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_initial_,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1235130',
   'total_match': '1189644',
   'total_unmatched_left': '0',
   'total_unmatched_right': '45486'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

In [14]:
unmerged_initial = data_merged_initial_.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_initial = data_merged_initial_.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_initial.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
14241,504630310,,,2017-04-26,,,,,1500,1004.0,,,,2017-04-26,right_only,https://data.inpi.fr/entreprises/504630310
14242,823218854,,,2016-10-19,,,,,1090,1258.0,,,,2016-12-07,right_only,https://data.inpi.fr/entreprises/823218854
14243,546110156,,,1961-01-20,,,,,1110,,,,,2009-01-01,right_only,https://data.inpi.fr/entreprises/546110156
14244,490068327,,,2015-12-16,,,,,1480,1166.0,,,,2015-12-17,right_only,https://data.inpi.fr/entreprises/490068327
14245,793005000,,,2017-01-12,,,,,1150,1202.0,,,,2017-01-12,right_only,https://data.inpi.fr/entreprises/793005000


In [15]:
# Count #SIRET by SIREN
test = (data_pp_initial_
        .groupby('siren')['siren']
        .count()
        .rename('count')
        .reset_index()
        .groupby('count')
        .count()
        .compute()
        .assign(pct = lambda x:100*x/x.sum())
        #.rename(columns={'count': 'siret_count'})
       )
test

Unnamed: 0_level_0,siren,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,741492,98.905294
2,7823,1.043485
3,304,0.04055
4,54,0.007203
5,9,0.0012
6,4,0.000534
7,7,0.000934
8,1,0.000133
10,1,0.000133
11,1,0.000133


# NEW PP

In [17]:
# Monthly New Import
year='2017'

In [18]:
# define import paths
## INSEE
#source='insee'
#insee_csv_relative_filepath = "INSEE/Stock/ETS/StockEtablissement_utf8.csv"

## INPI
#inpi_import_root_path = "INPI/TC_1/Stock_processed"
#inpi_nature = 'PP' #* nature → ACTES/COMPTES/ETS/etc
inpi_origin = 'new' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'json') #2017_NEW_PP.json
inpi_gz_file_name = "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [19]:
inpi_import_gz_filepath

'INPI/TC_1/Stock_processed/2017_NEW_PP.gz'

In [20]:
# instanciate AWS connection
#AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [21]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_new = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [22]:
# load data into dataframes

## INPI
dtypes__ = {'Siren':'object',
            'Adresse_Ligne1': 'object',
       'Code_Commune': 'object',
       'Code_Postal': 'object',
       'Conjoint_Collab_Pseudo': 'object',
       'DAP_Adresse_Ligne1': 'object',
       'DAP_Code_Commune': 'object',
       'Date_Immatriculation': 'object',
       'Date_1re_Immatriculation': 'object',
       'Date_Radiation': 'object',
       'Date_Greffe': 'object',
       'Sans_Activité': 'object'}

data_pp_new= (dd.read_csv(pp_new, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_new.head()

Unnamed: 0,siren,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe
0,801884248,2017-11-30,,,,,1240,1069,,,,2017-11-30
1,833599822,2017-11-30,,,,,1300,1034,,,,2017-11-30
2,832710818,2017-10-19,,,,,1500,1004,,,,2017-10-19
3,830157384,2017-11-10,,,,,1240,1145,,,,2017-11-10
4,833642630,2017-12-01,,,,,1480,1075,,,,2017-12-01


In [23]:
# correction of missing zeros in siren
data_pp_new['siren']=data_pp_new['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [24]:
# Merge data

#NB 1 : Due to computation limits, it is not possible to put this part in a class

#NB 2 : Computation limit error : UnpicklingError: invalid load key, '\xdd'.
# => restart kernel

data_merged_new = data_ets_.merge(
    data_pp_new,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [25]:
# create a check_url for online manual data verification
data_merged_new = data_merged_new.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_new.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
0,314488321,31448832100014,O,2017-12-18,,,,,34290,34009,,,,2017-12-18,both,https://data.inpi.fr/entreprises/314488321
1,314488321,31448832100022,O,2017-12-18,,,,,34290,34009,,,,2017-12-18,both,https://data.inpi.fr/entreprises/314488321
2,314488321,31448832100030,O,2017-12-18,,,,,34290,34009,,,,2017-12-18,both,https://data.inpi.fr/entreprises/314488321
3,331285775,33128577500036,O,2017-12-26,,,,,29360,29031,,,,2017-12-26,both,https://data.inpi.fr/entreprises/331285775
4,331285775,33128577500044,O,2017-12-26,,,,,29360,29031,,,,2017-12-26,both,https://data.inpi.fr/entreprises/331285775


In [26]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_new,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'new',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '21530',
   'total_match': '19862',
   'total_unmatched_left': '0',
   'total_unmatched_right': '1668'}},
 'matched_df': 'insee_PP_new_matche.gz',
 'unmatched_df': 'insee_PP_new_non_matche.gz'}

In [27]:
unmerged_new = data_merged_new.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_new = data_merged_new.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_new.head()
unmerged_new.iloc[1:40]

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
230,833424716,,,2017-11-20,,,,4 Rue Kruger,13004,13204,,,,2017-11-20,right_only,https://data.inpi.fr/entreprises/833424716
231,833424716,,,2017-11-20,,,,4 Rue Kruger,13004,13204,,,,2017-11-20,right_only,https://data.inpi.fr/entreprises/833424716
232,833578297,,,2017-11-27,,,,,14680,14290,,,,2017-11-27,right_only,https://data.inpi.fr/entreprises/833578297
233,833758857,,,2017-12-29,2017-12-29,,,,26780,26169,,,,2017-12-29,right_only,https://data.inpi.fr/entreprises/833758857
234,833983000,,,2017-12-26,,,,,29120,29171,,,,2017-12-26,right_only,https://data.inpi.fr/entreprises/833983000
235,343202909,,,2017-12-20,,,,,34500,34032,,,,2017-12-20,right_only,https://data.inpi.fr/entreprises/343202909
236,833162118,,,2017-11-21,,,,,34110,34108,,,,2017-11-21,right_only,https://data.inpi.fr/entreprises/833162118
237,833120439,,,2017-11-28,,,,,34230,34016,,,,2017-11-28,right_only,https://data.inpi.fr/entreprises/833120439
238,833120439,,,2017-11-28,,,,,34230,34016,,,,2017-11-28,right_only,https://data.inpi.fr/entreprises/833120439
239,833969744,,,2017-12-14,,,,24 rue Alphonse Daudet,44000,44109,,,,2017-12-14,right_only,https://data.inpi.fr/entreprises/833969744


# Evt PP

In [28]:
# Monthly New Import
year='2017'

In [29]:
# define import paths

## INPI
inpi_origin = 'evt' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'json') #2017_EVT_PP.json
inpi_gz_file_name = "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [30]:
# instanciate AWS connection
#AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [31]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_evt = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [32]:
# load data into dataframes

## INPI

data_pp_evt= (dd.read_csv(pp_evt, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_evt.head()

Unnamed: 0,siren,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe
0,818626053,,,2017-11-30 00:00:00,,,,,,,,2017-11-30
1,799291299,,,2017-11-24 00:00:00,,,,,,,,2017-11-24
2,817544828,,,2017-12-01 00:00:00,,,,,,,,2017-12-01
3,383932688,,,2017-10-26 00:00:00,,,,,,,,2017-10-26
4,518564398,,,,,,1750.0,1320.0,,,,2017-12-01


In [33]:
# correction of missing zeros in siren
data_pp_evt['siren']=data_pp_evt['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [34]:
# Merge data

data_merged_evt = data_ets_.merge(
    data_pp_evt,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [35]:
# create a check_url for online manual data verification
data_merged_evt = data_merged_evt.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_evt.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
0,305597478,30559747800034,O,,,2017-12-06 00:00:00,,,,,,,,2017-12-06,both,https://data.inpi.fr/entreprises/305597478
1,305597478,30559747800042,O,,,2017-12-06 00:00:00,,,,,,,,2017-12-06,both,https://data.inpi.fr/entreprises/305597478
2,305597478,30559747800059,O,,,2017-12-06 00:00:00,,,,,,,,2017-12-06,both,https://data.inpi.fr/entreprises/305597478
3,305597478,30559747800067,O,,,2017-12-06 00:00:00,,,,,,,,2017-12-06,both,https://data.inpi.fr/entreprises/305597478
4,305597478,30559747800091,O,,,2017-12-06 00:00:00,,,,,,,,2017-12-06,both,https://data.inpi.fr/entreprises/305597478


In [36]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_evt,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'evt',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '22078',
   'total_match': '20998',
   'total_unmatched_left': '0',
   'total_unmatched_right': '1080'}},
 'matched_df': 'insee_PP_evt_matche.gz',
 'unmatched_df': 'insee_PP_evt_non_matche.gz'}

In [37]:
unmerged_evt = data_merged_evt.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_evt = data_merged_evt.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_evt.head()
unmerged_evt.iloc[1:40]

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
262,327933958,,,,,,,,,,,,,2017-12-11,right_only,https://data.inpi.fr/entreprises/327933958
263,327933958,,,,,,,,13270.0,13039.0,,,,2017-12-11,right_only,https://data.inpi.fr/entreprises/327933958
264,825070006,,,,,2017-12-08 00:00:00,,,,,,,,2017-12-08,right_only,https://data.inpi.fr/entreprises/825070006
265,420508509,,,,,2017-12-22 00:00:00,,,,,,,,2017-12-22,right_only,https://data.inpi.fr/entreprises/420508509
303,752142232,,,,,2017-12-20 00:00:00,,,,,,,,2017-12-20,right_only,https://data.inpi.fr/entreprises/752142232
304,802723171,,,,,2017-12-14 00:00:00,,,,,,,,2017-12-14,right_only,https://data.inpi.fr/entreprises/802723171
305,443863527,,,,,,,,,,,,,2017-12-12,right_only,https://data.inpi.fr/entreprises/443863527
306,828637603,,,,,2017-12-19 00:00:00,,,,,,,,2017-12-19,right_only,https://data.inpi.fr/entreprises/828637603
307,532052552,,,,,2017-12-19 00:00:00,,,,,,,,2017-12-19,right_only,https://data.inpi.fr/entreprises/532052552
308,382954626,,,,,2017-12-13 00:00:00,,,,,,,,2017-12-13,right_only,https://data.inpi.fr/entreprises/382954626


# Partiel PP

In [38]:
# define import paths
inpi_origin = 'partiel' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}.{}".format(inpi_origin,inpi_nature,'json')
inpi_gz_file_name = "{}_{}.{}".format(inpi_origin,inpi_nature,'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)


In [39]:
inpi_import_gz_filepath

'INPI/TC_1/Stock_processed/partiel_PP.gz'

In [40]:
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [41]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_partiel = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [42]:
# load data into dataframes

## INPI

data_pp_partiel= (dd.read_csv(pp_partiel, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_partiel.head()

Unnamed: 0,siren,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe
0,305616815,1977-10-19,,,,,1160,,,,,2018-01-22
1,305637597,1976-08-11,,,,,1220,,,,,2015-12-02
2,306009127,1976-05-12,,,,,1130,1269.0,,,,2018-03-29
3,308729946,1955-12-02,,,,,1300,,,,,2017-08-04
4,308734003,1981-12-16,,,,,1960,1289.0,,,,2018-04-17


In [43]:
# correction of missing zeros in siren
data_pp_partiel['siren']=data_pp_partiel['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [44]:
# Merge data

data_merged_partiel = data_ets_.merge(
    data_pp_partiel,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [45]:
# create a check_url for online manual data verification
data_merged_partiel = data_merged_partiel.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_partiel.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
0,36410462,3641046200018,O,1964-02-05,,,,,6600,6004.0,,,,2017-11-17,both,https://data.inpi.fr/entreprises/036410462
1,301105615,30110561500019,O,1974-07-02,,,,,20150,,,,,2017-08-04,both,https://data.inpi.fr/entreprises/301105615
2,303405203,30340520300017,O,1978-03-09,,,,,84210,84001.0,,,,2018-03-05,both,https://data.inpi.fr/entreprises/303405203
3,303502876,30350287600038,O,2001-03-27,,,,,81000,,,,,2018-08-29,both,https://data.inpi.fr/entreprises/303502876
4,303502876,30350287600046,O,2001-03-27,,,,,81000,,,,,2018-08-29,both,https://data.inpi.fr/entreprises/303502876


In [46]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_partiel,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'partiel',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '27918',
   'total_match': '25987',
   'total_unmatched_left': '0',
   'total_unmatched_right': '1931'}},
 'matched_df': 'insee_PP_partiel_matche.gz',
 'unmatched_df': 'insee_PP_partiel_non_matche.gz'}

In [47]:
unmerged_partiel = data_merged_partiel.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_partiel = data_merged_partiel.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_partiel.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
329,417242096,,,1978-02-14,,,,,6130,6069.0,,,,2012-09-27,right_only,https://data.inpi.fr/entreprises/417242096
330,327933958,,,1985-12-12,,,,,13270,13039.0,,,,2017-12-11,right_only,https://data.inpi.fr/entreprises/327933958
331,454007162,,,2017-05-09,2017-05-09,,,,28200,28088.0,,,,2017-09-12,right_only,https://data.inpi.fr/entreprises/454007162
332,833066525,,,2017-11-10,,,,,50610,50066.0,,,,2018-02-22,right_only,https://data.inpi.fr/entreprises/833066525
333,349143669,,,1989-01-18,,,,,62480,,,,,2018-03-06,right_only,https://data.inpi.fr/entreprises/349143669


In [48]:
# Count #SIRET by SIREN
test = (data_pp_partiel
        .groupby('siren')['siren']
        .count()
        .rename('count')
        .reset_index()
        .groupby('count')
        .count()
        .compute()
        .assign(pct = lambda x:100*x/x.sum())
       )
test

Unnamed: 0_level_0,siren,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16640,98.153719
2,237,1.397983
3,33,0.194656
4,13,0.076683
5,11,0.064885
6,3,0.017696
7,4,0.023595
8,3,0.017696
9,2,0.011797
10,1,0.005899
