# Test Matching Insee/PP

## INSEE

- https://s3.console.aws.amazon.com/s3/object/calfdata/INSEE/Stock/ETS/
        - INSEE/Stock/ETS/StockEtablissement_utf8.csv
        
```
['siren', 'siret']
```

## INPI

- https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/Stock_processed/
    - INPI/TC_1/Stock_processed/initial_PP.gz
    - INPI/TC_1/Stock_processed/initial_PP.json
    

## Sauvegarde

* La liste des SIREN matchés sera sauvegardée selon leur nature et origine
  * nature → ACTES/COMPTES/ETS/etc
  * origine → initial/partiel/new/evt

Les matchés seront sauvegardé dans calfdata/SIRETISATION/matche/ au format suivant:

* insee_nature_origine_matche.gz
    * ex: insee_pp_initial_matche.gz
    
    

## Moteur de recherche TEST

* Insee
  * http://avis-situation-sirene.insee.fr/IdentificationListeSiret.action
* INPI/TC
  * https://data.inpi.fr/


In [92]:
import boto3, json
import dask.dataframe as dd
import pandas as pd
import Match_inpi_insee.aws_connectors as aws
#from tqdm.notebook import tqdm
#import tqdm
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [93]:
# update of dask to avoid error
# 'AttributeError: 'DataFrame' object has no attribute '_getitem_array'

! python -m pip install --upgrade dask
! python -m pip install fsspec
! python -m pip install --upgrade s3fs

Requirement already up-to-date: dask in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (2.10.1)
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Requirement already up-to-date: s3fs in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.4.0)
Requirement not upgraded as not directly required: fsspec>=0.6.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (0.6.2)
Requirement not upgraded as not directly required: boto3>=1.9.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1.11.5)
Requirement not upgraded as not directly required: botocore>=1.12.91 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from s3fs) (1

In [94]:
instance_aws = 'https://calfdata.s3.eu-west-3.amazonaws.com'
bucket = 'calfdata'

# define import paths
## INSEE
source='insee'
insee_csv_relative_filepath = "INSEE/Stock/ETS/StockEtablissement_utf8.csv"

## INPI
inpi_import_root_path = "INPI/TC_1/Stock_processed"
inpi_nature = 'PP' #* nature → ACTES/COMPTES/ETS/etc
inpi_origin = 'initial' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}.{}".format(inpi_origin,inpi_nature,'json')
inpi_gz_file_name = "{}_{}.{}".format(inpi_origin,inpi_nature,'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [95]:
# instanciate AWS connection
AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [96]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
insee = AWS_connection.url_instance_bucket(path_file = insee_csv_relative_filepath)
pp = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

## Cleaning of the tables based on unmatched cases

### initial matching results

```
{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1234287',
   'total_match': '1182979',
   'total_unmatched_left': '0',
   'total_unmatched_right': '51308'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```
### Cause n°1 for unmatched : missing zeros before siren in INPI

```
### check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]
merged = data_merged.loc[lambda x: x['_merge'].isin(['both'])]
```

```
# SIREN with less than 9 car haven't been merged
merged.groupby('siren_length')['siren'].count()
 
siren_length
9    1182979
Name: siren, dtype: int64
```

```
###  Most of unmerged are 9 car
unmerged.groupby('siren_length')['siren'].count()
siren_length
1        7
6        1
7      726
8     6484
9    44090
Name: siren, dtype: int64
```

#### We used a fillO function to complete siren from INPI before matching

#### After correction, total_unmatched_right : - 5822


```
# check if there is an issue with siren beginning with zeros
data_merged['siren_length'] = data_merged['siren'].apply(lambda x: len(x))
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]

# Unmerged have decreased 51308=>45486. (all siren are now 9 car)
unmerged.groupby('siren_length')['siren'].count()

siren_length
9    45486
Name: siren, dtype: int64


{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1235130',
   'total_match': '1189644',
   'total_unmatched_left': '0',
   'total_unmatched_right': '45486'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

```

In [97]:
# load data into dataframes
## INSEE
data_ets_= dd.read_csv(insee, 
                       usecols = ['siren', 'siret','statutDiffusionEtablissement'],
                       dtype={'siren':'object',
                              'siret':'object',
                             'statutDiffusionEtablissement': 'object'}
                      )

In [141]:
## INPI Columns
usecols__ =["Siren",
          'Adresse_Ligne1',
          'Code_Commune',
          'Code_Postal',
          'Conjoint_Collab_Pseudo',
          'DAP_Adresse_Ligne1',
          'DAP_Code_Commune',
          'Date_Immatriculation',
          'Date_1re_Immatriculation',
          'Date_Radiation',
          'Date_Greffe',
          'Sans_Activité'
          ]

dtypes__ = {'Siren':'object',
            'Adresse_Ligne1': 'object',
       'Code_Commune': 'object',
       'Code_Postal': 'object',
       'Conjoint_Collab_Pseudo': 'object',
       'DAP_Adresse_Ligne1': 'object',
       'DAP_Code_Commune': 'object',
       'Date_Immatriculation': 'object',
       'Date_1re_Immatriculation': 'object',
       'Date_Radiation': 'object',
       'Date_Greffe': 'object',
       'Sans_Activité': 'object'}

# INITIAL PP

In [98]:
## INPI Data
data_pp_= (dd.read_csv(pp, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_.head()

Unnamed: 0,siren,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe
0,442377040,2002-06-18,,,,,1000,,,,,2015-08-04
1,809861297,2015-02-27,,,,,1140,1348.0,,,,2015-02-27
2,480952712,2005-02-23,,,,,1380,1025.0,,,,2017-05-03
3,388239667,1992-10-14,,,,,69130,,,,,2017-05-03
4,828427245,2017-04-18,,,,,1220,1143.0,,,,2017-04-18


In [99]:
# correction of missing zeros in siren
def fill0(text, num):
    while len(text) < num:
        text = str('0' + text)
    return text

data_pp_['siren']=data_pp_['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [100]:
# Merge data

#NB 1 : Due to computation limits, it is not possible to put this part in a class

#NB 2 : Computation limit error : UnpicklingError: invalid load key, '\xdd'.
# => restart kernel

data_merged = data_ets_.merge(
    data_pp_,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [101]:
# create a check_url for online manual data verification
data_merged = data_merged.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
0,5741509,574150900011,O,1957-06-01,,,,,4270.0,,,,,2009-01-01,both,https://data.inpi.fr/entreprises/005741509
1,5915533,591553300011,O,1959-12-28,,,,,80220.0,80373.0,,,,2009-01-01,both,https://data.inpi.fr/entreprises/005915533
2,6472419,647241900032,O,1964-05-27,,,,9 rue de Conan,44780.0,44098.0,,,,2017-05-04,both,https://data.inpi.fr/entreprises/006472419
3,6641104,664110400010,O,1966-07-25,,,,,,,,,,2016-06-06,both,https://data.inpi.fr/entreprises/006641104
4,7241532,724153200014,O,1980-06-24,,,,,,,,,,2016-02-04,both,https://data.inpi.fr/entreprises/007241532


In [102]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'initial',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '1235130',
   'total_match': '1189644',
   'total_unmatched_left': '0',
   'total_unmatched_right': '45486'}},
 'matched_df': 'insee_PP_initial_matche.gz',
 'unmatched_df': 'insee_PP_initial_non_matche.gz'}

In [103]:
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]
merged = data_merged.loc[lambda x: x['_merge'].isin(['both'])]
unmerged.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
14241,504630310,,,2017-04-26,,,,,1500,1004.0,,,,2017-04-26,right_only,https://data.inpi.fr/entreprises/504630310
14242,823218854,,,2016-10-19,,,,,1090,1258.0,,,,2016-12-07,right_only,https://data.inpi.fr/entreprises/823218854
14243,546110156,,,1961-01-20,,,,,1110,,,,,2009-01-01,right_only,https://data.inpi.fr/entreprises/546110156
14244,490068327,,,2015-12-16,,,,,1480,1166.0,,,,2015-12-17,right_only,https://data.inpi.fr/entreprises/490068327
14245,793005000,,,2017-01-12,,,,,1150,1202.0,,,,2017-01-12,right_only,https://data.inpi.fr/entreprises/793005000


In [104]:
# IN PROGRESS
# Describe results & check

In [105]:
# Unmatched sample
data_merged.loc[lambda x: x['_merge'].isin(['right_only'])].head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
14241,504630310,,,2017-04-26,,,,,1500,1004.0,,,,2017-04-26,right_only,https://data.inpi.fr/entreprises/504630310
14242,823218854,,,2016-10-19,,,,,1090,1258.0,,,,2016-12-07,right_only,https://data.inpi.fr/entreprises/823218854
14243,546110156,,,1961-01-20,,,,,1110,,,,,2009-01-01,right_only,https://data.inpi.fr/entreprises/546110156
14244,490068327,,,2015-12-16,,,,,1480,1166.0,,,,2015-12-17,right_only,https://data.inpi.fr/entreprises/490068327
14245,793005000,,,2017-01-12,,,,,1150,1202.0,,,,2017-01-12,right_only,https://data.inpi.fr/entreprises/793005000


In [106]:
# Count #SIRET by SIREN
test = (data_pp_
        .groupby('siren')['siren']
        .count()
        .rename('count')
        .reset_index()
        .groupby('count')
        .count()
        .compute()
        .assign(pct = lambda x:100*x/x.sum())
        #.rename(columns={'count': 'siret_count'})
       )
test

Unnamed: 0_level_0,siren,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,741492,98.905294
2,7823,1.043485
3,304,0.04055
4,54,0.007203
5,9,0.0012
6,4,0.000534
7,7,0.000934
8,1,0.000133
10,1,0.000133
11,1,0.000133


In [107]:
# check some cases
unmerged = data_merged.loc[lambda x: x['_merge'].isin(['right_only'])]
merged = data_merged.loc[lambda x: x['_merge'].isin(['both'])]
unmerged['siren'].head()

14241    504630310
14242    823218854
14243    546110156
14244    490068327
14245    793005000
Name: siren, dtype: object

In [108]:
#to do : verif si possible d'avoir plusieurs fois le meme SIREN dans data_pp_

In [109]:
# cases found N°3:
## created recently : 2020
#  => utiliser le champ dateCreationEtablissement
# 2019 stock INSEE
# 2017 stock INPI


In [110]:
test_creation_date = dd.read_csv(insee, 
                       usecols = ['siren', 'siret','dateCreationEtablissement'],
                       dtype={'siren':'object',
                              'siret':'object',
                             'dateCreationEtablissement': 'object'}
                      )

In [111]:
test_creation_date.head()

Unnamed: 0,siren,siret,dateCreationEtablissement
0,325175,32517500016,2000-09-26
1,325175,32517500024,2008-05-20
2,325175,32517500032,2009-05-27
3,325175,32517500040,2011-10-21
4,325175,32517500057,2014-01-07


In [112]:
test_creation_date['dateCreationEtablissement'] = test_creation_date['dateCreationEtablissement'].astype('datetime64[ns]') 

In [113]:
test_creation_date.head()

Unnamed: 0,siren,siret,dateCreationEtablissement
0,325175,32517500016,2000-09-26
1,325175,32517500024,2008-05-20
2,325175,32517500032,2009-05-27
3,325175,32517500040,2011-10-21
4,325175,32517500057,2014-01-07


In [114]:
#test_creation_date['dateCreationEtablissement'].max().compute() #ko

In [115]:
test_creation_date.info() 

<class 'dask.dataframe.core.DataFrame'>
Columns: 3 entries, siren to dateCreationEtablissement
dtypes: datetime64[ns](1), object(2)

# NEW PP

In [116]:
# Monthly New Import
year='2017'

In [117]:
# define import paths
## INSEE
#source='insee'
#insee_csv_relative_filepath = "INSEE/Stock/ETS/StockEtablissement_utf8.csv"

## INPI
#inpi_import_root_path = "INPI/TC_1/Stock_processed"
#inpi_nature = 'PP' #* nature → ACTES/COMPTES/ETS/etc
inpi_origin = 'new' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'json') #2017_NEW_PP.json
inpi_gz_file_name = "{}_{}_{}.{}".format(year,inpi_origin.upper(),inpi_nature.upper(),'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)

# define save_to paths
inpi_save_to_root_path = 'SIRETISATION'

In [118]:
inpi_import_gz_filepath

'INPI/TC_1/Stock_processed/2017_NEW_PP.gz'

In [119]:
# instanciate AWS connection
#AWS_connection = aws.aws_instantiate(instance_aws, bucket)
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [120]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_new = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [121]:
# load data into dataframes

## INPI
dtypes__ = {'Siren':'object',
            'Adresse_Ligne1': 'object',
       'Code_Commune': 'object',
       'Code_Postal': 'object',
       'Conjoint_Collab_Pseudo': 'object',
       'DAP_Adresse_Ligne1': 'object',
       'DAP_Code_Commune': 'object',
       'Date_Immatriculation': 'object',
       'Date_1re_Immatriculation': 'object',
       'Date_Radiation': 'object',
       'Date_Greffe': 'object',
       'Sans_Activité': 'object'}

data_pp_new= (dd.read_csv(pp_new, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_new.head()

Unnamed: 0,siren,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe
0,801884248,2017-11-30,,,,,1240,1069,,,,2017-11-30
1,833599822,2017-11-30,,,,,1300,1034,,,,2017-11-30
2,832710818,2017-10-19,,,,,1500,1004,,,,2017-10-19
3,830157384,2017-11-10,,,,,1240,1145,,,,2017-11-10
4,833642630,2017-12-01,,,,,1480,1075,,,,2017-12-01


In [122]:
# correction of missing zeros in siren
data_pp_new['siren']=data_pp_new['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [128]:
# Merge data

#NB 1 : Due to computation limits, it is not possible to put this part in a class

#NB 2 : Computation limit error : UnpicklingError: invalid load key, '\xdd'.
# => restart kernel

data_merged_new = data_ets_.merge(
    data_pp_new,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [129]:
# create a check_url for online manual data verification
data_merged_new = data_merged_new.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_new.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
0,314488321,31448832100014,O,2017-12-18,,,,,34290,34009,,,,2017-12-18,both,https://data.inpi.fr/entreprises/314488321
1,314488321,31448832100022,O,2017-12-18,,,,,34290,34009,,,,2017-12-18,both,https://data.inpi.fr/entreprises/314488321
2,314488321,31448832100030,O,2017-12-18,,,,,34290,34009,,,,2017-12-18,both,https://data.inpi.fr/entreprises/314488321
3,338065873,33806587300014,O,2017-12-29,,,,,31820,31417,,,,2017-12-29,both,https://data.inpi.fr/entreprises/338065873
4,338065873,33806587300022,O,2017-12-29,,,,,31820,31417,,,,2017-12-29,both,https://data.inpi.fr/entreprises/338065873


In [130]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged_new,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'new',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '21530',
   'total_match': '19862',
   'total_unmatched_left': '0',
   'total_unmatched_right': '1668'}},
 'matched_df': 'insee_PP_new_matche.gz',
 'unmatched_df': 'insee_PP_new_non_matche.gz'}

In [131]:
unmerged_new = data_merged_new.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_new = data_merged_new.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_new.head()
unmerged_new.iloc[1:40]

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
229,382505865,,,2017-12-08,,,,,9100,9225,,,,2017-12-08,right_only,https://data.inpi.fr/entreprises/382505865
230,833424716,,,2017-11-20,,,,4 Rue Kruger,13004,13204,,,,2017-11-20,right_only,https://data.inpi.fr/entreprises/833424716
231,833424716,,,2017-11-20,,,,4 Rue Kruger,13004,13204,,,,2017-11-20,right_only,https://data.inpi.fr/entreprises/833424716
232,833578297,,,2017-11-27,,,,,14680,14290,,,,2017-11-27,right_only,https://data.inpi.fr/entreprises/833578297
233,833758857,,,2017-12-29,2017-12-29,,,,26780,26169,,,,2017-12-29,right_only,https://data.inpi.fr/entreprises/833758857


# Partiel PP

In [135]:
# define import paths
inpi_origin = 'partiel' #* origine → initial/partiel/new/evt
inpi_json_file_name =  "{}_{}.{}".format(inpi_origin,inpi_nature,'json')
inpi_gz_file_name = "{}_{}.{}".format(inpi_origin,inpi_nature,'gz')

inpi_import_json_filepath ="{}/{}".format(inpi_import_root_path,inpi_json_file_name)
inpi_import_gz_filepath = "{}/{}".format(inpi_import_root_path,inpi_gz_file_name)



In [137]:
inpi_import_gz_filepath

'INPI/TC_1/Stock_processed/partiel_PP.gz'

In [139]:
AWS_connection.uploadFileBucket(pathfile =inpi_import_json_filepath)

In [140]:
# get files

# json recap
with open(inpi_json_file_name, 'r') as f:
    pp_json = json.load(f)

# data    
pp_partiel = AWS_connection.url_instance_bucket(path_file = inpi_import_gz_filepath)

In [144]:
# load data into dataframes

## INPI

data_pp_partiel= (dd.read_csv(pp_partiel, 
                      usecols =usecols__,
                      dtype=dtypes__,
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
           .rename(columns={"Siren": "siren"})
          )
data_pp_partiel.head()

Unnamed: 0,siren,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe
0,305616815,1977-10-19,,,,,1160,,,,,2018-01-22
1,305637597,1976-08-11,,,,,1220,,,,,2015-12-02
2,306009127,1976-05-12,,,,,1130,1269.0,,,,2018-03-29
3,308729946,1955-12-02,,,,,1300,,,,,2017-08-04
4,308734003,1981-12-16,,,,,1960,1289.0,,,,2018-04-17


In [145]:
# correction of missing zeros in siren
data_pp_partiel['siren']=data_pp_partiel['siren'].apply(lambda x: fill0(x,9))

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('siren', 'object'))



In [146]:
# Merge data

data_merged_partiel = data_ets_.merge(
    data_pp_partiel,
    how= 'right',
    left_on = 'siren', 
    right_on ='siren',
    indicator = True).compute()

In [147]:
# create a check_url for online manual data verification
data_merged_partiel = data_merged_partiel.assign(url = lambda x :
            'https://data.inpi.fr/entreprises/' + x["siren"] ) 
data_merged_partiel.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
0,36410462,3641046200018,O,1964-02-05,,,,,6600,6004.0,,,,2017-11-17,both,https://data.inpi.fr/entreprises/036410462
1,301105615,30110561500019,O,1974-07-02,,,,,20150,,,,,2017-08-04,both,https://data.inpi.fr/entreprises/301105615
2,303405203,30340520300017,O,1978-03-09,,,,,84210,84001.0,,,,2018-03-05,both,https://data.inpi.fr/entreprises/303405203
3,303502876,30350287600038,O,2001-03-27,,,,,81000,,,,,2018-08-29,both,https://data.inpi.fr/entreprises/303502876
4,303502876,30350287600046,O,2001-03-27,,,,,81000,,,,,2018-08-29,both,https://data.inpi.fr/entreprises/303502876


In [148]:
# compute matching
match_dict = AWS_connection.gen_full_matching(source = source, 
                        nature = inpi_nature, 
                        origin = inpi_origin, 
                        source_path = inpi_import_root_path,
                        save_to_path = inpi_save_to_root_path, 
                        data_merged = data_merged,
                        save='yes')
match_dict

{'matching_json': {'nature': 'PP',
  'origin': 'partiel',
  'path': 'INPI/TC_1/Stock_processed',
  'details': {'total_rows_origin': '21530',
   'total_match': '19862',
   'total_unmatched_left': '0',
   'total_unmatched_right': '1668'}},
 'matched_df': 'insee_PP_partiel_matche.gz',
 'unmatched_df': 'insee_PP_partiel_non_matche.gz'}

In [151]:
unmerged_partiel = data_merged_partiel.loc[lambda x: x['_merge'].isin(['right_only'])]
merged_partiel = data_merged_partiel.loc[lambda x: x['_merge'].isin(['both'])]
unmerged_partiel.head()

Unnamed: 0,siren,siret,statutDiffusionEtablissement,Date_Immatriculation,Date_1re_Immatriculation,Date_Radiation,Sans_Activité,Adresse_Ligne1,Code_Postal,Code_Commune,DAP_Adresse_Ligne1,DAP_Code_Commune,Conjoint_Collab_Pseudo,Date_Greffe,_merge,url
329,417242096,,,1978-02-14,,,,,6130,6069.0,,,,2012-09-27,right_only,https://data.inpi.fr/entreprises/417242096
330,327933958,,,1985-12-12,,,,,13270,13039.0,,,,2017-12-11,right_only,https://data.inpi.fr/entreprises/327933958
331,454007162,,,2017-05-09,2017-05-09,,,,28200,28088.0,,,,2017-09-12,right_only,https://data.inpi.fr/entreprises/454007162
332,833066525,,,2017-11-10,,,,,50610,50066.0,,,,2018-02-22,right_only,https://data.inpi.fr/entreprises/833066525
333,349143669,,,1989-01-18,,,,,62480,,,,,2018-03-06,right_only,https://data.inpi.fr/entreprises/349143669


In [152]:
# Count #SIRET by SIREN
test = (data_pp_partiel
        .groupby('siren')['siren']
        .count()
        .rename('count')
        .reset_index()
        .groupby('count')
        .count()
        .compute()
        .assign(pct = lambda x:100*x/x.sum())
       )
test

Unnamed: 0_level_0,siren,pct
count,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16640,98.153719
2,237,1.397983
3,33,0.194656
4,13,0.076683
5,11,0.064885
6,3,0.017696
7,4,0.023595
8,3,0.017696
9,2,0.011797
10,1,0.005899
