# Test Matching Insee/ETS

## INSEE

- https://s3.console.aws.amazon.com/s3/object/calfdata/INSEE/Stock/ETS/
        - INSEE/Stock/ETS/StockEtablissement_utf8.csv
        
```
['siren', 'siret']
```

## INPI

- https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/Stock_processed/
    - INPI/TC_1/Stock_processed/initial_ETS.gz
    - INPI/TC_1/Stock_processed/initial_ETS.json
    
Colonnes test:

```
["Siren","Date_Immatriculation", "Date_Clôture", "Date_Greffe"]
```

## Sauvegarde

* La liste des SIREN matchés sera sauvegardée selon leur nature et origine
  * nature → ACTES/COMPTES/ETS/etc
  * origine → initial/partiel/new/evt

Les matchés seront sauvegardé dans calfdata/SIRETISATION/matche/ au format suivant:

* insee_nature_origine_matche.gz
    * ex: insee_pm_initial_matche.gz
    
    

## Moteur de recherche TEST

* Insee
  * http://avis-situation-sirene.insee.fr/IdentificationListeSiret.action
* INPI/TC
  * https://data.inpi.fr/
* Infogreffe
  * https://www.infogreffe.fr/


In [None]:
import boto3, json
import dask.dataframe as dd
import pandas as pd
import Match_inpi_insee.aws_connectors as aws
#from tqdm.notebook import tqdm
#import tqdm
%load_ext autoreload
%autoreload 2

In [None]:
instance_aws = 'https://calfdata.s3.eu-west-3.amazonaws.com'
bucket = 'calfdata'

In [None]:
# instanciate AWS connection
AWS_connection = aws.aws_instantiate(instance_aws, bucket)

## Valeurs communes dans INSEE/INPI ETS

| INSEE                     | INPI         |                         |
|---------------------------|--------------|-------------------------|
| Var                       | Var          | comment                 |
| siren                     | Siren        |                         |
| codePostalEtablissement   | Code_Postal  | Si Type == 'principale' |
| codePostal2Etablissement  | Code_Postal  | Si Type == 'secondaire' |
| codeCommuneEtablissement  | Code_Commune | Si Type == 'principale' |
| codeCommune2Etablissement | Code_Commune | Si Type == 'secondaire' |
|                           |              |                         |
|                           |              |                         |
|                           |              |                         |

In [None]:
insee = AWS_connection.url_instance_bucket(path_file = 'INSEE/Stock/ETS/StockEtablissement_utf8.csv')
ets = AWS_connection.url_instance_bucket(path_file = 'INPI/TC_1/Stock_processed/initial_ETS.gz')
ets

## Matching établissement principal

Ici, on filtre les variables communes pour l'INSEE & INPI établissements secondaires.

In [None]:
# load data into dataframes
data_insee_= dd.read_csv(insee,
                         usecols = ['siren',
                                  'siret',
                                 'codePostalEtablissement',
                                 #'codePostal2Etablissement',
                                 'codeCommuneEtablissement',
                                 #'codeCommune2Etablissement'
                                   ],
                       dtype={'siren':'object',
                              'siret':'object',
                              'codePostalEtablissement':'object',
                              #'codePostal2Etablissement':'object',
                              'codeCommuneEtablissement':'object',
                              #'codeCommune2Etablissement':'object',
                             }
                      )

data_ets_= (dd.read_csv(ets, 
                       usecols = [
                           'Type',
                           'Siren',
                                  'Code_Postal',
                                  'Code_Commune'
                                 ],
                       dtype={
                           'Type':'object',
                           'Siren':'object',
                              'Code_Postal':'object',
                              'Code_Commune':'object'
                             },
                      compression='gzip',
                      blocksize=None,
                      low_memory=False
                      )
            .compute()
            .loc[lambda x: x['Type'].isin(['PRI'])]
           )

In [None]:
data_ets_['Type'].unique()

In [None]:
data_ets_.shape

In [None]:
data_ets_.head()

In [None]:
data_merged = data_insee_.merge(
    data_ets_,
    how= 'right',
    left_on = ["siren", 'codePostalEtablissement'],
    right_on = ['Siren', 'Code_Postal'],
    indicator = True).compute()