# Création table INSEE transformée contenant les nouvelles variables permettant la siretisation

# Objective(s)

*  Creation de la table INSEE avec les data de juillet
* Création des variables pour faire les tests de siretisation
* Please, update the Source URL by clicking on the button after the information have been pasted
  * US 03 Creation Variables data INPI et INSEE Modify rows
  * Delete tables and Github related to the US: Delete rows

# Metadata

* Epic: Epic 6
* US: US 4
* Date Begin: 9/28/2020
* Duration Task: 1
* Description: Creation des variables qui vont servir a réaliser les tests pour la siretisation
* Step type: Transform table
* Status: Active
  * Change Status task: Active
  * Update table: Modify rows
* Source URL: US 03 Creation Variables data INPI et INSEE
* Task type: Jupyter Notebook
* Users: Thomas Pernet
* Watchers: Thomas Pernet
* User Account: https://937882855452.signin.aws.amazon.com/console
* Estimated Log points: 10
* Task tag: #athena,#lookup-table,#sql,#data-preparation,#insee
* Toggl Tag: #documentation

# Input Cloud Storage [AWS/GCP]

## Table/file

* Origin: 
* Athena
* Name: 
* ets_insee_raw_juillet
* Github: 
  * 

# Destination Output/Delivery

## Table/file

* Origin: 
* Athena
* Name:
* ets_insee_transformed
* GitHub:


In [1]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_s3 import service_s3
from awsPy.aws_glue import service_glue
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import os, shutil, json

path = os.getcwd()
parent_path = str(Path(path).parent)
path_cred = r"{}/credential_AWS.json".format(parent_path)
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = 'eu-west-3')

region = 'eu-west-3'
bucket = 'calfdata'


In [2]:
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = region)
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = bucket, verbose = True) 
glue = service_glue.connect_glue(client = client) 

In [3]:
pandas_setting = True
if pandas_setting:
    cm = sns.light_palette("green", as_cmap=True)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)

# Etape création table tansformée INSEE

La préparation de la table transformée de l'INSEE se fait en deux étapes. La première étape consiste bien sur à intégrer dans la base de donnée, la table brute de l'INSEE. Nous utiliserons la table datant de juillet 2020, pour correspondre avec celle de l'équipe Datum.

Dans un second temps, nous allons 6 variables, qui sont résumés dans le tableau ci dessous

| Tables | Variables                          | Commentaire                                                                                                                                                                                                        | Bullet_inputs                                                                                                                 | Bullet_point_regex                                     | Inputs                                                                                                                        | US_md                                                          | query_md_gitlab                                                                                                                                                                                                                                                              | Pattern_regex                                          |
|--------|------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
| INSEE  | voie_clean                         | Extraction du type de voie contenu dans l’adresse. Variable type voie nom complet. Exemple, l'INSEE indique CH, pour chemin, il faut donc indiquer CHEMIN. Besoin table externe (type_voie) pour créer la variable |                                                                                                                               |                                                        |                                                                                                                               | [2953](https://tree.taiga.io/project/olivierlubet-air/us/2953) | [etape-1-pr%C3%A9paration-voie_clean](https://scm.saas.cagip.group.gca/PERNETTH/inseeinpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/04_ETS_add_variables_insee.md#etape-1-pr%C3%A9paration-voie_clean)                     |                                                        |
| INSEE  | indiceRepetitionEtablissement_full | Récupération du nom complet des indices de répétion; par exemple B devient BIS, T devient TER                                                                                                                      | indiceRepetitionEtablissement                                                                                                 | Regles_speciales                                       | indiceRepetitionEtablissement                                                                                                 | [2953](https://tree.taiga.io/project/olivierlubet-air/us/2953) | []()                                                                                                                                                                                                                                                                         | Regles_speciales                                       |
| INSEE  | adresse_reconstituee_insee         | Concatenation des champs de l'adresse et suppression des espace                                                                                                                                                    | numeroVoieEtablissement indiceRepetitionEtablissement_full voie_clean libelleVoieEtablissement complementAdresseEtablissement | debut/fin espace espace Upper                          | numeroVoieEtablissement,indiceRepetitionEtablissement_full,voie_clean,libelleVoieEtablissement,complementAdresseEtablissement | [2954](https://tree.taiga.io/project/olivierlubet-air/us/2954) | [etape-2-preparation-adress_reconstituee_insee](https://scm.saas.cagip.group.gca/PERNETTH/inseeinpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/04_ETS_add_variables_insee.md#etape-2-preparation-adress_reconstituee_insee) | debut/fin espace,espace,Upper                          |
| INSEE  | adresse_distance_insee             | Concatenation des champs de l'adresse, suppression des espaces et des articles. Utilisé pour calculer le score permettant de distinguer la similarité/dissimilarité entre deux adresses (INPI vs INSEE)            | numeroVoieEtablissement indiceRepetitionEtablissement_full voie_clean libelleVoieEtablissement complementAdresseEtablissement | article digit debut/fin espace espace Upper            | numeroVoieEtablissement,indiceRepetitionEtablissement_full,voie_clean,libelleVoieEtablissement,complementAdresseEtablissement | [3004](https://tree.taiga.io/project/olivierlubet-air/us/3004) | [etape-3-adresse_distance_insee](https://scm.saas.cagip.group.gca/PERNETTH/inseeinpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/04_ETS_add_variables_insee.md#etape-3-adresse_distance_insee)                               | article,digit,debut/fin espace,espace,Upper            |
| INSEE  | list_numero_voie_matching_insee    | Liste contenant tous les numéros de l'adresse dans l'INSEE                                                                                                                                                         | numeroVoieEtablissement indiceRepetitionEtablissement_full voie_clean libelleVoieEtablissement complementAdresseEtablissement | article digit debut/fin espace                         | numeroVoieEtablissement,indiceRepetitionEtablissement_full,voie_clean,libelleVoieEtablissement,complementAdresseEtablissement | [3004](https://tree.taiga.io/project/olivierlubet-air/us/3004) | [etape-4-creation-liste-num%C3%A9ro-de-voie](https://scm.saas.cagip.group.gca/PERNETTH/inseeinpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/04_ETS_add_variables_insee.md#etape-4-creation-liste-num%C3%A9ro-de-voie)       | article,digit,debut/fin espace                         |
| INSEE  | ville_matching                     | Nettoyage regex de la ville et suppression des espaces. La même logique de nettoyage est appliquée coté INPI                                                                                                       | libelleCommuneEtablissement                                                                                                   | article digit debut/fin espace espace Regles_speciales | libelleCommuneEtablissement                                                                                                   | [2954](https://tree.taiga.io/project/olivierlubet-air/us/2954) | [etape-2-cr%C3%A9ation-ville_matching](https://scm.saas.cagip.group.gca/PERNETTH/inseeinpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/04_ETS_add_variables_insee.md#etape-2-cr%C3%A9ation-ville_matching)                   | article,digit,debut/fin espace,espace,Regles_speciales |
| INSEE  | count_initial_insee                | Compte du nombre de siret (établissement) par siren (entreprise)                                                                                                                                                   | siren                                                                                                                         |                                                        | siren                                                                                                                         | [2955](https://tree.taiga.io/project/olivierlubet-air/us/2955) | [etape-5-count_initial_insee](https://scm.saas.cagip.group.gca/PERNETTH/inseeinpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/04_ETS_add_variables_insee.md#etape-5-count_initial_insee)                                     |                                                        |
    
## Prepare `TABLE.CREATION` parameters
    
Le fichier config JSON contient déjà les étapes de préparation de l'INPI. Nous allons continuer d'ajouter les queries a éxécuter dans le JSON afin d'avoir un processus complet contenu dans un seul est même fichier. 

In [4]:
### If chinese characters, set  ensure_ascii=False
s3.download_file(key = 'DATA/ETL/parameters_ETL.json')
with open('parameters_ETL.json', 'r') as fp:
    parameters = json.load(fp)

## 2. Prepare `TABLES.CREATION`

This part usually starts with raw/transformed data in S3. The typical architecture in the S3 is:

- `DATA/RAW_DATA` or `DATA/UNZIP_DATA_APPEND_ALL` or `DATA/TRANSFORMED`. One of our rule is, if the user needs to create a table from a CSV/JSON (raw or transformed), then the query should be written in the key `TABLES.CREATION` and the notebook in the folder `01_prepare_tables`

One or more notebooks in the folder `01_prepare_tables` are used to create the raw tables. Please, use the notebook named `XX_template_table_creation_AWS` to create table using the key `TABLES.CREATION`

In [5]:
table_raw_insee = [{
    "database": "ets_insee",
    "name": "ets_insee_raw_juillet",
    "output_id": "",
    "separator": ",",
    "s3URI": "s3://calfdata/INSEE/00_rawData/ETS_01_07_2020",
    "schema": [
        {'Name': 'siren', 'Type': 'string', 'Comment': ''},
 {'Name': 'nic', 'Type': 'string', 'Comment': ''},
 {'Name': 'siret', 'Type': 'string', 'Comment': ''},
 {'Name': 'statutdiffusionetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'datecreationetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'trancheeffectifsetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'anneeeffectifsetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'activiteprincipaleregistremetiersetablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'datederniertraitementetablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'etablissementsiege', 'Type': 'string', 'Comment': ''},
 {'Name': 'nombreperiodesetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'complementadresseetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'numerovoieetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'indicerepetitionetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'typevoieetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellevoieetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'codepostaletablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellecommuneetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellecommuneetrangeretablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'distributionspecialeetablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'codecommuneetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'codecedexetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellecedexetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'codepaysetrangeretablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellepaysetrangeretablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'complementadresse2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'numerovoie2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'indicerepetition2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'typevoie2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellevoie2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'codepostal2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellecommune2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellecommuneetranger2etablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'distributionspeciale2etablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'codecommune2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'codecedex2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellecedex2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'codepaysetranger2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'libellepaysetranger2etablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'datedebut', 'Type': 'string', 'Comment': ''},
 {'Name': 'etatadministratifetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'enseigne1etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'enseigne2etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'enseigne3etablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'denominationusuelleetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'activiteprincipaleetablissement', 'Type': 'string', 'Comment': ''},
 {'Name': 'nomenclatureactiviteprincipaleetablissement',
  'Type': 'string',
  'Comment': ''},
 {'Name': 'caractereemployeuretablissement', 'Type': 'string', 'Comment': ''}
    ]
}
]

To remove an item from the list, use `pop` with the index to remove. Exemple `parameters['TABLES']['CREATION']['ALL_SCHEMA'].pop(6)` will remove the 5th item

In [6]:
to_remove = False
if to_remove:
    parameters['TABLES']['CREATION']['ALL_SCHEMA'].pop(0)

In [7]:
parameters['TABLES']['CREATION']['ALL_SCHEMA'].extend(table_raw_insee)

Query executée

In [8]:
for key, value in parameters["TABLES"]["CREATION"].items():
    if key == "ALL_SCHEMA":
        for table_info in value:
            if table_info['name'] in ['ets_insee_raw_juillet']:
                # CREATE QUERY

                ### Create top/bottom query
                table_top = parameters["TABLES"]["CREATION"]["template"]["top"].format(
                            table_info["database"], table_info["name"]
                        )
                table_bottom = parameters["TABLES"]["CREATION"]["template"][
                            "bottom_OpenCSVSerde"
                        ].format(table_info["separator"], table_info["s3URI"])

                ### Create middle
                table_middle = ""
                nb_var = len(table_info["schema"])
                for i, val in enumerate(table_info["schema"]):
                    if i == nb_var - 1:
                        table_middle += parameters["TABLES"]["CREATION"]["template"][
                                    "middle"
                                ].format(val['Name'], val['Type'], ")")
                    else:
                        table_middle += parameters["TABLES"]["CREATION"]["template"][
                                    "middle"
                                ].format(val['Name'], val['Type'], ",")

                query = (
                    table_top + 
                    "\n" + 
                    table_middle +
                    "\n" + 
                    table_bottom
                )
                
                print(query)


CREATE EXTERNAL TABLE IF NOT EXISTS ets_insee.ets_insee_raw_juillet (
siren string ,nic string ,siret string ,statutdiffusionetablissement string ,datecreationetablissement string ,trancheeffectifsetablissement string ,anneeeffectifsetablissement string ,activiteprincipaleregistremetiersetablissement string ,datederniertraitementetablissement string ,etablissementsiege string ,nombreperiodesetablissement string ,complementadresseetablissement string ,numerovoieetablissement string ,indicerepetitionetablissement string ,typevoieetablissement string ,libellevoieetablissement string ,codepostaletablissement string ,libellecommuneetablissement string ,libellecommuneetrangeretablissement string ,distributionspecialeetablissement string ,codecommuneetablissement string ,codecedexetablissement string ,libellecedexetablissement string ,codepaysetrangeretablissement string ,libellepaysetrangeretablissement string ,complementadresse2etablissement string ,numerovoie2etablissement string ,indicere

In [9]:
json_filename ='parameters_ETL.json'
json_file = json.dumps(parameters)
f = open(json_filename,"w")
f.write(json_file)
f.close()
s3.upload_file(json_filename, 'DATA/ETL')

In [10]:
s3.download_file(key = 'DATA/ETL/parameters_ETL.json')
with open('parameters_ETL.json', 'r') as fp:
    parameters = json.load(fp)

Move `parameters_ETL.json` to the parent folder `01_prepare_tables`

In [11]:
s3_output = parameters['GLOBAL']['QUERIES_OUTPUT']
db = parameters['GLOBAL']['DATABASE']

In [12]:
for key, value in parameters["TABLES"]["CREATION"].items():
    if key == "ALL_SCHEMA":
        for table_info in value:
            if table_info['name'] in ['ets_insee_raw_juillet']:

                # CREATE QUERY

                ### Create top/bottom query
                table_top = parameters["TABLES"]["CREATION"]["template"]["top"].format(
                            table_info["database"], table_info["name"]
                        )
                table_bottom = parameters["TABLES"]["CREATION"]["template"][
                            "bottom_OpenCSVSerde"
                        ].format(table_info["separator"], table_info["s3URI"])

                ### Create middle
                table_middle = ""
                nb_var = len(table_info["schema"])
                for i, val in enumerate(table_info["schema"]):
                    if i == nb_var - 1:
                        table_middle += parameters["TABLES"]["CREATION"]["template"][
                                    "middle"
                                ].format(val['Name'], val['Type'], ")")
                    else:
                        table_middle += parameters["TABLES"]["CREATION"]["template"][
                                    "middle"
                                ].format(val['Name'], val['Type'], ",")

                query = table_top + table_middle + table_bottom

                ## DROP IF EXIST

                s3.run_query(
                                query="DROP TABLE {}".format(table_info["name"]),
                                database=db,
                                s3_output=s3_output
                        )

                ## RUN QUERY
                output = s3.run_query(
                            query=query,
                            database=table_info["database"],
                            s3_output=s3_output,
                            filename=None,  ## Add filename to print dataframe
                            destination_key=None,  ### Add destination key if need to copy output
                        )

                    ## SAVE QUERY ID
                table_info['output_id'] = output['QueryID']

                         ### UPDATE CATALOG
                #glue.update_schema_table(
                #            database=table_info["database"],
                #            table=table_info["name"],
                #            schema=table_info["schema"],
                #        )

                print(output)

{'Results': {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2020, 10, 22, 11, 54, 1, 31000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2020, 10, 22, 11, 54, 1, 553000, tzinfo=tzlocal())}, 'QueryID': 'f1f483d3-5fa6-4cad-bc60-fd4d72636093'}


Appercu tables créées

In [13]:
for key, value in parameters["TABLES"]["CREATION"].items():
    if key == "ALL_SCHEMA":
        for table_info in value:
            if table_info['name'] in ['ets_insee_raw_juillet']:
                print(table_info['name'])
                
                query = """
                SELECT *
                FROM  {}
                LIMIT 10
                """.format(table_info['name'])
                
                output = s3.run_query(
                            query=query,
                            database=table_info["database"],
                            s3_output=s3_output,
                            filename="table_{}".format(table_info['name']),  ## Add filename to print dataframe
                            destination_key=None,  ### Add destination key if need to copy output
                        )
                
                display(output)


ets_insee_raw_juillet


Unnamed: 0,siren,nic,siret,statutdiffusionetablissement,datecreationetablissement,trancheeffectifsetablissement,anneeeffectifsetablissement,activiteprincipaleregistremetiersetablissement,datederniertraitementetablissement,etablissementsiege,nombreperiodesetablissement,complementadresseetablissement,numerovoieetablissement,indicerepetitionetablissement,typevoieetablissement,libellevoieetablissement,codepostaletablissement,libellecommuneetablissement,libellecommuneetrangeretablissement,distributionspecialeetablissement,codecommuneetablissement,codecedexetablissement,libellecedexetablissement,codepaysetrangeretablissement,libellepaysetrangeretablissement,complementadresse2etablissement,numerovoie2etablissement,indicerepetition2etablissement,typevoie2etablissement,libellevoie2etablissement,codepostal2etablissement,libellecommune2etablissement,libellecommuneetranger2etablissement,distributionspeciale2etablissement,codecommune2etablissement,codecedex2etablissement,libellecedex2etablissement,codepaysetranger2etablissement,libellepaysetranger2etablissement,datedebut,etatadministratifetablissement,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,denominationusuelleetablissement,activiteprincipaleetablissement,nomenclatureactiviteprincipaleetablissement,caractereemployeuretablissement
0,420911026,11,42091102600011,O,1989-12-31,NN,,,2014-09-16T15:50:14,True,4,,10.0,,RUE,PUITS DES CLERCS,19310,AYEN,,,19015,,,,,,,,,,,,,,,,,,,2009-01-01,A,,,,,01.25Z,NAFRev2,N
1,420911034,23,42091103400023,O,1999-01-01,,,,,True,1,,99.0,,RUE,DE LEYSOTTE,33400,TALENCE,,,33522,,,,,,,,,,,,,,,,,,,1999-01-01,F,,,,,74.1G,NAF1993,N
2,420911042,18,42091104200018,O,1998-10-29,NN,,,2018-08-29T08:56:12,True,4,,57.0,,CHE,ST ANTOINE A ST JOSEPH,13015,MARSEILLE 15,,,13215,,,,,,,,,,,,,,,,,,,2018-06-30,F,,,,,68.20A,NAFRev2,N
3,420911059,12,42091105900012,O,1998-11-01,NN,,,,True,1,,46.0,B,RUE,MELUSINE,86480,ROUILLE,,,86213,,,,,,,,,,,,,,,,,,,2001-04-06,F,,,,,52.4L,NAF1993,O
4,420911067,15,42091106700015,O,1998-11-01,NN,,,2019-11-14T14:00:43,True,1,,2.0,,PL,DU MARCHE,53170,MESLAY-DU-MAINE,,,53152,,,,,,,,,,,,,,,,,,,2000-05-04,F,,,,,55.4B,NAF1993,O
5,420911075,18,42091107500018,O,1998-11-04,,,,2008-11-08T01:38:25,False,3,,5.0,,RUE,DE MULHOUSE,75002,PARIS 2,,,75102,,,,,,,,,,,,,,,,,,,2000-09-15,F,,,,,51.4A,NAF1993,N
6,420911075,26,42091107500026,O,2000-09-15,NN,,,2008-01-04T23:59:19,True,3,,38.0,,RUE,D ENGHIEN,75010,PARIS 10,,,75110,,,,,,,,,,,,,,,,,,,2008-01-01,A,,,,,46.41Z,NAFRev2,N
7,420911083,12,42091108300012,O,1998-11-12,NN,,,2019-11-14T14:00:52,True,1,,29.0,,RUE,DEL ESPIGOLAIRE,66140,CANET-EN-ROUSSILLON,,,66037,,,,,,,,,,,,,,,,,,,2002-06-30,F,,,,,70.2C,NAF1993,N
8,420911091,15,42091109100015,O,1998-10-01,01,2016.0,,2019-11-14T14:00:34,True,3,,,,,LE BON ACCUEIL,35370,BREAL-SOUS-VITRE,,,35038,,,,,,,,,,,,,,,,,,,2008-01-01,A,,,,,81.10Z,NAFRev2,O
9,420911117,18,42091111700018,O,1998-07-21,NN,,,2019-11-14T14:01:05,False,5,HHP HHP FRANCE,50.0,,RUE,MARCEL DASSAULT,92100,BOULOGNE-BILLANCOURT,,,92012,,,,,,,,,,,,,,,,,,,2003-12-25,F,,,,,72.2C,NAFRev1,O


## Creation table transformée

La tale tranformée contient 6 variables supplémentaires qui vont être utilisées pour la réalisation des tests. Les 6 variables sont les suivantes:

* `voie_clean` 
    - Ajout de la variable non abbrégée du type de voie. Exemple, l'INSEE indique CH, pour chemin, il faut donc indiquer CHEMIN
* `count_initial_insee`
    - Compte du nombre de siret (établissement) par siren (entreprise).
* ville_matching 
    - Nettoyage de la ville de l'INSEE (`libelleCommuneEtablissement`) de la même manière que l'INPI
* adress_reconstituee_insee:
    - Reconstitution de l'adresse à l'INSEE en utilisant le numéro de voie `numeroVoieEtablissement`, le type de voie non abbrégé, `voie_clean`, l'adresse `libelleVoieEtablissement`  et le `complementAdresseEtablissement` et suppression des articles
* adresse_distance_insee
* list_enseigne:
    - Concatenation de:
        - `enseigne1etablissement`
        - `enseigne2etablissement`
        - `enseigne3etablissement`

Pour créer le pattern regex, on utilise une liste de type de voie disponible dans le Gitlab et à l'INSEE, que nous avons ensuite modifié manuellement. 

- Input
    - CSV: [TypeVoie.csv](https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/data/input/Parameters/typeVoieEtablissement.csv)
        - CSV dans S3: [Parameters/upper_stop.csv](https://s3.console.aws.amazon.com/s3/buckets/calfdata/Parameters/TYPE_VOIE/)
        - A créer en table
   - Athena: type_voie
       - CSV dans S3: [Parameters/type_voie.csv](https://s3.console.aws.amazon.com/s3/buckets/calfdata/Parameters/TYPE_VOIE_SQL/)
- Code Python: [Exemple Input 1](https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/05_redaction_US/04_prep_voie_num_2697.md#exemple-input-1)

Pour rappel, nous sommes a l'étape 8 de la préparation des données

Le nettoyage des variables de l'adresse suive le schema suivant:

| Table | Variables                 | Article | Digit | Debut/fin espace | Espace | Accent | Upper |
|-------|---------------------------|---------|-------|------------------|--------|--------|-------|
| INSEE  | adresse_distance_insee     | X       | X     | X                | X      | X      | X     |
| INSEE  | adresse_reconstituee_insee |         |       | X                | X      | X      | X     |

In [14]:
step_8 = {
   "STEPS_8":{
      "name":"Creation des variables pour la réalisation des tests pour la siretisation",
      "execution":[
         {
            "database":"ets_insee",
            "name":"ets_insee_transformed",
            "output_id":"",
            "query":{
               "top":" WITH remove_empty_siret AS ( SELECT siren, siret, dateCreationEtablissement, etablissementSiege, etatAdministratifEtablissement, complementAdresseEtablissement, numeroVoieEtablissement, indiceRepetitionEtablissement, CASE WHEN indiceRepetitionEtablissement = 'B' THEN 'BIS' WHEN indiceRepetitionEtablissement = 'T' THEN 'TER' WHEN indiceRepetitionEtablissement = 'Q' THEN 'QUATER' WHEN indiceRepetitionEtablissement = 'C' THEN 'QUINQUIES' ELSE indiceRepetitionEtablissement END as indiceRepetitionEtablissement_full, typeVoieEtablissement, libelleVoieEtablissement, codePostalEtablissement, libelleCommuneEtablissement, libelleCommuneEtrangerEtablissement, distributionSpecialeEtablissement, codeCommuneEtablissement, codeCedexEtablissement, libelleCedexEtablissement, codePaysEtrangerEtablissement, libellePaysEtrangerEtablissement, enseigne1Etablissement, enseigne2Etablissement, enseigne3Etablissement, array_remove( array_distinct( SPLIT( concat( enseigne1etablissement, ',', enseigne2etablissement, ',', enseigne3etablissement ), ',' ) ), '' ) as list_enseigne FROM ets_insee.ets_insee_raw_juillet ) ",
                "middle":" SELECT * FROM ( WITH concat_adress AS( SELECT siren, siret, dateCreationEtablissement, etablissementSiege, etatAdministratifEtablissement, codePostalEtablissement, codeCommuneEtablissement, libelleCommuneEtablissement, ville_matching, numeroVoieEtablissement, array_distinct( regexp_extract_all( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( CONCAT( COALESCE(numeroVoieEtablissement, ''), ' ', COALESCE( indiceRepetitionEtablissement_full, '' ), ' ', COALESCE(voie_clean, ''), ' ', COALESCE(libelleVoieEtablissement, ''), ' ', COALESCE( complementAdresseEtablissement, '' ) ), '[^\w\s]| +', ' ' ), '(?:^|(?<= ))(AU|AUX|AVEC|CE|CES|DANS|DE|DES|DU|ELLE|EN|ET|EUX|IL|ILS|LA|LE|LES)(?:(?= )|$)', '' ), '\s+\s+', ' ' ), '[0-9]+' ) ) AS list_numero_voie_matching_insee, typeVoieEtablissement, voie_clean, libelleVoieEtablissement, complementAdresseEtablissement, indiceRepetitionEtablissement_full, REGEXP_REPLACE( NORMALIZE( UPPER( trim( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( CONCAT( COALESCE(numeroVoieEtablissement, ''), ' ', COALESCE( indiceRepetitionEtablissement_full, '' ), ' ', COALESCE(voie_clean, ''), ' ', COALESCE(libelleVoieEtablissement, ''), ' ', COALESCE( complementAdresseEtablissement, '' ) ), '[^\w\s]| +', ' ' ), '\s\s+', ' ' ), '^\s+|\s+$', '' ) ) ), NFD ), '\pM', '' ) AS adresse_reconstituee_insee, REGEXP_REPLACE( NORMALIZE( UPPER( REGEXP_REPLACE( trim( REGEXP_REPLACE( REGEXP_REPLACE( CONCAT( COALESCE(numeroVoieEtablissement, ''), ' ', COALESCE( indiceRepetitionEtablissement_full, '' ), ' ', COALESCE(voie_clean, ''), ' ', COALESCE(libelleVoieEtablissement, ''), ' ', COALESCE( complementAdresseEtablissement, '' ) ), '[^\w\s]|\d+| +', ' ' ), '(?:^|(?<= ))(AU|AUX|AVEC|CE|CES|DANS|DE|DES|DU|ELLE|EN|ET|EUX|IL|ILS|LA|LE|LES)(?:(?= )|$)', '' ) ), '\s+\s+', ' ' ) ), NFD ), '\pM', '' ) AS adresse_distance_insee, enseigne1Etablissement, enseigne2Etablissement, enseigne3Etablissement, list_enseigne FROM ( SELECT siren, siret, dateCreationEtablissement, etablissementSiege, etatAdministratifEtablissement, codePostalEtablissement, codeCommuneEtablissement, libelleCommuneEtablissement, REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( libelleCommuneEtablissement, '^\d+\s|\s\d+\s|\s\d+$', '' ), '^LA\s+|^LES\s+|^LE\s+|\\(.*\\)|^L(ES|A|E) | L(ES|A|E) | L(ES|A|E)$|CEDEX | CEDEX | CEDEX|^E[R*] | E[R*] | E[R*]$', '' ), '^STE | STE | STE$|^STES | STES | STES', 'SAINTE' ), '^ST | ST | ST$', 'SAINT' ), 'S/|^S | S | S$', 'SUR' ), '/S', 'SOUS' ), '[^\w\s]|\([^()]*\)|ER ARRONDISSEMENT|E ARRONDISSEMENT|" \
"|^SUR$|CEDEX|[0-9]+|\s+', '' ) as ville_matching, libelleVoieEtablissement, complementAdresseEtablissement, numeroVoieEtablissement, indiceRepetitionEtablissement_full, typeVoieEtablissement, enseigne1Etablissement, enseigne2Etablissement, enseigne3Etablissement, list_enseigne FROM remove_empty_siret ) LEFT JOIN inpi.type_voie ON typevoieetablissement = type_voie.voie_matching ) ",
                "bottom":" SELECT count_initial_insee, concat_adress.siren, siret, dateCreationEtablissement, etablissementSiege, etatAdministratifEtablissement, codePostalEtablissement, codeCommuneEtablissement, libelleCommuneEtablissement, ville_matching, libelleVoieEtablissement, complementAdresseEtablissement, numeroVoieEtablissement, CASE WHEN cardinality( list_numero_voie_matching_insee ) = 0 THEN NULL ELSE list_numero_voie_matching_insee END as list_numero_voie_matching_insee, indiceRepetitionEtablissement_full, typeVoieEtablissement, voie_clean, adresse_reconstituee_insee, adresse_distance_insee, enseigne1Etablissement, enseigne2Etablissement, enseigne3Etablissement, CASE WHEN cardinality(list_enseigne) = 0 THEN NULL ELSE list_enseigne END AS list_enseigne FROM concat_adress LEFT JOIN ( SELECT siren, COUNT(siren) as count_initial_insee FROM concat_adress GROUP BY siren ) as count_siren ON concat_adress.siren = count_siren.siren ) " }
         }
      ],
       "schema":[
               {
                  "Name":"",
                  "Type":"",
                  "Comment":""
               }
            ]
   }
}

In [15]:
to_remove = False
if to_remove:
    parameters['TABLES']['PREPARATION']['ALL_SCHEMA'].pop(-1)

In [16]:
parameters['TABLES']['PREPARATION']['ALL_SCHEMA'].append(step_8)

Query executée

In [17]:
for key, value in parameters["TABLES"]["PREPARATION"].items():
    if key == "ALL_SCHEMA":
        ### LOOP STEPS
        for i, steps in enumerate(value):
            step_name = "STEPS_{}".format(i)
            if step_name in [ "STEPS_8"]:
                print('\n', steps[step_name]['name'], '\n')
                for j, step_n in enumerate(steps[step_name]["execution"]):
                    ### COMPILE QUERY
                    query = (
                        table_top
                        + "\n"
                        + step_n["query"]["top"]
                        + "\n"
                        + step_n["query"]["middle"]
                        + "\n"
                        + step_n["query"]["bottom"]
                    )

                    print(query)



 Creation des variables pour la réalisation des tests pour la siretisation 

CREATE EXTERNAL TABLE IF NOT EXISTS ets_insee.ets_insee_raw_juillet (
 WITH remove_empty_siret AS ( SELECT siren, siret, dateCreationEtablissement, etablissementSiege, etatAdministratifEtablissement, complementAdresseEtablissement, numeroVoieEtablissement, indiceRepetitionEtablissement, CASE WHEN indiceRepetitionEtablissement = 'B' THEN 'BIS' WHEN indiceRepetitionEtablissement = 'T' THEN 'TER' WHEN indiceRepetitionEtablissement = 'Q' THEN 'QUATER' WHEN indiceRepetitionEtablissement = 'C' THEN 'QUINQUIES' ELSE indiceRepetitionEtablissement END as indiceRepetitionEtablissement_full, typeVoieEtablissement, libelleVoieEtablissement, codePostalEtablissement, libelleCommuneEtablissement, libelleCommuneEtrangerEtablissement, distributionSpecialeEtablissement, codeCommuneEtablissement, codeCedexEtablissement, libelleCedexEtablissement, codePaysEtrangerEtablissement, libellePaysEtrangerEtablissement, enseigne1Etabliss

In [18]:
json_filename ='parameters_ETL.json'
json_file = json.dumps(parameters)
f = open(json_filename,"w")
f.write(json_file)
f.close()
s3.upload_file(json_filename, 'DATA/ETL')

In [19]:
for key, value in parameters["TABLES"]["PREPARATION"].items():
    if key == "ALL_SCHEMA":
        ### LOOP STEPS
        for i, steps in enumerate(value):
            step_name = "STEPS_{}".format(i)
            if step_name in ['STEPS_8']:

                ### LOOP EXECUTION WITHIN STEP
                for j, step_n in enumerate(steps[step_name]["execution"]):

                    ### DROP IF EXIST
                    s3.run_query(
                        query="DROP TABLE {}.{}".format(step_n["database"], step_n["name"]),
                        database=db,
                        s3_output=s3_output,
                    )

                    ### CREATE TOP
                    table_top = parameters["TABLES"]["PREPARATION"]["template"][
                        "top"
                    ].format(step_n["database"], step_n["name"],)

                    ### COMPILE QUERY
                    query = (
                        table_top
                        + step_n["query"]["top"]
                        + step_n["query"]["middle"]
                        + step_n["query"]["bottom"]
                    )
                    output = s3.run_query(
                        query=query,
                        database=db,
                        s3_output=s3_output,
                        filename=None,  ## Add filename to print dataframe
                        destination_key=None,  ### Add destination key if need to copy output
                    )

                    ## SAVE QUERY ID
                    step_n["output_id"] = output["QueryID"]

                    ### UPDATE CATALOG
                    #glue.update_schema_table(
                    #    database=step_n["database"],
                    #    table=step_n["name"],
                    #    schema=steps[step_name]["schema"],
                    #)

                    print(output)

{'Results': {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2020, 10, 22, 11, 55, 11, 860000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2020, 10, 22, 11, 56, 44, 479000, tzinfo=tzlocal())}, 'QueryID': 'f8be1a95-a81f-4c76-b912-fc88a90533f4'}


Appercu tables créées

In [21]:
for key, value in parameters["TABLES"]["PREPARATION"].items():
    if key == "ALL_SCHEMA":
        ### LOOP STEPS
        for i, steps in enumerate(value):
            step_name = "STEPS_{}".format(i)
            if step_name in ['STEPS_8']:
                print('\n', steps[step_name]['name'], '\n')
                for j, step_n in enumerate(steps[step_name]["execution"]):
                    query = """
                    SELECT *
                    FROM {}
                    LIMIT 10
                    """.format(step_n['name'])
                    
                    output = s3.run_query(
                    query=query,
                    database='ets_insee',
                    s3_output=s3_output,
                    filename='show_{}'.format(step_n['name']),  ## Add filename to print dataframe
                    destination_key=None,  ### Add destination key if need to copy output
                )
                    
                    display(output)



 Creation des variables pour la réalisation des tests pour la siretisation 



Unnamed: 0,count_initial_insee,siren,siret,datecreationetablissement,etablissementsiege,etatadministratifetablissement,codepostaletablissement,codecommuneetablissement,libellecommuneetablissement,ville_matching,libellevoieetablissement,complementadresseetablissement,numerovoieetablissement,list_numero_voie_matching_insee,indicerepetitionetablissement_full,typevoieetablissement,voie_clean,adresse_reconstituee_insee,adresse_distance_insee,enseigne1etablissement,enseigne2etablissement,enseigne3etablissement,list_enseigne
0,1,321137085,32113708500023,,True,F,92390,92078,VILLENEUVE-LA-GARENNE,VILLENEUVELAGARENNE,DU HAUT DE LA NOUE,,39.0,[39],,RUE,RUE,39 RUE DU HAUT DE LA NOUE,RUE HAUT NOUE,,,,
1,1,321137341,32113734100020,,True,F,75019,75119,PARIS 19,PARIS,ARCHEREAU,,14.0,[14],,RUE,RUE,14 RUE ARCHEREAU,RUE ARCHEREAU,,,,
2,1,321137374,32113737400013,,True,F,94410,94069,SAINT-MAURICE,SAINTMAURICE,DU DOCTEUR DECORSE,,74.0,[74],,RUE,RUE,74 RUE DU DOCTEUR DECORSE,RUE DOCTEUR DECORSE,,,,
3,1,321137457,32113745700024,,True,F,94250,94037,GENTILLY,GENTILLY,VICTOR MARQUIGNY,,,,,RUE,RUE,RUE VICTOR MARQUIGNY,RUE VICTOR MARQUIGNY,,,,
4,1,321137598,32113759800017,1981-02-01,True,A,97100,97105,BASSE-TERRE,BASSETERRE,JEAN JAURES,RIVIERE DES PERES,35.0,[35],,RUE,RUE,35 RUE JEAN JAURES RIVIERE DES PERES,RUE JEAN JAURES RIVIERE PERES,,,,
5,1,321138083,32113808300027,1985-01-01,True,A,13001,13201,MARSEILLE 1,MARSEILLE,NEUVE SAINT MARTIN,,8.0,[8],,RUE,RUE,8 RUE NEUVE SAINT MARTIN,RUE NEUVE SAINT MARTIN,,,,
6,1,321138307,32113830700020,,True,F,35190,35337,TINTENIAC,TINTENIAC,DE L'ECOTAY,,,,,RUE,RUE,RUE DE L ECOTAY,RUE L ECOTAY,,,,
7,2,321138398,32113839800029,1994-12-13,True,A,28200,28389,THIVILLE,THIVILLE,DE LUTZ,,2.0,[2],,RUE,RUE,2 RUE DE LUTZ,RUE LUTZ,,,,
8,1,321139065,32113906500031,1985-03-01,True,F,60800,60176,CREPY-EN-VALOIS,CREPYENVALOIS,HENRI LAROCHE,,59.0,[59],,RUE,RUE,59 RUE HENRI LAROCHE,RUE HENRI LAROCHE,,,,
9,1,321139172,32113917200019,1970-03-01,True,F,75001,75101,PARIS 1,PARIS,DE RICHELIEU,,40.0,[40],,RUE,RUE,40 RUE DE RICHELIEU,RUE RICHELIEU,,,,


# Analytics

The cells below execute the job in the key `ANALYSIS`. You need to change the `primary_key` and `secondary_key`.

Il n'est pas possible de récupérer le schema de Glue avec Boto3 sous windows. Nous devons récuperer le schéma manuellement

In [34]:
schema = {
	"StorageDescriptor": {
		"Columns":  [
				{
					"Name": "count_initial_insee",
					"Type": "bigint",
					"comment": ""
				},
				{
					"Name": "siren",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "siret",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "datecreationetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "etablissementsiege",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "etatadministratifetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "codepostaletablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "codecommuneetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "libellecommuneetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "ville_matching",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "libellevoieetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "complementadresseetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "numerovoieetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "list_numero_voie_matching_insee",
					"Type": "array<string>",
					"comment": ""
				},
				{
					"Name": "indicerepetitionetablissement_full",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "typevoieetablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "voie_clean",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "adresse_reconstituee_insee",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "adresse_distance_insee",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "enseigne1etablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "enseigne2etablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "enseigne3etablissement",
					"Type": "string",
					"comment": ""
				},
				{
					"Name": "list_enseigne",
					"Type": "array<string>",
					"comment": ""
				}
			],
		"location": "s3://calfdata/SQL_OUTPUT_ATHENA/tables/f8be1a95-a81f-4c76-b912-fc88a90533f4/",
		"inputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
		"outputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
		"compressed": "false",
		"numBuckets": "0",
		"SerDeInfo": {
			"name": "ets_insee_transformed",
			"serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
			"parameters": {}
		},
		"bucketCols": [],
		"sortCols": [],
		"parameters": {},
		"SkewedInfo": {},
		"storedAsSubDirectories": "false"
	},
	"parameters": {
		"EXTERNAL": "TRUE",
		"has_encrypted_data": "false"
	}
}

## Count missing values

In [26]:
from datetime import date
today = date.today().strftime('%Y%M%d')
today

'20200022'

In [31]:
db = 'ets_insee'

In [50]:
table_top = parameters["ANALYSIS"]["COUNT_MISSING"]["top"]
table_middle = ""
table_bottom = parameters["ANALYSIS"]["COUNT_MISSING"]["bottom"].format(
    db, parameters["TABLES"]["PREPARATION"]['ALL_SCHEMA'][-1]['STEPS_8']['execution'][0]['name']
)

for key, value in enumerate(schema["StorageDescriptor"]["Columns"]):
    if key == len(schema["StorageDescriptor"]["Columns"]) - 1:

        table_middle += "{} ".format(
            parameters["ANALYSIS"]["COUNT_MISSING"]["middle"].format(value["Name"])
        )
    else:
        table_middle += "{} ,".format(
            parameters["ANALYSIS"]["COUNT_MISSING"]["middle"].format(value["Name"])
        )
query = table_top + table_middle + table_bottom
output = s3.run_query(
    query=query,
    database=db,
    s3_output=s3_output,
    filename="count_missing",  ## Add filename to print dataframe
    destination_key=None,  ### Add destination key if need to copy output
)
display(
    output.T.rename(columns={0: "total_missing"})
    .assign(total_missing_pct=lambda x: x["total_missing"] / x.iloc[0, 0])
    .sort_values(by=["total_missing"], ascending=False)
    .style.format("{0:,.2%}", subset=["total_missing_pct"])
    .bar(subset="total_missing_pct", color=["#d65f5f"])
)

Unnamed: 0,total_missing,total_missing_pct
nb_obs,29928193,100.00%
list_enseigne,27332648,91.33%
list_numero_voie_matching_insee,6724481,22.47%
voie_clean,4903915,16.39%
siren,0,0.00%
siret,0,0.00%
enseigne3etablissement,0,0.00%
enseigne2etablissement,0,0.00%
enseigne1etablissement,0,0.00%
adresse_distance_insee,0,0.00%


# Brief description table

In this part, we provide a brief summary statistic from the lattest jobs. For the continuous analysis with a primary/secondary key, please add the relevant variables you want to know the count and distribution

## Categorical Description

During the categorical analysis, we wil count the number of observations for a given group and for a pair.

### Count obs by group

- Index: primary group
- nb_obs: Number of observations per primary group value
- percentage: Percentage of observation per primary group value over the total number of observations

Returns the top 10 only

In [51]:
for field in schema["StorageDescriptor"]["Columns"]:
    if field["Type"] in ["string", "object", "varchar(12)"]:

        print("Nb of obs for {}".format(field["Name"]))

        query = parameters["ANALYSIS"]["CATEGORICAL"]["PAIR"].format(
            db, parameters["TABLES"]["PREPARATION"]['ALL_SCHEMA'][-1]['STEPS_8']['execution'][0]['name'], field["Name"]
        )
        output = s3.run_query(
            query=query,
            database=db,
            s3_output=s3_output,
            filename="count_categorical_{}".format(
                field["Name"]
            ),  ## Add filename to print dataframe
            destination_key=None,  ### Add destination key if need to copy output
        )

        ### Print top 10

        display(
            (
                output.set_index([field["Name"]])
                .assign(percentage=lambda x: x["nb_obs"] / x["nb_obs"].sum())
                .sort_values("percentage", ascending=False)
                .head(10)
                .style.format("{0:.2%}", subset=["percentage"])
                .bar(subset=["percentage"], color="#d65f5f")
            )
        )

Nb of obs for siren


Unnamed: 0_level_0,nb_obs,percentage
siren,Unnamed: 1_level_1,Unnamed: 2_level_1
356000000,12565,0.04%
552049447,9140,0.03%
552081317,9072,0.03%
632041042,6411,0.02%
662025196,5881,0.02%
380129866,4401,0.01%
428268023,4358,0.01%
662042449,3709,0.01%
954509741,3390,0.01%
552120222,3274,0.01%


Nb of obs for siret


Unnamed: 0_level_0,nb_obs,percentage
siret,Unnamed: 1_level_1,Unnamed: 2_level_1
81916578800012,1,0.00%
32317956400014,1,0.00%
39870682000014,1,0.00%
39870549100015,1,0.00%
39869793800023,1,0.00%
39869300200014,1,0.00%
39869150100017,1,0.00%
39867492900011,1,0.00%
39867297200013,1,0.00%
34741664600011,1,0.00%


Nb of obs for datecreationetablissement


Unnamed: 0_level_0,nb_obs,percentage
datecreationetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
,3500351,11.70%
1900-01-01,515844,1.72%
1983-03-01,81928,0.27%
1991-01-01,71460,0.24%
2012-01-01,70411,0.24%
1993-01-01,68710,0.23%
1995-12-25,68518,0.23%
2016-01-01,68378,0.23%
1997-01-01,67041,0.22%
2015-01-01,64005,0.21%


Nb of obs for etablissementsiege


Unnamed: 0_level_0,nb_obs,percentage
etablissementsiege,Unnamed: 1_level_1,Unnamed: 2_level_1
True,21255155,71.02%
False,8673038,28.98%


Nb of obs for etatadministratifetablissement


Unnamed: 0_level_0,nb_obs,percentage
etatadministratifetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
F,18000284,60.14%
A,11927909,39.86%


Nb of obs for codepostaletablissement


Unnamed: 0_level_0,nb_obs,percentage
codepostaletablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
75008.0,322658,1.08%
75017.0,196993,0.66%
,180846,0.60%
75011.0,171814,0.57%
75015.0,171077,0.57%
75009.0,149619,0.50%
75018.0,149436,0.50%
75010.0,146782,0.49%
75116.0,134873,0.45%
75020.0,132795,0.44%


Nb of obs for codecommuneetablissement


Unnamed: 0_level_0,nb_obs,percentage
codecommuneetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
75108.0,322659,1.08%
31555.0,271970,0.91%
6088.0,265196,0.89%
75116.0,218086,0.73%
33063.0,200934,0.67%
75117.0,196994,0.66%
,180846,0.60%
34172.0,176191,0.59%
75111.0,171815,0.57%
75115.0,171077,0.57%


Nb of obs for libellecommuneetablissement


Unnamed: 0_level_0,nb_obs,percentage
libellecommuneetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
PARIS 8,322659,1.08%
TOULOUSE,271970,0.91%
NICE,265196,0.89%
PARIS 16,218086,0.73%
BORDEAUX,200934,0.67%
PARIS 17,196994,0.66%
,180846,0.60%
MONTPELLIER,176191,0.59%
PARIS 11,171815,0.57%
PARIS 15,171077,0.57%


Nb of obs for ville_matching


Unnamed: 0_level_0,nb_obs,percentage
ville_matching,Unnamed: 1_level_1,Unnamed: 2_level_1
PARIS,2587531,8.65%
MARSEILLE,475401,1.59%
TOULOUSE,271970,0.91%
NICE,265196,0.89%
BORDEAUX,200934,0.67%
,180846,0.60%
MONTPELLIER,176191,0.59%
NANTES,160893,0.54%
LILLE,146103,0.49%
STRASBOURG,143756,0.48%


Nb of obs for libellevoieetablissement


Unnamed: 0_level_0,nb_obs,percentage
libellevoieetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
,874221,2.92%
DE LA REPUBLIQUE,275385,0.92%
JEAN JAURES,218743,0.73%
VICTOR HUGO,151002,0.50%
GRANDE RUE,139015,0.46%
DU GENERAL DE GAULLE,138856,0.46%
GAMBETTA,123548,0.41%
DE LA GARE,116180,0.39%
DE PARIS,109388,0.37%
LE BOURG,108316,0.36%


Nb of obs for complementadresseetablissement


Unnamed: 0_level_0,nb_obs,percentage
complementadresseetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
,24414065,81.58%
MAIRIE,183427,0.61%
RATTACHEMENT MAIRIE,23664,0.08%
ZONE INDUSTRIELLE,23114,0.08%
HOTEL DE VILLE,15388,0.05%
ZI,14816,0.05%
ZONE ARTISANALE,11229,0.04%
CENTRE COMMERCIAL,10969,0.04%
LE BOURG,10029,0.03%
MAISON DES ASSOCIATIONS,9187,0.03%


Nb of obs for numerovoieetablissement


Unnamed: 0_level_0,nb_obs,percentage
numerovoieetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
,7407033,24.75%
1.0,1059611,3.54%
2.0,986275,3.30%
3.0,819779,2.74%
4.0,788604,2.63%
5.0,734491,2.45%
6.0,697706,2.33%
7.0,630808,2.11%
8.0,624500,2.09%
10.0,571372,1.91%


Nb of obs for indicerepetitionetablissement_full


Unnamed: 0_level_0,nb_obs,percentage
indicerepetitionetablissement_full,Unnamed: 1_level_1,Unnamed: 2_level_1
,28768656,96.13%
BIS,915410,3.06%
TER,115665,0.39%
A,68323,0.23%
QUINQUIES,28173,0.09%
D,7663,0.03%
QUATER,6204,0.02%
R,5755,0.02%
E,3793,0.01%
F,3364,0.01%


Nb of obs for typevoieetablissement


Unnamed: 0_level_0,nb_obs,percentage
typevoieetablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
RUE,13954520,46.63%
,4903854,16.39%
AV,3450819,11.53%
RTE,1375826,4.60%
BD,1318926,4.41%
PL,1121876,3.75%
CHE,1120665,3.74%
ALL,614471,2.05%
LD,492117,1.64%
IMP,390063,1.30%


Nb of obs for voie_clean


Unnamed: 0_level_0,nb_obs,percentage
voie_clean,Unnamed: 1_level_1,Unnamed: 2_level_1
RUE,13954520,46.63%
,4903915,16.39%
AVENUE,3450819,11.53%
ROUTE,1375826,4.60%
BOULEVARD,1318926,4.41%
PLACE,1121876,3.75%
CHEMIN,1120665,3.74%
ALLEE,614471,2.05%
LIEU DIT,492117,1.64%
IMPASSE,390063,1.30%


Nb of obs for adresse_reconstituee_insee


Unnamed: 0_level_0,nb_obs,percentage
adresse_reconstituee_insee,Unnamed: 1_level_1,Unnamed: 2_level_1
,654006,2.19%
MAIRIE,106197,0.35%
LE BOURG,84838,0.28%
BOURG,30858,0.10%
LE VILLAGE,21816,0.07%
PLACE DE L EGLISE,17125,0.06%
GRANDE RUE,14003,0.05%
HOTEL DE VILLE,13686,0.05%
PLACE DE LA MAIRIE,12792,0.04%
ZONE INDUSTRIELLE,10289,0.03%


Nb of obs for adresse_distance_insee


Unnamed: 0_level_0,nb_obs,percentage
adresse_distance_insee,Unnamed: 1_level_1,Unnamed: 2_level_1
,659226,2.20%
BOURG,127112,0.42%
RUE REPUBLIQUE,116420,0.39%
GRANDE RUE,113580,0.38%
MAIRIE,107187,0.36%
RUE JEAN JAURES,77060,0.26%
AVENUE JEAN JAURES,71051,0.24%
AVENUE REPUBLIQUE,63854,0.21%
RUE L EGLISE,60816,0.20%
RUE VICTOR HUGO,57594,0.19%


Nb of obs for enseigne1etablissement


Unnamed: 0_level_0,nb_obs,percentage
enseigne1etablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
,27332648,91.33%
MAIRIE,37751,0.13%
CCAS,29194,0.10%
ECOLE PRIMAIRE PUBLIQUE,16153,0.05%
BUREAU DE POSTE,13762,0.05%
LA POSTE,12256,0.04%
CDE,7364,0.02%
SERVICE ASSAINISSEMENT,5941,0.02%
ECOLE MATERNELLE PUBLIQUE,5842,0.02%
ECOLE ELEMENTAIRE PUBLIQUE,3727,0.01%


Nb of obs for enseigne2etablissement


Unnamed: 0_level_0,nb_obs,percentage
enseigne2etablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
,29912586,99.95%
OUVERTURE DE PORTE,58,0.00%
REAGROUP,31,0.00%
SSIAD,23,0.00%
DE PARIS,20,0.00%
E,18,0.00%
BONOBO,15,0.00%
SQUARE HABITAT,12,0.00%
COMPTOIR D'ELECTRICITE FRANCO BELGE,12,0.00%
RENOVATION,12,0.00%


Nb of obs for enseigne3etablissement


Unnamed: 0_level_0,nb_obs,percentage
enseigne3etablissement,Unnamed: 1_level_1,Unnamed: 2_level_1
,29924864,99.99%
APPART CITY CAP LOISIRS,8,0.00%
GECAGRI,5,0.00%
DOMEN SECURITE,5,0.00%
AIC CONSEILS,4,0.00%
PATRICE BREAL,4,0.00%
EM'ROAD 35,4,0.00%
APPART'CITY CAP LOISIRS,4,0.00%
MOUCHARABIEH BIJOUX,4,0.00%
ROCHER ENTRETIEN,3,0.00%


# Generation report

In [52]:
import os, time, shutil, urllib, ipykernel, json
from pathlib import Path
from notebook import notebookapp

In [None]:
def create_report(extension = "html", keep_code = False):
    """
    Create a report from the current notebook and save it in the 
    Report folder (Parent-> child directory)
    
    1. Exctract the current notbook name
    2. Convert the Notebook 
    3. Move the newly created report
    
    Args:
    extension: string. Can be "html", "pdf", "md"
    
    
    """
    
    ### Get notebook name
    connection_file = os.path.basename(ipykernel.get_connection_file())
    kernel_id = connection_file.split('-', 1)[0].split('.')[0]

    for srv in notebookapp.list_running_servers():
        try:
            if srv['token']=='' and not srv['password']:  
                req = urllib.request.urlopen(srv['url']+'api/sessions')
            else:
                req = urllib.request.urlopen(srv['url']+ \
                                             'api/sessions?token=' + \
                                             srv['token'])
            sessions = json.load(req)
            notebookname = sessions[0]['name']
        except:
            pass  
    
    sep = '.'
    path = os.getcwd()
    #parent_path = str(Path(path).parent)
    
    ### Path report
    #path_report = "{}/Reports".format(parent_path)
    #path_report = "{}/Reports".format(path)
    
    ### Path destination
    name_no_extension = notebookname.split(sep, 1)[0]
    source_to_move = name_no_extension +'.{}'.format(extension)
    dest = os.path.join(path,'Reports', source_to_move)
    
    ### Generate notebook
    if keep_code:
        os.system('jupyter nbconvert --to {} {}'.format(
    extension,notebookname))
    else:
        os.system('jupyter nbconvert --no-input --to {} {}'.format(
    extension,notebookname))
    
    ### Move notebook to report folder
    #time.sleep(5)
    shutil.move(source_to_move, dest)
    print("Report Available at this adress:\n {}".format(dest))

In [None]:
create_report(extension = "html", keep_code = True)