# Stock Etablissement

Voici le schéma de Stock Etablissement :

Schema([('siren', String),
        ('nic', Int64),
        ('siret', String),
        ('statutDiffusionEtablissement', String),
        ('dateCreationEtablissement', Date),
        ('trancheEffectifsEtablissement', String),
        ('anneeEffectifsEtablissement', Int64),
        ('activitePrincipaleRegistreMetiersEtablissement', String),
        ('dateDernierTraitementEtablissement',
         Datetime(time_unit='us', time_zone=None)),
        ('etablissementSiege', Boolean),
        ('nombrePeriodesEtablissement', Int64),
        ('complementAdresseEtablissement', String),
        ('numeroVoieEtablissement', String),
        ('indiceRepetitionEtablissement', String),
        ('dernierNumeroVoieEtablissement', String),
        ('indiceRepetitionDernierNumeroVoieEtablissement', String),
        ('typeVoieEtablissement', String),
        ('libelleVoieEtablissement', String),
        ('codePostalEtablissement', String),
        ('libelleCommuneEtablissement', String),
        ('libelleCommuneEtrangerEtablissement', String),
        ('distributionSpecialeEtablissement', String),
        ('codeCommuneEtablissement', String),
        ('codeCedexEtablissement', String),
        ('libelleCedexEtablissement', String),
        ('codePaysEtrangerEtablissement', String),
        ('libellePaysEtrangerEtablissement', String),
        ('identifiantAdresseEtablissement', String),
        ('coordonneeLambertAbscisseEtablissement', String),
        ('coordonneeLambertOrdonneeEtablissement', String),
        ('complementAdresse2Etablissement', String),
        ('numeroVoie2Etablissement', String),
        ('indiceRepetition2Etablissement', String),
        ('typeVoie2Etablissement', String),
        ('libelleVoie2Etablissement', String),
        ('codePostal2Etablissement', String),
        ('libelleCommune2Etablissement', String),
        ('libelleCommuneEtranger2Etablissement', String),
        ('distributionSpeciale2Etablissement', String),
        ('codeCommune2Etablissement', String),
        ('codeCedex2Etablissement', String),
        ('libelleCedex2Etablissement', String),
        ('codePaysEtranger2Etablissement', String),
        ('libellePaysEtranger2Etablissement', String),
        ('dateDebut', Date),
        ('etatAdministratifEtablissement', String),
        ('enseigne1Etablissement', String),
        ('enseigne2Etablissement', String),
        ('enseigne3Etablissement', String),
        ('denominationUsuelleEtablissement', String),
        ('activitePrincipaleEtablissement', String),
        ('nomenclatureActivitePrincipaleEtablissement', String),
        ('caractereEmployeurEtablissement', String)])

In [1]:
import polars as pl
import pyarrow.parquet as pq
import sys

filepath = "../Data/raw/StockEtablissement_utf8.parquet" 

print("--- Début de la lecture 'bypass' ---")

try:
    print(f"Lecture du fichier via PyArrow : {filepath}")
    table_arrow = pq.read_table(
        filepath,
    )
    
    print("Conversion de la table PyArrow en DataFrame Polars...")
    df_eta = pl.from_arrow(table_arrow)
    
    print("--- SUCCÈS ! ---\n")
    print("Le DataFrame est maintenant dans Polars, prêt pour la transformation.")
    print(df_eta.head())

except Exception as e:
    print(f"\n--- ERREUR ---", file=sys.stderr)
    print(f"Impossible de lire le fichier, même avec PyArrow : {e}", file=sys.stderr)

--- Début de la lecture 'bypass' ---
Lecture du fichier via PyArrow : ../Data/raw/StockEtablissement_utf8.parquet
Conversion de la table PyArrow en DataFrame Polars...
--- SUCCÈS ! ---

Le DataFrame est maintenant dans Polars, prêt pour la transformation.
shape: (5, 53)
┌───────────┬─────┬────────────┬────────────┬───┬────────────┬────────────┬────────────┬───────────┐
│ siren     ┆ nic ┆ siret      ┆ statutDiff ┆ … ┆ denominati ┆ activitePr ┆ nomenclatu ┆ caractere │
│ ---       ┆ --- ┆ ---        ┆ usionEtabl ┆   ┆ onUsuelleE ┆ incipaleEt ┆ reActivite ┆ Employeur │
│ str       ┆ i64 ┆ str        ┆ issement   ┆   ┆ tablisseme ┆ ablissemen ┆ Principale ┆ Etablisse │
│           ┆     ┆            ┆ ---        ┆   ┆ …          ┆ …          ┆ …          ┆ men…      │
│           ┆     ┆            ┆ str        ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---       │
│           ┆     ┆            ┆            ┆   ┆ str        ┆ str        ┆ str        ┆ str       │
╞═══════════╪═════╪═══

---

# Stock Etablissement Historique

Voici le schema de Stock Etablissement Historique :

Schema([('siren', String),
        ('nic', Int64),
        ('siret', String),
        ('dateFin', Date),
        ('dateDebut', Date),
        ('etatAdministratifEtablissement', String),
        ('changementEtatAdministratifEtablissement', Boolean),
        ('enseigne1Etablissement', String),
        ('enseigne2Etablissement', String),
        ('enseigne3Etablissement', String),
        ('changementEnseigneEtablissement', Boolean),
        ('denominationUsuelleEtablissement', String),
        ('changementDenominationUsuelleEtablissement', Boolean),
        ('activitePrincipaleEtablissement', String),
        ('nomenclatureActivitePrincipaleEtablissement', String),
        ('changementActivitePrincipaleEtablissement', Boolean),
        ('caractereEmployeurEtablissement', String),
        ('changementCaractereEmployeurEtablissement', Boolean)])

In [2]:
import polars as pl
import pyarrow.parquet as pq
import sys

filepath = "../Data/raw/StockEtablissementHistorique_utf8.parquet" 

print("--- Début de la lecture 'bypass' ---")

try:
    print(f"Lecture du fichier via PyArrow : {filepath}")
    table_arrow = pq.read_table(
        filepath,
    )
    
    print("Conversion de la table PyArrow en DataFrame Polars...")
    df_eta_hist = pl.from_arrow(table_arrow)
    
    print("--- SUCCÈS ! ---\n")
    print("Le DataFrame est maintenant dans Polars, prêt pour la transformation.")
    print(df_eta_hist.head())

except Exception as e:
    print(f"\n--- ERREUR ---", file=sys.stderr)
    print(f"Impossible de lire le fichier, même avec PyArrow : {e}", file=sys.stderr)

--- Début de la lecture 'bypass' ---
Lecture du fichier via PyArrow : ../Data/raw/StockEtablissementHistorique_utf8.parquet
Conversion de la table PyArrow en DataFrame Polars...
--- SUCCÈS ! ---

Le DataFrame est maintenant dans Polars, prêt pour la transformation.
shape: (5, 18)
┌───────────┬─────┬────────────┬────────────┬───┬────────────┬────────────┬────────────┬───────────┐
│ siren     ┆ nic ┆ siret      ┆ dateFin    ┆ … ┆ nomenclatu ┆ changement ┆ caractereE ┆ changemen │
│ ---       ┆ --- ┆ ---        ┆ ---        ┆   ┆ reActivite ┆ ActivitePr ┆ mployeurEt ┆ tCaracter │
│ str       ┆ i64 ┆ str        ┆ date       ┆   ┆ Principale ┆ incipaleEt ┆ ablissemen ┆ eEmployeu │
│           ┆     ┆            ┆            ┆   ┆ …          ┆ …          ┆ …          ┆ rEt…      │
│           ┆     ┆            ┆            ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---       │
│           ┆     ┆            ┆            ┆   ┆ str        ┆ bool       ┆ str        ┆ bool      │
╞═══════════

---

# Stock Unite Legale Historique

On ne va pas utiliser ce fichier pour la création de notre DB siren_date car il fait juste un doublon avec Stock Unite Legale et que la données du dateFin est disponible dans la db StockEtablissement. Donc peut d'intérêt mais on le garde dans Sandbox si jamais.

Voici le schema Stock Unite Legale Historique :

Schema([('siren', String),
        ('dateFin', Date),
        ('dateDebut', Date),
        ('etatAdministratifUniteLegale', String),
        ('changementEtatAdministratifUniteLegale', Boolean),
        ('nomUniteLegale', String),
        ('changementNomUniteLegale', Boolean),
        ('nomUsageUniteLegale', String),
        ('changementNomUsageUniteLegale', Boolean),
        ('denominationUniteLegale', String),
        ('changementDenominationUniteLegale', Boolean),
        ('denominationUsuelle1UniteLegale', String),
        ('denominationUsuelle2UniteLegale', String),
        ('denominationUsuelle3UniteLegale', String),
        ('changementDenominationUsuelleUniteLegale', String),
        ('categorieJuridiqueUniteLegale', String),
        ('changementCategorieJuridiqueUniteLegale', Boolean),
        ('activitePrincipaleUniteLegale', String),
        ('nomenclatureActivitePrincipaleUniteLegale', String),
        ('changementActivitePrincipaleUniteLegale', Boolean),
        ('nicSiegeUniteLegale', Int64),
        ('changementNicSiegeUniteLegale', Boolean),
        ('economieSocialeSolidaireUniteLegale', String),
        ('changementEconomieSocialeSolidaireUniteLegale', Boolean),
        ('societeMissionUniteLegale', String),
        ('changementSocieteMissionUniteLegale', Boolean),
        ('caractereEmployeurUniteLegale', String),
        ('changementCaractereEmployeurUniteLegale', Boolean)])

In [1]:
import polars as pl
import pyarrow.parquet as pq
import sys

filepath = "../Data/raw/StockUniteLegaleHistorique_utf8.parquet" 

print("--- Début de la lecture 'bypass' ---")

try:
    # ÉTAPE 1 : On lit avec PyArrow, qui est robuste.
    print(f"Lecture du fichier via PyArrow : {filepath}")
    table_arrow = pq.read_table(
        filepath,
    )
    
    # ÉTAPE 2 : On passe la data à Polars sans la copier (gain de performance).
    print("Conversion de la table PyArrow en DataFrame Polars...")
    df_unit_legal_hist = pl.from_arrow(table_arrow)
    
    print("--- SUCCÈS ! ---")
    print("Le DataFrame est maintenant dans Polars, prêt pour la transformation.")
    print(df_unit_legal_hist.head())

except Exception as e:
    print(f"--- ERREUR ---", file=sys.stderr)
    print(f"Impossible de lire le fichier, même avec PyArrow : {e}", file=sys.stderr)

--- Début de la lecture 'bypass' ---
Lecture du fichier via PyArrow : ../Data/raw/StockUniteLegaleHistorique_utf8.parquet
Conversion de la table PyArrow en DataFrame Polars...
--- SUCCÈS ! ---
Le DataFrame est maintenant dans Polars, prêt pour la transformation.
shape: (5, 28)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ siren     ┆ dateFin   ┆ dateDebut ┆ etatAdmin ┆ … ┆ societeMi ┆ changemen ┆ caractere ┆ changeme │
│ ---       ┆ ---       ┆ ---       ┆ istratifU ┆   ┆ ssionUnit ┆ tSocieteM ┆ Employeur ┆ ntCaract │
│ str       ┆ date      ┆ date      ┆ niteLegal ┆   ┆ eLegale   ┆ issionUni ┆ UniteLega ┆ ereEmplo │
│           ┆           ┆           ┆ e         ┆   ┆ ---       ┆ teL…      ┆ le        ┆ yeurUn…  │
│           ┆           ┆           ┆ ---       ┆   ┆ str       ┆ ---       ┆ ---       ┆ ---      │
│           ┆           ┆           ┆ str       ┆   ┆           ┆ bool      ┆ str       ┆ bool     │
╞═══════════╪══

---

# Stock Unite Legale

Voici le schéma de Stock Unite Legale :

Schema([('siren', String),
        ('statutDiffusionUniteLegale', String),
        ('unitePurgeeUniteLegale', Boolean),
        ('dateCreationUniteLegale', Date),
        ('sigleUniteLegale', String),
        ('sexeUniteLegale', String),
        ('prenom1UniteLegale', String),
        ('prenom2UniteLegale', String),
        ('prenom3UniteLegale', String),
        ('prenom4UniteLegale', String),
        ('prenomUsuelUniteLegale', String),
        ('pseudonymeUniteLegale', String),
        ('identifiantAssociationUniteLegale', String),
        ('trancheEffectifsUniteLegale', String),
        ('anneeEffectifsUniteLegale', Int64),
        ('dateDernierTraitementUniteLegale',
         Datetime(time_unit='us', time_zone=None)),
        ('nombrePeriodesUniteLegale', Int64),
        ('categorieEntreprise', String),
        ('anneeCategorieEntreprise', Int64),
        ('dateDebut', Date),
        ('etatAdministratifUniteLegale', String),
        ('nomUniteLegale', String),
        ('nomUsageUniteLegale', String),
        ('denominationUniteLegale', String),
        ('denominationUsuelle1UniteLegale', String),
        ('denominationUsuelle2UniteLegale', String),
        ('denominationUsuelle3UniteLegale', String),
        ('categorieJuridiqueUniteLegale', Int64),
        ('activitePrincipaleUniteLegale', String),
        ('nomenclatureActivitePrincipaleUniteLegale', String),
        ('nicSiegeUniteLegale', Int64),
        ('economieSocialeSolidaireUniteLegale', String),
        ('societeMissionUniteLegale', String),
        ('caractereEmployeurUniteLegale', String)])

In [2]:
import polars as pl
import pyarrow.parquet as pq
import sys

filepath = "../Data/raw/StockUniteLegale_utf8.parquet" 

print("--- Début de la lecture 'bypass' ---")

try:
    # ÉTAPE 1 : On lit avec PyArrow, qui est robuste.
    print(f"Lecture du fichier via PyArrow : {filepath}")
    table_arrow = pq.read_table(
        filepath,
    )
    
    # ÉTAPE 2 : On passe la data à Polars sans la copier (gain de performance).
    print("Conversion de la table PyArrow en DataFrame Polars...")
    df_unit_legal = pl.from_arrow(table_arrow)
    
    print("--- SUCCÈS ! ---")
    print("Le DataFrame est maintenant dans Polars, prêt pour la transformation.")
    print(df_unit_legal.head())

except Exception as e:
    print(f"--- ERREUR ---", file=sys.stderr)
    print(f"Impossible de lire le fichier, même avec PyArrow : {e}", file=sys.stderr)

--- Début de la lecture 'bypass' ---
Lecture du fichier via PyArrow : ../Data/raw/StockUniteLegale_utf8.parquet
Conversion de la table PyArrow en DataFrame Polars...
--- SUCCÈS ! ---
Le DataFrame est maintenant dans Polars, prêt pour la transformation.
shape: (5, 34)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ siren     ┆ statutDif ┆ unitePurg ┆ dateCreat ┆ … ┆ nicSiegeU ┆ economieS ┆ societeMi ┆ caracter │
│ ---       ┆ fusionUni ┆ eeUniteLe ┆ ionUniteL ┆   ┆ niteLegal ┆ ocialeSol ┆ ssionUnit ┆ eEmploye │
│ str       ┆ teLegale  ┆ gale      ┆ egale     ┆   ┆ e         ┆ idaireUni ┆ eLegale   ┆ urUniteL │
│           ┆ ---       ┆ ---       ┆ ---       ┆   ┆ ---       ┆ teL…      ┆ ---       ┆ egale    │
│           ┆ str       ┆ bool      ┆ date      ┆   ┆ i64       ┆ ---       ┆ str       ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆           ┆ str       ┆           ┆ str      │
╞═══════════╪═══════════╪

---

# Siren Master

In [14]:
import polars as pl
import sys
import os

print("--- Lancement Script 01: Création du MASTER FILE SIRENE ---")

PATH_UL = "../Data/raw/StockUniteLegale_utf8.parquet"
PATH_ETAB = "../Data/raw/StockEtablissement_utf8.parquet"
PATH_ETAB_HISTO = "../Data/raw/StockEtablissementHistorique_utf8.parquet"
PATH_OUTPUT = "../Data/processed/sirene_infos.parquet"

for path in [PATH_UL, PATH_ETAB, PATH_ETAB_HISTO]:
    if not os.path.exists(path):
        print(f"ERREUR FATALE: Fichier brut manquant : {path}", file=sys.stderr)
        sys.exit(1)

# ===================================================================
# ÉTAPE 1: La Base (FEATURES X) - Fichier 'StockUniteLegale'
# ===================================================================
print("Étape 1: Lecture des features de 'StockUniteLegale'...")
df_base_features = pl.scan_parquet(PATH_UL).select(
    "siren",
    "dateCreationUniteLegale",
    "categorieJuridiqueUniteLegale",
    "trancheEffectifsUniteLegale",
    "activitePrincipaleUniteLegale",
    "categorieEntreprise",
    "economieSocialeSolidaireUniteLegale",
    "societeMissionUniteLegale"
)

# ===================================================================
# ÉTAPE 2: Trouver le SIRET du Siège (HQ) - Fichier 'StockEtablissement'
# ===================================================================
print("Étape 2: Lecture de 'StockEtablissement' pour trouver les sièges...")
df_sieges = pl.scan_parquet(PATH_ETAB).filter(
    pl.col("etablissementSiege") == True
).select(
    "siren", 
    "siret",
    pl.col("codePostalEtablissement").str.slice(0, 2).alias("departement")
)

# ===================================================================
# ÉTAPE 3: Trouver la Date de Fermeture (La Cible Y) - Fichier 'StockEtablissementHistorique'
# ===================================================================
print("Étape 3: Lecture de 'StockEtablissementHistorique' pour trouver les 'morts'...")
df_fermetures = pl.scan_parquet(PATH_ETAB_HISTO).filter(
    pl.col("etatAdministratifEtablissement") == 'F' # On ne garde que les événements "Fermé"
).select(
    "siret",
    pl.col("dateFin").alias("dateFermeture")
).group_by("siret").agg(
    # S'il y a plusieurs événements "F", on prend le plus récent
    pl.col("dateFermeture").max() 
)

# ===================================================================
# ÉTAPE 4: Le "Grand Mariage" SIRENE
# ===================================================================
print("Étape 4: Jointure finale des 3 tables...")

# 1. Joindre la Base (Features X) avec les Sièges (Dept + siret)
df_master = df_base_features.join(
    df_sieges, on="siren", how="left"
)

# 2. Joindre le résultat avec les Dates de Fermeture
# C'est la jointure magique.
df_master = df_master.join(
    df_fermetures, on="siret", how="left"
)

# ===================================================================
# ÉTAPE 5: Sauvegarde
# ===================================================================
print(f"Sauvegarde du Master File dans {PATH_OUTPUT}...")

# On ne garde que les colonnes finales
df_final = df_master.select(
    "siren",
    "dateCreationUniteLegale",
    "dateFermeture",
    "categorieJuridiqueUniteLegale",
    "trancheEffectifsUniteLegale",
    "activitePrincipaleUniteLegale",
    # "categorieEntreprise",  Ici on a aps beaucoup d non nul (17M null vs 10m data) avoir sur intération pour reprendre
    # "economieSocialeSolidaireUniteLegale",
    # "societeMissionUniteLegale",
    "departement"
)

df_final.collect().write_parquet(PATH_OUTPUT)

print(f"--- Script 01 (Master File) Terminé avec Succès. Shape: {df_final.collect().shape} ---")

--- Lancement Script 01: Création du MASTER FILE SIRENE ---
Étape 1: Lecture des features de 'StockUniteLegale'...
Étape 2: Lecture de 'StockEtablissement' pour trouver les sièges...
Étape 3: Lecture de 'StockEtablissementHistorique' pour trouver les 'morts'...
Étape 4: Jointure finale des 3 tables...
Sauvegarde du Master File dans ../Data/processed/sirene_infos.parquet...
--- Script 01 (Master File) Terminé avec Succès. Shape: (28882409, 7) ---


In [15]:
df_final.schema

  df_final.schema


Schema([('siren', String),
        ('dateCreationUniteLegale', Date),
        ('dateFermeture', Date),
        ('categorieJuridiqueUniteLegale', Int64),
        ('trancheEffectifsUniteLegale', String),
        ('activitePrincipaleUniteLegale', String),
        ('departement', String)])

In [16]:
import polars as pl

# 1. Définir le chemin vers ton nouveau fichier "propre"
path_master_file = "../Data/processed/sirene_infos.parquet"

# 2. Lire le fichier 
df_sirene = pl.read_parquet(path_master_file)

# 3. Regarder ta DB !
print(f"--- Fichier Master SIRENE chargé ---")
print(f"Shape (Lignes, Colonnes) : {df_sirene.shape}\n")
print("Aperçu des 5 premières lignes :")
print(df_sirene.head())
print("\nSchéma des colonnes (Types) :")
print(df_sirene.schema)

--- Fichier Master SIRENE chargé ---
Shape (Lignes, Colonnes) : (28882409, 7)

Aperçu des 5 premières lignes :
shape: (5, 7)
┌───────────┬──────────────┬──────────────┬──────────────┬─────────────┬─────────────┬─────────────┐
│ siren     ┆ dateCreation ┆ dateFermetur ┆ categorieJur ┆ trancheEffe ┆ activitePri ┆ departement │
│ ---       ┆ UniteLegale  ┆ e            ┆ idiqueUniteL ┆ ctifsUniteL ┆ ncipaleUnit ┆ ---         │
│ str       ┆ ---          ┆ ---          ┆ egale        ┆ egale       ┆ eLegale     ┆ str         │
│           ┆ date         ┆ date         ┆ ---          ┆ ---         ┆ ---         ┆             │
│           ┆              ┆              ┆ i64          ┆ str         ┆ str         ┆             │
╞═══════════╪══════════════╪══════════════╪══════════════╪═════════════╪═════════════╪═════════════╡
│ 000325175 ┆ 2000-09-26   ┆ null         ┆ 1000         ┆ NN          ┆ 32.12Z      ┆ 13          │
│ 001807254 ┆ 1972-05-01   ┆ null         ┆ 1000         ┆ NN      

In [17]:
df_sirene.describe()

statistic,siren,dateCreationUniteLegale,dateFermeture,categorieJuridiqueUniteLegale,trancheEffectifsUniteLegale,activitePrincipaleUniteLegale,departement
str,str,str,str,f64,str,str,str
"""count""","""28882409""","""27710121""","""1198759""",28882409.0,"""28882409""","""28860884""","""28560617"""
"""null_count""","""0""","""1172288""","""27683650""",0.0,"""0""","""21525""","""321792"""
"""mean""",,"""2005-10-22 13:11:56.004134""","""2010-09-25 00:21:57.306314""",3246.573881,,,
"""std""",,,,2736.41353,,,
"""min""","""000325175""","""0001-01-16""","""1900-12-31""",1000.0,"""00""","""00.00""",""" D"""
"""25%""",,"""1995-01-19""","""2003-12-24""",1000.0,,,
"""50%""",,"""2010-05-01""","""2010-10-13""",1000.0,,,
"""75%""",,"""2019-11-02""","""2019-12-02""",5599.0,,,
"""max""","""999992357""","""3023-01-06""","""5015-04-04""",9970.0,"""NN""","""99.0Z""","""sw"""


In [18]:
# on va supprimer toutes les lignes null de departement et activitePrincipaleUniteLegale

df_sirene_clean = df_sirene.drop_nulls(subset=["departement", "activitePrincipaleUniteLegale", "dateCreationUniteLegale", "dateCreationUniteLegale"])


In [19]:
df_sirene_clean.describe()

statistic,siren,dateCreationUniteLegale,dateFermeture,categorieJuridiqueUniteLegale,trancheEffectifsUniteLegale,activitePrincipaleUniteLegale,departement
str,str,str,str,f64,str,str,str
"""count""","""27391299""","""27391299""","""1190800""",27391299.0,"""27391299""","""27391299""","""27391299"""
"""null_count""","""0""","""0""","""26200499""",0.0,"""0""","""0""","""0"""
"""mean""",,"""2005-09-28 12:27:59.586499""","""2010-09-30 23:10:28.167954""",3306.424326,,,
"""std""",,,,2757.269271,,,
"""min""","""000325175""","""0001-01-16""","""1900-12-31""",1000.0,"""00""","""00.00""",""" D"""
"""25%""",,"""1995-01-01""","""2003-12-24""",1000.0,,,
"""50%""",,"""2010-04-13""","""2010-11-30""",1000.0,,,
"""75%""",,"""2019-10-16""","""2019-12-19""",5710.0,,,
"""max""","""999992357""","""3023-01-06""","""5015-04-04""",9970.0,"""NN""","""99.0Z""","""sw"""


On va juste check que tous les types dans les colonnes soit les bons (str, int, date ...) MAIS AUSSI que les dates sont valides. Si boite créé après date du jour -> supprimer

In [20]:
# créer une colonne années de fermeture à partir de dateFermeture

df_sirene_clean = df_sirene_clean.with_columns(
    pl.col("dateFermeture").dt.year().alias("anneeFermeture")
)

In [22]:
df_sirene_clean.schema


Schema([('siren', String),
        ('dateCreationUniteLegale', Date),
        ('dateFermeture', Date),
        ('categorieJuridiqueUniteLegale', Int64),
        ('trancheEffectifsUniteLegale', String),
        ('activitePrincipaleUniteLegale', String),
        ('departement', String),
        ('anneeFermeture', Int32)])

---

# Detail Bilan

In [1]:
import polars as pl
import pyarrow.parquet as pq
import sys

filepath = "../Data/raw/ExportDetailBilan.parquet" 

print("--- Début de la lecture 'bypass' ---")

try:
    print(f"Lecture du fichier via PyArrow : {filepath}")
    table_arrow = pq.read_table(
        filepath,
        columns=["siren", "liasse", "date_cloture_exercice"]
    )
    
    print("Conversion de la table PyArrow en DataFrame Polars...")
    df_bilan = pl.from_arrow(table_arrow)
    
    print("--- SUCCÈS ! ---")
    print("Le DataFrame est maintenant dans Polars, prêt pour la transformation.")
    print(df_bilan.head())

except Exception as e:
    print(f"--- ERREUR ---", file=sys.stderr)
    print(f"Impossible de lire le fichier, même avec PyArrow : {e}", file=sys.stderr)

--- Début de la lecture 'bypass' ---
Lecture du fichier via PyArrow : ../Data/raw/ExportDetailBilan.parquet
Conversion de la table PyArrow en DataFrame Polars...
--- SUCCÈS ! ---
Le DataFrame est maintenant dans Polars, prêt pour la transformation.
shape: (5, 3)
┌───────────┬─────────────────────────────────┬───────────────────────┐
│ siren     ┆ liasse                          ┆ date_cloture_exercice │
│ ---       ┆ ---                             ┆ ---                   │
│ str       ┆ list[struct[2]]                 ┆ date                  │
╞═══════════╪═════════════════════════════════╪═══════════════════════╡
│ 005420120 ┆ [{"HF",111571}, {"BH-BI",2559}… ┆ 2018-12-31            │
│ 005420120 ┆ [{"GK",74833}, {"AS",90304}, …… ┆ 2021-12-31            │
│ 005420120 ┆ [{"HF",151562}, {"NH",5850813}… ┆ 2017-12-31            │
│ 005420120 ┆ [{"HC",918483}, {"CU-CV",66917… ┆ 2016-12-31            │
│ 005420120 ┆ [{"AO",425420}, {"CH-CI",14383… ┆ 2019-12-31            │
└───────────┴────

In [2]:
df_exploded = df_bilan.explode("liasse")  


df_final = df_exploded.with_columns([
    pl.col("liasse").struct.field("key").alias("key"),
    pl.col("liasse").struct.field("value").alias("value")
]).drop("liasse")

df_final

siren,date_cloture_exercice,key,value
str,date,str,i32
"""005420120""",2018-12-31,"""HF""",111571
"""005420120""",2018-12-31,"""BH-BI""",2559
"""005420120""",2018-12-31,"""CJ-CK""",15117606
"""005420120""",2018-12-31,"""DO""",0
"""005420120""",2018-12-31,"""EA""",123502
…,…,…,…
"""999990542""",2017-12-31,"""HE""",0
"""999990542""",2017-12-31,"""I3""",0
"""999990542""",2017-12-31,"""EA""",62902
"""999990542""",2017-12-31,"""FI""",1342983


In [3]:
import polars as pl

if 'df_final' not in locals():
    print("ERREUR: 'df_final' (le df 'long') n'est pas chargé.")
else:
    print(f"DataFrame 'long' (df_final) chargé. Shape: {df_final.shape}")

    # =================================================
    # 1. DÉFINIR LES "DIAMANTS" (Les codes qu'on garde)
    # =================================================

    CODES_A_GARDER = [
        'HN',  # Résultat Net
        'FA',  # Chiffre d'Affaires (Ventes)
        'FB',  # Achats de marchandises
        'CJ-CK', # Total Actif
        'DL',  # Dettes (à 1 an max)
        'DM',  # Dettes (à +1 an)
        'DA',  # Trésorerie (Actif)
        'FJ',  # Résultat financier
        'FR',  # Résultat exceptionnel
        'DF',  # Capitaux Propres
        'EG'   # Impôts et taxes
    ]

    # =====================================
    # 2. FILTRE & OPTIMISATION
    # =====================================

    print(f"\nÉtape 1: Filtrage ... On jette 95% des colonnes inutiles.")

    df_filtered = df_final.filter(
        pl.col("key").is_in(CODES_A_GARDER)
    )
    
    print(f"DataFrame filtré. Il ne reste que {df_filtered.shape[0]} lignes.")

    # ==================================
    # 3. PIVOT
    # ==================================

    print("\nÉtape 2: PIVOT sur les codes financiers sélectionnés")
    
    df_wide = df_filtered.pivot(
        values="value",
        index=["siren", "date_cloture_exercice"],
        columns="key",
        aggregate_function="first"
    ).fill_null(0)

    print("\n---")
    print("PIVOT TERMINÉ AVEC SUCCÈS.")
    print(f"Nouveau DataFrame 'df_wide' créé (shape: {df_wide.shape})")
    print("---")

    df_wide.head()

DataFrame 'long' (df_final) chargé. Shape: (387228690, 4)

Étape 1: Filtrage ... On jette 95% des colonnes inutiles.
DataFrame filtré. Il ne reste que 23374489 lignes.

Étape 2: PIVOT sur les codes financiers sélectionnés


  df_wide = df_filtered.pivot(



---
PIVOT TERMINÉ AVEC SUCCÈS.
Nouveau DataFrame 'df_wide' créé (shape: (3706645, 13))
---


In [4]:
df_wide

siren,date_cloture_exercice,CJ-CK,EG,FJ,FA,HN,DA,DL,FB,FR,DF,DM
str,date,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""005420120""",2018-12-31,15117606,841098,135797,4623,-289131,711840,90269342,0,470896,0,0
"""005420120""",2021-12-31,10813111,2500000,271605,44073,-1974866,711840,86469939,0,619916,0,0
"""005420120""",2017-12-31,22684824,441247,98112,26192,-376691,711840,90919571,0,450623,0,0
"""005420120""",2016-12-31,31933093,586967,104225,11836,-261053,711840,92013428,0,781843,0,0
"""005420120""",2019-12-31,12736527,0,217792,48370,-970147,711840,89288445,0,520363,0,0
…,…,…,…,…,…,…,…,…,…,…,…,…
"""999990369""",2017-12-31,27032947,0,25446485,0,316070,7111836,9336880,0,25633827,0,0
"""999990369""",2022-12-31,25326821,0,20220489,0,-672325,7111836,14487576,0,23350653,0,0
"""999990369""",2019-12-31,21649436,0,21713101,0,725501,7111836,14454662,0,23034206,0,0
"""999990542""",2016-12-31,1318894,0,1780080,0,318053,225000,598610,0,1780080,3674,0


In [5]:
df_wide.columns

['siren',
 'date_cloture_exercice',
 'CJ-CK',
 'EG',
 'FJ',
 'FA',
 'HN',
 'DA',
 'DL',
 'FB',
 'FR',
 'DF',
 'DM']

In [6]:
df_wide.describe()

statistic,siren,date_cloture_exercice,CJ-CK,EG,FJ,FA,HN,DA,DL,FB,FR,DF,DM
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""3706645""","""3706645""",3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0
"""null_count""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,"""2019-10-06 17:39:49.194433""",3913800.0,1101200.0,3306400.0,1465000.0,361013.787403,1948400.0,4380700.0,188926.995324,5872000.0,29869.091996,28888.753277
"""std""",,,43182000.0,16379000.0,35760000.0,27663000.0,17763000.0,36837000.0,61281000.0,9286100.0,60821000.0,4470800.0,5833200.0
"""min""","""005420120""","""1919-09-30""",-1053500000.0,-2147500000.0,-1073500000.0,-547490000.0,-2147500000.0,-2147500000.0,-2147500000.0,-52500000.0,-2147500000.0,-1382600000.0,-360972.0
"""25%""",,"""2017-12-31""",111338.0,0.0,0.0,0.0,0.0,7622.0,48516.0,0.0,0.0,0.0,0.0
"""50%""",,"""2019-12-31""",425115.0,82444.0,40521.0,0.0,2619.0,30000.0,260972.0,0.0,164367.0,0.0,0.0
"""75%""",,"""2021-09-30""",1224340.0,431124.0,869181.0,0.0,66630.0,155000.0,833139.0,0.0,1285494.0,0.0,0.0
"""max""","""999990542""","""2029-12-31""",2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0


In [7]:
import polars as pl

# ============================
# 1. DICTIONNAIRE DE RENOMMAGE
# ============================

RENAMING_MAP = {
    'siren': 'siren',
    'HN': 'HN_RésultatNet',
    'FA': 'FA_ChiffreAffairesVentes',
    'FB': 'FB_AchatsMarchandises',
    'CJ-CK': 'CJCK_TotalActifBrut', # On le garde tel quel
    'DL': 'DL_DettesCourtTerme',
    'DM': 'DM_DettesLongTerme',
    'DA': 'DA_TresorerieActive',
    'FJ': 'FJ_ResultatFinancier',
    'FR': 'FR_ResultatExceptionnel',
    'DF': 'DF_CapitauxPropres',
    'EG': 'EG_ImpotsTaxes',
    'date_cloture_exercice' : 'DateClotureExercice'
}

RAW_COLS_TO_KEEP = list(RENAMING_MAP.keys())

df_filtered_raw = df_wide.select(RAW_COLS_TO_KEEP) 

print("\nRenommage des colonnes en noms 'propres'...")
df_renamed = df_filtered_raw.rename(RENAMING_MAP)


# ==================================
# 2. CRÉATION DES 7 RATIOS "EXPERTS"
# ==================================
print("Création des 7 ratios 'experts'...")

df_with_ratios = df_renamed.with_columns(
    
    # On utilise les NOUVEAUX noms propres
    (pl.col("HN_RésultatNet") / (pl.col("FA_ChiffreAffairesVentes") + 1e-6)).fill_nan(0).alias("ratio_rentabilite_nette"),
    
    ((pl.col("DL_DettesCourtTerme") + pl.col("DM_DettesLongTerme")) / (pl.col("CJCK_TotalActifBrut") + 1e-6)).fill_nan(0).alias("ratio_endettement"),
    
    ((pl.col("FA_ChiffreAffairesVentes") - pl.col("FB_AchatsMarchandises")) / (pl.col("FA_ChiffreAffairesVentes") + 1e-6)).fill_nan(0).alias("ratio_marge_brute"),
    
    (pl.col("DF_CapitauxPropres") / (pl.col("CJCK_TotalActifBrut") + 1e-6)).fill_nan(0).alias("ratio_capitaux_propres"),
    
    (pl.col("DA_TresorerieActive") / (pl.col("CJCK_TotalActifBrut") + 1e-6)).fill_nan(0).alias("ratio_tresorerie"),
    
    (pl.col("FJ_ResultatFinancier") / (pl.col("FA_ChiffreAffairesVentes") + 1e-6)).fill_nan(0).alias("ratio_resultat_financier"),
    
    (pl.col("FR_ResultatExceptionnel") / (pl.col("FA_ChiffreAffairesVentes") + 1e-6)).fill_nan(0).alias("ratio_resultat_exceptionnel")
)

# On sauvegarde le DataFrame final (qui n'est plus df_wide)
df_dna_expert = df_with_ratios

print("\n---")
print("Transformation TERMINÉE.")
print(f"Nouveau DataFrame 'df_dna_expert' créé (shape: {df_dna_expert.shape})")
print("---")

df_dna_expert = df_dna_expert.sort(["siren", "DateClotureExercice"])

df_dna_expert.head()



Renommage des colonnes en noms 'propres'...
Création des 7 ratios 'experts'...

---
Transformation TERMINÉE.
Nouveau DataFrame 'df_dna_expert' créé (shape: (3706645, 20))
---


siren,HN_RésultatNet,FA_ChiffreAffairesVentes,FB_AchatsMarchandises,CJCK_TotalActifBrut,DL_DettesCourtTerme,DM_DettesLongTerme,DA_TresorerieActive,FJ_ResultatFinancier,FR_ResultatExceptionnel,DF_CapitauxPropres,EG_ImpotsTaxes,DateClotureExercice,ratio_rentabilite_nette,ratio_endettement,ratio_marge_brute,ratio_capitaux_propres,ratio_tresorerie,ratio_resultat_financier,ratio_resultat_exceptionnel
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,date,f64,f64,f64,f64,f64,f64,f64
"""005420120""",-261053,11836,0,31933093,92013428,0,711840,104225,781843,0,586967,2016-12-31,-22.055847,2.881444,1.0,0.0,0.022292,8.805762,66.056353
"""005420120""",-376691,26192,0,22684824,90919571,0,711840,98112,450623,0,441247,2017-12-31,-14.381911,4.007947,1.0,0.0,0.03138,3.745877,17.204604
"""005420120""",-289131,4623,0,15117606,90269342,0,711840,135797,470896,0,841098,2018-12-31,-62.541856,5.97114,1.0,0.0,0.047087,29.374216,101.859399
"""005420120""",-970147,48370,0,12736527,89288445,0,711840,217792,520363,0,0,2019-12-31,-20.056791,7.010423,1.0,0.0,0.05589,4.502626,10.75797
"""005420120""",-807683,72481,0,12006568,88446360,0,711840,342381,725921,0,885730,2020-12-31,-11.143376,7.366498,1.0,0.0,0.059288,4.723734,10.015328


In [8]:
#je veux passer ma date de cloture en format années

df_dna_expert = df_dna_expert.with_columns(
    pl.col("DateClotureExercice").dt.year().alias("AnneeClotureExercice")
)
df_dna_expert

siren,HN_RésultatNet,FA_ChiffreAffairesVentes,FB_AchatsMarchandises,CJCK_TotalActifBrut,DL_DettesCourtTerme,DM_DettesLongTerme,DA_TresorerieActive,FJ_ResultatFinancier,FR_ResultatExceptionnel,DF_CapitauxPropres,EG_ImpotsTaxes,DateClotureExercice,ratio_rentabilite_nette,ratio_endettement,ratio_marge_brute,ratio_capitaux_propres,ratio_tresorerie,ratio_resultat_financier,ratio_resultat_exceptionnel,AnneeClotureExercice
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,date,f64,f64,f64,f64,f64,f64,f64,i32
"""005420120""",-261053,11836,0,31933093,92013428,0,711840,104225,781843,0,586967,2016-12-31,-22.055847,2.881444,1.0,0.0,0.022292,8.805762,66.056353,2016
"""005420120""",-376691,26192,0,22684824,90919571,0,711840,98112,450623,0,441247,2017-12-31,-14.381911,4.007947,1.0,0.0,0.03138,3.745877,17.204604,2017
"""005420120""",-289131,4623,0,15117606,90269342,0,711840,135797,470896,0,841098,2018-12-31,-62.541856,5.97114,1.0,0.0,0.047087,29.374216,101.859399,2018
"""005420120""",-970147,48370,0,12736527,89288445,0,711840,217792,520363,0,0,2019-12-31,-20.056791,7.010423,1.0,0.0,0.05589,4.502626,10.75797,2019
"""005420120""",-807683,72481,0,12006568,88446360,0,711840,342381,725921,0,885730,2020-12-31,-11.143376,7.366498,1.0,0.0,0.059288,4.723734,10.015328,2020
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""999990369""",389310,0,0,22017428,14843972,0,7111836,20677807,22224228,0,0,2020-12-31,3.8931e11,0.674192,0.0,0.0,0.323009,2.0678e13,2.2224e13,2020
"""999990369""",254808,0,0,23802015,15159901,0,7111836,22429094,24658118,0,0,2021-12-31,2.5481e11,0.636917,0.0,0.0,0.298791,2.2429e13,2.4658e13,2021
"""999990369""",-672325,0,0,25326821,14487576,0,7111836,20220489,23350653,0,0,2022-12-31,-6.7232e11,0.572025,0.0,0.0,0.280803,2.0220e13,2.3351e13,2022
"""999990542""",318053,0,0,1318894,598610,0,225000,1780080,1780080,3674,0,2016-12-31,3.1805e11,0.453873,0.0,0.002786,0.170597,1.7801e12,1.7801e12,2016


In [9]:
#je veux que date de cloture  et année de cloture soit la premiere colone 

cols = df_dna_expert.columns
cols.remove("DateClotureExercice")
cols.remove("AnneeClotureExercice")
new_order = ["DateClotureExercice", "AnneeClotureExercice"] + cols
df_dna_expert = df_dna_expert.select(new_order)
df_dna_expert.head()




DateClotureExercice,AnneeClotureExercice,siren,HN_RésultatNet,FA_ChiffreAffairesVentes,FB_AchatsMarchandises,CJCK_TotalActifBrut,DL_DettesCourtTerme,DM_DettesLongTerme,DA_TresorerieActive,FJ_ResultatFinancier,FR_ResultatExceptionnel,DF_CapitauxPropres,EG_ImpotsTaxes,ratio_rentabilite_nette,ratio_endettement,ratio_marge_brute,ratio_capitaux_propres,ratio_tresorerie,ratio_resultat_financier,ratio_resultat_exceptionnel
date,i32,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,f64,f64,f64,f64,f64,f64,f64
2016-12-31,2016,"""005420120""",-261053,11836,0,31933093,92013428,0,711840,104225,781843,0,586967,-22.055847,2.881444,1.0,0.0,0.022292,8.805762,66.056353
2017-12-31,2017,"""005420120""",-376691,26192,0,22684824,90919571,0,711840,98112,450623,0,441247,-14.381911,4.007947,1.0,0.0,0.03138,3.745877,17.204604
2018-12-31,2018,"""005420120""",-289131,4623,0,15117606,90269342,0,711840,135797,470896,0,841098,-62.541856,5.97114,1.0,0.0,0.047087,29.374216,101.859399
2019-12-31,2019,"""005420120""",-970147,48370,0,12736527,89288445,0,711840,217792,520363,0,0,-20.056791,7.010423,1.0,0.0,0.05589,4.502626,10.75797
2020-12-31,2020,"""005420120""",-807683,72481,0,12006568,88446360,0,711840,342381,725921,0,885730,-11.143376,7.366498,1.0,0.0,0.059288,4.723734,10.015328


In [23]:
# Je veux voir le nombre de lignes par année de clôture d'exercice en polars


df_dna_expert.filter(
    pl.col("AnneeClotureExercice").is_between(2016, 2022)
).group_by("AnneeClotureExercice").count().sort("AnneeClotureExercice")

  ).group_by("AnneeClotureExercice").count().sort("AnneeClotureExercice")


AnneeClotureExercice,count
i32,u32
2016,484144
2017,530057
2018,522084
2019,523470
2020,510227
2021,500065
2022,435016


In [13]:
df_dna_expert.describe()

statistic,DateClotureExercice,AnneeClotureExercice,siren,HN_RésultatNet,FA_ChiffreAffairesVentes,FB_AchatsMarchandises,CJCK_TotalActifBrut,DL_DettesCourtTerme,DM_DettesLongTerme,DA_TresorerieActive,FJ_ResultatFinancier,FR_ResultatExceptionnel,DF_CapitauxPropres,EG_ImpotsTaxes,ratio_rentabilite_nette,ratio_endettement,ratio_marge_brute,ratio_capitaux_propres,ratio_tresorerie,ratio_resultat_financier,ratio_resultat_exceptionnel
str,str,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""3706645""",3706645.0,"""3706645""",3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0,3706645.0
"""null_count""","""0""",0.0,"""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""","""2019-10-06 17:39:49.194433""",2018.926753,,361013.787403,1465000.0,188926.995324,3913800.0,4380700.0,28888.753277,1948400.0,3306400.0,5872000.0,29869.091996,1101200.0,297520000000.0,415630000000.0,-17131000000.0,873830000.0,217960000000.0,1155400000000.0,3271300000000.0
"""std""",,2.143208,,17763000.0,27663000.0,9286100.0,43182000.0,61281000.0,5833200.0,36837000.0,35760000.0,60821000.0,4470800.0,16379000.0,17176000000000.0,26438000000000.0,3144500000000.0,1144300000000.0,17762000000000.0,16260000000000.0,48061000000000.0
"""min""","""1919-09-30""",1919.0,"""005420120""",-2147500000.0,-547490000.0,-52500000.0,-1053500000.0,-2147500000.0,-360972.0,-2147500000.0,-1073500000.0,-2147500000.0,-1382600000.0,-2147500000.0,-2147500000000000.0,-2147500000000000.0,-1884700000000000.0,-7297200000000.0,-1619000000000000.0,-1073500000000000.0,-2147500000000000.0
"""25%""","""2017-12-31""",2017.0,,0.0,0.0,0.0,111338.0,48516.0,0.0,7622.0,0.0,0.0,0.0,0.0,0.0,0.257128,0.0,0.0,0.024007,0.0,0.0
"""50%""","""2019-12-31""",2019.0,,2619.0,0.0,0.0,425115.0,260972.0,0.0,30000.0,40521.0,164367.0,0.0,82444.0,0.026723,0.609238,0.0,0.0,0.083855,1.002534,61.977393
"""75%""","""2021-09-30""",2021.0,,66630.0,0.0,0.0,1224340.0,833139.0,0.0,155000.0,869181.0,1285494.0,0.0,431124.0,26156000000.0,1.042921,0.0,0.0,0.33101,132980000000.0,370880000000.0
"""max""","""2029-12-31""",2029.0,"""999990542""",2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000.0,2147500000000000.0,2147500000000000.0,1414900000000.0,2147500000000000.0,2147500000000000.0,2147500000000000.0,2147500000000000.0


In [11]:
# Je veux sauvegarder le DataFrame final dans un fichier Parquet

OUTPUT_PATH = "../Data/processed/sirene_bilan.parquet"

df_dna_expert.write_parquet(OUTPUT_PATH)

In [31]:
import polars as pl
import os

print("--- Lancement de la Data Prep 'Monstrueuse' (Wide) ---")

# --- 1. CHARGER LES "MASTER FILES" (en mode "scan" rapide) ---
try:
    df_sirene = pl.scan_parquet("../Data/processed/sirene_infos.parquet")
    df_bilan = pl.scan_parquet("../Data/processed/sirene_bilan.parquet")
    print("Fichiers 'infos' et 'bilan' chargés.")
except Exception as e:
    print(f"ERREUR: Fichiers 'processed' non trouvés. {e}")
    raise e

# --- 2. DÉFINIR LES FEATURES À "LAGGER" (L'Historique) ---
# On prend les 11 codes "diamant" + les 7 ratios
FEATURES_A_LAGGER = [
    'HN_RésultatNet', 'FA_ChiffreAffairesVentes', 'FB_AchatsMarchandises',
    'CJCK_TotalActifBrut', 'DL_DettesCourtTerme', 'DM_DettesLongTerme',
    'DA_TresorerieActive', 'FJ_ResultatFinancier', 'FR_ResultatExceptionnel',
    'DF_CapitauxPropres', 'EG_ImpotsTaxes',
    "ratio_rentabilite_nette", "ratio_endettement", "ratio_marge_brute", 
    "ratio_capitaux_propres", "ratio_tresorerie",
    "ratio_resultat_financier", "ratio_resultat_exceptionnel"
]

# --- 3. CRÉER LES 3 "INSTANTANÉS" TEMPORELS ---
# On se place en 2022, donc "N" = 2022
print("Création des 3 instantanés temporels (N, N-1, N-2)...")

# A. Données N (2022) - La "Base"
df_N = df_bilan.filter(pl.col("AnneeClotureExercice") == 2022).select(
    "siren",
    # On ajoute un suffixe "_N"
    *[pl.col(c).alias(f"{c}_N") for c in FEATURES_A_LAGGER]
)

# B. Données N-1 (2021) - Le passé récent
df_N_moins_1 = df_bilan.filter(pl.col("AnneeClotureExercice") == 2021).select(
    "siren",
    # On ajoute un suffixe "_N_moins_1"
    *[pl.col(c).alias(f"{c}_N_moins_1") for c in FEATURES_A_LAGGER]
)

# C. Données N-2 (2020) - Le passé lointain
df_N_moins_2 = df_bilan.filter(pl.col("AnneeClotureExercice") == 2020).select(
    "siren",
    # On ajoute un suffixe "_N_moins_2"
    *[pl.col(c).alias(f"{c}_N_moins_2") for c in FEATURES_A_LAGGER]
)

# --- 4. LE "GRAND MARIAGE" TEMPOREL (LEFT JOINS) ---
# On crée la base "large" (1 ligne par siren)
print("Assemblage de la base 'large' (wide)...")

# On prend 2022 comme base
df_wide = df_N.join(
    df_N_moins_1, on="siren", how="left"
).join(
    df_N_moins_2, on="siren", how="left"
)

# On remplit les 'null' (pour les boîtes créées en 2021 qui n'ont pas de N-2)
df_wide = df_wide.fill_null(0)

print(f"Base 'wide' financière créée. Shape: {df_wide.collect().shape}")

# --- 5. LE FEATURE ENGINEERING "MONSTRUEUX" (Vélocité) ---
print("Création des features de 'Vélocité' (Niveau 1)...")

# ▼▼▼ ON FAIT ÇA EN DEUX ÉTAPES ▼▼▼

# ÉTAPE 5.A: Créer les variations de Niveau 1
df_wide = df_wide.with_columns(
    
    # Ex: Variation de la rentabilité (N vs N-1)
    (pl.col("ratio_rentabilite_nette_N") - pl.col("ratio_rentabilite_nette_N_moins_1")).alias("variation_rentabilite_N1"),
    
    # Ex: Variation de la rentabilité (N-1 vs N-2)
    (pl.col("ratio_rentabilite_nette_N_moins_1") - pl.col("ratio_rentabilite_nette_N_moins_2")).alias("variation_rentabilite_N2"),

    # Ex: Variation du CA (N vs N-1)
    (pl.col("FA_ChiffreAffairesVentes_N") - pl.col("FA_ChiffreAffairesVentes_N_moins_1")).alias("variation_CA_N1")
    
    # ... Tu peux en créer d'autres de Niveau 1 ...
)

print("Création des features d'Accélération (Niveau 2)...")
# ÉTAPE 5.B: Créer les variations de Niveau 2 (en utilisant celles du Niveau 1)
# On lance un NOUVEAU .with_columns()
df_wide = df_wide.with_columns(

    # Ex: Accélération de la rentabilité
    (pl.col("variation_rentabilite_N1") - pl.col("variation_rentabilite_N2")).alias("acceleration_rentabilite")
    
    # ... Tu peux en créer d'autres de Niveau 2 ...
)

# --- 6. LE "MARIAGE FINAL" (FINANCE "WIDE" + DÉMO) ---
print("Jointure finale avec les données SIRENE (Démo)...")
df_final_ml = df_wide.join(
    df_sirene.select( # On sélectionne les features démo
        "siren",
        "dateCreationUniteLegale",
        "dateFermeture",
        "categorieJuridiqueUniteLegale",
        "trancheEffectifsUniteLegale",
        "activitePrincipaleUniteLegale",
        "departement"
    ),
    on="siren",
    how="left" # On garde les entreprises 'wide' même si on n'a pas l'info SIRENE
)

import os # <-- On a besoin de cet import

# --- 7. SAUVEGARDE DU NOUVEAU "MASTER FILE" ---
PATH_OUTPUT = "../Data/processed/dataset_ML_FINAL_WIDE_2022.parquet"

# ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
# LA CORRECTION "MONSTRUEUSE" EST ICI
# On s'assure que le dossier "Data/processed" existe AVANT de sauvegarder
print(f"Vérification/Création du dossier : {os.path.dirname(PATH_OUTPUT)}")
os.makedirs(os.path.dirname(PATH_OUTPUT), exist_ok=True)
# ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲

print(f"Sauvegarde du dataset ML final (Wide) dans {PATH_OUTPUT}...")
df_final_ml.collect().write_parquet(PATH_OUTPUT)

print("---")
print("Data Prep 'Monstrueuse' (Wide) TERMINÉE.")
print(f"Shape finale : {df_final_ml.collect().shape}")
print("---")

# On affiche la tête du DataFrame qu'on vient de sauvegarder
print(df_final_ml.collect().head())

--- Lancement de la Data Prep 'Monstrueuse' (Wide) ---
Fichiers 'infos' et 'bilan' chargés.
Création des 3 instantanés temporels (N, N-1, N-2)...
Assemblage de la base 'large' (wide)...
Base 'wide' financière créée. Shape: (436312, 55)
Création des features de 'Vélocité' (Niveau 1)...
Création des features d'Accélération (Niveau 2)...
Jointure finale avec les données SIRENE (Démo)...
Vérification/Création du dossier : ../Data/processed
Sauvegarde du dataset ML final (Wide) dans ../Data/processed/dataset_ML_FINAL_WIDE_2022.parquet...
---
Data Prep 'Monstrueuse' (Wide) TERMINÉE.
Shape finale : (436312, 65)
---
shape: (5, 65)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ siren     ┆ HN_Résult ┆ FA_Chiffr ┆ FB_Achats ┆ … ┆ categorie ┆ trancheEf ┆ activiteP ┆ departem │
│ ---       ┆ atNet_N   ┆ eAffaires ┆ Marchandi ┆   ┆ Juridique ┆ fectifsUn ┆ rincipale ┆ ent      │
│ str       ┆ ---       ┆ Ventes_N  ┆ ses_N     ┆   ┆ UniteLega ┆ 

In [29]:
df_final_ml.schema

  df_final_ml.schema


Schema([('siren', String),
        ('HN_RésultatNet_N', Int32),
        ('FA_ChiffreAffairesVentes_N', Int32),
        ('FB_AchatsMarchandises_N', Int32),
        ('CJCK_TotalActifBrut_N', Int32),
        ('DL_DettesCourtTerme_N', Int32),
        ('DM_DettesLongTerme_N', Int32),
        ('DA_TresorerieActive_N', Int32),
        ('FJ_ResultatFinancier_N', Int32),
        ('FR_ResultatExceptionnel_N', Int32),
        ('DF_CapitauxPropres_N', Int32),
        ('EG_ImpotsTaxes_N', Int32),
        ('ratio_rentabilite_nette_N', Float64),
        ('ratio_endettement_N', Float64),
        ('ratio_marge_brute_N', Float64),
        ('ratio_capitaux_propres_N', Float64),
        ('ratio_tresorerie_N', Float64),
        ('ratio_resultat_financier_N', Float64),
        ('ratio_resultat_exceptionnel_N', Float64),
        ('HN_RésultatNet_N_moins_1', Int32),
        ('FA_ChiffreAffairesVentes_N_moins_1', Int32),
        ('FB_AchatsMarchandises_N_moins_1', Int32),
        ('CJCK_TotalActifBrut_N_moins_

In [32]:
# je veux changer le fichier ML final wide 2022 en polars

import polars as pl

PATH_INPUT = "../Data/processed/dataset_ML_FINAL_WIDE_2022.parquet"
df_ml_final_wide = pl.read_parquet(PATH_INPUT)

In [33]:
df_ml_final_wide.head()

siren,HN_RésultatNet_N,FA_ChiffreAffairesVentes_N,FB_AchatsMarchandises_N,CJCK_TotalActifBrut_N,DL_DettesCourtTerme_N,DM_DettesLongTerme_N,DA_TresorerieActive_N,FJ_ResultatFinancier_N,FR_ResultatExceptionnel_N,DF_CapitauxPropres_N,EG_ImpotsTaxes_N,ratio_rentabilite_nette_N,ratio_endettement_N,ratio_marge_brute_N,ratio_capitaux_propres_N,ratio_tresorerie_N,ratio_resultat_financier_N,ratio_resultat_exceptionnel_N,HN_RésultatNet_N_moins_1,FA_ChiffreAffairesVentes_N_moins_1,FB_AchatsMarchandises_N_moins_1,CJCK_TotalActifBrut_N_moins_1,DL_DettesCourtTerme_N_moins_1,DM_DettesLongTerme_N_moins_1,DA_TresorerieActive_N_moins_1,FJ_ResultatFinancier_N_moins_1,FR_ResultatExceptionnel_N_moins_1,DF_CapitauxPropres_N_moins_1,EG_ImpotsTaxes_N_moins_1,ratio_rentabilite_nette_N_moins_1,ratio_endettement_N_moins_1,ratio_marge_brute_N_moins_1,ratio_capitaux_propres_N_moins_1,ratio_tresorerie_N_moins_1,ratio_resultat_financier_N_moins_1,ratio_resultat_exceptionnel_N_moins_1,HN_RésultatNet_N_moins_2,FA_ChiffreAffairesVentes_N_moins_2,FB_AchatsMarchandises_N_moins_2,CJCK_TotalActifBrut_N_moins_2,DL_DettesCourtTerme_N_moins_2,DM_DettesLongTerme_N_moins_2,DA_TresorerieActive_N_moins_2,FJ_ResultatFinancier_N_moins_2,FR_ResultatExceptionnel_N_moins_2,DF_CapitauxPropres_N_moins_2,EG_ImpotsTaxes_N_moins_2,ratio_rentabilite_nette_N_moins_2,ratio_endettement_N_moins_2,ratio_marge_brute_N_moins_2,ratio_capitaux_propres_N_moins_2,ratio_tresorerie_N_moins_2,ratio_resultat_financier_N_moins_2,ratio_resultat_exceptionnel_N_moins_2,variation_rentabilite_N1,variation_rentabilite_N2,variation_CA_N1,acceleration_rentabilite,dateCreationUniteLegale,dateFermeture,categorieJuridiqueUniteLegale,trancheEffectifsUniteLegale,activitePrincipaleUniteLegale,departement
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,f64,f64,f64,f64,f64,f64,f64,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,f64,f64,f64,f64,f64,f64,f64,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,f64,f64,f64,f64,f64,f64,f64,f64,f64,i32,f64,date,date,i64,str,str,str
"""005420120""",-1081200,13906,0,10107302,85387013,0,711840,245784,526013,0,854623,-77.750611,8.448052,1.0,0.0,0.070428,17.674673,37.826334,-1974866,44073,0,10813111,86469939,0,711840,271605,619916,0,2500000,-44.808976,7.996768,1.0,0.0,0.065831,6.162617,14.065664,-807683,72481,0,12006568,88446360,0,711840,342381,725921,0,885730,-11.143376,7.366498,1.0,0.0,0.059288,4.723734,10.015328,-32.941635,-33.665601,-30167,0.723965,1954-01-01,,5599,"""03""","""70.10Z""","""62"""
"""005520176""",561370,1362160,0,4849259,3634604,0,1000000,10261455,10598945,0,2021582,0.412118,0.749517,1.0,0.0,0.206217,7.533223,7.780984,541249,997690,27041,4475296,3173234,0,1000000,8054388,8318194,0,1542916,0.542502,0.709056,0.972896,0.0,0.223449,8.073037,8.337454,-300307,700617,24406,3652740,2888510,0,1000000,6082217,6353066,0,1045892,-0.428632,0.790779,0.965165,0.0,0.273767,8.68123,9.067816,-0.130385,0.971134,364470,-1.101519,1955-01-01,,5710,"""21""","""17.21A""","""80"""
"""005520242""",385970,6354,0,3814778,1339284,0,2775000,7498811,8140348,0,2803196,60.744413,0.351078,1.0,0.0,0.727434,1180.171703,1281.137551,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,98843,7241,0,3630190,783561,0,2775000,4874985,5547208,0,3100850,13.650463,0.215846,1.0,0.0,0.764423,673.24748,766.083138,60.744413,-13.650463,6354,74.394876,1955-01-01,,5710,"""12""","""20.30Z""","""80"""
"""005580501""",-6706,0,0,624529,5132555,0,4600000,0,0,0,5277,-6706000000.0,8.218281,0.0,0.0,7.365551,0.0,0.0,230616,0,0,626918,5139261,0,4600000,0,0,0,0,230620000000.0,8.197661,0.0,0.0,7.337483,0.0,0.0,14662,0,0,425447,608645,0,2000000,0,0,0,4586,14662000000.0,1.430601,0.0,0.0,4.700938,0.0,0.0,-237320000000.0,215950000000.0,0,-453280000000.0,1900-01-01,,5710,"""NN""","""66.30Z""","""75"""
"""005620034""",332454,8236901,0,3878366,2602867,0,110050,8236901,8400611,0,1352699,0.040362,0.671125,1.0,0.0,0.028375,1.0,1.019875,0,0,0,3432567,2100114,0,110050,0,0,0,97311,0.0,0.61182,0.0,0.0,0.032061,0.0,0.0,0,0,0,3512689,1522206,0,110050,0,0,0,1723560,0.0,0.433345,0.0,0.0,0.031329,0.0,0.0,0.040362,0.0,8236901,0.040362,1956-01-01,,5710,"""11""","""46.73A""","""80"""


In [None]:
# donne moi le nombred e valeur unique siren 

df_ml_final_wide.select(
    pl.col("siren").n_unique().alias("nombre_siren_uniques")
).show()


---

In [1]:
import polars as pl
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, RandomizedSearchCV # On garde RandomizedSearch
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, r2_score
from xgboost import XGBRegressor
import mlflow
import warnings

warnings.filterwarnings('ignore', category=UserWarning)

# --- 1. CONFIGURER MLFLOW ---
mlflow.set_experiment("Projet_SIRENE_Regression_Monstre")
mlflow.xgboost.autolog()
print("MLflow configuré pour le 'Run Élite Profond'.")

# --- 2. CHARGER LES "MASTER FILES" ---
print("Chargement des 'Master files'...")
df_sirene = pl.read_parquet("../Data/processed/sirene_infos.parquet")
df_bilan = pl.read_parquet("../Data/processed/sirene_bilan.parquet")
print("Fichiers 'infos' et 'bilan' chargés.")

# --- 3. DÉFINIR LES "FEATURES ÉLITE" (Ton choix) ---
# Les 4 "non-ratios" dont on va chercher l'historique
ELITE_RAW_CODES = [
    'HN_RésultatNet', 'FA_ChiffreAffairesVentes', 'FJ_ResultatFinancier', 'EG_ImpotsTaxes'
]
# Les 3 ratios qu'on ne garde qu'en N-1
ELITE_RATIO_CODES = [
    "ratio_rentabilite_nette", "ratio_endettement", "ratio_resultat_financier"
]
# Les features "Démo"
CATEGORICAL_FEATURES = ["categorieJuridiqueUniteLegale", "departement"]
# La Target
TARGET = "TARGET_rentabilite_N"

# --- 4. CRÉATION DU DATASET (LE "SELF-JOIN" PROFOND N-3) ---
print("Création du dataset temporel (N, N-1, N-2, N-3)...")

# A. Target de N (2019)
df_N = df_bilan.filter(pl.col("AnneeClotureExercice") == 2019).select(
    "siren", pl.col("ratio_rentabilite_nette").alias(TARGET)
)

# B. Données N-1 (2018) - Features "État"
df_N_moins_1 = df_bilan.filter(pl.col("AnneeClotureExercice") == 2018).select(
    "siren", *[pl.col(c).alias(f"{c}_N1") for c in (ELITE_RAW_CODES + ELITE_RATIO_CODES)]
)

# C. Données N-2 (2017) - Features "Historique 1"
df_N_moins_2 = df_bilan.filter(pl.col("AnneeClotureExercice") == 2017).select(
    "siren", *[pl.col(c).alias(f"{c}_N2") for c in ELITE_RAW_CODES] # On ne prend que les 4 non-ratios
)

# D. Données N-3 (2016) - Features "Historique 2"
df_N_moins_3 = df_bilan.filter(pl.col("AnneeClotureExercice") == 2016).select(
    "siren", *[pl.col(c).alias(f"{c}_N3") for c in ELITE_RAW_CODES] # On ne prend que les 4 non-ratios
)

# --- 5. LE FEATURE ENGINEERING "PROFOND" ---
print("Création des features de 'Vélocité' et 'Accélération'...")
# On joint N-1, N-2, N-3
df_features = df_N_moins_1.join(
    df_N_moins_2, on="siren", how="left"
).join(
    df_N_moins_3, on="siren", how="left"
).fill_null(0) # Très important

# On crée les features de "Vélocité" et "Accélération" pour les 4 codes
NEW_VELOCITY_FEATURES = []
for c in ELITE_RAW_CODES:
    # Vélocité N-1 vs N-2
    var_n1_n2 = f"var_{c}_N1_N2"
    df_features = df_features.with_columns(
        (pl.col(f"{c}_N1") - pl.col(f"{c}_N2")).alias(var_n1_n2)
    )
    # Vélocité N-2 vs N-3
    var_n2_n3 = f"var_{c}_N2_N3"
    df_features = df_features.with_columns(
        (pl.col(f"{c}_N2") - pl.col(f"{c}_N3")).alias(var_n2_n3)
    )
    # Accélération
    accel = f"accel_{c}_N1_N3"
    df_features = df_features.with_columns(
        (pl.col(var_n1_n2) - pl.col(var_n2_n3)).alias(accel)
    )
    NEW_VELOCITY_FEATURES.extend([var_n1_n2, var_n2_n3, accel])

print(f"{len(NEW_VELOCITY_FEATURES)} features de vélocité/accélération créées.")

# On joint avec les features "Démo" (le châssis !)
df_features = df_features.join(
    df_sirene.select("siren", *CATEGORICAL_FEATURES),
    on="siren",
    how="left"
)

# --- 6. JOINTURE FINALE (Features + Target) ---
df_ml = df_features.join(df_N, on="siren", how="inner")
print(f"Dataset de Régression 'Profond' créé. Shape: {df_ml.shape}")

# --- 7. DÉFINITION FINALE DES FEATURES (X) ET TARGET (Y) ---
# Nos features (X) sont :
# 1. L'état N-1 (les 7 features "élite" que tu as choisies)
# 2. Les 12 nouvelles features de vélocité/accélération
# 3. Les 2 features "Démo"
FEATURES_ETAT_N1 = [f"{c}_N1" for c in (ELITE_RAW_CODES + ELITE_RATIO_CODES)]
NUMERIC_FEATURES_FINAL = FEATURES_ETAT_N1 + NEW_VELOCITY_FEATURES

print(f"Total features: {len(CATEGORICAL_FEATURES)} cat + {len(NUMERIC_FEATURES_FINAL)} num.")

# --- 8. NETTOYAGE DES OUTLIERS (Clipping) ---
print("Clipping des outliers...")
LOWER_BOUND, UPPER_BOUND = -5.0, 5.0
clip_cols = [c for c in df_ml.columns if "ratio" in c or "TARGET" in c or "variation" in c or "accel" in c]
df_ml = df_ml.with_columns(
    pl.col(clip_cols).clip(lower_bound=LOWER_BOUND, upper_bound=UPPER_BOUND)
).fill_null(0)

# Conversion en Pandas
X = df_ml.select(NUMERIC_FEATURES_FINAL + CATEGORICAL_FEATURES).to_pandas()
y = df_ml.select(TARGET).to_pandas().squeeze()

# --- 9. PRÉPARATION (Le Preprocessor "Profond") ---
print("Preprocessing avec RobustScaler (Num) + OHE (Cat)...")
numerical_transformer = RobustScaler()
categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, NUMERIC_FEATURES_FINAL),
        ("cat", categorical_transformer, CATEGORICAL_FEATURES)
    ],
    remainder="passthrough"
)

# --- 10. CRÉATION DE LA PIPELINE ET TUNING ---
print("Création de la pipeline (Preprocessor + XGB Regressor)...")
pipeline_preprocessor = Pipeline(steps=[('preprocessor', preprocessor)])
print("Preprocessing terminé. Lancement du tuning 'Profond' (peut prendre 10-20 minutes)...")

# Grille de tuning (on garde la même)
param_grid = {
    'n_estimators': [100, 250, 400],
    'max_depth': [5, 7, 10],
    'learning_rate': [0.1, 0.05, 0.01],
    'subsample': [0.7, 1.0]
}

xgb_reg = XGBRegressor(objective='reg:squarederror', eval_metric='rmse', random_state=42)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# On garde RandomizedSearchCV (20 itérations)
random_search = RandomizedSearchCV(
    estimator=xgb_reg,
    param_distributions=param_grid,
    n_iter=20, # On teste 20 combinaisons
    cv=kfold,
    scoring='r2',
    verbose=2,
    n_jobs=-1,
    random_state=42
)

X_processed = pipeline_preprocessor.fit_transform(X)

with mlflow.start_run() as run:
    random_search.fit(X_processed, y)
    mlflow.log_param("model_type", "Model_I_Deep_Elite_Tuned")
    mlflow.log_metric("best_r2_score", random_search.best_score_)

# --- 11. RÉSULTATS DU TUNING "PROFOND" ---
print("---")
print("--- RÉSULTATS DU TUNING 'ÉLITE PROFOND' (RANDOMIZEDSEARCHCV) ---")
print(f"Meilleur Score R² trouvé : {random_search.best_score_:.4f}")
print("Meilleurs Hyperparamètres :")
print(random_search.best_params_)
print("---")
print(f"Score précédent (Modèle H, 'large'): 0.3294")
print(f"Score actuel (Modèle I, 'profond'): {random_search.best_score_:.4f}")
print("---")
print("Toutes les expériences sont loggées dans 'mlruns'.")
print("Lance 'mlflow ui' dans ton terminal pour voir le dashboard.")

  return FileStore(store_uri, store_uri)


MLflow configuré pour le 'Run Élite Profond'.
Chargement des 'Master files'...
Fichiers 'infos' et 'bilan' chargés.
Création du dataset temporel (N, N-1, N-2, N-3)...
Création des features de 'Vélocité' et 'Accélération'...
12 features de vélocité/accélération créées.
Dataset de Régression 'Profond' créé. Shape: (415411, 31)
Total features: 2 cat + 19 num.
Clipping des outliers...
Preprocessing avec RobustScaler (Num) + OHE (Cat)...
Création de la pipeline (Preprocessor + XGB Regressor)...
Preprocessing terminé. Lancement du tuning 'Profond' (peut prendre 10-20 minutes)...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END learning_rate=0.05, max_depth=5, n_estimators=100, subsample=1.0; total time=  14.9s
[CV] END learning_rate=0.05, max_depth=5, n_estimators=100, subsample=1.0; total time=  17.4s
[CV] END learning_rate=0.05, max_depth=5, n_estimators=100, subsample=1.0; total time=  17.4s
[CV] END learning_rate=0.05, max_depth=5, n_estimators=100, subsample=1.0; t



---
--- RÉSULTATS DU TUNING 'ÉLITE PROFOND' (RANDOMIZEDSEARCHCV) ---
Meilleur Score R² trouvé : 0.3195
Meilleurs Hyperparamètres :
{'subsample': 0.7, 'n_estimators': 400, 'max_depth': 10, 'learning_rate': 0.01}
---
Score précédent (Modèle H, 'large'): 0.3294
Score actuel (Modèle I, 'profond'): 0.3195
---
Toutes les expériences sont loggées dans 'mlruns'.
Lance 'mlflow ui' dans ton terminal pour voir le dashboard.
