# Mod√®les de machine learning r√©gression

## Objectifs de ce notebook

Dans ce notebook, nous allons tester diff√©rents mod√®les de machine learning pour la pr√©diction :

**Pr√©diction du nombre de jour depuis la date de sortie avant r√©duction (33%)**

Comparaison des performances de plusieurs mod√®les
- **R√©gression lin√©aire** 
- **Random forest**
- **XGBoost**

## Donn√©es utilis√©es

Utilisation du fichier nettoy√© dataset jeux ps5 avec travail sur les features r√©alis√©: **featured_games_dataset_final.csv**.

- mod√©lisation avec les donn√©es de bases
- mod√©lisation avec les donn√©es apr√®s feature engineering

## Note sur les librairies utilis√©es

Ce notebook utilise scikit learn pour la partie machine learning

## Import des librairies

In [572]:


# Python base
import sys
from pathlib import Path
import os

# Biblioth√®ques principales
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder,OneHotEncoder,OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import (
    LinearRegression,
    Ridge,
    Lasso,
    ElasticNet,
    BayesianRidge,
    HuberRegressor,
    RANSACRegressor,
    Lars,
    LassoLars,
    OrthogonalMatchingPursuit,
)
from sklearn.ensemble import (
    RandomForestRegressor,
    GradientBoostingRegressor,
    AdaBoostRegressor,
    ExtraTreesRegressor,
    HistGradientBoostingRegressor,
    BaggingRegressor,
)
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings('ignore')

## M√©thodes utiles

In [573]:
def column_summary(df: pd.DataFrame):
    summary = []
    for col in df.columns:
        col_type = df[col].dtype
        non_null = df[col].notna().sum()
        null_count = df[col].isna().sum()
        
        # G√©rer le cas o√π la colonne contient des listes (unhashable)
        try:
            unique_count = df[col].nunique()
        except TypeError:
            # Si erreur (listes), convertir en string temporairement
            unique_count = df[col].astype(str).nunique()
            print(f"‚ö†Ô∏è Colonne '{col}' contient des types non-hashable (probablement des listes)")

        summary.append({
            'Column': col,
            'Type': str(col_type),
            'Non-Null Count': non_null,
            'Null Count': null_count,
            'Unique Values': unique_count,
        })

    # Afficher le r√©sum√© des colonnes
    print("=" * 80)
    print("R√©sum√© d√©taill√© des colonnes:")
    print("=" * 80)
    column_summary_df = pd.DataFrame(summary)
    print(column_summary_df.to_string(index=False))
    print("\n")

In [574]:
class OutlierClipper(BaseEstimator, TransformerMixin):
    """Clip les outliers avec la m√©thode IQR"""
    
    def __init__(self):
        self.lower_bounds_ = None
        self.upper_bounds_ = None
    
    def fit(self, X, y=None):
        if isinstance(X, pd.DataFrame):
            X_values = X.values
        else:
            X_values = X
        
        Q1 = np.percentile(X_values, 25, axis=0)
        Q3 = np.percentile(X_values, 75, axis=0)
        IQR = Q3 - Q1
        
        self.lower_bounds_ = Q1 - 1.5 * IQR
        self.upper_bounds_ = Q3 + 1.5 * IQR
        
        return self
    
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            X_values = X.values
            X_clipped = np.clip(X_values, self.lower_bounds_, self.upper_bounds_)
            return pd.DataFrame(X_clipped, columns=X.columns, index=X.index)
        else:
            return np.clip(X, self.lower_bounds_, self.upper_bounds_)

In [575]:
class ZeroImputer(BaseEstimator, TransformerMixin):
    """Impute les valeurs manquantes par 0"""
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            return X.fillna(0)
        else:
            # Si numpy array
            X_copy = X.copy()
            X_copy[np.isnan(X_copy)] = 0
            return X_copy

## Chargement des donn√©es

In [576]:
path = os.path.join(Path.cwd().parent, "data/processed")

In [577]:
# Charger les donn√©es CSV pour verifier
df_dataset = pd.read_csv(os.path.join(path, "featured_games_dataset_final.csv"))

In [578]:
column_summary(df_dataset)

R√©sum√© d√©taill√© des colonnes:
                              Column    Type  Non-Null Count  Null Count  Unique Values
                            id_store  object            5382           0           5382
               pssstore_stars_rating float64            5382           0            352
         pssstore_stars_rating_count   int64            5382           0           1583
             metacritic_critic_score float64            1268        4114             67
                              is_ps4   int64            5382           0              2
                            is_indie   int64            5382           0              2
                              is_dlc   int64            5382           0              1
                               is_vr   int64            5382           0              2
                     is_opti_ps5_pro   int64            5382           0              2
                         is_remaster   int64            5382           0              

In [579]:
df_dataset.head()

Unnamed: 0,id_store,pssstore_stars_rating,pssstore_stars_rating_count,metacritic_critic_score,is_ps4,is_indie,is_dlc,is_vr,is_opti_ps5_pro,is_remaster,...,content_category,exclusif_playstation_content,visibility_score,visibility_category,pegi_unified,price_category,game_age_years,month_sin,month_cos,release_season
0,EP8311-PPSA19174_00-0421646910657705,1.57,14,,0,0,0,0,0,0,...,minimal,0,7.0,obscure,7,0 - 7.99,2,-0.8660254,0.5,fall
1,EP2005-PPSA06055_00-SINUCA0000000000,3.26,72,,0,0,0,0,0,0,...,minimal,0,43.0,moderate,3,0 - 7.99,3,0.5,-0.866025,spring
2,EP8311-PPSA16513_00-0233078860249892,1.55,11,,0,0,0,0,0,0,...,minimal,0,7.0,obscure,3,0 - 7.99,2,0.5,-0.866025,spring
3,EP8311-PPSA13840_00-0277389480637871,1.44,18,,0,0,0,0,0,0,...,minimal,0,8.0,obscure,3,0 - 7.99,2,0.5,0.866025,winter
4,EP8311-PPSA12662_00-0212989199890961,1.17,23,,0,0,0,0,0,0,...,minimal,0,8.0,obscure,3,0 - 7.99,3,-2.449294e-16,1.0,winter


## Rappel des features disponibles

In [580]:
# 47 features au total

# X Basic features simple sans traitement avanc√©s (28)

# base_price                    - numeric (float)
# pssstore_stars_rating_count   - numeric (int)
# pssstore_stars_rating         - numeric (float)
# is_indie                      - boolean
# has_microtransactions         - boolean
# dlcs_count                        - numeric (int)
# packs_deluxe_count            - numeric (int)
# series_count                   - numeric (int)
# trophies_count                - numeric (int)
# is_vr                         - boolean
# has_local_multiplayer         - boolean
# has_online_multiplayer        - boolean
# is_online_only                - boolean
# is_opti_ps5_pro               - boolean
# is_ps_exclusive               - boolean
# difficulty                    - numeric (int)
# is_remaster                   - boolean
# is_ps4                        - boolean
# pegi_unified                  - category (numeric)
# metacritic_critic_score       - numeric (float)
# hours_main_story              - numeric (float)
# voice_languages_count        - numeric (int)
# sub_languages_count           - numeric (int)
# download_size_gb              - numeric (float)
# price_category                - category



#  X Advanced features engineered (19)

# popularity_score              - numeric (float)
# popularity_category           - category
# visibility_score              - numeric (float)
# visibility_category           - category
# exclusif_playstation_content  - boolean
# publisher_game_count          - numeric (int)
# publisher_game_count_cat      - category
# publisher_category            - category
# genre_action_aventure         - boolean
# genre_roles                   - boolean
# genre_sports                  - boolean
# genre_reflexion               - boolean
# genre_rapide                  - boolean
# localization_category         - category
# download_size_category        - category
# content_score                 - numeric (float)
# content_category              - category
# game_age_years                - numeric (int)
# month_sin                     - numeric (float)
# month_cos                     - numeric (float)
# release_season                - category


#  X Bonus observations √† + 60 jours (2)

# has_5pct_discount_at_30d      - boolean
# has_10pct_discount_at_60d     - boolean

# Y
# 
# Regression nombre de jours
# 
# days_to_10_percent_discount - numeric (int)
# days_to_25_percent_discount - numeric (int)
# days_to_33_percent_discount - numeric (int)
# days_to_50_percent_discount - numeric (int)
# days_to_75_percent_discount - numeric (int)

# Classification tranche d√©lais avant promotion
# 
# days_to_10_percent_discount_category - category
# days_to_25_percent_discount_category - category
# days_to_33_percent_discount_category - category
# days_to_50_percent_discount_category - category
# days_to_75_percent_discount_category - category

## Combinaison de features

In [581]:
base_features = {
    "numeric_continuous": [
        "base_price",
        "pssstore_stars_rating",  # Observation post sortie
        "metacritic_critic_score",  # Observation post sortie
        "hours_main_story",
        "download_size_gb",
    ],
    "numeric_discrete_zero": [
        "packs_deluxe_count",
        "series_count",
        "sub_languages_count",
        "voice_languages_count",
        "publisher_game_count",
        "dlcs_count",
        "game_age_years",
    ],
    "numeric_discrete_median": [
        "pssstore_stars_rating_count",  # Observation post sortie
        "trophies_count",
    ],
    "boolean_cols": [
        "is_indie",
        "is_vr",
        "is_ps_exclusive",
        "is_remaster",
        "is_ps4",
        "has_local_multiplayer",
        "has_online_multiplayer",
        "is_online_only",
        "has_microtransactions",
        "is_opti_ps5_pro",
        "exclusif_playstation_content", #advanced
        "has_5pct_discount_at_30d",  # Observation post sortie
        "has_10pct_discount_at_60d",  # Observation post sortie
        "genre_action_aventure",
        "genre_roles",
        "genre_sports",
        "genre_reflexion",
        "genre_rapide",
    ],
    "categorical_cols": [
        "price_category",
        "publisher_category",
        "visibility_category", # advanced
        "popularity_category",  # Observation post sortie
        "content_category", # advanced
        "download_size_category",
        "localization_category",
        "publisher_game_count_cat",
        "release_season",
    ],
    "ordinal_cols": ["difficulty", "pegi_unified"],
    "score_cols": ["popularity_score", "visibility_score", "content_score"], #advanced
    "month_cols": ["month_sin", "month_cos"],

    # SimpleImputer(strategy='median'))
    # ('zero_imputer', ZeroImputer()),
    # ('outlier_clipper', OutlierClipper()),
}

In [582]:
def get_all_features_columns(features_dict):
    all_columns = []
    for feature_list in features_dict.values():
        all_columns.extend(feature_list)
    return all_columns

### En attente de tests

In [583]:
# pssstore_stars_rating_count (0 √† 1.8M) ‚Üí log transform pour r√©duire l'√©chelle

## Preprocessing

1. Num√©riques continues ‚Üí Imputation + Scaling (si n√©cessaire)
2. Num√©riques discr√®tes ‚Üí Imputation + Binning ou Sqrt (optionnel)
3. Bool√©ennes ‚Üí Format 0/1
4. Ordinales ‚Üí Garder l'ordre naturel
5. Nominales ‚Üí Label Encoding (Mod√®les Arbres)

### Pipeline preprocessing sci kit learn

In [584]:
def create_pipeline(model_type: str, available_columns: dict):
    """
    Cr√©e un Pipeline sklearn pour le preprocessing, Type de mod√®le : 'tree', 'linear', 'svm'
        X_train_processed = pipeline.fit_transform(X_train)
        X_test_processed = pipeline.transform(X_test)
    """

    print(f"Cr√©ation du pipeline pour mod√®le : {model_type.upper()}")

    # Liste pour stocker tous les transformers
    transformers = []

    # NUM√âRIQUES CONTINUES
    filtered_numeric_continuous = available_columns["numeric_continuous"]

    if model_type == "tree":
        # TREE: Juste imputation, pas de scaling
        numeric_continuous_transformer = Pipeline(
            steps=[("imputer", SimpleImputer(strategy="median"))]
        )

    elif model_type == "linear":
        # LINEAR: Imputation + Outliers + Standardisation
        numeric_continuous_transformer = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="median")),
                ("outlier_clipper", OutlierClipper()),
                ("scaler", StandardScaler()),
            ]
        )

    elif model_type == "svm":
        # SVM: Imputation + Outliers + Normalisation [0,1]
        numeric_continuous_transformer = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="median")),
                ("outlier_clipper", OutlierClipper()),
                ("scaler", MinMaxScaler()),
            ]
        )

    transformers.append(
        ("num_continuous", numeric_continuous_transformer, filtered_numeric_continuous)
    )

    # NUMERIQUES DISCRETES z√©ro
    numeric_discrete_zero = available_columns["numeric_discrete_zero"]

    if model_type == "tree":
        numeric_discrete_zero_transformer = Pipeline(
            steps=[("zero_imputer", ZeroImputer())]
        )
    elif model_type == "linear":
        numeric_discrete_zero_transformer = Pipeline(
            steps=[("zero_imputer", ZeroImputer()), ("scaler", StandardScaler())]
        )
    elif model_type == "svm":
        numeric_discrete_zero_transformer = Pipeline(
            steps=[("zero_imputer", ZeroImputer()), ("scaler", MinMaxScaler())]
        )

    transformers.append(
        ("num_discrete_zero", numeric_discrete_zero_transformer, numeric_discrete_zero)
    )

    # NUMERIQUES DISCRETES m√©diane
    numeric_discrete_median = available_columns["numeric_discrete_median"]

    if model_type == "tree":
        numeric_discrete_median_transformer = Pipeline(
            steps=[("imputer", SimpleImputer(strategy="median"))]
        )
    elif model_type == "linear":
        numeric_discrete_median_transformer = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="median")),
                ("scaler", StandardScaler()),
            ]
        )
    elif model_type == "svm":
        numeric_discrete_median_transformer = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="median")),
                ("scaler", MinMaxScaler()),
            ]
        )

    transformers.append(
        (
            "num_discrete_median",
            numeric_discrete_median_transformer,
            numeric_discrete_median,
        )
    )

    # BOOL√âENNES
    boolean_cols = available_columns["boolean_cols"]
    boolean_transformer = Pipeline(
        steps=[("zero_imputer", ZeroImputer())]  # Remplacer NA par 0
    )
    transformers.append(("boolean", boolean_transformer, boolean_cols))

    # CAT√âGORIELLES
    categorical_cols = available_columns["categorical_cols"]

    if model_type == "tree":
        # TREE Label Encoding (via OrdinalEncoder)
        categorical_transformer = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
                (
                    "encoder",
                    OrdinalEncoder(
                        handle_unknown="use_encoded_value",
                        unknown_value=-1,
                        encoded_missing_value=-1,
                    ),
                ),
            ]
        )
    else:
        # LINEAR / SVM: One-Hot Encoding
        categorical_transformer = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
                (
                    "encoder",
                    OneHotEncoder(
                        drop="first", sparse_output=False, handle_unknown="ignore"
                    ),
                ),
            ]
        )

    transformers.append(("categorical", categorical_transformer, categorical_cols))

    # ORDINALES
    ordinal_cols = available_columns["ordinal_cols"]

    if model_type == "tree":
        ordinal_transformer = Pipeline(
            steps=[("imputer", SimpleImputer(strategy="median"))]
        )
    else:  # linear ou svm
        ordinal_transformer = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="median")),
                ("scaler", StandardScaler()),
            ]
        )

    transformers.append(("ordinal", ordinal_transformer, ordinal_cols))

    # SCORES ENGINEERED
    score_cols = available_columns["score_cols"]

    score_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),  # Recommand√© pour tous les mod√®les
        ]
    )

    transformers.append(("scores", score_transformer, score_cols))

    # Colonnes D√©j√† normalis√©es, juste passer tel quel
    month_cols = available_columns["month_cols"]
    month_transformer = "passthrough"
    transformers.append(("month", month_transformer, month_cols))

    # PIPELINE FINAL
    pipeline = Pipeline(
        steps=[
            (
                "preprocessor",
                ColumnTransformer(
                    transformers=transformers,
                    remainder="drop",  # Supprimer les colonnes non sp√©cifi√©es
                    verbose_feature_names_out=False,  # Garder noms courts
                ),
            )
        ]
    )

    print(f"‚úÖ Pipeline cr√©√© avec {len(transformers)} groupes de transformers")

    return pipeline

In [585]:
def get_feature_names(pipeline:Pipeline, input_features):
    """
    R√©cup√®re les noms des features apr√®s transformation
    """
    try:
        # M√©thode moderne (sklearn >= 1.0)
        return pipeline.named_steps['preprocessor'].get_feature_names_out()
    except:
        # Fallback pour anciennes versions
        return None


In [586]:
def save_pipeline(pipeline, filepath):
    """
    Sauvegarde le pipeline
    """
    import joblib
    joblib.dump(pipeline, filepath)
    print(f"‚úÖ Pipeline sauvegard√© : {filepath}")

In [587]:
def load_pipeline(filepath):
    """
    Charge un pipeline sauvegard√©
    """
    import joblib
    pipeline = joblib.load(filepath)
    print(f"‚úÖ Pipeline charg√© : {filepath}")
    return pipeline

## Premier tests de pr√©dictions

In [588]:
df_clean = df_dataset.copy()

### Choisir la target: niveau de promotion

In [589]:
TARGET_PROMO_COL = 'days_to_33_percent_discount'

### Nettoyage des donn√©es qui ne poss√®dent pas la target

In [590]:
# On supprime les lignes qui n'ont pas de target
df_clean = df_clean[df_clean[TARGET_PROMO_COL].notna()].copy()

print(f"Avant: {len(df_dataset)} lignes")
print(f"Apr√®s: {len(df_clean)} lignes")

Avant: 5382 lignes
Apr√®s: 3714 lignes


### R√©cup√©ration des features de base

In [591]:
base_X_features = get_all_features_columns(base_features)

# S√©lectionner colonnes disponibles
X = df_clean[base_X_features]

# On assigne la target
y = df_clean[TARGET_PROMO_COL]

### V√©rification des colonnes 

In [592]:
column_summary(X)

R√©sum√© d√©taill√© des colonnes:
                      Column    Type  Non-Null Count  Null Count  Unique Values
                  base_price float64            3714           0             62
       pssstore_stars_rating float64            3714           0            331
     metacritic_critic_score float64            1038        2676             66
            hours_main_story float64            2486        1228             85
            download_size_gb float64            2017        1697            802
          packs_deluxe_count   int64            3714           0              9
                series_count   int64            3714           0             24
         sub_languages_count float64            2821         893             30
       voice_languages_count float64            1894        1820             17
        publisher_game_count   int64            3714           0             51
                  dlcs_count   int64            3714           0             54
      

### Cr√©ation train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTrain : {len(X_train)} | Test : {len(X_test)}")


Train : 2971 | Test : 743


### Creation du pipeline preprocess mod√®les de type TREE

In [None]:
pipeline_tree = create_pipeline(model_type='tree', available_columns=base_features)

# Fit et transform
X_train_processed_tree = pipeline_tree.fit_transform(X_train)
X_test_processed_tree = pipeline_tree.transform(X_test)

print(f"\nTrain processed : {X_train_processed_tree.shape}")
print(f"Test processed  : {X_test_processed_tree.shape}")

Cr√©ation du pipeline pour mod√®le : TREE
‚úÖ Pipeline cr√©√© avec 8 groupes de transformers

Train processed : (2971, 48)
Test processed  : (743, 48)


### Creation du pipeline preprocess mod√®les lin√©aires

In [None]:
pipeline_linear = create_pipeline(model_type='linear', available_columns=base_features)

# Fit et transform
X_train_processed_linear = pipeline_linear.fit_transform(X_train)
X_test_processed_linear = pipeline_linear.transform(X_test)

print(f"\nTrain processed : {X_train_processed_linear.shape}")
print(f"Test processed  : {X_test_processed_linear.shape}")

Cr√©ation du pipeline pour mod√®le : LINEAR
‚úÖ Pipeline cr√©√© avec 8 groupes de transformers

Train processed : (2971, 76)
Test processed  : (743, 76)


### Cr√©ation pipeline preprocess mod√®les svm

In [None]:
pipeline_svm = create_pipeline(model_type='svm', available_columns=base_features)

# Fit et transform
X_train_processed_svm = pipeline_svm.fit_transform(X_train)
X_test_processed_svm = pipeline_svm.transform(X_test)

print(f"\nTrain processed : {X_train_processed_svm.shape}")
print(f"Test processed  : {X_test_processed_svm.shape}")

Cr√©ation du pipeline pour mod√®le : SVM
‚úÖ Pipeline cr√©√© avec 8 groupes de transformers

Train processed : (2971, 76)
Test processed  : (743, 76)


### R√©gression lin√©aire

### Random forest

In [None]:
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train_processed_tree, y_train)

predictions = rf.predict(X_test_processed_tree)

mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"\nüìäRandomForestRegressor R√âSULTATS :")
print(f"   MAE : {mae:.2f} jours")
print(f"   R¬≤  : {r2:.4f}")


üìäRandomForestRegressor R√âSULTATS :
   MAE : 95.03 jours
   R¬≤  : 0.2353


### Ridge linear

In [None]:
from sklearn.linear_model import Ridge

model_ridge = Ridge(alpha=1.0)
model_ridge.fit(X_train_processed_linear, y_train)

predictions = model_ridge.predict(X_test_processed_linear)

mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"\nüìä RIDGE R√âSULTATS :")
print(f"   MAE : {mae:.2f} jours")
print(f"   R¬≤  : {r2:.4f}")


üìä RIDGE R√âSULTATS :
   MAE : 99.90 jours
   R¬≤  : 0.1775


### SVM

In [None]:
from sklearn.svm import SVR

model_svm = SVR(kernel='rbf', C=1.0)
model_svm.fit(X_train_processed_svm, y_train)

predictions = model_svm.predict(X_test_processed_svm)

mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"\nüìä SVM R√âSULTATS :")
print(f"   MAE : {mae:.2f} jours")
print(f"   R¬≤  : {r2:.4f}")

# elif model_type == 'svm':
#     from sklearn.svm import SVR
#     model = SVR(kernel='rbf', C=1.0)
#     print("Mod√®le par d√©faut : SVR")


üìä SVM R√âSULTATS :
   MAE : 102.47 jours
   R¬≤  : 0.0277


### Comparer les r√©sultats

In [None]:
# g√©n√©rer plots de comparaison

## Automatiser d√©marche empirique

Cr√©ation d'un syst√®me permettant de lancer une biblioth√®que de mod√®les sur plusieurs combinaisons de features et plusieurs target de pr√©dictions.

In [None]:
# Premi√®rer version de la liste g√©n√©r√© par Claude code. Phase de v√©rification n√©cessaire.

regression_models = {
    # ========================================================================
    # MOD√àLES LIN√âAIRES (Rapides, interpr√©tables)
    # ========================================================================
    "linear": [
        # --- R√©gression lin√©aire simple ---
        {
            "name": "Linear Regression",
            "category":"linear",
            "model": LinearRegression(),
            "description": "R√©gression lin√©aire classique (OLS)",
            "pros": "Simple, rapide, interpr√©table",
            "cons": "Sensible aux outliers et multicolin√©arit√©",
            "best_for": "Baseline, relations lin√©aires simples",
        },
        # --- Ridge (L2 regularization) ---
        {
            "name": "Ridge",
            "category":"linear",
            "model": Ridge(alpha=1.0),
            "description": "R√©gression avec r√©gularisation L2",
            "pros": "G√®re la multicolin√©arit√©, stable",
            "cons": "Ne fait pas de s√©lection de features",
            "best_for": "Features corr√©l√©es, √©viter overfitting",
            "hyperparams": {"alpha": [0.1, 1.0, 10.0, 100.0]},
        },
        # --- Lasso (L1 regularization) ---
        {
            "name": "Lasso",
            "category":"linear",
            "model": Lasso(alpha=1.0, max_iter=10000),
            "description": "R√©gression avec r√©gularisation L1",
            "pros": "S√©lection automatique de features",
            "cons": "Peut √™tre instable",
            "best_for": "Feature selection, mod√®les parcimonieux",
            "hyperparams": {"alpha": [0.01, 0.1, 1.0, 10.0]},
        },
        # --- ElasticNet (L1 + L2) ---
        {
            "name": "ElasticNet",
            "category":"linear",
            "model": ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=10000),
            "description": "Combinaison Ridge + Lasso",
            "pros": "√âquilibre entre Ridge et Lasso",
            "cons": "Plus de param√®tres √† tuner",
            "best_for": "Compromis entre stabilit√© et s√©lection",
            "hyperparams": {
                "alpha": [0.1, 1.0, 10.0],
                "l1_ratio": [0.3, 0.5, 0.7, 0.9],
            },
        },
        # --- Bayesian Ridge ---
        {
            "name": "Bayesian Ridge",
            "category":"linear",
            "model": BayesianRidge(),
            "description": "R√©gression Ridge bay√©sienne",
            "pros": "Adapte alpha automatiquement, intervalles de confiance",
            "cons": "Plus lent que Ridge classique",
            "best_for": "Incertitude quantifi√©e, alpha adaptatif",
        },
        # --- Huber Regressor ---
        {
            "name": "Huber",
            "category":"linear",
            "model": HuberRegressor(epsilon=1.35, max_iter=1000),
            "description": "R√©gression robuste aux outliers",
            "pros": "Tr√®s robuste aux outliers",
            "cons": "Plus lent que OLS",
            "best_for": "Donn√©es avec outliers",
            "hyperparams": {"epsilon": [1.1, 1.35, 1.5, 2.0]},
        },
        # --- Lars ---
        {
            "name": "Lars",
            "category":"linear",
            "model": Lars(),
            "description": "Least Angle Regression",
            "pros": "Efficace computationnellement",
            "cons": "Sensible au bruit",
            "best_for": "Haute dimensionnalit√©",
        },
        # --- Lasso Lars ---
        {
            "name": "Lasso Lars",
            "category":"linear",
            "model": LassoLars(alpha=1.0),
            "description": "Lasso avec algorithme Lars",
            "pros": "Plus rapide que Lasso pour high-dim",
            "cons": "Peut √™tre instable",
            "best_for": "Beaucoup de features",
        },
    ],
    # ========================================================================
    # MOD√àLES √Ä BASE D'ARBRES (Performants, robustes)
    # ========================================================================
    "tree": [
        # --- Decision Tree ---
        {
            "name": "Decision Tree",
            "category":"tree",
            "model": DecisionTreeRegressor(
                max_depth=10, min_samples_split=10, min_samples_leaf=5, random_state=42
            ),
            "description": "Arbre de d√©cision simple",
            "pros": "Interpr√©table, g√®re non-lin√©arit√©s",
            "cons": "Overfitting facile",
            "best_for": "Baseline arbre, interpr√©tabilit√©",
            "hyperparams": {
                "max_depth": [5, 10, 15, 20, None],
                "min_samples_split": [2, 5, 10, 20],
                "min_samples_leaf": [1, 2, 5, 10],
            },
        },
        # --- Random Forest ---
        {
            "name": "Random Forest",
            "category":"tree",
            "model": RandomForestRegressor(
                n_estimators=200,
                max_depth=15,
                min_samples_split=5,
                min_samples_leaf=2,
                max_features="sqrt",
                random_state=42,
                n_jobs=-1,
            ),
            "description": "Ensemble d'arbres de d√©cision",
            "pros": "Tr√®s performant, robuste, parall√©lisable",
            "cons": "Moins interpr√©table, peut √™tre lent",
            "best_for": "Performance g√©n√©rale, votre meilleur mod√®le",
            "hyperparams": {
                "n_estimators": [100, 200, 300, 500],
                "max_depth": [10, 15, 20, 25, None],
                "min_samples_split": [2, 5, 10],
                "min_samples_leaf": [1, 2, 4],
                "max_features": ["sqrt", "log2", 0.5],
            },
        },
        # --- Extra Trees ---
        {
            "name": "Extra Trees",
            "category":"tree",
            "model": ExtraTreesRegressor(
                n_estimators=200,
                max_depth=15,
                min_samples_split=5,
                min_samples_leaf=2,
                random_state=42,
                n_jobs=-1,
            ),
            "description": "Random Forest avec splits al√©atoires",
            "pros": "Plus rapide que RF, r√©duit variance",
            "cons": "Peut avoir plus de biais",
            "best_for": "Alternative √† RF, plus rapide",
            "hyperparams": {
                "n_estimators": [100, 200, 300],
                "max_depth": [10, 15, 20, None],
                "min_samples_split": [2, 5, 10],
            },
        },
        # --- Gradient Boosting ---
        {
            "name": "Gradient Boosting",
            "category":"tree",
            "model": GradientBoostingRegressor(
                n_estimators=200,
                learning_rate=0.1,
                max_depth=5,
                min_samples_split=5,
                min_samples_leaf=2,
                subsample=0.8,
                random_state=42,
            ),
            "description": "Boosting s√©quentiel d'arbres",
            "pros": "Tr√®s performant, g√®re bien les interactions",
            "cons": "Lent √† entra√Æner, risque overfitting",
            "best_for": "Performance maximale",
            "hyperparams": {
                "n_estimators": [100, 200, 300],
                "learning_rate": [0.01, 0.05, 0.1, 0.2],
                "max_depth": [3, 5, 7],
                "subsample": [0.7, 0.8, 0.9, 1.0],
            },
        },
        # --- Histogram Gradient Boosting (plus rapide) ---
        {
            "name": "Hist Gradient Boosting",
            "category":"tree",
            "model": HistGradientBoostingRegressor(
                max_iter=200,
                learning_rate=0.1,
                max_depth=10,
                min_samples_leaf=20,
                random_state=42,
            ),
            "description": "Gradient Boosting optimis√© (bins)",
            "pros": "Beaucoup plus rapide que GB classique",
            "cons": "Moins pr√©cis pour petits datasets",
            "best_for": "Gros datasets, rapidit√©",
            "hyperparams": {
                "max_iter": [100, 200, 300],
                "learning_rate": [0.05, 0.1, 0.2],
                "max_depth": [5, 10, 15, None],
            },
        },
        # --- AdaBoost ---
        {
            "name": "AdaBoost",
            "category":"tree",
            "model": AdaBoostRegressor(
                estimator=DecisionTreeRegressor(max_depth=5),
                n_estimators=100,
                learning_rate=1.0,
                random_state=42,
            ),
            "description": "Boosting adaptatif",
            "pros": "Simple, moins d'overfitting que GB",
            "cons": "Sensible au bruit",
            "best_for": "Alternative au Gradient Boosting",
            "hyperparams": {
                "n_estimators": [50, 100, 200],
                "learning_rate": [0.5, 1.0, 1.5],
            },
        },
        # --- Bagging ---
        {
            "name": "Bagging",
            "category":"tree",
            "model": BaggingRegressor(
                estimator=DecisionTreeRegressor(max_depth=10),
                n_estimators=100,
                max_samples=0.8,
                max_features=0.8,
                random_state=42,
                n_jobs=-1,
            ),
            "description": "Bootstrap Aggregating",
            "pros": "R√©duit variance, parall√©lisable",
            "cons": "Moins performant que RF",
            "best_for": "Baseline ensemble",
        },
        # --- XGBoost ---
        {
            "name": "XGBoost",
            "category":"tree",
            "model": XGBRegressor(
                n_estimators=200,
                learning_rate=0.1,
                max_depth=6,
                min_child_weight=1,
                subsample=0.8,
                colsample_bytree=0.8,
                gamma=0,
                random_state=42,
                n_jobs=-1,
            ),
            "description": "Extreme Gradient Boosting",
            "pros": "√âtat de l'art, tr√®s performant, rapide",
            "cons": "Beaucoup d'hyperparam√®tres",
            "best_for": "Performance maximale, comp√©titions",
            "hyperparams": {
                "n_estimators": [100, 200, 300],
                "learning_rate": [0.01, 0.05, 0.1],
                "max_depth": [3, 5, 7, 9],
                "min_child_weight": [1, 3, 5],
                "subsample": [0.7, 0.8, 0.9],
                "colsample_bytree": [0.7, 0.8, 0.9],
            },
        },
        # --- LightGBM ---
        {
            "name": "LightGBM",
            "category":"tree",
            "model": LGBMRegressor(
                n_estimators=200,
                learning_rate=0.1,
                max_depth=10,
                num_leaves=31,
                min_child_samples=20,
                subsample=0.8,
                colsample_bytree=0.8,
                random_state=42,
                n_jobs=-1,
                verbose=-1,
            ),
            "description": "Light Gradient Boosting Machine",
            "pros": "Tr√®s rapide, g√®re grandes donn√©es",
            "cons": "Peut overfit sur petits datasets",
            "best_for": "Gros datasets, rapidit√©",
            "hyperparams": {
                "n_estimators": [100, 200, 300],
                "learning_rate": [0.01, 0.05, 0.1],
                "max_depth": [5, 10, 15, -1],
                "num_leaves": [15, 31, 63],
                "min_child_samples": [10, 20, 30],
            },
        },
        # --- CatBoost ---
        {
            "name": "CatBoost",
            "category":"tree",
            "model": CatBoostRegressor(
                iterations=200,
                learning_rate=0.1,
                depth=6,
                l2_leaf_reg=3,
                random_state=42,
                verbose=False,
            ),
            "description": "Categorical Boosting",
            "pros": "G√®re cat√©gorielles nativement, robuste",
            "cons": "Peut √™tre lent",
            "best_for": "Beaucoup de features cat√©gorielles",
            "hyperparams": {
                "iterations": [100, 200, 300],
                "learning_rate": [0.01, 0.05, 0.1],
                "depth": [4, 6, 8, 10],
                "l2_leaf_reg": [1, 3, 5, 7],
            },
        },
    ],
    # ========================================================================
    # SVM (Support Vector Machines)
    # ========================================================================
    "svm": [
        # --- SVR RBF ---
        {
            "name": "SVR (RBF)",
            "category":"svm",
            "model": SVR(kernel="rbf", C=1.0, epsilon=0.1),
            "description": "SVM avec noyau radial",
            "pros": "G√®re non-lin√©arit√©s, robuste",
            "cons": "Tr√®s lent, difficile √† tuner",
            "best_for": "Petits datasets avec patterns complexes",
            "hyperparams": {
                "C": [0.1, 1.0, 10.0, 100.0],
                "epsilon": [0.01, 0.1, 0.2, 0.5],
                "gamma": ["scale", "auto", 0.001, 0.01, 0.1],
            },
        },
        # --- SVR Linear ---
        {
            "name": "SVR (Linear)",
            "category":"svm",
            "model": LinearSVR(epsilon=0.1, C=1.0, max_iter=10000),
            "description": "SVM lin√©aire",
            "pros": "Plus rapide que RBF",
            "cons": "Seulement relations lin√©aires",
            "best_for": "Alternative lin√©aire √† Ridge",
            "hyperparams": {"C": [0.1, 1.0, 10.0], "epsilon": [0.01, 0.1, 0.2]},
        },
        # --- SVR Polynomial ---
        {
            "name": "SVR (Poly)",
            "category":"svm",
            "model": SVR(kernel="poly", degree=3, C=1.0, epsilon=0.1),
            "description": "SVM avec noyau polynomial",
            "pros": "G√®re interactions polynomiales",
            "cons": "Tr√®s lent, difficile √† tuner",
            "best_for": "Interactions polynomiales",
            "hyperparams": {"degree": [2, 3, 4], "C": [0.1, 1.0, 10.0]},
        },
    ],
}

def get_all_models():
    """Retourne une liste plate de tous les mod√®les"""
    all_models = []
    for category, models in regression_models.items():
        all_models.extend(models)
    return all_models


def get_models_by_category(category):
    """Retourne les mod√®les d'une cat√©gorie"""
    return regression_models.get(category, [])


def get_model_by_name(name):
    """Retourne un mod√®le par son nom"""
    for models in regression_models.values():
        for model_dict in models:
            if model_dict["name"] == name:
                return model_dict
    return None

### Tests de tous les mod√®les

In [None]:
def train_all_models(use_grid_search=False):
    results = []

    for config in get_all_models():
        try:
            # Entra√Æner
            cat_model = config.get('category', 'N/A') # tree, linear, svm
            
            if cat_model == 'tree':
                X_tr = X_train_processed_tree
                X_te = X_test_processed_tree
            
            if cat_model == 'linear':
                X_tr = X_train_processed_linear
                X_te = X_test_processed_linear
                
            if cat_model == 'svm':
                X_tr = X_train_processed_svm
                X_te = X_test_processed_svm
            
            model = config["model"]
            print(f"Start with {config["name"]}")
   
            # Grille de param√®tres - GridSearch
            if "hyperparams" in config and use_grid_search is True:
                grid_search = GridSearchCV(
                    model, config["hyperparams"],
                    cv=5, scoring='neg_mean_absolute_error',
                    n_jobs=-1, verbose=0
                )
                grid_search.fit(X_tr, y_train)
                predictions = grid_search.predict(X_te)
                print(grid_search.best_params_)
            else:
                model.fit(X_tr, y_train)
                predictions = model.predict(X_te)

            # √âvaluer
            mae = mean_absolute_error(y_test, predictions)
            r2 = r2_score(y_test, predictions)
            
            results.append({
                'Model': config['name'],
                'MAE': mae,
                'r2': r2,
                'Category': config.get('category', 'N/A')
            })
            
            print(f"‚úÖ {config['name']:30} : MAE = {mae:.2f}")
            
        except Exception as e:
            print(f"‚ùå {config['name']:30} : {str(e)[:50]}")

    # Afficher r√©sultats tri√©s
    df_results = pd.DataFrame(results).sort_values('MAE')
    print("\n" + df_results.to_string(index=False))

### V√©rification de la m√©thode d'entrainement sur plusieurs mod√®les

In [None]:
train_all_models()

Start with Linear Regression
‚úÖ Linear Regression              : MAE = 99.79
Start with Ridge
‚úÖ Ridge                          : MAE = 99.90
Start with Lasso
‚úÖ Lasso                          : MAE = 97.70
Start with ElasticNet
‚úÖ ElasticNet                     : MAE = 101.51
Start with Bayesian Ridge
‚úÖ Bayesian Ridge                 : MAE = 98.46
Start with Huber
‚úÖ Huber                          : MAE = 94.79
Start with Lars
‚úÖ Lars                           : MAE = 124.76
Start with Lasso Lars
‚úÖ Lasso Lars                     : MAE = 97.70
Start with Decision Tree
‚úÖ Decision Tree                  : MAE = 109.07
Start with Random Forest
‚úÖ Random Forest                  : MAE = 93.52
Start with Extra Trees
‚úÖ Extra Trees                    : MAE = 93.63
Start with Gradient Boosting
‚úÖ Gradient Boosting              : MAE = 92.77
Start with Hist Gradient Boosting
‚úÖ Hist Gradient Boosting         : MAE = 94.12
Start with AdaBoost
‚úÖ AdaBoost                       : M

Prendre les meilleures mod√®les et faire des v√©rifications plus appronfondies (pas de suraprentissage etc.)