# Mod√®les de machine learning classification

## Objectifs de ce notebook

Dans ce notebook, nous allons tester diff√©rents mod√®les de machine learning pour la pr√©diction des seuils de promotion:

**Pr√©diction des classes de d√©lais depuis la date de sortie avant r√©duction (33%)**

Comparaison des performances de plusieurs mod√®les
- **Random forest**
- **XGBoost**

## Donn√©es utilis√©es

Utilisation du fichier nettoy√© dataset jeux ps5 avec travail sur les features r√©alis√©: **featured_games_dataset_final.csv**.

- mod√©lisation avec les donn√©es de bases
- mod√©lisation avec les donn√©es apr√®s feature engineering

## Note sur les librairies utilis√©es

Ce notebook utilise scikit learn pour la partie machine learning

## Import des librairies

In [None]:


# Python base
import sys
from pathlib import Path
import os

# Biblioth√®ques principales
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder,OneHotEncoder,OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import mean_absolute_error, r2_score

import warnings
warnings.filterwarnings('ignore')

## M√©thodes utiles

In [190]:
def column_summary(df: pd.DataFrame):
    summary = []
    for col in df.columns:
        col_type = df[col].dtype
        non_null = df[col].notna().sum()
        null_count = df[col].isna().sum()
        
        # G√©rer le cas o√π la colonne contient des listes (unhashable)
        try:
            unique_count = df[col].nunique()
        except TypeError:
            # Si erreur (listes), convertir en string temporairement
            unique_count = df[col].astype(str).nunique()
            print(f"‚ö†Ô∏è Colonne '{col}' contient des types non-hashable (probablement des listes)")

        summary.append({
            'Column': col,
            'Type': str(col_type),
            'Non-Null Count': non_null,
            'Null Count': null_count,
            'Unique Values': unique_count,
        })

    # Afficher le r√©sum√© des colonnes
    print("=" * 80)
    print("R√©sum√© d√©taill√© des colonnes:")
    print("=" * 80)
    column_summary_df = pd.DataFrame(summary)
    print(column_summary_df.to_string(index=False))
    print("\n")

In [None]:
class OutlierClipper(BaseEstimator, TransformerMixin):
    """Clip les outliers avec la m√©thode IQR"""
    
    def __init__(self):
        self.lower_bounds_ = None
        self.upper_bounds_ = None
    
    def fit(self, X, y=None):
        if isinstance(X, pd.DataFrame):
            X_values = X.values
        else:
            X_values = X
        
        Q1 = np.percentile(X_values, 25, axis=0)
        Q3 = np.percentile(X_values, 75, axis=0)
        IQR = Q3 - Q1
        
        self.lower_bounds_ = Q1 - 1.5 * IQR
        self.upper_bounds_ = Q3 + 1.5 * IQR
        
        return self
    
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            X_values = X.values
            X_clipped = np.clip(X_values, self.lower_bounds_, self.upper_bounds_)
            return pd.DataFrame(X_clipped, columns=X.columns, index=X.index)
        else:
            return np.clip(X, self.lower_bounds_, self.upper_bounds_)

In [None]:
class ZeroImputer(BaseEstimator, TransformerMixin):
    """Impute les valeurs manquantes par 0"""
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            return X.fillna(0)
        else:
            # Si numpy array
            X_copy = X.copy()
            X_copy[np.isnan(X_copy)] = 0
            return X_copy

## Chargement des donn√©es

In [191]:
path = os.path.join(Path.cwd().parent, "data/processed")

In [192]:
# Charger les donn√©es CSV pour verifier
df_dataset = pd.read_csv(os.path.join(path, "featured_games_dataset_final.csv"))

In [193]:
column_summary(df_dataset)

R√©sum√© d√©taill√© des colonnes:
                              Column    Type  Non-Null Count  Null Count  Unique Values
                            id_store  object            5382           0           5382
               pssstore_stars_rating float64            5382           0            352
         pssstore_stars_rating_count   int64            5382           0           1583
             metacritic_critic_score float64            1268        4114             67
                              is_ps4   int64            5382           0              2
                            is_indie   int64            5382           0              2
                              is_dlc   int64            5382           0              1
                               is_vr   int64            5382           0              2
                     is_opti_ps5_pro   int64            5382           0              2
                         is_remaster   int64            5382           0              

In [194]:
df_dataset.head()

Unnamed: 0,id_store,pssstore_stars_rating,pssstore_stars_rating_count,metacritic_critic_score,is_ps4,is_indie,is_dlc,is_vr,is_opti_ps5_pro,is_remaster,...,content_category,exclusif_playstation_content,visibility_score,visibility_category,pegi_unified,price_category,game_age_years,month_sin,month_cos,release_season
0,EP8311-PPSA19174_00-0421646910657705,1.57,14,,0,0,0,0,0,0,...,minimal,0,7.0,obscure,7,0 - 7.99,2,-0.8660254,0.5,fall
1,EP2005-PPSA06055_00-SINUCA0000000000,3.26,72,,0,0,0,0,0,0,...,minimal,0,43.0,moderate,3,0 - 7.99,3,0.5,-0.866025,spring
2,EP8311-PPSA16513_00-0233078860249892,1.55,11,,0,0,0,0,0,0,...,minimal,0,7.0,obscure,3,0 - 7.99,2,0.5,-0.866025,spring
3,EP8311-PPSA13840_00-0277389480637871,1.44,18,,0,0,0,0,0,0,...,minimal,0,8.0,obscure,3,0 - 7.99,2,0.5,0.866025,winter
4,EP8311-PPSA12662_00-0212989199890961,1.17,23,,0,0,0,0,0,0,...,minimal,0,8.0,obscure,3,0 - 7.99,3,-2.449294e-16,1.0,winter


## Rappel des features disponibles

In [195]:
# 47 features au total

# X Basic features simple sans traitement avanc√©s (28)

# base_price                    - numeric (float)
# pssstore_stars_rating_count   - numeric (int)
# pssstore_stars_rating         - numeric (float)
# is_indie                      - boolean
# has_microtransactions         - boolean
# dlc_count                     - numeric (int)
# packs_deluxe_count            - numeric (int)
# serie_count                   - numeric (int)
# trophies_count                - numeric (int)
# is_vr                         - boolean
# has_local_multiplayer         - boolean
# has_online_multiplayer        - boolean
# is_online_only                - boolean
# is_opti_ps5_pro               - boolean
# is_ps_exclusive               - boolean
# difficulty                    - numeric (int)
# is_remaster                   - boolean
# is_ps4                        - boolean
# pegi_unified                  - category (numeric)
# metacritic_critic_score       - numeric (float)
# hours_main_story              - numeric (float)
# voices_languages_count        - numeric (int)
# sub_languages_count           - numeric (int)
# download_size_gb              - numeric (float)
# price_category                - category



#  X Advanced features engineered (19)

# popularity_score              - numeric (float)
# popularity_category           - category
# visibility_score              - numeric (float)
# visibility_category           - category
# exclusif_playstation_content  - boolean
# publisher_game_count          - numeric (int)
# publisher_game_count_cat      - category
# publisher_category            - category
# genre_action_aventure         - boolean
# genre_roles                   - boolean
# genre_sports                  - boolean
# genre_reflexion               - boolean
# genre_rapide                  - boolean
# localization_category         - category
# download_size_category        - category
# content_score                 - numeric (float)
# content_category              - category
# game_age_years                - numeric (int)
# month_sin                     - numeric (float)
# month_cos                     - numeric (float)
# release_season                - category


#  X Bonus observations √† + 60 jours (2)

# has_5pct_discount_at_30d      - boolean
# has_10pct_discount_at_60d     - boolean

# Y
# 
# Regression nombre de jours
# 
# days_to_10_percent_discount - numeric (int)
# days_to_25_percent_discount - numeric (int)
# days_to_33_percent_discount - numeric (int)
# days_to_50_percent_discount - numeric (int)
# days_to_75_percent_discount - numeric (int)

# Classification tranche d√©lais avant promotion
# 
# days_to_10_percent_discount_category - category
# days_to_25_percent_discount_category - category
# days_to_33_percent_discount_category - category
# days_to_50_percent_discount_category - category
# days_to_75_percent_discount_category - category

## Preprocessing

1. Num√©riques continues ‚Üí Imputation + Scaling (si n√©cessaire)
2. Num√©riques discr√®tes ‚Üí Imputation + Binning ou Sqrt (optionnel)
3. Bool√©ennes ‚Üí Format 0/1
4. Ordinales ‚Üí Garder l'ordre naturel
5. Nominales ‚Üí Label Encoding (Mod√®les Arbres)

### Features par types

In [None]:
numeric_continuous = [
    "base_price",
    "pssstore_stars_rating",    # Observation post sortie
    "metacritic_critic_score",  # Observation post sortie
    "hours_main_story",
    "download_size_gb",
]

numeric_discrete = [
    "pssstore_stars_rating_count", # Observation post sortie
    "trophies_count",
    "packs_deluxe_count",
    "serie_count",
    "sub_languages_count",
    "voices_languages_count",
    "publisher_game_count",
    "dlc_count",
    "game_age_years"
]

boolean_cols = [
    "is_indie",
    "is_vr",
    "is_ps_exclusive",
    "is_remaster",
    "is_ps4",
    "has_local_multiplayer",
    "has_online_multiplayer",
    "is_online_only",
    "has_microtransactions",
    "is_opti_ps5_pro",
    "is_ps_exclusive",
    "exclusif_playstation_content",
    "has_5pct_discount_at_30d", # Observation post sortie
    "has_10pct_discount_at_60d", # Observation post sortie
    "genre_action_aventure",
    "genre_roles",
    "genre_sports",
    "genre_reflexion",
    "genre_rapide",
]

categorical_cols = [
    "price_category",
    "publisher_category",
    "visibility_category",
    "popularity_category",   # Observation post sortie
    "content_category",
    "download_size_category",
    "localization_category",
    "publisher_game_count_cat",
    "release_season"
]

ordinal_cols = ["difficulty", "pegi_unified"]

score_cols = [
    "popularity_score",       
    "visibility_score", 
    "content_score"]

month_cols = ["month_sin", "month_cos"]

### En attente de tests

In [197]:
# pssstore_stars_rating_count (0 √† 1.8M) ‚Üí log transform pour r√©duire l'√©chelle

### Pipeline sci kit learn

In [None]:
def create_pipeline(model_type='tree', available_columns=None):
    """
    Cr√©e un Pipeline sklearn complet pour le preprocessing, Type de mod√®le : 'tree', 'linear', 'svm'
    Retourne un Pipeline sklearn √† utiliser avec .fit() et .transform()
    
    X_train_processed = pipeline.fit_transform(X_train)
    X_test_processed = pipeline.transform(X_test)
    """
    
    print(f"Cr√©ation du pipeline pour mod√®le : {model_type.upper()}")
    
    # Filtrer les colonnes pour ne garder que celles disponibles
    def filter_cols(cols):
        if available_columns is None:
            return cols
        return [c for c in cols if c in available_columns]
    
    # Liste pour stocker tous les transformers
    transformers = []
    
    # NUM√âRIQUES CONTINUES
    filtered_numeric_continuous = filter_cols(numeric_continuous)
    if filtered_numeric_continuous:
        if model_type == 'tree':
            # TREE: Juste imputation, pas de scaling
            numeric_continuous_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median'))
            ])
        
        elif model_type == 'linear':
            # LINEAR: Imputation + Outliers + Standardisation
            numeric_continuous_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('outlier_clipper', OutlierClipper()),
                ('scaler', StandardScaler())
            ])
        
        elif model_type == 'svm':
            # SVM: Imputation + Outliers + Normalisation [0,1]
            numeric_continuous_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('outlier_clipper', OutlierClipper()),
                ('scaler', MinMaxScaler())
            ])
        
        transformers.append(('num_continuous', numeric_continuous_transformer, filtered_numeric_continuous))
    
    
    # NUM√âRIQUES DISCR√àTES
    
    # TODO S√©parer les comptages (imputation 0) des autres (imputation m√©diane)
    zero_fill_cols = ['trophies_count', 'packs_deluxe_count', 'dlc_count', 'serie_count']
    
    filtered_numeric_discrete = filter_cols(numeric_discrete)
    
    numeric_discrete_zero = [col for col in filtered_numeric_discrete if col in zero_fill_cols]
    numeric_discrete_median = [col for col in filtered_numeric_discrete if col not in zero_fill_cols]
    
    # Comptages avec imputation 0
    if numeric_discrete_zero:
        if model_type == 'tree':
            numeric_discrete_zero_transformer = Pipeline(steps=[
                ('zero_imputer', ZeroImputer())
            ])
        elif model_type == 'linear':
            numeric_discrete_zero_transformer = Pipeline(steps=[
                ('zero_imputer', ZeroImputer()),
                ('scaler', StandardScaler())
            ])
        elif model_type == 'svm':
            numeric_discrete_zero_transformer = Pipeline(steps=[
                ('zero_imputer', ZeroImputer()),
                ('scaler', MinMaxScaler())
            ])
        
        transformers.append(('num_discrete_zero', numeric_discrete_zero_transformer, numeric_discrete_zero))
    
    # Autres num√©riques discr√®tes avec imputation m√©diane
    if numeric_discrete_median:
        if model_type == 'tree':
            numeric_discrete_median_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median'))
            ])
        elif model_type == 'linear':
            numeric_discrete_median_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())
            ])
        elif model_type == 'svm':
            numeric_discrete_median_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', MinMaxScaler())
            ])
        
        transformers.append(('num_discrete_median', numeric_discrete_median_transformer, numeric_discrete_median))
    
    # BOOL√âENNES
    boolean_transformer = Pipeline(steps=[
        ('zero_imputer', ZeroImputer())  # Remplacer NA par 0
    ])
    transformers.append(('boolean', boolean_transformer, boolean_cols))
    
    # CAT√âGORIELLES
    if model_type == 'tree':
        # TREE Label Encoding (via OrdinalEncoder)
        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
            ('encoder', OrdinalEncoder(
                handle_unknown='use_encoded_value',
                unknown_value=-1,
                encoded_missing_value=-1
            ))
        ])
    else:
        # LINEAR / SVM: One-Hot Encoding
        from sklearn.preprocessing import OneHotEncoder
        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
            ('encoder', OneHotEncoder(
                drop='first',
                sparse_output=False,
                handle_unknown='ignore'
            ))
        ])
    
    transformers.append(('categorical', categorical_transformer, categorical_cols))
    
    # ORDINALES

    if model_type == 'tree':
        ordinal_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median'))
        ])
    else:  # linear ou svm
        ordinal_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ])
    
    transformers.append(('ordinal', ordinal_transformer, ordinal_cols))
    
    # SCORES ENGINEERED
    score_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())  # Recommand√© pour tous les mod√®les
    ])
    
    transformers.append(('scores', score_transformer, score_cols))
       
    # Colonnes D√©j√† normalis√©es, juste passer tel quel
    month_transformer = 'passthrough'
    transformers.append(('month', month_transformer, month_cols))
    
    # CR√âER LE COLUMN TRANSFORMER
    preprocessor = ColumnTransformer(
        transformers=transformers,
        remainder='drop',  # Supprimer les colonnes non sp√©cifi√©es
        verbose_feature_names_out=False  # Garder noms courts
    )
    
    # PIPELINE FINAL
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor)
    ])
    
    print(f"‚úÖ Pipeline cr√©√© avec {len(transformers)} groupes de transformers")
    
    return pipeline

In [None]:
def get_feature_names(pipeline:Pipeline, input_features):
    """
    R√©cup√®re les noms des features apr√®s transformation
    """
    try:
        # M√©thode moderne (sklearn >= 1.0)
        return pipeline.named_steps['preprocessor'].get_feature_names_out()
    except:
        # Fallback pour anciennes versions
        return None


In [None]:
def save_pipeline(pipeline, filepath):
    """
    Sauvegarde le pipeline
    """
    import joblib
    joblib.dump(pipeline, filepath)
    print(f"‚úÖ Pipeline sauvegard√© : {filepath}")

In [None]:
def load_pipeline(filepath):
    """
    Charge un pipeline sauvegard√©
    """
    import joblib
    pipeline = joblib.load(filepath)
    print(f"‚úÖ Pipeline charg√© : {filepath}")
    return pipeline

## Tests 

### On supprime explicitement les colonnes targets

In [None]:
# preparation des donn√©es pour les tests

col_to_delete = [
    "id_store",
    "days_to_10_percent_discount",
    "days_to_25_percent_discount",
    "days_to_33_percent_discount",
    "days_to_50_percent_discount",
    "days_to_75_percent_discount",
    "days_to_10_percent_discount_category",
    "days_to_25_percent_discount_category",
    # "days_to_33_percent_discount_category",
    "days_to_50_percent_discount_category",
    "days_to_75_percent_discount_category",
]

df_clean = df_dataset.drop(columns=col_to_delete)

In [None]:
# On supprime les lignes qui n'ont pas de target
df_clean = df_clean[df_clean['days_to_33_percent_discount'].notna()].copy()

In [None]:
# S√©lectionner colonnes disponibles
all_feature_cols = (numeric_continuous + numeric_discrete + boolean_cols + 
                    categorical_cols + ordinal_cols +
                    score_cols + month_cols)

available_cols = [col for col in all_feature_cols if col in df_clean.columns]

X = df_clean[available_cols]
X = df_clean.drop(columns=['days_to_33_percent_discount'])

In [None]:
y = df_clean['days_to_33_percent_discount']
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTrain : {len(X_train)} | Test : {len(X_test)}")


Train : 2971 | Test : 743


In [None]:
pipeline_tree = create_pipeline(model_type='tree', available_columns=available_cols)

# Fit et transform
X_train_processed = pipeline_tree.fit_transform(X_train)
X_test_processed = pipeline_tree.transform(X_test)

print(f"\nTrain processed : {X_train_processed.shape}")
print(f"Test processed  : {X_test_processed.shape}")

In [None]:
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train_processed, y_train)

predictions = rf.predict(X_test_processed)

mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"\nüìä R√âSULTATS :")
print(f"   MAE : {mae:.2f} jours")
print(f"   R¬≤  : {r2:.4f}")


Train processed : (2971, 46)
Test processed  : (743, 46)

üìä R√âSULTATS :
   MAE : 94.51 jours
   R¬≤  : 0.2531


In [None]:
# elif model_type == 'linear':
#     from sklearn.linear_model import Ridge
#     model = Ridge(alpha=1.0)
#     print("Mod√®le par d√©faut : Ridge")

# elif model_type == 'svm':
#     from sklearn.svm import SVR
#     model = SVR(kernel='rbf', C=1.0)
#     print("Mod√®le par d√©faut : SVR")


EXEMPLE 3 : COMPARER TREE vs LINEAR
Cr√©ation du pipeline pour mod√®le : TREE
‚úÖ Pipeline cr√©√© avec 9 groupes de transformers
Mod√®le par d√©faut : RandomForestRegressor
‚úÖ Pipeline complet cr√©√© (preprocessing + RandomForestRegressor)

Tree   - MAE : 94.51 jours
Cr√©ation du pipeline pour mod√®le : LINEAR
‚úÖ Pipeline cr√©√© avec 9 groupes de transformers
Mod√®le par d√©faut : Ridge
‚úÖ Pipeline complet cr√©√© (preprocessing + Ridge)
Linear - MAE : 99.33 jours


In [None]:
# # 6.1 Gridesearch pour hyperparam√®tres
# print("\n=== GRIDS SEARCH POUR HYPERPARAM√àTRES ===")
# # D√©finition du mod√®le de base
# model = RandomForestClassifier(random_state=42)
# # D√©finition des hyperparam√®tres √† tester
# param_grid = {
#     'n_estimators': [50, 100, 200],
#     'max_depth': [None, 10, 20, 30],
#     'min_samples_split': [2, 5],
#     'min_samples_leaf': [1, 2, 4]
# }

# # Configuration de la validation crois√©e
# cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # 5-fold cross-validation
# # Grille de recherche
# grid_search = GridSearchCV(estimator=model, param_grid=param_grid, 
#                            scoring='accuracy', cv=cv_strategy, n_jobs=-1, verbose=2)

# # Entra√Ænement du mod√®le avec GridSearch
# grid_search.fit(preprocessor.fit_transform(X_train), y_train)

# # Meilleurs hyperparam√®tres
# print("\nMeilleurs hyperparam√®tres trouv√©s:")
# print(grid_search.best_params_)

# # Meilleur score
# print(f"Meilleur score de validation: {grid_search.best_score_:.2%}")
# # Entra√Ænement du mod√®le avec les meilleurs hyperparam√®tres
# best_model = grid_search.best_estimator_   


In [None]:
# # 7. CROSS-VALIDATION
# print("\n=== CROSS-VALIDATION ===")

# # Configuration de la cross-validation stratifi√©e
# cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# # Scores de cross-validation
# cv_scores = cross_val_score(model_pipeline, X_train, y_train, cv=cv, scoring='accuracy')

# print(f"Scores CV: {cv_scores}")
# print(f"Score CV moyen: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")