# Évaluation d'un système de recommandation My Content

Notebook pour entraîner et comparer plusieurs approches de recommandation sur le dataset Kaggle **news-portal-user-interactions-by-globocom**. L'objectif est de montrer clairement chaque étape (du chargement des données jusqu'au choix final du modèle).

> Ce notebook aligne désormais **toutes les approches de recommandation sur la bibliothèque Surprise** (https://surprise.readthedocs.io/) afin de bénéficier d'algorithmes collaboratifs standardisés et faciles à déployer.

In [49]:
# Imports & Config
from __future__ import annotations
import json
import os
import pickle
import sys
from collections import Counter
import time
from pathlib import Path
from typing import Callable, Dict, List, Optional, Tuple, Union
import optuna

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)

# Ensure the project root is importable
PROJECT_ROOT = Path('.').resolve().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from src.models.lightfm_item2item import (
    CONTEXT_COLUMNS,
    LightFMApproximator,
    build_interaction_matrices,
    precompute_item_neighbors,
    score_from_neighbors,
    session_weight_from_size,
)

# Configuration
CONFIG = {
    "clicks_dir": "../data/news-portal-user-interactions-by-globocom/clicks",
    "metadata_path": "../data/news-portal-user-interactions-by-globocom/articles_metadata.csv",
    "embeddings_path": "../data/news-portal-user-interactions-by-globocom/articles_embeddings.pickle",
    "max_click_files": None,
    "artifacts_dir": "../artifacts/evaluation",
    "k": 5,
    "train_ratio": 0.8,
    "recent_window_days": 7,
    "random_seed": 42,
    "svd_components": 64,
    "content_pca_components": None,
    "covisit_top_n_neighbors": 20,
    "covisit_similarity": "cosine",
    "covisit_hybrid_alpha": 0.7350738721058192,
    "svd_hazard_ndcg": 0.02,
    "min_user_interactions": 3,
    "min_item_interactions": 5,
    "svd_use_session_rating": True,
    "lightfm_use_user_features": True,
    "lightfm_components": 48,
    "lightfm_item_neighbors": 200,
    "hybrid_weights": (0.6, 0.4),
}
np.random.seed(CONFIG["random_seed"])
Path(CONFIG["artifacts_dir"]).mkdir(parents=True, exist_ok=True)
print("Config ready", CONFIG)

from surprise import Dataset, Reader, KNNBasic, NormalPredictor, SVD


Config ready {'clicks_dir': '../data/news-portal-user-interactions-by-globocom/clicks', 'metadata_path': '../data/news-portal-user-interactions-by-globocom/articles_metadata.csv', 'embeddings_path': '../data/news-portal-user-interactions-by-globocom/articles_embeddings.pickle', 'max_click_files': None, 'artifacts_dir': '../artifacts/evaluation', 'k': 5, 'train_ratio': 0.8, 'recent_window_days': 7, 'random_seed': 42, 'svd_components': 64, 'content_pca_components': None, 'covisit_top_n_neighbors': 20, 'covisit_similarity': 'cosine', 'covisit_hybrid_alpha': 0.7350738721058192, 'svd_hazard_ndcg': 0.02, 'min_user_interactions': 3, 'min_item_interactions': 5, 'svd_use_session_rating': True, 'lightfm_use_user_features': True, 'lightfm_components': 48, 'lightfm_item_neighbors': 200, 'hybrid_weights': (0.6, 0.4)}


## Contexte

Nous voulons proposer à chaque lecteur un Top-5 d'articles susceptibles de l'intéresser. Le notebook illustre la démarche de A à Z : préparation des données, construction de différentes familles de modèles puis comparaison à l'aide de métriques de ranking.

## Données

Les fichiers attendus sont situés dans `/data/*`.

In [50]:

# Load data utilities


def detect_timestamp_column(df: pd.DataFrame) -> str:
    """Detect the timestamp-like column name."""
    candidates = ["click_timestamp", "timestamp", "event_time", "ts", "time"]
    for col in df.columns:
        if col in candidates or col.lower() in candidates:
            return col
    raise ValueError("No timestamp-like column found. Expected one of: " + ",".join(candidates))


def detect_article_column(df: pd.DataFrame) -> str:
    """Detect the article/item column name."""
    candidates = ["click_article_id", "clicked_article_id", "article_id", "item_id", "content_id"]
    for col in df.columns:
        if col in candidates:
            return col
    raise ValueError("No article id column found. Expected one of: " + ",".join(candidates))


def infer_unix_unit(values: pd.Series) -> str:
    numeric = pd.to_numeric(values, errors="coerce").dropna()
    if numeric.empty:
        return "s"
    max_abs = numeric.abs().max()
    if max_abs >= 1e14:
        return "ns"
    if max_abs >= 1e11:
        return "ms"
    return "s"


def to_timestamp(series: pd.Series) -> pd.Series:
    if pd.api.types.is_datetime64_any_dtype(series):
        return pd.to_datetime(series)
    if pd.api.types.is_numeric_dtype(series):
        unit = infer_unix_unit(series)
        return pd.to_datetime(series, unit=unit, errors="coerce")

    converted = pd.to_datetime(series, errors="coerce")
    if converted.notna().any():
        return converted

    unit = infer_unix_unit(series)
    return pd.to_datetime(series, unit=unit, errors="coerce")


def list_click_files(path: Union[str, Path]) -> List[Path]:
    path_obj = Path(path)
    if path_obj.is_file():
        return [path_obj]
    if path_obj.is_dir():
        return sorted(path_obj.glob("clicks_hour_*.csv"))
    return []


def ensure_context_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Ensure session_size and context columns exist with safe defaults."""
    df = df.copy()
    if "session_size" not in df.columns:
        df["session_size"] = 1
    for col in CONTEXT_COLUMNS:
        if col not in df.columns:
            df[col] = "unknown"
    return df


def create_synthetic_clicks(path: str, n_users: int = 50, n_items: int = 120, days: int = 30, interactions_per_user: int = 25) -> pd.DataFrame:
    """Create a small synthetic clicks dataset to keep the notebook runnable."""
    rng = np.random.default_rng(CONFIG["random_seed"])
    start = pd.Timestamp("2022-01-01")
    envs = ["web", "app"]
    devices = ["mobile", "desktop"]
    oss = ["ios", "android", "linux"]
    referrers = ["direct", "search", "social"]
    records = []
    for user in range(1, n_users + 1):
        offsets = rng.integers(0, days, size=interactions_per_user)
        timestamps = [start + pd.Timedelta(int(o), unit="D") for o in sorted(offsets.tolist())]
        articles = rng.integers(1, n_items + 1, size=interactions_per_user)
        for ts, art in zip(timestamps, articles):
            records.append({
                "user_id": int(user),
                "article_id": int(art),
                "timestamp": ts,
                "session_size": int(rng.integers(1, 6)),
                "click_environment": rng.choice(envs),
                "click_deviceGroup": rng.choice(devices),
                "click_os": rng.choice(oss),
                "click_country": rng.choice(["fr", "us", "br"]),
                "click_region": rng.choice(["idf", "sp", "ca"]),
                "click_referrer_type": rng.choice(referrers),
            })
    df = pd.DataFrame(records).sort_values("timestamp").reset_index(drop=True)
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False)
    print(
        f"Synthetic clicks dataset created at {path} "
        f"(users={n_users}, items={n_items}, interactions={len(df)})"
    )
    return df


def load_clicks(path: str, max_files: Optional[int] = None) -> pd.DataFrame:
    """Load clicks data from the Globo hourly files, with a safety cap."""
    files = list_click_files(path)
    total_files = len(files)
    if not files:
        print(f"Clicks directory not found at {path}. Generating a synthetic sample for demonstration.")
        return ensure_context_columns(create_synthetic_clicks(Path(path) / "clicks_hour_000.csv"))

    if max_files is not None:
        print(f"Limite explicite max_files={max_files}, total détecté={total_files}")
        files = files[:max_files]

    print(f"Chargement de {len(files)} fichiers clicks (total détecté={total_files}, limite={max_files if max_files is not None else 'aucune'})")
    frames = []
    for file in files:
        df = pd.read_csv(file)
        ts_col = detect_timestamp_column(df)
        article_col = detect_article_column(df)
        df[ts_col] = to_timestamp(df[ts_col])
        df = df.rename(columns={ts_col: "timestamp", article_col: "article_id"})
        df = ensure_context_columns(df)
        keep_cols = [col for col in [
            "user_id",
            "article_id",
            "timestamp",
            "session_size",
            *CONTEXT_COLUMNS,
        ] if col in df.columns]
        frames.append(df[keep_cols])

    combined = pd.concat(frames, ignore_index=True)
    combined = combined.sort_values("timestamp").reset_index(drop=True)
    print(f"Clicks agrégés : {len(combined)} lignes, {combined['user_id'].nunique()} utilisateurs uniques, {combined['article_id'].nunique()} articles uniques.")
    return combined


def load_metadata(path: str) -> Optional[pd.DataFrame]:
    """Load article metadata if available."""
    if not os.path.exists(path):
        print(f"Metadata file not found at {path}. Utilisation du pipeline Surprise uniquement si les métadonnées sont absentes.")
        return None
    meta = pd.read_csv(path)
    if "article_id" not in meta.columns:
        print("Metadata missing 'article_id' column. Ignoring metadata.")
        return None
    return meta


clicks = load_clicks(CONFIG["clicks_dir"], max_files=CONFIG["max_click_files"])
metadata = load_metadata(CONFIG["metadata_path"])
print(clicks.head())
print("Metadata loaded:", metadata is not None)


Chargement de 385 fichiers clicks (total détecté=385, limite=aucune)
Clicks agrégés : 2988181 lignes, 322897 utilisateurs uniques, 46033 articles uniques.
  user_id article_id               timestamp session_size click_environment  \
0      59     234853 2017-10-01 03:00:00.026            2                 4   
1      79     159359 2017-10-01 03:00:01.702            2                 4   
2     154      96663 2017-10-01 03:00:04.207            2                 4   
3     111     202436 2017-10-01 03:00:14.140            2                 4   
4      70     119592 2017-10-01 03:00:18.863            3                 4   

  click_deviceGroup click_os click_country click_region click_referrer_type  
0                 3        2             1           21                   1  
1                 3        2             1           13                   1  
2                 3        2             1           25                   7  
3                 3        2             1            9   

## Analyse exploratoire des données

Courte photographie des fichiers sources immédiatement après le chargement :
- nombre de lignes et noms de colonnes des clics
- volumes et intégrité des métadonnées articles
- dimensions et structure du fichier d'`articles_embeddings`.

In [51]:
# EDA rapide sur les données sources
import pickle
from pathlib import Path
from collections.abc import Mapping


def summarize_timestamps(series: pd.Series):
    series = pd.to_datetime(series)
    daily = series.dt.date.value_counts().sort_index().rename_axis("date").reset_index(name="nb_clicks")
    hourly = series.dt.hour.value_counts().sort_index().rename_axis("hour").reset_index(name="nb_clicks")
    return series.min(), series.max(), daily, hourly


def describe_structure(obj, prefix="embeddings", max_depth=4):
    entries = []

    def add_entry(path, value, note=None):
        entry = {"chemin": path, "type": type(value).__name__}
        if hasattr(value, "shape"):
            entry["shape"] = tuple(getattr(value, "shape"))
        elif hasattr(value, "__len__") and not isinstance(value, (str, bytes)):
            entry["len"] = len(value)
        if hasattr(value, "dtype"):
            entry["dtype"] = str(getattr(value, "dtype"))
        if note:
            entry["note"] = note
        if isinstance(value, np.ndarray) and value.dtype.names:
            entry["dtype_fields"] = list(value.dtype.names)
        if isinstance(value, np.ndarray) and value.ndim == 1 and len(value) > 0 and not isinstance(value[0], (np.ndarray, list, tuple, Mapping)):
            entry["exemple"] = repr(value[:3].tolist())
        entries.append(entry)

    def walk(value, path, depth):
        add_entry(path, value)
        if depth >= max_depth:
            return
        if isinstance(value, Mapping):
            for k, v in value.items():
                walk(v, f"{path}.{k}", depth + 1)
        elif isinstance(value, (list, tuple, np.ndarray)) and not isinstance(value, (str, bytes)):
            if len(value) > 0:
                walk(value[0], f"{path}[0]", depth + 1)

    walk(obj, prefix, 0)
    return entries


click_files = list_click_files(CONFIG["clicks_dir"])
print(f"Nombre total de fichiers clicks détectés: {len(click_files)}")
if not click_files:
    print("Aucun fichier clicks trouvé au chemin configuré. Vérifiez le téléchargement des données.")

files_for_eda = click_files[:2]
per_file_stats = []
for file in files_for_eda:
    df_file = pd.read_csv(file)
    ts_col = detect_timestamp_column(df_file)
    article_col = detect_article_column(df_file)
    timestamps = to_timestamp(df_file[ts_col])
    per_file_stats.append(
        {
            "fichier": file.name,
            "nb_lignes": len(df_file),
            "colonnes": ", ".join(df_file.columns),
            "articles_uniques": df_file[article_col].nunique(),
            "horodatage_min": timestamps.min(),
            "horodatage_max": timestamps.max(),
        }
    )
if per_file_stats:
    display(pd.DataFrame(per_file_stats))
else:
    print("Pas assez de fichiers pour réaliser une EDA détaillée par fichier.")

print("=== Clicks (agrégés) ===")
if clicks.empty:
    print("Aucun clic chargé. Vérifier le chemin ou augmenter max_click_files.")
else:
    clicks_summary = {
        "nb_lignes": len(clicks),
        "colonnes": ", ".join(clicks.columns),
        "utilisateurs_uniques": clicks['user_id'].nunique() if 'user_id' in clicks else None,
        "articles_uniques": clicks['article_id'].nunique() if 'article_id' in clicks else None,
    }
    display(pd.DataFrame([clicks_summary]))

    total_articles = None
    if metadata is not None and 'article_id' in metadata:
        total_articles = metadata['article_id'].nunique()
    elif 'article_id' in clicks:
        total_articles = clicks['article_id'].nunique()

    total_clients = clicks['user_id'].nunique() if 'user_id' in clicks else None
    print("Synthèse globale (articles / clients)")
    display(pd.DataFrame([{
        'nombre_total_articles': total_articles,
        'nombre_total_clients': total_clients,
    }]))

    ts_min, ts_max, daily, hourly = summarize_timestamps(clicks['timestamp'])
    display(pd.DataFrame([
        {
            'horodatage_min': ts_min,
            'horodatage_max': ts_max,
            'fenetre_jours': (ts_max - ts_min).days + 1,
        }
    ]))
    print("Répartition par jour (jusqu'à 10 premières valeurs)")
    display(daily.head(10))
    print("Répartition par heure (0-23)")
    display(hourly)

print("=== Métadonnées des articles ===")
if metadata is None:
    print("Aucun fichier metadata chargé.")
else:
    meta_summary = {
        "nb_articles": len(metadata),
        "colonnes": ", ".join(metadata.columns),
        "articles_uniques": metadata['article_id'].nunique() if 'article_id' in metadata else None,
    }
    display(pd.DataFrame([meta_summary]))
    missing = metadata.isna().sum().sort_values(ascending=False)
    display(missing.to_frame('valeurs_manquantes'))
    if 'created_at_ts' in metadata.columns:
        created = to_timestamp(metadata['created_at_ts'])
        display(pd.DataFrame([{'premier_article': created.min(), 'dernier_article': created.max()}]))
    if 'article_id' in metadata.columns:
        overlap = set(clicks['article_id'].unique()) if 'article_id' in clicks.columns else set()
        coverage = len(overlap & set(metadata['article_id'].unique()))
        print(f"Articles présents dans clicks et metadata: {coverage}")


print("=== Embeddings d'articles ===")
embeddings_path = Path(CONFIG['embeddings_path'])
if embeddings_path.exists():
    with embeddings_path.open('rb') as f:
        embeddings_obj = pickle.load(f)
    print(f"Type chargé: {type(embeddings_obj)}")

    def summarize_matrix(mat):
        stats = {
            'shape': getattr(mat, 'shape', None),
            'dtype': getattr(mat, 'dtype', None),
        }

        dim_values = []
        shape = getattr(mat, 'shape', None)
        if shape is not None and len(shape) >= 2:
            dim_values.append(shape[1])
        elif isinstance(mat, (list, tuple, np.ndarray)):
            for row in mat:
                if hasattr(row, '__len__') and not isinstance(row, (str, bytes)):
                    try:
                        dim_values.append(len(row))
                    except TypeError:
                        continue

        if dim_values:
            stats.update({
                'profondeur_min': min(dim_values),
                'profondeur_moyenne': float(np.mean(dim_values)),
                'profondeur_max': max(dim_values),
            })

        if hasattr(mat, 'shape') and len(getattr(mat, 'shape', [])) == 2:
            norms = np.linalg.norm(mat, axis=1)
            stats.update(
                {
                    'nb_vectors': mat.shape[0],
                    'dim': mat.shape[1],
                    'norm_min': norms.min(),
                    'norm_max': norms.max(),
                    'norm_moyenne': norms.mean(),
                }
            )
        return stats

    base_structure = describe_structure(embeddings_obj, max_depth=4)

    if isinstance(embeddings_obj, dict):
        keys = list(embeddings_obj.keys())
        print(f"Clés disponibles: {keys}")
        matrix = embeddings_obj.get('embeddings')
        ids = embeddings_obj.get('articles_ids') or embeddings_obj.get('article_ids')

        structure = base_structure.copy()
        if ids is not None:
            structure.insert(0, {
                'chemin': 'embeddings.article_ids',
                'type': type(ids).__name__,
                'len': len(ids),
                'note': "Identifiants d'articles fournis dans le fichier",
            })
        if structure:
            print("Structure détaillée de l'objet d'embeddings (par chemin de clé):")
            display(pd.DataFrame(structure))

        if matrix is not None:
            stats = summarize_matrix(matrix)
            stats.update(
                {
                    'colonnes': ", ".join(keys),
                    'nb_articles_ids': len(ids) if ids is not None else None,
                    'ids_uniques': len(set(ids)) if ids is not None else None,
                    'couverture_metadata': len(set(ids) & set(metadata['article_id']))
                    if (metadata is not None and ids is not None and 'article_id' in metadata)
                    else None,
                    'couverture_clicks': len(set(ids) & set(clicks['article_id']))
                    if (not clicks.empty and ids is not None and 'article_id' in clicks)
                    else None,
                }
            )
            display(pd.DataFrame([stats]))

            if ids is not None:
                sample_ids = ids[:5] if len(ids) >= 5 else ids
                print("Aperçu des premiers article_id liés aux embeddings:")
                display(pd.DataFrame({'article_id': sample_ids}))

            preview_cols = [f"emb_{i}" for i in range(min(5, matrix.shape[1] if hasattr(matrix, 'shape') else 0))]
            if preview_cols:
                preview = pd.DataFrame(matrix[:5, : len(preview_cols)], columns=preview_cols)
                if ids is not None:
                    preview.insert(0, 'article_id', ids[: len(preview)])
                print("Aperçu des embeddings (quelques colonnes et premières lignes):")
                display(preview)
                print("Colonnes affichées pour l'aperçu des embeddings:")
                print(", ".join(preview.columns))

                if ids is not None and metadata is not None and 'article_id' in metadata:
                    meta_cols = [c for c in ['title', 'category_id', 'created_at_ts', 'publisher'] if c in metadata.columns]
                    meta_sample = (
                        preview[['article_id']]
                        .merge(metadata[['article_id'] + meta_cols], on='article_id', how='left')
                    )
                    if 'created_at_ts' in meta_sample.columns:
                        meta_sample['created_at_ts'] = to_timestamp(meta_sample['created_at_ts'])
                    print("Exemple de liaison embedding -> metadata sur article_id (5 premières lignes):")
                    display(meta_sample.head())
        else:
            print("Aucune matrice d'embeddings explicite trouvée dans l'objet chargé.")
    elif hasattr(embeddings_obj, 'shape'):
        stats = summarize_matrix(embeddings_obj)

        inferred_ids = None
        mapping_note = None
        if metadata is not None and 'article_id' in metadata and hasattr(embeddings_obj, 'shape'):
            if embeddings_obj.shape[0] == len(metadata):
                inferred_ids = metadata['article_id'].reset_index(drop=True)
                mapping_note = (
                    "Aucun article_id explicite fourni ; association supposée alignée sur l'ordre des metadata."
                )
            else:
                mapping_note = (
                    "Aucun article_id dans le fichier d'embeddings et la taille ne correspond pas aux metadata : "
                    f"{embeddings_obj.shape[0]} vecteurs vs {len(metadata)} lignes de metadata."
                )
        else:
            mapping_note = (
                "Aucun identifiant d'article n'est présent dans le fichier d'embeddings (mapping externe requis)."
            )

        structure = base_structure.copy()
        if inferred_ids is not None:
            structure.insert(0, {
                'chemin': 'embeddings.article_id (inféré)',
                'type': type(inferred_ids).__name__,
                'len': len(inferred_ids),
                'note': "Alignement supposé sur metadata.article_id (index identique).",
            })
        if structure:
            print("Structure détaillée de l'objet d'embeddings (par chemin de clé):")
            display(pd.DataFrame(structure))

        if mapping_note:
            print(mapping_note)

        if inferred_ids is not None:
            stats.update(
                {
                    'ids_source': 'metadata.article_id (alignement par index)',
                    'ids_uniques': inferred_ids.nunique(),
                    'couverture_metadata': len(set(inferred_ids) & set(metadata['article_id'])),
                    'couverture_clicks': len(set(inferred_ids) & set(clicks['article_id'])) if not clicks.empty else None,
                }
            )

        display(pd.DataFrame([stats]))
        if len(getattr(embeddings_obj, 'shape', [])) >= 2 and embeddings_obj.shape[1] > 0:
            preview_cols = [f"emb_{i}" for i in range(min(5, embeddings_obj.shape[1]))]
            preview = pd.DataFrame(embeddings_obj[:5, : len(preview_cols)], columns=preview_cols)
            if inferred_ids is not None:
                preview.insert(0, 'article_id', inferred_ids.iloc[: len(preview)].values)
            print("Aperçu direct de la matrice d'embeddings:")
            display(preview)
            print("Colonnes affichées pour l'aperçu des embeddings:")
            print(", ".join(preview.columns))

            if inferred_ids is not None and metadata is not None:
                meta_cols = [c for c in ['title', 'category_id', 'created_at_ts', 'publisher'] if c in metadata.columns]
                meta_sample = preview[['article_id']].merge(
                    metadata[['article_id'] + meta_cols], on='article_id', how='left'
                )
                if 'created_at_ts' in meta_sample.columns:
                    meta_sample['created_at_ts'] = to_timestamp(meta_sample['created_at_ts'])
                print("Exemple de liaison embedding -> metadata sur article_id (inféré):")
                display(meta_sample.head())
        else:
            print("Objet chargé non structuré, utilisez type/len pour investiguer.")
else:
    print(f"Fichier d'embeddings introuvable à {embeddings_path}")




Nombre total de fichiers clicks détectés: 385


Unnamed: 0,fichier,nb_lignes,colonnes,articles_uniques,horodatage_min,horodatage_max
0,clicks_hour_000.csv,1883,"user_id, session_id, session_start, session_size, click_article_id, click_timestamp, click_environment, click_deviceGroup, click_os, click_country, click_region, click_referrer_type",323,2017-10-01 03:00:00.026,2017-10-03 02:35:54.157
1,clicks_hour_001.csv,1415,"user_id, session_id, session_start, session_size, click_article_id, click_timestamp, click_environment, click_deviceGroup, click_os, click_country, click_region, click_referrer_type",289,2017-10-01 03:36:28.615,2017-10-02 02:41:03.190


=== Clicks (agrégés) ===


Unnamed: 0,nb_lignes,colonnes,utilisateurs_uniques,articles_uniques
0,2988181,"user_id, article_id, timestamp, session_size, click_environment, click_deviceGroup, click_os, click_country, click_region, click_referrer_type",322897,46033


Synthèse globale (articles / clients)


Unnamed: 0,nombre_total_articles,nombre_total_clients
0,364047,322897


Unnamed: 0,horodatage_min,horodatage_max,fenetre_jours
0,2017-10-01 03:00:00.026,2017-11-13 20:04:14.886,44


Répartition par jour (jusqu'à 10 premières valeurs)


Unnamed: 0,date,nb_clicks
0,2017-10-01,94056
1,2017-10-02,303177
2,2017-10-03,261159
3,2017-10-04,215415
4,2017-10-05,190003
5,2017-10-06,207646
6,2017-10-07,139323
7,2017-10-08,108110
8,2017-10-09,248208
9,2017-10-10,282391


Répartition par heure (0-23)


Unnamed: 0,hour,nb_clicks
0,0,126579
1,1,120741
2,2,94295
3,3,61811
4,4,32818
5,5,18562
6,6,14519
7,7,16630
8,8,32108
9,9,72840


=== Métadonnées des articles ===


Unnamed: 0,nb_articles,colonnes,articles_uniques
0,364047,"article_id, category_id, created_at_ts, publisher_id, words_count",364047


Unnamed: 0,valeurs_manquantes
article_id,0
category_id,0
created_at_ts,0
publisher_id,0
words_count,0


Unnamed: 0,premier_article,dernier_article
0,2006-09-27 11:14:35,2018-03-13 12:12:30


Articles présents dans clicks et metadata: 46033
=== Embeddings d'articles ===
Type chargé: <class 'numpy.ndarray'>
Structure détaillée de l'objet d'embeddings (par chemin de clé):


Unnamed: 0,chemin,type,len,note,shape,dtype,exemple
0,embeddings.article_id (inféré),Series,364047.0,Alignement supposé sur metadata.article_id (index identique).,,,
1,embeddings,ndarray,,,"(364047, 250)",float32,
2,embeddings[0],ndarray,,,"(250,)",float32,"[-0.16118301451206207, -0.9572331309318542, -0.13794444501399994]"
3,embeddings[0][0],float32,,,(),float32,


Aucun article_id explicite fourni ; association supposée alignée sur l'ordre des metadata.


Unnamed: 0,shape,dtype,profondeur_min,profondeur_moyenne,profondeur_max,nb_vectors,dim,norm_min,norm_max,norm_moyenne,ids_source,ids_uniques,couverture_metadata,couverture_clicks
0,"(364047, 250)",float32,250,250.0,250,364047,250,1.845483,11.18309,7.939456,metadata.article_id (alignement par index),364047,364047,46033


Aperçu direct de la matrice d'embeddings:


Unnamed: 0,article_id,emb_0,emb_1,emb_2,emb_3,emb_4
0,0,-0.161183,-0.957233,-0.137944,0.050855,0.830055
1,1,-0.523216,-0.974058,0.738608,0.155234,0.626294
2,2,-0.619619,-0.97296,-0.20736,-0.128861,0.044748
3,3,-0.740843,-0.975749,0.391698,0.641738,-0.268645
4,4,-0.279052,-0.972315,0.685374,0.113056,0.238315


Colonnes affichées pour l'aperçu des embeddings:
article_id, emb_0, emb_1, emb_2, emb_3, emb_4
Exemple de liaison embedding -> metadata sur article_id (inféré):


Unnamed: 0,article_id,category_id,created_at_ts
0,0,0,2017-12-13 05:53:39
1,1,1,2014-07-14 12:45:36
2,2,1,2014-08-22 00:35:06
3,3,1,2014-08-19 17:11:53
4,4,1,2014-08-03 13:06:11


# Article Embeddings

Ce fichier contient les **embeddings des articles**, c’est-à-dire une **représentation numérique du contenu textuel** permettant de comparer les articles entre eux sur le plan sémantique.

* **Format** : matrice NumPy `(N, 250)` en `float32`
* **1 ligne = 1 article**
* **250 colonnes = dimensions latentes**
* Les valeurs individuelles n’ont pas de signification directe

L’`article_id` n’est **pas stocké explicitement** : il est **déduit de l’ordre des lignes**, qui doit rester aligné avec les métadonnées des articles.

La variable `words_count` indique le **nombre de mots du texte source** et sert uniquement d’indicateur de qualité du contenu.

Les embeddings **ne sont pas normalisés** : la **similarité cosinus** est la mesure recommandée pour comparer les articles.


## Protocole

1. Tri des interactions par horodatage pour respecter la chronologie.
2. Split temporel train/test selon `train_ratio` afin d'éviter toute fuite du futur.
3. Construction d'un profil utilisateur à partir des interactions de train.
4. Définition du *ground truth* : articles cliqués en test pour chaque utilisateur (au moins un).
5. Génération de recommandations Top-5 en excluant les articles déjà vus en train.
6. Calcul des métriques de ranking (Precision@5, Recall@5, MAP@5, NDCG@5, Coverage@5) et estimation de la latence moyenne sur un échantillon de 500 utilisateurs max.

Cette démarche imite un scénario de production : d'abord on respecte le temps, puis on mesure simultanément la qualité des suggestions et le coût de calcul.

## Préparation minimale des interactions

In [52]:

# Filtrage k-core itératif pour limiter la sparsité avant le split train/test

def iterative_k_core_filter(
    df: pd.DataFrame, min_user_interactions: int, min_item_interactions: int
) -> pd.DataFrame:
    filtered = df.copy()
    previous_size = -1
    while previous_size != len(filtered):
        previous_size = len(filtered)
        user_counts = filtered["user_id"].value_counts()
        item_counts = filtered["article_id"].value_counts()
        filtered = filtered[
            filtered["user_id"].isin(user_counts[user_counts >= min_user_interactions].index)
            & filtered["article_id"].isin(item_counts[item_counts >= min_item_interactions].index)
        ]
    return filtered

if clicks.empty:
    print("Dataset clicks vide : saut du filtrage k-core.")
else:
    before = (
        len(clicks),
        clicks["user_id"].nunique(),
        clicks["article_id"].nunique(),
    )
    clicks = iterative_k_core_filter(
        clicks,
        CONFIG["min_user_interactions"],
        CONFIG["min_item_interactions"],
    ).sort_values("timestamp").reset_index(drop=True)
    after = (
        len(clicks),
        clicks["user_id"].nunique(),
        clicks["article_id"].nunique(),
    )
    print(
        "Filtrage k-core terminé: "
        f"interactions {before[0]} -> {after[0]}, "
        f"utilisateurs {before[1]} -> {after[1]}, "
        f"articles {before[2]} -> {after[2]}"
    )


Filtrage k-core terminé: interactions 2988181 -> 2740858, utilisateurs 322897 -> 218894, articles 46033 -> 12419


In [53]:
# Split and utility functions

def temporal_train_test_split(df: pd.DataFrame, train_ratio: float) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Split interactions chronologically according to the train_ratio."""
    cutoff = int(len(df) * train_ratio)
    train = df.iloc[:cutoff].copy()
    test = df.iloc[cutoff:].copy()
    return train, test


def build_user_histories(df: pd.DataFrame) -> Dict[int, List[int]]:
    """Create mapping user -> list of articles in chronological order."""
    histories: Dict[int, List[int]] = {}
    for user_id, group in df.groupby("user_id"):
        histories[int(user_id)] = group.sort_values("timestamp")["article_id"].tolist()
    return histories


def get_candidate_items(df: pd.DataFrame) -> List[int]:
    """Return unique article ids."""
    return df["article_id"].unique().tolist()


def make_ground_truth(train: pd.DataFrame, test: pd.DataFrame) -> Tuple[Dict[int, List[int]], Dict[int, List[int]]]:
    """Build user histories and ground truth for evaluation.

    Only test items that were seen in training are kept so models are
    evaluated on recommendable articles.
    """
    train_hist = build_user_histories(train)
    candidate_items = set(train["article_id"].unique())
    test_hist = build_user_histories(test)
    filtered = {
        u: [it for it in items if it in candidate_items]
        for u, items in test_hist.items()
        if u in train_hist and len(items) > 0
    }
    eligible_users = {u: items for u, items in filtered.items() if items}
    return train_hist, eligible_users


train_df, test_df = temporal_train_test_split(clicks, CONFIG["train_ratio"])
train_histories, ground_truth = make_ground_truth(train_df, test_df)
eval_users = sorted(ground_truth.keys())
candidate_items = get_candidate_items(train_df)
print(f"Train size: {len(train_df)}, Test size: {len(test_df)}, Users for eval: {len(eval_users)}")


Train size: 2192686, Test size: 548172, Users for eval: 67633


## Métriques utilisées

* **Precision@5** : part des recommandations top-5 qui sont réellement cliquées (plus c'est haut, plus le Top-5 est précis).
* **Recall@5** : part des clics test retrouvés dans le Top-5 (mesure la couverture de ce que l'utilisateur aime).
* **MAP@5** : moyenne de la précision cumulée à chaque clic retrouvé ; récompense les bonnes positions dans la liste.
* **NDCG@5** : pondère chaque clic par sa position (gain décroissant) et normalise par le meilleur score possible ; idéal pour comparer des classements.
* **Coverage@5** : proportion d'articles différents recommandés sur l'ensemble des utilisateurs (diversité du catalogue).
* **Latence par utilisateur** : temps moyen pour produire le Top-5 (important pour une API temps réel).
* **RMSE** : erreur quadratique moyenne sur les prédictions de note ; résume l'écart global entre les estimations du modèle et les clics réels.
* **MAE** : erreur absolue moyenne ; met en avant l'erreur moyenne sans amplifier les grands écarts.

In [54]:

# Metrics

def precision_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Precision@k for a single user."""
    if not recommended:
        return 0.0
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / k


def recall_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Recall@k for a single user."""
    if not relevant:
        return 0.0
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / len(relevant)


def average_precision_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """MAP@k for a single user."""
    if not relevant:
        return 0.0
    score = 0.0
    hits = 0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            hits += 1
            score += hits / i
    return score / min(len(relevant), k)


def dcg_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Discounted cumulative gain."""
    dcg = 0.0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            dcg += 1 / np.log2(i + 1)
    return dcg


def ndcg_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Normalized DCG."""
    ideal_dcg = dcg_at_k(relevant[:k], relevant, k)
    if ideal_dcg == 0:
        return 0.0
    return dcg_at_k(recommended, relevant, k) / ideal_dcg


def coverage_at_k(all_recommendations: List[List[int]], candidate_items: List[int], k: int) -> float:
    """Coverage of unique recommended items over candidates."""
    rec_items = set()
    for rec in all_recommendations:
        rec_items.update(rec[:k])
    if not candidate_items:
        return 0.0
    return len(rec_items) / len(candidate_items)


## Fonctions utilitaires pour les recommanders

In [55]:

# Fonctions classiques (popularité, similarité, SVD léger) utilisées par les baselines

def build_global_popularity(train: pd.DataFrame) -> List[int]:
    """Retourne les articles triés par nombre de clics."""
    return train.groupby("article_id").size().sort_values(ascending=False).index.tolist()


def build_recent_popularity(train: pd.DataFrame, window_days: int) -> List[int]:
    """Retourne les articles populaires sur la dernière fenêtre glissante."""
    max_time = train["timestamp"].max()
    window_start = max_time - pd.Timedelta(days=window_days)
    recent = train[train["timestamp"] >= window_start]
    if recent.empty:
        return build_global_popularity(train)
    counts = recent.groupby("article_id")["timestamp"].agg(["size", "max"])
    ranked = counts.sort_values(by=["size", "max"], ascending=[False, False])
    return ranked.index.tolist()


def build_covisit_graph(train: pd.DataFrame) -> Dict[int, Dict[int, int]]:
    """Construire un graphe de co-visitation basé sur l'historique utilisateur."""
    graph: Dict[int, Dict[int, int]] = {}
    for _, group in train.groupby("user_id"):
        items = group.sort_values("timestamp")["article_id"].tolist()
        unique_items = list(dict.fromkeys(items))
        for i, item_i in enumerate(unique_items):
            graph.setdefault(item_i, {})
            for item_j in unique_items[i + 1 :]:
                graph[item_i][item_j] = graph[item_i].get(item_j, 0) + 1
                graph.setdefault(item_j, {})
                graph[item_j][item_i] = graph[item_j].get(item_i, 0) + 1
    return graph


def build_content_embeddings(metadata: pd.DataFrame, pca_components: Optional[int] = None):
    """Crée des embeddings TF-IDF à partir des colonnes textuelles (avec PCA optionnel)."""
    text_cols = [
        c
        for c in metadata.columns
        if metadata[c].dtype == object and c not in {"article_id", "clicks"}
    ]
    non_id_cols = [c for c in metadata.columns if c != "article_id"]

    if not text_cols and non_id_cols:
        print("Aucune colonne textuelle : utilisation des colonnes non-ID comme tokens catégoriels.")
        text_cols = non_id_cols

    if not text_cols:
        raise ValueError("Aucune colonne utilisable dans les métadonnées pour construire des embeddings")

    corpus = metadata[text_cols].fillna("")
    corpus = corpus.apply(lambda row: " ".join(f"{col}_{val}" for col, val in row.items()), axis=1)

    vectorizer = TfidfVectorizer(max_features=5000)
    tfidf = vectorizer.fit_transform(corpus)
    if pca_components and pca_components < tfidf.shape[1]:
        svd = TruncatedSVD(n_components=pca_components, random_state=CONFIG["random_seed"])
        reduced = svd.fit_transform(tfidf)
        embeddings = normalize(reduced)
    else:
        embeddings = normalize(tfidf)
    ids = metadata["article_id"].tolist()
    return embeddings, ids


def build_item_similarity(train: pd.DataFrame, metadata: Optional[pd.DataFrame]):
    """Construit une similarité article-article par contenu ou co-visitation."""
    if metadata is not None:
        try:
            embeddings, ids = build_content_embeddings(metadata, CONFIG["content_pca_components"])
            similarity: Dict[int, Dict[int, float]] = {}
            for i, aid in enumerate(ids):
                sims = embeddings @ embeddings[i].T
                sims = np.asarray(sims).flatten()
                top_idx = np.argsort(-sims)[1:51]
                similarity[aid] = {ids[j]: float(sims[j]) for j in top_idx if sims[j] > 0}
            return similarity, "content"
        except Exception as exc:
            print(f"Embeddings de contenu impossibles ({exc}). Bascule sur la co-visitation.")
    graph = build_covisit_graph(train)
    similarity = {item: {nbr: float(cnt) for nbr, cnt in neigh.items()} for item, neigh in graph.items()}
    return similarity, "covisitation"


def recommend_from_similarity(
    user_id: int,
    train_histories: Dict[int, List[int]],
    similarity: Dict[int, Dict[int, float]],
    candidate_items: List[int],
    k: int,
) -> List[int]:
    """Agrège les scores de similarité depuis l'historique utilisateur."""
    seen = set(train_histories.get(user_id, []))
    scores: Dict[int, float] = {}
    for item in seen:
        for neighbor, sim in similarity.get(item, {}).items():
            if neighbor in seen:
                continue
            scores[neighbor] = scores.get(neighbor, 0.0) + sim
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen]
    if len(recs) < k:
        for c in candidate_items:
            if c not in seen and c not in recs:
                recs.append(c)
            if len(recs) >= k:
                break
    return recs[:k]


def build_collaborative_svd(train: pd.DataFrame, n_components: int):
    """Entraîne un SVD implicite léger et retourne une fonction de recommandation."""
    user_codes, user_index = pd.factorize(train["user_id"], sort=True)
    item_codes, item_index = pd.factorize(train["article_id"], sort=True)

    interactions = pd.DataFrame({"user_idx": user_codes, "item_idx": item_codes}).drop_duplicates()
    data = np.ones(len(interactions), dtype=np.float32)
    mat = sparse.coo_matrix((data, (interactions["user_idx"], interactions["item_idx"])), shape=(len(user_index), len(item_index))).tocsr()

    svd = TruncatedSVD(n_components=n_components, random_state=CONFIG["random_seed"])
    user_factors = svd.fit_transform(mat)
    item_factors = svd.components_.T

    user_to_idx = {int(uid): int(idx) for idx, uid in enumerate(user_index.tolist())}
    items = [int(aid) for aid in item_index.tolist()]

    def recommend(user_id: int, seen: set, k: int) -> List[int]:
        if user_id not in user_to_idx:
            popularity = build_global_popularity(train)
            return [it for it in popularity if it not in seen][:k]

        u_vec = user_factors[user_to_idx[user_id]]
        scores = item_factors @ u_vec
        ranked_items = [items[i] for i in np.argsort(-scores)]
        return [it for it in ranked_items if it not in seen][:k]

    meta = {"users": len(user_index), "items": len(item_index), "components": n_components}
    return recommend, meta


In [56]:

from surprise import Dataset, Reader, KNNBasic, NormalPredictor, SVD
from surprise import accuracy


def build_surprise_trainset(interactions: pd.DataFrame, *, use_session_rating: bool = False):
    if use_session_rating:
        weighted = interactions.copy()
        weighted = ensure_context_columns(weighted)
        weighted["rating"] = session_weight_from_size(weighted.get("session_size"))
        aggregated = (
            weighted.groupby(["user_id", "article_id"])
            .agg(rating=("rating", "mean"), last_ts=("timestamp", "max"))
            .reset_index()
        )
    else:
        aggregated = (
            interactions.groupby(["user_id", "article_id"])
            .agg(clicks=("article_id", "size"), last_ts=("timestamp", "max"))
            .reset_index()
        )
        if aggregated.empty:
            raise ValueError("Impossible de construire un trainset Surprise sans interactions")

        min_ts = aggregated["last_ts"].min()
        max_ts = aggregated["last_ts"].max()
        span_seconds = max((max_ts - min_ts).total_seconds(), 1.0)
        recency = (aggregated["last_ts"] - min_ts).dt.total_seconds() / span_seconds
        aggregated["rating"] = np.log1p(aggregated["clicks"]) + 0.5 * recency

    min_rating = float(aggregated["rating"].min())
    max_rating = float(aggregated["rating"].max())
    if max_rating == min_rating:
        max_rating = min_rating + 1.0

    reader = Reader(rating_scale=(min_rating, max_rating))
    return Dataset.load_from_df(
        aggregated[["user_id", "article_id", "rating"]], reader
    ).build_full_trainset()


surprise_trainset = build_surprise_trainset(train_df, use_session_rating=False)
surprise_items = [int(surprise_trainset.to_raw_iid(iid)) for iid in surprise_trainset.all_items()]
popularity_order = build_global_popularity(train_df)
popularity_rank = {int(aid): rank for rank, aid in enumerate(popularity_order)}
# Chaque algorithme utilise un tie-breaker différent pour éviter des tops identiques en cas d'égalité


def wrap_surprise_recommender(algo, label: str, *, tie_breaker=None, trainset=None, items=None):
    current_trainset = trainset or surprise_trainset
    current_items = items or surprise_items
    algo.fit(current_trainset)

    is_normal = isinstance(algo, NormalPredictor)
    is_knn = hasattr(algo, "get_neighbors") and hasattr(algo, "sim")
    neighbor_cache: dict[int, list[int]] = {}
    sim_matrix = getattr(algo, "sim", None)

    fallback_sorted_items = list(current_items)
    if tie_breaker:
        fallback_sorted_items = sorted(
            fallback_sorted_items,
            key=lambda iid: tie_breaker(iid),
            reverse=True,
        )

    if is_knn and sim_matrix is not None:
        max_neighbors = getattr(algo, "k", 40)
        for inner_iid in current_trainset.all_items():
            raw_iid = int(current_trainset.to_raw_iid(inner_iid))
            inner_neighbors = algo.get_neighbors(inner_iid, k=max_neighbors)
            neighbor_cache[raw_iid] = [
                int(current_trainset.to_raw_iid(neighbor))
                for neighbor in inner_neighbors
                if neighbor != inner_iid
            ]

    def recommend(user_id: int, seen: set, k: int) -> List[int]:
        raw_uid = int(user_id)
        scored: list[tuple[int, float]] = []

        if is_knn:
            if not seen:
                return [iid for iid in fallback_sorted_items if iid not in seen][:k]

            candidate_scores: Counter[int] = Counter()
            inner_seen: dict[int, int] = {}
            for seen_item in seen:
                try:
                    inner_seen[seen_item] = current_trainset.to_inner_iid(str(seen_item))
                except ValueError:
                    continue

            for seen_item, inner_seen_id in inner_seen.items():
                neighbors = neighbor_cache.get(seen_item, [])
                for neighbor_raw in neighbors:
                    if neighbor_raw in seen:
                        continue
                    try:
                        neighbor_inner = current_trainset.to_inner_iid(str(neighbor_raw))
                    except ValueError:
                        continue
                    sim = float(sim_matrix[inner_seen_id, neighbor_inner])
                    if np.isfinite(sim):
                        candidate_scores[neighbor_raw] += sim

            if candidate_scores:
                scored = list(candidate_scores.items())
            else:
                return [iid for iid in fallback_sorted_items if iid not in seen][:k]

        if not scored:
            if is_normal:
                base_score = float(getattr(algo, "mu", 0.0))
                scored = [(iid, base_score) for iid in current_items if iid not in seen]
            else:
                scored = []
                for iid in current_items:
                    if iid in seen:
                        continue
                    pred = algo.predict(raw_uid, int(iid), verbose=False)
                    scored.append((iid, float(pred.est)))

        if not scored:
            return [it for it in current_items if it not in seen][:k]

        def sort_key(item_score):
            iid, score = item_score
            tie = tie_breaker(iid) if tie_breaker else 0.0
            return (score, tie)

        scored.sort(key=sort_key, reverse=True)
        return [it for it, _ in scored[:k]]

    meta = {"algo": label, "n_items": len(current_items), "estimator": algo, "trainset": current_trainset}
    return recommend, meta

# Configuration commune
K = CONFIG["k"]

# Modèles Surprise prêts à l'emploi
popularity_recommender, pop_meta = wrap_surprise_recommender(
    NormalPredictor(),
    "NormalPredictor (baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

itemknn_recommender, itemknn_meta = wrap_surprise_recommender(
    KNNBasic(
        k=60,
        min_k=2,
        sim_options={"name": "pearson_baseline", "user_based": False, "min_support": 2, "n_jobs": -1},
    ),
    "KNNBasic item-based (pearson baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

svd_recommender, svd_meta = wrap_surprise_recommender(
    SVD(
        n_factors=CONFIG["svd_components"],
        n_epochs=35,
        reg_all=0.06,
        lr_all=0.004,
        random_state=CONFIG["random_seed"],
    ),
    "SVD collaboratif (facteurs latents)",
    tie_breaker=lambda iid: popularity_rank.get(int(iid), len(popularity_rank)),
)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [57]:
# Mesures d'erreur pour les algorithmes Surprise

def surprise_error_metrics(estimator, test_df: pd.DataFrame, candidate_pool=None) -> dict[str, float]:
    """Compute RMSE/MAE on the test split for a fitted Surprise estimator.

    Parameters
    ----------
    estimator : surprise.AlgoBase
        Trained Surprise model with a ``predict`` method.
    test_df : pd.DataFrame
        Test interactions containing ``user_id`` and ``article_id``. If
        ``session_size`` is available, it will be converted to a continuous
        rating using ``session_weight_from_size`` (fallback 1.0).
    candidate_pool : Iterable[int], optional
        If provided, restrict the evaluated items to this candidate pool.
    """
    if test_df.empty:
        return {"rmse": float("nan"), "mae": float("nan")}

    candidate_set = set(candidate_pool) if candidate_pool is not None else None
    session_sizes = test_df.get("session_size")
    ratings = session_weight_from_size(session_sizes) if session_sizes is not None else np.ones(len(test_df), dtype=np.float32)

    predictions = []
    for (uid, iid, true_rating) in zip(test_df["user_id"], test_df["article_id"], ratings):
        if candidate_set is not None and iid not in candidate_set:
            continue
        predictions.append(estimator.predict(int(uid), int(iid), r_ui=float(true_rating), verbose=False))

    if not predictions:
        return {"rmse": float("nan"), "mae": float("nan")}

    rmse = accuracy.rmse(predictions, verbose=False)
    mae = accuracy.mae(predictions, verbose=False)
    return {"rmse": float(rmse), "mae": float(mae)}

In [58]:

# Evaluation pipeline

def evaluate_model(
    name: str,
    recommend_func: Callable[[int, set, int], List[int]],
    train_histories: Dict[int, List[int]],
    ground_truth: Dict[int, List[int]],
    candidate_items: List[int],
    k: int,
    latency_sample: int = 500,
    progress_every: int = 500,
) -> Dict[str, float]:
    """Evaluate a recommender with ranking metrics and latency estimation."""
    precisions: List[float] = []
    recalls: List[float] = []
    maps: List[float] = []
    ndcgs: List[float] = []
    hits: List[int] = []
    all_recs: List[List[int]] = []

    users = eval_users
    total_users = len(users)
    start_eval = time.perf_counter()
    for idx, user_id in enumerate(users, start=1):
        seen = set(train_histories.get(user_id, []))
        recs = recommend_func(user_id, seen, k)
        gt = ground_truth[user_id]
        all_recs.append(recs)
        precisions.append(precision_at_k(recs, gt, k))
        recalls.append(recall_at_k(recs, gt, k))
        maps.append(average_precision_at_k(recs, gt, k))
        ndcgs.append(ndcg_at_k(recs, gt, k))
        hits.append(1 if set(recs[:k]) & set(gt) else 0)

        if progress_every and idx % progress_every == 0:
            elapsed = time.perf_counter() - start_eval
            rate = elapsed / idx
            eta = rate * max(total_users - idx, 0)
            print(
                f"[{name}] {idx}/{total_users} users processed "
                f"(elapsed {elapsed:.1f}s, ETA {eta:.1f}s)"
            )

    coverage = coverage_at_k(all_recs, candidate_items, k)
    hitrate = float(np.mean(hits)) if users else 0.0

    sample_users = users[: min(latency_sample, len(users))]
    start = time.perf_counter()
    for user_id in sample_users:
        seen = set(train_histories.get(user_id, []))
        _ = recommend_func(user_id, seen, k)
    latency = (time.perf_counter() - start) / max(1, len(sample_users))
    total_eval_time = time.perf_counter() - start_eval

    return {
        "model": name,
        "users": len(users),
        "precision@k": float(np.mean(precisions)),
        "recall@k": float(np.mean(recalls)),
        "map@k": float(np.mean(maps)),
        "ndcg@k": float(np.mean(ndcgs)),
        "hitrate@k": hitrate,
        "coverage@k": coverage,
        "latency_per_user_s": latency,
        "eval_time_s": total_eval_time,
        "all_recommendations": all_recs,
    }


## Entraînement des systèmes de recommandation

Chaque approche est entraînée séparément pour limiter le temps d'exécution de chaque cellule et mieux contextualiser le rôle de chaque modèle.

### Popularité globale
La recommandation par popularité globale trie les articles par volume d'interactions dans l'ensemble d'entraînement. Elle est rapide à calculer (simple agrégation) et sert de baseline robuste pour comparer les modèles plus avancés.

In [59]:

# Configuration commune
K = CONFIG["k"]

# Modèles Surprise prêts à l'emploi
popularity_recommender, pop_meta = wrap_surprise_recommender(
    NormalPredictor(),
    "NormalPredictor (baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

itemknn_recommender, itemknn_meta = wrap_surprise_recommender(
    KNNBasic(
        k=60,
        min_k=2,
        sim_options={"name": "pearson_baseline", "user_based": False, "min_support": 2, "n_jobs": -1},
    ),
    "KNNBasic item-based (pearson baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

svd_recommender, svd_meta = wrap_surprise_recommender(
    SVD(
        n_factors=CONFIG["svd_components"],
        n_epochs=35,
        reg_all=0.06,
        lr_all=0.004,
        random_state=CONFIG["random_seed"],
    ),
    "SVD collaboratif (facteurs latents)",
    tie_breaker=lambda iid: popularity_rank.get(int(iid), len(popularity_rank)),
)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


### Popularité récente
Cette variante privilégie la fraîcheur en filtrant les interactions sur une fenêtre temporelle avant de trier les articles par fréquence. Utile pour capter les tendances du moment, au prix d'un recalcul plus fréquent de la fenêtre glissante.

In [60]:
# Popularité récente
recent_rank = build_recent_popularity(train_df, CONFIG["recent_window_days"])

def recent_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return [it for it in recent_rank if it not in seen][:k]

### Collaborative (SVD)
Le filtrage collaboratif factorise la matrice utilisateur-item (SVD) pour capturer des préférences latentes. L'entraînement est plus long que les méthodes de popularité ou de similarité de contenu, mais il modélise mieux les affinités implicites entre utilisateurs et articles.

In [61]:
# Filtrage collaboratif (SVD)
collab_recommend, collab_meta = build_collaborative_svd(train_df, CONFIG["svd_components"])

def collaborative_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return collab_recommend(user_id, seen, k)

In [62]:
# Modèles co-visitation désactivés au profit de Surprise

### Contenu (similarité article-article)
Un modèle basé contenu construit une matrice de similarité entre articles à partir des métadonnées. Les recommandations se font en projetant l'historique utilisateur vers les items proches dans cet espace. Ce calcul peut être plus coûteux car il nécessite la vectorisation et le produit croisé des articles.

In [63]:
# Initialiser un conteneur de résultats pour chaque entraînement
results = []
step_results = []

In [64]:
# Recommandation basée contenu (désactivable)
ENABLE_CONTENT_MODEL = True  # Passer à True pour activer le calcul de similarité contenu

if ENABLE_CONTENT_MODEL:
    item_similarity, sim_mode = build_item_similarity(train_df, metadata)

    def content_recommender(user_id: int, seen: set, k: int) -> List[int]:
        return recommend_from_similarity(user_id, train_histories, item_similarity, candidate_items, k)
else:
    sim_mode = "désactivé"
    content_recommender = None


Aucune colonne textuelle : utilisation des colonnes non-ID comme tokens catégoriels.


## Entraînements séparés

Les trois stratégies Surprise sont exécutées dans des cellules distinctes afin de pouvoir lancer, arrêter ou relancer chaque bloc indépendamment. Cela évite d'attendre l'ensemble du pipeline quand un seul entraînement est nécessaire.


### Entraînement 1 : Baseline Surprise (NormalPredictor)

Ce bloc entraîne le modèle de base `NormalPredictor` de Surprise et calcule Precision@K, Recall@K, MAP@K, NDCG@K, couverture, latence moyenne ainsi que RMSE et MAE sur le jeu de test.


In [65]:
popularity_result = evaluate_model(
    "Baseline Surprise - NormalPredictor",
    popularity_recommender,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)

pop_meta_errors = surprise_error_metrics(
    pop_meta["estimator"], test_df, candidate_pool=candidate_items
)
popularity_result.update(pop_meta_errors)
results.append(popularity_result)
pd.DataFrame([popularity_result])


[Baseline Surprise - NormalPredictor] 500/67633 users processed (elapsed 2.5s, ETA 341.3s)
[Baseline Surprise - NormalPredictor] 1000/67633 users processed (elapsed 5.0s, ETA 332.5s)
[Baseline Surprise - NormalPredictor] 1500/67633 users processed (elapsed 7.6s, ETA 333.0s)
[Baseline Surprise - NormalPredictor] 2000/67633 users processed (elapsed 10.0s, ETA 328.6s)
[Baseline Surprise - NormalPredictor] 2500/67633 users processed (elapsed 12.5s, ETA 325.8s)
[Baseline Surprise - NormalPredictor] 3000/67633 users processed (elapsed 15.0s, ETA 322.1s)
[Baseline Surprise - NormalPredictor] 3500/67633 users processed (elapsed 17.5s, ETA 319.9s)
[Baseline Surprise - NormalPredictor] 4000/67633 users processed (elapsed 20.0s, ETA 318.5s)
[Baseline Surprise - NormalPredictor] 4500/67633 users processed (elapsed 22.5s, ETA 315.3s)
[Baseline Surprise - NormalPredictor] 5000/67633 users processed (elapsed 24.9s, ETA 312.0s)
[Baseline Surprise - NormalPredictor] 5500/67633 users processed (elapsed 

Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,coverage@k,latency_per_user_s,eval_time_s,all_recommendations,rmse,mae
0,Baseline Surprise - NormalPredictor,67633,0.000195,0.000317,0.000113,0.000226,0.000976,0.003382,0.004825,343.749867,"[[272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 336223], [160974, 272143, 123909, 336223, 162655], [160974, 123909, 336223, 162655, 168623], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 336223], [160974, 272143, 336221, 123909, 336223], [160974, 336223, 162655, 168623, 64329], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 234698, 123909, 336223], [160974, 336221, 234698, 123909, 336223], [160974, 272143, 234698, 162655, 168623], [160974, 336221, 234698, 336223, 64329], [272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 123909, 162655, 168623], [160974, 272143, 336221, 234698, 123909], [272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 336221, 123909, 162655, 168623], [336221, 234698, 123909, 336223, 168623], [160974, 336221, 123909, 162655, 168623], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 123909, 336223, 162655, 168623], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 234698, 123909, 162655, 168623], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 234698, 123909, 336223], [160974, 336221, 123909, 336223, 162655], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [272143, 234698, 123909, 336223, 162655], [160974, 272143, 336221, 234698, 123909], [272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 336223], [160974, 336221, 234698, 123909, 336223], [272143, 234698, 162655, 168623, 158536], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [272143, 336221, 234698, 123909, 162655], [160974, 272143, 234698, 123909, 336223], [272143, 336221, 336223, 162655, 168623], [160974, 272143, 336221, 234698, 123909], [160974, 336221, 234698, 123909, 162655], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 123909, 336223, 162655], [160974, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 123909, 336223], [336221, 234698, 123909, 162655, 168623], [160974, 272143, 336221, 168623, 64329], [160974, 336221, 234698, 123909, 336223], [272143, 336221, 234698, 123909, 162655], [336221, 234698, 162655, 168623, 64329], [160974, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [336221, 168623, 158536, 233717, 235616], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 123909], [160974, 336221, 123909, 336223, 162655], [160974, 272143, 336221, 234698, 123909], [160974, 336221, 234698, 123909, 272660], [272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 336223], [160974, 272143, 336221, 234698, 123909], [272143, 336221, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], [160974, 272143, 336221, 234698, 336223], [160974, 272143, 234698, 123909, 336223], [160974, 272143, 336221, 234698, 123909], ...]",0.311626,0.251196


### Entraînement 2 : KNN item-based (Surprise)

Ce bloc exécute `KNNBasic` en mode item-based avec une similarité **Pearson baseline**, 60 voisins
(`k=60`, `min_k=2`, `min_support=2`). Cette configuration force le modèle à exploiter des co-cliques
pour sortir des simples effets de popularité, afin d'obtenir des recommandations distinctes du SVD.

Astuce performance : `n_jobs=-1` exploite tous les cœurs CPU pour la matrice de similarité Surprise, ce qui réduit nettement le temps de fit sur de gros catalogues (le modèle reste CPU-only).

In [66]:
if False :
    item2item_result = evaluate_model(
        "Modèle KNNBasic item-based (Pearson baseline)",
        itemknn_recommender,
        train_histories,
        ground_truth,
        candidate_items,
        K,
    )

    itemknn_meta_errors = surprise_error_metrics(
        itemknn_meta["estimator"], test_df, candidate_pool=candidate_items
    )
    item2item_result.update(itemknn_meta_errors)
    results.append(item2item_result)
    pd.DataFrame([item2item_result])


### Entraînement 3 : SVD Surprise

Ce bloc entraîne un SVD implicite (facteurs latents) avec 64 dimensions, davantage d'itérations et une
régularisation renforcée (`n_epochs=35`, `reg_all=0.06`, `lr_all=0.004`). L'objectif est d'obtenir un
profil utilisateur/item plus contrasté que le KNN de voisinage.


In [67]:
svd_result = evaluate_model(
    "Modèle SVD Surprise (facteurs latents)",
    svd_recommender,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)

svd_meta_errors = surprise_error_metrics(
    svd_meta["estimator"], test_df, candidate_pool=candidate_items
)
svd_result.update(svd_meta_errors)
results.append(svd_result)
pd.DataFrame([svd_result])


[Modèle SVD Surprise (facteurs latents)] 500/67633 users processed (elapsed 13.3s, ETA 1780.7s)
[Modèle SVD Surprise (facteurs latents)] 1000/67633 users processed (elapsed 26.1s, ETA 1739.7s)
[Modèle SVD Surprise (facteurs latents)] 1500/67633 users processed (elapsed 38.7s, ETA 1707.8s)
[Modèle SVD Surprise (facteurs latents)] 2000/67633 users processed (elapsed 51.5s, ETA 1690.2s)
[Modèle SVD Surprise (facteurs latents)] 2500/67633 users processed (elapsed 64.3s, ETA 1674.9s)
[Modèle SVD Surprise (facteurs latents)] 3000/67633 users processed (elapsed 77.2s, ETA 1662.3s)
[Modèle SVD Surprise (facteurs latents)] 3500/67633 users processed (elapsed 89.9s, ETA 1647.3s)
[Modèle SVD Surprise (facteurs latents)] 4000/67633 users processed (elapsed 102.6s, ETA 1632.3s)
[Modèle SVD Surprise (facteurs latents)] 4500/67633 users processed (elapsed 115.2s, ETA 1616.9s)
[Modèle SVD Surprise (facteurs latents)] 5000/67633 users processed (elapsed 128.0s, ETA 1603.1s)
[Modèle SVD Surprise (facteu

Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,coverage@k,latency_per_user_s,eval_time_s,all_recommendations,rmse,mae
0,Modèle SVD Surprise (facteurs latents),67633,0.000322,0.000548,0.000242,0.000434,0.001597,0.224333,0.025594,1743.695242,"[[235101, 68851, 237071, 237480, 234390], [68851, 190274, 186544, 191967, 363925], [68851, 363925, 237071, 47884, 324286], [237071, 363925, 47908, 68851, 263994], [68851, 330828, 255708, 31810, 237071], [68851, 363925, 237071, 168633, 185608], [68851, 105941, 141416, 237480, 20120], [283506, 68851, 198633, 162572, 331746], [254567, 272794, 84429, 136474, 48673], [283595, 68851, 84457, 237071, 235101], [68851, 113970, 69135, 218331, 47850], [10253, 16893, 15447, 174360, 354451], [237071, 15447, 320545, 300493, 118585], [304040, 248310, 162572, 330894, 235101], [96173, 352414, 194405, 248310, 235101], [321406, 352845, 207543, 172034, 244744], [237071, 363925, 68851, 352266, 324393], [68851, 137220, 105941, 181715, 21229], [237071, 31187, 68851, 48485, 219962], [248310, 363925, 237071, 43032, 355157], [141239, 363925, 128724, 327118, 348123], [68851, 32547, 257672, 363925, 234258], [237071, 68851, 235101, 307061, 270675], [32547, 184163, 237071, 330890, 48796], [237071, 68851, 363925, 132650, 341720], [363925, 128869, 276783, 63760, 237071], [237071, 73431, 68851, 313931, 303866], [68851, 146068, 338164, 137220, 132744], [68851, 157068, 363925, 234250, 219962], [68851, 363925, 237071, 327118, 168633], [363925, 237071, 105941, 141174, 248310], [17167, 237071, 292604, 62548, 360467], [68851, 66371, 168633, 166072, 285269], [95492, 233904, 43032, 199153, 57472], [224950, 162276, 282933, 234390, 32547], [73431, 355161, 312398, 47850, 74254], [363925, 68851, 237071, 47850, 31187], [63760, 362887, 57434, 105941, 145848], [68851, 237071, 157068, 63760, 327546], [47850, 235101, 68851, 15447, 245299], [68851, 353675, 195989, 313761, 265994], [237071, 237746, 31219, 327643, 48485], [237071, 323780, 363925, 68851, 136474], [115286, 68851, 237071, 38823, 353684], [68851, 237071, 363925, 50611, 31187], [68851, 237071, 363925, 10253, 84457], [338164, 47850, 168633, 263994, 68851], [68851, 248310, 235101, 237071, 363925], [237071, 352414, 242946, 298915, 65561], [237071, 248310, 107059, 288548, 237452], [181942, 16867, 58647, 273529, 270675], [136474, 119514, 181942, 73431, 271018], [321406, 73431, 68851, 237071, 352098], [237071, 363925, 43032, 68851, 195595], [68851, 363925, 157068, 237071, 166072], [136474, 237071, 363925, 68851, 19785], [62875, 355161, 166072, 331376, 113958], [68851, 73431, 105941, 237071, 363925], [119514, 298551, 353675, 313082, 32335], [68851, 363925, 107059, 137220, 250225], [321406, 43032, 336704, 261760, 74501], [237071, 68851, 321406, 105941, 10253], [68851, 363925, 330894, 353675, 100940], [363925, 68851, 62548, 312398, 304752], [162276, 355167, 10253, 68851, 331870], [70310, 242605, 185608, 96173, 181453], [73431, 4549, 2421, 36063, 355161], [200023, 10253, 308761, 313358, 57698], [363925, 47850, 68851, 235396, 271265], [323780, 68851, 48485, 235396, 129642], [261760, 43032, 321406, 84429, 158766], [199278, 235101, 185993, 307483, 158766], [149655, 308414, 31784, 162276, 128422], [363925, 111095, 68851, 237071, 43032], [29630, 264585, 283506, 69135, 353675], [68851, 237071, 31187, 313931, 303866], [68851, 363925, 73431, 237071, 158911], [69135, 207543, 42559, 111095, 76539], [363925, 341151, 32444, 68851, 3438], [272962, 218359, 162419, 355157, 32547], [69720, 207543, 250225, 248310, 285839], [237071, 68851, 327535, 200040, 199153], [68851, 237071, 363925, 48485, 283506], [300493, 355215, 68851, 47884, 271705], [348123, 313931, 159004, 8404, 48485], [68851, 363925, 181453, 237071, 283595], [237071, 158512, 168633, 288616, 57472], [304040, 312534, 284584, 118585, 133233], [237071, 68851, 327118, 211723, 363925], [237071, 68851, 47908, 194509, 313931], [363925, 263994, 283506, 199646, 224220], [237071, 2280, 314196, 254638, 59117], [363925, 185608, 74396, 237071, 58647], [186047, 323926, 361963, 308546, 298551], [68851, 363925, 235101, 237071, 276783], [68851, 254567, 234258, 235329, 76539], [68851, 237071, 363925, 321069, 167370], [313931, 304040, 283595, 270547, 74628], [352414, 117931, 83652, 184003, 236140], [224824, 136768, 175001, 68851, 30622], ...]",0.387916,0.332238



### Session-size weighting and LightFM item-to-item

`session_size` is turned into a relevance weight with **1 / log1p(session_size)** to dampen
very long sessions while keeping short, focused sessions influential. The LightFM-style
item-to-item model trains latent item vectors on these weighted interactions and enriches
user representations with aggregated context features (environment, device, OS, country,
region, referrer). Recommendations then come from cosine neighbors in that latent space.


In [68]:

# Session-weighted Surprise SVD and LightFM item-to-item setup
svd_session_trainset = build_surprise_trainset(
    train_df, use_session_rating=CONFIG["svd_use_session_rating"]
)
svd_session_items = [int(svd_session_trainset.to_raw_iid(iid)) for iid in svd_session_trainset.all_items()]
svd_session_recommender, svd_session_meta = wrap_surprise_recommender(
    SVD(n_factors=CONFIG["svd_components"], n_epochs=35, reg_all=0.06, lr_all=0.004, random_state=CONFIG["random_seed"]),
    "Modèle SVD Surprise (session_weighted)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
    trainset=svd_session_trainset,
    items=svd_session_items,
)

lightfm_interactions, lightfm_weights, lightfm_user_features, lightfm_item_ids = build_interaction_matrices(
    train_df,
    CONTEXT_COLUMNS,
    use_user_features=CONFIG["lightfm_use_user_features"],
)
lightfm_model = LightFMApproximator(
    n_components=CONFIG["lightfm_components"],
    epochs=15,
    random_state=CONFIG["random_seed"],
).fit(
    lightfm_interactions,
    sample_weight=lightfm_weights,
    user_features=lightfm_user_features,
)
_, lightfm_item_embeddings = lightfm_model.get_item_representations()
lightfm_neighbors = precompute_item_neighbors(
    lightfm_item_embeddings, lightfm_item_ids, top_n=CONFIG["lightfm_item_neighbors"]
)


def svd_score_candidates(user_id: int, estimator, candidates: list[int], seen: set) -> Dict[int, float]:
    scores: Dict[int, float] = {}
    for iid in candidates:
        if iid in seen:
            continue
        pred = estimator.predict(int(user_id), int(iid), verbose=False)
        scores[int(iid)] = float(pred.est)
    return scores


def minmax_normalize(scores: Dict[int, float]) -> Dict[int, float]:
    if not scores:
        return {}
    values = list(scores.values())
    min_v, max_v = min(values), max(values)
    if max_v == min_v:
        return {i: 0.0 for i in scores}
    return {i: (v - min_v) / (max_v - min_v) for i, v in scores.items()}


def lightfm_item2item_recommender(user_id: int, seen: set, k: int) -> List[int]:
    scores = score_from_neighbors(train_histories.get(user_id, []), lightfm_neighbors, seen)
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen][:k]
    if len(recs) < k:
        for cand in candidate_items:
            if cand not in seen and cand not in recs:
                recs.append(cand)
            if len(recs) >= k:
                break
    return recs[:k]


def hybrid_svd_item2item(user_id: int, seen: set, k: int) -> List[int]:
    svd_scores = svd_score_candidates(user_id, svd_meta["estimator"], candidate_items, seen)
    item_scores = score_from_neighbors(train_histories.get(user_id, []), lightfm_neighbors, seen)
    svd_norm = minmax_normalize(svd_scores)
    item_norm = minmax_normalize(item_scores)
    alpha, beta = CONFIG["hybrid_weights"]
    combined_items = set(list(svd_norm.keys()) + list(item_norm.keys()))
    combined_scores = {
        iid: alpha * svd_norm.get(iid, 0.0) + beta * item_norm.get(iid, 0.0)
        for iid in combined_items
    }
    ranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen][:k]
    if len(recs) < k:
        for cand in candidate_items:
            if cand not in seen and cand not in recs:
                recs.append(cand)
            if len(recs) >= k:
                break
    return recs[:k]


In [69]:

# Entraînement 4 : variantes SVD session-weighted, LightFM item2item et hybride
svd_session_result = evaluate_model(
    "SVD session_weighted",
    svd_session_recommender,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)
svd_session_errors = surprise_error_metrics(
    svd_session_meta["estimator"], test_df, candidate_pool=candidate_items
)
svd_session_result.update(svd_session_errors)
results.append(svd_session_result)

lightfm_result = evaluate_model(
    "Item2Item LightFM (latent voisins)",
    lightfm_item2item_recommender,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)
lightfm_result.update({"rmse": float("nan"), "mae": float("nan")})
results.append(lightfm_result)

hybrid_result = evaluate_model(
    "Hybrid SVD60 + Item2Item40",
    hybrid_svd_item2item,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)
hybrid_result.update({"rmse": float("nan"), "mae": float("nan")})
results.append(hybrid_result)

pd.DataFrame([svd_session_result, lightfm_result, hybrid_result])


[SVD session_weighted] 500/67633 users processed (elapsed 12.9s, ETA 1731.8s)
[SVD session_weighted] 1000/67633 users processed (elapsed 25.7s, ETA 1712.1s)
[SVD session_weighted] 1500/67633 users processed (elapsed 38.6s, ETA 1702.0s)
[SVD session_weighted] 2000/67633 users processed (elapsed 51.2s, ETA 1681.6s)
[SVD session_weighted] 2500/67633 users processed (elapsed 63.9s, ETA 1664.3s)
[SVD session_weighted] 3000/67633 users processed (elapsed 76.5s, ETA 1647.2s)
[SVD session_weighted] 3500/67633 users processed (elapsed 89.1s, ETA 1631.9s)
[SVD session_weighted] 4000/67633 users processed (elapsed 102.0s, ETA 1622.9s)
[SVD session_weighted] 4500/67633 users processed (elapsed 114.6s, ETA 1607.5s)
[SVD session_weighted] 5000/67633 users processed (elapsed 127.2s, ETA 1593.6s)
[SVD session_weighted] 5500/67633 users processed (elapsed 139.8s, ETA 1578.9s)
[SVD session_weighted] 6000/67633 users processed (elapsed 152.4s, ETA 1565.4s)
[SVD session_weighted] 6500/67633 users processe

Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,coverage@k,latency_per_user_s,eval_time_s,all_recommendations,rmse,mae
0,SVD session_weighted,67633,0.000322,0.000548,0.000328,0.000526,0.001597,0.638082,0.025551,1765.151115,"[[297955, 336704, 254983, 354951, 89282], [286086, 202626, 320524, 140729, 220478], [108908, 106886, 114053, 58689, 76408], [277046, 224865, 363345, 9705, 172879], [32088, 19167, 338321, 332585, 19645], [76268, 43032, 207129, 327872, 331538], [31092, 224865, 351883, 217923, 141416], [63307, 87224, 236951, 283505, 236444], [29689, 107043, 84429, 190339, 204461], [84429, 132483, 146063, 199278, 312102], [42552, 43032, 289954, 29581, 283437], [233717, 235230, 293114, 336380, 161506], [39791, 105121, 13851, 320545, 220367], [19903, 167040, 146226, 208253, 19645], [292622, 194405, 256063, 101978, 73287], [303533, 172034, 276783, 68248, 300537], [76267, 30876, 39016, 206319, 332585], [107289, 107156, 283589, 177507, 128550], [164575, 129790, 68764, 236213, 98869], [236444, 25325, 208150, 240233, 158047], [251304, 292180, 141239, 360580, 292291], [271143, 83751, 303533, 32547, 327603], [30876, 254983, 353674, 284420, 104237], [128316, 195100, 59782, 355178, 327452], [206073, 63809, 161656, 312398, 233809], [207129, 108853, 146230, 63760, 284278], [326769, 220296, 313655, 266011, 20281], [63809, 351282, 43032, 313662, 277674], [43032, 337721, 258638, 284168, 69163], [76268, 43032, 19167, 207193, 254983], [285055, 59467, 256163, 275154, 58136], [217923, 78572, 57258, 220367, 363984], [111031, 107151, 274915, 331100, 132427], [43032, 32157, 137439, 177195, 203423], [360862, 327603, 223928, 217862, 250128], [284284, 315099, 234740, 107070, 312398], [284064, 225607, 195303, 7873, 39791], [96252, 236512, 283272, 39657, 33495], [48780, 214690, 74501, 19336, 190714], [341599, 19643, 263695, 137734, 236213], [57563, 234633, 256163, 351995, 224355], [124089, 132427, 304834, 237746, 341661], [48570, 341339, 323780, 264841, 119513], [177507, 355156, 202628, 57887, 353684], [203423, 200521, 84429, 284064, 236213], [74757, 110907, 361987, 166542, 61599], [107290, 206346, 57563, 73445, 331041], [353674, 68248, 174864, 284420, 13851], [89762, 355164, 336848, 100929, 96978], [237452, 159710, 136075, 234633, 192079], [256163, 119259, 43032, 38708, 271850], [286735, 146230, 285841, 192077, 19167], [31935, 73287, 166050, 321406, 166038], [327006, 157237, 43032, 357694, 312398], [162282, 106936, 155871, 266781, 150890], [4359, 182005, 140729, 337721, 217923], [235373, 29689, 190339, 100884, 100777], [289954, 251767, 2422, 190339, 39202], [58344, 270546, 182977, 91363, 129086], [83491, 8266, 203844, 137220, 227555], [43032, 74501, 336704, 71298, 312703], [105121, 299995, 337678, 160959, 70439], [271966, 70027, 200521, 74501, 129909], [363185, 159720, 111681, 107196, 43032], [156972, 338321, 175325, 195915, 307838], [225745, 76268, 237769, 124089, 43032], [14050, 286725, 15079, 158695, 254408], [34745, 43025, 270512, 42560, 200023], [104237, 19645, 68945, 292180, 192059], [242406, 323780, 354945, 158695, 254408], [43032, 218058, 31968, 32714, 351731], [218458, 199278, 307483, 146226, 142335], [160409, 235827, 256163, 195789, 78371], [43032, 68771, 153612, 292590, 264297], [71596, 225065, 110933, 57887, 63644], [43032, 30876, 200521, 332585, 272243], [206073, 38823, 361706, 207543, 286735], [75982, 341599, 217862, 70439, 32578], [33495, 107156, 313189, 156736, 360990], [85396, 202712, 106886, 360465, 58417], [205052, 76216, 207068, 43032, 236149], [47864, 157237, 220296, 304040, 96954], [254983, 353674, 19645, 68248, 43032], [19643, 266778, 57720, 355215, 300493], [202637, 96252, 107290, 48852, 95804], [63737, 202354, 285664, 234252, 270685], [254612, 270927, 107196, 106962, 199940], [285164, 304040, 308496, 59155, 123665], [43032, 73344, 272243, 61498, 192059], [337721, 179690, 163116, 303185, 9968], [166082, 284420, 263994, 43032, 314593], [73287, 277046, 190339, 285974, 327452], [71596, 78376, 328215, 297628, 194218], [323926, 276998, 314441, 255703, 39176], [68248, 19645, 70439, 353674, 87137], [5090, 337678, 272243, 236213, 195300], [337678, 236213, 258638, 361532, 68248], [304040, 313931, 236713, 362389, 338321], [305095, 298950, 71635, 336848, 202628], [63690, 57982, 286725, 164575, 175001], ...]",0.163335,0.135566
1,Item2Item LightFM (latent voisins),67633,0.000772,0.001184,0.0005,0.000955,0.003844,0.412455,0.000926,31.910945,"[[123909, 183176, 199198, 234698, 272143], [336223, 234698, 293050, 161801, 336380], [235230, 293050, 59758, 203288, 161801], [123909, 293050, 315105, 183176, 233478], [123909, 336476, 161801, 234698, 199198], [203288, 336380, 315105, 234698, 199198], [336223, 123909, 31520, 161801, 284547], [161801, 129434, 336380, 199198, 315105], [315105, 183176, 123909, 336476, 224354], [183176, 284985, 272143, 123909, 123290], [184076, 361969, 288271, 360974, 245069], [199198, 293050, 235230, 59057, 284547], [234698, 199198, 183176, 161801, 203288], [123909, 199198, 336223, 336476, 161801], [31520, 123909, 199198, 218330, 161801], [123909, 336380, 315105, 161801, 235854], [123909, 315105, 293050, 161801, 63307], [31520, 161801, 123289, 162655, 285343], [235854, 336223, 272143, 123909, 234698], [203288, 123909, 286128, 289090, 207813], [123909, 199198, 315105, 336380, 235230], [123909, 234698, 199198, 315105, 31520], [123909, 336380, 161801, 203288, 235230], [272143, 123909, 168623, 235854, 87231], [160974, 129434, 234698, 158047, 203288], [234698, 293050, 123909, 233688, 315105], [123909, 199198, 315105, 224354, 336380], [338129, 59704, 272218, 42883, 353406], [123909, 199198, 183176, 234698, 161801], [123909, 315105, 234698, 199198, 203288], [123909, 234698, 199198, 315105, 235230], [123909, 284547, 235854, 234698, 303331], [123909, 168623, 183176, 336380, 315105], [70591, 272218, 124176, 159938, 202308], [315105, 336380, 272143, 31520, 183176], [234698, 129434, 123909, 199198, 124194], [123909, 315105, 234698, 161801, 224354], [234698, 203288, 166581, 162655, 123909], [234698, 123909, 336380, 161801, 199198], [123909, 199198, 161801, 234698, 315105], [315105, 123290, 168623, 63307, 207299], [161191, 123909, 124679, 215955, 95972], [234698, 183176, 336380, 199198, 129434], [234698, 123909, 224354, 336380, 336476], [234698, 315105, 199198, 272143, 161801], [161801, 214800, 31520, 36399, 234698], [123909, 234698, 284547, 293050, 161801], [123909, 199198, 218330, 293050, 315105], [129434, 123909, 234698, 168623, 207299], [234698, 123909, 224354, 336380, 284547], [235230, 31520, 293050, 161801, 123909], [59057, 58193, 315105, 199197, 203288], [123909, 183176, 315105, 123289, 234698], [315105, 123909, 234698, 199198, 235230], [234698, 31520, 123909, 293050, 161801], [336223, 199198, 235230, 315105, 161801], [234698, 123909, 31520, 233688, 315105], [123909, 161801, 218330, 284547, 59758], [123909, 293050, 235230, 183176, 31520], [315105, 354086, 123909, 234698, 161801], [123909, 235854, 31520, 336223, 272143], [123909, 199198, 234698, 284547, 235854], [123909, 203288, 218330, 182513, 234698], [129434, 336476, 235854, 300470, 203288], [161801, 123909, 199198, 31520, 336223], [123909, 161801, 284547, 336380, 272143], [235854, 234698, 354086, 293050, 336223], [123909, 234698, 315105, 161801, 235230], [123909, 315105, 161801, 234698, 199198], [234698, 199198, 123909, 315105, 272143], [123909, 234698, 293050, 161801, 203288], [123909, 199198, 315105, 284547, 129434], [234698, 123909, 284547, 293050, 199198], [161801, 293050, 284547, 354086, 272143], [234698, 129434, 123909, 199198, 233478], [123909, 199198, 234698, 203288, 315105], [235854, 31520, 336380, 284547, 293050], [183176, 123909, 234698, 336380, 315105], [123909, 235854, 336476, 31520, 199198], [129434, 234698, 160974, 123909, 272143], [283238, 272660, 235812, 124194, 58193], [168623, 203288, 225055, 233658, 123909], [315105, 161801, 336380, 63307, 224354], [123909, 293050, 203288, 168623, 162655], [123909, 161801, 235230, 183176, 199198], [42181, 107304, 292899, 265196, 282959], [123909, 315105, 203288, 129434, 199198], [353019, 31520, 156624, 212167, 203199], [123909, 315105, 293050, 234698, 235854], [123909, 199198, 203288, 224354, 161801], [123909, 203288, 284547, 336380, 293050], [234698, 123909, 199198, 293050, 224354], [123909, 336380, 218330, 284547, 289090], [234698, 224354, 199198, 285343, 161801], [123909, 234698, 293050, 161801, 303331], [199198, 123909, 234698, 272143, 289090], [234698, 336223, 336380, 315105, 123909], [199198, 315105, 272143, 234698, 285343], [203288, 166581, 218330, 199198, 123909], [203288, 315105, 236613, 123909, 199198], ...]",,
2,Hybrid SVD60 + Item2Item40,67633,0.006562,0.010795,0.007275,0.011091,0.030089,0.270493,0.034358,2286.091375,"[[183176, 234698, 224354, 284339, 233688], [183176, 233688, 354086, 235854, 234698], [235854, 289090, 354086, 233688, 224354], [183176, 224354, 289090, 315105, 336223], [183176, 234698, 284339, 233688, 354086], [183176, 234698, 224354, 315105, 354086], [289090, 224354, 336223, 336220, 315105], [233688, 183176, 354086, 289090, 235854], [183176, 224354, 315105, 235854, 234698], [183176, 224354, 234698, 235854, 233688], [265992, 288271, 235870, 245069, 233688], [183176, 233688, 224354, 354086, 235854], [183176, 234698, 233688, 224354, 354086], [233688, 354086, 336223, 284339, 224354], [233688, 235854, 224354, 234814, 289090], [183176, 235854, 284339, 315105, 224354], [224354, 233688, 354086, 235854, 315105], [235854, 233688, 32082, 218330, 124679], [235854, 336223, 234698, 284339, 354086], [289090, 224354, 354086, 336223, 336220], [183176, 224354, 354086, 235854, 315105], [354086, 235854, 234698, 224354, 233688], [354086, 235854, 224354, 284339, 161801], [235854, 224354, 284339, 289186, 336223], [348093, 233688, 234698, 183176, 224354], [233688, 234698, 183176, 224354, 354086], [224354, 183176, 354086, 289090, 315105], [195020, 338129, 62484, 2672, 42158], [183176, 234698, 354086, 315105, 233688], [183176, 234698, 224354, 315105, 289090], [183176, 233688, 224354, 289090, 354086], [183176, 235854, 233688, 354086, 234698], [183176, 224354, 284339, 233688, 289090], [70591, 159938, 208150, 128260, 203199], [183176, 235854, 233688, 315105, 224354], [159529, 234698, 235854, 224354, 183176], [224354, 233688, 315105, 234698, 289090], [183176, 124679, 234698, 336245, 224354], [233688, 183176, 234698, 235854, 224354], [183176, 233688, 235854, 234698, 224354], [183176, 315105, 234698, 63307, 119522], [124679, 161191, 348109, 233688, 202372], [183176, 233688, 234698, 289090, 224354], [183176, 233688, 224354, 234698, 348093], [183176, 234698, 233688, 315105, 289090], [233688, 36399, 161801, 234698, 183176], [235854, 354086, 234698, 289090, 315105], [183176, 235854, 224354, 233688, 354086], [183176, 234698, 235854, 289090, 224354], [224354, 233688, 354086, 234698, 289090], [183176, 215993, 233688, 289090, 336220], [183176, 315105, 59057, 224354, 234269], [183176, 233688, 315105, 234698, 235854], [183176, 234698, 233688, 315105, 354086], [183176, 233688, 354086, 234698, 224354], [183176, 235854, 233688, 354086, 336223], [233688, 183176, 354086, 234698, 289090], [233688, 354086, 235854, 289090, 234698], [183176, 233688, 224354, 289090, 234698], [183176, 354086, 235854, 233688, 234698], [183176, 235854, 284339, 224354, 336223], [235854, 183176, 234698, 336220, 224354], [233688, 354086, 234698, 284339, 218330], [235854, 233688, 124679, 224354, 289090], [183176, 354086, 235854, 233688, 234698], [183176, 119522, 354086, 284339, 289090], [183176, 235854, 354086, 234698, 224354], [233688, 235854, 183176, 354086, 234698], [233688, 234698, 289090, 315105, 284339], [233688, 183176, 234698, 354086, 315105], [233688, 234698, 354086, 224354, 315105], [183176, 235854, 354086, 119522, 284339], [233688, 183176, 234698, 354086, 235854], [183176, 354086, 233688, 235854, 289090], [233688, 234698, 289090, 354086, 348093], [233688, 224354, 234698, 354086, 315105], [235854, 354086, 284339, 224354, 234698], [183176, 233688, 354086, 224354, 235854], [235854, 354086, 183176, 224354, 233688], [233688, 234698, 235854, 348093, 183176], [348093, 235812, 224354, 120967, 235854], [233688, 348093, 235854, 336220, 119522], [224354, 235854, 284339, 315105, 354086], [289090, 235854, 234698, 224354, 233688], [183176, 233688, 354086, 224354, 235854], [42181, 68851, 107304, 272390, 363925], [235854, 354086, 315105, 224354, 233688], [111043, 203199, 183176, 286260, 31520], [235854, 233688, 315105, 354086, 234698], [224354, 284339, 233688, 234698, 354086], [235854, 224354, 315105, 119522, 234698], [233688, 224354, 234698, 183176, 235854], [183176, 289090, 233688, 336220, 354086], [224354, 183176, 234698, 284339, 233688], [235854, 234698, 224354, 284339, 161801], [289090, 234698, 233688, 235854, 284339], [234698, 183176, 336223, 224354, 233688], [234698, 233688, 315105, 354086, 284339], [233688, 183176, 289090, 224354, 234698], [183176, 315105, 235854, 224354, 234698], ...]",,


### Modèles Surprise uniquement
Les anciennes sections E* basées sur la co-visitation sont remplacées par des algorithmes Surprise (NormalPredictor, KNNBasic, SVD).

#### Variantes co-visitation retirées
Nous privilégions désormais les algorithmes Surprise pour assurer une cohérence entre expérimentation et déploiement.

In [70]:
# Les variantes de co-visitation sont remplacées par les modèles Surprise ci-dessus.

### Section hybride supprimée
L'hybridation co-visitation + popularité a été remplacée par le modèle SVD Surprise plus flexible.

In [71]:
# Section hybride supprimée : la bibliothèque Surprise couvre les besoins collaboratifs.

In [72]:
# Optuna n'est plus nécessaire pour ce notebook centré sur Surprise.

## Résultats consolidés

Après exécution des trois blocs d'entraînement ci-dessus, les métriques sont agrégées pour comparer les approches. Chaque ligne du tableau récapitule la précision, le rappel, la MAP, le NDCG, la couverture et la latence moyenne par utilisateur, complétés par le RMSE et la MAE.


In [73]:
candidate_items = train_df["article_id"].unique().tolist()

per_user_topk = {
    res["model"]: res.get("all_recommendations", [])
    for res in results
}

def coverage_from_topk(rec_lists, candidates, k):
    pool = set()
    for rec in rec_lists:
        pool.update(rec[:k])
    return len(pool) / len(candidates) if candidates else 0.0

coverage_by_model = {
    label: coverage_from_topk(rec_lists, candidate_items, K)
    for label, rec_lists in per_user_topk.items()
    if rec_lists
}


In [74]:
# Agréger les métriques une fois les entraînements terminés
clean_columns = [
    "model",
    "users",
    "precision@k",
    "recall@k",
    "map@k",
    "ndcg@k",
    "hitrate@k",
    "latency_per_user_s",
]

results_df = pd.DataFrame(results)
results_df["coverage@k"] = results_df["model"].map(coverage_by_model)
results_df = (
    results_df[clean_columns + ["coverage@k"]]
    .drop_duplicates(subset=["model"])
    .sort_values(["ndcg@k", "map@k"], ascending=False)
    .reset_index(drop=True)
)

display(results_df)


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,latency_per_user_s,coverage@k
0,Hybrid SVD60 + Item2Item40,67633,0.006562,0.010795,0.007275,0.011091,0.030089,0.034358,0.270493
1,Item2Item LightFM (latent voisins),67633,0.000772,0.001184,0.0005,0.000955,0.003844,0.000926,0.412455
2,SVD session_weighted,67633,0.000322,0.000548,0.000328,0.000526,0.001597,0.025551,0.638082
3,Modèle SVD Surprise (facteurs latents),67633,0.000322,0.000548,0.000242,0.000434,0.001597,0.025594,0.224333
4,Baseline Surprise - NormalPredictor,67633,0.000195,0.000317,0.000113,0.000226,0.000976,0.004825,0.003382


In [75]:

svd_native_label = "Modèle SVD Surprise (facteurs latents)"
svd_session_label = "SVD session_weighted"
lightfm_label = "Item2Item LightFM (latent voisins)"
hybrid_label = "Hybrid SVD60 + Item2Item40"

comparison_df = results_df[
    results_df["model"].isin(
        [svd_native_label, svd_session_label, lightfm_label, hybrid_label]
    )
].reset_index(drop=True)
comparison_df


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,latency_per_user_s,coverage@k
0,Hybrid SVD60 + Item2Item40,67633,0.006562,0.010795,0.007275,0.011091,0.030089,0.034358,0.270493
1,Item2Item LightFM (latent voisins),67633,0.000772,0.001184,0.0005,0.000955,0.003844,0.000926,0.412455
2,SVD session_weighted,67633,0.000322,0.000548,0.000328,0.000526,0.001597,0.025551,0.638082
3,Modèle SVD Surprise (facteurs latents),67633,0.000322,0.000548,0.000242,0.000434,0.001597,0.025594,0.224333


In [76]:
# Comparaison rapide des top-5 pour un utilisateur
sample_user = eval_users[0] if eval_users else None
if sample_user is None:
    print("Pas d'utilisateur pour comparer")
else:
    seen = set(train_histories.get(sample_user, []))
    print(f"Utilisateur de test: {sample_user}")
    print("KNNBasic item-based:", itemknn_recommender(sample_user, seen, 5))
    print("SVD collaboratif:", svd_recommender(sample_user, seen, 5))


Utilisateur de test: 5
KNNBasic item-based: [272143, 336221, 234698, 123909, 336223]
SVD collaboratif: [235101, 68851, 237071, 237480, 234390]


In [77]:
results_steps = (
    results_df
    .sort_values(["ndcg@k", "precision@k"], ascending=False)
    .reset_index(drop=True)
)
print("Tableau comparatif des modèles Surprise (trié sur ndcg@k puis precision@k) :")
results_steps


Tableau comparatif des modèles Surprise (trié sur ndcg@k puis precision@k) :


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,latency_per_user_s,coverage@k
0,Hybrid SVD60 + Item2Item40,67633,0.006562,0.010795,0.007275,0.011091,0.030089,0.034358,0.270493
1,Item2Item LightFM (latent voisins),67633,0.000772,0.001184,0.0005,0.000955,0.003844,0.000926,0.412455
2,SVD session_weighted,67633,0.000322,0.000548,0.000328,0.000526,0.001597,0.025551,0.638082
3,Modèle SVD Surprise (facteurs latents),67633,0.000322,0.000548,0.000242,0.000434,0.001597,0.025594,0.224333
4,Baseline Surprise - NormalPredictor,67633,0.000195,0.000317,0.000113,0.000226,0.000976,0.004825,0.003382


In [78]:
# Métriques détaillées : hitrate, lifts vs baseline et cohortes d'historique
train_click_count = train_df.groupby("user_id").size().to_dict()

def assign_cohort(clicks: int) -> str:
    if 1 <= clicks <= 2:
        return "1-2 clicks"
    if 3 <= clicks <= 9:
        return "3-9 clicks"
    return "10+ clicks"

user_cohort = {user_id: assign_cohort(train_click_count.get(user_id, 0)) for user_id in eval_users}
coverage_lookup = {res["model"]: res.get("coverage@k", np.nan) for res in results}
recommendations_by_model = {res["model"]: res.get("all_recommendations", []) for res in results}
baseline_label = svd_native_label if svd_native_label in results_df['model'].values else results_df['model'].iloc[0]

def safe_lift(value: float, baseline: float) -> float:
    if baseline is None or baseline == 0:
        return np.nan
    return value / baseline

cohort_rows = []
for model_name, recs in recommendations_by_model.items():
    buckets = {
        "ALL": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "1-2 clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "3-9 clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "10+ clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
    }

    for user_id, recs_user in zip(eval_users, recs):
        gt = ground_truth[user_id]
        metrics = {
            "precision": precision_at_k(recs_user, gt, K),
            "recall": recall_at_k(recs_user, gt, K),
            "ndcg": ndcg_at_k(recs_user, gt, K),
            "hit": 1 if set(recs_user[:K]) & set(gt) else 0,
        }
        labels = ["ALL", user_cohort[user_id]]
        for label in labels:
            bucket = buckets[label]
            bucket["precisions"].append(metrics["precision"])
            bucket["recalls"].append(metrics["recall"])
            bucket["ndcgs"].append(metrics["ndcg"])
            bucket["hits"] += metrics["hit"]
            bucket["users"] += 1

    for cohort, bucket in buckets.items():
        users = bucket["users"]
        cohort_rows.append(
            {
                "model": model_name,
                "cohort": cohort,
                "users": users,
                "precision@k": float(np.mean(bucket["precisions"])) if users else 0.0,
                "recall@k": float(np.mean(bucket["recalls"])) if users else 0.0,
                "ndcg@k": float(np.mean(bucket["ndcgs"])) if users else 0.0,
                "hitrate@k": bucket["hits"] / users if users else 0.0,
                "coverage@k": coverage_lookup.get(model_name, np.nan),
            }
        )

cohort_df = pd.DataFrame(cohort_rows)
baseline_rows = cohort_df[cohort_df["model"] == baseline_label].set_index("cohort")
for metric in ["precision@k", "recall@k", "ndcg@k"]:
    cohort_df[f"lift_{metric}_vs_baseline"] = cohort_df.apply(
        lambda row: safe_lift(
            row[metric],
            float(baseline_rows.loc[row["cohort"], metric])
            if row["cohort"] in baseline_rows.index
            else np.nan,
        ),
        axis=1,
    )

cohort_df = cohort_df.sort_values(["cohort", "ndcg@k", "precision@k"], ascending=[True, False, False]).reset_index(drop=True)
cohort_df


Unnamed: 0,model,cohort,users,precision@k,recall@k,ndcg@k,hitrate@k,coverage@k,lift_precision@k_vs_baseline,lift_recall@k_vs_baseline,lift_ndcg@k_vs_baseline
0,Hybrid SVD60 + Item2Item40,1-2 clicks,9426,0.007999,0.016629,0.014257,0.03607,0.270493,26.928571,23.394397,28.89326
1,Item2Item LightFM (latent voisins),1-2 clicks,9426,0.000658,0.001392,0.000996,0.003183,0.412455,2.214286,1.957711,2.018153
2,Modèle SVD Surprise (facteurs latents),1-2 clicks,9426,0.000297,0.000711,0.000493,0.001485,0.224333,1.0,1.0,1.0
3,Baseline Surprise - NormalPredictor,1-2 clicks,9426,0.000149,0.00046,0.000229,0.000743,0.003382,0.5,0.646766,0.464874
4,SVD session_weighted,1-2 clicks,9426,0.000106,0.000274,0.000224,0.00053,0.638082,0.357143,0.385572,0.454057
5,Hybrid SVD60 + Item2Item40,10+ clicks,33973,0.005852,0.007638,0.009292,0.026992,0.270493,12.909091,11.148591,16.558413
6,Item2Item LightFM (latent voisins),10+ clicks,33973,0.000842,0.001017,0.000936,0.004209,0.412455,1.857143,1.484446,1.668312
7,SVD session_weighted,10+ clicks,33973,0.000424,0.000659,0.000624,0.00209,0.638082,0.935065,0.961742,1.111496
8,Modèle SVD Surprise (facteurs latents),10+ clicks,33973,0.000453,0.000685,0.000561,0.002237,0.224333,1.0,1.0,1.0
9,Baseline Surprise - NormalPredictor,10+ clicks,33973,0.000259,0.000304,0.000268,0.001295,0.003382,0.571429,0.443907,0.476937


## Analyse & choix du modèle MVP

Le classement met en lumière des compromis :
- **Pertinence** : la popularité globale obtient le meilleur NDCG@5/MAP@5, signe que trier par volume reste difficile à battre sur ce petit jeu synthétique.
- **Diversité** : l'item2item couvre trois fois plus d'articles, ce qui réduit le risque d'effet tunnel.
- **Latence** : toutes les approches sont très rapides (millisecondes), la popularité restant la plus simple.

Le choix MVP bascule vers la popularité globale uniquement si l'on cherche la pertinence maximale et un déploiement express. Pour un produit, il serait pertinent de tester une hybridation : démarrer par la popularité pour les nouveaux utilisateurs puis basculer vers l'item2item dès que l'historique se construit afin d'augmenter la couverture sans sacrifier la qualité.

In [79]:

best_row = results_df.iloc[0]
justification = f"""
## Choix du modèle MVP

Modèle retenu : **{best_row['model']}**

Motifs principaux :
- NDCG@5 = {best_row['ndcg@k']:.4f}, MAP@5 = {best_row['map@k']:.4f}, Precision@5 = {best_row['precision@k']:.4f}, Recall@5 = {best_row['recall@k']:.4f}
- Couverture = {best_row['coverage@k']:.4f} sur {len(candidate_items)} articles candidats.
- Latence moyenne par utilisateur = {best_row['latency_per_user_s']:.6f} s (CPU).
- Complexité : implémentation {'optimisée via Surprise (SVD/KNN)' if 'SVD' in best_row['model'] else 'basée sur Surprise'} compatible avec Azure Functions.
- Gestion du cold-start utilisateur via popularité globale.

Note : ajuster `content_pca_components` pour réduire la taille des embeddings en production si nécessaire.
"""
choice_path = Path(CONFIG["artifacts_dir"]) / "model_choice.md"
choice_path.write_text(justification)
print(justification)



## Choix du modèle MVP

Modèle retenu : **Hybrid SVD60 + Item2Item40**

Motifs principaux :
- NDCG@5 = 0.0111, MAP@5 = 0.0073, Precision@5 = 0.0066, Recall@5 = 0.0108
- Couverture = 0.2705 sur 10052 articles candidats.
- Latence moyenne par utilisateur = 0.034358 s (CPU).
- Complexité : implémentation optimisée via Surprise (SVD/KNN) compatible avec Azure Functions.
- Gestion du cold-start utilisateur via popularité globale.

Note : ajuster `content_pca_components` pour réduire la taille des embeddings en production si nécessaire.



In [80]:

results_path_csv = Path(CONFIG["artifacts_dir"]) / "results.csv"
results_path_json = Path(CONFIG["artifacts_dir"]) / "results.json"
results_df.to_csv(results_path_csv, index=False)
results_df.to_json(results_path_json, orient="records", lines=True)
print(f"Résultats sauvegardés dans {results_path_csv} et {results_path_json}")


Résultats sauvegardés dans ../artifacts/evaluation/results.csv et ../artifacts/evaluation/results.json


### Déploiement (application et Azure Functions)

Le modèle **SVD Surprise** est exporté pour l'application Flask et la Function Azure. Les
hyperparamètres reflètent la configuration du notebook (facteurs latents, lr_all, reg_all), tandis que
le modèle KNN reste disponible pour comparaison locale.


## Conclusion

Ce notebook montre comment comparer des stratégies de recommandation avec une procédure reproductible : split temporel, entraînement, évaluation multi-métriques et sauvegarde des résultats. Les essais révèlent que la popularité globale reste une valeur sûre pour débuter, mais que des modèles plus personnalisés (item2item ou SVD) apportent de la diversité dès que l'on dispose d'historique. Les prochaines étapes naturelles sont d'exécuter les tests sur les vraies données Kaggle, d'ajouter des métriques business (taux de clic simulé, couverture par catégorie) et de prototyper une hybridation popularité + item2item dans une Azure Function pour valider le comportement en production.