# Évaluation d'un système de recommandation My Content

Notebook pour entraîner et comparer plusieurs approches de recommandation sur le dataset Kaggle **news-portal-user-interactions-by-globocom**. L'objectif est de montrer clairement chaque étape (du chargement des données jusqu'au choix final du modèle).

> Ce notebook aligne désormais **toutes les approches de recommandation sur la bibliothèque Surprise** (https://surprise.readthedocs.io/) afin de bénéficier d'algorithmes collaboratifs standardisés et faciles à déployer.

In [None]:
# Imports & Config
from __future__ import annotations
import json
import os
import pickle
import sys
from collections import Counter
import time
from pathlib import Path
from typing import Callable, Dict, List, Optional, Tuple, Union
import optuna

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)

# Ensure the project root is importable
PROJECT_ROOT = Path('.').resolve().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from src.models.lightfm_item2item import (
    CONTEXT_COLUMNS,
    LightFMApproximator,
    build_interaction_matrices,
    precompute_item_neighbors,
    score_from_neighbors,
    session_weight_from_size,
)

# Configuration
CONFIG = {
    "clicks_dir": "../data/news-portal-user-interactions-by-globocom/clicks",
    "metadata_path": "../data/news-portal-user-interactions-by-globocom/articles_metadata.csv",
    "embeddings_path": "../data/news-portal-user-interactions-by-globocom/articles_embeddings.pickle",
    "max_click_files": None,
    "artifacts_dir": "../artifacts/evaluation",
    "k": 5,
    "train_ratio": 0.8,
    "recent_window_days": 7,
    "random_seed": 42,
    "svd_components": 64,
    "content_pca_components": None,
    "covisit_top_n_neighbors": 20,
    "covisit_similarity": "cosine",
    "covisit_hybrid_alpha": 0.7350738721058192,
    "svd_hazard_ndcg": 0.02,
    "min_user_interactions": 3,
    "min_item_interactions": 5,
    "svd_use_session_rating": True,
    "lightfm_use_user_features": True,
    "lightfm_components": 48,
    "lightfm_item_neighbors": 200,
    "hybrid_weights": (0.6, 0.4),
}
np.random.seed(CONFIG["random_seed"])
Path(CONFIG["artifacts_dir"]).mkdir(parents=True, exist_ok=True)
print("Config ready", CONFIG)

from surprise import Dataset, Reader, KNNBasic, NormalPredictor, SVD



## Contexte

Nous voulons proposer à chaque lecteur un Top-5 d'articles susceptibles de l'intéresser. Le notebook illustre la démarche de A à Z : préparation des données, construction de différentes familles de modèles puis comparaison à l'aide de métriques de ranking.

## Données

Les fichiers attendus sont situés dans `/data/*`.

In [None]:

# Load data utilities


def detect_timestamp_column(df: pd.DataFrame) -> str:
    """Detect the timestamp-like column name."""
    candidates = ["click_timestamp", "timestamp", "event_time", "ts", "time"]
    for col in df.columns:
        if col in candidates or col.lower() in candidates:
            return col
    raise ValueError("No timestamp-like column found. Expected one of: " + ",".join(candidates))


def detect_article_column(df: pd.DataFrame) -> str:
    """Detect the article/item column name."""
    candidates = ["click_article_id", "clicked_article_id", "article_id", "item_id", "content_id"]
    for col in df.columns:
        if col in candidates:
            return col
    raise ValueError("No article id column found. Expected one of: " + ",".join(candidates))


def infer_unix_unit(values: pd.Series) -> str:
    numeric = pd.to_numeric(values, errors="coerce").dropna()
    if numeric.empty:
        return "s"
    max_abs = numeric.abs().max()
    if max_abs >= 1e14:
        return "ns"
    if max_abs >= 1e11:
        return "ms"
    return "s"


def to_timestamp(series: pd.Series) -> pd.Series:
    if pd.api.types.is_datetime64_any_dtype(series):
        return pd.to_datetime(series)
    if pd.api.types.is_numeric_dtype(series):
        unit = infer_unix_unit(series)
        return pd.to_datetime(series, unit=unit, errors="coerce")

    converted = pd.to_datetime(series, errors="coerce")
    if converted.notna().any():
        return converted

    unit = infer_unix_unit(series)
    return pd.to_datetime(series, unit=unit, errors="coerce")


def list_click_files(path: Union[str, Path]) -> List[Path]:
    path_obj = Path(path)
    if path_obj.is_file():
        return [path_obj]
    if path_obj.is_dir():
        return sorted(path_obj.glob("clicks_hour_*.csv"))
    return []


def ensure_context_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Ensure session_size and context columns exist with safe defaults."""
    df = df.copy()
    if "session_size" not in df.columns:
        df["session_size"] = 1
    for col in CONTEXT_COLUMNS:
        if col not in df.columns:
            df[col] = "unknown"
    return df


def create_synthetic_clicks(path: str, n_users: int = 50, n_items: int = 120, days: int = 30, interactions_per_user: int = 25) -> pd.DataFrame:
    """Create a small synthetic clicks dataset to keep the notebook runnable."""
    rng = np.random.default_rng(CONFIG["random_seed"])
    start = pd.Timestamp("2022-01-01")
    envs = ["web", "app"]
    devices = ["mobile", "desktop"]
    oss = ["ios", "android", "linux"]
    referrers = ["direct", "search", "social"]
    records = []
    for user in range(1, n_users + 1):
        offsets = rng.integers(0, days, size=interactions_per_user)
        timestamps = [start + pd.Timedelta(int(o), unit="D") for o in sorted(offsets.tolist())]
        articles = rng.integers(1, n_items + 1, size=interactions_per_user)
        for ts, art in zip(timestamps, articles):
            records.append({
                "user_id": int(user),
                "article_id": int(art),
                "timestamp": ts,
                "session_size": int(rng.integers(1, 6)),
                "click_environment": rng.choice(envs),
                "click_deviceGroup": rng.choice(devices),
                "click_os": rng.choice(oss),
                "click_country": rng.choice(["fr", "us", "br"]),
                "click_region": rng.choice(["idf", "sp", "ca"]),
                "click_referrer_type": rng.choice(referrers),
            })
    df = pd.DataFrame(records).sort_values("timestamp").reset_index(drop=True)
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False)
    print(
        f"Synthetic clicks dataset created at {path} "
        f"(users={n_users}, items={n_items}, interactions={len(df)})"
    )
    return df


def load_clicks(path: str, max_files: Optional[int] = None) -> pd.DataFrame:
    """Load clicks data from the Globo hourly files, with a safety cap."""
    files = list_click_files(path)
    total_files = len(files)
    if not files:
        print(f"Clicks directory not found at {path}. Generating a synthetic sample for demonstration.")
        return ensure_context_columns(create_synthetic_clicks(Path(path) / "clicks_hour_000.csv"))

    if max_files is not None:
        print(f"Limite explicite max_files={max_files}, total détecté={total_files}")
        files = files[:max_files]

    print(f"Chargement de {len(files)} fichiers clicks (total détecté={total_files}, limite={max_files if max_files is not None else 'aucune'})")
    frames = []
    for file in files:
        df = pd.read_csv(file)
        ts_col = detect_timestamp_column(df)
        article_col = detect_article_column(df)
        df[ts_col] = to_timestamp(df[ts_col])
        df = df.rename(columns={ts_col: "timestamp", article_col: "article_id"})
        df = ensure_context_columns(df)
        keep_cols = [col for col in [
            "user_id",
            "article_id",
            "timestamp",
            "session_size",
            *CONTEXT_COLUMNS,
        ] if col in df.columns]
        frames.append(df[keep_cols])

    combined = pd.concat(frames, ignore_index=True)
    combined = combined.sort_values("timestamp").reset_index(drop=True)
    print(f"Clicks agrégés : {len(combined)} lignes, {combined['user_id'].nunique()} utilisateurs uniques, {combined['article_id'].nunique()} articles uniques.")
    return combined


def load_metadata(path: str) -> Optional[pd.DataFrame]:
    """Load article metadata if available."""
    if not os.path.exists(path):
        print(f"Metadata file not found at {path}. Utilisation du pipeline Surprise uniquement si les métadonnées sont absentes.")
        return None
    meta = pd.read_csv(path)
    if "article_id" not in meta.columns:
        print("Metadata missing 'article_id' column. Ignoring metadata.")
        return None
    return meta


clicks = load_clicks(CONFIG["clicks_dir"], max_files=CONFIG["max_click_files"])
metadata = load_metadata(CONFIG["metadata_path"])
print(clicks.head())
print("Metadata loaded:", metadata is not None)



## Analyse exploratoire des données

Courte photographie des fichiers sources immédiatement après le chargement :
- nombre de lignes et noms de colonnes des clics
- volumes et intégrité des métadonnées articles
- dimensions et structure du fichier d'`articles_embeddings`.

In [None]:
# EDA rapide sur les données sources
import pickle
from pathlib import Path
from collections.abc import Mapping


def summarize_timestamps(series: pd.Series):
    series = pd.to_datetime(series)
    daily = series.dt.date.value_counts().sort_index().rename_axis("date").reset_index(name="nb_clicks")
    hourly = series.dt.hour.value_counts().sort_index().rename_axis("hour").reset_index(name="nb_clicks")
    return series.min(), series.max(), daily, hourly


def describe_structure(obj, prefix="embeddings", max_depth=4):
    entries = []

    def add_entry(path, value, note=None):
        entry = {"chemin": path, "type": type(value).__name__}
        if hasattr(value, "shape"):
            entry["shape"] = tuple(getattr(value, "shape"))
        elif hasattr(value, "__len__") and not isinstance(value, (str, bytes)):
            entry["len"] = len(value)
        if hasattr(value, "dtype"):
            entry["dtype"] = str(getattr(value, "dtype"))
        if note:
            entry["note"] = note
        if isinstance(value, np.ndarray) and value.dtype.names:
            entry["dtype_fields"] = list(value.dtype.names)
        if isinstance(value, np.ndarray) and value.ndim == 1 and len(value) > 0 and not isinstance(value[0], (np.ndarray, list, tuple, Mapping)):
            entry["exemple"] = repr(value[:3].tolist())
        entries.append(entry)

    def walk(value, path, depth):
        add_entry(path, value)
        if depth >= max_depth:
            return
        if isinstance(value, Mapping):
            for k, v in value.items():
                walk(v, f"{path}.{k}", depth + 1)
        elif isinstance(value, (list, tuple, np.ndarray)) and not isinstance(value, (str, bytes)):
            if len(value) > 0:
                walk(value[0], f"{path}[0]", depth + 1)

    walk(obj, prefix, 0)
    return entries


click_files = list_click_files(CONFIG["clicks_dir"])
print(f"Nombre total de fichiers clicks détectés: {len(click_files)}")
if not click_files:
    print("Aucun fichier clicks trouvé au chemin configuré. Vérifiez le téléchargement des données.")

files_for_eda = click_files[:2]
per_file_stats = []
for file in files_for_eda:
    df_file = pd.read_csv(file)
    ts_col = detect_timestamp_column(df_file)
    article_col = detect_article_column(df_file)
    timestamps = to_timestamp(df_file[ts_col])
    per_file_stats.append(
        {
            "fichier": file.name,
            "nb_lignes": len(df_file),
            "colonnes": ", ".join(df_file.columns),
            "articles_uniques": df_file[article_col].nunique(),
            "horodatage_min": timestamps.min(),
            "horodatage_max": timestamps.max(),
        }
    )
if per_file_stats:
    display(pd.DataFrame(per_file_stats))
else:
    print("Pas assez de fichiers pour réaliser une EDA détaillée par fichier.")

print("=== Clicks (agrégés) ===")
if clicks.empty:
    print("Aucun clic chargé. Vérifier le chemin ou augmenter max_click_files.")
else:
    clicks_summary = {
        "nb_lignes": len(clicks),
        "colonnes": ", ".join(clicks.columns),
        "utilisateurs_uniques": clicks['user_id'].nunique() if 'user_id' in clicks else None,
        "articles_uniques": clicks['article_id'].nunique() if 'article_id' in clicks else None,
    }
    display(pd.DataFrame([clicks_summary]))

    total_articles = None
    if metadata is not None and 'article_id' in metadata:
        total_articles = metadata['article_id'].nunique()
    elif 'article_id' in clicks:
        total_articles = clicks['article_id'].nunique()

    total_clients = clicks['user_id'].nunique() if 'user_id' in clicks else None
    print("Synthèse globale (articles / clients)")
    display(pd.DataFrame([{
        'nombre_total_articles': total_articles,
        'nombre_total_clients': total_clients,
    }]))

    ts_min, ts_max, daily, hourly = summarize_timestamps(clicks['timestamp'])
    display(pd.DataFrame([
        {
            'horodatage_min': ts_min,
            'horodatage_max': ts_max,
            'fenetre_jours': (ts_max - ts_min).days + 1,
        }
    ]))
    print("Répartition par jour (jusqu'à 10 premières valeurs)")
    display(daily.head(10))
    print("Répartition par heure (0-23)")
    display(hourly)

print("=== Métadonnées des articles ===")
if metadata is None:
    print("Aucun fichier metadata chargé.")
else:
    meta_summary = {
        "nb_articles": len(metadata),
        "colonnes": ", ".join(metadata.columns),
        "articles_uniques": metadata['article_id'].nunique() if 'article_id' in metadata else None,
    }
    display(pd.DataFrame([meta_summary]))
    missing = metadata.isna().sum().sort_values(ascending=False)
    display(missing.to_frame('valeurs_manquantes'))
    if 'created_at_ts' in metadata.columns:
        created = to_timestamp(metadata['created_at_ts'])
        display(pd.DataFrame([{'premier_article': created.min(), 'dernier_article': created.max()}]))
    if 'article_id' in metadata.columns:
        overlap = set(clicks['article_id'].unique()) if 'article_id' in clicks.columns else set()
        coverage = len(overlap & set(metadata['article_id'].unique()))
        print(f"Articles présents dans clicks et metadata: {coverage}")


print("=== Embeddings d'articles ===")
embeddings_path = Path(CONFIG['embeddings_path'])
if embeddings_path.exists():
    with embeddings_path.open('rb') as f:
        embeddings_obj = pickle.load(f)
    print(f"Type chargé: {type(embeddings_obj)}")

    def summarize_matrix(mat):
        stats = {
            'shape': getattr(mat, 'shape', None),
            'dtype': getattr(mat, 'dtype', None),
        }

        dim_values = []
        shape = getattr(mat, 'shape', None)
        if shape is not None and len(shape) >= 2:
            dim_values.append(shape[1])
        elif isinstance(mat, (list, tuple, np.ndarray)):
            for row in mat:
                if hasattr(row, '__len__') and not isinstance(row, (str, bytes)):
                    try:
                        dim_values.append(len(row))
                    except TypeError:
                        continue

        if dim_values:
            stats.update({
                'profondeur_min': min(dim_values),
                'profondeur_moyenne': float(np.mean(dim_values)),
                'profondeur_max': max(dim_values),
            })

        if hasattr(mat, 'shape') and len(getattr(mat, 'shape', [])) == 2:
            norms = np.linalg.norm(mat, axis=1)
            stats.update(
                {
                    'nb_vectors': mat.shape[0],
                    'dim': mat.shape[1],
                    'norm_min': norms.min(),
                    'norm_max': norms.max(),
                    'norm_moyenne': norms.mean(),
                }
            )
        return stats

    base_structure = describe_structure(embeddings_obj, max_depth=4)

    if isinstance(embeddings_obj, dict):
        keys = list(embeddings_obj.keys())
        print(f"Clés disponibles: {keys}")
        matrix = embeddings_obj.get('embeddings')
        ids = embeddings_obj.get('articles_ids') or embeddings_obj.get('article_ids')

        structure = base_structure.copy()
        if ids is not None:
            structure.insert(0, {
                'chemin': 'embeddings.article_ids',
                'type': type(ids).__name__,
                'len': len(ids),
                'note': "Identifiants d'articles fournis dans le fichier",
            })
        if structure:
            print("Structure détaillée de l'objet d'embeddings (par chemin de clé):")
            display(pd.DataFrame(structure))

        if matrix is not None:
            stats = summarize_matrix(matrix)
            stats.update(
                {
                    'colonnes': ", ".join(keys),
                    'nb_articles_ids': len(ids) if ids is not None else None,
                    'ids_uniques': len(set(ids)) if ids is not None else None,
                    'couverture_metadata': len(set(ids) & set(metadata['article_id']))
                    if (metadata is not None and ids is not None and 'article_id' in metadata)
                    else None,
                    'couverture_clicks': len(set(ids) & set(clicks['article_id']))
                    if (not clicks.empty and ids is not None and 'article_id' in clicks)
                    else None,
                }
            )
            display(pd.DataFrame([stats]))

            if ids is not None:
                sample_ids = ids[:5] if len(ids) >= 5 else ids
                print("Aperçu des premiers article_id liés aux embeddings:")
                display(pd.DataFrame({'article_id': sample_ids}))

            preview_cols = [f"emb_{i}" for i in range(min(5, matrix.shape[1] if hasattr(matrix, 'shape') else 0))]
            if preview_cols:
                preview = pd.DataFrame(matrix[:5, : len(preview_cols)], columns=preview_cols)
                if ids is not None:
                    preview.insert(0, 'article_id', ids[: len(preview)])
                print("Aperçu des embeddings (quelques colonnes et premières lignes):")
                display(preview)
                print("Colonnes affichées pour l'aperçu des embeddings:")
                print(", ".join(preview.columns))

                if ids is not None and metadata is not None and 'article_id' in metadata:
                    meta_cols = [c for c in ['title', 'category_id', 'created_at_ts', 'publisher'] if c in metadata.columns]
                    meta_sample = (
                        preview[['article_id']]
                        .merge(metadata[['article_id'] + meta_cols], on='article_id', how='left')
                    )
                    if 'created_at_ts' in meta_sample.columns:
                        meta_sample['created_at_ts'] = to_timestamp(meta_sample['created_at_ts'])
                    print("Exemple de liaison embedding -> metadata sur article_id (5 premières lignes):")
                    display(meta_sample.head())
        else:
            print("Aucune matrice d'embeddings explicite trouvée dans l'objet chargé.")
    elif hasattr(embeddings_obj, 'shape'):
        stats = summarize_matrix(embeddings_obj)

        inferred_ids = None
        mapping_note = None
        if metadata is not None and 'article_id' in metadata and hasattr(embeddings_obj, 'shape'):
            if embeddings_obj.shape[0] == len(metadata):
                inferred_ids = metadata['article_id'].reset_index(drop=True)
                mapping_note = (
                    "Aucun article_id explicite fourni ; association supposée alignée sur l'ordre des metadata."
                )
            else:
                mapping_note = (
                    "Aucun article_id dans le fichier d'embeddings et la taille ne correspond pas aux metadata : "
                    f"{embeddings_obj.shape[0]} vecteurs vs {len(metadata)} lignes de metadata."
                )
        else:
            mapping_note = (
                "Aucun identifiant d'article n'est présent dans le fichier d'embeddings (mapping externe requis)."
            )

        structure = base_structure.copy()
        if inferred_ids is not None:
            structure.insert(0, {
                'chemin': 'embeddings.article_id (inféré)',
                'type': type(inferred_ids).__name__,
                'len': len(inferred_ids),
                'note': "Alignement supposé sur metadata.article_id (index identique).",
            })
        if structure:
            print("Structure détaillée de l'objet d'embeddings (par chemin de clé):")
            display(pd.DataFrame(structure))

        if mapping_note:
            print(mapping_note)

        if inferred_ids is not None:
            stats.update(
                {
                    'ids_source': 'metadata.article_id (alignement par index)',
                    'ids_uniques': inferred_ids.nunique(),
                    'couverture_metadata': len(set(inferred_ids) & set(metadata['article_id'])),
                    'couverture_clicks': len(set(inferred_ids) & set(clicks['article_id'])) if not clicks.empty else None,
                }
            )

        display(pd.DataFrame([stats]))
        if len(getattr(embeddings_obj, 'shape', [])) >= 2 and embeddings_obj.shape[1] > 0:
            preview_cols = [f"emb_{i}" for i in range(min(5, embeddings_obj.shape[1]))]
            preview = pd.DataFrame(embeddings_obj[:5, : len(preview_cols)], columns=preview_cols)
            if inferred_ids is not None:
                preview.insert(0, 'article_id', inferred_ids.iloc[: len(preview)].values)
            print("Aperçu direct de la matrice d'embeddings:")
            display(preview)
            print("Colonnes affichées pour l'aperçu des embeddings:")
            print(", ".join(preview.columns))

            if inferred_ids is not None and metadata is not None:
                meta_cols = [c for c in ['title', 'category_id', 'created_at_ts', 'publisher'] if c in metadata.columns]
                meta_sample = preview[['article_id']].merge(
                    metadata[['article_id'] + meta_cols], on='article_id', how='left'
                )
                if 'created_at_ts' in meta_sample.columns:
                    meta_sample['created_at_ts'] = to_timestamp(meta_sample['created_at_ts'])
                print("Exemple de liaison embedding -> metadata sur article_id (inféré):")
                display(meta_sample.head())
        else:
            print("Objet chargé non structuré, utilisez type/len pour investiguer.")
else:
    print(f"Fichier d'embeddings introuvable à {embeddings_path}")





# Article Embeddings

Ce fichier contient les **embeddings des articles**, c’est-à-dire une **représentation numérique du contenu textuel** permettant de comparer les articles entre eux sur le plan sémantique.

* **Format** : matrice NumPy `(N, 250)` en `float32`
* **1 ligne = 1 article**
* **250 colonnes = dimensions latentes**
* Les valeurs individuelles n’ont pas de signification directe

L’`article_id` n’est **pas stocké explicitement** : il est **déduit de l’ordre des lignes**, qui doit rester aligné avec les métadonnées des articles.

La variable `words_count` indique le **nombre de mots du texte source** et sert uniquement d’indicateur de qualité du contenu.

Les embeddings **ne sont pas normalisés** : la **similarité cosinus** est la mesure recommandée pour comparer les articles.


## Protocole

1. Tri des interactions par horodatage pour respecter la chronologie.
2. Split temporel train/test selon `train_ratio` afin d'éviter toute fuite du futur.
3. Construction d'un profil utilisateur à partir des interactions de train.
4. Définition du *ground truth* : articles cliqués en test pour chaque utilisateur (au moins un).
5. Génération de recommandations Top-5 en excluant les articles déjà vus en train.
6. Calcul des métriques de ranking (Precision@5, Recall@5, MAP@5, NDCG@5, Coverage@5) et estimation de la latence moyenne sur un échantillon de 500 utilisateurs max.

Cette démarche imite un scénario de production : d'abord on respecte le temps, puis on mesure simultanément la qualité des suggestions et le coût de calcul.

## Préparation minimale des interactions

In [None]:

# Filtrage k-core itératif pour limiter la sparsité avant le split train/test

def iterative_k_core_filter(
    df: pd.DataFrame, min_user_interactions: int, min_item_interactions: int
) -> pd.DataFrame:
    filtered = df.copy()
    previous_size = -1
    while previous_size != len(filtered):
        previous_size = len(filtered)
        user_counts = filtered["user_id"].value_counts()
        item_counts = filtered["article_id"].value_counts()
        filtered = filtered[
            filtered["user_id"].isin(user_counts[user_counts >= min_user_interactions].index)
            & filtered["article_id"].isin(item_counts[item_counts >= min_item_interactions].index)
        ]
    return filtered

if clicks.empty:
    print("Dataset clicks vide : saut du filtrage k-core.")
else:
    before = (
        len(clicks),
        clicks["user_id"].nunique(),
        clicks["article_id"].nunique(),
    )
    clicks = iterative_k_core_filter(
        clicks,
        CONFIG["min_user_interactions"],
        CONFIG["min_item_interactions"],
    ).sort_values("timestamp").reset_index(drop=True)
    after = (
        len(clicks),
        clicks["user_id"].nunique(),
        clicks["article_id"].nunique(),
    )
    print(
        "Filtrage k-core terminé: "
        f"interactions {before[0]} -> {after[0]}, "
        f"utilisateurs {before[1]} -> {after[1]}, "
        f"articles {before[2]} -> {after[2]}"
    )



In [None]:
# Split and utility functions

from __future__ import annotations

from typing import Dict, List, Tuple, Set
import numpy as np
import pandas as pd


MIN_HISTORY = 2


def _assert_no_temporal_leakage(train_df: pd.DataFrame, test_df: pd.DataFrame) -> None:
    """
    Vectorized leakage check:
    For each user that has both train and test rows, ensure max(train_ts) <= min(test_ts).
    """
    if train_df.empty or test_df.empty:
        return

    train_max = train_df.groupby("user_id")["timestamp"].max()
    test_min = test_df.groupby("user_id")["timestamp"].min()

    joined = pd.concat([train_max.rename("train_max"), test_min.rename("test_min")], axis=1).dropna()
    bad = joined[joined["train_max"] > joined["test_min"]]
    assert bad.empty, f"Temporal leakage detected for users: {bad.index.tolist()[:5]}"


def temporal_train_test_split_per_user(df: pd.DataFrame, train_ratio: float) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Split interactions chronologically per user to avoid leakage.

    Retro-compatible:
    - Same function name/signature and return types.
    - Same semantics: per user, earliest interactions go to train, remaining to test.
    - Keeps a leakage assert (but vectorized, not O(users * rows)).
    """
    if df.empty:
        empty = df.iloc[0:0].copy()
        return empty, empty

    # Sort once globally instead of sorting inside each group
    df2 = df.sort_values(["user_id", "timestamp"]).copy()

    # Position inside each user (0..n-1) and per-user cutoff
    pos = df2.groupby("user_id").cumcount()
    n = df2.groupby("user_id")["user_id"].transform("size")
    cutoff = (n * float(train_ratio)).astype(int)

    train_mask = pos < cutoff
    train_df = df2[train_mask].copy()
    test_df = df2[~train_mask].copy()

    _assert_no_temporal_leakage(train_df, test_df)
    return train_df, test_df


def build_user_histories(df: pd.DataFrame) -> Dict[int, List[int]]:
    """Create mapping user -> unique list of articles in chronological order.

    Retro-compatible:
    - Same output: {user_id: [article_id1, article_id2, ...]} with order preserved and duplicates removed.
    - Faster: sort once and drop duplicates per (user_id, article_id) while keeping first occurrence.
    """
    if df.empty:
        return {}

    df2 = df.sort_values(["user_id", "timestamp"])

    # Keep first time a user saw an article (preserves chronological order per user)
    df2 = df2.drop_duplicates(subset=["user_id", "article_id"], keep="first")

    # Build dict user -> list[article_id]
    # (astype(int) for retro-compat with your previous behavior)
    grouped = df2.groupby("user_id")["article_id"].apply(lambda s: [int(x) for x in s.tolist()])
    return {int(uid): items for uid, items in grouped.items()}


def make_ground_truth(
    train: pd.DataFrame,
    test: pd.DataFrame,
    min_history: int,
) -> Tuple[Dict[int, List[int]], Dict[int, Set[int]], List[int], float]:
    """Build user histories and ground truth for evaluation.

    Test items not observed in train are filtered out to keep candidates consistent.
    Returns the unknown test item rate (fraction of test interactions filtered out).

    Retro-compatible:
    - Same signature and returned tuple structure.
    - Same filtering rules.
    """
    train = train.copy()
    test = test.copy()

    # Preserve your int coercions (important for joins/sets and Surprise adapters)
    train["user_id"] = train["user_id"].astype(int)
    train["article_id"] = train["article_id"].astype(int)
    test["user_id"] = test["user_id"].astype(int)
    test["article_id"] = test["article_id"].astype(int)

    train_hist = build_user_histories(train)

    # Candidate pool from train items
    candidate_items_global = [int(i) for i in train["article_id"].dropna().unique().tolist()]
    candidate_set = set(candidate_items_global)

    unknown_mask = ~test["article_id"].isin(candidate_set)
    unknown_test_items_rate = float(unknown_mask.mean()) if len(test) else 0.0

    test_hist = build_user_histories(test)

    filtered_gt: Dict[int, Set[int]] = {}
    for user_id, items in test_hist.items():
        filtered_items = {int(i) for i in items if int(i) in candidate_set}
        if not filtered_items:
            continue
        if len(train_hist.get(int(user_id), [])) < int(min_history):
            continue
        filtered_gt[int(user_id)] = filtered_items

    return train_hist, filtered_gt, candidate_items_global, unknown_test_items_rate


# --- Usage (unchanged) ---

train_df, test_df = temporal_train_test_split_per_user(clicks, CONFIG["train_ratio"])
train_histories, ground_truth, candidate_items_global, unknown_test_items_rate = make_ground_truth(
    train_df, test_df, MIN_HISTORY
)
eval_users = sorted(ground_truth.keys())
candidate_items = candidate_items_global
print(
    "Train size: "
    f"{len(train_df)}, Test size: {len(test_df)}, Users for eval: {len(eval_users)}, "
    f"Unknown test items rate (filtered): {unknown_test_items_rate:.4f}"
)


## Métriques utilisées

* **Precision@5** : part des recommandations top-5 qui sont réellement cliquées (plus c'est haut, plus le Top-5 est précis).
* **Recall@5** : part des clics test retrouvés dans le Top-5 (mesure la couverture de ce que l'utilisateur aime).
* **MAP@5** : moyenne de la précision cumulée à chaque clic retrouvé ; récompense les bonnes positions dans la liste.
* **NDCG@5** : pondère chaque clic par sa position (gain décroissant) et normalise par le meilleur score possible ; idéal pour comparer des classements.
* **Coverage@5** : proportion d'articles différents recommandés sur l'ensemble des utilisateurs (diversité du catalogue).
* **Latence par utilisateur** : temps moyen pour produire le Top-5 (important pour une API temps réel).
* **RMSE** : erreur quadratique moyenne sur les prédictions de note ; résume l'écart global entre les estimations du modèle et les clics réels.
* **MAE** : erreur absolue moyenne ; met en avant l'erreur moyenne sans amplifier les grands écarts.

In [None]:
# Metrics

def precision_at_k(recommended: List[int], relevant: set[int], k: int) -> float:
    """Precision@k for a single user (binary relevance)."""
    if k == 0:
        return 0.0
    rec_k = recommended[:k]
    hits = sum(1 for item in rec_k if item in relevant)
    return hits / k


def recall_at_k(recommended: List[int], relevant: set[int], k: int) -> float:
    """Recall@k for a single user (binary relevance)."""
    if not relevant:
        return 0.0
    rec_k = recommended[:k]
    hits = sum(1 for item in rec_k if item in relevant)
    return hits / len(relevant)


def average_precision_at_k(recommended: List[int], relevant: set[int], k: int) -> float:
    """Average precision@k for a single user (binary relevance)."""
    if not relevant:
        return 0.0
    score = 0.0
    hits = 0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            hits += 1
            score += hits / i
    return score / min(len(relevant), k)


def dcg_at_k(recommended: List[int], relevant: set[int], k: int) -> float:
    """Discounted cumulative gain (binary relevance)."""
    dcg = 0.0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            dcg += 1 / np.log2(i + 1)
    return dcg


def ndcg_at_k(recommended: List[int], relevant: set[int], k: int) -> float:
    """Normalized DCG with binary relevance and capped ideal DCG."""
    ideal_hits = min(len(relevant), k)
    if ideal_hits == 0:
        return 0.0
    ideal_dcg = sum(1 / np.log2(i + 1) for i in range(1, ideal_hits + 1))
    return dcg_at_k(recommended, relevant, k) / ideal_dcg


def coverage_at_k(all_recommendations: List[List[int]], candidate_items: List[int], k: int) -> float:
    """Coverage of unique recommended items over global candidates."""
    assert candidate_items, "candidate_items_global must not be empty"
    rec_items = set()
    for rec in all_recommendations:
        rec_items.update(rec[:k])
    assert rec_items.issubset(set(candidate_items)), "Coverage union must be subset of candidates"
    return len(rec_items) / len(candidate_items)



## Fonctions utilitaires pour les recommanders

In [None]:

# Fonctions classiques (popularité, similarité, SVD léger) utilisées par les baselines

def build_global_popularity(train: pd.DataFrame) -> List[int]:
    """Retourne les articles triés par nombre de clics."""
    return train.groupby("article_id").size().sort_values(ascending=False).index.tolist()


def build_recent_popularity(train: pd.DataFrame, window_days: int) -> List[int]:
    """Retourne les articles populaires sur la dernière fenêtre glissante."""
    max_time = train["timestamp"].max()
    window_start = max_time - pd.Timedelta(days=window_days)
    recent = train[train["timestamp"] >= window_start]
    if recent.empty:
        return build_global_popularity(train)
    counts = recent.groupby("article_id")["timestamp"].agg(["size", "max"])
    ranked = counts.sort_values(by=["size", "max"], ascending=[False, False])
    return ranked.index.tolist()


def build_covisit_graph(train: pd.DataFrame) -> Dict[int, Dict[int, int]]:
    """Construire un graphe de co-visitation basé sur l'historique utilisateur."""
    graph: Dict[int, Dict[int, int]] = {}
    for _, group in train.groupby("user_id"):
        items = group.sort_values("timestamp")["article_id"].tolist()
        unique_items = list(dict.fromkeys(items))
        for i, item_i in enumerate(unique_items):
            graph.setdefault(item_i, {})
            for item_j in unique_items[i + 1 :]:
                graph[item_i][item_j] = graph[item_i].get(item_j, 0) + 1
                graph.setdefault(item_j, {})
                graph[item_j][item_i] = graph[item_j].get(item_i, 0) + 1
    return graph


def build_content_embeddings(metadata: pd.DataFrame, pca_components: Optional[int] = None):
    """Crée des embeddings TF-IDF à partir des colonnes textuelles (avec PCA optionnel)."""
    text_cols = [
        c
        for c in metadata.columns
        if metadata[c].dtype == object and c not in {"article_id", "clicks"}
    ]
    non_id_cols = [c for c in metadata.columns if c != "article_id"]

    if not text_cols and non_id_cols:
        print("Aucune colonne textuelle : utilisation des colonnes non-ID comme tokens catégoriels.")
        text_cols = non_id_cols

    if not text_cols:
        raise ValueError("Aucune colonne utilisable dans les métadonnées pour construire des embeddings")

    corpus = metadata[text_cols].fillna("")
    corpus = corpus.apply(lambda row: " ".join(f"{col}_{val}" for col, val in row.items()), axis=1)

    vectorizer = TfidfVectorizer(max_features=5000)
    tfidf = vectorizer.fit_transform(corpus)
    if pca_components and pca_components < tfidf.shape[1]:
        svd = TruncatedSVD(n_components=pca_components, random_state=CONFIG["random_seed"])
        reduced = svd.fit_transform(tfidf)
        embeddings = normalize(reduced)
    else:
        embeddings = normalize(tfidf)
    ids = metadata["article_id"].tolist()
    return embeddings, ids


def build_item_similarity(train: pd.DataFrame, metadata: Optional[pd.DataFrame]):
    """Construit une similarité article-article par contenu ou co-visitation."""
    if metadata is not None:
        try:
            embeddings, ids = build_content_embeddings(metadata, CONFIG["content_pca_components"])
            similarity: Dict[int, Dict[int, float]] = {}
            for i, aid in enumerate(ids):
                sims = embeddings @ embeddings[i].T
                sims = np.asarray(sims).flatten()
                top_idx = np.argsort(-sims)[1:51]
                similarity[aid] = {ids[j]: float(sims[j]) for j in top_idx if sims[j] > 0}
            return similarity, "content"
        except Exception as exc:
            print(f"Embeddings de contenu impossibles ({exc}). Bascule sur la co-visitation.")
    graph = build_covisit_graph(train)
    similarity = {item: {nbr: float(cnt) for nbr, cnt in neigh.items()} for item, neigh in graph.items()}
    return similarity, "covisitation"


def recommend_from_similarity(
    user_id: int,
    train_histories: Dict[int, List[int]],
    similarity: Dict[int, Dict[int, float]],
    candidate_items: List[int],
    k: int,
) -> List[int]:
    """Agrège les scores de similarité depuis l'historique utilisateur."""
    seen = set(train_histories.get(user_id, []))
    scores: Dict[int, float] = {}
    for item in seen:
        for neighbor, sim in similarity.get(item, {}).items():
            if neighbor in seen:
                continue
            scores[neighbor] = scores.get(neighbor, 0.0) + sim
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen]
    if len(recs) < k:
        for c in candidate_items:
            if c not in seen and c not in recs:
                recs.append(c)
            if len(recs) >= k:
                break
    return recs[:k]


def build_collaborative_svd(train: pd.DataFrame, n_components: int):
    """Entraîne un SVD implicite léger et retourne une fonction de recommandation."""
    user_codes, user_index = pd.factorize(train["user_id"], sort=True)
    item_codes, item_index = pd.factorize(train["article_id"], sort=True)

    interactions = pd.DataFrame({"user_idx": user_codes, "item_idx": item_codes}).drop_duplicates()
    data = np.ones(len(interactions), dtype=np.float32)
    mat = sparse.coo_matrix((data, (interactions["user_idx"], interactions["item_idx"])), shape=(len(user_index), len(item_index))).tocsr()

    svd = TruncatedSVD(n_components=n_components, random_state=CONFIG["random_seed"])
    user_factors = svd.fit_transform(mat)
    item_factors = svd.components_.T

    user_to_idx = {int(uid): int(idx) for idx, uid in enumerate(user_index.tolist())}
    items = [int(aid) for aid in item_index.tolist()]

    def recommend(user_id: int, seen: set, k: int) -> List[int]:
        if user_id not in user_to_idx:
            popularity = build_global_popularity(train)
            return [it for it in popularity if it not in seen][:k]

        u_vec = user_factors[user_to_idx[user_id]]
        scores = item_factors @ u_vec
        ranked_items = [items[i] for i in np.argsort(-scores)]
        return [it for it in ranked_items if it not in seen][:k]

    meta = {"users": len(user_index), "items": len(item_index), "components": n_components}
    return recommend, meta



In [None]:

from surprise import Dataset, Reader, KNNBasic, NormalPredictor, SVD
from surprise import accuracy


def build_surprise_trainset(interactions: pd.DataFrame, *, use_session_rating: bool = False):
    if use_session_rating:
        weighted = interactions.copy()
        weighted = ensure_context_columns(weighted)
        weighted["rating"] = session_weight_from_size(weighted.get("session_size"))
        aggregated = (
            weighted.groupby(["user_id", "article_id"])
            .agg(rating=("rating", "mean"), last_ts=("timestamp", "max"))
            .reset_index()
        )
    else:
        aggregated = (
            interactions.groupby(["user_id", "article_id"])
            .agg(clicks=("article_id", "size"), last_ts=("timestamp", "max"))
            .reset_index()
        )
        if aggregated.empty:
            raise ValueError("Impossible de construire un trainset Surprise sans interactions")

        min_ts = aggregated["last_ts"].min()
        max_ts = aggregated["last_ts"].max()
        span_seconds = max((max_ts - min_ts).total_seconds(), 1.0)
        recency = (aggregated["last_ts"] - min_ts).dt.total_seconds() / span_seconds
        aggregated["rating"] = np.log1p(aggregated["clicks"]) + 0.5 * recency

    min_rating = float(aggregated["rating"].min())
    max_rating = float(aggregated["rating"].max())
    if max_rating == min_rating:
        max_rating = min_rating + 1.0

    reader = Reader(rating_scale=(min_rating, max_rating))
    return Dataset.load_from_df(
        aggregated[["user_id", "article_id", "rating"]], reader
    ).build_full_trainset()


surprise_trainset = build_surprise_trainset(train_df, use_session_rating=False)
surprise_items = [int(surprise_trainset.to_raw_iid(iid)) for iid in surprise_trainset.all_items()]
popularity_order = build_global_popularity(train_df)
popularity_rank = {int(aid): rank for rank, aid in enumerate(popularity_order)}


# Adapters and baselines for evaluation

def seen_recommender_adapter(recommend_func, k: int):
    """Wrap recommenders that expect (user_id, seen, k)."""
    def wrapped(user_id: int, candidates_u: list[int], train_histories: Dict[int, List[int]]):
        seen = set(train_histories.get(user_id, []))
        return recommend_func(user_id, seen, k)
    return wrapped


def random_recommender(user_id: int, candidates_u: list[int], train_histories: Dict[int, List[int]]):
    """Return a random ranking over the user candidate set."""
    rng = np.random.default_rng(CONFIG["random_seed"] + int(user_id))
    return [int(i) for i in rng.permutation(candidates_u)]


def most_popular_recommender(user_id: int, candidates_u: list[int], train_histories: Dict[int, List[int]]):
    """Return candidates sorted by global popularity from train."""
    candidate_set = set(candidates_u)
    return [int(i) for i in popularity_order if int(i) in candidate_set]
# Chaque algorithme utilise un tie-breaker différent pour éviter des tops identiques en cas d'égalité


def wrap_surprise_recommender(algo, label: str, *, tie_breaker=None, trainset=None, items=None):
    current_trainset = trainset or surprise_trainset
    current_items = items or surprise_items
    algo.fit(current_trainset)

    is_normal = isinstance(algo, NormalPredictor)
    is_knn = hasattr(algo, "get_neighbors") and hasattr(algo, "sim")
    neighbor_cache: dict[int, list[int]] = {}
    sim_matrix = getattr(algo, "sim", None)

    fallback_sorted_items = list(current_items)
    if tie_breaker:
        fallback_sorted_items = sorted(
            fallback_sorted_items,
            key=lambda iid: tie_breaker(iid),
            reverse=True,
        )

    if is_knn and sim_matrix is not None:
        max_neighbors = getattr(algo, "k", 40)
        for inner_iid in current_trainset.all_items():
            raw_iid = int(current_trainset.to_raw_iid(inner_iid))
            inner_neighbors = algo.get_neighbors(inner_iid, k=max_neighbors)
            neighbor_cache[raw_iid] = [
                int(current_trainset.to_raw_iid(neighbor))
                for neighbor in inner_neighbors
                if neighbor != inner_iid
            ]

    def recommend(user_id: int, seen: set, k: int) -> List[int]:
        raw_uid = int(user_id)
        scored: list[tuple[int, float]] = []

        if is_knn:
            if not seen:
                return [iid for iid in fallback_sorted_items if iid not in seen][:k]

            candidate_scores: Counter[int] = Counter()
            inner_seen: dict[int, int] = {}
            for seen_item in seen:
                try:
                    inner_seen[seen_item] = current_trainset.to_inner_iid(str(seen_item))
                except ValueError:
                    continue

            for seen_item, inner_seen_id in inner_seen.items():
                neighbors = neighbor_cache.get(seen_item, [])
                for neighbor_raw in neighbors:
                    if neighbor_raw in seen:
                        continue
                    try:
                        neighbor_inner = current_trainset.to_inner_iid(str(neighbor_raw))
                    except ValueError:
                        continue
                    sim = float(sim_matrix[inner_seen_id, neighbor_inner])
                    if np.isfinite(sim):
                        candidate_scores[neighbor_raw] += sim

            if candidate_scores:
                scored = list(candidate_scores.items())
            else:
                return [iid for iid in fallback_sorted_items if iid not in seen][:k]

        if not scored:
            if is_normal:
                base_score = float(getattr(algo, "mu", 0.0))
                scored = [(iid, base_score) for iid in current_items if iid not in seen]
            else:
                scored = []
                for iid in current_items:
                    if iid in seen:
                        continue
                    pred = algo.predict(raw_uid, int(iid), verbose=False)
                    scored.append((iid, float(pred.est)))

        if not scored:
            return [it for it in current_items if it not in seen][:k]

        def sort_key(item_score):
            iid, score = item_score
            tie = tie_breaker(iid) if tie_breaker else 0.0
            return (score, tie)

        scored.sort(key=sort_key, reverse=True)
        return [it for it, _ in scored[:k]]

    meta = {"algo": label, "n_items": len(current_items), "estimator": algo, "trainset": current_trainset}
    return recommend, meta

# Configuration commune
K = CONFIG["k"]

# Modèles Surprise prêts à l'emploi
popularity_recommender, pop_meta = wrap_surprise_recommender(
    NormalPredictor(),
    "NormalPredictor (baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

itemknn_recommender, itemknn_meta = wrap_surprise_recommender(
    KNNBasic(
        k=60,
        min_k=2,
        sim_options={"name": "pearson_baseline", "user_based": False, "min_support": 2, "n_jobs": -1},
    ),
    "KNNBasic item-based (pearson baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

svd_recommender, svd_meta = wrap_surprise_recommender(
    SVD(
        n_factors=CONFIG["svd_components"],
        n_epochs=35,
        reg_all=0.06,
        lr_all=0.004,
        random_state=CONFIG["random_seed"],
    ),
    "SVD collaboratif (facteurs latents)",
    tie_breaker=lambda iid: popularity_rank.get(int(iid), len(popularity_rank)),
)




In [None]:
# Mesures d'erreur pour les algorithmes Surprise

def surprise_error_metrics(estimator, test_df: pd.DataFrame, candidate_pool=None) -> dict[str, float]:
    """Compute RMSE/MAE on the test split for a fitted Surprise estimator.

    Parameters
    ----------
    estimator : surprise.AlgoBase
        Trained Surprise model with a ``predict`` method.
    test_df : pd.DataFrame
        Test interactions containing ``user_id`` and ``article_id``. If
        ``session_size`` is available, it will be converted to a continuous
        rating using ``session_weight_from_size`` (fallback 1.0).
    candidate_pool : Iterable[int], optional
        If provided, restrict the evaluated items to this candidate pool.
    """
    if test_df.empty:
        return {"rmse": float("nan"), "mae": float("nan")}

    candidate_set = set(candidate_pool) if candidate_pool is not None else None
    session_sizes = test_df.get("session_size")
    ratings = session_weight_from_size(session_sizes) if session_sizes is not None else np.ones(len(test_df), dtype=np.float32)

    predictions = []
    for (uid, iid, true_rating) in zip(test_df["user_id"], test_df["article_id"], ratings):
        if candidate_set is not None and iid not in candidate_set:
            continue
        predictions.append(estimator.predict(int(uid), int(iid), r_ui=float(true_rating), verbose=False))

    if not predictions:
        return {"rmse": float("nan"), "mae": float("nan")}

    rmse = accuracy.rmse(predictions, verbose=False)
    mae = accuracy.mae(predictions, verbose=False)
    return {"rmse": float(rmse), "mae": float(mae)}


In [None]:
# Evaluation pipeline

def deduplicate_preserve_order(items: List[int]) -> List[int]:
    """Remove duplicates while preserving the original order."""
    seen: set[int] = set()
    deduped: List[int] = []
    for item in items:
        if item in seen:
            continue
        seen.add(item)
        deduped.append(item)
    return deduped


def normalize_recommendations(raw_recs: List[Union[int, Tuple[int, float]]]) -> List[int]:
    """Normalize recommender output to a ranked list of int item ids."""
    if not raw_recs:
        return []
    first = raw_recs[0]
    if isinstance(first, (tuple, list)):
        ranked = sorted(raw_recs, key=lambda x: x[1], reverse=True)
        return [int(item) for item, _ in ranked]
    return [int(item) for item in raw_recs]


def evaluate_model(
    name: str,
    recommend_func: Callable[[int, List[int], Dict[int, List[int]]], List[Union[int, Tuple[int, float]]]],
    train_histories: Dict[int, List[int]],
    ground_truth: Dict[int, set[int]],
    candidate_items_global: List[int],
    k: int,
    latency_sample: int = 500,
    progress_every: int = 500,
    unknown_test_items_rate: float = 0.0,
    popularity_fallback: Optional[List[int]] = None,
) -> Dict[str, float]:
    """Evaluate a recommender with ranking metrics and latency estimation."""
    assert candidate_items_global, "candidate_items_global must not be empty"
    popularity_fallback = popularity_fallback or list(candidate_items_global)

    precisions: List[float] = []
    recalls: List[float] = []
    maps: List[float] = []
    ndcgs: List[float] = []
    hitrates: List[int] = []
    all_recs: List[List[int]] = []

    users = sorted(ground_truth.keys())
    total_users = len(users)
    empty_candidates_count = 0
    short_recs_count = 0
    candidate_sizes: List[int] = []
    users_evaluated_list: List[int] = []

    rng = np.random.default_rng(CONFIG["random_seed"])
    debug_users = set(rng.choice(users, size=min(5, len(users)), replace=False).tolist()) if users else set()
    debug_examples: List[Dict[str, object]] = []

    start_eval = time.perf_counter()
    for idx, user_id in enumerate(users, start=1):
        seen = set(train_histories.get(user_id, []))
        candidates_u = [item for item in candidate_items_global if item not in seen]
        if not candidates_u:
            empty_candidates_count += 1
            continue

        candidate_sizes.append(len(candidates_u))
        users_evaluated_list.append(user_id)
        candidate_set_u = set(candidates_u)
        raw_recs = recommend_func(user_id, candidates_u, train_histories)
        recs = normalize_recommendations(raw_recs)
        recs = deduplicate_preserve_order(recs)
        recs = [item for item in recs if item in candidate_set_u]

        if len(recs) < k:
            short_recs_count += 1
            fallback = [item for item in popularity_fallback if item in candidate_set_u and item not in recs]
            recs.extend(fallback)

        recs = recs[: min(k, len(candidates_u))]
        all_recs.append(recs)
        gt = ground_truth[user_id]

        hits_u = sum(1 for item in recs[:k] if item in gt)
        precisions.append(hits_u / k)
        recalls.append(hits_u / len(gt))
        hitrates.append(1 if hits_u > 0 else 0)
        maps.append(average_precision_at_k(recs, gt, k))
        ndcgs.append(ndcg_at_k(recs, gt, k))

        if user_id in debug_users and len(debug_examples) < 5:
            debug_examples.append(
                {
                    "user_id": user_id,
                    "train_size": len(train_histories.get(user_id, [])),
                    "gt_size": len(gt),
                    "candidate_size": len(candidates_u),
                    "gt_items": list(sorted(gt))[:10],
                    "recs": recs[:k],
                    "hits": hits_u,
                }
            )

        if progress_every and idx % progress_every == 0:
            elapsed = time.perf_counter() - start_eval
            rate = elapsed / idx
            eta = rate * max(total_users - idx, 0)
            print(
                f"[{name}] {idx}/{total_users} users processed "
                f"(elapsed {elapsed:.1f}s, ETA {eta:.1f}s)"
            )

    users_evaluated = len(users_evaluated_list)
    empty_candidates_rate = empty_candidates_count / max(1, total_users)
    short_recs_rate = short_recs_count / max(1, users_evaluated)

    if candidate_sizes:
        candidate_min = int(np.min(candidate_sizes))
        candidate_median = float(np.median(candidate_sizes))
        candidate_max = int(np.max(candidate_sizes))
    else:
        candidate_min = candidate_median = candidate_max = 0

    coverage = coverage_at_k(all_recs, candidate_items_global, k)
    hitrate = float(np.mean(hitrates)) if hitrates else 0.0

    sample_users = users[: min(latency_sample, len(users))]
    start = time.perf_counter()
    for user_id in sample_users:
        seen = set(train_histories.get(user_id, []))
        candidates_u = [item for item in candidate_items_global if item not in seen]
        if not candidates_u:
            continue
        _ = recommend_func(user_id, candidates_u, train_histories)
    latency = (time.perf_counter() - start) / max(1, len(sample_users))
    total_eval_time = time.perf_counter() - start_eval

    print(
        f"[{name}] users_evaluated={users_evaluated}, "
        f"short_recs_rate={short_recs_rate:.4f}, "
        f"empty_candidates_rate={empty_candidates_rate:.4f}, "
        f"unknown_test_items_rate={unknown_test_items_rate:.4f}, "
        f"candidate_size(min/median/max)={candidate_min}/{candidate_median:.1f}/{candidate_max}, "
        f"precision@k={float(np.mean(precisions)) if precisions else 0.0:.4f}, "
        f"recall@k={float(np.mean(recalls)) if recalls else 0.0:.4f}, "
        f"map@k={float(np.mean(maps)) if maps else 0.0:.4f}, "
        f"ndcg@k={float(np.mean(ndcgs)) if ndcgs else 0.0:.4f}, "
        f"hitrate@k={hitrate:.4f}, "
        f"coverage@k={coverage:.4f}"
    )

    if debug_examples:
        print(f"[{name}] Debug examples (seeded):")
        for example in debug_examples:
            print(
                f"user={example['user_id']} "
                f"| train={example['train_size']} "
                f"| gt={example['gt_size']} "
                f"| candidates={example['candidate_size']} "
                f"| gt_items={example['gt_items']} "
                f"| recs={example['recs']} "
                f"| hits={example['hits']}"
            )

    return {
        "model": name,
        "users": users_evaluated,
        "precision@k": float(np.mean(precisions)) if precisions else 0.0,
        "recall@k": float(np.mean(recalls)) if recalls else 0.0,
        "map@k": float(np.mean(maps)) if maps else 0.0,
        "ndcg@k": float(np.mean(ndcgs)) if ndcgs else 0.0,
        "hitrate@k": hitrate,
        "coverage@k": coverage,
        "latency_per_user_s": latency,
        "eval_time_s": total_eval_time,
        "all_recommendations": all_recs,
        "short_recs_rate": short_recs_rate,
        "empty_candidates_rate": empty_candidates_rate,
        "unknown_test_items_rate": unknown_test_items_rate,
        "candidate_size_min": candidate_min,
        "candidate_size_median": candidate_median,
        "candidate_size_max": candidate_max,
        "users_evaluated_list": users_evaluated_list,
    }









## Entraînement des systèmes de recommandation

Chaque approche est entraînée séparément pour limiter le temps d'exécution de chaque cellule et mieux contextualiser le rôle de chaque modèle.

### Popularité globale
La recommandation par popularité globale trie les articles par volume d'interactions dans l'ensemble d'entraînement. Elle est rapide à calculer (simple agrégation) et sert de baseline robuste pour comparer les modèles plus avancés.

In [None]:

# Configuration commune
K = CONFIG["k"]

# Modèles Surprise prêts à l'emploi
popularity_recommender, pop_meta = wrap_surprise_recommender(
    NormalPredictor(),
    "NormalPredictor (baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

itemknn_recommender, itemknn_meta = wrap_surprise_recommender(
    KNNBasic(
        k=60,
        min_k=2,
        sim_options={"name": "pearson_baseline", "user_based": False, "min_support": 2, "n_jobs": -1},
    ),
    "KNNBasic item-based (pearson baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

svd_recommender, svd_meta = wrap_surprise_recommender(
    SVD(
        n_factors=CONFIG["svd_components"],
        n_epochs=35,
        reg_all=0.06,
        lr_all=0.004,
        random_state=CONFIG["random_seed"],
    ),
    "SVD collaboratif (facteurs latents)",
    tie_breaker=lambda iid: popularity_rank.get(int(iid), len(popularity_rank)),
)




### Popularité récente
Cette variante privilégie la fraîcheur en filtrant les interactions sur une fenêtre temporelle avant de trier les articles par fréquence. Utile pour capter les tendances du moment, au prix d'un recalcul plus fréquent de la fenêtre glissante.

In [None]:
# Popularité récente
recent_rank = build_recent_popularity(train_df, CONFIG["recent_window_days"])

def recent_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return [it for it in recent_rank if it not in seen][:k]


### Collaborative (SVD)
Le filtrage collaboratif factorise la matrice utilisateur-item (SVD) pour capturer des préférences latentes. L'entraînement est plus long que les méthodes de popularité ou de similarité de contenu, mais il modélise mieux les affinités implicites entre utilisateurs et articles.

In [None]:
# Filtrage collaboratif (SVD)
collab_recommend, collab_meta = build_collaborative_svd(train_df, CONFIG["svd_components"])

def collaborative_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return collab_recommend(user_id, seen, k)


In [None]:
# Modèles co-visitation désactivés au profit de Surprise


### Contenu (similarité article-article)
Un modèle basé contenu construit une matrice de similarité entre articles à partir des métadonnées. Les recommandations se font en projetant l'historique utilisateur vers les items proches dans cet espace. Ce calcul peut être plus coûteux car il nécessite la vectorisation et le produit croisé des articles.

In [None]:
# Initialiser un conteneur de résultats pour chaque entraînement
results = []
step_results = []


In [None]:
# Recommandation basée contenu (désactivable)
ENABLE_CONTENT_MODEL = True  # Passer à True pour activer le calcul de similarité contenu

if ENABLE_CONTENT_MODEL:
    item_similarity, sim_mode = build_item_similarity(train_df, metadata)

    def content_recommender(user_id: int, seen: set, k: int) -> List[int]:
        return recommend_from_similarity(user_id, train_histories, item_similarity, candidate_items, k)
else:
    sim_mode = "désactivé"
    content_recommender = None



## Entraînements séparés

Les trois stratégies Surprise sont exécutées dans des cellules distinctes afin de pouvoir lancer, arrêter ou relancer chaque bloc indépendamment. Cela évite d'attendre l'ensemble du pipeline quand un seul entraînement est nécessaire.


### Entraînement 1 : Baseline Surprise (NormalPredictor)

Ce bloc entraîne le modèle de base `NormalPredictor` de Surprise et calcule Precision@K, Recall@K, MAP@K, NDCG@K, couverture, latence moyenne ainsi que RMSE et MAE sur le jeu de test.


In [None]:
random_result = evaluate_model(
    "Random (uniform)",
    random_recommender,
    train_histories,
    ground_truth,
    candidate_items_global,
    K,
    unknown_test_items_rate=unknown_test_items_rate,
    popularity_fallback=popularity_order,
)
results.append(random_result)

most_popular_result = evaluate_model(
    "Most-popular (train)",
    most_popular_recommender,
    train_histories,
    ground_truth,
    candidate_items_global,
    K,
    unknown_test_items_rate=unknown_test_items_rate,
    popularity_fallback=popularity_order,
)
results.append(most_popular_result)

if random_result["coverage@k"] < 0.01 or most_popular_result["hitrate@k"] < random_result["hitrate@k"]:
    print(
        "[WARN] Baseline sanity check failed: random coverage near 0 or "
        "most-popular hitrate worse than random. Evaluation may be broken."
    )

popularity_result = evaluate_model(
    "Baseline Surprise - NormalPredictor",
    seen_recommender_adapter(popularity_recommender, K),
    train_histories,
    ground_truth,
    candidate_items_global,
    K,
    unknown_test_items_rate=unknown_test_items_rate,
    popularity_fallback=popularity_order,
)

pop_meta_errors = surprise_error_metrics(
    pop_meta["estimator"], test_df, candidate_pool=candidate_items_global
)
popularity_result.update(pop_meta_errors)
results.append(popularity_result)
pd.DataFrame([popularity_result])






### Entraînement 2 : KNN item-based (Surprise)

Ce bloc exécute `KNNBasic` en mode item-based avec une similarité **Pearson baseline**, 60 voisins
(`k=60`, `min_k=2`, `min_support=2`). Cette configuration force le modèle à exploiter des co-cliques
pour sortir des simples effets de popularité, afin d'obtenir des recommandations distinctes du SVD.

Astuce performance : `n_jobs=-1` exploite tous les cœurs CPU pour la matrice de similarité Surprise, ce qui réduit nettement le temps de fit sur de gros catalogues (le modèle reste CPU-only).

In [None]:
if False :
    item2item_result = evaluate_model(
        "Modèle KNNBasic item-based (Pearson baseline)",
        seen_recommender_adapter(itemknn_recommender, K),
        train_histories,
        ground_truth,
        candidate_items_global,
        K,
        unknown_test_items_rate=unknown_test_items_rate,
        popularity_fallback=popularity_order,
    )

    itemknn_meta_errors = surprise_error_metrics(
        itemknn_meta["estimator"], test_df, candidate_pool=candidate_items_global
    )
    item2item_result.update(itemknn_meta_errors)
    results.append(item2item_result)
    pd.DataFrame([item2item_result])







### Entraînement 3 : SVD Surprise

Ce bloc entraîne un SVD implicite (facteurs latents) avec 64 dimensions, davantage d'itérations et une
régularisation renforcée (`n_epochs=35`, `reg_all=0.06`, `lr_all=0.004`). L'objectif est d'obtenir un
profil utilisateur/item plus contrasté que le KNN de voisinage.


In [None]:
svd_result = evaluate_model(
    "Modèle SVD Surprise (facteurs latents)",
    seen_recommender_adapter(svd_recommender, K),
    train_histories,
    ground_truth,
    candidate_items_global,
    K,
    unknown_test_items_rate=unknown_test_items_rate,
    popularity_fallback=popularity_order,
)

svd_meta_errors = surprise_error_metrics(
    svd_meta["estimator"], test_df, candidate_pool=candidate_items_global
)
svd_result.update(svd_meta_errors)
results.append(svd_result)
pd.DataFrame([svd_result])







### Session-size weighting and LightFM item-to-item

`session_size` is turned into a relevance weight with **1 / log1p(session_size)** to dampen
very long sessions while keeping short, focused sessions influential. The LightFM-style
item-to-item model trains latent item vectors on these weighted interactions and enriches
user representations with aggregated context features (environment, device, OS, country,
region, referrer). Recommendations then come from cosine neighbors in that latent space.


In [None]:

# Session-weighted Surprise SVD and LightFM item-to-item setup
svd_session_trainset = build_surprise_trainset(
    train_df, use_session_rating=CONFIG["svd_use_session_rating"]
)
svd_session_items = [int(svd_session_trainset.to_raw_iid(iid)) for iid in svd_session_trainset.all_items()]
svd_session_recommender, svd_session_meta = wrap_surprise_recommender(
    SVD(n_factors=CONFIG["svd_components"], n_epochs=35, reg_all=0.06, lr_all=0.004, random_state=CONFIG["random_seed"]),
    "Modèle SVD Surprise (session_weighted)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
    trainset=svd_session_trainset,
    items=svd_session_items,
)

lightfm_interactions, lightfm_weights, lightfm_user_features, lightfm_item_ids = build_interaction_matrices(
    train_df,
    CONTEXT_COLUMNS,
    use_user_features=CONFIG["lightfm_use_user_features"],
)
lightfm_model = LightFMApproximator(
    n_components=CONFIG["lightfm_components"],
    epochs=15,
    random_state=CONFIG["random_seed"],
).fit(
    lightfm_interactions,
    sample_weight=lightfm_weights,
    user_features=lightfm_user_features,
)
_, lightfm_item_embeddings = lightfm_model.get_item_representations()
lightfm_neighbors = precompute_item_neighbors(
    lightfm_item_embeddings, lightfm_item_ids, top_n=CONFIG["lightfm_item_neighbors"]
)


def svd_score_candidates(user_id: int, estimator, candidates: list[int], seen: set) -> Dict[int, float]:
    scores: Dict[int, float] = {}
    for iid in candidates:
        if iid in seen:
            continue
        pred = estimator.predict(int(user_id), int(iid), verbose=False)
        scores[int(iid)] = float(pred.est)
    return scores


def minmax_normalize(scores: Dict[int, float]) -> Dict[int, float]:
    if not scores:
        return {}
    values = list(scores.values())
    min_v, max_v = min(values), max(values)
    if max_v == min_v:
        return {i: 0.0 for i in scores}
    return {i: (v - min_v) / (max_v - min_v) for i, v in scores.items()}


def lightfm_item2item_recommender(user_id: int, seen: set, k: int) -> List[int]:
    scores = score_from_neighbors(train_histories.get(user_id, []), lightfm_neighbors, seen)
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen][:k]
    if len(recs) < k:
        for cand in candidate_items:
            if cand not in seen and cand not in recs:
                recs.append(cand)
            if len(recs) >= k:
                break
    return recs[:k]


def hybrid_svd_item2item(user_id: int, seen: set, k: int) -> List[int]:
    svd_scores = svd_score_candidates(user_id, svd_meta["estimator"], candidate_items, seen)
    item_scores = score_from_neighbors(train_histories.get(user_id, []), lightfm_neighbors, seen)
    svd_norm = minmax_normalize(svd_scores)
    item_norm = minmax_normalize(item_scores)
    alpha, beta = CONFIG["hybrid_weights"]
    combined_items = set(list(svd_norm.keys()) + list(item_norm.keys()))
    combined_scores = {
        iid: alpha * svd_norm.get(iid, 0.0) + beta * item_norm.get(iid, 0.0)
        for iid in combined_items
    }
    ranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen][:k]
    if len(recs) < k:
        for cand in candidate_items:
            if cand not in seen and cand not in recs:
                recs.append(cand)
            if len(recs) >= k:
                break
    return recs[:k]




In [None]:

# Entraînement 4 : variantes SVD session-weighted, LightFM item2item et hybride
svd_session_result = evaluate_model(
    "SVD session_weighted",
    seen_recommender_adapter(svd_session_recommender, K),
    train_histories,
    ground_truth,
    candidate_items_global,
    K,
    unknown_test_items_rate=unknown_test_items_rate,
    popularity_fallback=popularity_order,
)
svd_session_errors = surprise_error_metrics(
    svd_session_meta["estimator"], test_df, candidate_pool=candidate_items_global
)
svd_session_result.update(svd_session_errors)
results.append(svd_session_result)

lightfm_result = evaluate_model(
    "Item2Item LightFM (latent voisins)",
    seen_recommender_adapter(lightfm_item2item_recommender, K),
    train_histories,
    ground_truth,
    candidate_items_global,
    K,
    unknown_test_items_rate=unknown_test_items_rate,
    popularity_fallback=popularity_order,
)
lightfm_result.update({"rmse": float("nan"), "mae": float("nan")})
results.append(lightfm_result)

hybrid_result = evaluate_model(
    "Hybrid SVD60 + Item2Item40",
    seen_recommender_adapter(hybrid_svd_item2item, K),
    train_histories,
    ground_truth,
    candidate_items_global,
    K,
    unknown_test_items_rate=unknown_test_items_rate,
    popularity_fallback=popularity_order,
)
hybrid_result.update({"rmse": float("nan"), "mae": float("nan")})
results.append(hybrid_result)

pd.DataFrame([svd_session_result, lightfm_result, hybrid_result])






In [None]:

svd_native_label = "Modèle SVD Surprise (facteurs latents)"
svd_session_label = "SVD session_weighted"
lightfm_label = "Item2Item LightFM (latent voisins)"
hybrid_label = "Hybrid SVD60 + Item2Item40"

comparison_df = results_df[
    results_df["model"].isin(
        [svd_native_label, svd_session_label, lightfm_label, hybrid_label]
    )
].reset_index(drop=True)
comparison_df



### Modèles Surprise uniquement
Les anciennes sections E* basées sur la co-visitation sont remplacées par des algorithmes Surprise (NormalPredictor, KNNBasic, SVD).

#### Variantes co-visitation retirées
Nous privilégions désormais les algorithmes Surprise pour assurer une cohérence entre expérimentation et déploiement.

In [None]:
# Les variantes de co-visitation sont remplacées par les modèles Surprise ci-dessus.


### Section hybride supprimée
L'hybridation co-visitation + popularité a été remplacée par le modèle SVD Surprise plus flexible.

In [None]:
# Section hybride supprimée : la bibliothèque Surprise couvre les besoins collaboratifs.


In [None]:
# Optuna n'est plus nécessaire pour ce notebook centré sur Surprise.


## Résultats consolidés

Après exécution des trois blocs d'entraînement ci-dessus, les métriques sont agrégées pour comparer les approches. Chaque ligne du tableau récapitule la précision, le rappel, la MAP, le NDCG, la couverture et la latence moyenne par utilisateur, complétés par le RMSE et la MAE.


In [None]:
candidate_items = candidate_items_global



In [None]:
# Agréger les métriques une fois les entraînements terminés
clean_columns = [
    "model",
    "users",
    "precision@k",
    "recall@k",
    "map@k",
    "ndcg@k",
    "hitrate@k",
    "coverage@k",
    "latency_per_user_s",
]

results_df = pd.DataFrame(results)
results_df = (
    results_df[clean_columns]
    .drop_duplicates(subset=["model"])
    .sort_values(["ndcg@k", "map@k"], ascending=False)
    .reset_index(drop=True)
)

display(results_df)



In [None]:
# Comparaison rapide des top-5 pour un utilisateur
sample_user = eval_users[0] if eval_users else None
if sample_user is None:
    print("Pas d'utilisateur pour comparer")
else:
    seen = set(train_histories.get(sample_user, []))
    print(f"Utilisateur de test: {sample_user}")
    print("KNNBasic item-based:", itemknn_recommender(sample_user, seen, 5))
    print("SVD collaboratif:", svd_recommender(sample_user, seen, 5))



In [None]:
results_steps = (
    results_df
    .sort_values(["ndcg@k", "precision@k"], ascending=False)
    .reset_index(drop=True)
)
print("Tableau comparatif des modèles Surprise (trié sur ndcg@k puis precision@k) :")
results_steps



In [None]:
# Métriques détaillées : hitrate, lifts vs baseline et cohortes d'historique
train_click_count = train_df.groupby("user_id").size().to_dict()

def assign_cohort(clicks: int) -> str:
    if 1 <= clicks <= 2:
        return "1-2 clicks"
    if 3 <= clicks <= 9:
        return "3-9 clicks"
    return "10+ clicks"

user_cohort = {user_id: assign_cohort(train_click_count.get(user_id, 0)) for user_id in train_histories}
coverage_lookup = {res["model"]: res.get("coverage@k", np.nan) for res in results}
recommendations_by_model = {res["model"]: res.get("all_recommendations", []) for res in results}
users_by_model = {res["model"]: res.get("users_evaluated_list", eval_users) for res in results}
baseline_label = svd_native_label if svd_native_label in results_df['model'].values else results_df['model'].iloc[0]

def safe_lift(value: float, baseline: float) -> float:
    if baseline is None or baseline == 0:
        return np.nan
    return value / baseline

cohort_rows = []
for model_name, recs in recommendations_by_model.items():
    users_for_model = users_by_model.get(model_name, [])
    buckets = {
        "ALL": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "1-2 clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "3-9 clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "10+ clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
    }

    for user_id, recs_user in zip(users_for_model, recs):
        gt = ground_truth[user_id]
        metrics = {
            "precision": precision_at_k(recs_user, gt, K),
            "recall": recall_at_k(recs_user, gt, K),
            "ndcg": ndcg_at_k(recs_user, gt, K),
            "hit": 1 if set(recs_user[:K]) & set(gt) else 0,
        }
        labels = ["ALL", user_cohort.get(user_id, "unknown")]
        for label in labels:
            bucket = buckets[label]
            bucket["precisions"].append(metrics["precision"])
            bucket["recalls"].append(metrics["recall"])
            bucket["ndcgs"].append(metrics["ndcg"])
            bucket["hits"] += metrics["hit"]
            bucket["users"] += 1

    for cohort, bucket in buckets.items():
        users = bucket["users"]
        cohort_rows.append(
            {
                "model": model_name,
                "cohort": cohort,
                "users": users,
                "precision@k": float(np.mean(bucket["precisions"])) if users else 0.0,
                "recall@k": float(np.mean(bucket["recalls"])) if users else 0.0,
                "ndcg@k": float(np.mean(bucket["ndcgs"])) if users else 0.0,
                "hitrate@k": bucket["hits"] / users if users else 0.0,
                "coverage@k": coverage_lookup.get(model_name, np.nan),
            }
        )

cohort_df = pd.DataFrame(cohort_rows)
baseline_rows = cohort_df[cohort_df["model"] == baseline_label].set_index("cohort")
for metric in ["precision@k", "recall@k", "ndcg@k"]:
    cohort_df[f"lift_{metric}_vs_baseline"] = cohort_df.apply(
        lambda row: safe_lift(
            row[metric],
            float(baseline_rows.loc[row["cohort"], metric])
            if row["cohort"] in baseline_rows.index
            else np.nan,
        ),
        axis=1,
    )

cohort_df = cohort_df.sort_values(["cohort", "ndcg@k", "precision@k"], ascending=[True, False, False]).reset_index(drop=True)
cohort_df


## Analyse & choix du modèle MVP

Le classement met en lumière des compromis :
- **Pertinence** : la popularité globale obtient le meilleur NDCG@5/MAP@5, signe que trier par volume reste difficile à battre sur ce petit jeu synthétique.
- **Diversité** : l'item2item couvre trois fois plus d'articles, ce qui réduit le risque d'effet tunnel.
- **Latence** : toutes les approches sont très rapides (millisecondes), la popularité restant la plus simple.

Le choix MVP bascule vers la popularité globale uniquement si l'on cherche la pertinence maximale et un déploiement express. Pour un produit, il serait pertinent de tester une hybridation : démarrer par la popularité pour les nouveaux utilisateurs puis basculer vers l'item2item dès que l'historique se construit afin d'augmenter la couverture sans sacrifier la qualité.

In [None]:

best_row = results_df.iloc[0]
justification = f"""
## Choix du modèle MVP

Modèle retenu : **{best_row['model']}**

Motifs principaux :
- NDCG@5 = {best_row['ndcg@k']:.4f}, MAP@5 = {best_row['map@k']:.4f}, Precision@5 = {best_row['precision@k']:.4f}, Recall@5 = {best_row['recall@k']:.4f}
- Couverture = {best_row['coverage@k']:.4f} sur {len(candidate_items)} articles candidats.
- Latence moyenne par utilisateur = {best_row['latency_per_user_s']:.6f} s (CPU).
- Complexité : implémentation {'optimisée via Surprise (SVD/KNN)' if 'SVD' in best_row['model'] else 'basée sur Surprise'} compatible avec Azure Functions.
- Gestion du cold-start utilisateur via popularité globale.

Note : ajuster `content_pca_components` pour réduire la taille des embeddings en production si nécessaire.
"""
choice_path = Path(CONFIG["artifacts_dir"]) / "model_choice.md"
choice_path.write_text(justification)
print(justification)



In [None]:

results_path_csv = Path(CONFIG["artifacts_dir"]) / "results.csv"
results_path_json = Path(CONFIG["artifacts_dir"]) / "results.json"
results_df.to_csv(results_path_csv, index=False)
results_df.to_json(results_path_json, orient="records", lines=True)
print(f"Résultats sauvegardés dans {results_path_csv} et {results_path_json}")



### Déploiement (application et Azure Functions)

Le modèle **SVD Surprise** est exporté pour l'application Flask et la Function Azure. Les
hyperparamètres reflètent la configuration du notebook (facteurs latents, lr_all, reg_all), tandis que
le modèle KNN reste disponible pour comparaison locale.


## Conclusion

Ce notebook montre comment comparer des stratégies de recommandation avec une procédure reproductible : split temporel, entraînement, évaluation multi-métriques et sauvegarde des résultats. Les essais révèlent que la popularité globale reste une valeur sûre pour débuter, mais que des modèles plus personnalisés (item2item ou SVD) apportent de la diversité dès que l'on dispose d'historique. Les prochaines étapes naturelles sont d'exécuter les tests sur les vraies données Kaggle, d'ajouter des métriques business (taux de clic simulé, couverture par catégorie) et de prototyper une hybridation popularité + item2item dans une Azure Function pour valider le comportement en production.