# Évaluation d'un système de recommandation My Content

Notebook pour entraîner et comparer plusieurs approches de recommandation sur le dataset Kaggle **news-portal-user-interactions-by-globocom**. L'objectif est de montrer clairement chaque étape (du chargement des données jusqu'au choix final du modèle).

> Ce notebook aligne désormais **toutes les approches de recommandation sur la bibliothèque Surprise** (https://surprise.readthedocs.io/) afin de bénéficier d'algorithmes collaboratifs standardisés et faciles à déployer.

In [443]:
# Imports & Config
from __future__ import annotations
import json
import os
import pickle
import time
from pathlib import Path
from typing import Callable, Dict, List, Optional, Tuple, Union
import optuna

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)

# Configuration
CONFIG = {
    "clicks_dir": "../data/news-portal-user-interactions-by-globocom/clicks",
    "metadata_path": "../data/news-portal-user-interactions-by-globocom/articles_metadata.csv",
    "embeddings_path": "../data/news-portal-user-interactions-by-globocom/articles_embeddings.pickle",
    "max_click_files": 20,
    "artifacts_dir": "../artifacts/evaluation",
    "k": 5,
    "train_ratio": 0.8,
    "recent_window_days": 7,
    "random_seed": 42,
    "svd_components": 64,
    "content_pca_components": None,
    "covisit_top_n_neighbors": 20,
    "covisit_similarity": "cosine",
    "covisit_hybrid_alpha": 0.7350738721058192,
    "svd_hazard_ndcg": 0.02,
    "min_user_interactions": 3,
    "min_item_interactions": 5,
}
np.random.seed(CONFIG["random_seed"])
Path(CONFIG["artifacts_dir"]).mkdir(parents=True, exist_ok=True)
print("Config ready", CONFIG)

from surprise import Dataset, Reader, KNNBasic, NormalPredictor, SVD

Config ready {'clicks_dir': '../data/news-portal-user-interactions-by-globocom/clicks', 'metadata_path': '../data/news-portal-user-interactions-by-globocom/articles_metadata.csv', 'embeddings_path': '../data/news-portal-user-interactions-by-globocom/articles_embeddings.pickle', 'max_click_files': 20, 'artifacts_dir': '../artifacts/evaluation', 'k': 5, 'train_ratio': 0.8, 'recent_window_days': 7, 'random_seed': 42, 'svd_components': 64, 'content_pca_components': None, 'covisit_top_n_neighbors': 20, 'covisit_similarity': 'cosine', 'covisit_hybrid_alpha': 0.7350738721058192, 'svd_hazard_ndcg': 0.02, 'min_user_interactions': 3, 'min_item_interactions': 5}


## Contexte

Nous voulons proposer à chaque lecteur un Top-5 d'articles susceptibles de l'intéresser. Le notebook illustre la démarche de A à Z : préparation des données, construction de différentes familles de modèles puis comparaison à l'aide de métriques de ranking.

## Données

Les fichiers attendus sont situés dans `/data/*`.

In [444]:
# Load data utilities

def detect_timestamp_column(df: pd.DataFrame) -> str:
    """Detect the timestamp-like column name."""
    candidates = ["click_timestamp", "timestamp", "event_time", "ts", "time"]
    for col in df.columns:
        if col in candidates or col.lower() in candidates:
            return col
    raise ValueError("No timestamp-like column found. Expected one of: " + ",".join(candidates))


def detect_article_column(df: pd.DataFrame) -> str:
    """Detect the article/item column name."""
    candidates = ["click_article_id", "clicked_article_id", "article_id", "item_id", "content_id"]
    for col in df.columns:
        if col in candidates:
            return col
    raise ValueError("No article id column found. Expected one of: " + ",".join(candidates))


def infer_unix_unit(values: pd.Series) -> str:
    numeric = pd.to_numeric(values, errors="coerce").dropna()
    if numeric.empty:
        return "s"
    max_abs = numeric.abs().max()
    if max_abs >= 1e14:
        return "ns"
    if max_abs >= 1e11:
        return "ms"
    return "s"


def to_timestamp(series: pd.Series) -> pd.Series:
    if pd.api.types.is_datetime64_any_dtype(series):
        return pd.to_datetime(series)
    if pd.api.types.is_numeric_dtype(series):
        unit = infer_unix_unit(series)
        return pd.to_datetime(series, unit=unit, errors="coerce")

    converted = pd.to_datetime(series, errors="coerce")
    if converted.notna().any():
        return converted

    unit = infer_unix_unit(series)
    return pd.to_datetime(series, unit=unit, errors="coerce")


def list_click_files(path: Union[str, Path]) -> List[Path]:
    path_obj = Path(path)
    if path_obj.is_file():
        return [path_obj]
    if path_obj.is_dir():
        return sorted(path_obj.glob("clicks_hour_*.csv"))
    return []


def create_synthetic_clicks(path: str, n_users: int = 50, n_items: int = 120, days: int = 30, interactions_per_user: int = 25) -> pd.DataFrame:
    """Create a small synthetic clicks dataset to keep the notebook runnable."""
    rng = np.random.default_rng(CONFIG["random_seed"])
    start = pd.Timestamp("2022-01-01")
    records = []
    for user in range(1, n_users + 1):
        offsets = rng.integers(0, days, size=interactions_per_user)
        timestamps = [start + pd.Timedelta(int(o), unit="D") for o in sorted(offsets.tolist())]
        articles = rng.integers(1, n_items + 1, size=interactions_per_user)
        for ts, art in zip(timestamps, articles):
            records.append({"user_id": int(user), "article_id": int(art), "timestamp": ts})
    df = pd.DataFrame(records).sort_values("timestamp").reset_index(drop=True)
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False)
    print(
        f"Synthetic clicks dataset created at {path} "
        f"(users={n_users}, items={n_items}, interactions={len(df)})"
    )
    return df


def load_clicks(path: str, max_files: Optional[int] = None) -> pd.DataFrame:
    """Load clicks data from the Globo hourly files, with a safety cap."""
    files = list_click_files(path)
    total_files = len(files)
    if not files:
        print(f"Clicks directory not found at {path}. Generating a synthetic sample for demonstration.")
        return create_synthetic_clicks(Path(path) / "clicks_hour_000.csv")

    if max_files is not None:
        print(f"Limite explicite max_files={max_files}, total détecté={total_files}")
        files = files[:max_files]

    print(f"Chargement de {len(files)} fichiers clicks (total détecté={total_files}, limite={max_files if max_files is not None else 'aucune'})")
    frames = []
    for file in files:
        df = pd.read_csv(file)
        ts_col = detect_timestamp_column(df)
        article_col = detect_article_column(df)
        df[ts_col] = to_timestamp(df[ts_col])
        df = df.rename(columns={ts_col: "timestamp", article_col: "article_id"})
        frames.append(df[["user_id", "article_id", "timestamp"]])

    combined = pd.concat(frames, ignore_index=True)
    combined = combined.sort_values("timestamp").reset_index(drop=True)
    print(f"Clicks agrégés : {len(combined)} lignes, {combined['user_id'].nunique()} utilisateurs uniques, {combined['article_id'].nunique()} articles uniques.")
    return combined


def load_metadata(path: str) -> Optional[pd.DataFrame]:
    """Load article metadata if available."""
    if not os.path.exists(path):
        print(f"Metadata file not found at {path}. Utilisation du pipeline Surprise uniquement si les métadonnées sont absentes.")
        return None
    meta = pd.read_csv(path)
    if "article_id" not in meta.columns:
        print("Metadata missing 'article_id' column. Ignoring metadata.")
        return None
    return meta


clicks = load_clicks(CONFIG["clicks_dir"], max_files=CONFIG["max_click_files"])
metadata = load_metadata(CONFIG["metadata_path"])
print(clicks.head())
print("Metadata loaded:", metadata is not None)


Limite explicite max_files=20, total détecté=385
Chargement de 20 fichiers clicks (total détecté=385, limite=20)
Clicks agrégés : 81245 lignes, 27911 utilisateurs uniques, 3353 articles uniques.
   user_id  article_id               timestamp
0       59      234853 2017-10-01 03:00:00.026
1       79      159359 2017-10-01 03:00:01.702
2      154       96663 2017-10-01 03:00:04.207
3      111      202436 2017-10-01 03:00:14.140
4       70      119592 2017-10-01 03:00:18.863
Metadata loaded: True


## Analyse exploratoire des données

Courte photographie des fichiers sources immédiatement après le chargement :
- nombre de lignes et noms de colonnes des clics
- volumes et intégrité des métadonnées articles
- dimensions et structure du fichier d'`articles_embeddings`.

In [445]:
# EDA rapide sur les données sources
import pickle
from pathlib import Path
from collections.abc import Mapping


def summarize_timestamps(series: pd.Series):
    series = pd.to_datetime(series)
    daily = series.dt.date.value_counts().sort_index().rename_axis("date").reset_index(name="nb_clicks")
    hourly = series.dt.hour.value_counts().sort_index().rename_axis("hour").reset_index(name="nb_clicks")
    return series.min(), series.max(), daily, hourly


def describe_structure(obj, prefix="embeddings", max_depth=4):
    entries = []

    def add_entry(path, value, note=None):
        entry = {"chemin": path, "type": type(value).__name__}
        if hasattr(value, "shape"):
            entry["shape"] = tuple(getattr(value, "shape"))
        elif hasattr(value, "__len__") and not isinstance(value, (str, bytes)):
            entry["len"] = len(value)
        if hasattr(value, "dtype"):
            entry["dtype"] = str(getattr(value, "dtype"))
        if note:
            entry["note"] = note
        if isinstance(value, np.ndarray) and value.dtype.names:
            entry["dtype_fields"] = list(value.dtype.names)
        if isinstance(value, np.ndarray) and value.ndim == 1 and len(value) > 0 and not isinstance(value[0], (np.ndarray, list, tuple, Mapping)):
            entry["exemple"] = repr(value[:3].tolist())
        entries.append(entry)

    def walk(value, path, depth):
        add_entry(path, value)
        if depth >= max_depth:
            return
        if isinstance(value, Mapping):
            for k, v in value.items():
                walk(v, f"{path}.{k}", depth + 1)
        elif isinstance(value, (list, tuple, np.ndarray)) and not isinstance(value, (str, bytes)):
            if len(value) > 0:
                walk(value[0], f"{path}[0]", depth + 1)

    walk(obj, prefix, 0)
    return entries


click_files = list_click_files(CONFIG["clicks_dir"])
print(f"Nombre total de fichiers clicks détectés: {len(click_files)}")
if not click_files:
    print("Aucun fichier clicks trouvé au chemin configuré. Vérifiez le téléchargement des données.")

files_for_eda = click_files[:2]
per_file_stats = []
for file in files_for_eda:
    df_file = pd.read_csv(file)
    ts_col = detect_timestamp_column(df_file)
    article_col = detect_article_column(df_file)
    timestamps = to_timestamp(df_file[ts_col])
    per_file_stats.append(
        {
            "fichier": file.name,
            "nb_lignes": len(df_file),
            "colonnes": ", ".join(df_file.columns),
            "articles_uniques": df_file[article_col].nunique(),
            "horodatage_min": timestamps.min(),
            "horodatage_max": timestamps.max(),
        }
    )
if per_file_stats:
    display(pd.DataFrame(per_file_stats))
else:
    print("Pas assez de fichiers pour réaliser une EDA détaillée par fichier.")

print("=== Clicks (agrégés) ===")
if clicks.empty:
    print("Aucun clic chargé. Vérifier le chemin ou augmenter max_click_files.")
else:
    clicks_summary = {
        "nb_lignes": len(clicks),
        "colonnes": ", ".join(clicks.columns),
        "utilisateurs_uniques": clicks['user_id'].nunique() if 'user_id' in clicks else None,
        "articles_uniques": clicks['article_id'].nunique() if 'article_id' in clicks else None,
    }
    display(pd.DataFrame([clicks_summary]))

    total_articles = None
    if metadata is not None and 'article_id' in metadata:
        total_articles = metadata['article_id'].nunique()
    elif 'article_id' in clicks:
        total_articles = clicks['article_id'].nunique()

    total_clients = clicks['user_id'].nunique() if 'user_id' in clicks else None
    print("Synthèse globale (articles / clients)")
    display(pd.DataFrame([{
        'nombre_total_articles': total_articles,
        'nombre_total_clients': total_clients,
    }]))

    ts_min, ts_max, daily, hourly = summarize_timestamps(clicks['timestamp'])
    display(pd.DataFrame([
        {
            'horodatage_min': ts_min,
            'horodatage_max': ts_max,
            'fenetre_jours': (ts_max - ts_min).days + 1,
        }
    ]))
    print("Répartition par jour (jusqu'à 10 premières valeurs)")
    display(daily.head(10))
    print("Répartition par heure (0-23)")
    display(hourly)

print("=== Métadonnées des articles ===")
if metadata is None:
    print("Aucun fichier metadata chargé.")
else:
    meta_summary = {
        "nb_articles": len(metadata),
        "colonnes": ", ".join(metadata.columns),
        "articles_uniques": metadata['article_id'].nunique() if 'article_id' in metadata else None,
    }
    display(pd.DataFrame([meta_summary]))
    missing = metadata.isna().sum().sort_values(ascending=False)
    display(missing.to_frame('valeurs_manquantes'))
    if 'created_at_ts' in metadata.columns:
        created = to_timestamp(metadata['created_at_ts'])
        display(pd.DataFrame([{'premier_article': created.min(), 'dernier_article': created.max()}]))
    if 'article_id' in metadata.columns:
        overlap = set(clicks['article_id'].unique()) if 'article_id' in clicks.columns else set()
        coverage = len(overlap & set(metadata['article_id'].unique()))
        print(f"Articles présents dans clicks et metadata: {coverage}")


print("=== Embeddings d'articles ===")
embeddings_path = Path(CONFIG['embeddings_path'])
if embeddings_path.exists():
    with embeddings_path.open('rb') as f:
        embeddings_obj = pickle.load(f)
    print(f"Type chargé: {type(embeddings_obj)}")

    def summarize_matrix(mat):
        stats = {
            'shape': getattr(mat, 'shape', None),
            'dtype': getattr(mat, 'dtype', None),
        }

        dim_values = []
        shape = getattr(mat, 'shape', None)
        if shape is not None and len(shape) >= 2:
            dim_values.append(shape[1])
        elif isinstance(mat, (list, tuple, np.ndarray)):
            for row in mat:
                if hasattr(row, '__len__') and not isinstance(row, (str, bytes)):
                    try:
                        dim_values.append(len(row))
                    except TypeError:
                        continue

        if dim_values:
            stats.update({
                'profondeur_min': min(dim_values),
                'profondeur_moyenne': float(np.mean(dim_values)),
                'profondeur_max': max(dim_values),
            })

        if hasattr(mat, 'shape') and len(getattr(mat, 'shape', [])) == 2:
            norms = np.linalg.norm(mat, axis=1)
            stats.update(
                {
                    'nb_vectors': mat.shape[0],
                    'dim': mat.shape[1],
                    'norm_min': norms.min(),
                    'norm_max': norms.max(),
                    'norm_moyenne': norms.mean(),
                }
            )
        return stats

    base_structure = describe_structure(embeddings_obj, max_depth=4)

    if isinstance(embeddings_obj, dict):
        keys = list(embeddings_obj.keys())
        print(f"Clés disponibles: {keys}")
        matrix = embeddings_obj.get('embeddings')
        ids = embeddings_obj.get('articles_ids') or embeddings_obj.get('article_ids')

        structure = base_structure.copy()
        if ids is not None:
            structure.insert(0, {
                'chemin': 'embeddings.article_ids',
                'type': type(ids).__name__,
                'len': len(ids),
                'note': "Identifiants d'articles fournis dans le fichier",
            })
        if structure:
            print("Structure détaillée de l'objet d'embeddings (par chemin de clé):")
            display(pd.DataFrame(structure))

        if matrix is not None:
            stats = summarize_matrix(matrix)
            stats.update(
                {
                    'colonnes': ", ".join(keys),
                    'nb_articles_ids': len(ids) if ids is not None else None,
                    'ids_uniques': len(set(ids)) if ids is not None else None,
                    'couverture_metadata': len(set(ids) & set(metadata['article_id']))
                    if (metadata is not None and ids is not None and 'article_id' in metadata)
                    else None,
                    'couverture_clicks': len(set(ids) & set(clicks['article_id']))
                    if (not clicks.empty and ids is not None and 'article_id' in clicks)
                    else None,
                }
            )
            display(pd.DataFrame([stats]))

            if ids is not None:
                sample_ids = ids[:5] if len(ids) >= 5 else ids
                print("Aperçu des premiers article_id liés aux embeddings:")
                display(pd.DataFrame({'article_id': sample_ids}))

            preview_cols = [f"emb_{i}" for i in range(min(5, matrix.shape[1] if hasattr(matrix, 'shape') else 0))]
            if preview_cols:
                preview = pd.DataFrame(matrix[:5, : len(preview_cols)], columns=preview_cols)
                if ids is not None:
                    preview.insert(0, 'article_id', ids[: len(preview)])
                print("Aperçu des embeddings (quelques colonnes et premières lignes):")
                display(preview)
                print("Colonnes affichées pour l'aperçu des embeddings:")
                print(", ".join(preview.columns))

                if ids is not None and metadata is not None and 'article_id' in metadata:
                    meta_cols = [c for c in ['title', 'category_id', 'created_at_ts', 'publisher'] if c in metadata.columns]
                    meta_sample = (
                        preview[['article_id']]
                        .merge(metadata[['article_id'] + meta_cols], on='article_id', how='left')
                    )
                    if 'created_at_ts' in meta_sample.columns:
                        meta_sample['created_at_ts'] = to_timestamp(meta_sample['created_at_ts'])
                    print("Exemple de liaison embedding -> metadata sur article_id (5 premières lignes):")
                    display(meta_sample.head())
        else:
            print("Aucune matrice d'embeddings explicite trouvée dans l'objet chargé.")
    elif hasattr(embeddings_obj, 'shape'):
        stats = summarize_matrix(embeddings_obj)

        inferred_ids = None
        mapping_note = None
        if metadata is not None and 'article_id' in metadata and hasattr(embeddings_obj, 'shape'):
            if embeddings_obj.shape[0] == len(metadata):
                inferred_ids = metadata['article_id'].reset_index(drop=True)
                mapping_note = (
                    "Aucun article_id explicite fourni ; association supposée alignée sur l'ordre des metadata."
                )
            else:
                mapping_note = (
                    "Aucun article_id dans le fichier d'embeddings et la taille ne correspond pas aux metadata : "
                    f"{embeddings_obj.shape[0]} vecteurs vs {len(metadata)} lignes de metadata."
                )
        else:
            mapping_note = (
                "Aucun identifiant d'article n'est présent dans le fichier d'embeddings (mapping externe requis)."
            )

        structure = base_structure.copy()
        if inferred_ids is not None:
            structure.insert(0, {
                'chemin': 'embeddings.article_id (inféré)',
                'type': type(inferred_ids).__name__,
                'len': len(inferred_ids),
                'note': "Alignement supposé sur metadata.article_id (index identique).",
            })
        if structure:
            print("Structure détaillée de l'objet d'embeddings (par chemin de clé):")
            display(pd.DataFrame(structure))

        if mapping_note:
            print(mapping_note)

        if inferred_ids is not None:
            stats.update(
                {
                    'ids_source': 'metadata.article_id (alignement par index)',
                    'ids_uniques': inferred_ids.nunique(),
                    'couverture_metadata': len(set(inferred_ids) & set(metadata['article_id'])),
                    'couverture_clicks': len(set(inferred_ids) & set(clicks['article_id'])) if not clicks.empty else None,
                }
            )

        display(pd.DataFrame([stats]))
        if len(getattr(embeddings_obj, 'shape', [])) >= 2 and embeddings_obj.shape[1] > 0:
            preview_cols = [f"emb_{i}" for i in range(min(5, embeddings_obj.shape[1]))]
            preview = pd.DataFrame(embeddings_obj[:5, : len(preview_cols)], columns=preview_cols)
            if inferred_ids is not None:
                preview.insert(0, 'article_id', inferred_ids.iloc[: len(preview)].values)
            print("Aperçu direct de la matrice d'embeddings:")
            display(preview)
            print("Colonnes affichées pour l'aperçu des embeddings:")
            print(", ".join(preview.columns))

            if inferred_ids is not None and metadata is not None:
                meta_cols = [c for c in ['title', 'category_id', 'created_at_ts', 'publisher'] if c in metadata.columns]
                meta_sample = preview[['article_id']].merge(
                    metadata[['article_id'] + meta_cols], on='article_id', how='left'
                )
                if 'created_at_ts' in meta_sample.columns:
                    meta_sample['created_at_ts'] = to_timestamp(meta_sample['created_at_ts'])
                print("Exemple de liaison embedding -> metadata sur article_id (inféré):")
                display(meta_sample.head())
        else:
            print("Objet chargé non structuré, utilisez type/len pour investiguer.")
else:
    print(f"Fichier d'embeddings introuvable à {embeddings_path}")




Nombre total de fichiers clicks détectés: 385


Unnamed: 0,fichier,nb_lignes,colonnes,articles_uniques,horodatage_min,horodatage_max
0,clicks_hour_000.csv,1883,"user_id, session_id, session_start, session_size, click_article_id, click_timestamp, click_environment, click_deviceGroup, click_os, click_country, click_region, click_referrer_type",323,2017-10-01 03:00:00.026,2017-10-03 02:35:54.157
1,clicks_hour_001.csv,1415,"user_id, session_id, session_start, session_size, click_article_id, click_timestamp, click_environment, click_deviceGroup, click_os, click_country, click_region, click_referrer_type",289,2017-10-01 03:36:28.615,2017-10-02 02:41:03.190


=== Clicks (agrégés) ===


Unnamed: 0,nb_lignes,colonnes,utilisateurs_uniques,articles_uniques
0,81245,"user_id, article_id, timestamp",27911,3353


Synthèse globale (articles / clients)


Unnamed: 0,nombre_total_articles,nombre_total_clients
0,364047,27911


Unnamed: 0,horodatage_min,horodatage_max,fenetre_jours
0,2017-10-01 03:00:00.026,2017-10-24 23:48:51.578,24


Répartition par jour (jusqu'à 10 premières valeurs)


Unnamed: 0,date,nb_clicks
0,2017-10-01,80483
1,2017-10-02,683
2,2017-10-03,45
3,2017-10-04,8
4,2017-10-05,7
5,2017-10-07,4
6,2017-10-08,2
7,2017-10-16,1
8,2017-10-22,5
9,2017-10-23,3


Répartition par heure (0-23)


Unnamed: 0,hour,nb_clicks
0,0,273
1,1,111
2,2,105
3,3,2255
4,4,1276
5,5,816
6,6,541
7,7,628
8,8,949
9,9,2268


=== Métadonnées des articles ===


Unnamed: 0,nb_articles,colonnes,articles_uniques
0,364047,"article_id, category_id, created_at_ts, publisher_id, words_count",364047


Unnamed: 0,valeurs_manquantes
article_id,0
category_id,0
created_at_ts,0
publisher_id,0
words_count,0


Unnamed: 0,premier_article,dernier_article
0,2006-09-27 11:14:35,2018-03-13 12:12:30


Articles présents dans clicks et metadata: 3353
=== Embeddings d'articles ===
Type chargé: <class 'numpy.ndarray'>
Structure détaillée de l'objet d'embeddings (par chemin de clé):


Unnamed: 0,chemin,type,len,note,shape,dtype,exemple
0,embeddings.article_id (inféré),Series,364047.0,Alignement supposé sur metadata.article_id (index identique).,,,
1,embeddings,ndarray,,,"(364047, 250)",float32,
2,embeddings[0],ndarray,,,"(250,)",float32,"[-0.16118301451206207, -0.9572331309318542, -0.13794444501399994]"
3,embeddings[0][0],float32,,,(),float32,


Aucun article_id explicite fourni ; association supposée alignée sur l'ordre des metadata.


Unnamed: 0,shape,dtype,profondeur_min,profondeur_moyenne,profondeur_max,nb_vectors,dim,norm_min,norm_max,norm_moyenne,ids_source,ids_uniques,couverture_metadata,couverture_clicks
0,"(364047, 250)",float32,250,250.0,250,364047,250,1.845483,11.18309,7.939456,metadata.article_id (alignement par index),364047,364047,3353


Aperçu direct de la matrice d'embeddings:


Unnamed: 0,article_id,emb_0,emb_1,emb_2,emb_3,emb_4
0,0,-0.161183,-0.957233,-0.137944,0.050855,0.830055
1,1,-0.523216,-0.974058,0.738608,0.155234,0.626294
2,2,-0.619619,-0.97296,-0.20736,-0.128861,0.044748
3,3,-0.740843,-0.975749,0.391698,0.641738,-0.268645
4,4,-0.279052,-0.972315,0.685374,0.113056,0.238315


Colonnes affichées pour l'aperçu des embeddings:
article_id, emb_0, emb_1, emb_2, emb_3, emb_4
Exemple de liaison embedding -> metadata sur article_id (inféré):


Unnamed: 0,article_id,category_id,created_at_ts
0,0,0,2017-12-13 05:53:39
1,1,1,2014-07-14 12:45:36
2,2,1,2014-08-22 00:35:06
3,3,1,2014-08-19 17:11:53
4,4,1,2014-08-03 13:06:11


# Article Embeddings

Ce fichier contient les **embeddings des articles**, c’est-à-dire une **représentation numérique du contenu textuel** permettant de comparer les articles entre eux sur le plan sémantique.

* **Format** : matrice NumPy `(N, 250)` en `float32`
* **1 ligne = 1 article**
* **250 colonnes = dimensions latentes**
* Les valeurs individuelles n’ont pas de signification directe

L’`article_id` n’est **pas stocké explicitement** : il est **déduit de l’ordre des lignes**, qui doit rester aligné avec les métadonnées des articles.

La variable `words_count` indique le **nombre de mots du texte source** et sert uniquement d’indicateur de qualité du contenu.

Les embeddings **ne sont pas normalisés** : la **similarité cosinus** est la mesure recommandée pour comparer les articles.


## Protocole

1. Tri des interactions par horodatage pour respecter la chronologie.
2. Split temporel train/test selon `train_ratio` afin d'éviter toute fuite du futur.
3. Construction d'un profil utilisateur à partir des interactions de train.
4. Définition du *ground truth* : articles cliqués en test pour chaque utilisateur (au moins un).
5. Génération de recommandations Top-5 en excluant les articles déjà vus en train.
6. Calcul des métriques de ranking (Precision@5, Recall@5, MAP@5, NDCG@5, Coverage@5) et estimation de la latence moyenne sur un échantillon de 500 utilisateurs max.

Cette démarche imite un scénario de production : d'abord on respecte le temps, puis on mesure simultanément la qualité des suggestions et le coût de calcul.

## Préparation minimale des interactions

In [446]:

# Filtrage k-core itératif pour limiter la sparsité avant le split train/test

def iterative_k_core_filter(
    df: pd.DataFrame, min_user_interactions: int, min_item_interactions: int
) -> pd.DataFrame:
    filtered = df.copy()
    previous_size = -1
    while previous_size != len(filtered):
        previous_size = len(filtered)
        user_counts = filtered["user_id"].value_counts()
        item_counts = filtered["article_id"].value_counts()
        filtered = filtered[
            filtered["user_id"].isin(user_counts[user_counts >= min_user_interactions].index)
            & filtered["article_id"].isin(item_counts[item_counts >= min_item_interactions].index)
        ]
    return filtered

if clicks.empty:
    print("Dataset clicks vide : saut du filtrage k-core.")
else:
    before = (
        len(clicks),
        clicks["user_id"].nunique(),
        clicks["article_id"].nunique(),
    )
    clicks = iterative_k_core_filter(
        clicks,
        CONFIG["min_user_interactions"],
        CONFIG["min_item_interactions"],
    ).sort_values("timestamp").reset_index(drop=True)
    after = (
        len(clicks),
        clicks["user_id"].nunique(),
        clicks["article_id"].nunique(),
    )
    print(
        "Filtrage k-core terminé: "
        f"interactions {before[0]} -> {after[0]}, "
        f"utilisateurs {before[1]} -> {after[1]}, "
        f"articles {before[2]} -> {after[2]}"
    )


Filtrage k-core terminé: interactions 81245 -> 42765, utilisateurs 27911 -> 9962, articles 3353 -> 467


In [447]:
# Split and utility functions

def temporal_train_test_split(df: pd.DataFrame, train_ratio: float) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Split interactions chronologically according to the train_ratio."""
    cutoff = int(len(df) * train_ratio)
    train = df.iloc[:cutoff].copy()
    test = df.iloc[cutoff:].copy()
    return train, test


def build_user_histories(df: pd.DataFrame) -> Dict[int, List[int]]:
    """Create mapping user -> list of articles in chronological order."""
    histories: Dict[int, List[int]] = {}
    for user_id, group in df.groupby("user_id"):
        histories[int(user_id)] = group.sort_values("timestamp")["article_id"].tolist()
    return histories


def get_candidate_items(df: pd.DataFrame) -> List[int]:
    """Return unique article ids."""
    return df["article_id"].unique().tolist()


def make_ground_truth(train: pd.DataFrame, test: pd.DataFrame) -> Tuple[Dict[int, List[int]], Dict[int, List[int]]]:
    """Build user histories and ground truth for evaluation.

    Only test items that were seen in training are kept so models are
    evaluated on recommendable articles.
    """
    train_hist = build_user_histories(train)
    candidate_items = set(train["article_id"].unique())
    test_hist = build_user_histories(test)
    filtered = {
        u: [it for it in items if it in candidate_items]
        for u, items in test_hist.items()
        if u in train_hist and len(items) > 0
    }
    eligible_users = {u: items for u, items in filtered.items() if items}
    return train_hist, eligible_users


train_df, test_df = temporal_train_test_split(clicks, CONFIG["train_ratio"])
train_histories, ground_truth = make_ground_truth(train_df, test_df)
eval_users = sorted(ground_truth.keys())
candidate_items = get_candidate_items(train_df)
print(f"Train size: {len(train_df)}, Test size: {len(test_df)}, Users for eval: {len(eval_users)}")


Train size: 34212, Test size: 8553, Users for eval: 1300


## Métriques utilisées

* **Precision@5** : part des recommandations top-5 qui sont réellement cliquées (plus c'est haut, plus le Top-5 est précis).
* **Recall@5** : part des clics test retrouvés dans le Top-5 (mesure la couverture de ce que l'utilisateur aime).
* **MAP@5** : moyenne de la précision cumulée à chaque clic retrouvé ; récompense les bonnes positions dans la liste.
* **NDCG@5** : pondère chaque clic par sa position (gain décroissant) et normalise par le meilleur score possible ; idéal pour comparer des classements.
* **Coverage@5** : proportion d'articles différents recommandés sur l'ensemble des utilisateurs (diversité du catalogue).
* **Latence par utilisateur** : temps moyen pour produire le Top-5 (important pour une API temps réel).
* **RMSE** : erreur quadratique moyenne sur les prédictions de note ; résume l'écart global entre les estimations du modèle et les clics réels.
* **MAE** : erreur absolue moyenne ; met en avant l'erreur moyenne sans amplifier les grands écarts.

In [448]:

# Metrics

def precision_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Precision@k for a single user."""
    if not recommended:
        return 0.0
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / k


def recall_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Recall@k for a single user."""
    if not relevant:
        return 0.0
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / len(relevant)


def average_precision_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """MAP@k for a single user."""
    if not relevant:
        return 0.0
    score = 0.0
    hits = 0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            hits += 1
            score += hits / i
    return score / min(len(relevant), k)


def dcg_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Discounted cumulative gain."""
    dcg = 0.0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            dcg += 1 / np.log2(i + 1)
    return dcg


def ndcg_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Normalized DCG."""
    ideal_dcg = dcg_at_k(relevant[:k], relevant, k)
    if ideal_dcg == 0:
        return 0.0
    return dcg_at_k(recommended, relevant, k) / ideal_dcg


def coverage_at_k(all_recommendations: List[List[int]], candidate_items: List[int], k: int) -> float:
    """Coverage of unique recommended items over candidates."""
    rec_items = set()
    for rec in all_recommendations:
        rec_items.update(rec[:k])
    if not candidate_items:
        return 0.0
    return len(rec_items) / len(candidate_items)


## Fonctions utilitaires pour les recommanders

In [449]:

# Fonctions classiques (popularité, similarité, SVD léger) utilisées par les baselines

def build_global_popularity(train: pd.DataFrame) -> List[int]:
    """Retourne les articles triés par nombre de clics."""
    return train.groupby("article_id").size().sort_values(ascending=False).index.tolist()


def build_recent_popularity(train: pd.DataFrame, window_days: int) -> List[int]:
    """Retourne les articles populaires sur la dernière fenêtre glissante."""
    max_time = train["timestamp"].max()
    window_start = max_time - pd.Timedelta(days=window_days)
    recent = train[train["timestamp"] >= window_start]
    if recent.empty:
        return build_global_popularity(train)
    counts = recent.groupby("article_id")["timestamp"].agg(["size", "max"])
    ranked = counts.sort_values(by=["size", "max"], ascending=[False, False])
    return ranked.index.tolist()


def build_covisit_graph(train: pd.DataFrame) -> Dict[int, Dict[int, int]]:
    """Construire un graphe de co-visitation basé sur l'historique utilisateur."""
    graph: Dict[int, Dict[int, int]] = {}
    for _, group in train.groupby("user_id"):
        items = group.sort_values("timestamp")["article_id"].tolist()
        unique_items = list(dict.fromkeys(items))
        for i, item_i in enumerate(unique_items):
            graph.setdefault(item_i, {})
            for item_j in unique_items[i + 1 :]:
                graph[item_i][item_j] = graph[item_i].get(item_j, 0) + 1
                graph.setdefault(item_j, {})
                graph[item_j][item_i] = graph[item_j].get(item_i, 0) + 1
    return graph


def build_content_embeddings(metadata: pd.DataFrame, pca_components: Optional[int] = None):
    """Crée des embeddings TF-IDF à partir des colonnes textuelles (avec PCA optionnel)."""
    text_cols = [
        c
        for c in metadata.columns
        if metadata[c].dtype == object and c not in {"article_id", "clicks"}
    ]
    non_id_cols = [c for c in metadata.columns if c != "article_id"]

    if not text_cols and non_id_cols:
        print("Aucune colonne textuelle : utilisation des colonnes non-ID comme tokens catégoriels.")
        text_cols = non_id_cols

    if not text_cols:
        raise ValueError("Aucune colonne utilisable dans les métadonnées pour construire des embeddings")

    corpus = metadata[text_cols].fillna("")
    corpus = corpus.apply(lambda row: " ".join(f"{col}_{val}" for col, val in row.items()), axis=1)

    vectorizer = TfidfVectorizer(max_features=5000)
    tfidf = vectorizer.fit_transform(corpus)
    if pca_components and pca_components < tfidf.shape[1]:
        svd = TruncatedSVD(n_components=pca_components, random_state=CONFIG["random_seed"])
        reduced = svd.fit_transform(tfidf)
        embeddings = normalize(reduced)
    else:
        embeddings = normalize(tfidf)
    ids = metadata["article_id"].tolist()
    return embeddings, ids


def build_item_similarity(train: pd.DataFrame, metadata: Optional[pd.DataFrame]):
    """Construit une similarité article-article par contenu ou co-visitation."""
    if metadata is not None:
        try:
            embeddings, ids = build_content_embeddings(metadata, CONFIG["content_pca_components"])
            similarity: Dict[int, Dict[int, float]] = {}
            for i, aid in enumerate(ids):
                sims = embeddings @ embeddings[i].T
                sims = np.asarray(sims).flatten()
                top_idx = np.argsort(-sims)[1:51]
                similarity[aid] = {ids[j]: float(sims[j]) for j in top_idx if sims[j] > 0}
            return similarity, "content"
        except Exception as exc:
            print(f"Embeddings de contenu impossibles ({exc}). Bascule sur la co-visitation.")
    graph = build_covisit_graph(train)
    similarity = {item: {nbr: float(cnt) for nbr, cnt in neigh.items()} for item, neigh in graph.items()}
    return similarity, "covisitation"


def recommend_from_similarity(
    user_id: int,
    train_histories: Dict[int, List[int]],
    similarity: Dict[int, Dict[int, float]],
    candidate_items: List[int],
    k: int,
) -> List[int]:
    """Agrège les scores de similarité depuis l'historique utilisateur."""
    seen = set(train_histories.get(user_id, []))
    scores: Dict[int, float] = {}
    for item in seen:
        for neighbor, sim in similarity.get(item, {}).items():
            if neighbor in seen:
                continue
            scores[neighbor] = scores.get(neighbor, 0.0) + sim
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen]
    if len(recs) < k:
        for c in candidate_items:
            if c not in seen and c not in recs:
                recs.append(c)
            if len(recs) >= k:
                break
    return recs[:k]


def build_collaborative_svd(train: pd.DataFrame, n_components: int):
    """Entraîne un SVD implicite léger et retourne une fonction de recommandation."""
    user_codes, user_index = pd.factorize(train["user_id"], sort=True)
    item_codes, item_index = pd.factorize(train["article_id"], sort=True)

    interactions = pd.DataFrame({"user_idx": user_codes, "item_idx": item_codes}).drop_duplicates()
    data = np.ones(len(interactions), dtype=np.float32)
    mat = sparse.coo_matrix((data, (interactions["user_idx"], interactions["item_idx"])), shape=(len(user_index), len(item_index))).tocsr()

    svd = TruncatedSVD(n_components=n_components, random_state=CONFIG["random_seed"])
    user_factors = svd.fit_transform(mat)
    item_factors = svd.components_.T

    user_to_idx = {int(uid): int(idx) for idx, uid in enumerate(user_index.tolist())}
    items = [int(aid) for aid in item_index.tolist()]

    def recommend(user_id: int, seen: set, k: int) -> List[int]:
        if user_id not in user_to_idx:
            popularity = build_global_popularity(train)
            return [it for it in popularity if it not in seen][:k]

        u_vec = user_factors[user_to_idx[user_id]]
        scores = item_factors @ u_vec
        ranked_items = [items[i] for i in np.argsort(-scores)]
        return [it for it in ranked_items if it not in seen][:k]

    meta = {"users": len(user_index), "items": len(item_index), "components": n_components}
    return recommend, meta


In [450]:

# Recommenders (Surprise)

from surprise import Dataset, Reader, KNNBasic, NormalPredictor, SVD
from surprise import accuracy


def build_surprise_trainset(interactions: pd.DataFrame):
    aggregated = (
        interactions.groupby(["user_id", "article_id"])
        .agg(clicks=("article_id", "size"), last_ts=("timestamp", "max"))
        .reset_index()
    )
    if aggregated.empty:
        raise ValueError("Impossible de construire un trainset Surprise sans interactions")

    min_ts = aggregated["last_ts"].min()
    max_ts = aggregated["last_ts"].max()
    span_seconds = max((max_ts - min_ts).total_seconds(), 1.0)
    recency = (aggregated["last_ts"] - min_ts).dt.total_seconds() / span_seconds

    aggregated["rating"] = np.log1p(aggregated["clicks"]) + 0.5 * recency

    min_rating = float(aggregated["rating"].min())
    max_rating = float(aggregated["rating"].max())
    if max_rating == min_rating:
        max_rating = min_rating + 1.0

    reader = Reader(rating_scale=(min_rating, max_rating))
    return Dataset.load_from_df(
        aggregated[["user_id", "article_id", "rating"]], reader
    ).build_full_trainset()


surprise_trainset = build_surprise_trainset(train_df)
surprise_items = [int(surprise_trainset.to_raw_iid(iid)) for iid in surprise_trainset.all_items()]
popularity_order = build_global_popularity(train_df)
popularity_rank = {int(aid): rank for rank, aid in enumerate(popularity_order)}
# Chaque algorithme utilise un tie-breaker différent pour éviter des tops identiques en cas d'égalité


def wrap_surprise_recommender(algo, label: str, *, tie_breaker=None):
    algo.fit(surprise_trainset)

    def recommend(user_id: int, seen: set, k: int) -> List[int]:
        raw_uid = int(user_id)
        is_normal = isinstance(algo, NormalPredictor)

        if is_normal:
            base_score = float(getattr(algo, "mu", 0.0))
            scored = [(iid, base_score) for iid in surprise_items if iid not in seen]
        else:
            scored = []
            for iid in surprise_items:
                if iid in seen:
                    continue
                pred = algo.predict(raw_uid, int(iid), verbose=False)
                scored.append((iid, float(pred.est)))

        if not scored:
            return [it for it in surprise_items if it not in seen][:k]

        def sort_key(item_score):
            iid, score = item_score
            tie = tie_breaker(iid) if tie_breaker else 0.0
            return (score, tie)

        scored.sort(key=sort_key, reverse=True)
        return [it for it, _ in scored[:k]]

    meta = {"algo": label, "n_items": len(surprise_items), "estimator": algo}
    return recommend, meta


def surprise_error_metrics(
    algo,
    test_df: pd.DataFrame,
    *,
    candidate_pool: list[int],
    negatives_per_user: int = 50,
) -> dict:
    """Calcule RMSE et MAE sur le jeu de test avec échantillonnage négatif.

    On ajoute pour chaque utilisateur des articles non vus notés à 0 pour éviter
    des RMSE/MAE artificiellement nulles sur un dataset binaire.
    """
    if test_df is None or test_df.empty:
        return {"rmse": float("nan"), "mae": float("nan")}

    rng = np.random.default_rng(42)
    rating_rows = []

    for row in test_df.itertuples(index=False):
        rating_rows.append((row.user_id, row.article_id, 1.0))

        seen = set(train_histories.get(row.user_id, []))
        available = [it for it in candidate_pool if it not in seen]
        if not available:
            continue

        negatives = rng.choice(
            available,
            size=min(negatives_per_user, len(available)),
            replace=False,
        )
        rating_rows.extend((row.user_id, int(neg), 0.0) for neg in negatives)

    preds = [
        algo.predict(int(uid), int(iid), r_ui=rating, verbose=False)
        for uid, iid, rating in rating_rows
    ]

    rmse = accuracy.rmse(preds, verbose=False)
    mae = accuracy.mae(preds, verbose=False)
    return {"rmse": float(rmse), "mae": float(mae)}


In [451]:
# Evaluation pipeline

def evaluate_model(
    name: str,
    recommend_func: Callable[[int, set, int], List[int]],
    train_histories: Dict[int, List[int]],
    ground_truth: Dict[int, List[int]],
    candidate_items: List[int],
    k: int,
    latency_sample: int = 500,
) -> Dict[str, float]:
    """Evaluate a recommender with ranking metrics and latency estimation."""
    precisions: List[float] = []
    recalls: List[float] = []
    maps: List[float] = []
    ndcgs: List[float] = []
    hits: List[int] = []
    all_recs: List[List[int]] = []

    users = eval_users
    for user_id in users:
        seen = set(train_histories.get(user_id, []))
        recs = recommend_func(user_id, seen, k)
        gt = ground_truth[user_id]
        all_recs.append(recs)
        precisions.append(precision_at_k(recs, gt, k))
        recalls.append(recall_at_k(recs, gt, k))
        maps.append(average_precision_at_k(recs, gt, k))
        ndcgs.append(ndcg_at_k(recs, gt, k))
        hits.append(1 if set(recs[:k]) & set(gt) else 0)

    coverage = coverage_at_k(all_recs, candidate_items, k)
    hitrate = float(np.mean(hits)) if users else 0.0

    sample_users = users[: min(latency_sample, len(users))]
    start = time.perf_counter()
    for user_id in sample_users:
        seen = set(train_histories.get(user_id, []))
        _ = recommend_func(user_id, seen, k)
    latency = (time.perf_counter() - start) / max(1, len(sample_users))

    return {
        "model": name,
        "users": len(users),
        "precision@k": float(np.mean(precisions)),
        "recall@k": float(np.mean(recalls)),
        "map@k": float(np.mean(maps)),
        "ndcg@k": float(np.mean(ndcgs)),
        "hitrate@k": hitrate,
        "coverage@k": coverage,
        "latency_per_user_s": latency,
        "all_recommendations": all_recs,
    }


## Entraînement des systèmes de recommandation

Chaque approche est entraînée séparément pour limiter le temps d'exécution de chaque cellule et mieux contextualiser le rôle de chaque modèle.

### Popularité globale
La recommandation par popularité globale trie les articles par volume d'interactions dans l'ensemble d'entraînement. Elle est rapide à calculer (simple agrégation) et sert de baseline robuste pour comparer les modèles plus avancés.

In [452]:

# Configuration commune
K = CONFIG["k"]

# Modèles Surprise prêts à l'emploi
popularity_recommender, pop_meta = wrap_surprise_recommender(
    NormalPredictor(),
    "NormalPredictor (baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

itemknn_recommender, itemknn_meta = wrap_surprise_recommender(
    KNNBasic(
        k=60,
        min_k=2,
        sim_options={"name": "pearson_baseline", "user_based": False, "min_support": 2},
    ),
    "KNNBasic item-based (pearson baseline)",
    tie_breaker=lambda iid: -popularity_rank.get(int(iid), len(popularity_rank)),
)

svd_recommender, svd_meta = wrap_surprise_recommender(
    SVD(
        n_factors=CONFIG["svd_components"],
        n_epochs=35,
        reg_all=0.06,
        lr_all=0.004,
        random_state=CONFIG["random_seed"],
    ),
    "SVD collaboratif (facteurs latents)",
    tie_breaker=lambda iid: popularity_rank.get(int(iid), len(popularity_rank)),
)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


### Popularité récente
Cette variante privilégie la fraîcheur en filtrant les interactions sur une fenêtre temporelle avant de trier les articles par fréquence. Utile pour capter les tendances du moment, au prix d'un recalcul plus fréquent de la fenêtre glissante.

In [453]:
# Popularité récente
recent_rank = build_recent_popularity(train_df, CONFIG["recent_window_days"])

def recent_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return [it for it in recent_rank if it not in seen][:k]

### Collaborative (SVD)
Le filtrage collaboratif factorise la matrice utilisateur-item (SVD) pour capturer des préférences latentes. L'entraînement est plus long que les méthodes de popularité ou de similarité de contenu, mais il modélise mieux les affinités implicites entre utilisateurs et articles.

In [454]:
# Filtrage collaboratif (SVD)
collab_recommend, collab_meta = build_collaborative_svd(train_df, CONFIG["svd_components"])

def collaborative_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return collab_recommend(user_id, seen, k)

In [455]:
# Modèles co-visitation désactivés au profit de Surprise

### Contenu (similarité article-article)
Un modèle basé contenu construit une matrice de similarité entre articles à partir des métadonnées. Les recommandations se font en projetant l'historique utilisateur vers les items proches dans cet espace. Ce calcul peut être plus coûteux car il nécessite la vectorisation et le produit croisé des articles.

In [456]:
# Initialiser un conteneur de résultats pour chaque entraînement
results = []
step_results = []

In [457]:
# Recommandation basée contenu (désactivable)
ENABLE_CONTENT_MODEL = False  # Passer à True pour activer le calcul de similarité contenu

if ENABLE_CONTENT_MODEL:
    item_similarity, sim_mode = build_item_similarity(train_df, metadata)

    def content_recommender(user_id: int, seen: set, k: int) -> List[int]:
        return recommend_from_similarity(user_id, train_histories, item_similarity, candidate_items, k)
else:
    sim_mode = "désactivé"
    content_recommender = None


## Entraînements séparés

Les trois stratégies Surprise sont exécutées dans des cellules distinctes afin de pouvoir lancer, arrêter ou relancer chaque bloc indépendamment. Cela évite d'attendre l'ensemble du pipeline quand un seul entraînement est nécessaire.


### Entraînement 1 : Baseline Surprise (NormalPredictor)

Ce bloc entraîne le modèle de base `NormalPredictor` de Surprise et calcule Precision@K, Recall@K, MAP@K, NDCG@K, couverture, latence moyenne ainsi que RMSE et MAE sur le jeu de test.


In [458]:
popularity_result = evaluate_model(
    "Baseline Surprise - NormalPredictor",
    popularity_recommender,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)

pop_meta_errors = surprise_error_metrics(
    pop_meta["estimator"], test_df, candidate_pool=candidate_items
)
popularity_result.update(pop_meta_errors)
results.append(popularity_result)
pd.DataFrame([popularity_result])


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,coverage@k,latency_per_user_s,all_recommendations,rmse,mae
0,Baseline Surprise - NormalPredictor,1300,0.057077,0.107896,0.046245,0.078194,0.243077,0.029213,0.000134,"[[207122, 160474, 59758, 336430, 68866], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 96663, 59758], [207122, 119592, 96663, 59758, 118180], [284463, 160474, 119592, 96663, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 96663, 59758], [284463, 160474, 119592, 96663, 59758], [160474, 59758, 336430, 118180, 108854], [284463, 160474, 96663, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 119592, 59758], [284463, 160474, 59758, 336430, 68866], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 59758], [207122, 160474, 119592, 96663, 59758], [284463, 119592, 96663, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 119592, 96663], [284463, 160474, 119592, 96663, 59758], [207122, 119592, 96663, 59758, 336430], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 119592, 59758], [207122, 160474, 119592, 96663, 336430], [284463, 160474, 96663, 118180, 108854], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 96663, 59758], [284463, 160474, 119592, 96663, 59758], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [160474, 96663, 59758, 336430, 68866], [284463, 160474, 119592, 96663, 59758], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 119592, 96663], [284463, 119592, 96663, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 119592, 96663], [284463, 160474, 59758, 336430, 68866], [207122, 284463, 160474, 119592, 96663], [284463, 160474, 336430, 68866, 108854], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 160474, 119592, 96663, 59758], [284463, 160474, 119592, 96663, 59758], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 96663, 336430], [207122, 284463, 160474, 119592, 96663], [160474, 96663, 59758, 336430, 68866], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 119592, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 119592, 59758, 336430], [207122, 284463, 119592, 96663, 59758], [207122, 284463, 160474, 119592, 96663], [96663, 59758, 336430, 68866, 118180], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [119592, 59758, 336430, 68866, 108854], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 119592, 96663], [284463, 160474, 96663, 59758, 336430], [284463, 160474, 119592, 96663, 59758], [207122, 284463, 160474, 119592, 59758], ...]",1.012126,0.995852


### Entraînement 2 : KNN item-based (Surprise)

Ce bloc exécute `KNNBasic` en mode item-based avec une similarité **Pearson baseline**, 60 voisins
(`k=60`, `min_k=2`, `min_support=2`). Cette configuration force le modèle à exploiter des co-cliques
pour sortir des simples effets de popularité, afin d'obtenir des recommandations distinctes du SVD.


In [459]:
item2item_result = evaluate_model(
    "Modèle KNNBasic item-based (Pearson baseline)",
    itemknn_recommender,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)

itemknn_meta_errors = surprise_error_metrics(
    itemknn_meta["estimator"], test_df, candidate_pool=candidate_items
)
item2item_result.update(itemknn_meta_errors)
results.append(item2item_result)
pd.DataFrame([item2item_result])


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,coverage@k,latency_per_user_s,all_recommendations,rmse,mae
0,Modèle KNNBasic item-based (Pearson baseline),1300,0.048154,0.089278,0.046528,0.072609,0.200769,0.61573,0.001411,"[[336972, 60258, 33497, 33583, 15166], [283039, 195586, 209723, 288270, 161160], [207122, 284463, 160474, 59758, 336430], [284463, 160474, 59758, 336430, 118180], [61582, 106886, 208408, 59058, 233605], [59758, 336430, 68866, 118180, 233605], [207122, 284463, 160474, 59758, 336430], [118627, 293425, 292980, 106911, 182136], [284463, 160474, 59758, 336430, 118180], [83738, 194213, 234747, 15166, 156672], [297851, 118627, 207374, 261680, 357587], [207122, 284463, 160474, 59758, 336430], [33583, 283039, 284463, 59758, 336430], [284463, 160474, 59758, 336430, 118180], [84979, 118180, 84493, 84770, 129897], [284463, 160474, 59758, 336430, 118180], [64305, 160474, 355876, 332250, 157077], [207122, 284463, 160474, 59758, 336430], [284463, 59758, 336430, 118180, 233605], [207122, 284463, 160474, 119592, 96663], [177507, 59758, 284463, 336430, 207391], [284463, 160474, 59758, 336430, 118180], [284463, 59758, 336430, 118180, 233605], [284463, 160474, 59758, 336430, 118180], [284463, 160474, 59758, 336430, 118180], [79454, 284846, 292959, 118934, 96672], [336430, 118180, 220466, 161585, 202528], [207122, 284463, 160474, 59758, 336430], [284463, 59758, 336430, 261680, 161585], [207122, 284463, 160474, 119592, 59758], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 59758, 336430], [284463, 235210, 236179, 242485, 62438], [285412, 236294, 182456, 284845, 270857], [284463, 59758, 336430, 118180, 338351], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 59758, 336430], [284463, 160474, 59758, 336430, 118180], [158535, 160474, 156543, 273324, 288582], [16783, 233984, 162369, 264587, 354701], [284463, 160474, 59758, 336430, 156543], [160894, 272660, 163505, 177507, 159581], [284463, 59758, 336430, 68866, 118180], [284463, 59758, 336430, 118180, 233605], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 96663, 59758], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 59758, 336430], [233605, 297851, 288270, 16783, 195586], [284463, 160474, 59758, 336430, 68866], [284463, 160474, 59758, 336430, 118180], [336430, 297851, 59758, 284463, 283039], [59758, 336430, 118180, 261680, 220466], [207122, 284463, 160474, 59758, 336430], [284463, 160474, 59758, 336430, 118180], [284463, 59758, 336430, 118180, 338351], [284463, 160474, 59758, 336430, 118180], [59758, 336430, 118180, 233605, 220466], [284845, 235263, 57931, 59758, 50568], [233605, 235132, 158722, 160154, 207391], [207122, 284463, 160474, 96663, 59758], [284463, 160474, 59758, 336430, 118180], [284463, 160474, 59758, 336430, 118180], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 59758, 336430], [207122, 160474, 119592, 96663, 59758], [59758, 336430, 118180, 233605, 156543], [160894, 220576, 202528, 157730, 156672], [207122, 284463, 160474, 59758, 336430], [163505, 62438, 284463, 161585, 161586], [207122, 284463, 160474, 59758, 336430], [297851, 141050, 118627, 220466, 84289], [284463, 354701, 236179, 163505, 220576], [207122, 284463, 160474, 96663, 59758], [59758, 336430, 118180, 233605, 220466], [207122, 284463, 160474, 59758, 336430], [284463, 59758, 336430, 118180, 233605], [220576, 284463, 270589, 297851, 288270], [207122, 284463, 160474, 59758, 336430], [284463, 96663, 59758, 336430, 68866], [207122, 284463, 160474, 59758, 336430], [207122, 284463, 160474, 59758, 336430], [284463, 59758, 336430, 118180, 156543], [15166, 118948, 202528, 357587, 118627], [284463, 160474, 119592, 96663, 59758], [206233, 207129, 271053, 265355, 161586], [106886, 205897, 284847, 237620, 233769], [207122, 284463, 160474, 59758, 336430], [216304, 360465, 177507, 308203, 157730], [207122, 284463, 160474, 119592, 96663], [207122, 284463, 160474, 59758, 336430], [336430, 118180, 338351, 145166, 336431], [284463, 59758, 336430, 118180, 233605], [284463, 160474, 59758, 336430, 68866], [284463, 59758, 336430, 68866, 118180], ...]",1.003838,0.993943


### Entraînement 3 : SVD Surprise

Ce bloc entraîne un SVD implicite (facteurs latents) avec 64 dimensions, davantage d'itérations et une
régularisation renforcée (`n_epochs=35`, `reg_all=0.06`, `lr_all=0.004`). L'objectif est d'obtenir un
profil utilisateur/item plus contrasté que le KNN de voisinage.


In [460]:
svd_result = evaluate_model(
    "Modèle SVD Surprise (facteurs latents)",
    svd_recommender,
    train_histories,
    ground_truth,
    candidate_items,
    K,
)

svd_meta_errors = surprise_error_metrics(
    svd_meta["estimator"], test_df, candidate_pool=candidate_items
)
svd_result.update(svd_meta_errors)
results.append(svd_result)
pd.DataFrame([svd_result])


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,coverage@k,latency_per_user_s,all_recommendations,rmse,mae
0,Modèle SVD Surprise (facteurs latents),1300,0.032308,0.060513,0.032275,0.051923,0.156154,0.552809,0.000999,"[[29727, 177507, 216304, 236179, 95645], [313504, 272660, 236336, 273324, 158069], [270857, 331939, 158069, 297851, 273324], [361170, 177507, 313504, 331900, 216304], [160894, 96605, 50568, 216304, 161178], [161178, 348103, 297628, 270857, 97458], [270857, 161178, 236179, 95645, 283039], [331900, 15166, 300958, 110933, 159581], [15166, 312864, 15927, 47932, 160894], [277454, 84289, 177507, 95645, 272660], [270857, 273324, 326928, 177507, 357587], [59058, 160894, 308203, 270857, 256163], [83472, 159581, 114404, 72337, 284997], [234747, 331939, 331900, 175372, 177507], [271971, 236179, 158064, 293425, 15166], [145012, 123301, 95930, 271558, 235132], [331939, 160894, 256163, 235132, 308203], [60254, 313504, 84289, 118627, 235132], [352461, 292856, 124263, 361170, 124754], [124263, 361170, 48731, 360465, 361210], [118627, 141050, 60253, 256163, 270857], [286585, 271904, 352461, 360465, 47932], [83472, 15166, 5278, 96605, 283039], [160894, 331939, 50568, 129897, 47825], [270857, 59034, 234747, 140720, 225764], [216304, 225764, 236179, 5278, 207391], [216304, 361170, 331939, 177507, 160894], [276731, 194381, 255670, 47994, 361170], [225764, 270857, 314294, 31343, 312864], [225764, 15166, 284312, 129897, 313504], [270857, 235132, 31343, 277454, 326928], [331939, 59034, 60253, 360465, 124263], [59058, 124263, 175372, 161178, 50568], [313504, 225764, 57997, 271971, 236336], [59034, 361210, 207391, 216304, 95954], [327866, 59058, 140720, 129897, 159581], [216304, 336972, 270857, 292881, 123434], [140720, 283039, 242485, 331242, 72646], [59058, 123301, 157444, 216304, 58628], [216304, 96178, 159581, 161178, 361170], [236179, 220769, 313504, 160894, 118627], [272660, 216304, 273324, 177507, 270857], [97458, 177507, 362603, 58628, 216304], [270857, 236179, 161178, 15166, 272259], [16981, 60012, 362603, 273324, 283039], [16199, 159710, 273324, 235132, 118627], [97458, 60012, 177507, 59034, 271642], [235132, 84770, 195120, 248373, 118948], [352461, 161178, 361210, 159581, 293425], [187048, 59058, 235132, 308203, 331939], [361170, 15166, 235132, 284629, 331939], [270857, 360465, 95856, 352461, 331242], [15166, 255716, 31343, 314346, 225764], [272660, 31173, 15166, 331242, 331939], [298524, 118627, 236179, 361210, 313504], [327866, 293425, 118627, 160894, 270857], [161178, 354297, 159581, 312864, 360465], [263430, 47994, 32629, 48396, 361210], [160894, 58628, 273464, 48731, 59058], [331939, 256007, 47825, 16199, 31343], [47825, 16199, 270857, 235132, 159581], [47994, 352461, 216304, 270857, 361210], [270857, 159581, 216304, 96178, 331939], [59058, 175372, 118627, 283039, 272660], [160894, 124263, 272660, 360465, 271904], [362603, 313504, 58628, 354701, 47994], [15166, 361170, 216304, 248373, 208408], [361170, 293425, 47994, 59058, 95856], [352461, 84289, 216304, 15927, 283039], [360465, 50568, 160894, 235132, 352461], [235132, 256007, 308203, 31173, 236179], [159581, 216304, 288734, 336972, 47994], [161178, 331939, 255716, 30894, 293425], [175372, 60012, 114095, 160894, 78374], [59034, 361170, 352461, 331900, 124754], [158722, 158069, 140720, 61169, 15166], [59034, 361170, 119417, 160894, 97458], [160894, 313504, 273324, 235132, 121428], [360465, 158069, 84493, 297851, 355863], [341417, 118627, 236179, 97458, 312864], [216304, 331900, 286420, 160894, 96581], [270857, 283039, 277454, 160894, 159581], [225764, 361170, 208408, 73422, 271558], [15166, 59058, 234747, 348120, 50568], [16783, 314294, 160894, 255670, 312402], [84289, 140720, 270857, 59034, 207391], [216304, 225764, 187048, 308203, 313504], [352461, 348103, 59034, 215770, 124754], [159710, 331900, 47994, 271642, 272304], [313504, 59058, 177507, 95645, 47932], [216304, 273324, 96581, 161178, 288734], [95645, 161178, 160894, 235132, 361210], [235132, 145012, 207391, 15166, 331242], [160894, 299389, 177507, 201768, 124263], [361170, 59058, 313504, 293273, 15927], [123301, 140720, 145012, 277454, 15166], [354297, 225764, 297628, 292881, 270857], [47825, 129897, 235132, 331939, 293425], [331900, 15166, 270857, 272660, 256007], [331900, 272660, 158069, 83622, 299837], ...]",0.998556,0.987822


### Modèles Surprise uniquement
Les anciennes sections E* basées sur la co-visitation sont remplacées par des algorithmes Surprise (NormalPredictor, KNNBasic, SVD).

#### Variantes co-visitation retirées
Nous privilégions désormais les algorithmes Surprise pour assurer une cohérence entre expérimentation et déploiement.

In [461]:
# Les variantes de co-visitation sont remplacées par les modèles Surprise ci-dessus.

### Section hybride supprimée
L'hybridation co-visitation + popularité a été remplacée par le modèle SVD Surprise plus flexible.

In [462]:
# Section hybride supprimée : la bibliothèque Surprise couvre les besoins collaboratifs.

In [463]:
# Optuna n'est plus nécessaire pour ce notebook centré sur Surprise.

## Résultats consolidés

Après exécution des trois blocs d'entraînement ci-dessus, les métriques sont agrégées pour comparer les approches. Chaque ligne du tableau récapitule la précision, le rappel, la MAP, le NDCG, la couverture et la latence moyenne par utilisateur, complétés par le RMSE et la MAE.


In [464]:
candidate_items = train_df["article_id"].unique().tolist()

per_user_topk = {
    res["model"]: res.get("all_recommendations", [])
    for res in results
}

def coverage_from_topk(rec_lists, candidates, k):
    pool = set()
    for rec in rec_lists:
        pool.update(rec[:k])
    return len(pool) / len(candidates) if candidates else 0.0

coverage_by_model = {
    label: coverage_from_topk(rec_lists, candidate_items, K)
    for label, rec_lists in per_user_topk.items()
    if rec_lists
}


In [465]:
# Agréger les métriques une fois les entraînements terminés
clean_columns = [
    "model",
    "users",
    "precision@k",
    "recall@k",
    "map@k",
    "ndcg@k",
    "hitrate@k",
    "latency_per_user_s",
]

results_df = pd.DataFrame(results)
results_df["coverage@k"] = results_df["model"].map(coverage_by_model)
results_df = (
    results_df[clean_columns + ["coverage@k"]]
    .drop_duplicates(subset=["model"])
    .sort_values(["ndcg@k", "map@k"], ascending=False)
    .reset_index(drop=True)
)

display(results_df)


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,latency_per_user_s,coverage@k
0,Baseline Surprise - NormalPredictor,1300,0.057077,0.107896,0.046245,0.078194,0.243077,0.000134,0.029213
1,Modèle KNNBasic item-based (Pearson baseline),1300,0.048154,0.089278,0.046528,0.072609,0.200769,0.001411,0.61573
2,Modèle SVD Surprise (facteurs latents),1300,0.032308,0.060513,0.032275,0.051923,0.156154,0.000999,0.552809


In [466]:
baseline_label = "Baseline Surprise - NormalPredictor"
knn_label = "Modèle KNNBasic item-based (Pearson baseline)"
svd_label = "Modèle SVD Surprise (facteurs latents)"

comparison_df = results_df[results_df["model"].isin([baseline_label, knn_label, svd_label])].reset_index(drop=True)
comparison_df


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,latency_per_user_s,coverage@k
0,Baseline Surprise - NormalPredictor,1300,0.057077,0.107896,0.046245,0.078194,0.243077,0.000134,0.029213
1,Modèle KNNBasic item-based (Pearson baseline),1300,0.048154,0.089278,0.046528,0.072609,0.200769,0.001411,0.61573
2,Modèle SVD Surprise (facteurs latents),1300,0.032308,0.060513,0.032275,0.051923,0.156154,0.000999,0.552809


In [467]:
# Comparaison rapide des top-5 pour un utilisateur
sample_user = eval_users[0] if eval_users else None
if sample_user is None:
    print("Pas d'utilisateur pour comparer")
else:
    seen = set(train_histories.get(sample_user, []))
    print(f"Utilisateur de test: {sample_user}")
    print("KNNBasic item-based:", itemknn_recommender(sample_user, seen, 5))
    print("SVD collaboratif:", svd_recommender(sample_user, seen, 5))


Utilisateur de test: 22
KNNBasic item-based: [336972, 60258, 33497, 33583, 15166]
SVD collaboratif: [29727, 177507, 216304, 236179, 95645]


In [468]:
results_steps = (
    results_df
    .sort_values(["ndcg@k", "precision@k"], ascending=False)
    .reset_index(drop=True)
)
print("Tableau comparatif des modèles Surprise (trié sur ndcg@k puis precision@k) :")
results_steps


Tableau comparatif des modèles Surprise (trié sur ndcg@k puis precision@k) :


Unnamed: 0,model,users,precision@k,recall@k,map@k,ndcg@k,hitrate@k,latency_per_user_s,coverage@k
0,Baseline Surprise - NormalPredictor,1300,0.057077,0.107896,0.046245,0.078194,0.243077,0.000134,0.029213
1,Modèle KNNBasic item-based (Pearson baseline),1300,0.048154,0.089278,0.046528,0.072609,0.200769,0.001411,0.61573
2,Modèle SVD Surprise (facteurs latents),1300,0.032308,0.060513,0.032275,0.051923,0.156154,0.000999,0.552809


In [469]:
# Métriques détaillées : hitrate, lifts vs baseline et cohortes d'historique
train_click_count = train_df.groupby("user_id").size().to_dict()

def assign_cohort(clicks: int) -> str:
    if 1 <= clicks <= 2:
        return "1-2 clicks"
    if 3 <= clicks <= 9:
        return "3-9 clicks"
    return "10+ clicks"

user_cohort = {user_id: assign_cohort(train_click_count.get(user_id, 0)) for user_id in eval_users}
coverage_lookup = {res["model"]: res.get("coverage@k", np.nan) for res in results}
recommendations_by_model = {res["model"]: res.get("all_recommendations", []) for res in results}

def safe_lift(value: float, baseline: float) -> float:
    if baseline is None or baseline == 0:
        return np.nan
    return value / baseline

cohort_rows = []
for model_name, recs in recommendations_by_model.items():
    buckets = {
        "ALL": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "1-2 clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "3-9 clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
        "10+ clicks": {"precisions": [], "recalls": [], "ndcgs": [], "hits": 0, "users": 0},
    }

    for user_id, recs_user in zip(eval_users, recs):
        gt = ground_truth[user_id]
        metrics = {
            "precision": precision_at_k(recs_user, gt, K),
            "recall": recall_at_k(recs_user, gt, K),
            "ndcg": ndcg_at_k(recs_user, gt, K),
            "hit": 1 if set(recs_user[:K]) & set(gt) else 0,
        }
        labels = ["ALL", user_cohort[user_id]]
        for label in labels:
            bucket = buckets[label]
            bucket["precisions"].append(metrics["precision"])
            bucket["recalls"].append(metrics["recall"])
            bucket["ndcgs"].append(metrics["ndcg"])
            bucket["hits"] += metrics["hit"]
            bucket["users"] += 1

    for cohort, bucket in buckets.items():
        users = bucket["users"]
        cohort_rows.append(
            {
                "model": model_name,
                "cohort": cohort,
                "users": users,
                "precision@k": float(np.mean(bucket["precisions"])) if users else 0.0,
                "recall@k": float(np.mean(bucket["recalls"])) if users else 0.0,
                "ndcg@k": float(np.mean(bucket["ndcgs"])) if users else 0.0,
                "hitrate@k": bucket["hits"] / users if users else 0.0,
                "coverage@k": coverage_lookup.get(model_name, np.nan),
            }
        )

cohort_df = pd.DataFrame(cohort_rows)
baseline_rows = cohort_df[cohort_df["model"] == baseline_label].set_index("cohort")
for metric in ["precision@k", "recall@k", "ndcg@k"]:
    cohort_df[f"lift_{metric}_vs_baseline"] = cohort_df.apply(
        lambda row: safe_lift(
            row[metric],
            float(baseline_rows.loc[row["cohort"], metric])
            if row["cohort"] in baseline_rows.index
            else np.nan,
        ),
        axis=1,
    )

cohort_df = cohort_df.sort_values(["cohort", "ndcg@k", "precision@k"], ascending=[True, False, False]).reset_index(drop=True)
cohort_df


Unnamed: 0,model,cohort,users,precision@k,recall@k,ndcg@k,hitrate@k,coverage@k,lift_precision@k_vs_baseline,lift_recall@k_vs_baseline,lift_ndcg@k_vs_baseline
0,Baseline Surprise - NormalPredictor,1-2 clicks,718,0.057382,0.109562,0.078818,0.246518,0.029213,1.0,1.0,1.0
1,Modèle KNNBasic item-based (Pearson baseline),1-2 clicks,718,0.046518,0.089898,0.06994,0.193593,0.61573,0.81068,0.820519,0.88736
2,Modèle SVD Surprise (facteurs latents),1-2 clicks,718,0.032591,0.064929,0.053582,0.157382,0.552809,0.567961,0.592621,0.679818
3,Modèle KNNBasic item-based (Pearson baseline),10+ clicks,33,0.090909,0.158923,0.135072,0.363636,0.61573,3.0,4.140351,3.466949
4,Modèle SVD Surprise (facteurs latents),10+ clicks,33,0.048485,0.063721,0.066175,0.242424,0.552809,1.6,1.660088,1.698535
5,Baseline Surprise - NormalPredictor,10+ clicks,33,0.030303,0.038384,0.03896,0.151515,0.029213,1.0,1.0,1.0
6,Baseline Surprise - NormalPredictor,3-9 clicks,549,0.058288,0.109895,0.079737,0.24408,0.029213,1.0,1.0,1.0
7,Modèle KNNBasic item-based (Pearson baseline),3-9 clicks,549,0.047723,0.084282,0.072345,0.200364,0.61573,0.81875,0.766928,0.907303
8,Modèle SVD Surprise (facteurs latents),3-9 clicks,549,0.030965,0.054545,0.048897,0.149362,0.552809,0.53125,0.496336,0.613228
9,Baseline Surprise - NormalPredictor,ALL,1300,0.057077,0.107896,0.078194,0.243077,0.029213,1.0,1.0,1.0


## Analyse & choix du modèle MVP

Le classement met en lumière des compromis :
- **Pertinence** : la popularité globale obtient le meilleur NDCG@5/MAP@5, signe que trier par volume reste difficile à battre sur ce petit jeu synthétique.
- **Diversité** : l'item2item couvre trois fois plus d'articles, ce qui réduit le risque d'effet tunnel.
- **Latence** : toutes les approches sont très rapides (millisecondes), la popularité restant la plus simple.

Le choix MVP bascule vers la popularité globale uniquement si l'on cherche la pertinence maximale et un déploiement express. Pour un produit, il serait pertinent de tester une hybridation : démarrer par la popularité pour les nouveaux utilisateurs puis basculer vers l'item2item dès que l'historique se construit afin d'augmenter la couverture sans sacrifier la qualité.

In [470]:

best_row = results_df.iloc[0]
justification = f"""
## Choix du modèle MVP

Modèle retenu : **{best_row['model']}**

Motifs principaux :
- NDCG@5 = {best_row['ndcg@k']:.4f}, MAP@5 = {best_row['map@k']:.4f}, Precision@5 = {best_row['precision@k']:.4f}, Recall@5 = {best_row['recall@k']:.4f}
- Couverture = {best_row['coverage@k']:.4f} sur {len(candidate_items)} articles candidats.
- Latence moyenne par utilisateur = {best_row['latency_per_user_s']:.6f} s (CPU).
- Complexité : implémentation {'optimisée via Surprise (SVD/KNN)' if 'SVD' in best_row['model'] else 'basée sur Surprise'} compatible avec Azure Functions.
- Gestion du cold-start utilisateur via popularité globale.

Note : ajuster `content_pca_components` pour réduire la taille des embeddings en production si nécessaire.
"""
choice_path = Path(CONFIG["artifacts_dir"]) / "model_choice.md"
choice_path.write_text(justification)
print(justification)



## Choix du modèle MVP

Modèle retenu : **Baseline Surprise - NormalPredictor**

Motifs principaux :
- NDCG@5 = 0.0782, MAP@5 = 0.0462, Precision@5 = 0.0571, Recall@5 = 0.1079
- Couverture = 0.0292 sur 445 articles candidats.
- Latence moyenne par utilisateur = 0.000134 s (CPU).
- Complexité : implémentation basée sur Surprise compatible avec Azure Functions.
- Gestion du cold-start utilisateur via popularité globale.

Note : ajuster `content_pca_components` pour réduire la taille des embeddings en production si nécessaire.



In [471]:

results_path_csv = Path(CONFIG["artifacts_dir"]) / "results.csv"
results_path_json = Path(CONFIG["artifacts_dir"]) / "results.json"
results_df.to_csv(results_path_csv, index=False)
results_df.to_json(results_path_json, orient="records", lines=True)
print(f"Résultats sauvegardés dans {results_path_csv} et {results_path_json}")


Résultats sauvegardés dans ../artifacts/evaluation/results.csv et ../artifacts/evaluation/results.json


### Déploiement (application et Azure Functions)

Le modèle **SVD Surprise** est exporté pour l'application Flask et la Function Azure. Les
hyperparamètres reflètent la configuration du notebook (facteurs latents, lr_all, reg_all), tandis que
le modèle KNN reste disponible pour comparaison locale.


## Conclusion

Ce notebook montre comment comparer des stratégies de recommandation avec une procédure reproductible : split temporel, entraînement, évaluation multi-métriques et sauvegarde des résultats. Les essais révèlent que la popularité globale reste une valeur sûre pour débuter, mais que des modèles plus personnalisés (item2item ou SVD) apportent de la diversité dès que l'on dispose d'historique. Les prochaines étapes naturelles sont d'exécuter les tests sur les vraies données Kaggle, d'ajouter des métriques business (taux de clic simulé, couverture par catégorie) et de prototyper une hybridation popularité + item2item dans une Azure Function pour valider le comportement en production.