# Évaluation d'un système de recommandation sur Globocom

Notebook autonome pour entraîner et comparer plusieurs approches de recommandation sur le dataset Kaggle **news-portal-user-interactions-by-globocom**.

In [1]:

# Imports & Config
from __future__ import annotations
import json
import os
import time
from pathlib import Path
from typing import Callable, Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

# Configuration
CONFIG = {
    "clicks_path": "data/clicks_sample.csv",
    "metadata_path": "data/articles_metadata.csv",
    "artifacts_dir": "artifacts/evaluation",
    "k": 5,
    "train_ratio": 0.8,
    "recent_window_days": 7,
    "random_seed": 42,
    "svd_components": 64,
    "content_pca_components": None,
}
np.random.seed(CONFIG["random_seed"])
Path(CONFIG["artifacts_dir"]).mkdir(parents=True, exist_ok=True)
print("Config ready", CONFIG)


Config ready {'clicks_path': 'data/clicks_sample.csv', 'metadata_path': 'data/articles_metadata.csv', 'artifacts_dir': 'artifacts/evaluation', 'k': 5, 'train_ratio': 0.8, 'recent_window_days': 7, 'random_seed': 42, 'svd_components': 64, 'content_pca_components': None}


## Contexte

Ce notebook compare plusieurs stratégies de recommandation pour choisir un Top-5 d'articles par utilisateur. Les textes sont en français tandis que le code et les commentaires restent en anglais.

## Données

Les fichiers attendus sont situés dans `data/`. Si un fichier n'est pas trouvé, un jeu de clics synthétique minimal est généré automatiquement pour que le notebook reste exécutable (les messages expliquent comment remplacer par les données Kaggle).

In [2]:
# Load data utilities

def detect_timestamp_column(df: pd.DataFrame) -> str:
    """Detect the timestamp-like column name."""
    candidates = ["timestamp", "click_timestamp", "event_time", "ts", "time"]
    for col in df.columns:
        if col.lower() in candidates:
            return col
    raise ValueError("No timestamp-like column found. Expected one of: " + ",".join(candidates))


def detect_article_column(df: pd.DataFrame) -> str:
    """Detect the article/item column name."""
    candidates = ["article_id", "click_article_id", "item_id", "content_id"]
    for col in df.columns:
        if col in candidates:
            return col
    raise ValueError("No article id column found. Expected one of: " + ",".join(candidates))


def create_synthetic_clicks(path: str, n_users: int = 50, n_items: int = 120, days: int = 30, interactions_per_user: int = 25) -> pd.DataFrame:
    """Create a small synthetic clicks dataset to keep the notebook runnable."""
    rng = np.random.default_rng(CONFIG["random_seed"])
    start = pd.Timestamp("2022-01-01")
    records = []
    for user in range(1, n_users + 1):
        offsets = rng.integers(0, days, size=interactions_per_user)
        timestamps = [start + pd.Timedelta(int(o), unit="D") for o in sorted(offsets.tolist())]
        articles = rng.integers(1, n_items + 1, size=interactions_per_user)
        for ts, art in zip(timestamps, articles):
            records.append({"user_id": int(user), "article_id": int(art), "timestamp": ts})
    df = pd.DataFrame(records).sort_values("timestamp").reset_index(drop=True)
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False)
    print(
        f"Synthetic clicks dataset created at {path} "
        f"(users={n_users}, items={n_items}, interactions={len(df)})"
    )
    return df


def load_clicks(path: str) -> pd.DataFrame:
    """Load clicks data with robust timestamp parsing; auto-generate if missing."""
    if not os.path.exists(path):
        print(f"Clicks file not found at {path}. Generating a synthetic sample for demonstration.")
        return create_synthetic_clicks(path)
    df = pd.read_csv(path)
    ts_col = detect_timestamp_column(df)
    article_col = detect_article_column(df)
    df[ts_col] = pd.to_datetime(df[ts_col])
    required_cols = {"user_id", article_col, ts_col}
    if not required_cols.issubset(df.columns):
        raise ValueError(f"Missing columns. Found {df.columns}, expected at least {required_cols}")
    df = df.rename(columns={ts_col: "timestamp", article_col: "article_id"})
    df = df[["user_id", "article_id", "timestamp"]]
    df = df.sort_values("timestamp").reset_index(drop=True)
    return df


def load_metadata(path: str) -> Optional[pd.DataFrame]:
    """Load article metadata if available."""
    if not os.path.exists(path):
        print(f"Metadata file not found at {path}. Falling back to co-visitation content model.")
        return None
    meta = pd.read_csv(path)
    if "article_id" not in meta.columns:
        print("Metadata missing 'article_id' column. Ignoring metadata.")
        return None
    return meta


clicks = load_clicks(CONFIG["clicks_path"])
metadata = load_metadata(CONFIG["metadata_path"])
print(clicks.head())
print("Metadata loaded:", metadata is not None)


Clicks file not found at data/clicks_sample.csv. Generating a synthetic sample for demonstration.
Synthetic clicks dataset created at data/clicks_sample.csv (users=50, items=120, interactions=1250)
Metadata file not found at data/articles_metadata.csv. Falling back to co-visitation content model.
   user_id  article_id  timestamp
0        6          58 2022-01-01
1       17          11 2022-01-01
2       17          82 2022-01-01
3       38          15 2022-01-01
4        7          28 2022-01-01
Metadata loaded: False


## Protocole

1. Tri des interactions par horodatage.
2. Split temporel train/test selon `train_ratio`.
3. Profil utilisateur: interactions de train.
4. Ground truth: articles cliqués en test pour chaque utilisateur (au moins 1).
5. Recommandations Top-5 en excluant les articles déjà vus en train.
6. Calcul des métriques de ranking (Precision@5, Recall@5, MAP@5, NDCG@5, Coverage@5) et estimation de la latence moyenne sur un échantillon de 500 utilisateurs max.

In [3]:

# Split and utility functions

def temporal_train_test_split(df: pd.DataFrame, train_ratio: float) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Split interactions chronologically according to the train_ratio."""
    cutoff = int(len(df) * train_ratio)
    train = df.iloc[:cutoff].copy()
    test = df.iloc[cutoff:].copy()
    return train, test


def build_user_histories(df: pd.DataFrame) -> Dict[int, List[int]]:
    """Create mapping user -> list of articles in chronological order."""
    histories: Dict[int, List[int]] = {}
    for user_id, group in df.groupby("user_id"):
        histories[int(user_id)] = group.sort_values("timestamp")["article_id"].tolist()
    return histories


def get_candidate_items(df: pd.DataFrame) -> List[int]:
    """Return unique article ids."""
    return df["article_id"].unique().tolist()


def make_ground_truth(train: pd.DataFrame, test: pd.DataFrame) -> Tuple[Dict[int, List[int]], Dict[int, List[int]]]:
    """Build user histories and ground truth for evaluation."""
    train_hist = build_user_histories(train)
    test_hist = build_user_histories(test)
    eligible_users = {u: items for u, items in test_hist.items() if u in train_hist and len(items) > 0}
    return train_hist, eligible_users


train_df, test_df = temporal_train_test_split(clicks, CONFIG["train_ratio"])
train_histories, ground_truth = make_ground_truth(train_df, test_df)
candidate_items = get_candidate_items(train_df)
print(f"Train size: {len(train_df)}, Test size: {len(test_df)}, Users for eval: {len(ground_truth)}")


Train size: 1000, Test size: 250, Users for eval: 50


## Modèles évalués

* **Baseline A** : popularité globale (Top-K des articles les plus cliqués en train).
* **Baseline B** : popularité récente (Top-K sur les N derniers jours de train).
* **Modèle C** : item-to-item basé sur le contenu si disponible, sinon co-visitation (co-occurrence dans l'historique).
* **Modèle D** : filtrage collaboratif implicite (factorisation via SVD sur matrice user-item binaire).

In [4]:

# Metrics

def precision_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Precision@k for a single user."""
    if not recommended:
        return 0.0
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / k


def recall_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Recall@k for a single user."""
    if not relevant:
        return 0.0
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / len(relevant)


def average_precision_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """MAP@k for a single user."""
    if not relevant:
        return 0.0
    score = 0.0
    hits = 0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            hits += 1
            score += hits / i
    return score / min(len(relevant), k)


def dcg_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Discounted cumulative gain."""
    dcg = 0.0
    for i, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            dcg += 1 / np.log2(i + 1)
    return dcg


def ndcg_at_k(recommended: List[int], relevant: List[int], k: int) -> float:
    """Normalized DCG."""
    ideal_dcg = dcg_at_k(relevant[:k], relevant, k)
    if ideal_dcg == 0:
        return 0.0
    return dcg_at_k(recommended, relevant, k) / ideal_dcg


def coverage_at_k(all_recommendations: List[List[int]], candidate_items: List[int], k: int) -> float:
    """Coverage of unique recommended items over candidates."""
    rec_items = set()
    for rec in all_recommendations:
        rec_items.update(rec[:k])
    if not candidate_items:
        return 0.0
    return len(rec_items) / len(candidate_items)


In [5]:

# Recommenders

def build_global_popularity(train: pd.DataFrame) -> List[int]:
    """Return items sorted by global click counts."""
    return train.groupby("article_id").size().sort_values(ascending=False).index.tolist()


def build_recent_popularity(train: pd.DataFrame, window_days: int) -> List[int]:
    """Return popular items over the last window_days of training data."""
    max_time = train["timestamp"].max()
    window_start = max_time - pd.Timedelta(days=window_days)
    recent = train[train["timestamp"] >= window_start]
    return recent.groupby("article_id").size().sort_values(ascending=False).index.tolist()


def build_covisit_graph(train: pd.DataFrame) -> Dict[int, Dict[int, int]]:
    """Build co-visitation counts based on user histories."""
    graph: Dict[int, Dict[int, int]] = {}
    for _, group in train.groupby("user_id"):
        items = group.sort_values("timestamp")["article_id"].tolist()
        unique_items = list(dict.fromkeys(items))
        for i, item_i in enumerate(unique_items):
            graph.setdefault(item_i, {})
            for item_j in unique_items[i + 1 :]:
                graph[item_i][item_j] = graph[item_i].get(item_j, 0) + 1
                graph.setdefault(item_j, {})
                graph[item_j][item_i] = graph[item_j].get(item_i, 0) + 1
    return graph


def build_content_embeddings(metadata: pd.DataFrame, pca_components: Optional[int] = None):
    """Create TF-IDF embeddings from textual columns with optional PCA reduction."""
    text_cols = [c for c in metadata.columns if metadata[c].dtype == object and c != "article_id"]
    if not text_cols:
        raise ValueError("No textual columns in metadata")
    corpus = metadata[text_cols].fillna("").astype(str).agg(" ".join, axis=1)
    vectorizer = TfidfVectorizer(max_features=5000)
    tfidf = vectorizer.fit_transform(corpus)
    if pca_components and pca_components < tfidf.shape[1]:
        svd = TruncatedSVD(n_components=pca_components, random_state=CONFIG["random_seed"])
        reduced = svd.fit_transform(tfidf)
        embeddings = normalize(reduced)
    else:
        embeddings = normalize(tfidf)
    ids = metadata["article_id"].tolist()
    return embeddings, ids


def build_item_similarity(train: pd.DataFrame, metadata: Optional[pd.DataFrame]):
    """Build item-to-item similarity either from content or co-visitation."""
    if metadata is not None:
        try:
            embeddings, ids = build_content_embeddings(metadata, CONFIG["content_pca_components"])
            similarity: Dict[int, Dict[int, float]] = {}
            for i, aid in enumerate(ids):
                sims = embeddings @ embeddings[i].T
                sims = np.asarray(sims).flatten()
                top_idx = np.argsort(-sims)[1:51]
                similarity[aid] = {ids[j]: float(sims[j]) for j in top_idx if sims[j] > 0}
            return similarity, "content"
        except Exception as exc:
            print(f"Content embeddings failed ({exc}). Falling back to co-visitation.")
    graph = build_covisit_graph(train)
    similarity = {item: {nbr: float(cnt) for nbr, cnt in neigh.items()} for item, neigh in graph.items()}
    return similarity, "covisitation"


def recommend_from_similarity(user_id: int, train_histories: Dict[int, List[int]], similarity: Dict[int, Dict[int, float]], candidate_items: List[int], k: int) -> List[int]:
    """Aggregate similarity scores from user's history."""
    seen = set(train_histories.get(user_id, []))
    scores: Dict[int, float] = {}
    for item in seen:
        for neighbor, sim in similarity.get(item, {}).items():
            if neighbor in seen:
                continue
            scores[neighbor] = scores.get(neighbor, 0.0) + sim
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    recs = [it for it, _ in ranked if it not in seen]
    if len(recs) < k:
        for c in candidate_items:
            if c not in seen and c not in recs:
                recs.append(c)
            if len(recs) >= k:
                break
    return recs[:k]


def build_collaborative_svd(train: pd.DataFrame, n_components: int):
    """Train a simple implicit SVD recommender returning a recommend function."""
    users = train["user_id"].unique().tolist()
    items = train["article_id"].unique().tolist()
    user_to_idx = {u: i for i, u in enumerate(users)}
    item_to_idx = {it: i for i, it in enumerate(items)}

    rows = [user_to_idx[u] for u in train["user_id"]]
    cols = [item_to_idx[it] for it in train["article_id"]]
    data = np.ones(len(rows))
    mat = sparse.coo_matrix((data, (rows, cols)), shape=(len(users), len(items))).tocsr()

    svd = TruncatedSVD(n_components=n_components, random_state=CONFIG["random_seed"])
    user_factors = svd.fit_transform(mat)
    item_factors = svd.components_.T

    user_norm = normalize(user_factors)
    item_norm = normalize(item_factors)

    def recommend(user_id: int, seen: set, k: int) -> List[int]:
        if user_id not in user_to_idx:
            popularity = build_global_popularity(train)
            return [it for it in popularity if it not in seen][:k]
        u_vec = user_norm[user_to_idx[user_id]]
        scores = item_norm @ u_vec
        ranked_items = [items[i] for i in np.argsort(-scores)]
        return [it for it in ranked_items if it not in seen][:k]

    meta = {"users": len(users), "items": len(items), "components": n_components}
    return recommend, meta


In [6]:

# Evaluation pipeline

def evaluate_model(
    name: str,
    recommend_func: Callable[[int, set, int], List[int]],
    train_histories: Dict[int, List[int]],
    ground_truth: Dict[int, List[int]],
    candidate_items: List[int],
    k: int,
    latency_sample: int = 500,
) -> Dict[str, float]:
    """Evaluate a recommender with ranking metrics and latency estimation."""
    precisions: List[float] = []
    recalls: List[float] = []
    maps: List[float] = []
    ndcgs: List[float] = []
    all_recs: List[List[int]] = []

    users = list(ground_truth.keys())
    for user_id in users:
        seen = set(train_histories.get(user_id, []))
        recs = recommend_func(user_id, seen, k)
        gt = ground_truth[user_id]
        all_recs.append(recs)
        precisions.append(precision_at_k(recs, gt, k))
        recalls.append(recall_at_k(recs, gt, k))
        maps.append(average_precision_at_k(recs, gt, k))
        ndcgs.append(ndcg_at_k(recs, gt, k))

    coverage = coverage_at_k(all_recs, candidate_items, k)

    sample_users = users[: min(latency_sample, len(users))]
    start = time.perf_counter()
    for user_id in sample_users:
        seen = set(train_histories.get(user_id, []))
        _ = recommend_func(user_id, seen, k)
    latency = (time.perf_counter() - start) / max(1, len(sample_users))

    return {
        "model": name,
        "users": len(users),
        "precision@k": float(np.mean(precisions)),
        "recall@k": float(np.mean(recalls)),
        "map@k": float(np.mean(maps)),
        "ndcg@k": float(np.mean(ndcgs)),
        "coverage@k": coverage,
        "latency_per_user_s": latency,
    }


In [7]:

# Train recommenders
K = CONFIG["k"]

popularity_rank = build_global_popularity(train_df)

def popularity_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return [it for it in popularity_rank if it not in seen][:k]

recent_rank = build_recent_popularity(train_df, CONFIG["recent_window_days"])

def recent_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return [it for it in recent_rank if it not in seen][:k]

item_similarity, sim_mode = build_item_similarity(train_df, metadata)

def content_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return recommend_from_similarity(user_id, train_histories, item_similarity, candidate_items, k)

collab_recommend, collab_meta = build_collaborative_svd(train_df, CONFIG["svd_components"])

def collaborative_recommender(user_id: int, seen: set, k: int) -> List[int]:
    return collab_recommend(user_id, seen, k)


## Résultats

Les métriques sont calculées sur les utilisateurs présents en test avec au moins un clic et un historique en train.

In [8]:

results = []
results.append(evaluate_model("Baseline A - Popularité globale", popularity_recommender, train_histories, ground_truth, candidate_items, K))
results.append(evaluate_model(f"Baseline B - Popularité {CONFIG['recent_window_days']}j", recent_recommender, train_histories, ground_truth, candidate_items, K))
results.append(evaluate_model(f"Modèle C - Item2Item ({sim_mode})", content_recommender, train_histories, ground_truth, candidate_items, K))
results.append(evaluate_model("Modèle D - Collaborative SVD", collaborative_recommender, train_histories, ground_truth, candidate_items, K))

results_df = pd.DataFrame(results)
results_df = results_df.sort_values(["ndcg@k", "map@k"], ascending=False).reset_index(drop=True)
print(results_df)


                                 model  users  precision@k  recall@k  \
0      Baseline A - Popularité globale     50        0.048  0.045357   
1  Modèle C - Item2Item (covisitation)     50        0.048  0.038103   
2           Baseline B - Popularité 7j     50        0.044  0.042579   
3         Modèle D - Collaborative SVD     50        0.024  0.023079   

      map@k    ndcg@k  coverage@k  latency_per_user_s  
0  0.028633  0.053961    0.083333            0.000007  
1  0.021067  0.044166    0.258333            0.000220  
2  0.019967  0.044085    0.091667            0.000006  
3  0.016667  0.029087    0.783333            0.000027  


## Analyse & choix du modèle MVP

Comparer les scores NDCG@5, MAP@5, couverture et latence pour sélectionner le meilleur compromis pour un déploiement Azure Functions. La justification est stockée dans `artifacts/evaluation/model_choice.md`.

In [9]:

best_row = results_df.iloc[0]
justification = f"""
## Choix du modèle MVP

Modèle retenu : **{best_row['model']}**

Motifs principaux :
- NDCG@5 = {best_row['ndcg@k']:.4f}, MAP@5 = {best_row['map@k']:.4f}, Precision@5 = {best_row['precision@k']:.4f}, Recall@5 = {best_row['recall@k']:.4f}
- Couverture = {best_row['coverage@k']:.4f} sur {len(candidate_items)} articles candidats.
- Latence moyenne par utilisateur = {best_row['latency_per_user_s']:.6f} s (CPU).
- Complexité : implémentation {('légère (contenu/co-visitation)' if 'Item2Item' in best_row['model'] else 'linéaire en dimensions SVD')} compatible avec Azure Functions.
- Gestion du cold-start utilisateur via popularité globale.

Note : ajuster `content_pca_components` pour réduire la taille des embeddings en production si nécessaire.
"""
choice_path = Path(CONFIG["artifacts_dir"]) / "model_choice.md"
choice_path.write_text(justification)
print(justification)



## Choix du modèle MVP

Modèle retenu : **Baseline A - Popularité globale**

Motifs principaux :
- NDCG@5 = 0.0540, MAP@5 = 0.0286, Precision@5 = 0.0480, Recall@5 = 0.0454
- Couverture = 0.0833 sur 120 articles candidats.
- Latence moyenne par utilisateur = 0.000007 s (CPU).
- Complexité : implémentation linéaire en dimensions SVD compatible avec Azure Functions.
- Gestion du cold-start utilisateur via popularité globale.

Note : ajuster `content_pca_components` pour réduire la taille des embeddings en production si nécessaire.



In [10]:

results_path_csv = Path(CONFIG["artifacts_dir"]) / "results.csv"
results_path_json = Path(CONFIG["artifacts_dir"]) / "results.json"
results_df.to_csv(results_path_csv, index=False)
results_df.to_json(results_path_json, orient="records", lines=True)
print(f"Résultats sauvegardés dans {results_path_csv} et {results_path_json}")


Résultats sauvegardés dans artifacts/evaluation/results.csv et artifacts/evaluation/results.json
