#PART1 Ici on fait le code pour faire la recommendation (sur base model LDA construit à la question 4)


## Step 1 — Importer et extraire le modèle LDA






Dans cette étape, nous :

1. **Importons le fichier ZIP** contenant le modèle et les données.
2. **Décompressons** son contenu dans le dossier `/content`.
3. Vérifions que le répertoire `Q5_model_LDA` a bien été créé et contient :
   - `lda_model.joblib`
   - `vectorizer.joblib`
   - `playlist_topic_matrix.joblib`
   - `df_playlists.parquet`
   - `df_pl_tracks.parquet`

Ces fichiers seront utilisés dans les étapes suivantes pour charger :
- le modèle LDA,
- le vectorizer,
- la matrice des topics,
- et les DataFrames nécessaires à l’analyse.


In [None]:
import zipfile
import os
import joblib
import pandas as pd

# 1) Chemin vers ton zip (après l’avoir upload dans Colab)
zip_path = "/content/Q5_model_LDA.zip"

# 2) On dézippe dans /content
with zipfile.ZipFile(zip_path, 'r') as z:
    z.extractall("/content")

# 3) Dossier qui contient les fichiers (comme sur ta capture : Q5_model_LDA)
MODEL_DIR = "/content/Q5_model_LDA"

# 4) Chargement des modèles et des dataframes
lda = joblib.load(os.path.join(MODEL_DIR, "lda_model.joblib"))
vectorizer = joblib.load(os.path.join(MODEL_DIR, "vectorizer.joblib"))
playlist_topic_matrix = joblib.load(os.path.join(MODEL_DIR, "playlist_topic_matrix.joblib"))

df_playlists = pd.read_parquet(os.path.join(MODEL_DIR, "df_playlists.parquet"))
df_pl_tracks = pd.read_parquet(os.path.join(MODEL_DIR, "df_pl_tracks.parquet"))

print(df_playlists.shape, df_pl_tracks.shape)
df_playlists.head()


(50000, 3) (3353026, 4)


Unnamed: 0,pid,playlist_name,text
0,0,Throwbacks,Throwbacks Lose Control (feat. Ciara & Fat Ma...
1,1,Awesome Playlist,Awesome Playlist Eye of the Tiger Survivor Ey...
2,2,korean,korean Like You Hoody On And On GOOD (feat. E...
3,3,mat,mat Danse macabre Camille Saint-Saëns French ...
4,4,90s,"90s Tonight, Tonight The Smashing Pumpkins Me..."


## Step 2 — Construire pl2tracks et track_meta




Cette cellule crée deux objets essentiels pour la recommandation :

1. pl2tracks :
   - un dictionnaire {pid → set(track_uri)}
   - utilisé pour exclure les morceaux déjà présents dans la playlist
   - accélère fortement le scoring

2. track_meta :
   - un tableau contenant {track_uri, track_name, artist_name}
   - utilisé pour afficher des recommandations lisibles (titres et artistes)

Ces objets sont nécessaires à la fonction de recommandation.

In [None]:
# pid -> set(track_uri) pour exclure les morceaux déjà présents
pl2tracks = df_pl_tracks.groupby("pid")["track_uri"].apply(lambda s: set(s)).to_dict()

# meta par track_uri pour affichage
track_meta = (
    df_pl_tracks
      .sort_values("track_name")
      .drop_duplicates("track_uri")[["track_uri", "track_name", "artist_name"]]
)


## Step 3 — Normalisation des topics + définition des fonctions de recommandation



Cette cellule fait deux choses :

1. Normaliser la matrice `playlist_topic_matrix`
   - permet de calculer des similarités cosinus très rapidement
   - utilisé pour trouver les playlists les plus proches (top-K voisins)

2. Définir les fonctions :
   - top_k_neighbors_from_idx : trouve les playlists les plus similaires
   - recommend_tracks_for_pid : calcule les recommandations pour une playlist donnée

Fonctionnement interne :
- on cherche les playlists les plus proches d'une playlist cible
- pour chaque playlist voisine, on ajoute un “vote” pour chaque track qu’elle contient
- les tracks les mieux scorés deviennent les recommandations finales



In [None]:
import numpy as np
from collections import defaultdict

# Normalisation pour cosinus
playlist_norm = playlist_topic_matrix / (
    np.linalg.norm(playlist_topic_matrix, axis=1, keepdims=True) + 1e-12
)

def top_k_neighbors_from_idx(pl_idx, k=500):
    q = playlist_norm[pl_idx:pl_idx+1]
    sims = (q @ playlist_norm.T).ravel()
    sims[pl_idx] = -np.inf  # on enlève la playlist elle-même
    nn = np.argpartition(-sims, k)[:k]
    nn = nn[np.argsort(-sims[nn])]
    return nn, sims[nn]

def recommend_tracks_for_pid(query_pid, topn=500, k_neighbors=500):
    from collections import defaultdict

    pl_idx = int(df_playlists.index[df_playlists["pid"] == query_pid][0])
    nn_idx, nn_sims = top_k_neighbors_from_idx(pl_idx, k=k_neighbors)
    nn_pids = df_playlists.iloc[nn_idx]["pid"].to_numpy()

    query_tracks = pl2tracks.get(query_pid, set())

    scores = defaultdict(float)
    for pid, s in zip(nn_pids, nn_sims):
        for t in pl2tracks.get(pid, ()):
            if t not in query_tracks:
                scores[t] += float(s)

    items = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:topn]
    recs = pd.DataFrame(items, columns=["track_uri", "score"])
    recs = recs.merge(track_meta, on="track_uri", how="left")
    return recs[["track_uri", "track_name", "artist_name", "score"]]



In [13]:
# regarder quelques playlists pour choisir un pid
df_playlists.head(30)


Unnamed: 0,pid,playlist_name,text
0,0,Throwbacks,Throwbacks Lose Control (feat. Ciara & Fat Ma...
1,1,Awesome Playlist,Awesome Playlist Eye of the Tiger Survivor Ey...
2,2,korean,korean Like You Hoody On And On GOOD (feat. E...
3,3,mat,mat Danse macabre Camille Saint-Saëns French ...
4,4,90s,"90s Tonight, Tonight The Smashing Pumpkins Me..."
5,5,Wedding,Wedding Teach Me How to Dougie Cali Swag Dist...
6,6,I Put A Spell On You,I Put A Spell On You I Put A Spell On You Cre...
7,7,2017,2017 Hard To See You Happy Fink Fink’s Sunday...
8,8,BOP,BOP Twice Catfish and the Bottlemen The Ride ...
9,9,old country,old country Highwayman Willie Nelson Nashvill...


## Step 4 — Tester la recommandation sur une playlist



Cette cellule :
- choisit un pid depuis df_playlists
- appelle la fonction recommend_tracks_for_pid()
- affiche les 20 meilleurs morceaux recommandés

Elle permet de vérifier que le pipeline fonctionne de bout en bout.


In [None]:
#query_pid = int(df_playlists.iloc[0]["pid"])
query_pid = 29 #<----
print("Playlist choisie :", query_pid, df_playlists.iloc[0]["playlist_name"])

recs = recommend_tracks_for_pid(query_pid, topn=20, k_neighbors=500)
recs.head(20)


Playlist choisie : 29 Throwbacks


Unnamed: 0,track_uri,track_name,artist_name,score
0,spotify:track:3eR23VReFzcdmS7TYCrhCe,It Ain't Me (with Selena Gomez),Kygo,133.299013
1,spotify:track:0dA2Mk56wEzDgegdC6R17g,Stay (with Alessia Cara),Zedd,123.653736
2,spotify:track:5CtI0qwDJkDQGwXD1H1cLb,Despacito - Remix,Luis Fonsi,121.711813
3,spotify:track:6D0b04NJIKfEMg040WioJQ,Issues,Julia Michaels,119.760042
4,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,109.861672
5,spotify:track:4iLqG9SeJSnt0cSPICSjxv,Attention,Charlie Puth,107.997895
6,spotify:track:79cuOz3SPQTuFrp8WgftAu,There's Nothing Holdin' Me Back,Shawn Mendes,101.981914
7,spotify:track:152lZdxL1OR0ZMW6KquMif,Location,Khalid,97.201655
8,spotify:track:6DNtNfH8hXkqOX1sjqmI7p,Cold Water (feat. Justin Bieber & MØ),Major Lazer,96.997841
9,spotify:track:7BKLCZ1jbUBVqRi2FVlTVw,Closer,The Chainsmokers,96.99209


## Step 7 — Just pour check : Voir le contenu réel d'une playlist pour voir la cohérence



Cette cellule fournit une petite fonction :
- show_playlist_tracks(pid)
- permet d'afficher les morceaux d'une playlist existante
- pratique pour comprendre les recommandations obtenues

Utile pour vérifier que les recommandations excluent bien les tracks déjà présents.


In [None]:
def show_playlist_tracks(pid):
    return df_pl_tracks[df_pl_tracks["pid"] == pid][["track_uri", "track_name", "artist_name"]]

# Exemple :
pid =   29
show_playlist_tracks(pid).head(20)


Unnamed: 0,track_uri,track_name,artist_name
1447,spotify:track:1wjzFQodRWrPcQ0AnYnvQ9,I Like Me Better,Lauv
1448,spotify:track:5bcTCxgc7xVfSaMV3RuVke,Feels,Calvin Harris
1449,spotify:track:0Z9HQ8YvHqdeOjTwsR3cS7,Two High,Moon Taxi
1450,spotify:track:0NiXXAI876aGImAd6rTj8w,Mama,Jonas Blue
1451,spotify:track:38yBBH2jacvDxrznF7h08J,Slow Hands,Niall Horan
1452,spotify:track:3PEgB3fkiojxms35ntsTgs,More Than You Know,Axwell /\ Ingrosso
1453,spotify:track:1OAh8uOEOvTDqkKFsKksCi,Wild Thoughts,DJ Khaled
1454,spotify:track:0V9cosR5jWa4fr2koARmhD,Summer Air,ItaloBrothers
1455,spotify:track:0sri8HoWihcZO9Gr31gbmy,Degas Park,Kevin Abstract
1456,spotify:track:7F9vK8hNFMml4GtHsaXui6,Back to You (feat. Bebe Rexha & Digital Farm A...,Louis Tomlinson


# PART2 Ici on va check les recommendation pour voir si elles "bonnes"

## 1. What a single recommendation does (for 1 playlist)

For one test playlist (one `pid`), the algorithm performs:

1. **Find the 500 most similar playlists**  
   → using `top_k_neighbors_from_idx(pl_idx, k=500)`

2. **Collect all tracks from these 500 neighbour playlists**

3. **Score each track**  
   The score of a track *t* is computed as:  
   score(t) = sum over all playlists p that contain t of s_pl(p)

   → In other words, each similar playlist votes for its tracks with a weight equal to its similarity.

4. **Sort the tracks** and keep the **top 500** as recommendations.

So for one playlist, the method only uses:  
→ 500 closest playlists + all tracks appearing in these 500 playlists.

---

## 2. What the evaluation does (loop over many playlists)

The evaluation does **not** run the process for only one playlist.  
It runs it for **many playlists**.

Example:

```python
eval_pids_sample = random.sample(eval_pids, k=min(1000, len(eval_pids)))
```
---
## Remarque, Evaluation Metric: Recall@500

To evaluate the quality of the recommendations, a simple metric is used: **Recall@500**.

For each playlist in the evaluation set:

1. A subset of tracks (the *seed tracks*) is kept as input.
2. The remaining tracks form the **ground truth**.
3. The model generates **500 recommended tracks**.
4. The score measures how many of the hidden ground-truth tracks appear in the top 500.

Formally:


$$
\text{Recall@500} =
\frac{ |\text{Recommended}_{500} \cap \text{GroundTruth}| }
     { |\text{GroundTruth}| }.
$$







A higher recall means the model is able to recover more of the missing tracks of the playlist.



Choisir un K et une liste de playlists de test

In [None]:
import random

K = 25  # nombre de seed tracks qu'on garde (tu peux changer)

# playlists avec au moins K+1 morceaux
eval_pids = [
    pid for pid, tracks in pl2tracks.items()
    if len(tracks) > K + 1
]

print("Nombre de playlists éligibles :", len(eval_pids))

# on en prend un échantillon pour aller plus vite
random.seed(42)
eval_pids_sample = random.sample(eval_pids, k=min(1000, len(eval_pids)))
print("Nombre de playlists utilisées pour l'évaluation :", len(eval_pids_sample))


Nombre de playlists éligibles : 37234
Nombre de playlists utilisées pour l'évaluation : 1000


Fonction pour créer seed / ground truth

In [None]:
def make_seed_and_ground_truth(pid, K):
    """
    À partir d'une playlist (pid), on tire K morceaux comme seed,
    le reste sert de ground truth.
    """
    tracks = list(pl2tracks[pid])
    if len(tracks) <= K:
        return None, None
    seed_tracks = set(random.sample(tracks, K))
    gt_tracks = set(tracks) - seed_tracks
    return seed_tracks, gt_tracks


Version de la reco qui prend des seeds explicites

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict

def recommend_tracks_for_pid_with_seeds(query_pid, seed_tracks, topn=500, k_neighbors=500):
    """
    Même principe que recommend_tracks_for_pid,
    mais on considère que la playlist ne contient que seed_tracks côté utilisateur.
    """
    # index de la playlist dans df_playlists
    pl_idx = int(df_playlists.index[df_playlists["pid"] == query_pid][0])

    # voisins en espace LDA
    nn_idx, nn_sims = top_k_neighbors_from_idx(pl_idx, k=k_neighbors)
    nn_pids = df_playlists.iloc[nn_idx]["pid"].to_numpy()

    query_tracks = set(seed_tracks)

    scores = defaultdict(float)
    for pid, s in zip(nn_pids, nn_sims):
        for t in pl2tracks.get(pid, ()):
            # on autorise uniquement les morceaux qui ne font pas partie des seeds
            if t not in query_tracks:
                scores[t] += float(s)

    items = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:topn]
    recs = pd.DataFrame(items, columns=["track_uri", "score"])
    recs = recs.merge(track_meta, on="track_uri", how="left")
    return recs[["track_uri", "track_name", "artist_name", "score"]]


metric recall @500

In [None]:
def recall_at_500(gt_tracks, rec_uris):
    """
    Recall@500 = (# de tracks de la ground truth retrouvés dans le top 500)
                 / (# de tracks dans la ground truth)
    """
    gt = set(gt_tracks)
    if not gt:
        return 0.0
    rec = set(rec_uris[:500])
    return len(gt & rec) / len(gt)


boucle d'évaluation

In [None]:
recalls = []

for pid in eval_pids_sample:
    seed_tracks, gt_tracks = make_seed_and_ground_truth(pid, K)
    if not seed_tracks or not gt_tracks:
        continue

    recs = recommend_tracks_for_pid_with_seeds(pid, seed_tracks, topn=500, k_neighbors=500)
    rec_uris = recs["track_uri"].tolist()

    r = recall_at_500(gt_tracks, rec_uris)
    recalls.append(r)

len(recalls), sum(recalls) / len(recalls)


(1000, 0.40941276762373296)

Quelques stat

In [None]:
import numpy as np

print("Nombre de playlists évaluées :", len(recalls))
print("Recall@500 moyen :", float(np.mean(recalls)))
print("Recall@500 médian :", float(np.median(recalls)))


Nombre de playlists évaluées : 1000
Recall@500 moyen : 0.40941276762373296
Recall@500 médian : 0.4
