1. **Anomaly detection**

   **Goal**  
   Mark tracks whose mood is far from the playlist’s main cluster(s).

   **Implementation steps**

   - **Feature extraction**  
     - Build a numeric feature vector from `audio_features` plus encoded tags (as you described to Sphinx earlier).

   - **Clustering**  
     - Cluster per playlist (e.g., KMeans/GMM with \(K \approx 3\text{–}8\)).

   - **Anomaly scoring**  
     - For each track, compute distance to nearest centroid (or \(1 - \max\) cluster probability).  
     - Define anomalies as top X% distances (e.g., 10–15%) or above a threshold.

   - **Output**  
     - For each track: anomaly score, `is_anomaly`, nearest cluster id.  
     - Store summary like “N anomalies out of M tracks”.

   - **Front‑end action**  
     - When the user runs “anomaly detection”, call your backend endpoint that:
       - Runs/loads the analysis for selected playlists.
       - Returns anomaly details per track (with Spotify IDs) to display and optionally create an “anomalies” playlist.

---

2. **Mood detection (clustering + visualization)**

   **Goal**  
   Show the main mood clusters in the playlist.

   **Implementation**

   - Reuse the same feature matrix.  
   - After clustering:
     - Compute centroid stats (mean energy, valence, tempo, tag frequencies) per cluster.  
     - Auto‑label clusters with short human names (e.g., “High‑energy happy”, “Low‑energy chill”) using rules over centroids and tags.

   **Visualization**

   - 2D projection (PCA/UMAP) colored by cluster.  
   - Tag clouds per cluster from Last.fm tags.

   **Front‑end action**

   - Display clusters as:
     - A scatter plot (mood map).
     - Cluster cards listing representative songs.

---
## Based on numbers 1 and 2, an analysis will be created for the rest of the actions to use info from.
---

3. **Playlist comparisons**

   **Goal**  
   Compare moods across multiple playlists.

   **Implementation**

   - For each playlist, compute a *mood fingerprint*:
     - Mean of each audio feature (energy, valence, etc.).
     - Normalized tag distribution over your mood/tag vocabulary.

   - Derive summary metrics:
     - Overall energy/valence, tempo, acousticness etc.
     - Maybe a 2D point (valence vs energy centroid).

   - **Similarity**
     - Compute distance between mood fingerprints (e.g., cosine or Euclidean).
     - Rank how similar playlists are to each other.

   **Front‑end action**

   - Show a radar chart or bar comparison of features per playlist.  
   - Option: “These two playlists are your most similar; this one is the outlier.”

---

4. **Mood selection**

   **Goal**  
   Pick a target mood and show matching songs from selected playlists.

   **Implementation**

   - Define a fixed set of mood labels mapped to feature regions (e.g., from your clusters or a manual mapping).  
   - For each track:
     - Either use its cluster label (if clusters ↔ mood labels).
     - Or compute a mood score using rules: e.g., “happy” = high valence + medium/high energy.

   - Given a chosen mood:
     - Filter tracks closest to that mood’s centroid/region.

   **Front‑end action**

   - Mood selector (chips/buttons).  
   - When a mood is chosen, show a ranked list of tracks matching it, and button “Create playlist from these tracks”.

---

5. **Anomaly recommendations**

   **Goal**  
   Recommend songs that “fit better” than the anomalies, or that are similar to anomalies but align with playlist mood.

   **Two variants**

   - **“Fix my playlist”**
     - For each anomaly, find tracks in the user’s other playlists whose features are close to the dominant playlist cluster instead of the anomaly cluster.
     - Recommend those as replacements.

   - **“Lean into the weirdness”**
     - Find tracks similar to the anomalies (same cluster in global space) to build a “weird side‑quest” playlist.

   **Implementation**

   - Need a global index of the user’s `EnrichedTracks` across playlists.  
   - Use nearest‑neighbor search in feature space to pull candidates.

   **Front‑end action**

   - For each anomaly, show:
     - “Replace this with …” list.
     - “Build an ‘anomaly vibes’ playlist” button.

---

6. **Mood recommendations**

   **Goal**  
   Given a mood, recommend tracks from *outside* the selected playlist(s) but within the user’s library or beyond.

   **Implementation (within user’s library)**

   - Build a mood centroid from:
     - Selected mood label → target centroid (from your clusters/global stats).
   - Across all user’s `EnrichedTracks`:
     - Find nearest tracks to that centroid not already in the active playlist(s).

   - If you want to go beyond the user’s songs, you could later integrate ReccoBeats or Spotify recommendations as a second step, but that’s optional for the datathon.

   **Front‑end action**

   - “More songs for this mood” button on mood selection results.

---

7. **Sphinx chatbot integration**

   **Goal**  
   Let users ask questions like “Why is this song an anomaly?” or “Which playlist is happiest?”.

   **Architecture**

   - **Data surface for Sphinx**:
     - In a notebook (local or Databricks), load:
       - The `EnrichedTracks`.
       - All derived analysis tables (clusters, anomalies, fingerprints).

   - **Sphinx CLI / agent**:
     - Run Sphinx in “agent” or “plan” mode over that notebook kernel.
     - Give it a description of the dataframes/dicts and the actions (like the scenario you already wrote).

   - **App integration**:
     - Your app sends user questions → a backend route that:
       - Forwards the question to Sphinx CLI (e.g., via subprocess or HTTP if you proxy it).
       - Sphinx reads the dataframes, generates/executes code, and returns a natural language answer plus optional tables/plots.
     - You can predefine some “shortcuts” in Sphinx rules, e.g.,
       - “anomalies for playlist X” → call `get_anomalies(playlist_id)`
       - “compare playlist A and B” → call `compare_playlists([A, B])`

---

8. **Creating playlists from results**

   **Goal**  
   Turn any list of Spotify track IDs into a new playlist via the Web API.

   **Implementation**

   - **Backend helper**:
     - `create_playlist(user_id, name, description, is_public)` → call Spotify “Create Playlist” API.
     - `add_tracks_to_playlist(playlist_id, track_spotify_ids)`.

   - **Wire into actions**:
     - Anomaly detection:
       - “Create playlist of anomalies.”
     - Mood selection:
       - “Create playlist of ‘High‑energy happy’ songs from selected playlists.”
     - Anomaly/mood recommendations:
       - “Create recommended playlist” from candidate tracks.

   - Optional: store a link between analysis runs and created playlist IDs for traceability.

In [None]:
"""Data classes shared across modules."""

from __future__ import annotations

from dataclasses import dataclass, field
from typing import Optional


@dataclass
class Artist:
    name: str
    spotify_id: Optional[str] = None


@dataclass
class Track:
    """Minimal Spotify track info collected from playlists."""

    spotify_id: str
    title: str
    artists: list[Artist]
    album_name: str
    duration_ms: int


@dataclass
class AudioFeatures:
    """Audio features retrieved from ReccoBeats."""

    acousticness: float
    danceability: float
    energy: float
    instrumentalness: float
    liveness: float
    loudness: float
    speechiness: float
    tempo: float
    valence: float
    key: Optional[int] = None
    mode: Optional[int] = None


@dataclass
class Tag:
    """A single Last.fm tag with its weight."""

    name: str
    count: int  # 0-100 relevance weight


@dataclass
class EnrichedTrack:
    """Final fused object combining Spotify metadata, ReccoBeats features, and
    Last.fm tags.  This is the object handed off to later data-processing code.
    """

    spotify_id: str
    title: str
    artists: list[Artist]
    album_name: str
    duration_ms: int
    audio_features: Optional[AudioFeatures] = None
    tags: list[Tag] = field(default_factory=list)

    # ReccoBeats internal id (useful for further API calls)
    reccobeats_id: Optional[str] = None


@dataclass
class Playlist:
    """Basic Spotify playlist metadata (without tracks)."""

    spotify_id: str
    name: str
    total_tracks: int
    owner: str
    description: Optional[str] = None
    image_url: Optional[str] = None


@dataclass
class EnrichedPlaylist:
    """A playlist represented as a collection of EnrichedTracks."""

    spotify_id: str
    name: str
    tracks: list[EnrichedTrack] = field(default_factory=list)
    description: Optional[str] = None
    owner: Optional[str] = None
    snapshot_id: Optional[str] = None
    image_url: Optional[str] = None
    total_tracks: int = 0


In [None]:
import json
from pathlib import Path

import models

# We expect models to define:
# models.Artist
# models.AudioFeatures
# models.Tag
# models.EnrichedTrack
# models.EnrichedPlaylist

DATA_PATH = Path("enriched_playlists.json")

with DATA_PATH.open("r", encoding="utf-8") as f:
    raw_playlists = json.load(f)

def artist_from_dict(d: dict) -> models.Artist:
    return models.Artist(
        name=d["name"],
        spotify_id=d.get("spotify_id"),
    )

def audio_features_from_dict(d: dict | None) -> models.AudioFeatures | None:
    if d is None:
        return None
    return models.AudioFeatures(
        acousticness=d["acousticness"],
        danceability=d["danceability"],
        energy=d["energy"],
        instrumentalness=d["instrumentalness"],
        liveness=d["liveness"],
        loudness=d["loudness"],
        speechiness=d["speechiness"],
        tempo=d["tempo"],
        valence=d["valence"],
        key=d.get("key"),
        mode=d.get("mode"),
    )

def tag_from_dict(d: dict) -> models.Tag:
    return models.Tag(
        name=d["name"],
        count=d["count"],
    )

def enriched_track_from_dict(d: dict) -> models.EnrichedTrack:
    return models.EnrichedTrack(
        spotify_id=d["spotify_id"],
        title=d["title"],
        artists=[artist_from_dict(a) for a in d.get("artists", [])],
        album_name=d["album_name"],
        duration_ms=d["duration_ms"],
        audio_features=audio_features_from_dict(d.get("audio_features")),
        tags=[tag_from_dict(t) for t in d.get("tags", [])],
        reccobeats_id=d.get("reccobeats_id"),
    )

def enriched_playlist_from_dict(d: dict) -> models.EnrichedPlaylist:
    return models.EnrichedPlaylist(
        spotify_id=d["spotify_id"],
        name=d["name"],
        tracks=[enriched_track_from_dict(t) for t in d.get("tracks", [])],
    )

example_playlists: list[models.EnrichedPlaylist] = [
    enriched_playlist_from_dict(p) for p in raw_playlists
]

# Quick sanity check
print(f"Loaded {len(example_playlists)} playlist(s)")
for pl in example_playlists:
    print(f"- {pl.name}: {len(pl.tracks)} tracks")


We will implement the following analysis functions:

- run_anomaly_detection(playlist: EnrichedPlaylist) -> dict

- run_mood_clustering(playlist: EnrichedPlaylist) -> dict

- compare_playlists(playlists: list[EnrichedPlaylist]) -> dict

- select_tracks_by_mood(playlists: list[EnrichedPlaylist], mood_label: str) -> dict

- recommend_for_anomalies(playlists: list[EnrichedPlaylist]) -> dict

- recommend_for_mood(playlists: list[EnrichedPlaylist], mood_label: str) -> dict

All functions must:

- Take EnrichedPlaylist / list[EnrichedPlaylist] and optional mood_label as input.

- Use EnrichedPlaylist.tracks (list of EnrichedTrack) as the data source.

- Handle missing audio_features and/or empty tags without crashing.

- Return JSON‑serializable dicts ready to send from a backend API to a front‑end.

You are a data‑science coding agent working in this notebook.

The core data model is:

- `EnrichedPlaylist` with fields:
  - `spotify_id: str`
  - `name: str`
  - `tracks: list[EnrichedTrack]`
- `EnrichedTrack` has:
  - `spotify_id`, `title`, `artists`, `album_name`, `duration_ms`
  - `audio_features: Optional[AudioFeatures]`
  - `tags: list[Tag]`
  - `reccobeats_id: Optional[str]`
- `AudioFeatures` contains numeric audio features:
  - `acousticness`, `danceability`, `energy`, `instrumentalness`, `liveness`, `loudness`, `speechiness`, `tempo`, `valence`, `key`, `mode`
- `Tag` contains:
  - `name: str` (text)
  - `count: int` (0–100 relevance)

We already use `EnrichedTrack` as the fused object containing Spotify metadata, ReccoBeats audio features, and Last.fm tags.

---

## Functions to implement

Implement the following Python functions in this notebook.

Each function must:

- Take `EnrichedPlaylist` or `list[EnrichedPlaylist]` objects as input.
- Construct feature matrices from `audio_features` + encoded `tags`.
- Handle missing data robustly:
  - If a track has neither `audio_features` nor `tags`, exclude it from numerical clustering, but report it separately.
  - If a track has only `audio_features`, use numeric features only.
  - If a track has only `tags`, use tag‑based features only for that part.
- Return JSON‑serializable Python dicts (no custom classes in the return value).
- Never crash due to missing fields.

### 1. `run_playlist_analysis(playlist: EnrichedPlaylist) -> dict`

This is the **combined analysis** action (mood clustering + anomaly detection).  
Other actions will reuse its outputs.

Responsibilities:

- Build a mood feature space for the playlist using `EnrichedTrack` features.
- Cluster tracks into 3–8 mood clusters using combined features.
- For each cluster:
  - Compute centroid feature summary (mean energy, valence, tempo, tag distribution).
  - Assign a simple label like `"high_energy_happy"` based on rules.
- Compute an anomaly score per track (e.g., distance to nearest cluster center or low cluster membership probability).
- Mark the top X% as anomalies.
- Return a dict with:
  - `"playlist_id"`, `"playlist_name"`
  - `"clusters"`: list of clusters with  
    `cluster_id`, `label`, `size`, `centroid_features`, and list of member `spotify_id`s.
  - `"tracks"`: list of objects with  
    `spotify_id`, `title`, `cluster_id`, `anomaly_score`, `is_anomaly`, and a brief text `"reason"` field explaining why it’s an anomaly (if `is_anomaly` is true).
  - `"summary"`: overall stats for the playlist (e.g., `num_tracks`, `num_anomalies`, `num_clusters`, any tracks excluded due to missing data).

Later functions should assume they can consume the outputs of `run_playlist_analysis` instead of recomputing clustering/anomalies from scratch.

### 2. `compare_playlists(playlists: list[EnrichedPlaylist]) -> dict`

- For each playlist, compute a mood fingerprint using the analysis results from `run_playlist_analysis` when available:
  - Mean audio features (energy, valence, etc.).
  - Normalized tag distribution over a small mood/tag vocabulary you infer from the data.
  - Optional: aggregate cluster information (e.g., distribution over mood labels).
- Compute pairwise similarity/distance between playlists.
- Return:
  - `"playlists"`: list with `playlist_id`, `name`, `fingerprint` (features).
  - `"similarities"`: list of `{ "playlist_id_a", "playlist_id_b", "distance" }`.

### 3. `select_tracks_by_mood(playlists: list[EnrichedPlaylist], mood_label: str) -> dict`

- Use the existing mood feature space or the cluster labels from `run_playlist_analysis` to find tracks that match `mood_label` from the selected playlists.
- If a playlist has not yet been analyzed, call `run_playlist_analysis` internally or document that it must be precomputed.
- Return:
  - `"mood_label"`
  - `"tracks"`: list with  
    `spotify_id`, `title`, `playlist_id`, `playlist_name`, `cluster_id` (if applicable), and a `"match_score"`.

### 4. `recommend_for_anomalies(playlists: list[EnrichedPlaylist]) -> dict`

- Use the anomaly detection results from `run_playlist_analysis`.
- For each anomaly track in each playlist:
  - Identify the dominant mood cluster(s) of that playlist.
  - Suggest a few replacement tracks from the user’s *other* playlists that better fit the dominant mood cluster (using the same feature space).
- Return:
  - `"recommendations"`: list of items with  
    `playlist_id`, `playlist_name`,  
    `anomaly_track_id`, `anomaly_title`,  
    and a list of suggested replacement tracks (`spotify_id`, `title`, `source_playlist_id`, `source_playlist_name`, `fit_score`).

### 5. `recommend_for_mood(playlists: list[EnrichedPlaylist], mood_label: str) -> dict`

- Given a mood label, use the mood feature space / clusters from `run_playlist_analysis` to find tracks across the provided playlists that match the mood but are *not* already in a specific target playlist (or simply return all matching tracks with scores).
- Return:
  - `"mood_label"`
  - `"tracks"`: list with  
    `spotify_id`, `title`, `playlist_id`, `playlist_name`, `cluster_id` (if applicable), and `"match_score"`.

---

## General requirements

- Use idiomatic Python with `numpy` / `pandas` / `scikit-learn` for feature construction and clustering.
- Encapsulate common logic (feature extraction, tag encoding, clustering, distance / similarity computations) into helper functions to avoid duplication.
- Prefer reusing the outputs of `run_playlist_analysis` instead of re‑clustering whenever possible.
- Provide at least one example call per function using `example_playlists` (or a small synthetic playlist) to demonstrate usage.
- Make sure all code runs in this notebook without external dependencies beyond standard data‑science libraries.
