https://claude.ai/chat/3ef0777d-211c-4ef0-bd23-a3248bf37bb7

# Next Best Action Music Recommender: High-Level Approach (just plain notes, because of an exhaustive example)

Let‚Äôs walk through how to build a next best action recommender, using a music recommendation example.

## 1. Business Questions to Ask

Before diving into the solution, clarify the following:

### **Business Objective**
- **What action are we recommending?** (e.g., ‚Äúnext song to play‚Äù)  
- **What defines success?** (e.g., user listens for >30 seconds, adds to playlist, doesn‚Äôt skip)  
- **What‚Äôs the business impact?** (e.g., increased engagement, reduced churn)  

### **Context & Constraints**
- Do we need real-time recommendations or batch predictions?  
- Are there diversity requirements? (don‚Äôt just recommend similar songs)  
- Any business rules? (e.g., new artist promotion, licensing priorities)  
- How often can recommendations change?  

### **User Experience**
- How many recommendations to show?  
- Should we explain *why* we‚Äôre recommending something?  
- What happens if we have no data on a user (cold start problem)?  

---

## 2. Data Collection Strategy

For a music recommender, you typically need:

### **Interaction Data (the core)**
- User ID, Song ID, Timestamp  
- Interaction type: played, skipped, liked, added to playlist  
- Duration listened (partial vs. complete plays)  

### **User Features**
- Demographics (age, location)  
- Listening history aggregates (genres preferred, listening times)  
- Subscription type, tenure  

### **Item Features (Songs)**
- Artist, album, genre, release date  
- Audio features (tempo, energy, danceability)  
- Popularity metrics  

### **Contextual Data**
- Time of day, day of week  
- Device type, location  
- Current playlist/session context  

---

## 3. Approach Structure

A mature recommender system typically has multiple layers:

### **Stage 1: Candidate Generation (Recall)**
Generate a broad set of potential recommendations (~100‚Äì1000 items) using fast, simpler methods:
- **Collaborative filtering:** ‚ÄúUsers like you also listened to‚Ä¶‚Äù  
- **Content-based:** ‚ÄúSimilar songs based on audio features‚Ä¶‚Äù  
- **Popularity-based:** Trending songs in relevant categories  

### **Stage 2: Ranking**
Score and rank candidates using more sophisticated models:
- Predict engagement probability for each candidate  
- Consider multiple objectives (relevance, diversity, novelty)  
- Apply business rules  

### **Stage 3: Post-Processing**
- Ensure diversity (don‚Äôt show 10 songs from the same artist)  
- Apply filters (explicit content, regional availability)  
- Reorder for better user experience  

---

## 4. Algorithmic Approach (The Core Methods)

### **Collaborative Filtering**
The workhorse of recommendation systems. Two main flavors:

#### **User-Based CF**
- Find users similar to the target user (based on listening history)  
- Recommend songs those similar users enjoyed  
- **Intuition:** ‚ÄúPeople with similar taste to you liked these songs‚Äù  

#### **Item-Based CF**
- Find songs similar to ones the user has listened to  
- Calculate similarity based on co-listening patterns  
- **Intuition:** ‚ÄúIf you liked song A, you‚Äôll probably like song B because they‚Äôre often enjoyed together‚Äù  

**The math:**  
Similarity metrics like *cosine similarity* or *Pearson correlation* on user-item interaction matrices.

#### **Matrix Factorization (Advanced CF)**
- Decompose the user-item interaction matrix into latent factors  
- Each user and song gets embedded into a lower-dimensional space  
- **Recommendation score** = dot product of user and item embeddings  
- **Intuition:** Discover hidden patterns like ‚Äúthis user likes upbeat indie rock‚Äù without explicitly defining these categories  

---

### **Content-Based Filtering**
- Create feature vectors for songs (genre, tempo, artist, etc.)  
- Build a user profile based on songs they‚Äôve enjoyed  
- Recommend songs with similar features  
- **Intuition:** ‚ÄúYou like energetic pop songs ‚Äî here are more.‚Äù  

---

### **Hybrid Approaches**
Combine multiple methods to overcome individual weaknesses:
- **Collaborative filtering** struggles with new songs (no interaction data yet)  
- **Content-based** can get stuck in a ‚Äúfilter bubble‚Äù  
- **Hybrid:** Use content-based for new items, CF for established catalog  

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from collections import defaultdict
import random

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# ============================================================================
# 1. GENERATE SYNTHETIC DATA
# ============================================================================

def generate_synthetic_data():
    """Create realistic fake music listening data"""
    
    # Define songs with features
    songs_data = {
        'song_id': [f'S{i:03d}' for i in range(1, 51)],
        'artist': np.random.choice(['Artist_A', 'Artist_B', 'Artist_C', 'Artist_D', 'Artist_E'], 50),
        'genre': np.random.choice(['Pop', 'Rock', 'Electronic', 'Jazz', 'Hip-Hop'], 50),
        'tempo': np.random.randint(60, 180, 50),  # BPM
        'energy': np.random.uniform(0, 1, 50),
        'danceability': np.random.uniform(0, 1, 50),
        'popularity': np.random.randint(0, 100, 50)
    }
    songs_df = pd.DataFrame(songs_data)
    
    # Generate user listening history
    users = [f'U{i:03d}' for i in range(1, 101)]
    interactions = []
    
    for user in users:
        # Each user listens to 5-20 songs
        n_listens = np.random.randint(5, 21)
        # Users have preferences (some like pop, some like rock, etc.)
        preferred_genre = np.random.choice(songs_df['genre'].unique())
        
        for _ in range(n_listens):
            # 70% chance to listen to preferred genre
            if np.random.random() < 0.7:
                song = songs_df[songs_df['genre'] == preferred_genre].sample(1).iloc[0]
            else:
                song = songs_df.sample(1).iloc[0]
            
            interactions.append({
                'user_id': user,
                'song_id': song['song_id'],
                'rating': np.random.choice([1, 1, 1, 0.5, 0.5, 0], p=[0.5, 0.2, 0.1, 0.1, 0.05, 0.05])
                # rating: 1=completed listen, 0.5=partial, 0=skipped
            })
    
    interactions_df = pd.DataFrame(interactions)
    
    return songs_df, interactions_df


# ============================================================================
# 2. COLLABORATIVE FILTERING - ITEM-BASED
# ============================================================================

class ItemBasedCF:
    """Item-based collaborative filtering recommender"""
    
    def __init__(self):
        self.item_similarity = None
        self.user_item_matrix = None
        
    def fit(self, interactions_df):
        """Train the model by calculating item-item similarities"""
        
        # Create user-item matrix
        self.user_item_matrix = interactions_df.pivot_table(
            index='user_id',
            columns='song_id',
            values='rating',
            fill_value=0
        )
        
        # Calculate item-item similarity (cosine similarity)
        # Transpose so items are rows
        item_matrix = self.user_item_matrix.T
        self.item_similarity = cosine_similarity(item_matrix)
        self.item_similarity_df = pd.DataFrame(
            self.item_similarity,
            index=item_matrix.index,
            columns=item_matrix.index
        )
        
        print(f"‚úì Trained CF model on {len(self.user_item_matrix)} users and {len(item_matrix)} songs")
        
    def recommend(self, user_id, n_recommendations=5, exclude_listened=True):
        """Generate recommendations for a user"""
        
        if user_id not in self.user_item_matrix.index:
            return []  # Cold start - return empty for now
        
        # Get songs user has listened to
        user_ratings = self.user_item_matrix.loc[user_id]
        listened_songs = user_ratings[user_ratings > 0].index.tolist()
        
        # Calculate scores for all songs
        scores = {}
        for song in self.item_similarity_df.columns:
            if exclude_listened and song in listened_songs:
                continue
            
            # Score = weighted sum of similarities with listened songs
            score = 0
            weight_sum = 0
            for listened_song in listened_songs:
                similarity = self.item_similarity_df.loc[song, listened_song]
                user_rating = user_ratings[listened_song]
                score += similarity * user_rating
                weight_sum += similarity
            
            if weight_sum > 0:
                scores[song] = score / weight_sum
        
        # Sort and return top N
        recommended_songs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return [song for song, score in recommended_songs[:n_recommendations]]


# ============================================================================
# 3. CONTENT-BASED FILTERING
# ============================================================================

class ContentBasedFilter:
    """Content-based recommender using song features"""
    
    def __init__(self):
        self.song_features = None
        self.scaler = StandardScaler()
        
    def fit(self, songs_df):
        """Prepare song feature vectors"""
        
        # One-hot encode categorical features
        genre_dummies = pd.get_dummies(songs_df['genre'], prefix='genre')
        artist_dummies = pd.get_dummies(songs_df['artist'], prefix='artist')
        
        # Combine with numerical features
        numerical_features = songs_df[['tempo', 'energy', 'danceability', 'popularity']]
        numerical_scaled = self.scaler.fit_transform(numerical_features)
        numerical_df = pd.DataFrame(
            numerical_scaled,
            columns=numerical_features.columns,
            index=songs_df['song_id']
        )
        
        # Combine all features
        self.song_features = pd.concat([
            numerical_df,
            genre_dummies.set_index(songs_df['song_id']),
            artist_dummies.set_index(songs_df['song_id'])
        ], axis=1)
        
        print(f"‚úì Prepared content features for {len(self.song_features)} songs")
        
    def recommend(self, user_id, interactions_df, n_recommendations=5):
        """Recommend songs similar to what user has listened to"""
        
        # Get user's listening history
        user_history = interactions_df[
            (interactions_df['user_id'] == user_id) & 
            (interactions_df['rating'] > 0.5)
        ]
        
        if len(user_history) == 0:
            return []
        
        listened_songs = user_history['song_id'].tolist()
        
        # Create user profile (average of liked songs' features)
        user_profile = self.song_features.loc[listened_songs].mean(axis=0)
        
        # Calculate similarity between user profile and all songs
        similarities = cosine_similarity(
            user_profile.values.reshape(1, -1),
            self.song_features.values
        )[0]
        
        # Create score dictionary
        song_scores = dict(zip(self.song_features.index, similarities))
        
        # Remove already listened songs
        for song in listened_songs:
            song_scores.pop(song, None)
        
        # Return top N
        recommended_songs = sorted(song_scores.items(), key=lambda x: x[1], reverse=True)
        return [song for song, score in recommended_songs[:n_recommendations]]


# ============================================================================
# 4. HYBRID RECOMMENDER
# ============================================================================

class HybridRecommender:
    """Combines collaborative and content-based filtering"""
    
    def __init__(self, cf_weight=0.6, cb_weight=0.4):
        self.cf_model = ItemBasedCF()
        self.cb_model = ContentBasedFilter()
        self.cf_weight = cf_weight
        self.cb_weight = cb_weight
        
    def fit(self, songs_df, interactions_df):
        """Train both models"""
        print("\nüéµ Training Hybrid Recommender System...")
        self.cf_model.fit(interactions_df)
        self.cb_model.fit(songs_df)
        print("‚úì Training complete!\n")
        
    def recommend(self, user_id, interactions_df, n_recommendations=10):
        """Generate hybrid recommendations"""
        
        # Get recommendations from both models
        cf_recs = self.cf_model.recommend(user_id, n_recommendations=20)
        cb_recs = self.cb_model.recommend(user_id, interactions_df, n_recommendations=20)
        
        # Combine scores
        combined_scores = defaultdict(float)
        
        # CF scores (inverse rank scoring)
        for rank, song in enumerate(cf_recs):
            combined_scores[song] += self.cf_weight * (1 / (rank + 1))
        
        # CB scores (inverse rank scoring)
        for rank, song in enumerate(cb_recs):
            combined_scores[song] += self.cb_weight * (1 / (rank + 1))
        
        # Sort and return top N
        final_recs = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        return [song for song, score in final_recs[:n_recommendations]]


# ============================================================================
# 5. DEMONSTRATION
# ============================================================================

def main():
    """Run the complete recommendation pipeline"""
    
    print("=" * 70)
    print("MUSIC RECOMMENDATION SYSTEM - DEMONSTRATION")
    print("=" * 70)
    
    # Generate data
    print("\nüìä Generating synthetic music data...")
    songs_df, interactions_df = generate_synthetic_data()
    print(f"‚úì Created {len(songs_df)} songs and {len(interactions_df)} listening events")
    print(f"‚úì {len(interactions_df['user_id'].unique())} users in the system")
    
    # Initialize and train the hybrid recommender
    recommender = HybridRecommender(cf_weight=0.6, cb_weight=0.4)
    recommender.fit(songs_df, interactions_df)
    
    # Make recommendations for a few users
    print("\n" + "=" * 70)
    print("RECOMMENDATION EXAMPLES")
    print("=" * 70)
    
    sample_users = ['U001', 'U042', 'U075']
    
    for user_id in sample_users:
        print(f"\nüéß Recommendations for {user_id}:")
        print("-" * 70)
        
        # Get user's listening history
        user_history = interactions_df[
            (interactions_df['user_id'] == user_id) & 
            (interactions_df['rating'] > 0.5)
        ]['song_id'].tolist()
        
        print(f"User has listened to: {', '.join(user_history[:5])}...")
        
        # Show what genres they like
        user_songs = songs_df[songs_df['song_id'].isin(user_history)]
        favorite_genres = user_songs['genre'].value_counts()
        print(f"Favorite genres: {', '.join([f'{g} ({c})' for g, c in favorite_genres.head(3).items()])}")
        
        # Get recommendations
        recommendations = recommender.recommend(user_id, interactions_df, n_recommendations=5)
        
        print(f"\nüìù Top 5 Recommended Songs:")
        for i, song_id in enumerate(recommendations, 1):
            song_info = songs_df[songs_df['song_id'] == song_id].iloc[0]
            print(f"  {i}. {song_id} - {song_info['artist']} ({song_info['genre']}) "
                  f"[tempo: {song_info['tempo']}, energy: {song_info['energy']:.2f}]")
    
    print("\n" + "=" * 70)
    print("‚ú® Recommendation pipeline complete!")
    print("=" * 70)


if __name__ == "__main__":
    main()

MUSIC RECOMMENDATION SYSTEM - DEMONSTRATION

üìä Generating synthetic music data...
‚úì Created 50 songs and 1328 listening events
‚úì 100 users in the system

üéµ Training Hybrid Recommender System...
‚úì Trained CF model on 100 users and 50 songs
‚úì Prepared content features for 50 songs
‚úì Training complete!


RECOMMENDATION EXAMPLES

üéß Recommendations for U001:
----------------------------------------------------------------------
User has listened to: S020, S036, S019, S030, S012...
Favorite genres: Jazz (10), Pop (2), Rock (2)

üìù Top 5 Recommended Songs:
  1. S005 - Artist_E (Electronic) [tempo: 71, energy: 0.29]
  2. S035 - Artist_C (Pop) [tempo: 74, energy: 0.96]
  3. S008 - Artist_C (Rock) [tempo: 107, energy: 0.81]
  4. S016 - Artist_B (Jazz) [tempo: 163, energy: 0.90]
  5. S017 - Artist_D (Rock) [tempo: 145, energy: 0.32]

üéß Recommendations for U042:
----------------------------------------------------------------------
User has listened to: S033, S036, S050, S0

# Latent Factors Explained

> A latent factor is a **hidden dimension** ‚Äî a feature that is not explicitly observed in your data but is inferred from patterns in user‚Äìitem interactions.
> ‚ÄúAn abstract concept that explains why a user likes certain items ‚Äî even though we never labeled it.‚Äù

Imagine you have a user‚Äìsong matrix that records whether a user listened to a song (1) or not (0):

|       | Song A | Song B | Song C | Song D |
|-------|:------:|:------:|:------:|:------:|
| Alice |   1    |   1    |   0    |   0    |
| Bob   |   1    |   1    |   0    |   0    |
| Carol |   0    |   0    |   1    |   1    |


We can clearly see that:
* Alice and Bob both like A & B (maybe both are rock songs)
* Carol likes C & D (maybe jazz songs)

But our data does not say ‚Äúrock‚Äù or ‚Äújazz‚Äù anywhere!
Those genres are **latent (hidden) properties** that explain the structure of the data.

A matrix factorization model learns such hidden dimensions automatically:
* It finds that one dimension (latent factor #1) might correspond to ‚Äúrock vs. jazz‚Äù
* Another could capture ‚Äúenergetic vs. calm‚Äù
* Another could capture ‚Äúpopular vs. niche‚Äù

Each song and each user gets a coordinate in this hidden ‚Äútaste space‚Äù.

## In Math Terms
We approximate the user‚Äìitem matrix R (ratings or play counts) as:

$R \approx U \times V^T$

* U: user‚Äìlatent matrix (each user has a vector of latent preferences)
* V: item‚Äìlatent matrix (each item has a vector of latent attributes)
* The number of columns in U (and rows in V) = number of latent factors k

Each factor dimension captures a pattern such as:
* ‚ÄúPrefers energetic vs calm music‚Äù
* ‚ÄúPrefers mainstream vs indie‚Äù
* ‚ÄúPrefers instrumental vs vocal‚Äù

Prediction for user $u$ and item $i$:
$\hat{r}_{ui} = U_u \cdot V_i^T$


In [2]:
import numpy as np
from sklearn.decomposition import NMF
import pandas as pd

# Example user‚Äìitem play matrix
R = np.array([
    [5, 4, 0, 0],  # Alice
    [4, 5, 0, 0],  # Bob
    [0, 0, 4, 5],  # Carol
    [0, 0, 5, 4],  # Dave
])

nmf = NMF(n_components=2, random_state=42)
U = nmf.fit_transform(R)   # users √ó latent factors
V = nmf.components_.T      # items √ó latent factors

pd.DataFrame(U, columns=["Latent_1", "Latent_2"], index=["Alice", "Bob", "Carol", "Dave"])

Unnamed: 0,Latent_1,Latent_2
Alice,0.09224,0.0
Bob,0.09224,0.0
Carol,0.0,1.646558
Dave,0.0,1.646558


You‚Äôll see that:
* Alice/Bob have high weights in latent factor 1 (‚Üí ‚ÄúRock‚Äù)
* Carol/Dave in latent factor 2 (‚Üí ‚ÄúJazz‚Äù)

| Concept        | Meaning                                                                 |
|----------------|--------------------------------------------------------------------------|
| Latent Factor  | A hidden variable inferred from data that explains observed patterns.    |
| In Recommenders | Captures abstract tastes or themes shared by users/items.               |
| Discovered by  | Matrix Factorization, Embeddings, Neural Nets                            |
| Use            | Predict unknown interactions, compress large matrices, reveal structure  |