## Content-Based Music Recommendation (Nearest Neighbors)

This section builds a **content-based** recommender using song/audio metadata from `df_clean`. We use a Nearest Neighbors model over engineered features and return the **10 most similar songs** for a given input song.

In [2]:
# ============================================================================
# 1) Imports
# ============================================================================

import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import NearestNeighbors

import warnings
warnings.filterwarnings('ignore')

print('Imports ready.')

Imports ready.


In [3]:
# ============================================================================
# 2) Load df_clean
# ============================================================================

df_clean_path = r"D:\UNH Materials\Projects\Spotify Song Recommendations\data\df_clean.csv"
# df_clean_path = r"D:\UNH Materials\Projects\Spotify Song Recommendations\data\top_10000_1960-now.csv"
df_clean = pd.read_csv(df_clean_path)
# df_clean.columns = df_clean.columns.str.lower().str.replace(' ', '_')
# df_clean = df_clean.drop(['album_genres', 'artist_genres', 'track_preview_url', 'copyrights'], axis=1)
# df_clean = df_clean.dropna()

print('df_clean loaded:', df_clean.shape)
display(df_clean.head())

df_clean loaded: (8582, 36)


Unnamed: 0,track_uri,track_name,artist_uri(s),artist_name(s),album_uri,album_name,album_artist_uri(s),album_artist_name(s),album_release_date,album_image_url,...,valence,tempo,time_signature,label,release_year,release_month,release_quarter,release_week,release_day_of_week,track_duration_min
0,spotify:track:1XAZlnVtthcDZt2NI1Dtxo,Justified & Ancient - Stand by the Jams,spotify:artist:6dYrdRlNZSKaVxYg5IrvCH,The KLF,spotify:album:4MC0ZjNtVP1nDD5lsLxFjc,Songs Collection,spotify:artist:6dYrdRlNZSKaVxYg5IrvCH,The KLF,1992-08-03,https://i.scdn.co/image/ab67616d0000b27355346b...,...,0.504,111.458,4.0,Jams Communications,1992,8,3,32,Monday,3.6045
1,spotify:track:6a8GbQIlV8HBUW3c6Uk9PH,I Know You Want Me (Calle Ocho),spotify:artist:0TnOYISbd1XYRBk9myaseg,Pitbull,spotify:album:5xLAcbvbSAlRtPXnKkggXA,Pitbull Starring In Rebelution,spotify:artist:0TnOYISbd1XYRBk9myaseg,Pitbull,2009-10-23,https://i.scdn.co/image/ab67616d0000b27326d73a...,...,0.8,127.045,4.0,Mr.305/Polo Grounds Music/J Records,2009,10,4,43,Friday,3.952
2,spotify:track:70XtWbcVZcpaOddJftMcVi,From the Bottom of My Broken Heart,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Britney Spears,spotify:album:3WNxdumkSMGMJRhEgK80qx,...Baby One More Time (Digital Deluxe Version),spotify:artist:26dSoYclwsYLMAKD3tpOr4,Britney Spears,1999-01-12,https://i.scdn.co/image/ab67616d0000b2738e4986...,...,0.706,74.981,4.0,Jive,1999,1,1,2,Tuesday,5.208883
3,spotify:track:1NXUWyPJk5kO6DQJ5t7bDu,Apeman - 2014 Remastered Version,spotify:artist:1SQRv42e4PjEYfPhS0Tk9E,The Kinks,spotify:album:6lL6HugNEN4Vlc8sj0Zcse,"Lola vs. Powerman and the Moneygoround, Pt. On...",spotify:artist:1SQRv42e4PjEYfPhS0Tk9E,The Kinks,2014-10-20,https://i.scdn.co/image/ab67616d0000b2731e7c53...,...,0.833,75.311,4.0,Sanctuary Records,2014,10,4,43,Monday,3.89
4,spotify:track:72WZtWs6V7uu3aMgMmEkYe,You Can't Always Get What You Want,spotify:artist:22bE4uQ6baNwSHPVcDxLCe,The Rolling Stones,spotify:album:0c78nsgqX6VfniSNWIxwoD,Let It Bleed,spotify:artist:22bE4uQ6baNwSHPVcDxLCe,The Rolling Stones,1969-12-05,https://i.scdn.co/image/ab67616d0000b27373d927...,...,0.497,85.818,4.0,Universal Music Group,1969,12,4,49,Friday,7.478667


In [6]:
# ============================================================================
# 3) Feature selection (content-based)
#    We avoid IDs/names for modeling, but keep them for display.
# ============================================================================

# Strong content signals: audio + a bit of metadata
numerical_features = [
    'danceability', 'energy', 'loudness', 'explicit',
    'instrumentalness', 'tempo', 'popularity', 'valence',
    'speechiness',	'acousticness', 'liveness'
]

categorical_features = [
    # 'genre', 'country', 'label'
    'mode'
]

# Keep only columns that exist
numerical_features = [c for c in numerical_features if c in df_clean.columns]
categorical_features = [c for c in categorical_features if c in df_clean.columns]

required_id_cols = ['track_name', 'artist_name(s)']
available_id_cols = [c for c in required_id_cols if c in df_clean.columns]

print('Numerical features:', numerical_features)
print('Categorical features:', categorical_features)
print('ID/display columns:', available_id_cols)

# Basic cleaning for modeling
model_df = df_clean[available_id_cols + numerical_features + categorical_features].copy()

# Fill missing values
for c in numerical_features:
    model_df[c] = model_df[c].fillna(model_df[c].median())
for c in categorical_features:
    model_df[c] = model_df[c].fillna('Unknown')

print('Modeling dataframe:', model_df.shape)
display(model_df.head())

Numerical features: ['danceability', 'energy', 'loudness', 'explicit', 'instrumentalness', 'tempo', 'popularity', 'valence', 'speechiness', 'acousticness', 'liveness']
Categorical features: ['mode']
ID/display columns: ['track_name', 'artist_name(s)']
Modeling dataframe: (8582, 14)


Unnamed: 0,track_name,artist_name(s),danceability,energy,loudness,explicit,instrumentalness,tempo,popularity,valence,speechiness,acousticness,liveness,mode
0,Justified & Ancient - Stand by the Jams,The KLF,0.617,0.872,-12.305,False,0.112,111.458,0,0.504,0.048,0.0158,0.408,1.0
1,I Know You Want Me (Calle Ocho),Pitbull,0.825,0.743,-5.995,False,2.1e-05,127.045,64,0.8,0.149,0.0142,0.237,1.0
2,From the Bottom of My Broken Heart,Britney Spears,0.677,0.665,-5.171,False,1e-06,74.981,56,0.706,0.0305,0.56,0.338,1.0
3,Apeman - 2014 Remastered Version,The Kinks,0.683,0.728,-8.92,False,5.1e-05,75.311,42,0.833,0.259,0.568,0.0384,1.0
4,You Can't Always Get What You Want,The Rolling Stones,0.319,0.627,-9.611,False,7.3e-05,85.818,0,0.497,0.0687,0.675,0.289,1.0


In [7]:
# ============================================================================
# 4) Build feature matrix + fit Nearest Neighbors model
# ============================================================================

# Preprocessing: scale numericals, one-hot encode categoricals
preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
    ],
    remainder='drop'
)

# We fit a NearestNeighbors model on the transformed feature space
nn_model = NearestNeighbors(metric='cosine', algorithm='auto')

pipe = Pipeline([
    ('preprocess', preprocess),
    ('nn', nn_model)
])

X = model_df[numerical_features + categorical_features]
pipe.fit(X)

print('NearestNeighbors model fit complete.')

NearestNeighbors model fit complete.


In [8]:
# ============================================================================
# 5) Recommendation function
#    Input can be track_name (recommended) or track_id.
# ============================================================================

# Build quick lookup indices
name_to_idx = None
id_to_idx = None

if 'track_name' in model_df.columns:
    # If duplicates exist, we keep the first occurrence
    name_to_idx = pd.Series(model_df.index.values, index=model_df['track_name'].astype(str)).groupby(level=0).first().to_dict()

if 'track_id' in model_df.columns:
    id_to_idx = pd.Series(model_df.index.values, index=model_df['track_id'].astype(str)).groupby(level=0).first().to_dict()


def recommend_songs(song, k=10, verbose=True):
    """Return k closest songs for a given input song (track_name or track_id).

    Parameters
    ----------
    song : str
        A track name (e.g., "Blinding Lights") OR a track_id.
    k : int
        Number of recommendations to return.
    verbose : bool
        If True, prints the matched input row.

    Returns
    -------
    pd.DataFrame
        Top-k similar songs with similarity score.
    """
    if song is None or str(song).strip() == '':
        raise ValueError('Please provide a non-empty song name or track_id.')

    song = str(song)

    # Resolve index
    idx = None
    if id_to_idx is not None and song in id_to_idx:
        idx = id_to_idx[song]
    elif name_to_idx is not None and song in name_to_idx:
        idx = name_to_idx[song]
    else:
        # fallback: case-insensitive contains match on track_name
        if 'track_name' in model_df.columns:
            mask = model_df['track_name'].astype(str).str.lower().str.contains(song.lower(), na=False)
            if mask.any():
                idx = model_df.loc[mask].index[0]
                song = model_df.loc[idx, 'track_name']  # normalize to exact name

    if idx is None:
        raise KeyError(f"Song '{song}' not found. Try an exact track_name or a valid track_id.")

    if verbose:
        print('Matched input song:')
        cols_to_show = available_id_cols + (['genre'] if 'genre' in model_df.columns else [])
        display(model_df.loc[[idx], cols_to_show])

    # Query neighbors: ask for k+1 so we can drop the song itself
    query_X = model_df.loc[[idx], numerical_features + categorical_features]
    distances, indices = pipe.named_steps['nn'].kneighbors(
        pipe.named_steps['preprocess'].transform(query_X),
        n_neighbors=min(k + 1, len(model_df))
    )

    distances = distances.ravel()
    indices = indices.ravel()

    # Convert transformed-space indices back to original row indices
    # NearestNeighbors was fit on rows in the same order as model_df
    neighbor_df_indices = model_df.iloc[indices].index.values

    # Build results and drop self
    results = model_df.loc[neighbor_df_indices, :].copy()
    results['distance_cosine'] = distances
    results['similarity'] = 1 - results['distance_cosine']

    # Drop the input song itself (distance 0)
    results = results[results.index != idx]

    # Sort by similarity and return top k
    cols_out = []
    for c in ['track_name', 'artist_name', 'popularity', 'genre', 'country', 'label']:
        if c in results.columns:
            cols_out.append(c)
    if 'track_id' in results.columns:
        cols_out = ['track_id'] + cols_out

    out = results.sort_values('similarity', ascending=False).head(k)
    out = out[cols_out + ['similarity']]

    return out.reset_index(drop=True)


# Example usage (pick any exact track_name from df_clean):
# recommend_songs('Night respond')
print('Function recommend_songs(song, k=10) is ready.')

Function recommend_songs(song, k=10) is ready.


In [11]:
# Quick demo: recommend based on the first song in the dataset
example_song_name = df_clean['track_name'].astype(str).iloc[1] if 'track_name' in df_clean.columns else df_clean['track_id'].astype(str).iloc[0]
example_song = df_clean[df_clean['track_name'] == example_song_name].iloc[0]
print("\n--- Testing Recommendation System ---")
print(f"\nOriginal Track:")
print(f"  Track: {example_song['track_name']}")
print(f"  Artist: {example_song['artist_name(s)']}")
# print(f"  Genre: {example_song['genre']}")
print(f"  Popularity: {example_song['popularity']}")

print(f"\nTop 10 Recommended Similar Tracks:")

recs = recommend_songs(example_song_name, k=10, verbose=False)
display(recs)


--- Testing Recommendation System ---

Original Track:
  Track: I Know You Want Me (Calle Ocho)
  Artist: Pitbull
  Popularity: 64

Top 10 Recommended Similar Tracks:


Unnamed: 0,track_name,popularity,similarity
0,Super Freaky Girl,53,0.925639
1,Gettin' Jiggy Wit It,76,0.916959
2,Friday (feat. Mufasa & Hypeman) - Dopamine Re-...,83,0.915333
3,Big Girl (You Are Beautiful),61,0.896947
4,Wearing My Rolex - Radio Edit,60,0.890618
5,Shackles (Praise You),59,0.889421
6,Say It Right,83,0.870965
7,Turn Up The Love,56,0.870914
8,Run the World (Girls),76,0.867466
9,Jenny from the Block (feat. Jadakiss & Styles ...,65,0.867216
