---
# ***Notebook Overview:***
# **Auraly – Mood-Based Playlist Generator**

This notebook forms the core data-processing and playlist-generation pipeline for Auraly, a mood-based music recommendation application. It begins by loading and cleaning a large dataset containing Spotify and YouTube audio features, then applies a pre-trained XGBoost classifier (auraly_xgb_model.pkl), developed in the main modeling notebook, to predict the emotional mood of each track.
The model classifies songs into four moods — happy, sad, energetic, and calm. Based on key audio and perceptual features such as energy, valence, tempo, loudness, and acousticness.

After prediction, the dataset is fully normalized: column names are standardized, irrelevant metadata removed, missing values and duplicates dropped, and features aligned for consistency and deployment. The resulting clean dataset (spotify_df) serves as the foundation for generating personalized playlists. Each record links a song’s Spotify URI, artist, and essential audio features to its predicted mood label.

The notebook then extends into a playlist-generation system that allows users to input either a mood keyword (e.g., “happy”) or a short descriptive phrase (e.g., “need calm focus music” or “upbeat gym vibes”). Through lightweight natural-language keyword matching and mood inference, the system maps user input to one of the four moods, ranks tracks within that category using mood-specific scoring weights (energy, valence, tempo, loudness, etc.), and outputs a curated list of around fifteen songs.

In deployment, this logic can be integrated into a web interface, where users instantly receive Spotify playlist recommendations derived from the trained model and curated dataset.
Overall, this notebook demonstrates the complete flow from data ingestion to deployable recommendation logic, bridging machine-learning mood classification with user-friendly playlist generation for real-world applications.

Beyond individual users, producers, record labels, and local music platforms can leverage this workflow to classify and recommend their own catalogues. For example, Kenyan underground music or emerging regional artists can be automatically categorized by mood and surfaced to new audiences — making Auraly both a discovery and personalization tool for global and local music ecosystems.

---


In [1]:
# importing the libraries
import pandas as pd
import numpy as np
import zipfile
import warnings
from xgboost import XGBClassifier
import joblib
import json
warnings.filterwarnings('ignore')

In [2]:
# Extracting and load the CSV
zip_path = "Spotify_Youtube.csv.zip"

# Extract contents
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("unzipped_folder")

# Load the data
spotify_data = pd.read_csv('unzipped_folder/Spotify_Youtube.csv')

spotify_data.head(2)

Unnamed: 0.1,Unnamed: 0,Artist,Url_spotify,Track,Album,Album_type,Uri,Danceability,Energy,Key,...,Url_youtube,Title,Channel,Views,Likes,Comments,Description,Licensed,official_video,Stream
0,0,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Feel Good Inc.,Demon Days,album,spotify:track:0d28khcov6AiegSCpG5TuT,0.818,0.705,6.0,...,https://www.youtube.com/watch?v=HyHNuVaZJ-k,Gorillaz - Feel Good Inc. (Official Video),Gorillaz,693555221.0,6220896.0,169907.0,Official HD Video for Gorillaz' fantastic trac...,True,True,1040235000.0
1,1,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Rhinestone Eyes,Plastic Beach,album,spotify:track:1foMv2HQwfQ2vntFf9HFeG,0.676,0.703,8.0,...,https://www.youtube.com/watch?v=yYDmaexVHic,Gorillaz - Rhinestone Eyes [Storyboard Film] (...,Gorillaz,72011645.0,1079128.0,31003.0,The official video for Gorillaz - Rhinestone E...,True,True,310083700.0


In [3]:
# Check dataset info and missing values
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20718 entries, 0 to 20717
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        20718 non-null  int64  
 1   Artist            20718 non-null  object 
 2   Url_spotify       20718 non-null  object 
 3   Track             20718 non-null  object 
 4   Album             20718 non-null  object 
 5   Album_type        20718 non-null  object 
 6   Uri               20718 non-null  object 
 7   Danceability      20716 non-null  float64
 8   Energy            20716 non-null  float64
 9   Key               20716 non-null  float64
 10  Loudness          20716 non-null  float64
 11  Speechiness       20716 non-null  float64
 12  Acousticness      20716 non-null  float64
 13  Instrumentalness  20716 non-null  float64
 14  Liveness          20716 non-null  float64
 15  Valence           20716 non-null  float64
 16  Tempo             20716 non-null  float6

In [4]:
spotify_data.isnull().sum().sort_values(ascending=False)

Description         876
Stream              576
Comments            569
Likes               541
official_video      470
Licensed            470
Views               470
Channel             470
Title               470
Url_youtube         470
Valence               2
Duration_ms           2
Tempo                 2
Liveness              2
Instrumentalness      2
Acousticness          2
Speechiness           2
Loudness              2
Key                   2
Energy                2
Danceability          2
Artist                0
Uri                   0
Album_type            0
Album                 0
Track                 0
Url_spotify           0
Unnamed: 0            0
dtype: int64

In [5]:
spotify_data.columns

Index(['Unnamed: 0', 'Artist', 'Url_spotify', 'Track', 'Album', 'Album_type',
       'Uri', 'Danceability', 'Energy', 'Key', 'Loudness', 'Speechiness',
       'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo',
       'Duration_ms', 'Url_youtube', 'Title', 'Channel', 'Views', 'Likes',
       'Comments', 'Description', 'Licensed', 'official_video', 'Stream'],
      dtype='object')

In [6]:
# Dropping columns that are not required 

spotify_df = spotify_data.drop(columns=['Unnamed: 0', 'Url_youtube', 'Title', 'Channel', 'Views', 'Likes',
       'Comments', 'Description', 'Licensed', 'official_video', 'Stream', 'Key'])

# Drop rows with any missing Spotify feature values
spotify_df = spotify_df.dropna()

In [7]:
# Inspect cleaned dataset
spotify_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20716 entries, 0 to 20717
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Artist            20716 non-null  object 
 1   Url_spotify       20716 non-null  object 
 2   Track             20716 non-null  object 
 3   Album             20716 non-null  object 
 4   Album_type        20716 non-null  object 
 5   Uri               20716 non-null  object 
 6   Danceability      20716 non-null  float64
 7   Energy            20716 non-null  float64
 8   Loudness          20716 non-null  float64
 9   Speechiness       20716 non-null  float64
 10  Acousticness      20716 non-null  float64
 11  Instrumentalness  20716 non-null  float64
 12  Liveness          20716 non-null  float64
 13  Valence           20716 non-null  float64
 14  Tempo             20716 non-null  float64
 15  Duration_ms       20716 non-null  float64
dtypes: float64(10), object(6)
memory usage: 2.7+ 

In [8]:
spotify_df.isnull().sum()

Artist              0
Url_spotify         0
Track               0
Album               0
Album_type          0
Uri                 0
Danceability        0
Energy              0
Loudness            0
Speechiness         0
Acousticness        0
Instrumentalness    0
Liveness            0
Valence             0
Tempo               0
Duration_ms         0
dtype: int64

In [9]:
spotify_df = spotify_df.rename(columns=str.lower)
print(spotify_df.columns)

Index(['artist', 'url_spotify', 'track', 'album', 'album_type', 'uri',
       'danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
      dtype='object')


In [10]:
spotify_df.head()

Unnamed: 0,artist,url_spotify,track,album,album_type,uri,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Feel Good Inc.,Demon Days,album,spotify:track:0d28khcov6AiegSCpG5TuT,0.818,0.705,-6.679,0.177,0.00836,0.00233,0.613,0.772,138.559,222640.0
1,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Rhinestone Eyes,Plastic Beach,album,spotify:track:1foMv2HQwfQ2vntFf9HFeG,0.676,0.703,-5.815,0.0302,0.0869,0.000687,0.0463,0.852,92.761,200173.0
2,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,New Gold (feat. Tame Impala and Bootie Brown),New Gold (feat. Tame Impala and Bootie Brown),single,spotify:track:64dLd6rVqDLtkXFYrEUHIU,0.695,0.923,-3.93,0.0522,0.0425,0.0469,0.116,0.551,108.014,215150.0
3,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,On Melancholy Hill,Plastic Beach,album,spotify:track:0q6LuUqGLUiCPP1cbdwFs3,0.689,0.739,-5.81,0.026,1.5e-05,0.509,0.064,0.578,120.423,233867.0
4,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Clint Eastwood,Gorillaz,album,spotify:track:7yMiX7n9SBvadzox8T5jzT,0.663,0.694,-8.627,0.171,0.0253,0.0,0.0698,0.525,167.953,340920.0


In [11]:
spotify_df.duplicated().value_counts()

False    20716
Name: count, dtype: int64

In [12]:
spotify_df.shape

(20716, 16)

In [13]:
# Load trained model and label map
xgb_model = joblib.load("auraly_xgb_model.pkl")
label_map = json.load(open("label_map.json"))

In [14]:
# Make a copy 
X_spotify = spotify_df.copy()

# Convert duration from milliseconds to minutes
X_spotify["duration_min"] = X_spotify["duration_ms"] / 60000  

# Now only keep the expected features
features = ['danceability', 'energy', 'loudness', 'speechiness', 
            'acousticness', 'instrumentalness', 'liveness', 
            'valence', 'tempo', 'duration_min']

X_spotify = X_spotify[features]

In [15]:
X_spotify.columns

Index(['danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_min'],
      dtype='object')

In [16]:
spotify_df["predicted_mood"] = xgb_model.predict(X_spotify)
spotify_df["mood_label"] = spotify_df["predicted_mood"].astype(str).map(label_map)


In [17]:
spotify_df.head()

Unnamed: 0,artist,url_spotify,track,album,album_type,uri,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,predicted_mood,mood_label
0,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Feel Good Inc.,Demon Days,album,spotify:track:0d28khcov6AiegSCpG5TuT,0.818,0.705,-6.679,0.177,0.00836,0.00233,0.613,0.772,138.559,222640.0,2,Energetic
1,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Rhinestone Eyes,Plastic Beach,album,spotify:track:1foMv2HQwfQ2vntFf9HFeG,0.676,0.703,-5.815,0.0302,0.0869,0.000687,0.0463,0.852,92.761,200173.0,1,Happy
2,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,New Gold (feat. Tame Impala and Bootie Brown),New Gold (feat. Tame Impala and Bootie Brown),single,spotify:track:64dLd6rVqDLtkXFYrEUHIU,0.695,0.923,-3.93,0.0522,0.0425,0.0469,0.116,0.551,108.014,215150.0,1,Happy
3,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,On Melancholy Hill,Plastic Beach,album,spotify:track:0q6LuUqGLUiCPP1cbdwFs3,0.689,0.739,-5.81,0.026,1.5e-05,0.509,0.064,0.578,120.423,233867.0,1,Happy
4,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Clint Eastwood,Gorillaz,album,spotify:track:7yMiX7n9SBvadzox8T5jzT,0.663,0.694,-8.627,0.171,0.0253,0.0,0.0698,0.525,167.953,340920.0,1,Happy


In [18]:
spotify_df["mood_label"].value_counts()

mood_label
Happy        12579
Sad           4701
Energetic     2753
Calm           683
Name: count, dtype: int64

In [19]:
spotify_df["mood_label"].unique()

array(['Energetic', 'Happy', 'Sad', 'Calm'], dtype=object)

In [20]:
def get_playlist_by_mood(mood, n_songs=10):
    filtered_songs = spotify_df[spotify_df["mood_label"].str.lower() == mood.lower()]
    n_songs = min(n_songs, len(filtered_songs))
    playlist = filtered_songs.sample(n=n_songs)
    return playlist[["artist", "track", "url_spotify"]].to_dict(orient="records")

In [21]:
# Example: get 10 happy songs

happy_playlist = get_playlist_by_mood("Happy", n_songs=10)
for song in happy_playlist:
    print(song["track"], "-", song["artist"], "|", song["url_spotify"])

Faroeste Caboclo - Legião Urbana | https://open.spotify.com/artist/6tw6EpC9RgmSRZiZg0n22t
Caméléon - Booba | https://open.spotify.com/artist/58wXmynHaAWI5hwlPZP3qL
Down In the DM (feat. Nicki Minaj) - Remix - Yo Gotti | https://open.spotify.com/artist/6Ha4aES39QiVjR0L2lwuwq
Conteo Regresivo - Salsa Version - Gilberto Santa Rosa | https://open.spotify.com/artist/27vNK840zYq6IfDijHPsv1
러시안 룰렛 Russian Roulette - Red Velvet | https://open.spotify.com/artist/1z4g3DjTBBZKhvAroFlhOM
Munda Sona Hoon Main (From "Shehzada") - Diljit Dosanjh | https://open.spotify.com/artist/2FKWNmZWDBZR4dE5KX4plR
This Kiss - Faith Hill | https://open.spotify.com/artist/25NQNriVT2YbSW80ILRWJa
Broke In A Minute - Tory Lanez | https://open.spotify.com/artist/2jku7tDXc6XoB6MO2hFuqg
What'd I Say, Pt. 1 & 2 - Ray Charles | https://open.spotify.com/artist/1eYhYunlNJlDoQhtYBvPsi
Jireh (My Provider) - Lecrae | https://open.spotify.com/artist/1CFCsEqKrCyvAFKOATQHiW


In [22]:
# Example: get 10 sad songs

happy_playlist = get_playlist_by_mood("Sad", n_songs=10)
for song in happy_playlist:
    print(song["track"], "-", song["artist"], "|", song["url_spotify"])

God of Revival - Live - Bethel Music | https://open.spotify.com/artist/26T4yOaOoFJvUvxR87Y9HO
Oceans (Where Feet May Fail) - Radio Version - Hillsong UNITED | https://open.spotify.com/artist/74cb3MG0x0BOnYNW1uXYnM
Não Olhe Pra Trás (feat. Lenine) - Ao Vivo - Capital Inicial | https://open.spotify.com/artist/4Z0yuwHVJBROVZqFpTIr0d
Agarra Tu Camino - El Fantasma | https://open.spotify.com/artist/0my6Pg4I28dVcZLSpAkqhv
No Woman, No Cry - Live At The Lyceum, London/1975 - Bob Marley & The Wailers | https://open.spotify.com/artist/2QsynagSdAqZj3U9HgDzjD
可惜沒如果 - JJ Lin | https://open.spotify.com/artist/7Dx7RhX0mFuXhCOUgB01uM
Cómo Duele - Ricardo Arjona | https://open.spotify.com/artist/0h1zs4CTlU9D2QtgPxptUD
feel like shit - Tate McRae | https://open.spotify.com/artist/45dkTj5sMRSjrmBSBeiHym
Where's My Love - SYML | https://open.spotify.com/artist/6AyATGg7mDgBlZ4N5uNog0
Magic Moments - Perry Como | https://open.spotify.com/artist/5v8jlSmAQfrkTjAlpUfWtu


In [23]:
# Example: get 10 calm songs

happy_playlist = get_playlist_by_mood("calm", n_songs=10)
for song in happy_playlist:
    print(song["track"], "-", song["artist"], "|", song["url_spotify"])

With you - Instrumental - Jimin | https://open.spotify.com/artist/1oSPZhvZMIrWW5I41kPkkY
The Ellie Badge - Michael Giacchino | https://open.spotify.com/artist/4kLvhMAuCloLxoP1aVM7Lr
Atom 8 - Sleeping At Last | https://open.spotify.com/artist/0MeLMJJcouYXCymQSHPn8g
Headphones - Lofi Fruits Music | https://open.spotify.com/artist/1dABGukgZ8XKKOdd2rVSHM
Elder Scrolls – Skyrim: Far Horizons - London Philharmonic Orchestra | https://open.spotify.com/artist/3PfJE6ebCbCHeuqO4BfNeA
Cançoneta for Violin and Orchestra - Academy of St. Martin in the Fields | https://open.spotify.com/artist/77CaCn32H4mOMQA7UElzfF
12 Songs, Op. 21: V. Lilacs (Transcr. Rachmaninoff for Solo Piano) - Sergei Rachmaninoff | https://open.spotify.com/artist/0Kekt6CKSo0m5mivKcoH51
Tchaikovsky: The Nutcracker, Op. 71: Miniature Overture - Pyotr Ilyich Tchaikovsky | https://open.spotify.com/artist/3MKCzCnpzw3TjUYs2v7vDA
Suite in D Minor, HWV 447: Allemande - George Frideric Handel | https://open.spotify.com/artist/1QL7yTHrd

In [24]:
# Saving the dataset
spotify_df.to_csv("spotify_mood_dataset.csv", index=False)