
# ***Notebook Overview:***
# **Phrase-Based Mood Playlist Generator**

---
This notebook implements the core phrase-to-playlist recommendation pipeline for Auraly, a music recommendation application based on mood. Its purpose is to allow users to receive personalized playlists by typing a mood keyword (e.g., “happy”) or a short descriptive phrase (e.g., “need calm focus music” or “upbeat gym vibes”).

It begins with raw phrases collected from friends and potential users, which are then cleaned and preprocessed through steps such as duplicate removal, lemmatization, tokenization, removal of signs, and stop-word elimination. The resulting cleaned and tokenized phrase dataset (`phrases.csv`) maps each phrase to a corresponding mood label (`Sad`, `Happy`, `Energetic`, `Calm`). A TF-IDF vectorizer is applied to transform user-input phrases into numerical representations, which are compared against the phrase dataset using cosine similarity to get the most likely mood.

The notebook also integrates a pre-processed song dataset (`spotify_mood_dataset.csv`) containing Spotify track features and pre-predicted mood labels. This was done in the 'Tracks.ipynb' file. Once a user’s mood is gotten from their input phrase, the system filters and ranks tracks within that mood based on audio and perceptual features such as energy, valence, and danceability. Playlists are then generated, typically containing about 10 curated songs.

Special handling is included for ambiguous or unmatched phrases: if the system cannot confidently place a mood from the input (e.g., a rare or unseen keyword), it prompts the user to select from all four moods, ensuring a playlist can still be delivered.

This notebook demonstrates the complete flow from raw user input to curated playlist output, bridging natural-language mood inference with a pre-classified song dataset. It enables users, producers, and music platforms to discover and experience music personalized by emotional context, enhancing both user engagement and music discovery, especially for niche or regional artists.

---


---
## **1. Data Cleaning & Preprocessing**
---

In [1]:
# importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
import warnings
import contractions
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle
import nltk
nltk.download('punkt')      
nltk.download('wordnet')    
nltk.download('omw-1.4') 
nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Loading the data
phrase_df = pd.read_csv('music_app.csv')
phrase_df.head()

Unnamed: 0,Short phrase,Mood,Mood_label
0,long day of work,Sad,0
1,hosting a party,Energetic,2
2,peaceful for studying,Calm,3
3,feel good road trip,Happy,1
4,lonely nights,Sad,0


In [3]:
# Information and shape of the dataset 
print('Shape:', phrase_df.shape)
print('Information:')
print(phrase_df.info())

Shape: (229, 3)
Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229 entries, 0 to 228
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Short phrase   229 non-null    object
 1   Mood           229 non-null    object
 2   Mood_label     229 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 5.5+ KB
None


In [4]:
#Checking for duplicates in the dataset
phrase_df.duplicated().value_counts()

False    229
Name: count, dtype: int64

In [5]:
phrase_df.columns

Index(['Short phrase ', 'Mood', 'Mood_label'], dtype='object')

In [6]:
# Normalise column names: lowercase + remove extra spaces
phrase_df.columns = phrase_df.columns.str.strip().str.lower().str.replace(' ', '_')
phrase_df.columns

Index(['short_phrase', 'mood', 'mood_label'], dtype='object')

In [7]:
# Cleaning the short phrases text

# Define a simple cleaning function
def clean_phrases(phrase):
    # Convert to lowercase
    phrase = str(phrase).lower()
    # Handle contractions, eg, can't and cannot 
    phrase = contractions.fix(phrase)
    # Remove URLs if any, in case you add social-style phrases later
    phrase = re.sub(r'http\S+|www.\S+', '', phrase)
    # Remove mentions and hashtags (handle @ and # words)
    phrase = re.sub(r'[@#]\w+', '', phrase)
    # Remove non-ASCII characters
    text = re.sub(r'[^\x00-\x7F]+', '', phrase) 
    # Remove special symbols keeping alphanumeric
    phrase = re.sub(r'[^a-z0-9\s]', '', phrase)
    # Remove extra spaces
    return ' '.join(phrase.split())

# Apply the function to the short phrases column              
phrase_df['cleaned_phrase'] = phrase_df['short_phrase'].apply(clean_phrases)

# Remove empty phrases after cleaning
phrase_df[phrase_df['cleaned_phrase'].str.strip() != '']

phrase_df.head(15)

Unnamed: 0,short_phrase,mood,mood_label,cleaned_phrase
0,long day of work,Sad,0,long day of work
1,hosting a party,Energetic,2,hosting a party
2,peaceful for studying,Calm,3,peaceful for studying
3,feel good road trip,Happy,1,feel good road trip
4,lonely nights,Sad,0,lonely nights
5,in my feelings,Calm,3,in my feelings
6,workout music,Energetic,2,workout music
7,Ethereal,Calm,3,ethereal
8,Birthday,Happy,1,birthday
9,Chores,Energetic,2,chores


In [8]:
# Dropping the short phrase column (pre cleaning)
phrase_df = phrase_df.drop(columns=['short_phrase'])
phrase_df.head(2)

Unnamed: 0,mood,mood_label,cleaned_phrase
0,Sad,0,long day of work
1,Energetic,2,hosting a party


In [9]:
# Defining the preprocessing function 
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Function for parts of speech (assign to its grammatical category, eg noun, verb)
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

#Function for tokenising and lemmatising 
def preprocess_text(phrase):
    # Tokenize tweet
    tokens = word_tokenize(phrase)
    # Removing stopwords 
    tokens = [word for word in tokens if word not in stop_words]
    # POS tagging 
    tagged_tokens = pos_tag(tokens)
    # Lemmatize with POS
    lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_tokens]
    # Return the preprocessed tweet
    return " ".join(lemmatized)
    
# Applying the function to the data
phrase_df['phrases'] = phrase_df['cleaned_phrase'].apply(preprocess_text)
phrase_df.head()

Unnamed: 0,mood,mood_label,cleaned_phrase,phrases
0,Sad,0,long day of work,long day work
1,Energetic,2,hosting a party,host party
2,Calm,3,peaceful for studying,peaceful study
3,Happy,1,feel good road trip,feel good road trip
4,Sad,0,lonely nights,lonely night


In [10]:
# Dropping the cleaned phrase column (post cleaning)
phrase_df = phrase_df.drop(columns=['cleaned_phrase'])
phrase_df.head(2)

Unnamed: 0,mood,mood_label,phrases
0,Sad,0,long day work
1,Energetic,2,host party


In [11]:
phrase_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229 entries, 0 to 228
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   mood        229 non-null    object
 1   mood_label  229 non-null    int64 
 2   phrases     229 non-null    object
dtypes: int64(1), object(2)
memory usage: 5.5+ KB


In [12]:
# Saving the dataset
phrase_df.to_csv("phrases.csv", index=False)

---
## **2. Mapping The Phrases**
---

In [13]:
import pickle, json

phrases_df = pd.read_csv("phrases.csv")
songs_df   = pd.read_csv("spotify_mood_dataset.csv")
label_map = json.load(open("label_map.json"))

In [14]:
phrases_df.head(2)

Unnamed: 0,mood,mood_label,phrases
0,Sad,0,long day work
1,Energetic,2,host party


In [15]:
songs_df.head(2)

Unnamed: 0,artist,url_spotify,track,album,album_type,uri,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,predicted_mood,mood_label
0,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Feel Good Inc.,Demon Days,album,spotify:track:0d28khcov6AiegSCpG5TuT,0.818,0.705,-6.679,0.177,0.00836,0.00233,0.613,0.772,138.559,222640.0,2,Energetic
1,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Rhinestone Eyes,Plastic Beach,album,spotify:track:1foMv2HQwfQ2vntFf9HFeG,0.676,0.703,-5.815,0.0302,0.0869,0.000687,0.0463,0.852,92.761,200173.0,1,Happy


In [16]:
with open("label_map.json", "r") as f:
    label_map = json.load(f)

label_map

{'0': 'Sad', '1': 'Happy', '2': 'Energetic', '3': 'Calm'}

In [17]:
phrases_df = pd.read_csv("phrases.csv")

# create integer + string mood columns
phrases_df["mood"] = phrases_df["mood_label"]  # keep the numeric version
phrases_df["mood_label"] = phrases_df["mood_label"].astype(str).map(label_map)

In [18]:
phrases_df.head()

Unnamed: 0,mood,mood_label,phrases
0,0,Sad,long day work
1,2,Energetic,host party
2,3,Calm,peaceful study
3,1,Happy,feel good road trip
4,0,Sad,lonely night


In [19]:
# Fit TF-IDF on the cleaned phrases text
tfidf = TfidfVectorizer(ngram_range=(1, 2))
X_phr = tfidf.fit_transform(phrases_df['phrases'].astype(str))

def predict_mood_from_phrase(phrase):
    # Transform the input phrase using the TF-IDF vectorizer
    x = tfidf.transform([phrase])
    # Compute cosine similarity between input and all phrases in dataset
    sims = cosine_similarity(x, X_phr)[0]
    # Find the index of the most similar phrase
    idx = sims.argmax()
    # Return the corresponding mood label and similarity score
    return phrases_df.iloc[idx]["mood_label"], sims[idx]

In [20]:
# For 2 different options
def top2_moods_from_phrase(phrase):
    x = tfidf.transform([str(phrase)])
    sims = cosine_similarity(x, X_phr)[0]
    top_idx = sims.argsort()[::-1][:2]
    return [(phrases_df.iloc[i]["mood_label"], float(sims[i])) for i in top_idx]

In [21]:
# Curate playlist using songs_df['mood_label']
def playlist_for_mood(mood_label, top_k=30):
    sub = songs_df[songs_df["mood_label"].str.lower() == mood_label.lower()]
    if sub.empty:
        return sub
    sort_cols = [c for c in ["valence","energy","danceability"] if c in sub.columns]
    if sort_cols:
        sub = sub.sort_values(sort_cols, ascending=[False]*len(sort_cols))
    return sub.head(min(top_k, len(sub)))

In [22]:
ALL_MOODS = ["Sad", "Happy", "Energetic", "Calm"]

def playlist_from_phrase(phrase, top_k=30, ambiguity_margin=0.05, min_similarity=0.1):
    top2 = top2_moods_from_phrase(phrase)
    top_score = top2[0][1] if top2 else 0

    # Trigger ambiguity if scores too close OR similarity too low
    if (len(top2) < 2) or (top2[0][1] - top2[1][1] <= ambiguity_margin) or (top_score < min_similarity):
        return {
            "ambiguous": True,
            "candidates": ALL_MOODS,  # show all options to user
            "playlist": None
        }

    # Otherwise, take top mood and return playlist
    mood, score = top2[0]
    pl = playlist_for_mood(mood, top_k=top_k)
    return {"ambiguous": False, "mood": mood, "score": score, "playlist": pl}

In [23]:
# Test phrases
tests = ["long day", "host", "peace", "road", "hurt"]

for phrase in tests:
    result = playlist_from_phrase(phrase, top_k=10)

    if result["ambiguous"] or result["playlist"] is None:
        print(f"❌ '{phrase}': Sorry, we couldn’t match your mood. Try typing another phrase or pick a mood!: {ALL_MOODS}")
    else:
        mood = result["mood"]
        score = result["score"]
        playlist = result["playlist"]
        print(f"✅ '{phrase}' → {mood} ({score:.2f}), {len(playlist)} tracks")
        display(playlist[["artist","track","album","uri"]].head(5))

✅ 'long day' → Sad (0.73), 10 tracks


Unnamed: 0,artist,track,album,uri
3436,Chalino Sanchez,Alma Enamorada,Chalino Sánchez con Banda Brava,spotify:track:6ab5dRx0VtGzMjUejMFI9u
3614,Chuck Berry,My Mustang Ford,Fresh Berry's,spotify:track:0IHYNEHVT1zMEeNOgvyg9B
7065,Jorge Ben Jor,Menina Mulher Da Pele Preta,A Tabua De Esmeralda,spotify:track:5HpURubJUz2gysQiAkle9I
7370,Gene Autry,Here Comes Santa Claus (Right Down Santa Claus...,Rudolph The Red Nosed Reindeer And Other Chris...,spotify:track:25leEEaz1gIpp7o21Fqyjo
4988,Grupo Laberinto,El Perron Merino,Otra Carga de Corridos,spotify:track:3iVIHhKNpltJRseXibdbWx


✅ 'host' → Energetic (0.63), 10 tracks


Unnamed: 0,artist,track,album,uri
15180,Shilpi Raj,Dilwa Le Gaile Raja,Dilwa Le Gaile Raja,spotify:track:7z2HnyAP1dcSK6vEdcsJr6
17018,Beach Weather,Chit Chat,Chit Chat,spotify:track:2JIfkiU96Z25ERPoYhWirP
5718,John Mellencamp,Authority Song,Uh-HUH!,spotify:track:38Lf9Im0jQhAaKx8ehQM1S
5252,Ginuwine,Pony,R&B: From Doo-Wop To Hip-Hop,spotify:track:6mz1fBdKATx6qP4oP1I65G
7633,Vengaboys,To Brazil!,The Party Album!,spotify:track:6Z1Q31m2SQt7uLfyqce6ot


✅ 'peace' → Calm (0.58), 10 tracks


Unnamed: 0,artist,track,album,uri
4621,Antônio Carlos Jobim,O Morro Não Tem Vez,"The Composer Of Desafinado, Plays",spotify:track:6fwICn8FWouigyB8BxfljW
4618,Antônio Carlos Jobim,Wave,Wave,spotify:track:2hXBS8q9rGMovfG1S8FB4F
18510,Toby Fox,Fallen Down (Reprise),UNDERTALE Soundtrack,spotify:track:23b9BdZ2WZnDSeDzNUTVvZ
2172,Antonio Vivaldi,"Cello Concerto in E Minor, RV 409: II. Allegro",Vivaldi: Concertos for 2 Cellos,spotify:track:4sCgQZkwhYlB1a8ocpiWRr
4300,Academy of St. Martin in the Fields,Solomon HWV 67 / Act 3: The Arrival Of The Que...,The World of Handel,spotify:track:4iJWNp04bqqjLZ7OjZHtgF


✅ 'road' → Happy (0.40), 10 tracks


Unnamed: 0,artist,track,album,uri
18512,Toby Fox,Spider Dance,UNDERTALE Soundtrack,spotify:track:3aiGshuqYhdBBBhHqRf6jn
1934,Los Tucanes De Tijuana,El Tucanazo,Tucanes De Oro ... Secuestro De Amor,spotify:track:07Ag8vm1pW409NrhpPokFg
6912,Mi Banda El Mexicano,Feliz Feliz,Mi Mexico Querido,spotify:track:0JZV1UuBsbSwHhGirgWaXI
5449,Los Dareyes De La Sierra,La Recia,Con Banda,spotify:track:5PGbQgXt8bgXceoK3yZvYo
20535,BoyWithUke,Long Drives,Serotonin Dreams,spotify:track:1APefTkgjjwQfZbUWQkOFc


❌ 'hurt': Sorry, we couldn’t match your mood. Try typing another phrase or pick a mood!: ['Sad', 'Happy', 'Energetic', 'Calm']


In [24]:
# Save vectorizer and phrase matrix
pickle.dump(tfidf, open("tfidf_vectorizer.pkl", "wb"))
pickle.dump(X_phr, open("tfidf_phrase_matrix.pkl", "wb"))

# Save lookup of cleaned phrases + moods
phrases_df[["phrases", "mood_label"]].to_csv("tfidf_phrases_lookup.csv", index=False)