# **Spotify Recommender System**

Welcome to my project on building a Spotify Recommender System! In this project, I've undertaken the task of developing a recommendation system using a rich Spotify Tracks dataset sourced from (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset). This dataset comprises a comprehensive collection of Spotify tracks spanning a diverse array of 125 different genres. Each track is accompanied by a wealth of audio features, including artist information, danceability, energy, valence, and many more.

**Project Motivation**

The primary motivation behind this endeavor is to put into practice the machine learning knowledge I've acquired, both in academic settings and through various online courses. By working on a real-world application like this, I aim to leverage data science techniques to solve a practical problem — in this case, helping music enthusiasts discover new songs that align with their tastes.

**The Dataset**

The core of this project lies in the Spotify tracks dataset, which is a goldmine of information for music lovers and data enthusiasts alike. It encompasses a vast and diverse collection of tracks, offering a window into the world of music through data. This dataset is not only a treasure trove for building a recommendation system but also a playground for exploring the intricacies of music from a data-driven perspective.

**Building the Recommender System**

Building a Spotify Recommender System is a multi-faceted challenge. It involves harnessing the power of machine learning algorithms to analyze patterns and relationships within the dataset. By doing so, I can create a personalized music recommendation engine that understands individual preferences and suggests songs that are likely to resonate with each user's unique taste.


So, let's dive into the Recommender System!

In [18]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [19]:
import os
import pandas as pd
import random
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# **Streamlining Data Preparation: From Raw to Ready**

Below, I've imported the data and merged essential features from various datasets. The original dataset, 'data.csv,' initially lacked genre information. However, I ensured to incorporate this crucial data for training the recommender system effectively.

In [20]:
os.chdir('/content/gdrive/My Drive/Spotify_Training/data')

dataset = pd.read_csv("data.csv")
dataset_genre = pd.read_csv("data_w_genres.csv")
dataset_genre = dataset_genre['genres']

dataset = dataset.join(dataset_genre)
dataset['genres'].replace('[]', pd.NA, inplace=True)
dataset.dropna(subset=['genres'], inplace=True)

display(dataset)

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,genres
0,0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878000,10,0.6650,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954,['show tunes']
8,0.7210,1921,0.996,['Ignacio Corsini'],0.485,161520,0.130,0,05xDjWH9ub67nJJk82yfGf,0.151000,5,0.1040,-21.508,0,La Mañanita - Remasterizado,0,1921-03-20,0.0483,64.678,"['comedy rock', 'comic', 'parody']"
9,0.7710,1921,0.982,['Fortugé'],0.684,196560,0.257,0,08zfJvRLp7pjAb94MA9JmF,0.000000,8,0.5040,-16.415,1,Il Etait Syndiqué,0,1921,0.3990,109.378,"['emo rap', 'florida rap', 'sad rap', 'undergr..."
10,0.8260,1921,0.995,['Maurice Chevalier'],0.463,147133,0.260,0,0BMkRpQtDoKjcgzCpnqLNa,0.000000,9,0.2580,-16.894,1,Dans La Vie Faut Pas S'en Faire,0,1921,0.0557,85.146,"['dark trap', 'meme rap']"
12,0.4930,1921,0.990,['Georgel'],0.315,190800,0.363,0,0H3k2CvJvHULnWChlbeFgx,0.000000,5,0.2920,-12.562,0,La Vipère,0,1921,0.0546,174.532,"['asian american hip hop', 'cali rap', 'west c..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28667,0.6700,1970,0.659,['The Beatles'],0.540,50467,0.489,0,0eRyOunOVBChlXxIvqwOxH,0.001900,5,0.4570,-12.276,1,Dig It - Remastered 2009,45,1970-05-08,0.1170,157.030,['classic cantopop']
28673,0.4940,1970,0.459,['Santana'],0.424,297507,0.658,0,1vcmEqKQAHlnV2fcNdJjEt,0.897000,0,0.1030,-11.751,1,Incident at Neshabur,44,1970-09-23,0.0359,136.018,"['c-pop', 'classic mandopop', 'vintage chinese..."
28676,0.5390,1970,0.630,['Joe Cocker'],0.420,315200,0.713,0,0a3kDhZaZDNJJO25zNryki,0.000048,0,0.7800,-7.632,1,Space Captain - Live At The Fillmore East/1970,42,1970-08-01,0.0912,148.284,"['c-pop', 'classic cantopop', 'classic mandopo..."
28678,0.5140,1970,0.566,['Van Morrison'],0.629,310680,0.334,0,2D5bUcNHDDMvXcd0jKWhtk,0.000049,7,0.0978,-13.020,1,Brand New Day - 2013 Remaster,41,1970-02,0.0288,134.132,"['chinese indie', 'chinese indie rock']"


# Data Exploration & Preprocessing

In the following steps, I go through the dataset to detect any anomalies or potential issues. When such issues arise, I take corrective measures to enhance the dataset's quality and reliability, ensuring that it aligns with the recommendation system's objectives

In [21]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18823 entries, 0 to 28679
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   valence           18823 non-null  float64
 1   year              18823 non-null  int64  
 2   acousticness      18823 non-null  float64
 3   artists           18823 non-null  object 
 4   danceability      18823 non-null  float64
 5   duration_ms       18823 non-null  int64  
 6   energy            18823 non-null  float64
 7   explicit          18823 non-null  int64  
 8   id                18823 non-null  object 
 9   instrumentalness  18823 non-null  float64
 10  key               18823 non-null  int64  
 11  liveness          18823 non-null  float64
 12  loudness          18823 non-null  float64
 13  mode              18823 non-null  int64  
 14  name              18823 non-null  object 
 15  popularity        18823 non-null  int64  
 16  release_date      18823 non-null  object

In [22]:
dataset.isnull().sum()

valence             0
year                0
acousticness        0
artists             0
danceability        0
duration_ms         0
energy              0
explicit            0
id                  0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
name                0
popularity          0
release_date        0
speechiness         0
tempo               0
genres              0
dtype: int64

In [23]:
df = dataset.drop(columns=['id', 'name', 'artists', 'release_date', 'year', 'genres'])
df.fillna(0)
df.corr(numeric_only=True)

Unnamed: 0,valence,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo
valence,1.0,-0.161474,0.566828,-0.20249,0.377172,-0.033478,-0.16732,0.010938,0.029181,0.328851,0.033362,0.028188,0.051279,0.220691
acousticness,-0.161474,1.0,-0.251041,-0.118045,-0.745821,-0.304505,0.373585,-0.018433,0.009118,-0.537893,0.041337,-0.693114,-0.092383,-0.161548
danceability,0.566828,-0.251041,1.0,-0.140025,0.230907,0.238098,-0.236635,0.019422,-0.078823,0.268786,-0.033149,0.166042,0.244159,0.035164
duration_ms,-0.20249,-0.118045,-0.140025,1.0,0.071478,-0.014858,0.044524,0.008702,0.056397,0.027233,-0.043355,0.109522,-0.04535,-0.036989
energy,0.377172,-0.745821,0.230907,0.071478,1.0,0.148143,-0.299695,0.00412,0.073899,0.779202,-0.014093,0.584364,-0.076509,0.230007
explicit,-0.033478,-0.304505,0.238098,-0.014858,0.148143,1.0,-0.1422,0.020046,0.045905,0.134823,-0.073084,0.190835,0.395817,0.019549
instrumentalness,-0.16732,0.373585,-0.236635,0.044524,-0.299695,-0.1422,1.0,-0.00401,-0.036841,-0.414298,-0.03584,-0.372542,-0.109875,-0.091826
key,0.010938,-0.018433,0.019422,0.008702,0.00412,0.020046,-0.00401,1.0,0.000263,-0.001275,-0.09608,0.00542,0.020302,-0.004606
liveness,0.029181,0.009118,-0.078823,0.056397,0.073899,0.045905,-0.036841,0.000263,1.0,0.032475,0.001131,-0.099394,0.122605,-0.009288
loudness,0.328851,-0.537893,0.268786,0.027233,0.779202,0.134823,-0.414298,-0.001275,0.032475,1.0,0.015096,0.495408,-0.175668,0.199782


In [24]:
dataset.columns

Index(['valence', 'year', 'acousticness', 'artists', 'danceability',
       'duration_ms', 'energy', 'explicit', 'id', 'instrumentalness', 'key',
       'liveness', 'loudness', 'mode', 'name', 'popularity', 'release_date',
       'speechiness', 'tempo', 'genres'],
      dtype='object')

# **Spotify Song Recommendation using K-Means Clustering**


This Python code demonstrates a Spotify song recommendation system using K-Means clustering. It starts by preprocessing a dataset, normalizing numeric columns with Min-Max scaling for effective clustering. K-Means clustering is then applied to group songs into 10 clusters, with cluster labels added to the dataset. The code defines a Spotify_Recommendation class that, given a song name, calculates its similarity to others based on feature values. Recommendations are generated, and the top matches (e.g., 10) are displayed with song names, artists, and release years. This code illustrates how K-Means clustering can create personalized music recommendations based on song features, making it easy for users to discover songs similar to their preferences.



In [25]:
datatypes = ('int16', 'int32', 'int64', 'float16', 'float32', 'float64')
normarization = df.select_dtypes(include=datatypes)

scaler = MinMaxScaler()

normarization_scaled = scaler.fit_transform(normarization)

normarization = pd.DataFrame(normarization_scaled, columns=normarization.columns)

kmeans = KMeans(n_clusters=10, n_init=10)
features = kmeans.fit_predict(normarization)
dataset['features'] = features
MinMaxScaler(['features'])

class Spotify_Recommendation:
    def __init__(self, dataset):
        self.dataset = dataset

    def recommend(self, song_name, amount=1):
        distances = []

        song = self.dataset[(self.dataset.name.str.lower() == song_name.lower())].head(1).values[0]

        rec = self.dataset[self.dataset.name.str.lower() != song_name.lower()]

        for songs in rec.values:
            dis = 0
            for col in range(len(rec.columns)):
                if col not in [8, 14, 3, 16, 1, 19]: # The columns that are being dropped are 'year', artists', 'id', 'name', 'release_date', 'genres'
                    dis += np.absolute(float(song[col]) - float(songs[col]))
            distances.append(dis)

        rec = rec.copy()
        rec.loc[:, 'distance'] = distances
        rec = rec.sort_values('distance')

        columns = ['name', 'artists', 'year']

        recommended_songs = []
        for _, row in rec[columns][:amount].iterrows():
            song_info = {
                'name': row['name'],
                'artists': row['artists'],
                'year': row['year']
            }
            recommended_songs.append(song_info)

        return recommended_songs

recommendations = Spotify_Recommendation(dataset)
recommended_songs = recommendations.recommend("La Vipère", 10) # As an example, I am using "La Vipère"
for idx, song_info in enumerate(recommended_songs, start=1):
    song_name = song_info['name']
    artists_string = song_info['artists']
    artists_list = [name.strip() for name in artists_string.split(',')]
    cleaned_names = [name.strip(" '[]") for name in artists_list]
    formatted_artists = ' & '.join(cleaned_names)
    print(f"{idx}. {song_name} by {formatted_artists}")

1. Gloomy Sunday (with Teddy Wilson & His Orchestra) - Take 1 by Billie Holiday & Teddy Wilson
2. Juana La Cubana by Fito Olivares y Su Grupo
3. Quejas de bandoneón by Adolfo Berón
4. Mera Naam Raju - Instrumental by Master Ebrahim
5. Caballo Golondrino by Vicente Fernández
6. Feedin' the "Bean" by Count Basie
7. No me Escribas - Instrumental (Remasterizado) by Francisco Canaro
8. Song of the South by Alabama
9. When I Fall In Love by Nat King Cole
10. Dua Kar Gham-E-Dil-Anarkali - Instrumental by Master Ebrahim


# **Spotify Content-Based Song Recommendation**

The code below is a content-based song recommendation system for the Spotify dataset. It begins by selecting specific features, such as artists, year, genres, and audio characteristics, from a dataset of songs. These features are combined into a single text representation for each song. Then, a TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is applied to this text data to capture the importance of each word or feature. The code calculates cosine similarity between songs based on their TF-IDF vectors to determine song similarity. The Spotify_ContentBasedRecommendation class is defined to provide song recommendations given a song name. It identifies the most similar songs to the input, excluding the input song itself. The recommended songs are then displayed, showing their names, artists, and release years. This code showcases how content-based recommendation systems can help users discover songs with similar textual and audio characteristics to their preferred tracks.

In [26]:
features_to_include = ['artists', 'year', 'genres', 'valence', 'danceability', 'energy', 'popularity', 'instrumentalness', 'acousticness', 'loudness', 'tempo', 'duration_ms']
combined_features = dataset[features_to_include].astype(str).apply(lambda x: ' '.join(x), axis=1)

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
# combined_features = dataset['artists'] + ' ' + (dataset['year'].astype(str) + ' ') * 3
tfidf_matrix = tfidf_vectorizer.fit_transform(combined_features)

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

class Spotify_ContentBasedRecommendation:
    def __init__(self, dataset, cosine_sim):
        self.dataset = dataset
        self.cosine_sim = cosine_sim

    def recommend(self, song_name, amount=1):
        idx = self.dataset[self.dataset['name'].str.lower() == song_name.lower()].index[0]
        sim_scores = list(enumerate(self.cosine_sim[idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:amount + 1]
        song_indices = [i[0] for i in sim_scores]
        recommended_songs = []
        for index in song_indices:
            song_info = self.dataset.iloc[index]
            song_dict = {
                'name': song_info['name'],
                'artists': song_info['artists'],
                'year': song_info['year']
            }
            recommended_songs.append(song_dict)

        return recommended_songs

content_based_recommendations = Spotify_ContentBasedRecommendation(dataset, cosine_sim)

recommended_songs = content_based_recommendations.recommend("La Vipère", 10)

for idx, song_info in enumerate(recommended_songs, start=1):
    song_name = song_info['name']
    artists_string = song_info['artists']
    artists_list = [name.strip() for name in artists_string.split(',')]
    cleaned_names = [name.strip(" '[]") for name in artists_list]
    formatted_artists = ' & '.join(cleaned_names)
    print(f"{idx}. {song_name} by {formatted_artists}")

1. Nothin' on You (feat. Bruno Mars) by B.o.B & Bruno Mars
2. Big Girls Don't Cry by Frankie Valli & The Four Seasons
3. Maybe Not At All by Ethel Waters
4. Here's to Never Growing Up by Avril Lavigne
5. She's Funny That Way by Erroll Garner
6. Gone, Gone, Gone - Remastered Version by The Everly Brothers
7. Soja Soja Chandni by M.G.Shreekumar
8. Válgame la Magdalena (Zambra) by Lola Flores
9. Cousin Mary by John Coltrane
10. I Never Told You by Colbie Caillat


# **Testing the Reccomender System**

This section of the code is dedicated to testing the effectiveness of the recommendation system. It randomly selects 10 songs from a diverse range of genres and presents them to the user. The user is then prompted to choose 3 songs from this selection. This approach mirrors the common strategy employed by platforms like Netflix and Spotify, where they initially offer a small curated selection to users and ask for their preferences. By having users select songs of interest, the recommendation system is initialized to provide tailored suggestions. This interactive process helps fine-tune the system's recommendations, ensuring a more personalized and enjoyable user experience.

In [27]:
initial_songs = []

for i in range(10):
  random_num = random.randint(0, len(dataset))
  if (str(dataset.iloc[random_num]) not in str(initial_songs)) and (str(dataset.iloc[random_num]['genres']) not in str(initial_songs)):
    initial_songs.append(dataset.iloc[random_num])
    random_num = random.randint(0, len(dataset))

initial_songs = pd.Series(initial_songs)

for i in range(len(initial_songs)):
  print(i+1,"\t","Song Name: ",initial_songs.iloc[i]['name'],"\n","\t","Artists: ",initial_songs.iloc[i]['artists'],"\n","\t","Release Date: ",initial_songs.iloc[i]['year'],"\n")

print("Pick three songs from the following list: ")

count = 3
picked_songs = []
while count!= 0:
  song_choice = int(input())
  picked_songs.append(str(initial_songs.iloc[song_choice-1]['name']))
  count -= 1

print(picked_songs)


1 	 Song Name:  Eine kurze Weltgeschichte für junge Leser: Von den Anfängen bis zum Mittelalter, Kapitel 13 
 	 Artists:  ['Ernst H. Gombrich', 'Christoph Waltz'] 
 	 Release Date:  1936 

2 	 Song Name:  I Can't Begin To Tell You 
 	 Artists:  ['Harry James', 'Betty Grable'] 
 	 Release Date:  1939 

3 	 Song Name:  Washboard Blues 
 	 Artists:  ['Hoagy Carmichael'] 
 	 Release Date:  1950 

4 	 Song Name:  La Bohème: Quando me'n vo soletta (Musetta/Marcellop/Alcindoro/Mimì/Rodolfo/Schaunard/Colline) - 1997 Remastered Version 
 	 Artists:  ['Giacomo Puccini', 'Maria Callas', 'Giuseppe Di Stefano', 'Rolando Panerai', 'Manuel Spatafora', 'Nicola Zaccaria', 'Coro e Orchestra del Teatro alla Scala, Milano', 'Carlo Forti', 'Carlo Badioli', 'Franco Ricciardi', 'Antonino Votto', 'Eraldo Coda', 'Anna Moffo', 'Orchestra Del Teatro Alla Scala, Milano'] 
 	 Release Date:  1958 

5 	 Song Name:  The Curse of Curves 
 	 Artists:  ['Cute Is What We Aim For'] 
 	 Release Date:  2006 

6 	 Song Name:

# **Results**

With the user's selection of their top 3 favorite songs, the following code harnesses the power of two recommendation systems: K-Means Clustering and Content-based filtering. Equally weighting both recommender systems with a weight of 0.5 each, the code processes each of the user's chosen songs through both recommendation engines, resulting in 10 song recommendations from each system, totaling 20 songs. After all three songs have generated their recommendations, the combined list is meticulously ranked, and the cream of the crop— the top 10 songs—are presented to the user for their listening pleasure. This fusion of recommendation methods ensures a diverse and personalized music selection tailored to the user's unique tastes.

In [32]:
combined_recommendations = {}
for song in picked_songs:
    cf_recs = Spotify_Recommendation(dataset).recommend(str(song), 10)
    content_recs = Spotify_ContentBasedRecommendation(dataset, cosine_sim).recommend(song, 10)
    combined_recs = {}
    for rec in cf_recs:
        combined_recs[rec['name']] = 0.5
    for rec in content_recs:
        combined_recs[rec['name']] = combined_recs.get(rec['name'], 0) + 0.5
    combined_recommendations[song] = list(combined_recs.keys())
final_recommendations = []
for song, rec_list in combined_recommendations.items():
    combined_scores = {}
    for idx, rec_name in enumerate(rec_list):
        combined_scores[rec_name] = combined_scores.get(rec_name, 0) + (idx + 1)
    sorted_combined = sorted(combined_scores.keys(), key=lambda x: combined_scores[x])
    final_recommendations.extend(sorted_combined)

final_recommendations = final_recommendations[:10]


print("\t Recommended Songs: \n")
for idx, song in enumerate(final_recommendations, start=1):
    song_info = dataset[dataset['name'] == song].iloc[0]
    song_name = song_info['name']
    artists = song_info['artists']
    release_date = song_info['release_date']

    print(f"{idx}\t Song Name: {song_name}\n\t Artists: {artists}\n\t Release Date: {release_date} \n")

	 Recommended Songs: 

1	 Song Name: Wie man Freunde gewinnt - Die Kunst, beliebt und einflussreich zu werden, Kapitel 54
	 Artists: ['Dale Carnegie', 'Till Hagen', 'Stefan Kaminski']
	 Release Date: 1936 

2	 Song Name: Invitation - Instrumental
	 Artists: ['Cal Tjader']
	 Release Date: 1959-01-01 

3	 Song Name: Foe Life
	 Artists: ['Mack 10', 'Ice Cube']
	 Release Date: 1995-01-01 

4	 Song Name: Goomba Boomba - Remastered
	 Artists: ['Billy May', 'Yma Sumac']
	 Release Date: 1954 

5	 Song Name: Never No Lament
	 Artists: ['Duke Ellington', 'Duke Ellington Orchestra']
	 Release Date: 1940-01-01 

6	 Song Name: Il Ragazzo Della Via Gluck - Remastered
	 Artists: ['Adriano Celentano']
	 Release Date: 1966 

7	 Song Name: Treat Her Like A Lady
	 Artists: ['The Temptations']
	 Release Date: 1997-01-01 

8	 Song Name: Doin' Time - Original Version
	 Artists: ['Sublime']
	 Release Date: 1996-07-30 

9	 Song Name: Best Thing I Never Had
	 Artists: ['Beyoncé']
	 Release Date: 2011-06-24 

1