#k-NN Machine Learning Model

#After loading the data, we can start building our machine learning model to recommend songs. There are many ways to approach this. A simple approach is to use a k Nearest Neighbor model to get the top songs that are closest in distance with the set of feature inputs selected by the user. These “feature inputs” include the genre of interest, release year range (start year and end year), and a set of audio features (acousticness, danceability, energy, instrumentalness, valence, tempo).
We can use Sklearn to build the k-NN model and return the top-k results given a test point. To perform the above functions, we write the function n_neighbors_uri_audio , which will return the Spotify URIs and audio feature values of the top neighbors in ascending order of their rank (a point closest to the input features is ranked first).


In [2]:
import pandas as pd
import pickle

In [3]:
albums_data = pd.read_csv('spotify_albums.csv')
tracks_data = pd.read_csv('spotify_tracks.csv')
artist_data = pd.read_csv('spotify_artists.csv')


`# DATA PREPROCESSING

Now, we can join the albums and artists with tracks data. We need to join the album release year and artist genre information with the track data.

Drop irrelevant columns 

In [4]:
def join_genre_and_date(artist_df, album_df, track_df):
    album = album_df.rename(columns={'id':"album_id"}).set_index('album_id')
    artist = artist_df.rename(columns={'id':"artists_id",'name':"artists_name"}).set_index('artists_id')
    track = track_df.set_index('album_id').join(album['release_date'], on='album_id' )
    track.artists_id = track.artists_id.apply(lambda x: x[2:-2])
    track = track.set_index('artists_id').join(artist[['artists_name','genres']], on='artists_id' )
    track.reset_index(drop=False, inplace=True)
    track['release_year'] = pd.to_datetime(track.release_date).dt.year
    track.drop(columns = ['Unnamed: 0','country','track_name_prev','track_number','type'], inplace = True)
  
    return track[track.release_year >= 1950]


    

In [5]:
def get_filtered_track_df(df, genres_to_include):
    df['genres'] = df.genres.apply(lambda x: [i[1:-1] for i in str(x)[1:-1].split(", ")])
    df_exploded = df.explode("genres")[df.explode("genres")["genres"].isin(genres_to_include)]
    df_exploded.loc[df_exploded["genres"]=="indian pop", "genres"] = "pop"
    df_exploded_indices = list(df_exploded.index.unique())
    df = df[df.index.isin(df_exploded_indices)]
    df = df.reset_index(drop=True)
    return df

In [6]:
genres_to_include = genres = ['dance pop', 'electronic', 'electropop', 'hip hop', 'jazz', 'k-pop', 'latin', 'pop', 'pop rap', 'r&b', 'rock']
track_with_year_and_genre = join_genre_and_date(artist_data, albums_data, tracks_data)
filtered_track_df = get_filtered_track_df(track_with_year_and_genre, genres_to_include)

'''
['indian pop', 'desi', 'bollywood']
['desi', 'desi hip hop', 'filmi', 'indian pop', 'modern bollywood', 'sufi']
'''

"\n['indian pop', 'desi', 'bollywood']\n['desi', 'desi hip hop', 'filmi', 'indian pop', 'modern bollywood', 'sufi']\n"

In [7]:
filtered_track_df["uri"] = filtered_track_df["uri"].str.replace("spotify:track:", "")
filtered_track_df = filtered_track_df.drop(columns=['analysis_url', 'available_markets'])


In [8]:
# Filtering out bollywood songs with modified genre
genres_to_include = genres = ['desi', 'desi hip hop', 'filmi', 'indian pop', 'modern bollywood', 'sufi']
track_with_year_and_genre = join_genre_and_date(artist_data, albums_data, tracks_data)
filtered_track_df_bollywood = get_filtered_track_df(track_with_year_and_genre, genres_to_include)

filtered_track_df_bollywood.shape


(372, 30)

In [9]:
filtered_track_df_bollywood["uri"] = filtered_track_df_bollywood["uri"].str.replace("spotify:track:", "")
filtered_track_df_bollywood = filtered_track_df_bollywood.drop(columns=['analysis_url', 'available_markets'])

In [10]:
#filtered_track_df = filtered_track_df.append(filtered_track_df_bollywood, ignore_index=True)
replaced = ["bollywood", "pop"]
filtered_track_df_bollywood['genres']
for i in range(filtered_track_df_bollywood.genres.size):
    filtered_track_df_bollywood['genres'][i] = replaced

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_track_df_bollywood['genres'][i] = replaced


In [11]:
filtered_track_df = filtered_track_df.append(filtered_track_df_bollywood, ignore_index=True)

filtered_track_df.shape


  filtered_track_df = filtered_track_df.append(filtered_track_df_bollywood, ignore_index=True)


(11047, 28)

In [20]:
filtered_track_df.head(5)

Unnamed: 0,artists_id,acousticness,danceability,disc_number,duration_ms,energy,href,id,instrumentalness,key,...,tempo,time_signature,track_href,uri,valence,release_date,artists_name,genres,release_year,song_artist
0,68WwJXWrpo1yVOOIZjLSeT,0.0268,0.506,1.0,248777.0,0.741,https://api.spotify.com/v1/tracks/0UATU9OJxh4m...,0UATU9OJxh4m3fwDljdGZn,2.7e-05,1.0,...,94.042,4.0,https://api.spotify.com/v1/tracks/0UATU9OJxh4m...,0UATU9OJxh4m3fwDljdGZn,0.236,2018-09-28,Evalyn,"[electropop, indie electro-pop, indie poptimis...",2018,Creme de la creme - Evalyn
1,09xj0S68Y1OU1vHMCZAIvz,0.505,0.487,1.0,171573.0,0.297,https://api.spotify.com/v1/tracks/4JH1M62gVDND...,4JH1M62gVDNDhDAUiQB3Qv,5.2e-05,11.0,...,185.912,3.0,https://api.spotify.com/v1/tracks/4JH1M62gVDND...,4JH1M62gVDNDhDAUiQB3Qv,0.289,2001-08-21,Café Tacvba,"[latin, latin alternative, latin rock, mexican...",2001,La muerte chiquita - Café Tacvba
2,6pSsE5y0uJMwYj83KrPyf9,0.133,0.629,1.0,207396.0,0.706,https://api.spotify.com/v1/tracks/0h7Ld5CvgzaU...,0h7Ld5CvgzaUN1zA3tdyPq,0.0,1.0,...,81.22,4.0,https://api.spotify.com/v1/tracks/0h7Ld5CvgzaU...,0h7Ld5CvgzaUN1zA3tdyPq,0.543,2019-01-25,Dawn Richard,"[alternative r&b, deep pop r&b, escape room, h...",2019,"we, diamonds - Dawn Richard"
3,7slfeZO9LsJbWgpkIoXBUJ,0.406,0.59,1.0,279000.0,0.597,https://api.spotify.com/v1/tracks/4S1bYWrLOC8s...,4S1bYWrLOC8smuy8kJzxKQ,2.3e-05,9.0,...,121.051,4.0,https://api.spotify.com/v1/tracks/4S1bYWrLOC8s...,4S1bYWrLOC8smuy8kJzxKQ,0.466,1995-09-12,Ricky Martin,"[dance pop, latin, latin pop, mexican pop, pop...",1995,"Te Extraño, Te Olvido, Te Amo - Ricky Martin"
4,09hVIj6vWgoCDtT03h8ZCa,0.0316,0.727,1.0,218773.0,0.38,https://api.spotify.com/v1/tracks/758mQT4zzlvB...,758mQT4zzlvBhy9PvNePwC,0.0,7.0,...,92.05,4.0,https://api.spotify.com/v1/tracks/758mQT4zzlvB...,758mQT4zzlvBhy9PvNePwC,0.455,1991-09-24,A Tribe Called Quest,"[alternative hip hop, conscious hip hop, east ...",1991,Butter - A Tribe Called Quest


In [12]:
filtered_track_df['song_artist'] = filtered_track_df['name'] + ' - ' + filtered_track_df['artists_name']

filtered_track_df.shape

(11047, 29)

In [13]:
song_list = filtered_track_df['song_artist'].unique().tolist()
artist_list = filtered_track_df['artists_name'].unique().tolist()
print(len(song_list) ,len(artist_list))

song_list = song_list + artist_list
#pickle.dump(song_list, open('song_list.pkl','wb'))

9764 2054


In [14]:
filtered_track_df.loc[filtered_track_df['song_artist'].duplicated()]
test = filtered_track_df


In [15]:
import numpy as np
drop_values = np.where(filtered_track_df['song_artist'].duplicated().tolist())#filtered_track_df.loc[filtered_track_df['song_artist'].duplicated()]

test = test.drop(i for i in drop_values[0])

In [16]:
#dropping duplicates
filtered_track_df = filtered_track_df.drop(i for i in drop_values[0])

In [17]:
#filtered_track_df.reset_index(drop=False, inplace=True)
#filtered_track_df.drop(columns=['index'],inplace=True)                # } ALL EXECUTED
#pickle.dump(filtered_track_df, open("filtered_track_df.pkl", "wb"))