# Spotify Funk Recommender
Looking for new songs that I would like based on my "Tom's Funky Playlist" tracks.

* Using data collected with Funk Recommender Data.ipynb
* Create a content based filter to compare to my own Funky Songs playlist
* Recommend songs from the other lists that I might like!

Following along with: https://towardsdatascience.com/part-iii-building-a-song-recommendation-system-with-spotify-cf76b52705e7

## Imports

In [1]:
import pandas as pd
import numpy as np

# Graphing
import matplotlib.pyplot as plt

# Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import textblob

## Gather/Read Data
In [Funk Recommender Data.ipynb] I gathered all of the track info for the following playlists:
|
#### Spotify Playlists to Draw From
* Old School Funk: https://open.spotify.com/playlist/37i9dQZF1EIfqkfSDVB2GV
* All Funked Up: https://open.spotify.com/playlist/37i9dQZF1DX4WgZiuR77Ef
* Funky Jams: https://open.spotify.com/playlist/37i9dQZF1DX6drTZKzZwSo
* Crisp: https://open.spotify.com/playlist/37i9dQZF1DXdb5FEvfgsH9
* Instrumental Funk: https://open.spotify.com/playlist/37i9dQZF1DX8f5qTGj8FYl
* Future Funk: https://open.spotify.com/playlist/37i9dQZF1DXbjGYBfEmjR5

#### My Funky Songs Playlist to Compare to
* Toms Funky Playlist: "https://open.spotify.com/playlist/7eWWLoTfmLUcD0viBP6Hr0?si=e8b0760749404749"

In [2]:
filename = "funky_playlist_tracks.xlsx"
tracks_df = pd.read_excel(filename)
tracks_df.head()

Unnamed: 0,track_uri,track_name,artist_uri,artist_name,artist_pop,artist_genres,album,track_pop,explicit,acousticness,...,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,playlist
0,spotify:track:1v1PV2wERHiMPesMWX0qmO,Flash Light,spotify:artist:5SMVzTJyKFJ7TUb46DglcH,Parliament,53,"['afrofuturism', 'funk', 'funk rock', 'p funk'...",Funkentelechy Vs. The Placebo Syndrome,64,False,0.243,...,0.117,7,0.474,-10.458,0,0.0465,105.177,4,0.687,Old School Funk
1,spotify:track:68oL33xGl9GsUhDSTCXCrD,Hit And Run,spotify:artist:0Z4CzYz9ieK8q9XiVMPkW5,The Bar-Kays,43,"['classic soul', 'disco', 'funk', 'memphis sou...",The Best Of The Bar-Kays,46,False,0.31,...,4e-06,4,0.0431,-10.219,0,0.0328,112.43,4,0.968,Old School Funk
2,spotify:track:1VKPiQJnV15flF5B3zeocD,You Dropped A Bomb On Me,spotify:artist:4TwHRCIu3Xg9fjS3l7owkp,The Gap Band,55,"['disco', 'funk', 'motown', 'quiet storm', 'so...",The Gap Band IV,61,False,0.00737,...,0.00186,9,0.181,-11.177,1,0.0339,126.461,4,0.831,Old School Funk
3,spotify:track:6nJh9dyel0o2jmlZzYGh3h,Firecracker,spotify:artist:4Aj5BsUYgadIeoC759FrhE,Mass Production,26,"['classic soul', 'disco', 'funk', 'p funk', 'p...",Firecrackers: The Best Of Mass Production,40,False,0.0268,...,0.0108,7,0.19,-9.482,1,0.132,127.738,4,0.937,Old School Funk
4,spotify:track:71djYUPXyLrhOYWZcpYufv,Backstrokin',spotify:artist:6PWU6JQvvYv5sz5FOODHg6,Fatback Band,43,"['classic soul', 'disco', 'funk', 'harlem hip ...",Hustle! The Ultimate Fatback,47,False,0.0783,...,0.00442,6,0.0533,-8.422,1,0.109,116.494,4,0.802,Old School Funk


## Feature Generation
1. Sentiment Analysis - the example does this on the track names. I think I'll skip that for now. 
1. One-hot Encoding - in the example that I'm working with, this was used on sentiment and polarity of the song names. I'm dropping this part because song titles are very short and I don't expect them to have much interesting information. It is also used on key and mode though - which could be interesting.
1. TF-IDF - this is done on the genres - may as well give this a try
1. Normalization - scale numeric values to a range of 0-1 or something comparable. In this case it is just the popularity columns that need to be scaled down by 100.

In [3]:
def getSubjectivity(text):
  '''
  Getting the Subjectivity using TextBlob
  '''
  return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
  '''
  Getting the Polarity using TextBlob
  '''
  return TextBlob(text).sentiment.polarity

def getAnalysis(score, task="polarity"):
  '''
  Categorizing the Polarity & Subjectivity score
  '''
  if task == "subjectivity":
    if score < 1/3:
      return "low"
    elif score > 1/3:
      return "high"
    else:
      return "medium"
  else:
    if score < 0:
      return 'Negative'
    elif score == 0:
      return 'Neutral'
    else:
      return 'Positive'

def sentiment_analysis(df, text_col):
  '''
  Perform sentiment analysis on text
  ---
  Input:
  df (pandas dataframe): Dataframe of interest
  text_col (str): column of interest
  '''
  df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
  df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
  return df

In [4]:
# copying this function, but not using it yet
def ohe_prep(df, column, new_name): 
    ''' 
    Create One Hot Encoded features of a specific column
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    column (str): Column to be processed
    new_name (str): new column name to be used
        
    Output: 
    tf_df: One-hot encoded features 
    '''
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df


### TF-IDF: artist_genre

In [5]:
# TF-IDF implementation - on the artist genre
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(tracks_df['artist_genres'])
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
# genre_df.drop(columns='genre|unknown') # Drop unknown genre
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]

genre|acid            0.000000
genre|acoustic        0.000000
genre|adult           0.000000
genre|afrobeat        0.000000
genre|afrofuturism    0.696265
                        ...   
genre|urban           0.000000
genre|vaporwave       0.000000
genre|video           0.000000
genre|vocal           0.000000
genre|worth           0.000000
Name: 0, Length: 166, dtype: float64

### Normalization
The "popularization" columns are out of 100 - so I'll scale them down to a 0-1.

In [6]:
list(tracks_df.columns)
tracks_df[['artist_pop','track_pop']].describe()

Unnamed: 0,artist_pop,track_pop
count,582.0,582.0
mean,37.685567,33.749141
std,15.201088,14.058727
min,0.0,0.0
25%,28.0,26.0
50%,36.0,34.0
75%,48.0,41.0
max,88.0,89.0


In [7]:
tracks_df['artist_pop'] = tracks_df['artist_pop']/100
tracks_df['track_pop'] = tracks_df['track_pop']/100

## Feature Generation
Create a function that does all of the feature creation and data prep for modeling. This is similar to the Recipe step in R tidymodels.

In [8]:
def create_feature_set(df, float_cols):
    '''
    Process spotify df to create a final set of features that will be used to generate recommendations
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    float_cols (list(str)): List of float columns that will be scaled
            
    Output: 
    final (pandas dataframe): Final set of features 
    '''
    
    # Extract track-id
    tracks_df['id'] = tracks_df.track_uri.apply(lambda x: x.split(":")[2])    
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['artist_genres'])
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    
    if 'genre|unknown' in list(genre_df.columns):
        genre_df.drop(columns='genre|unknown') # drop unknown genre
    
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
#     df = sentiment_analysis(df, "track_name")

    # One-hot Encoding - commenting out subjectivity and polarity, since I'm skipping that bit
#     subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
#     polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [14]:
# Generate features
float_cols = tracks_df.dtypes[tracks_df.dtypes == 'float64'].index.values
complete_feature_set = create_feature_set(tracks_df, float_cols=float_cols)

## Content Based Filtering Recommendations
The next step is to perform content-based filtering based on the song features we have. To do so, we concatenate all songs in a playlist into one summarization vector. Then, we find the similarity between the summarized playlist vector with all songs (not including the songs in the playlist) in the database. Then, we use the similarity measure retrieved the most relevant song that is not in the playlist to recommend it.

There are three steps in this section:

1. Choose playlist: In this part, we retrieve a playlist
1. Extract features: In this part, we retireve playlist-of-interest features and non-playlist-of-interest features.
1. Find similarity: In this part, we compare the summarized playlist features with all other songs.

The first two parts were already done in the data collection step. I just need to separate out my playlist from the rest of the songs. The "rest of the songs" will be the database from which I want to make recommendations by finding the songs that "belong" in my playlist with the content filter.

In [15]:
def generate_playlist_feature(complete_feature_set, playlist_df):
    '''
    Summarize a user's playlist into a single vector
    ---
    Input: 
    complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
    playlist_df (pandas dataframe): playlist dataframe
        
    Output: 
    complete_feature_set_playlist_final (pandas series): single vector feature that summarizes the playlist
    complete_feature_set_nonplaylist (pandas dataframe): 
    '''
    
    # Find song features in the playlist
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    # Find all non-playlist song features
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
    return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist

In [16]:
playlist_df = tracks_df.loc[tracks_df.playlist=='Toms Funky Playlist']

# Generate the features
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlist_df)

In [17]:
# Non-playlist features
complete_feature_set_nonplaylist.head()

Unnamed: 0,genre|acid,genre|acoustic,genre|adult,genre|afrobeat,genre|afrofuturism,genre|album,genre|alternative,genre|americana,genre|ann,genre|arbor,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.696265,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,1v1PV2wERHiMPesMWX0qmO
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,68oL33xGl9GsUhDSTCXCrD
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.5,1VKPiQJnV15flF5B3zeocD
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,6nJh9dyel0o2jmlZzYGh3h
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,71djYUPXyLrhOYWZcpYufv


In [18]:
# Summarized playlist features
complete_feature_set_playlist_vector

genre|acid             0.653129
genre|acoustic         0.000000
genre|adult            0.000000
genre|afrobeat         0.494668
genre|afrofuturism     0.000000
                        ...    
key|9                  7.000000
key|10                 4.000000
key|11                 3.000000
mode|0                19.500000
mode|1                31.500000
Length: 193, dtype: float64

## Find Similarity
Find similarity
The last puzzle is to find the similarities between the summarized playlist vector and all other songs. There are many similarity measures but one of the most common measures is cosine similarity.

Cosine similarity is a mathematical value that measures the similarities between vectors. Imagining our songs vectors as only two dimensional, the visual representation would look similar to the figure below.

The mathematical formula can be expressed as:
 
$$Cosine Sim(A,B) = \frac{A \cdot B}{\|A \| \times \|B \|} = \frac{\sum_{i=1}^{n} A_i \times B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2} }$$   

In our code, we used the cosine_similarity() function from scikit learn to measure the similarity between each song and the summarized playlist vector.

One big advatange of doing this is the time complexity of the whole algorithm is equal to a matrix multiplication since we are performing the cosine similarity measure between each row vector (song) and the column vector of summarized playlist feature.

In [19]:
def generate_playlist_recos(df, features, nonplaylist_features):
    '''
    Generated ordered recommendation list based on songs in aspecific playlist.
    ---
    Input: 
    df (pandas dataframe): spotify dataframe
    features (pandas series): summarized playlist feature (single vector)
    nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
    Output: 
    non_playlist_df_ordered: ordered list of songs by similarity to the given playlist
    '''
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_ordered = non_playlist_df.sort_values('sim',ascending = False)
    
    return non_playlist_df_ordered

In [20]:
# Generate top 10 recommendations
recommend = generate_playlist_recos(tracks_df, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
recommend.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]


Unnamed: 0,track_uri,track_name,artist_uri,artist_name,artist_pop,artist_genres,album,track_pop,explicit,acousticness,...,liveness,loudness,mode,speechiness,tempo,time_signature,valence,playlist,id,sim
257,spotify:track:5lMJEmLGzqbdFuOaonB0eO,Your Touch,spotify:artist:4FcDSQOUJabW2HEHGofJOM,The APX,0.24,['modern funk'],Amplified Experiment,0.29,False,0.0482,...,0.116,-5.776,1,0.0633,116.981,4,0.691,Crisp,5lMJEmLGzqbdFuOaonB0eO,0.769971
93,spotify:track:3uQjXCAXuSAZIduWjU5mY8,D.R.E.A.D,spotify:artist:7JnJgTo8cCtAQmtC0cJyjp,Tom McGuire & the Brassholes,0.3,['modern funk'],D.R.E.A.D,0.28,False,0.0597,...,0.113,-8.196,1,0.149,190.051,4,0.926,All Funked Up,3uQjXCAXuSAZIduWjU5mY8,0.767957
66,spotify:track:3OSS6R3an41FservLqCpZH,2nd Place,spotify:artist:7JnJgTo8cCtAQmtC0cJyjp,Tom McGuire & the Brassholes,0.3,['modern funk'],Stay Rad,0.3,False,0.139,...,0.29,-7.874,1,0.072,164.028,4,0.884,All Funked Up,3OSS6R3an41FservLqCpZH,0.766341
53,spotify:track:0Bwy62vMCaxtEdgRCh4jh5,Bump The Man,spotify:artist:1wnaeDbP5Yl9MNV9qC008L,Philip Lassiter,0.29,['modern funk'],Bump The Man,0.33,False,0.0149,...,0.437,-7.463,1,0.0838,117.9,4,0.697,All Funked Up,0Bwy62vMCaxtEdgRCh4jh5,0.760799
139,spotify:track:3xDvyv5KF5Jvnycgutrgb9,Turn up the Sound,spotify:artist:7GnRzYsBXvLyhcdFEtCAei,The Brooks,0.24,['modern funk'],Turn up the Sound,0.29,False,0.119,...,0.213,-5.221,1,0.171,114.562,4,0.607,All Funked Up,3xDvyv5KF5Jvnycgutrgb9,0.758079
124,spotify:track:3Y0WMxJ7Mpb7xe2RTa1LkD,Satisfaction,spotify:artist:3xgLOazt16FXyWSWJ99ViC,Diggin' Dirt,0.22,['modern funk'],Satisfaction,0.27,False,0.0385,...,0.0316,-6.825,1,0.117,92.537,4,0.642,All Funked Up,3Y0WMxJ7Mpb7xe2RTa1LkD,0.757744
123,spotify:track:0GjSGefxut8enOP0LFPlln,Mother Funkin' Robots,spotify:artist:3gfBx0SvMGdMQ2ZsjPvIV4,MF Robots,0.25,['modern funk'],Mother Funkin' Robots,0.26,False,0.00226,...,0.291,-5.797,1,0.0479,169.999,4,0.834,All Funked Up,0GjSGefxut8enOP0LFPlln,0.755208
145,spotify:track:7Mg0Lbc0ehV83usNMmZlKi,Wanna Do (Funk With You),spotify:artist:1gODfHkJMTmn5Kmyy3M6LW,The Aquaducks,0.16,['modern funk'],Wanna Do (Funk With You),0.23,False,0.0649,...,0.0142,-5.987,1,0.0809,92.071,4,0.876,All Funked Up,7Mg0Lbc0ehV83usNMmZlKi,0.750633
103,spotify:track:6162qDKnzcPSOh1NcqoLM3,Brand New Day,spotify:artist:3gfBx0SvMGdMQ2ZsjPvIV4,MF Robots,0.25,['modern funk'],Break the Wall,0.29,False,0.0105,...,0.307,-9.085,1,0.0362,98.021,4,0.892,All Funked Up,6162qDKnzcPSOh1NcqoLM3,0.748057
140,spotify:track:2jnOjKvvOH7BZSAJmwdCr7,Gotta Keep on Movin',spotify:artist:1gODfHkJMTmn5Kmyy3M6LW,The Aquaducks,0.16,['modern funk'],Gotta Keep on Movin',0.29,False,0.445,...,0.0696,-6.024,1,0.0934,123.913,4,0.819,All Funked Up,2jnOjKvvOH7BZSAJmwdCr7,0.746425


## Collaborative Filtering