# Cosine Similarity Algorithm 

## Business Understanding 

Cosine Similarity is an algorithm used to calculate how similar variables are to each other using metrics that describe each variable. What I hope to accomplish with this algorithm is to plug a song in and have the algorithm output 20 songs that are similar to the song plugged in. This will make it easy for users to discover new music and a fast way to create playlists!

The goal of this cosine similarity function is give Apple Music a new way to create playlists quick and easy for their users to enjoy. Spotify has playlists on its streaming service that has a team dedicated to making playlists for their listeners. Rap Caviar, for example, is Spotify's biggest playlist with over 14 million likes. If Apple Music developed a team dedicated to making playlists, they could potentially catch up to Spotify's number of subscribers. 

In [1]:
import pandas as pd 
import numpy as np 
from numpy import dot 
import operator
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer 
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_validate

In [2]:
pd.set_option('display.max_columns', 500)

This website is where I got the information to implement a cosine similarity system.
https://www.datasciencelearner.com/sklearn-cosine-similarity-implementation/

In [3]:
df = pd.read_csv('../data/final_df.csv', index_col=0)
df

Unnamed: 0,track_id,track_name,track_popularity,duration_ms,explicit,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist_id,followers,genres,artist_name,artist_popularity
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,1922-02-22,0.645,0.44500,0,-13.338,1,0.4510,0.674,0.744000,0.1510,0.127,104.851,3,45tIt06XoI0Iio4LBEVpls,91.0,,Uli,4
1,0PH9AACae1f957JAavhOl2,Lazy Boi,0,157333,0,1922-02-22,0.298,0.46000,1,-18.645,1,0.4530,0.521,0.856000,0.4360,0.402,87.921,4,45tIt06XoI0Iio4LBEVpls,91.0,,Uli,4
2,2SiNuAZ6jIU9xhClRKXcST,Sketch,0,87040,0,1922-02-22,0.634,0.00399,5,-29.973,0,0.0377,0.926,0.919000,0.1050,0.396,79.895,4,45tIt06XoI0Iio4LBEVpls,91.0,,Uli,4
3,4vV7uBcF2AnjNTOejBS5oL,L'enfer,0,40000,0,1922-02-22,0.657,0.32500,10,-14.319,0,0.2540,0.199,0.856000,0.0931,0.105,81.944,5,45tIt06XoI0Iio4LBEVpls,91.0,,Uli,4
4,598LlBn6jpEpVbLjmZPsYV,Graphite,0,104400,0,1922-02-22,0.644,0.68400,7,-8.247,1,0.1990,0.144,0.802000,0.0847,0.138,100.031,4,45tIt06XoI0Iio4LBEVpls,91.0,,Uli,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
470033,0MmaEacabpK8Yp3Mdeo5uY,下雨天,50,265846,0,2020-02-25,0.528,0.67300,4,-3.639,1,0.0314,0.143,0.000000,0.0989,0.297,130.066,4,5VGgFE9nPgMfEnYiPT5J2B,929.0,chinese viral pop,芝麻,36
470034,1dKxf4Ht2SsKLyXfSDJAgy,The Cutest Puppy,67,82500,0,2020-10-30,0.609,0.01720,8,-28.573,1,0.1180,0.996,0.973000,0.1080,0.890,68.619,4,7vgGpuiXdNlCmc994PlMlz,23.0,instrumental lullaby,Laureen Conrad,52
470035,0SjsIzJkZfDU7wlcdklEFR,John Brown's Song,66,185250,0,2020-03-20,0.562,0.03310,1,-25.551,1,0.1030,0.996,0.961000,0.1110,0.386,63.696,3,4MxqhahGRT4BPz1PilXGeu,91.0,instrumental lullaby,Gregory Oberle,55
470036,5rgu12WBIHQtvej2MdHSH0,云与海,50,258267,0,2020-09-26,0.560,0.51800,0,-7.471,0,0.0292,0.785,0.000000,0.0648,0.211,131.896,4,1QLBXKM5GCpyQQSVMNZqrZ,896.0,chinese viral pop,阿YueYue,38


Getting rid of null values in genres

In [4]:
df = df.dropna(subset=['genres'])

I am going to create a variable with all of the numeric columns to scale and have them easily accessible when I plug them into functions. 

In [5]:
num_cols = ['danceability', 'energy', 'loudness', 'key', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'track_popularity']

Scaling my numeric data so that it will run smoothly through the cosine similarity algorithm.

In [6]:
scaler = StandardScaler()

scaler.fit(df[num_cols])

StandardScaler()

In [7]:
scaled_data = pd.DataFrame(scaler.fit_transform(df[num_cols]), columns=num_cols)
scaled_data

Unnamed: 0,danceability,energy,loudness,key,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_popularity
0,-0.823764,-1.595823,-2.533133,-1.203130,0.712639,-0.265077,1.719770,-0.279239,-0.010532,-0.419442,0.367961,2.512874,-1.733620
1,-1.534985,-1.936469,-4.034035,0.501498,0.712639,-0.269892,1.722753,3.581239,-0.590591,-0.656883,1.703360,-1.992675,-1.733620
2,-0.106249,-2.165909,-2.986436,0.217393,0.712639,0.347572,1.725736,3.645853,-0.354271,0.193948,-1.482568,-1.992675,-1.733620
3,-1.226580,-0.789683,-1.638480,-0.919025,0.712639,-0.052636,1.692923,-0.370211,3.840421,0.751936,1.355317,0.260100,-1.733620
4,-1.421693,-1.931508,-3.808933,0.785602,0.712639,-0.254245,1.719770,3.800927,-0.488544,-1.361293,-1.121934,-1.992675,-1.733620
...,...,...,...,...,...,...,...,...,...,...,...,...,...
432223,-0.232129,0.454667,1.349378,-0.350816,0.712639,-0.384237,-0.818768,-0.373145,-0.617983,-1.052619,0.356079,0.260100,1.190996
432224,0.277684,-2.256445,-4.169494,0.785602,0.712639,0.136936,1.725736,3.818157,-0.569108,1.294093,-1.718040,0.260100,2.185366
432225,-0.018133,-2.190713,-3.500607,-1.203130,0.712639,0.046664,1.725736,3.766466,-0.552995,-0.700414,-1.884214,-1.992675,2.126874
432226,-0.030721,-0.186111,0.501206,-1.487234,-1.403235,-0.397477,1.096322,-0.373145,-0.801132,-1.392951,0.417850,0.260100,1.190996


Making a new data frame to have track_id, track_name, and artist_name to plug into the algorithm.

In [8]:
df_track_info = df[['track_id', 'track_name', 'artist_name']]
df_track_info

Unnamed: 0,track_id,track_name,artist_name
56,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,Ignacio Corsini
57,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,Ignacio Corsini
58,0JV4iqw2lSKJaHBQZ0e5zK,Martín Fierro - Remasterizado,Ignacio Corsini
59,0l3BQsVJ7F76wlN5QhJzaP,El Vendaval - Remasterizado,Ignacio Corsini
60,0xJCJ9XSNcdTIz0QKmhtEn,La Maleva - Remasterizado,Ignacio Corsini
...,...,...,...
470033,0MmaEacabpK8Yp3Mdeo5uY,下雨天,芝麻
470034,1dKxf4Ht2SsKLyXfSDJAgy,The Cutest Puppy,Laureen Conrad
470035,0SjsIzJkZfDU7wlcdklEFR,John Brown's Song,Gregory Oberle
470036,5rgu12WBIHQtvej2MdHSH0,云与海,阿YueYue


The following code is from 
https://stackoverflow.com/questions/36538780/merging-dataframes-on-index-with-pandas

New dataframe with the scaled data and information on the track_id, track_name, and artist_name. 
- data is organized and ready to go through cosine similarity

In [9]:
df_mark = df_track_info.join(scaled_data, how='inner')
df_mark

Unnamed: 0,track_id,track_name,artist_name,danceability,energy,loudness,key,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_popularity
56,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,Ignacio Corsini,-1.446869,-1.930681,-3.926021,1.637916,-1.403235,-0.041803,1.725736,2.926482,-0.300562,0.379944,0.933857,2.512874,-1.733620
57,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,Ignacio Corsini,-1.062936,-1.310573,-2.526493,0.217393,0.712639,-0.233181,1.707838,3.473548,-0.262965,0.415560,-1.879927,0.260100,-1.733620
58,0JV4iqw2lSKJaHBQZ0e5zK,Martín Fierro - Remasterizado,Ignacio Corsini,-1.723805,-1.695040,-2.734995,-0.066712,0.712639,-0.220543,1.722753,-0.292162,-0.144805,0.003995,1.945178,0.260100,-1.733620
59,0l3BQsVJ7F76wlN5QhJzaP,El Vendaval - Remasterizado,Ignacio Corsini,-0.792294,-1.674370,-2.487980,0.217393,-1.403235,-0.132678,1.704855,3.340013,-0.563737,0.724234,1.656644,0.260100,-1.733620
60,0xJCJ9XSNcdTIz0QKmhtEn,La Maleva - Remasterizado,Ignacio Corsini,-1.465751,-1.178284,-2.248712,0.785602,0.712639,-0.123049,1.698889,3.775081,-0.198514,-0.063280,2.026054,-1.992675,-1.733620
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
432223,65o7zOY79D5vqOJJNm1l3T,下雨的晚上,Dadado Huang,-0.232129,0.454667,1.349378,-0.350816,0.712639,-0.384237,-0.818768,-0.373145,-0.617983,-1.052619,0.356079,0.260100,1.190996
432224,7D9yBn5ivJUao1v4jmVdgG,25歲,Dadado Huang,0.277684,-2.256445,-4.169494,0.785602,0.712639,0.136936,1.725736,3.818157,-0.569108,1.294093,-1.718040,0.260100,2.185366
432225,6di4lDxW9XThds6gIHVRtL,跟你出去玩,Dadado Huang,-0.018133,-2.190713,-3.500607,-1.203130,0.712639,0.046664,1.725736,3.766466,-0.552995,-0.700414,-1.884214,-1.992675,2.126874
432226,4EoOSTT7iBjHxSfOfmB8Iq,香格里拉,Dadado Huang,-0.030721,-0.186111,0.501206,-1.487234,-1.403235,-0.397477,1.096322,-0.373145,-0.801132,-1.392951,0.417850,0.260100,1.190996


Creating function for cosine similarity.

To make sure this function outputs accurate recommendations, I will run my recommendations based on a song I know well: Through The Wire by Kanye West.

In [10]:
df_mark.loc[df_mark['track_id'] == '4mmkhcEm1Ljy1U9nwtsxUo'] 

Unnamed: 0,track_id,track_name,artist_name,danceability,energy,loudness,key,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_popularity
249828,4mmkhcEm1Ljy1U9nwtsxUo,Through The Wire,Kanye West,-1.075524,-0.289463,0.554549,-1.20313,-1.403235,-0.38845,0.693616,-0.373145,-0.688879,-1.286103,-1.254758,0.2601,1.424966


The function below will take in a track and an artist and then it will return the track ID.

In [11]:
def find_track_id(track_name, artist_name):
    '''
    Given an artist and track, return the track ID
    
    NOTE: if there's more than one song with the same artist/track names, 
    will return the first one
    '''
    song = df_mark.loc[(df_mark['track_name'] == track_name) & (df_mark['artist_name'] == artist_name)]
    return song['track_id'].values[0]

In [12]:
find_track_id("Through The Wire", 'Kanye West')

'4mmkhcEm1Ljy1U9nwtsxUo'

This function is the recommendation system itself. Using cosine similarity, it will return the 10 most similar songs to the song plugged into the function. One of the problems with this recommendation system is that it has to go through every song in the data frame before finding the 10 most similar songs. There are over 400,000 songs in this data frame!

In [13]:
def cosSim(trackID, num_of_songs):
    '''
    Takes in a trackID, returns top num_of_songs most similar tracks
    '''
    song_starter = df_mark[df_mark['track_id'] == trackID][num_cols]
    sim_series = df_mark[num_cols].apply(
        lambda x: cosine_similarity(song_starter, x.values.reshape(1, -1)), axis=1)

#     sorted_dict = {k:v for k,v in sorted(similarity.items(), key=lambda k: k[1], reverse=True)}
#     list(sorted_dict.keys())[:num_of_songs]

    return sim_series.str[0].str[0].sort_values().tail(num_of_songs)

In [14]:
lindsey_sim = cosSim(find_track_id("Through The Wire", 'Kanye West'), 20)

This will output the top 20 songs and their similarity score, but it only gives us the index number and not the track name or the information about the features.

In [15]:
lindsey_sim

399665    0.949851
160256    0.950513
199369    0.952726
239999    0.955288
70914     0.955376
234320    0.955694
389698    0.960418
54082     0.960490
389637    0.962931
242984    0.963329
305109    0.963335
204849    0.964361
185161    0.965142
314135    0.970456
155184    0.974233
239074    0.976038
255300    0.980135
255348    0.980537
250124    0.986290
249828    1.000000
dtype: float64

The following code is to help match the track name and the other features to the index provided from the function above. 

In [18]:
playlist = []
for ind in lindsey_sim.index:
    playlist.append(df_mark.loc[df_mark.index == ind])

In [19]:
pd.concat(playlist)

Unnamed: 0,track_id,track_name,artist_name,danceability,energy,loudness,key,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_popularity
399665,6JlOU0RRr1LFU8CMjIjhfB,Change of Frame,Heidrik,-1.9378,-1.207222,0.169861,-1.487234,-1.403235,-0.399283,1.340929,-0.3719,-0.510028,-1.420653,-1.420696,0.2601,2.068382
160256,5iJ51DcYRNmbePnFc99XXK,Con Un Nudo En La Garganta,Pimpinela,-0.616062,-0.074492,0.250207,-0.919025,-1.403235,-0.398681,0.195454,-0.340149,-0.619057,-1.428568,-0.977296,0.2601,0.664565
199369,4ie9UwMClCZKN56STsnKy5,Spirito,Litfiba,-0.741942,0.061932,1.008958,-0.919025,-1.403235,-0.380024,0.416196,-0.373145,-0.629262,-0.565864,-1.175435,0.2601,1.366473
239999,5KNuHsIeFtD0oukst77hBi,Maula Mere Maula,Roop Kumar Rathod,-0.81747,-0.38868,0.416212,-0.350816,-1.403235,-0.311417,0.26108,-0.373138,-0.531511,-0.957642,-1.329694,0.2601,1.249489
70914,5IoLgsS2vq18Ymb6F7PNgR,A saudade continua,Tião Carreiro & Pardinho,-1.169934,-0.318401,0.524225,-0.919025,-1.403235,-0.395672,-0.156541,-0.373145,-0.648597,-0.918069,-1.242404,0.2601,1.074012
234320,1V3HlCCSvHmKqvGSNuSL1e,A Ele A Glória - Ao Vivo,Diante do Trono,-0.955938,-0.487897,0.54813,-1.20313,-1.403235,-0.422753,0.061219,-0.373145,-0.552995,-1.764943,-0.80859,0.2601,1.892905
389698,58fvOyLXXFP04O9ArtdGyh,Antio Leme,Konstantinos Argiros,-0.792294,-0.599516,0.267472,-0.634921,-1.403235,-0.345721,0.648871,-0.37314,-0.542253,-1.317762,-1.407532,0.2601,0.840042
54082,33TDAUeConyLclvr9GyvmZ,"To Be Young, Gifted and Black - Single Version",Nina Simone,-0.836352,-0.880632,-0.025803,-1.487234,-1.403235,-0.413124,0.759242,-0.369496,-0.648597,-1.187169,-1.556288,0.2601,1.249489
389637,1iiUBVsjSCw8A4fvn1iVdm,Pou Einai Oi Dikoi Mou,Voreia Asteria,-1.012584,-0.475495,0.729185,-0.350816,-1.403235,-0.361368,0.681684,-0.373145,-0.510028,-1.361293,-1.138339,0.2601,1.424966
242984,3xyNNZepxZSWRADK0SsNal,Tu Cabeza en Mi Hombro,Roberto Jordan,-1.528691,-0.095162,0.76305,-1.20313,-1.403235,-0.408912,1.048594,-0.373145,-0.504657,-1.523544,-0.868876,0.2601,0.957027


Above are the top 20 most similar songs to Through The Wire. The recommendation system prioritized instrumentalness, time signature, and mode. 

# Conclusion 

- Cosine Similarity function can recommend songs to help build playlists quickly 
- Can sort the songs from least similar to most similar 
- Can be used to help Apple Music build playlists and create more user-friendly features.
