# Spotify recommandation

From statistics about liked and disliked songs , we're going to create a model to predict wether I like a song or not

Summary :

1. Data Collection
2. Data Features
3. Data Cleaning
4. Exploratory Data analysis
5. Modelling
6. Testing on new data

## 1. Data Collection

### 1.1 Playlist creation
I collected 100 liked songs and 95 disliked songs

For those I like , I made a [playlist](https://open.spotify.com/playlist/2WONKi3eZaR29QaQCRSiAE?si=a2463f1d382f4399) of my favorite 100 songs. It is mainly French Rap , sometimes American rap , rock or electro music.

For those I dislike , I collected songs from various kind of music so the model will have a broader view of what I don't like

There is :
- [25 metal songs ( Cannibal Corps )](https://open.spotify.com/playlist/37i9dQZF1DZ06evO0grpKg?si=3c829a46465d4367)
- [20 " I don't like " rap songs ( PNL )](https://open.spotify.com/playlist/37i9dQZF1DX2fxPY4lXxv8?si=c69f40a2a2014a25)
- [25 classical songs](https://open.spotify.com/playlist/1h0CEZCm6IbFTbxThn6Xcs?si=933db0752a684db0)
- [25 Disco songs](https://open.spotify.com/playlist/2rkU3Aop33atDJoF8LCCjh?si=5e1247ee29284f0a)

I didn't include any Pop song because I'm kinda neutral about it

### 1.2 Getting the ID's

1. From the [Spotify's API "Get a playlist's Items"](https://developer.spotify.com/console/get-playlist-tracks/) , I turned the playlists into json formatted data which cointains the ID and the name of each track ( ids/yes.py and ids/no.py ). NB : on the website , specify "items(track(id,name))" in the fields format , to avoid being overwhelmed by useless data.

2. With a script ( ids/ids_to_data.py ) , I turned the json data into a long string with each ID separated with a comma.

### 1.3 Getting the statistics

Now I just had to enter the strings into the [Spotify API "Get Audio Features from several tracks"](https://developer.spotify.com/console/get-audio-features-several-tracks/) and get my data files ( data/good.json and data/dislike.json )

## 2. Data features

From [Spotify's API documentation](https://developer.spotify.com/documentation/web-api/reference/#object-audiofeaturesobject) :

* **acousticness** : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
* **danceability** : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
* **duration_ms** : The duration of the track in milliseconds.
* **energy** : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
* **instrumentalness** : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
* **key** : The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
* **liveness** : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
* **loudness** : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
* **mode** : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
* **speechiness** : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
* **tempo** : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
* **time_signature** : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
* **valence** : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).


And the variable that has to be predicted :

* **liked** : 1 for liked songs , 0 for disliked songs

## 3. Data Cleaning

We're going to :
* Take each json files 
* Turn them into a dataframe
* Add a "Liked" column
* Drop useless columns
* Shuffle them ( it's somewhat better for learning )
* Save it as a csv file

### Load Data

In [None]:
import pandas as pd 
import numpy as np
import json

In [None]:
with open("../input/spotify-recommendation/good.json","r") as f:
    liked = json.load(f)
liked = pd.DataFrame(liked["audio_features"])
liked

In [None]:
with open("../input/spotify-recommendation/dislike.json","r") as f:
    disliked = json.load(f)
disliked = pd.DataFrame(disliked["audio_features"])
disliked

### Add the "Liked" column

In [None]:
liked["liked"] = [1] * 100
disliked["liked"] = [0] * 95

In [None]:
liked

In [None]:
disliked

In [None]:
data = pd.concat([liked,disliked])
data

### Drop useless columns

We're going to drop things like id's , url's ...

In [None]:
data.drop(["type","id","uri","track_href","analysis_url"],axis=1,inplace=True)
data

### Shuffle rows

If you don't do it , the model will somewhat think they only have to learn what is a liked song because they'll only see them at the beginning 

Suffling prevents this from happening

In [None]:
data = data.sample(frac=1)
data

### Save the dataframe as a csv file

In [None]:
try :
    data = pd.read_csv("../input/spotify-recommendation/data.csv")
    print("Loading file...")
except :
    data.to_csv("../input/spotify-recommendation/data.csv",index=False)
    print("Saving file...")

## 4. Exploratory Data Analysis

As all figures are integers or digits , we're just going to see the correlation between them and the liked column

In [None]:
# import the library we're going to use
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
corr = data.corr()[["liked"]]
fig, ax = plt.subplots(figsize=(10,10)) 
sns.heatmap(
    corr, 
    annot=True,
    ax=ax
);

According to the figures , I'm very likely...
* To like ... songs :
    * danceable
    * high energy
    * loud
    * with many words
    * fast
    * with high amount of beats
    * slightly positive 
* To dislike ... songs :
    * not very accoustic
    * with low instrumentalness
    * short 

## 5. Modelling

For this , we're going to try several models:
* SVC with RBF kernel
* Random Forest Classifier
* KNN Classifier

### 5.1 Initial modelling

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split , GridSearchCV
from sklearn.metrics import accuracy_score , log_loss , roc_auc_score 

In [None]:
def evaluation(y_true,y_pred):
    return accuracy_score(y_true , y_pred) , log_loss(y_true , y_pred ) , roc_auc_score(y_true , y_pred)

In [None]:
X , y = data.drop("liked",axis=1) , data.liked
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [None]:
np.random.seed(42)

svc = SVC(kernel="rbf")
svc.fit(X_train,y_train)
y_svc_pred = svc.predict(X_test)

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_rf_pred = rf.predict(X_test)

knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
y_knn_pred = knn.predict(X_test)

In [None]:
class_svc , loss_svc , auc_svc = evaluation(y_test , y_svc_pred)
class_rf , loss_rf , auc_rf = evaluation(y_test ,y_rf_pred)
class_knn , loss_knn , auc_knn = evaluation(y_test , y_knn_pred)

In [None]:
scores = {
    "SVC":{
        "Accuracy":class_svc,
        "Loss":loss_svc,
        "AUC":auc_svc
    },
    "Random Forest":{
        "Accuracy":class_rf,
        "Loss":loss_rf,
        "AUC":auc_rf
    },
    "KNN":{
        "Accuracy":class_knn,
        "Loss":loss_knn,
        "AUC":auc_knn
    }
}
scores = pd.DataFrame(scores)
scores

In [None]:
scores.drop("Loss").plot.bar();

In [None]:
scores.drop(["Accuracy","AUC"]).plot.bar();

With such results , I'm not even going to try to hyperparameter tune the other models

## 5.2 Hyperparameter tuning

In [None]:
# as i've already run it locally , I didn't include all my attempts to limit running time
np.random.seed(42)

rf = RandomForestClassifier(n_jobs=-1)
rf_grid = {
 'max_depth': [10,15],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1,3],
 'min_samples_split': [2,4],
 'n_estimators': [10,42,100]
}
rf_cv = GridSearchCV(rf,rf_grid,verbose=2,cv=3)
rf_cv.fit(X_train , y_train)

In [None]:
rf_cv.best_params_

In [None]:
cv_pred = rf_cv.predict(X_test)

In [None]:
cv_acc , cv_loss , cv_auc = evaluation(y_test,cv_pred)
print(cv_acc , cv_loss , cv_auc)

In [None]:
comp = {
    "Old":{
        "Accuracy":class_rf,
        "Loss":loss_rf,
        "AUC":auc_rf
    },
    "CV":{
        "Accuracy":cv_acc,
        "Loss":cv_loss,
        "AUC":cv_auc
    }
}
comp = pd.DataFrame(comp)
comp

In [None]:
comp.drop("Loss").plot.bar();

In [None]:
comp.drop(["Accuracy","AUC"]).plot.bar();

**I don't really understand why it didn't improve , it did on my side with the same code**

In [None]:
# Reimporting libraries in case I just want to run this cell
import pickle
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import os
import json
import urllib.parse

data = pd.read_csv("../input/spotify-recommendation/data.csv")
X , y = data.drop("liked",axis=1) , data.liked

try :
    model = pickle.load(open("./model.sav", 'rb'))
except:
    model = RandomForestClassifier(n_jobs=-1,
                                  max_depth=15,
                                  min_samples_leaf=1,
                                  min_samples_split=4,
                                  n_estimators=42)

    model.fit(X,y)

    pickle.dump(model, open("./model.sav", 'wb'))

token = input(""" Spotify token :

To create one , visit this page : https://developer.spotify.com/console/get-several-tracks/

Log in to your spotify Account , and then copy what's in "OAuth Token" field """)
query = input("\n\n\nName of the track and artist ( be careful the database is somewhat capricious ) : ")

query = urllib.parse.quote(query)
stream = os.popen(f'curl -X "GET" "https://api.spotify.com/v1/search?q={query}&type=track" -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer {token}"')
data = stream.read()
try :
    data = json.loads(data)["tracks"]["items"][0]
    song_id = data["id"]
    artist = data["artists"][0]["name"]
    title = data["name"]
    stream = os.popen(f'curl -X "GET" "https://api.spotify.com/v1/audio-features/{song_id}" -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer {token}"')
    data = stream.read()
    data = json.loads(data)
    data = pd.DataFrame(data,index=[0])
    data.drop(["type","id","uri","track_href","analysis_url"],axis=1,inplace=True)
    print(f"\n\n\n\nThere is {list(model.predict_proba(data)[0])[1]*100:.2f}% chance that Brice likes \"{title}\" by {artist}\n\n\n")
except KeyError:
    print("\n\n\nYour token has expired , create a new one : https://developer.spotify.com/console/get-several-tracks/\n\n\n")
except IndexError:
    print("\n\n\nWe didn't find the song you were looking for\n\n\n")