# How to make a personal DJ in 5 simple steps

Part1: Setup
After importing all of the necessary libraries, we set up the spotipy client credentials using a secrets.env file to save SPOTIPY_CLIENT_ID and SPOTIPY_CLIENT_SECRET. For now we have out in our credentials Having setup the wrapper for the spotify web API, we needed to create a dataframe and populate it.

In [8]:
import os
import pandas as pd
import matplotlib
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [9]:
#Spotify API Setup
SPOTIPY_CLIENT_ID = "236e81909708434598e63e00fe671955"
SPOTIPY_CLIENT_SECRET = "5d574a1eb8f940b783b72b00c5eb4658"
cc = SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=cc)

Part 2: Data Population and data visualization
In order to create a "DJ" that would recommend songs to your liking, the DJ first must be trained off of some data.  We decided to train a model off of audio feature data.  All of the audio features are listed below in the array "columns" along with other useful information like "track_id" and "song_name". Then in order to create a populated dataframe we created two helper methods, song_to_df and playlist_to_df, which will return a dataframe of one row with a song and its audio features and return a dataframe of songs respectively.  

In [10]:
#Create music dataframe of party songs and non party songs
columns = ["track_id", "song", "acousticness", "danceability", "energy",
            "instrumentalness", "liveness", "loudness", "speechiness",
            "valence", "tempo", "party"]
music = pd.DataFrame(columns=columns)

In [11]:
#Two playlists chosen by us that we subjectively chose as "party songs" and "non-party songs"
#Here's the link to the party playlist we used: https://open.spotify.com/playlist/5ge2YqUbZrmqd2Mve8Uezf?si=m7ms4-EfQj6VFBsN5o-BbA
party_playlist_id = "5ge2YqUbZrmqd2Mve8Uezf?si=VVFB-RkdQMOpy1BffTeozQ"
#Here's the link for the non-party playlist we created to train with: https://open.spotify.com/playlist/5hCRFgctanZE1v1XzTDim4?si=tVGyL48TS-uv_TRHL_FywA
non_party_playlist_id = "5hCRFgctanZE1v1XzTDim4?si=M3NmQZrwTJOElCUKVYCzZg"

In [12]:
def song_to_df(track_id, song_name, party):
    song_features = sp.audio_features(track_id)[0]
    song_data = {'track_id': track_id,
                 'song': song_name,
                 'acousticness': song_features.get("acousticness"),
                 'danceability': song_features.get("danceability"),
                 'energy': song_features.get("energy"),
                 'instrumentalness': song_features.get("instrumentalness"),
                 'liveness': song_features.get("liveness"),
                 'loudness': song_features.get("loudness"),
                 'speechiness': song_features.get("speechiness"),
                 'valence': song_features.get("valence"),
                 'tempo': song_features.get("tempo"),
                 'party': party
                }
    return song_data

In [13]:
def playlist_to_df(df, playlist_id, party):
    songs = sp.playlist_tracks(playlist_id).get("tracks").get("items")
    music_frame = df
    for song in songs:
        track = song.get("track")
        song_name = track.get("name")
        track_id = track.get("id")
        song_data = song_to_df(track_id, song_name, party)
        music_frame = music_frame.append(song_data, ignore_index=True)

        #Get five more songs Spotify says is like this song
        recommendations = sp.recommendations(seed_tracks=[track_id], limit = 5).get("tracks")
        for recommendation in recommendations:
            r_song_name = recommendation.get("name")
            r_track_id = recommendation.get("id")
            r_song_data = song_to_df(r_track_id, r_song_name, party)
            music_frame = music_frame.append(r_song_data, ignore_index=True)
    return music_frame


In [17]:
#Visualize audio features of party songs
music = playlist_to_df(music, party_playlist_id, 1)
music = playlist_to_df(music, non_party_playlist_id, 0)
music = music.sample(frac = 1)
music = music.reset_index();
music = music.drop(["index"], axis = 1)
print(music)

                    track_id  \
0     3FskQrDXcY24ur2fCvz35O   
1     3hyKSdJDcYJQgRp3kMrfcs   
2     0Sd0kdgU6HrIclxYjuV99j   
3     1XFHbzTikXks9CsMq4v8Q3   
4     3WyRgi8CzQnhzO0xw79tTS   
...                      ...   
1315  3PYx9Wte3jwb48V0wArMOy   
1316  6MWtB6iiXyIwun0YzU6DFP   
1317  3JIqY33zQzdSgmcsMiAhUy   
1318  1mC2UjWt25Oixtqu7C6suL   
1319  4knL4iPxPOZjQzTUlELGSY   

                                                   song  acousticness  \
0                                                    Ye       0.01810   
1                                     Floss In The Bank       0.10500   
2                                            Break Shit       0.20900   
3                               Not So Bad (feat. Emie)       0.00903   
4                                             Goin Baby       0.12800   
...                                                 ...           ...   
1315                                       Keanu Reeves       0.05570   
1316                           

Part 3: Training models
Having visualized the data, we now understood a little more about what kind of audio features make a "party song".  So now we wanted to choose a model that would most accuratly predict whether a song is a party or non-party song. So we want the model to predict whether a song is a party song or not, so we decided to mesaure this by having the model output a binary 1 or 0, where 1 means it is a party song and 0 means it is not a party song.  To do this, we passed in a 2 dimensional dataframe with audio feature data as the features, and a dataframe with one column, "party", which had a 1 or a 0 based on if we though it was a party song or not. Since we basically want to build a classifier, we looked mainly at classifying models such as K nearest neighbors and Random Tree Classifier. So overall, these models were trained based off of our subjective opinion of what a party song is, to predict whether a given song is considered party or not.

In [15]:
#Splitting dataframe into train and testing data
features = music.drop(["track_id", "song", "party"], axis = 1)
target = music["party"]
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)
y_train = y_train.astype("int")
y_test = y_test.astype("int")

#Normalizing values
scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

#Training MODELS
#KNNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier(n_neighbors=5)
kNN.fit(x_train, y_train)

#Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)

#Random Tree Classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=2, random_state=0)
rfc.fit(x_train, y_train)

#Support vector machine
from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(x_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Part 4: Analyzing and comparing models
We abitrairly chose models to test.  In order to compare the models we wanted to look at a few metrics for each.  First, we wanted to see how accurate the models predictions were to the y_test data.  Then to gain more insight and assert or contradict the accuracy score, we looked at the classification report and confusion matrix.  

In [16]:
#Metrics for K nearest neighbors
y_pred_kNN = kNN.predict(x_test)
print("Accuracy" + str(accuracy_score(y_test, y_pred_kNN)))
print(classification_report(y_test, y_pred_kNN))
print(confusion_matrix(y_test, y_pred_kNN))

#Metrics for Logistics regression
y_pred_lr = lr.predict(x_test)

print("Accuracy" + str(accuracy_score(y_test, y_pred_lr)))
print(classification_report(y_test, y_pred_lr))
print(confusion_matrix(y_test, y_pred_lr))

#Metrics for Random Tree Classifier
y_pred_rfc = rfc.predict(x_test)

print("Accuracy" + str(accuracy_score(y_test, y_pred_rfc)))
print(classification_report(y_test, y_pred_rfc))
print(confusion_matrix(y_test, y_pred_rfc))

#Metrics for SVM
y_pred_svm = svm.predict(x_test)

print("Accuracy" + str(accuracy_score(y_test, y_pred_svm)))
print(classification_report(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))

Accuracy0.9747474747474747
              precision    recall  f1-score   support

           0       0.80      0.73      0.76        11
           1       0.98      0.99      0.99       187

    accuracy                           0.97       198
   macro avg       0.89      0.86      0.87       198
weighted avg       0.97      0.97      0.97       198

[[  8   3]
 [  2 185]]
Accuracy0.98989898989899
              precision    recall  f1-score   support

           0       0.85      1.00      0.92        11
           1       1.00      0.99      0.99       187

    accuracy                           0.99       198
   macro avg       0.92      0.99      0.96       198
weighted avg       0.99      0.99      0.99       198

[[ 11   0]
 [  2 185]]
Accuracy0.9797979797979798
              precision    recall  f1-score   support

           0       0.89      0.73      0.80        11
           1       0.98      0.99      0.99       187

    accuracy                           0.98       198
   

Part 5: Choosing a model
To choose a model for our DJ, we obviosly wanted the most accurate and most precise model.  When we looked through the accuracy scores, classification reports, and confusion matrices, we came to choose a logistic regression as the model for our DJ.  Although the Random Tree and K nearest neighbors has 97 and 98 percent accuracies, the logistic regression had a 99 percent accuracy.  Also if we look at the precision of classification, the logistic regression predicted 100 percent of party songs correctly.  This means that of the songs returned by the DJ, all party songs will at least be there, with some exceptions of incorrect predictions of non-party songs.  