# Part 4: Predictions on Separate Dataset

In this final part, I use the most successful random forest classifier model from the previous part to test a different set of test data.

## 1. Setting Up Training Data

I first set up all of the data from the previous parts as the training data.

In [1]:
import numpy as np
import pandas as pd

In [15]:
df_list = []
for i in range(0,12):
    filename = 'playlist' + str(i) + '.json'
    df_list.append(pd.read_json(filename))

songs = pd.concat(df_list)
songs = songs.reset_index()
songs = songs.drop_duplicates()
songs = songs.drop('index', axis=1)
songs = songs.drop(['id', 'title'], axis=1)

In [16]:
songs.head()

Unnamed: 0,danceability,energy,key,loudness,mode,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,num_bars,num_sections,num_segments,class
0,0.521,0.673,8,-8.685,1,0.00573,0.0,0.12,0.543,108.031,225947,4,100,8,830,1
1,0.735,0.849,4,-4.308,0,0.212,2.9e-05,0.0608,0.223,125.972,207477,4,107,7,999,1
2,0.614,0.547,5,-11.772,0,0.767,0.893,0.0919,0.226,92.99,172000,4,65,8,698,1
3,0.769,0.817,0,-4.092,0,0.0268,9.9e-05,0.084,0.866,139.979,218615,4,125,10,921,1
4,0.623,0.793,11,-6.63,0,0.000397,0.0015,0.375,0.36,98.998,274213,4,112,9,1137,1


## 2. Setting Up Test Data

I then read the test data from the separate test playlists. Note that I use a different file naming format to read the .json files for which I slightly modified the 'extractor.py' script. This is because I ran the script at different times for the training and test data, and the test .json files would overwrite the training .json files.

In [19]:
df_list = []
for i in range(0,2):
    filename = 'playlist_test' + str(i) + '.json'
    df_list.append(pd.read_json(filename))

songs_test = pd.concat(df_list)
songs_test = songs_test.reset_index()
songs_test = songs_test.drop_duplicates()
songs_test = songs_test.drop(['index', 'id'], axis=1) # keeping title to self-check data once predictions are done

In [20]:
songs_test.head()

Unnamed: 0,title,danceability,energy,key,loudness,mode,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,num_bars,num_sections,num_segments,class
0,Like Silk,0.748,0.321,7,-9.464,1,0.389,0.00191,0.107,0.245,76.965,124675,4,39,6,506,1
1,Lion Heart,0.754,0.788,2,-3.175,1,0.129,0.00274,0.0775,0.961,124.992,224258,4,115,10,841,1
2,Love Alone,0.596,0.817,8,-3.769,1,0.000192,3e-06,0.0619,0.445,130.927,211647,4,113,8,793,1
3,Love Lies (with Normani),0.708,0.648,6,-5.626,1,0.0956,0.0,0.134,0.338,143.955,201707,4,120,14,747,1
4,Love Like That,0.761,0.407,2,-6.616,1,0.343,0.0306,0.308,0.854,154.049,214400,4,136,8,888,1


## 3. Applying Random Forest Classifier Model

Instead of splitting the data into train and split, I used all of the training data to train the model and all of the testing data to test the data.

In [23]:
X_train = songs.drop('class', axis=1)
X_test = songs_test.drop(['class','title'], axis=1)
y_train = songs['class']
y_test = songs_test['class']

In [12]:
from sklearn.metrics import confusion_matrix, classification_report

In [24]:
from sklearn.ensemble import RandomForestClassifier
rfc_model = RandomForestClassifier()
rfc_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [34]:
rfc_pred = rfc_model.predict(X_test)
print(confusion_matrix(y_test, rfc_pred))
print('\n')
print(classification_report(y_test, rfc_pred))

[[37  5]
 [11 31]]


             precision    recall  f1-score   support

          0       0.77      0.88      0.82        42
          1       0.86      0.74      0.79        42

avg / total       0.82      0.81      0.81        84



The results are actually better than those in Part 3, with a noticeable improvement in the precision in predicting class 0. In the parts below, I createa a dataframe with the title, and the actual and predicted classes to look at how the model worked on certain songs.

In [48]:
songs_test_pred = songs_test[['title','class']]
songs_test_pred.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84 entries, 0 to 83
Data columns (total 2 columns):
title    84 non-null object
class    84 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.0+ KB


In [49]:
songs_test_pred['pred_class'] = rfc_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [52]:
songs_test_pred.sort_values('title')

Unnamed: 0,title,class,pred_class
35,A Million Miles,1,1
37,LOSER,1,1
5,LOVE ME RIGHT,1,1
7,LOVE SCENARIO,1,0
42,Lead Me On,0,0
43,Learned From Texas,0,1
54,Left Side Of Leavin',0,0
65,Let It,0,0
76,Let It Be Mine,0,0
79,Let It Fly (feat. Travis Scott),0,1


## 4. Conclusion

The random forest classifier model seems to be fairly accurate in predicting whether I will like a certain song or not based on the given data. The results could potentially improve with a better estimator implementation, or with better data. One example of having better data would be the way I classify the songs. Instead of having binary classes of whether I simply like or dislike a song, I could rate each song from 1-5 and see if the different classes are clustered within the data points. I would then predict how much I like or dislike a certain song.