# Spot the Classifiers
## By Kathrine Gibson and Lucy Tibbetts
For our final project, we were interested in how different song attributes affect song popularity. To do this, we used [this dataset from kaggle](https://www.kaggle.com/tomigelo/spotify-audio-features). This dataset contained artist name, track id, track name, audio features, and popularity of over 116k songs on Spotify. Popularity was based on the number of plays (as of December 3, 2018) on a scale of 0 to 100. There were thirteen audio features: acousticness, danceability, duration, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time signature, and valence. All of this data was taken from Spotify's Web API. We found that our k-NN classifier was the most accurate at predicting popularity, followed by our decision tree classifier and our Naive Bayes classifier.

To determine which audio features to use, we created graphs of each feature and popularity:

In [6]:
import matplotlib.pyplot as plt
import numpy as np
import utils



#grid of graphs goes here

After looking at the graphs, we decided not to use key, mode, or time signature. 

This dataset didn't need much cleaning, but we chose to remove the first three columns of the table since many of the track names included commas. Furthermore, these graphs were created using every song in the dataset, but since our classifiers took significantly longer to run, we created a file which only held the first 2,000 songs. To ensure relatively minimal sampling error, we graphed the data from this file as well to make sure the graphs were similar to the graphs made from the entire dataset. However, this data itself is only a sample of all of the songs which Spotify has available.

### KNN Classifier

We next implemented our own k-NN classifier to predict a song's popularity using the ten interesting audio features. Most of the audio features are already on a 0 to 1 scale, but we had to normalize duration, loudness, and tempo. Popularity was discretized into >=0.25, >=0.50, >=0.75, >=1.00 prior to classification.

In [7]:
trimmed_data = []
utils.read_file_to_table("small_audio_data.csv", trimmed_data, [0, 1, 2, 3, 4, 6, 7, 9, 10, 12, 13])

duration = utils.get_column(trimmed_data, 2)
normalized_duration = utils.normalize(duration)
loudness = utils.get_column(trimmed_data, 6)
normalized_loudness = utils.normalize(loudness)
tempo = utils.get_column(trimmed_data, 8)
normalized_tempo = utils.normalize(tempo)

for i in range(len(trimmed_data)):
    trimmed_data[i][2] = normalized_duration[i]
    trimmed_data[i][6] = normalized_loudness[i]
    trimmed_data[i][8] = normalized_tempo[i]
    trimmed_data[i][-1] = utils.discretize_popularity(trimmed_data[i][-1])
    

After normalizing our data, we created ten stratified cross folds in order to check the accuracy of our k-NN classifier.

In [8]:
folds = utils.stratified_cross_folds(trimmed_data, 10)
num_correct = 0
for i in range(0, 10):
    train, test = utils.set_up_train_test(i, folds)
    actual_popularities = [x[-1] for x in test]
    predicted_popularities = utils.knn_classifier(train, test)
    for i in range(len(test)):
        if actual_popularities[i] == predicted_popularities[i]:
            num_correct += 1
accuracy = num_correct / len(trimmed_data)
print("Accuracy kNN: " + str(round(accuracy * 100, 2)) + "%")

Accuracy kNN: 74.9%


As you can see, the accuracy of our k-NN classifier on the first 2,000 instances in our dataset was about 74.65%. For this, we had k = 8, but after some experimentation, changing k didn't seem to cause and significant change to the accuracy of the classifier.

### Ensemble Classifier

Now to see if we could increase our accuracy using k-NN, we implemented a k-NN classifier ensemble with five weak learners. Each of these learners used a different different subset of four attributes. Furthermore, each classifier generated a prediction for the same instance by using different training sets. A singular prediction for each instance was then decide upon by using simple majority voting.

In [9]:
num_correct_ensemble = 0
for i in range(10):
    train, test = utils.set_up_train_test(i, folds)
    actual_popularities = [x[-1] for x in test]
    predicted_popularities = []
    for instance in test:
        predictions = []
        for j in range(6):
            # each classifier generates a prediction using a different training set
            training_subset = train[j:j+4]
            prediction = utils.compute_class_knn(instance, training_subset)
            predictions.append(prediction)
        # use simple majority voting
        np_arr = np.array(predictions)
        majority_vote = np.bincount(np_arr).argmax()
        predicted_popularities.append(majority_vote)
    for i in range(len(test)):
        if predicted_popularities[i] == actual_popularities[i]:
            num_correct_ensemble += 1
accuracy_ensemble = num_correct_ensemble / len(trimmed_data)
print("Accuracy ensemble kNN: " + str(round(accuracy_ensemble * 100, 2)) + "%")

Accuracy ensemble kNN: 43.1%


Surprisingly, accuracy was actually lower with our ensemble classifier than with our singular classifier by about 30%. This could be due to the lower number of attributes which each classifer in the ensemble used than the singular classifier used.

### Scikit learn

We then compared our accuracy results to the accuracy generated using scikit-learn kNN:

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

df = pd.DataFrame(trimmed_data)
X = np.array(df.ix[:, 0:9])  # features
y = np.array(df.ix[:, 10])  # class label (popularity)

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)
print("Scikit-learn accuracy (kNN): " + str(round(accuracy_score(y_test, prediction) * 100, 2)) + "%")

Scikit-learn accuracy (kNN): 74.4%


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  import sys
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


The accuracy of using scikit-learn kNN was actually fairly similar to the accuracy which we got from our own singular k-NN classifier. 

### Decision Trees

With k-NN don, we moved onto an implementation of a TDIDT decision tree classifier. Popularity was already discretized, all attributes were already normalized, and our fingers were ready to pitter patter, and thus we began with some declarations of variables which the classifier would need:


In [13]:
 col_names = ["acousticness", "danceability", "duration", "energy",
             "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "popularity"]

# all possible values for each attribute
att_domains = {0: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               1: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               2: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               3: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               4: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               5: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               6: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               7: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               8: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               9: [">=0.25", ">=0.50", ">=0.75", ">=1.0"],
               10: [">=25", ">=50", ">=75", ">=100"]}

class_index = len(col_names) - 1

# att_indexes is a list of attributes to use for building the tree
att_indexes = list(range(len(col_names) - 1))

It's looking beautiful, so now let's implement our classifer by using the ten stratified cross folds previously created:

In [15]:
import tree_utils
num_correct = 0
for i in range(0, 10):  
    train, test = utils.set_up_train_test(i, folds)
    actual_popularities = [x[-1] for x in test]
    att_indexes = list(range(len(col_names) - 1))
    predicted_popularities = tree_utils.tree_classifier(
        train, test, att_indexes, att_domains, class_index, col_names)
    for i in range(len(test)):
        if actual_popularities[i] == predicted_popularities[i]:
            num_correct += 1
accuracy = num_correct / len(trimmed_data)
print("Accuracy Decision Tree: " + str(round(accuracy * 100, 2)) + "%")

Accuracy Decision Tree: 52.2%


Wonderul, our accuracy was no where near as high as the accuracy of our singular k-NN classifier, but it was higher than the accuracy of our ensemble classifier. This is about the result we expected since decision trees are better suited for categorical attributes.

### Naive Bayes

Now it was time for our old friend, Naive Bayes. To test the accuracy of our Naive Bayes classifier, we used the same procedure as for our k-NN and decision tree classifiers; we found the total number of correct guesses from ten stratified cross folds and divided by the total number of instances. 

In [17]:
num_correct_bayes = 0
for i in range(0, 10):
    train, test = utils.set_up_train_test(i, folds)
    priors = utils.compute_probabilities(train)
    actual_popularities_bayes = [x[-1] for x in test]
    predicted_popularities_bayes = []
    for instance in test:
        predicted_popularity_bayes = utils.naive_bayes_classifier(
            priors, instance, train)
        predicted_popularities_bayes.append(predicted_popularity_bayes)
    for i in range(len(test)):
        if actual_popularities_bayes[i] == predicted_popularities_bayes[i]:
            num_correct_bayes += 1
accuracy_bayes = num_correct_bayes / len(trimmed_data)
print("Accuracy Naive Bayes: " + str(round(accuracy_bayes * 100, 2)) + "%")

Accuracy Naive Bayes: 43.1%


As shown above, the accuracy of our Naive Bayes was 43.1%, the same as our ensemble classifier. We predicted that this classifier would result in the lowest accuracy and lucky us, we were right! This was due to the conditional independence assumption, i.e. we expected that the audio features are not wholly independent when predicting popularity.