# Introduction
In this project I will be looking at using a Decision Tree Classifier and a Random Forest Classifier to predict what genre of music a song is. 

This Data science project is on the dataset [Top Spotify songs from 2010-2019 - BY YEAR](https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year). I will be predicting the genre of each song, using features such as bpm, volume, length, and more.

I think the best model will be the Random Forest Classifier, as it creates muliple trees to use to predict the data. I think the bpm (beats per minute) will be the strongest feature effecting the results, as some genres are generally faster than others.

I will be attempting to get an accuracy of above 80%.

# Setting up
The below code contains necessary steps for setting up our machine learning environment. Key features are described in the comments.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualisation purposes
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree # Our model and a handy tool for visualising trees
from sklearn.model_selection import train_test_split # a tool for spliting data
from sklearn.ensemble import RandomForestClassifier #second model

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Gather and explore the data
I have added the spotify by selecting the **Add data** button, then selecting the chosen data. 

I chose the dataset [Top Spotify songs](https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year) because music is something I am intested in, and I have never seen a genre predicter before.

This data is over 600 songs long, and contains multiple potential features to use. I have decided to choose top genre as my prediction target, as it is the only catergorical piece of data, and can potentialy relate to all other points of data.

My data:
- Is in .csv format
- Contains mostly numerical values
- Contains categorical targets
- Doesn't have many missing values

The data is shown under the _input_ folder icon as you see it in the top right of this notebook.

Now that we have a file containing data, let's get it into a Pandas DataFrame and take a peek.

The values in this dataset are:

- Title (not using)
- Artist (not using)
- Top Genre (target)
- Year (not using)
- BPM (beats per minute; The higher the value, the faster the song)
- nrgy (energy; The higher the value, the more energtic the song)
- dnce (Danceability; The higher the value, the easier it is to dance to this song)
- dB (volume; The higher the value, the louder the song)
- live (Liveness; The higher the value, the more likely the song is a live recording)
- val (Valence; The higher the value, the more positive the mood is in the song)
- dur (Duration; The higher the value, the longer the song)
- acous (Acousticness; The higher the value, the more acoustic the song is)
- spch (Speech; The higher the value, the more spoken words in this song)
- pop (Popularity; The higher the value, the more popular the song iss)





In [None]:
# Create a new Pandas DataFrame with our training data
spotifydata = pd.read_csv('../input/top-spotify-songs-from-20102019-by-year/top10s.csv')

print(spotifydata.columns, end='\n\n\n')
(spotifydata.describe(include='all')) 


# Prepare the data
In this project, I will be trying to predict the genre, so 'top genre' will be the prediction target.

Before we can separate our prediction target 'y' (top genre) from the rest of the data, we need to do some preparation so that there aren't any rows with missing values as our machine learning model will not be able to handle them. However, my chosen data does not contain any values, so that step is not necessary. 


## Select features and target then drop missing values
Choosing our features first will help reduce the total number of rows we need to drop (remove).

I have chosen a selection of features that are:
- Relevant to our predictions
- Don't have many missing values, so I don't need to drop any values

Note that we'll also be including the target for now.


In [None]:
# Let's reduce our data to only the features we need and the target.
# The features we chose have similar 'count' values when we describe() them
# We need to keep the target as part of our DataFrame for now.
selected_columns = ['top genre', 'bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch', 'pop']

# Create our new training set containing only the features we want
prepared_data = spotifydata[selected_columns]

# Don't need to drop anything because there are no empty values


prepared_data.describe(include='all') #Note there are no empty values.

## Split data into training and testing data.
Splitting the training set into two subsets is important because you need to have data that your model hasn't seen yet with actual values to compare to your predictions to be able to tell how well it is performing. In this example project we're skipping this step, but when you do your project you'll need to consider how you want to split your data. The [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) goes through how to do this. 

## Separate Features From Target
Now that we have a set of data (as a Pandas DataFrame) without any missing values, let's separate the features we will use for training from the target, as well as spliting the training data and the testing data. Note that the ratio of testing:training values is approximately 1:3.




In [None]:
# Separate out the prediction target
y = prepared_data['top genre']

# Drop the target column (axis=1) from the original dataframe and use the rest as our feature data
X = prepared_data.drop('top genre', axis=1)

# Take a look at the data again

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) # Split data using the
print(f'Number of training values: {len(train_y.values)}')                # train test split in sklearn
print(f'Number of testing values: {len(val_y.values)}')
X.head()

# Choose and Train a Model
Now that we have data our model can digest, let's use it to train a model and make some predictions. We're going to use a __Decision Tree Classifier__ and a __Random Forest Classifier__. These models make categorical predictions instead of continuous numerical predictions, which is perfect for predicting genres. 

Ok, let's train our model and see what it looks like.



In [None]:
def besttdepth(train_X, val_X, train_y, val_y):
    '''This is a function to determine the best depth for a Decision Tree Classifier. 
       This is an alternative to the mean absolute error used for regressors          '''
    maxnodes = {i:0 for i in range(1,100)}
    for i in maxnodes:
        genrepred = DecisionTreeClassifier(max_depth=i)
        genrepred.fit(train_X, train_y)
        for j in range(len(genrepred.predict(val_X))):
            if genrepred.predict(val_X)[j] == val_y.values[j]:
                maxnodes[i] += 1
    best = 0
    score = 0
    for i in maxnodes:
        if maxnodes[i] > score:
            best = i
            score = maxnodes[i]
    print(maxnodes)
    return best

The above function is my alternative to the MAE (mean absolute error) used to decide the tree depth for regressors. This function creates a tree for every value in *max nodes*, and works out the number of correct predictions. It then returns the best tree depth.

The reason a MAE value cannot be used in this situation is because the MAE relies on the output of the tree to be a continous value, a Decision Tree Regressor. 

In [None]:
# Create a decision tree classifier with best depth
bestd = besttdepth(train_X, val_X, train_y, val_y)
print(bestd)
genrepred = DecisionTreeClassifier(max_depth=bestd) #replace bestd with 1 to get a faster running time

genrepred.fit(train_X, train_y)

The dictionary printed above shows the depth of the tree compared with the number of correct predictions. A depth of 1 has the best success rate, with 82/150 (54.6%). Generally, the deeper the tree, the lower success rate. You may also notice that the genre **dance pop** has the same percentage of values as the success rate of a tree depth of 1 - 54%.

*If you can't see the dictionary, run the above cell.*

In [None]:
# Let's plot the tree to see what it looks like!
plt.figure(figsize = (20,10))
plot_tree(genrepred,
          feature_names=X.columns,
          class_names=['acoustic pop', 'alaska indie', 'alternative r&b', 'art pop', 'atl hip hop',
                       'australian dance', 'australian hip hop', 'australian pop', 'barbadian pop',
                       'baroque pop', 'belgian edm', 'big room', 'boy band', 'british soul',
                       'brostep', 'canadian contemporary r&b', 'canadian hip hop', 'canadian latin',
                       'canadian pop', 'candy pop', 'celtic rock', 'chicago rap', 'colombian pop',
                       'complextro', 'contemporary country', 'dance pop', 'danish pop',
                       'detroit hip hop', 'downtempo', 'edm', 'electro', 'electro house',
                       'electronic trap', 'electropop', 'escape room', 'folk-pop',
                       'french indie pop', 'hip hop', 'hip pop', 'hollywood', 'house', 'indie pop',
                       'irish singer-songwriter', 'latin', 'metropopolis', 'moroccan pop',
                       'neo mellow', 'permanent wave', 'pop', 'tropical house'],
          filled=True)
plt.show()


As shown in the visulisation, 'pop' (popularity) has the biggest effect on determining the genre. However 

In [None]:
# just checking percentage of correct values
best = 0

for i in range(len(genrepred.predict(val_X))):
    #print(genrepred.predict(val_X)[i], '\t', val_y.values[i])
    if genrepred.predict(val_X)[i] == val_y.values[i]:
        best += 1
print(best/len(val_y.values))

Note that this value is extremely close to the number of dance pop genre songs in the table.

# Random Forest model

In [None]:
forestgenrepred = RandomForestClassifier(random_state=1)

# Train the model on the one hot encoded data
forestgenrepred.fit(train_X, train_y)



best = 0
val_X.head()

for i in range(len(forestgenrepred.predict(val_X))):
    #print(genrepred.predict(val_X)[i], '\t', val_y.values[i])
    if forestgenrepred.predict(val_X)[i] == val_y.values[i]:
        best += 1
print(best)

Suprisingly, the Random Forest Classifier only increases the correct predictions by 2 (1.3%)

# Evaluate model performance
As shown above, the reults are:

- Decision Tree Classifier (tree depth of 1): 82/151 (54.3% success rate)
- Random Forest Classifier: 84/151 (55.6% success rate)


As seen, the Random Forest Classifier has a higher success rate, but not by much.

My guess of why the results are not improved by much are because of best tree depth of 1. This means that the random forest doesn't have many options to chose from when selecting potential trees.

# Hyperparameters

I have changed the majority of Hyperparameters as I am going. The biggest change is the tree depth, where I chose the optimal depth for the Decision Tree Classifier, of 1. The bigger the tree depth, the less accurate the results. The results lowered by roughly 3 for every extra layer in the tree. Ulimately, I think the tree is really classifing if the genre is dance pop or not. This is because the success rate is very close to the number of dance pop entries in the data.

# Conclusion
My best model is the Random Forest Classifier, with a success rate of 55.6%. My other model, the Decision Tree Classifier is not much worse, with a success rate of 54.3%. I didn't succeed in geting my target rate of 80%, and, with a bit more time, or a different model, I might've been able to raise that percentage. The only feature effecting the Decision Tree Classifier was the popularity, rather than bpm, as I predicted. This is because there is a correlation between dance pop as a genre and a popularity. My Hypothesis was correct, the Random Forest Classifier is better, but the diference in success was low.