This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [2]:
df_original = pd.read_csv('originalfile.csv')
df = pd.read_csv('originalfile.csv')
#df.drop(['title', 'artist', 'top genre'], axis = 1)
df.drop(df.columns[[0, 1, 2]], inplace = True, axis = 1)

2. Display columns and describe the data set

In [3]:
print(df)

    year  beats.per.minute  energy  danceability  loudness.dB  liveness  \
0   2020               171      73            51           -6         9   
1   2019                95      82            55           -4        34   
2   2021                91      72            70           -4        32   
3   2019               110      41            50           -6        11   
4   2017                95      45            60           -6        11   
..   ...               ...     ...           ...          ...       ...   
95  2016               104      61            79           -6        32   
96  2015               120      79            75           -7         9   
97  2021               126      83            66           -5        40   
98  2018                93      80            61           -5        16   
99  2016               102      73            67           -7         9   

    valance  length  acousticness  speechiness  popularity  
0        33     200             0     

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   year              100 non-null    int64
 1   beats.per.minute  100 non-null    int64
 2   energy            100 non-null    int64
 3   danceability      100 non-null    int64
 4   loudness.dB       100 non-null    int64
 5   liveness          100 non-null    int64
 6   valance           100 non-null    int64
 7   length            100 non-null    int64
 8   acousticness      100 non-null    int64
 9   speechiness       100 non-null    int64
 10  popularity        100 non-null    int64
dtypes: int64(11)
memory usage: 8.7 KB


In [5]:
df.describe()

Unnamed: 0,year,beats.per.minute,energy,danceability,loudness.dB,liveness,valance,length,acousticness,speechiness,popularity
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,2015.96,116.97,62.68,66.96,-6.1,16.86,49.97,214.53,24.95,9.93,79.67
std,5.327497,27.470629,16.491737,13.60401,1.987334,12.972403,21.737857,35.934974,26.27876,9.424077,5.905065
min,1975.0,71.0,11.0,35.0,-14.0,3.0,6.0,119.0,0.0,2.0,53.0
25%,2015.0,95.0,52.0,59.0,-7.0,10.0,33.75,190.5,4.0,4.0,79.0
50%,2017.0,115.0,64.5,69.0,-6.0,12.0,48.0,210.0,13.0,6.0,81.0
75%,2018.0,135.25,76.0,77.0,-5.0,17.25,66.0,234.25,41.5,11.0,83.0
max,2021.0,186.0,92.0,91.0,-3.0,79.0,93.0,354.0,98.0,46.0,91.0


3. Prepare Data

In [6]:
# Run this section to inspect X
# X = df.drop(columns = ['year'])
# df.drop(columns = ['artist'])
X = df

In [7]:
# Uncomment this section to inpect y
y = df['energy']
y

0     73
1     82
2     72
3     41
4     45
      ..
95    61
96    79
97    83
98    80
99    73
Name: energy, Length: 100, dtype: int64

4. Calculate accuracy

In [8]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.45

5. Persisting Models

In [9]:
# Save the model to file
joblib.dump(model, 'MODELNAME.joblib')


['MODELNAME.joblib']

5.b. Import the model and make predictions

In [10]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('MODELNAME.joblib')
predictions = model.predict(X_test)
predictions

array([86, 73, 38, 54, 73, 37, 75, 26, 79, 72, 86, 69, 69, 57, 73, 87, 42,
       65, 61, 47])

6. (Optional) Visualize decision trees

In [13]:
tree.export_graphviz(model, out_file = 'MODELNAME.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
