# A very simple machine learning example

This example was modified from a [video](https://www.youtube.com/watch?v=7eh4d6sabA0) in youtube.
(thanks to Mosh Hamedani, codewithmosh.com)


In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv('music.csv')
df.shape

(18, 3)

In [2]:
df

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


### Splitting the data into input and output 

We need to split our data into an input dataset for the model and an output dataset. When trainig a model, we give it the input data set and we use the ourput data set for preditions.

In [3]:
X = df.drop(columns=['genre']).values
y = df['genre'].values

In [5]:
y

array(['HipHop', 'HipHop', 'HipHop', 'Jazz', 'Jazz', 'Jazz', 'Classical',
       'Classical', 'Classical', 'Dance', 'Dance', 'Dance', 'Acoustic',
       'Acoustic', 'Acoustic', 'Classical', 'Classical', 'Classical'],
      dtype=object)

### Build and train a DesicionTree model

In [6]:
model = DecisionTreeClassifier()
model.fit(X, y)

DecisionTreeClassifier()

### Predict new data with the model

We give the model a new dataset so it can predict it from its information gathered by the data in the model.

We test for a 21 year old male and a 22 year old female. These are not in our dataset!

In [10]:
predictions = model.predict([[21, 1], [22, 0]])
predictions

array(['HipHop', 'Dance'], dtype=object)

Comparing to our inital dataset, we ca see that we got a result that tells us that for a 23 year old male its likely that he goes for HipHop music. For the 22 year old female, its likely that she wants to hear Dance music.

In [11]:
df

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


### Estimating the accuracy of our model

So far, we used all data to train and then gave new datasets to predict. However, we have no idea how well the model predicts at all. Thererfore, we split the data into a trainig and testing dataset.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [16]:
X_train

array([[30,  1],
       [20,  1],
       [34,  0],
       [25,  1],
       [35,  0],
       [31,  0],
       [25,  0],
       [37,  1],
       [33,  1],
       [26,  1],
       [26,  0],
       [27,  0],
       [29,  1],
       [23,  1]])

The fisrt step is to retrain the model and use the trainig data sets

In [20]:
model.fit(X_train, y_train)

DecisionTreeClassifier()

Now, we can get the predictions useing the test input and then check the accuracy score.

In [21]:
predictions2 = model.predict(X_test)
score = accuracy_score(y_test, predictions2)
score

0.75

Because we use a small dataset, the score may be 1, like perfect match or as small as 0.2 or so. This comes from the fact that we use a tiny dataset and usually that methods are run on really large data. 

Let's test what happens if we run the model many times. The train_test_split() function picks randomly data!

In [22]:
scores = []

for i in range(20):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    scores.append(accuracy_score(y_test, pred))
    print(scores[i])

print(f"\naverage score: {np.mean(scores)}")


1.0
0.75
0.75
0.5
1.0
0.25
0.5
1.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
0.25
0.75
1.0
1.0
0.75

average score: 0.775


And now let's see what happens if we change the fraction between the training and testing from 80:20 % to the oppsosite!

In [23]:
scores = []

for i in range(20):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.8)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    scores.append(accuracy_score(y_test, pred))
    print(scores[i])

print(f"\naverage score: {np.round(np.mean(scores),2)}")

0.6
0.26666666666666666
0.13333333333333333
0.4
0.2
0.3333333333333333
0.4
0.3333333333333333
0.4
0.3333333333333333
0.2
0.4666666666666667
0.4
0.2
0.3333333333333333
0.3333333333333333
0.2
0.6
0.3333333333333333
0.3333333333333333

average score: 0.34


We can see, that the accuracy drops dramatically and that is becasue of the too small training data we use.