# Machine Learning

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

1. **I**mport the data
2. **C**lean the data
3. **S**plit the data
4. **T**rain the model
5. **M**ake Predictions
6. **E**valuate the model

The algorithm you choose will depend on the problem you need to solve. But one of the most popular is decision tree from **SciKitLearn**

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_rows', None)

# Onward to Machine Learning

Learn patterns of existing users to make recommendations

In [3]:
music = pd.read_csv('data/music.csv') #Totally fake and oversimplified

[Click here for a 'Programming with Mosh' video on this dataset and analysis](https://www.youtube.com/watch?v=7eh4d6sabA0&t=1214s)

# Import the data

In [4]:
music

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


# Prepare the data

No dupes or nulls, all good.

# Split the data

Input Set  - The information used to make prediction (age)  
Output Set - Prediction (genre)

# Creating the input data set

In [15]:
X = music.drop(columns = ['genre']) #X used by convention

In [16]:
X

Unnamed: 0,age,gender
0,20,1
1,23,1
2,25,1
3,26,1
4,29,1
5,30,1
6,31,1
7,33,1
8,37,1
9,20,0


# Creating the output data set

In [17]:
y = music['genre'] #y used by convention

In [18]:
y

0        HipHop
1        HipHop
2        HipHop
3          Jazz
4          Jazz
5          Jazz
6     Classical
7     Classical
8     Classical
9         Dance
10        Dance
11        Dance
12     Acoustic
13     Acoustic
14     Acoustic
15    Classical
16    Classical
17    Classical
Name: genre, dtype: object

# Modeling the data
Skelarn Decision Tree Algorithm   
A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.  
[For more information on SciKit Learn click here](https://scikit-learn.org/stable/)

<img src="images/decision_tree.png" width=600 height=900 align="center"/>

# Classification and Regression Trees (CART)

<img src="images/decision_tree_1.jpeg" width=600 height=900 align="center"/>

<img src="images/cart_formula.png" width=300 height=450 align="center"/>

from sklearn.tree import DecisionTreeClassifier

In [28]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X.values, y)

DecisionTreeClassifier()

[For more information on decision trees](http://scikit-learn.org/stable/modules/tree.html)

In [29]:
#Using Age and gender to predict genre

model.predict([[21, 1]]) # making a prediction for a 21 year old male

array(['HipHop'], dtype=object)

In [30]:
model.predict([[22, 0]]) # making a prediction for a 22 year old female

array(['Dance'], dtype=object)

In [31]:
model.predict([[21,1],[22, 0],[33,1],[38,0]]) # making a bunch of predictions

array(['HipHop', 'Dance', 'Classical', 'Classical'], dtype=object)

In [32]:
model.predict([[40,1]]) # even when you give it a value outside of its range it will still provide an output closest to the domain logic

array(['Classical'], dtype=object)

# How it would look put together

In [34]:
from sklearn.tree import DecisionTreeClassifier

X = music.drop(columns = ['genre'])
y = music['genre']

model = DecisionTreeClassifier()
model.fit(X.values, y)

model.predict([[21,1],[22, 0],[33,1],[38,0]]) # making a bunch of predictions

array(['HipHop', 'Dance', 'Classical', 'Classical'], dtype=object)

# Do we know how accurate our predictions actually are?

In [35]:
from sklearn.model_selection import train_test_split

In [36]:
from sklearn.tree import DecisionTreeClassifier

X = music.drop(columns = ['genre'])
y = music['genre']

# Starts the same way

In [37]:
train_test_split(X,y, test_size = 0.2) 
# the test_size argument specifies that 20% of our data will be set aside for testing
# The function outputs 4 data sets 
# 80% of X for training 
# 20% of X for testing
# 80% of y for training 
# 20% of y for testing

[    age  gender
 3    26       1
 17   35       0
 1    23       1
 0    20       1
 16   34       0
 6    31       1
 5    30       1
 4    29       1
 14   30       0
 10   21       0
 9    20       0
 2    25       1
 7    33       1
 11   25       0,
     age  gender
 8    37       1
 13   27       0
 15   31       0
 12   26       0,
 3          Jazz
 17    Classical
 1        HipHop
 0        HipHop
 16    Classical
 6     Classical
 5          Jazz
 4          Jazz
 14     Acoustic
 10        Dance
 9         Dance
 2        HipHop
 7     Classical
 11        Dance
 Name: genre, dtype: object,
 8     Classical
 13     Acoustic
 15    Classical
 12     Acoustic
 Name: genre, dtype: object]

In [38]:
# Lets assign these datasets to variables

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2) 

We change the inputs to make use of the new split

In [39]:
from sklearn.metrics import accuracy_score

In [41]:
model = DecisionTreeClassifier() #Same as before
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [42]:
predictions #show us what the model has guessed

array(['HipHop', 'Acoustic', 'HipHop', 'Classical'], dtype=object)

In [44]:
accuracy_score(y_test, predictions)

1.0

# Now all together

In [45]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [48]:
# Create your input and output sets
X = music.drop(columns = ['genre'])
y = music['genre']

# Split the input and output sets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2) #check how changing test size affects score

# Select and fit the model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Make some predictions
predictions = model.predict(X_test)

# Evaluate the model

score = accuracy_score(y_test, predictions)
score

1.0

the more data we give to our model the more accurate it will become

# Model persistence 

Using a model, trainig a model every time is not a good use of time

Now we will use the entire dataset for training an see what results it yeilds

In [49]:
from sklearn.tree import DecisionTreeClassifier

X = music.drop(columns = ['genre'])
y = music['genre']

model = DecisionTreeClassifier()
model.fit(X.values, y)

DecisionTreeClassifier()

In [50]:
import joblib 

In [51]:
joblib.dump(model, 'music_recommender.joblib')

['music_recommender.joblib']