# Introduction to Machine Learning 4

See https://learning.anaconda.cloud/getting-started-with-ai-ml

Cover major supervised machine learning algorithmss which use labelled data to make predicitons:

- linear regression
- Logistic regression
- Naive Bayes
- **Decision trees / random forests**
- Neural networks

Using `scikit-learn` for implementation

## Decision Trees / Random Forests

Decision trees are a very powerful machine learning technique that forms the basis of other machine learning algorithms, including random forests. 

In this module, you'll learn how to: 

- Dissect a decision tree and learn how `scitkit-learn` builds one
- Optimize splits with Gini impurity
- Prevent overfitting by using random forests

Decision Trees fit to data very well, with a definite risk of over-fitting.

One mitigation is **Random Forests** which generate hundreds of decision trees with randomly-sampled data.

Other possible mitigations include techniques such as gradient boosting #FurtherLearning


### Building Decision Trees

*skip over a chunk of stuff about building by hand*

Possible questions though:

- at a given step, how do I determine the best property to split on?
- where do I split continuous variables?
- when do I stop splitting and end my decision tree?

### Gini Impurity

A measure for *impurity* i.e. how mixed a population is.

So for events $A$ and $B$

$$
Gini\;Impurity = 1 - {(Probability\;of\;A)}^2  - {(Probability\;of\;B)}^2
$$

or more generically, consider a dataset $D$ that contains samples from $k$ classes. 
The probability of samples belonging to class $i$ at a given node can be denoted as $p_i$.

Then the Gini Impurity of $D$ is defined as:

$$
Gini(D) = 1 - \sum_1^k{p_i}^2
$$

For example in a kennel with 6 dogs and cats, the [Gini impurity](https://www.learndatasci.com/glossary/gini-impurity/) of dogs v cats is:

$$
Gini\;Impurity = 1 - {\left(\frac6{6+3}\right)}^2  - {\left(\frac3{6+3}\right)}^2 = 0.44444
$$

Using the weather example I skipped over in the last section, and looking at the first decision node $RAIN$:

![](gini-1.png)

Impurity for "good/bad weather" on the YES side is 0% meaning pure:

$$
1 - {\left(\frac0{0+16}\right)}^2  - {\left(\frac{16}{0+16}\right)}^2 = 0
$$

On the NO side we get an impurity of 48.4429%

$$
1 - {\left(\frac{14}{14+20}\right)}^2  - {\left(\frac{20}{14+20}\right)}^2 = .484429
$$

Looking at the entire impurity of using the $RAIN$ property to split we weight those two impurities together:

$$
0\,{\left(\frac{0+16}{0+16+14+20}\right)} + 0.484429\,{\left(\frac{14+20}{0+16+14+20}\right)} = .32941172
$$

This is known as the **weighted average Gini impurity** and can be used as a measure of quality for splitting the data on that property. Lowest impurity is the best.

At every step through the tree we choose the property that has the least weighted average Gini impurity.

The algorithm in `scikit-learn` does this recursively until running out of allowed steps or hitting a node with a weighted average Gini impurity that is inferior to prior nodes.


### Example using `scikit-learn`

In this example we use the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class

In [33]:
import pandas as pd
import numpy as np 

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

np.set_printoptions(suppress=True)

In [34]:
df = pd.read_csv('https://bit.ly/3zVspy4')
df.head(11)

Unnamed: 0,RAIN,LIGHTNING,CLOUDY,TEMPERATURE,GOOD_WEATHER_IND
0,0,1,1,74,0
1,0,0,0,69,1
2,1,0,1,58,0
3,0,0,0,71,1
4,0,0,0,73,1
5,0,1,1,80,0
6,0,1,1,74,0
7,0,0,0,73,1
8,1,0,1,79,0
9,0,0,1,72,1


In [35]:
X = df.values[:, :-1]
Y = df.values[:, -1]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)

In [36]:
# Declare the decision tree and fit the training data.
model = DecisionTreeClassifier(max_depth=10, criterion='gini')
model.fit(X_train, Y_train)

In [37]:
# Score the accuracy of the model
results = model.score(X_test, Y_test)
print(results)

1.0


In [38]:
# Show the confusion matrix.
matrix = confusion_matrix(y_true=Y_test, y_pred=model.predict(X_test))
print(matrix)

[[13  0]
 [ 0  4]]


### Random Forests

Decision Trees fit to data very well, with a definite risk of over-fitting.

One mitigation is **Random Forests** which generate hundreds of decision trees with randomly-sampled data.

Random Forests are a ML technique that generates hundreds of decision trees, each one building ooff partial random data and properties rather than all the data.
Typically each tree will train with only 2/3 the randomly sampled data, a technique known as bootstrapping.
Each tree only considers a subset of variables  when evaluating each node.

Each tree then "votes" on a prediction, the prediction with the highest votes wins.



### Example of Random Forest in `scikit-learn`

This uses the class [`sklearn.ensemble.RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [39]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [40]:
df = pd.read_csv('https://bit.ly/3zVspy4')
df

Unnamed: 0,RAIN,LIGHTNING,CLOUDY,TEMPERATURE,GOOD_WEATHER_IND
0,0,1,1,74,0
1,0,0,0,69,1
2,1,0,1,58,0
3,0,0,0,71,1
4,0,0,0,73,1
5,0,1,1,80,0
6,0,1,1,74,0
7,0,0,0,73,1
8,1,0,1,79,0
9,0,0,1,72,1


In [41]:
X = df.values[:, :-1]
Y = df.values[:, -1]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)

In [42]:
# Create a random forest classifier model with 300 trees. 
# Limit their depth to 10 nodes 
# and only allow a maximum of 4 features to be used for each tree.
model = RandomForestClassifier(n_estimators=300, max_depth=10, max_features=4, criterion='gini')
model.fit(X_train, Y_train)

In [43]:
# score the accuracy of the model
results = model.score(X_test, Y_test)
print(results)

1.0


In [44]:
# View the confusion matrix to see the accuracy of true predictions and false predictions.
matrix = confusion_matrix(y_true=Y_test, y_pred=model.predict(X_test))
print(matrix)

[[13  0]
 [ 0  4]]


### Another example

In [46]:
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


df = pd.read_csv('https://bit.ly/3QHvclX')

X = df.values[:, :-1]
Y = df.values[:, -1]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)

model = RandomForestClassifier(n_estimators=300, max_depth=10, max_features=3, criterion='gini')
model.fit(X_train, Y_train)


results = model.score(X_test, Y_test)
print(results)

matrix = confusion_matrix(y_true=Y_test, y_pred=model.predict(X_test))
print(matrix)

0.8243243243243243
[[77 17]
 [ 9 45]]
