# MultiClass Classification

1. Use a multiclass classifier using the Random Forest algorithm. 
2. How evaluate the performace of multiclass models.


## Introduction
A binary classifier can only take two different values its response variables, such as 0 or 1 or yes and no. A multiclass classification task is just an extension, its response variable can have more than two different values.

## Training a Random Forest Classifier

The Random Forest methodology was proposed in 1995 by Tin Kam Ho but it was developed by Leo Breiman in 2001.

Random FOrest is not really a recent algorithm, but it has been use for almost two decades thanks to its performace and simplicity.




In [2]:
import pandas as pd
dataset_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/activity.csv'
df = pd.read_csv(dataset_url)

df.head()

Unnamed: 0,avg_rss12,var_rss12,avg_rss13,var_rss13,avg_rss23,var_rss23,Activity
0,42.0,0.0,18.5,0.5,12.0,0.0,bending1
1,42.0,0.0,18.0,0.0,11.33,0.94,bending1
2,42.75,0.43,16.75,1.79,18.25,0.43,bending1
3,42.5,0.5,16.75,0.83,19.0,1.22,bending1
4,43.0,0.82,16.25,0.83,18.0,0.0,bending1


Each row represents an activity that was performed by a person and the name of the activitiy is stored in **Activity** column.
There are seven different activities in this variable: **bending1**,**bending2**, **cycling**, **lying**, **sitting**, **standing** and **Walking**, the other six columns are different measurements taken from sensor data.

In [3]:
target = df.pop('Activity')

Now we are going to split the dataset into training and testing sets. The model uses the training set to learn relevant parameters in predicting the response variable. The test set is used to check whether a model can accurately predict unseen data. We say the model is overfitting when it has learned the patterns relevant only to the training set and makes incorrect predictions about the testing set. In this case, the model performance will be much higher for the training set compared to the testing one. Ideally, we want to have a very similar level of performance for the training and testing sets

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size = 0.33, random_state = 42)

Now we can instantiate the Random Forest classifier with some hyperparameters.

A  hyperparameter is a type of parameter the model can't learn but is set by the data scientist to tune the model learning process.

In [5]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state = 1)
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [6]:
preds = rf_model.predict(X_train)
preds

array(['lying', 'bending1', 'cycling', ..., 'cycling', 'bending1',
       'standing'], dtype=object)

These are basically the key steps required for training a Random Forest classifier. This was quite straightforward, right? Training a machine learning model is incredibly easy but getting meaningful and accurate results is where the challenges lie.

## Evaluating the Model's performace

Now that we know how to train a Random Forest Classifier, it is time to check wheter we did a good job or not.
What we want is to get a model that makes extremely accurate predictions, so we need to assess its performace using some metric.

For classification problem, multiple metrics can be used to assess the model predictive power, such F1 score, precision, recall or ROC AUC. Each of them has its own specificity and depeding on teh projects and dataset, you may use one or another.

We will use the metric called **accuracy score**. It calculates the ratio between the number of correct predictions and the total number of predictions made by the model:

![texto alternativo](https://s3.amazonaws.com/thinkific/file_uploads/59347/images/15c/062/734/B15109_04_05.png)

For instance, if your model made 950 correct predictions out of 1,000 cases, then the accuracy score would be 950/1000 = 0.95. This would mean that your model was 95% accurate on that dataset. The sklearn package provides a function to calculate this score automatically and it is called accuracy_score(). 

In [7]:
from sklearn.metrics import accuracy_score 

accuracy_score(y_train, preds)

0.9972120952176711

We achieved an accuracy score of 0.997 on our training data. This means we accurately predicted more than 99% of these cases. This is an amazing result! Unfortunately, this doesn't mean you will be able to achieve such a high score for new, unseen data. Your model may have just learned the patterns that are only relevant to this training set, and in that case, the model will overfit.

But how can we assess the performance of a model for unseen data? Is there a way to get that kind of assessment? The answer to these questions is yes.

In [8]:
predictions = rf_model.predict(X_test)
accuracy_score(y_test, predictions)

0.7916696901531094

OK. Now the accuracy has dropped drastically to 0.79. The difference between the training and testing sets is quite big. This tells us our model is actually overfitting and learned only the patterns relevant to the training set. In an ideal case, the performance of your model should be equal or very close to equal for those two sets.

### Exercuse - Building a Model for Classifying Animal Type

In [9]:
import pandas as pd
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/openml_phpZNNasq.csv'
df = pd.read_csv(file_url)

In [10]:
df.head()

Unnamed: 0,animal,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,aardvark,True,False,False,True,False,False,True,True,True,True,False,False,4,False,False,True,mammal
1,antelope,True,False,False,True,False,False,False,True,True,True,False,False,4,True,False,True,mammal
2,bass,False,False,True,False,False,True,True,True,True,False,False,True,0,True,False,False,fish
3,bear,True,False,False,True,False,False,True,True,True,True,False,False,4,False,False,True,mammal
4,boar,True,False,False,True,False,False,True,True,True,True,False,False,4,True,False,True,mammal


In [11]:
df.drop(columns = 'animal', inplace = True)

In [12]:
y = df.pop('type')

In [13]:
df.head()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize
0,True,False,False,True,False,False,True,True,True,True,False,False,4,False,False,True
1,True,False,False,True,False,False,False,True,True,True,False,False,4,True,False,True
2,False,False,True,False,False,True,True,True,True,False,False,True,0,True,False,False
3,True,False,False,True,False,False,True,True,True,True,False,False,4,False,False,True
4,True,False,False,True,False,False,True,True,True,True,False,False,4,True,False,True


In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = 0.4, random_state = 188)

In [15]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state = 42)
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [16]:
train_preds = rf_model.predict(X_train)
train_preds

array(['mammal', 'mammal', 'mammal', 'fish', 'mammal', 'insect', 'fish',
       'bird', 'mammal', 'mammal', 'fish', 'bird', 'reptile', 'bird',
       'fish', 'mammal', 'mammal', 'bird', 'bird', 'mammal', 'bird',
       'bird', 'mammal', 'invertebrate', 'reptile', 'invertebrate',
       'fish', 'bird', 'mammal', 'mammal', 'amphibian', 'mammal',
       'invertebrate', 'mammal', 'mammal', 'insect', 'mammal', 'fish',
       'invertebrate', 'mammal', 'invertebrate', 'invertebrate', 'insect',
       'amphibian', 'mammal', 'reptile', 'amphibian', 'invertebrate',
       'mammal', 'fish', 'bird', 'mammal', 'mammal', 'bird', 'mammal',
       'mammal', 'fish', 'mammal', 'bird', 'fish'], dtype=object)

In [17]:
from sklearn.metrics import accuracy_score 

train_acc = accuracy_score(y_train, train_preds)
print(train_acc)

1.0


In [18]:
test_preds = rf_model.predict(X_test)
test_acc = accuracy_score(y_test, test_preds)
print(test_acc)

0.9024390243902439


## Number of Trees Estimators
The first hyperparameter you will look at in this sections is called **n_estimators**. This hyperparameter is responsible for defining the number of trees that will be trained by **RandomForest** algorithm.

A tree is a logical graph that maps a decision and its outcomes at each of its nodes. It is a series of yes/no questions that lead to different outcomes.


A leaf is a special type of node where the model will make a prediction. There will be a split after a leaf. A single node split of tree may look like this:

![texto alternativo](https://s3.amazonaws.com/thinkific/file_uploads/59347/images/cb7/169/c8b/B15109_04_14.PNG)


A tree node is composed of a question and two outcomes depending on whether the condition defined by the question is met or not. In the preceding example, the question is is avg_rss12 > 41? If the answer is yes, the outcome is the bending_1 leaf and if not, it will be the sitting leaf. 

![texto alternativo](https://s3.amazonaws.com/thinkific/file_uploads/59347/images/55c/7c6/e07/B15109_04_15.PNG)

In the preceding example, the tree is composed of three nodes with different questions. Now, for an observation to be predicted as sitting, it will need to meet the conditions: avg_rss13 <= 41, var_rss > 0.7, and avg_rss13 <= 16.25.

The **RandomForest** algorithm will build this kind of tree based on the training data it sees. Basically, it will go through every column of the dataset and see which split value will best help to separate the data into two groups of similar classes.  Taking the preceding example, the first node with the avg_rss13 > 41 condition will help to get the group of data on the left-hand side with mostly the bending_1 class. The RandomForest algorithm usually builds several of this kind of tree and this is the reason why it is called a forest.


The **n_estimators** hyperparameter is used to specify the number of trees RandomForest algorithm will build. By default, it will build 10 trees and for a given observation, it will ask each tree to make a prediction, then the average of those predictions and use the result as the final prediction for this input. For instance, if, out of 10 trees, 8 of them predict the outcome sitting, then RandomForest algorithm will use this outcome as the final prediction.

In general, the higher the number of tree is, the better the performace you will get.


In [19]:
#n_estimators = 2
rf_model2 = RandomForestClassifier(random_state=1, n_estimators=2)
rf_model2.fit(X_train, y_train)
preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)
print(accuracy_score(y_train, preds2))
print(accuracy_score(y_test, test_preds2))

0.95
0.8780487804878049


In [20]:
rf_model3 = RandomForestClassifier(random_state=1, n_estimators=50)
rf_model3.fit(X_train, y_train)
preds3 = rf_model3.predict(X_train)
test_preds3 = rf_model3.predict(X_test)
print(accuracy_score(y_train, preds3))
print(accuracy_score(y_test, test_preds3))

1.0
0.8780487804878049


#### Exercise: Tuning n_estimators to Reduce Overfitting

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [22]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/openml_phpZNNasq.csv'

In [23]:
df = pd.read_csv(file_url)

In [24]:
df.drop(columns = 'animal', inplace = True)

In [25]:
y = df.pop("type")

In [26]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = 0.40, random_state= 188)

In [28]:
rf_model = RandomForestClassifier(random_state = 42, n_estimators= 1)
rf_model.fit(X_train, y_train)
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)

In [29]:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

In [30]:
print(train_acc)
print(test_acc)

0.9166666666666666
0.8048780487804879


In [31]:
rf_model2 = RandomForestClassifier(random_state=42, n_estimators=30)
rf_model2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [32]:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)
print(train_acc2)
print(test_acc2)

1.0
0.9024390243902439


## Maximum Depth
Random Forest builds multiple trees to make predictions. Increasing the number of trees does improve model performance but it usually doesn't help much to decrease the risk of overfitting.Our model in the previous example is still performing much better on the training set (data it has already seen) than on the testing set (unseen data).

There are different hyperparameters that can help to lower the risk of overfitting for RandomForest and one of them is called **max_depth**.

This hyperparameter defines the depth of the trees built by Random Forest. Tells to the model, how many nodes (questions) it can create before making predictions. But how will that help to reduce overfitting. Well, let's say you built a single tree and set the max_depth hyperparameter to 50. This would mean that there would be some cases where you could ask 49 different questions (the value c includes the final leaf node) before making a prediction. So, the logic would be IF X1 > value1 AND X2 > value2 AND X1 <= value3 AND … AND X3 > value49 THEN predict class A.

This is a very specific rule. In the end, it may apply to only a few observations in the training set, with this case appering once the blue moon. Therefore, your model would be overfitting. By default, the value of this **max_depth** parameter is **None**, which means there is no limit set for the depth of the trrees.


In [33]:
rf_model4 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=3)
rf_model4.fit(X_train, y_train)
preds4 = rf_model4.predict(X_train)
test_preds4 = rf_model4.predict(X_test)
print(accuracy_score(y_train, preds4))
print(accuracy_score(y_test, test_preds4))
rf_model4 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=3)

0.95
0.8536585365853658


In [34]:
rf_model5 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10)
rf_model5.fit(X_train, y_train)
preds5 = rf_model5.predict(X_train)
test_preds5 = rf_model5.predict(X_test)
print(accuracy_score(y_train, preds5))
print(accuracy_score(y_test, test_preds5))

1.0
0.8780487804878049


In [35]:
rf_model6 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=50)
rf_model6.fit(X_train, y_train)
preds6 = rf_model6.predict(X_train)
test_preds6 = rf_model6.predict(X_test)
print(accuracy_score(y_train, preds6))
print(accuracy_score(y_test, test_preds6))

1.0
0.8780487804878049


the model is overfitting with max_depth = 50. It seems the sweet spot to get good predictions and not much overfitting is around 10 for the max_depth hyperparameter in this dataset.

#### Exercise: Tuning max_depth to Reduce Overfitting

In [36]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [37]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/openml_phpZNNasq.csv'

In [38]:
df = pd.read_csv(file_url)

In [39]:
df.drop(columns='animal', inplace=True)
y = df.pop('type')

In [40]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.4, random_state=188)


In [41]:
rf_model = RandomForestClassifier(random_state=42, n_estimators=30, max_depth=5)
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [42]:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)

In [43]:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

In [44]:
print(train_acc)
print(test_acc)

1.0
0.9024390243902439


In [45]:
rf_model2 = RandomForestClassifier(random_state=42, n_estimators=30, max_depth=2)
rf_model2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [46]:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)

In [47]:
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)

In [48]:
print(train_acc2)
print(test_acc2)

0.9
0.8292682926829268


## Minimum Sample in Leaf

This hyperparameter, as its name implies, is related to the leaf nodes of the trees. We saw earlier that the **RandomForest** builds nodes that will clearly separate observations into two different groups. If we look at the tree example, the top node is splitting data into two groups:
the left-hand group contains mainly observations for the bending_1 class and the right-hand group can be from any class.

This sounds like a reasonable split but are we sure it is not increasing the risk of overfitting? For instance, what if this split leads to only one observation falling on the left-hand side? This rule would be very specific (applying to only one single case) and we can't say it is generic enough for unseen data. It may be an edge case in the training set that will never happen again.

It would be great if we could let the model know to not create such specific rules that happen quite infrequently.

Luckily, **RandomForest** has such a hyperparameter, **min_samples_leaf**. This hyperparameter will specify the minimum number of observations that will have to fall under a leaf node to be considered in the tree.

We instance, if we set **min_samples_leaf** to **3**, then RandomForest will only consider a split hat leads to at least three observations on both leaft and right leaf nodes. If this condition is not met for a split, the model will not consider it and will exclude it from the tree. The default value of this hyperparameter is **1**.

In [53]:
rf_model7 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=3)
rf_model7.fit(X_train, y_train)
preds7 = rf_model7.predict(X_train)
test_preds7 = rf_model7.predict(X_test)
print(accuracy_score(y_train, preds7))
print(accuracy_score(y_test, test_preds7))

1.0
0.8780487804878049


With min_samples_leaf=3, the accuracy for both the training and testing sets didn't change much compared to the best model we found in the previous section. Let's try increasing it to 10:

In [54]:
rf_model8 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=10)
rf_model8.fit(X_train, y_train)
preds8 = rf_model8.predict(X_train)
test_preds8 = rf_model8.predict(X_test)
print(accuracy_score(y_train, preds8))
print(accuracy_score(y_test, test_preds8))

0.7333333333333333
0.7317073170731707


 
Now the accuracy of the training set dropped a bit but increased for the testing set and their difference is smaller now. So, our model is overfitting less. Let's try another value for this hyperparameter – 25:

In [55]:
rf_model9 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=25)
rf_model9.fit(X_train, y_train)
preds9 = rf_model9.predict(X_train)
test_preds9 = rf_model9.predict(X_test)
print(accuracy_score(y_train, preds9))
print(accuracy_score(y_test, test_preds9))

0.4
0.4146341463414634


When choosing the optimal value for this hyperparameter, you need to be careful: a value that's too low will increase the chance of the model overfitting, but on the other hand, setting a very high value will lead to underfitting (the model will not accurately predict the right outcome).

For instance, if you have a dataset of 1000 rows, if you set min_samples_leaf to 400, then the model will not be able to find good splits to predict 5 different classes. In this case, the model can only create one single split and the model will only be able to predict two different classes instead of 5. It is good practice to start with low values first and then progressively increase them until you reach satisfactory performance.

#### Exercise: Tuning min_samples_leaf

In [56]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [57]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/openml_phpZNNasq.csv'

In [58]:
df = pd.read_csv(file_url)

In [60]:
df.drop(columns='animal', inplace=True)
y = df.pop('type')

In [61]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.4, random_state=188)


In [62]:
rf_model = RandomForestClassifier(random_state=42, n_estimators=30, max_depth=2, min_samples_leaf=3)
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=3, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [63]:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)

In [64]:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

In [65]:
print(train_acc)
print(test_acc)

0.8333333333333334
0.8048780487804879


In [66]:
rf_model2 = RandomForestClassifier(random_state=42, n_estimators=30, max_depth=2, min_samples_leaf=7)
rf_model2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=7, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [67]:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)

In [68]:
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)

In [69]:
print(train_acc2)
print(test_acc2)

0.8
0.8048780487804879


## Maximum Features

RandomForest builds multiple trees and takes the average to make predictions. This is why it is called a forest, but we haven't really discussed the "random" part yet.
How does building multiple trees help to get better predictions, and won't all the trees look the same given that the input data is the same?


Using the analogy of a court trial. In some countries, the final decision of a trial is either made by a judge or a jury. A judge is a person who knows the law in detail and can decide wheter a person has broken the law or not. On the other hand, a jury is composed of people from different backgrounds who don't know each other or any of the parties involved in the trial and have limited knowledge of the legal system. In this case, we are asking random people who are not expert in the law to decide the outcome of case. This sound very risky at first, but the risk of one person making the wrong decision is very high. But in fact, the risk of 10 or 20 people making the wrong decision is realatively low.

But there is one condition that need to be met for this work: **randomness**. If the people in the jury come from the same background, they may share the same way of thinking and make similar decisions. For instance, if a group of people were raised in a community where you only drink hot chocolate at breakfast and one day you ask them if it is ok to drink coffe at breakfast, they would say no.

On the other hand, say you got another group of people from different backgrounds with different habits: some drink coffee, others tea, a few drink orange juice, and so on. If you asked them the same question, you would end up with the majority of them saying yes. Because we randomly picked these people, they have less bias as a group, and this therefore lowers the risk of them making a wrong decision.

RandomForest actually applies the same logic: it builds a number of trees independently of each other by randomly sampling data. A tree may see 60% of the data, another 70% and so on. By doing so, there is a high chance that the trees are absolutely different from each other and don't share the same bias. This is the secret of the RandomForest: building multiple random trees leads to higher accuracy.

But it is not the only way that RandomForest creates randomness. It does so also by randomly sampling columns.Each tree will only see a subset of the features rather than all of them.
And this is exactly what the **max_features** hyperparameter is for: it will set the maximum number of features a tree is allowed to see.


In **sklearn**, you can specify the value of this hyperparameter:
* The Maximum number of features, as an integer.
* A ratio, as the percentage of allowed features
* The **sqrt*** function, which will use the square root of the number of feature as the maximum value. Eg, if for a dataset there are 25 features, tis squared root will be 5 and this will be the value for max_features
* The **log2**, which will use the log base, 2, of the number of features as the maximum value. If, for a dataset, there are eight features, its log2 will be 3 and this will be the value for max_features.

In [71]:
rf_model10 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=10, max_features=2)
rf_model10.fit(X_train, y_train)
preds10 = rf_model10.predict(X_train)
test_preds10 = rf_model10.predict(X_test)
print(accuracy_score(y_train, preds10))
print(accuracy_score(y_test, test_preds10))

0.7333333333333333
0.7317073170731707


In [75]:
rf_model11 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=10, max_features=0.7)
rf_model11.fit(X_train, y_train)
preds11 = rf_model11.predict(X_train)
test_preds11 = rf_model11.predict(X_test)
print(accuracy_score(y_train, preds11))
print(accuracy_score(y_test, test_preds11))

0.7333333333333333
0.7317073170731707


With this ratio, both accuracy scores increased for the training and testing sets and the difference between them is less. Our model is overfitting less now and has slightly improved its predictive power. Let's give it a shot with the log2 option:

In [78]:
rf_model12 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=10, max_features='log2')
rf_model12.fit(X_train, y_train)
preds12 = rf_model12.predict(X_train)
test_preds12 = rf_model12.predict(X_test)
print(accuracy_score(y_train, preds12))
print(accuracy_score(y_test, test_preds12))

0.7333333333333333
0.7317073170731707


#### Exercise: Tuning max_features

In [79]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [80]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/openml_phpZNNasq.csv'
df = pd.read_csv(file_url)

In [81]:
df.drop(columns='animal', inplace=True)
y = df.pop('type')

In [82]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.4, random_state=188)

In [83]:
rf_model = RandomForestClassifier(random_state=42, n_estimators=30, max_depth=2, min_samples_leaf=7, max_features=10)
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features=10,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=7, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [85]:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.85
0.8048780487804879


In [86]:
rf_model2 = RandomForestClassifier(random_state=42, n_estimators=30, max_depth=2, min_samples_leaf=7, max_features=0.2)
rf_model2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features=0.2,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=7, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [88]:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)
print(train_acc2)
print(test_acc2)

0.8333333333333334
0.8048780487804879


## Acitivity: Train a Random Forest Classifier on the ISOLET Dataset


In [89]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/phpB0xrNj.csv'
df = pd.read_csv(file_url)

In [90]:
df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40,...,f579,f580,f581,f582,f583,f584,f585,f586,f587,f588,f589,f590,f591,f592,f593,f594,f595,f596,f597,f598,f599,f600,f601,f602,f603,f604,f605,f606,f607,f608,f609,f610,f611,f612,f613,f614,f615,f616,f617,class
0,-0.4394,-0.093,0.1718,0.462,0.6226,0.4704,0.3578,0.0478,-0.1184,-0.231,-0.2958,-0.2704,-0.262,-0.217,-0.0874,-0.0564,0.0254,0.0958,0.4226,0.6648,0.9184,0.9718,0.9324,0.707,0.6986,0.755,0.8816,1.0,0.938,0.845,0.7268,0.5578,-0.433,-0.1982,0.127,0.3666,0.4496,0.4258,0.2646,-0.0368,...,1,-1,-1.0,-1.0,-1.0,0.1334,-1,-0.077,0.0512,0.2564,0.5642,0.4872,0.077,0.4358,0.7436,0.5128,0.6666,0.641,0.6154,1.0,0.8206,0.641,0.359,0.6924,0.4358,0.1538,0.4616,0.6154,0.3334,0.3334,0.4102,0.2052,0.3846,0.359,0.5898,0.3334,0.641,0.5898,-0.4872,'1'
1,-0.4348,-0.1198,0.2474,0.4036,0.5026,0.6328,0.4948,0.0338,-0.052,-0.1302,-0.0964,-0.2084,-0.0494,-0.0494,-0.2942,0.0704,0.0546,0.1302,0.5652,0.6848,0.776,0.9558,0.8542,0.7474,0.6094,0.7708,0.8282,1.0,0.9974,0.948,0.7422,0.5678,-0.2196,0.109,0.5892,0.8768,1.0,0.9936,0.7852,0.3712,...,-1,-1,-1.0,-1.0,-1.0,-1.0,-1,0.0228,-0.091,0.2728,0.8636,0.75,0.4318,0.7272,0.659,0.409,0.7728,1.0,0.7272,0.4772,0.4772,0.4772,0.659,0.1818,0.4318,0.3864,0.841,0.8864,0.25,0.2272,0.0,0.2954,0.2046,0.4772,0.0454,0.2046,0.4318,0.4546,-0.091,'1'
2,-0.233,0.2124,0.5014,0.5222,-0.3422,-0.584,-0.7168,-0.6342,-0.8614,-0.8318,-0.7228,-0.6312,-0.4986,-0.708,-0.6666,-0.5428,-0.413,-0.3776,-0.0472,0.1356,0.6136,0.8024,1.0,0.9794,0.9352,0.8732,0.944,0.9588,0.6962,0.4838,0.3982,0.2064,-0.327,0.0134,0.362,0.3218,-0.4558,-0.8096,-0.7748,-0.7238,...,-1,1,-0.8,-1.0,-0.6,-0.8334,-1,-0.4286,-0.254,-0.365,-0.0952,-0.0794,0.0318,-0.2064,0.0634,0.1112,0.1746,0.238,0.1904,0.508,0.5396,0.0318,-0.0158,0.7142,1.0,0.4126,-0.0794,-0.0476,0.0,0.0952,-0.1112,-0.0476,-0.1746,0.0318,-0.0476,0.1112,0.254,0.1588,-0.4762,'2'
3,-0.3808,-0.0096,0.2602,0.2554,-0.429,-0.6746,-0.6868,-0.665,-0.841,-0.9614,-0.7374,-0.7084,-0.6772,-0.6338,-0.6482,-0.624,-0.3976,-0.5662,-0.2168,0.0458,0.3832,0.6168,0.8988,1.0,0.9156,0.8796,0.9132,0.7132,0.759,0.7278,0.5856,0.506,-0.371,-0.0868,0.4114,0.3438,-0.1816,-0.5964,-0.6888,-0.6686,...,-1,1,-1.0,-1.0,-1.0,-0.8334,-1,-0.2374,-0.5396,0.1798,0.2086,0.0792,0.036,0.3238,0.3956,0.41,0.2662,0.5252,0.367,0.9136,1.0,0.41,0.1224,0.5252,0.4388,0.0216,-0.0792,0.3812,0.2806,0.0648,-0.0504,-0.036,-0.1224,0.1366,0.295,0.0792,-0.0072,0.0936,-0.151,'2'
4,-0.3412,0.0946,0.6082,0.6216,-0.1622,-0.3784,-0.4324,-0.4358,-0.4966,-0.5406,-0.5472,-0.544,-0.4494,-0.2332,-0.2332,-0.1148,0.0068,0.0778,0.4864,0.9054,0.956,0.7602,0.777,0.7636,0.8818,1.0,0.9426,0.7162,0.5472,0.4122,0.277,0.2364,-0.4684,-0.1394,0.421,0.4316,-0.3106,-0.5448,-0.5132,-0.6368,...,1,-1,-1.0,-1.0,-1.0,1.0,-1,0.25,0.5,0.0624,0.3438,0.25,0.25,0.625,0.25,0.5312,0.4376,0.4688,0.5626,0.5938,0.3438,0.5626,0.25,1.0,0.9376,0.3438,0.2812,-0.0312,0.4376,0.2812,0.1562,0.3124,0.25,-0.0938,0.1562,0.3124,0.3124,0.2188,-0.25,'3'


In [91]:
y = df.pop('class')

In [92]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = 0.25, random_state = 101)

In [94]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

1.0
0.9343589743589743


In [95]:
rfc = RandomForestClassifier(
    n_estimators = 20
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.9998289721224559
0.9158974358974359


In [96]:
rfc = RandomForestClassifier(
    n_estimators = 50
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

1.0
0.9282051282051282


In [97]:
rfc = RandomForestClassifier(
    n_estimators = 20, max_depth = 10
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.9793056268171711
0.9051282051282051


In [98]:
rfc = RandomForestClassifier(
    n_estimators = 20, max_depth = 5
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.8542842483324782
0.8148717948717948


In [99]:
rfc = RandomForestClassifier(
    n_estimators = 50, max_depth = 5
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.8754917051479391
0.838974358974359


In [100]:
rfc = RandomForestClassifier(
    n_estimators = 50, max_depth = 10
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.9840944073884043
0.9235897435897436


In [101]:
rfc = RandomForestClassifier(
    n_estimators = 20, max_depth = 5, min_samples_leaf = 10
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.8416281853942192
0.8082051282051282


In [102]:
rfc = RandomForestClassifier(
    n_estimators = 20, max_depth = 5, min_samples_leaf = 50
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.8216179237215666
0.7825641025641026


In [103]:
rfc = RandomForestClassifier(
    n_estimators = 50, max_depth = 5, min_samples_leaf = 5
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.8590730289037113
0.8282051282051283


In [104]:
rfc = RandomForestClassifier(
    n_estimators = 50, max_depth = 5, min_samples_leaf = 50
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.8703608688216179
0.8302564102564103


In [105]:
rfc = RandomForestClassifier(
    n_estimators = 20, max_depth = 5, min_samples_leaf = 10, max_features = 0.5
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.7545749957243031
0.7384615384615385


In [106]:
rfc = RandomForestClassifier(
    n_estimators = 20, max_depth = 5, min_samples_leaf = 10, max_features = 0.3
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.7549170514793911
0.7297435897435898


In [107]:
rfc = RandomForestClassifier(
    n_estimators = 50, max_depth = 5, min_samples_leaf = 10, max_features = 0.3
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.7839917906618779
0.7528205128205128


In [108]:
rfc = RandomForestClassifier(
    n_estimators = 50, max_depth = 5, min_samples_leaf = 10, max_features = 0.5
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.7523516333162306
0.7235897435897436


In [109]:
rfc = RandomForestClassifier(
    n_estimators = 50, max_depth = 5, min_samples_leaf = 50, max_features = 0.5
)
rfc.fit(X_train, y_train)
train_preds = rfc.predict(X_train)
test_preds = rfc.predict(X_test)
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print(train_acc)
print(test_acc)

0.7349067898067385
0.7164102564102565


# Summary
Summary
We have finally reached the end of this chapter on multiclass classification with Random Forest. We learned that multiclass classification is an extension of binary classification: instead of predicting only two classes, target variables can have many more values. We saw how we can train a Random Forest model in just a few lines of code and assess its performance by calculating the accuracy score for the training and testing sets. Finally, we learned how to tune some of its most important hyperparameters: n_estimators, max_depth, min_samples_leaf, and max_features. We also saw how their values can have a significant impact on the predictive power of a model but also on its ability to generalize to unseen data.

In real projects, it is extremely important to choose a valid testing set. This is your final proxy before putting a model into production so you really want it to reflect the types of data you think it will receive in the future. For instance, if your dataset has a date field, you can use the last few weeks or months as your testing set and everything before that date as the training set. If you don't choose the testing set properly, you may end up with a very good model that seems to not overfit but once in production, it will generate incorrect results. The problem doesn't come from the model but from the fact the testing set was chosen poorly.

In some projects, you may see that the dataset is split into three different sets: training, validation, and testing. The validation set can be used to tune the hyperparameters and once you are confident enough, you can test your model on the testing set. As mentioned earlier, we don't want the model to see too much of the testing set but hyperparameter tuning requires you to run a model several times until you find the optimal values. This is the reason why most data scientists create a validation set for this purpose and only use the testing set a handful of times. This will be explained in more depth in Chapter 7, The Generalization of Machine Learning Models.

In the next section, you will be introduced to unsupervised learning and will learn how to build a clustering model with the k-means algorithm.