<a href="https://colab.research.google.com/github/shstreuber/Data-Mining/blob/master/Module6_TreesRandomForest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 6: CLASSIFICATION with Random Forest**
In this module, we are going to study different ways of setting up advanced classification models. At the end of this notebook, you will be able to:
* Explain what classification trees do and how, taken together as Random Forest, they produce more reliable results than single trees
* Build a Random Forest algorithm

**Be sure to expand all the hidden cells, run all the code, and do all the exercises--you will need the techniques for the lesson lab!**


#**AGAIN: What is the Classification Process?**
When you work with algorithms, you basically follow a pretty standard process. This process consists of the following steps:
0. **Preparation and Setup**: Loading the data and verifying that the data has indeed loaded
1. **Exploratory Data Analysis**: Getting a basic understanding of the data, including number of columns and rows, data types, data shape and distribution (remember 5-number summary?), and the like
2. **Preprocessing**: Cleaning the data up and reducing them to the smallest useful dataset with which you and your hardware will be able to work. This includes building a reduced dataframe.
3. **Splitting your data into Training and Test set**: We use the Training set to configure the model and the Test set to evaluate how well the model works. NOTE that you will need to remove the class attribute from the test set because this is the attribute whose values you want to predict.
4. **Building and Training the model**: Here, you select the algorithm you are going to use and you configure it using the Training set.
5. **Evaluating the Quality of the model**: This is where you apply the configured model to the Test set and determine how accurately it handles the test data. In other words, we compare the calculated class values to the actual class values shown in the test set. At the end, you'll use THREE methods to evaluate:
* The accuracy score (shows how the calculated predictions for the test data class compare to the actual class values that you have split off)
* The Confusion Matrix (visualization which compares the number of predictions with the number of true class values)
* The Classification Report (numeric breakdown and overview of accuracy and more)

And that's it! Let's dive into the material now!


#**1. Tree-Based Classification**
A Classification tree assigns data records to discrete levels (or labels) in a class attribute. It is built through binary recursive partitioning, which means that data is being split into partitions, then sub-partitions, and sub-sub-partitions, and so on. The outcome is a tree with a root, several branches, and leaves like the one below (which comes from [this awesome post](https://towardsdatascience.com/https-medium-com-lorrli-classification-and-regression-analysis-with-decision-trees-c43cdbc58054) on classification trees that will tell you almost everything you need to know):


<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/classtree.jpeg" width="400">



Now take a look at the first video in which I explain this in more detail (and with examples):

<iframe width="560" height="315" src="https://www.youtube.com/embed/BxQAIyDxDKg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

In [20]:
from IPython.display import IFrame  # This is just for me so I can embed videos
IFrame(src="https://www.youtube.com/embed/BxQAIyDxDKg", width=560, height=315)

While decision tree models are awesome, they have some real disadvantages:
1. If you build your training and test sets with random sampling, no two decision trees about the same dataset will be the same. So, can you ever really tell what the *real* result of your math is? Not really
2. They don't really work well with really big datasets because they classify *everything* in a dataset, to the point of going into too much detail. [This is called overfitting](https://aws.amazon.com/what-is/overfitting/) and can waste considerable resources.  

Wouldn't it be much better to combine different trees from randomly sampled subsets of the same dataset and then check where these trees come to the same solution?

##**Welcome to Random Forest!**

Random Forest doesn't build just one tree--it builds an entire classroom full of trees, each one of which is based on a slightly different training set (which is, in fact, a small randomized subset of the big overall training set). To save processing power, Random Forest then picks just a random few of the attributes to consider when building each tree, so that no two trees are based on the same attributes. Finally, Random Forest evaluates all the trees it has constructed and, for a given prediction, outputs the class assignment that is the mode of the classes (classification) or, if you run it as a regression tree, the mean prediction (regression) of the individual trees.

<div>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/randomforest2.png" width="600">
</div>

So, we have:
* A number of trees
* Using a random subset of features in the dataset to make their split decisions
* Built on a number of slightly different training subsets, selected as random samples with replacement (= bootstrap aggregating or bagging) from the overall training set
* A voting function that selects the mode of the classes (classification or the mean prediction (regression)

In other words, we introduce dual randomness into our classification in order to pick the best model from the places where all the individual trees overlap. That leaves us with much greater accuracy for our model.

**Got Questions?**

Take a look at the awesome video below.

In [22]:
IFrame(src="https://www.youtube.com/embed/v6VJ2RO66Ag", width=560, height=315)

##**1.0. Preparation and Setup**
There really isn't anything new going on between the modules on k Nearest Neighbor and Naive Bayes and this one. As with our previous problems, we will use the insurance dataset again.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import spatial
import statsmodels.api as sm

from IPython.display import HTML # This is just for me so I can embed videos
from IPython.display import Image # This is just for me so I can embed images

#Reading in the data as insurance dataframe
insurance = pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/insurance_with_categories.csv")

#Verifying that we can see the data
insurance.head()

Now we are ready for our Exploratory Data Analysis (EDA).

##**1.1 Exploratory Data Analysis (EDA)**
This is always the first step. Even though we already know this dataset, let's walk through the motions again. In the previous module, we used the ydata profiling package to generate a beautiful HTML interface with tabs that showed us everything we needed to know and then some more--but it required installing a new package. You may not always have the user permissions to do this. So, below is the basic process of data investigation.

###**1.1.1 Data Shape and Distribution**
Run each code line below to see what it does.

In [None]:
insurance.describe(include = 'all'), print("***DATA OVERVIEW***")

In [None]:
insurance.dtypes, print("***DATA TYPES***")

In [None]:
insurance.corr(numeric_only=1), print("***DATA CORRELATIONS***")

## Your Turn
What do these commands show you? Why is this important? Explain in the text field below:

###**1.1.2 Some Basic Visualiations**

In [None]:
# Data Distribution (numeric data only)
insurance.hist()
insurance.plot()

I know ... I promised you a pie plot in Module 1, and that was too hard back then. Here are two ways to do this.

**NOTE** that all plots require numeric information, so you have to first count the size of each level in a categorical attribute and then build the pie size based on that. You already know groupby, so all you need to do is get the size of each group with the size() command--or you can make an array from the attribute and count the values. Both ways are shown below.

**Uncomment each of the code lines below separately to see how they work**:

In [None]:
# You can also use the groupby command we have learned earlier in this course.
# insurance.groupby('sex').size().plot(kind='pie', autopct='%.2f')
# insurance['sex'].value_counts().plot(kind='pie', autopct='%.2f')
# insurance.groupby('children').size().plot(kind='pie', autopct='%.2f')

##Your Turn
Now analyze the second code line above and then display just the counts for the levels in the 'region' attribute:

##**1.2. Preprocessing: Building the Dataframe for Analysis**
We will, as before, use the "region" attribute as the class attribute and the numeric attributes (age, bmi, children, charges) in the insurance dataframe as the predictors. Since we already know that no data is missing, all we have to do is assemble the insurance2 dataframe we are going to use.

In the code row below, build the insurance2 dataframe we need (if you don't remember how to do this, review last week's module in which we built this dataframe already):

In [None]:
insurance2 = pd.DataFrame(insurance, columns = ['age', 'bmi', 'children','charges','region'])

##**1.3. Setting up the Training and the Test Sets**
Just like before, we need to build the training set and the test set again. We want a **80% training/ 20% test split**. Finish the code below to build this (if you can't remember how to do this, use the code from any of the two previous workbooks):

In [None]:
from sklearn.model_selection import train_test_split
x=insurance2.iloc[:,:4] # all parameters
y=insurance2['region'] # class labels 'southwest', 'southeast', 'northwest', 'northeast'
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=.20)
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))

##**1.4. Build and Train the Random Forest classifier**
We are going to use the [RandomForestClassifier from sklearn.ensemble](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). The RandomForestClassifier has a number of really interesting parameters that we can control in order to optimize our model to run quickly and efficiently, especially the sub-sample size, which is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.



###**1.4.1 Building the Classifier**

The most important parameters are:
* **n_estimators int, default=100** --
The number of trees in the forest.
* **criterion{“gini”, “entropy”}, default=”gini”** --
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
* **max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”** --
The number of features to consider when looking for the best split: If int, then consider max_features features at each split. If “auto”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
* **max_depthint, default=None** --
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
* **min_samples_split int or float, default=2**
The minimum number of samples required to split an internal node
* **bootstrap bool, default=True** -- Whether bootstrap samples are used when building trees (which is 50% of the whole idea behind Random Forest). If False, the whole dataset is used to build each tree.

Let's get started!

In [None]:
# Importing the Random Forest library.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

# Configuring the classifier and using get_params to double-check all the parameters with which it is configured
rf = RandomForestClassifier()
rf.get_params(deep=True)

###**1.4.2 Training the Classifier**
rf is our Random Forest classifier. As before, we use .fit to train the classifier on the dataset.
X_train[['age', 'bmi', 'children', 'charges']] are all the feature columns of the training set, and y_train is 'region'. Based on these we want to make a prediction.

In [None]:
rf.fit(X_train, y_train)

## **1.5. Use the Classifier to test and predict**
There is nothing different about the steps below than what you have already done. Uncomment the second line starting with "print" if you would like to see the output of your predictions.

In [None]:
y_pred = rf.predict(X_test)
# print(y_pred) # If you want to see the big long list, uncomment this line!

##**1.6. Evaluate the Quality of the Model**
OK, now we can calculate the accuracy score and then look at the Confusion Matrix.

###**1.6.1 Accuracy Score**

First, the accuracy score:

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Would you accept a result of 38% on an exam? (Take a look at the grading scale for this course to see where that would land you). Let's see what the Confusion Matrix tells us about this lousy score.

###**1.6.2 Confusion Matrix**
And now the Confusion Matrix:

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=rf.classes_)
cm_display = ConfusionMatrixDisplay(cm, display_labels=rf.classes_).plot()

Let's look at the "northwest" row: Out of 21+24+8+13 = 66 true northwest values, only 24 were predicted correctly. 21 were predicted as northeast, 8 as southeast, and 13 as southwest.

##Your Turn
What about the "southwest" row? Are the results better or worse? Write your explanation into the text field below:

###**1.6.3 Classification Report**
The Classification Report gives us even more insights into how well (or, in our case, badly) our model performs. To read it correctly, we first have to define a few terms:

* **precision** (also called positive predictive value) is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly ((true positives) / (true positives + false positives)). Said another way, “for all instances classified positive, what percent was correct?”
* **recall** (also known as sensitivity) is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive ((true positives) / (true positives + false negatives)). Said another way, “for all instances that were actually positive, what percent was classified correctly?
* **f-1 score** is the harmonic mean of the precision and recall. The highest possible value of F1 is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.
* **support** is the number of actual occurrences of the class in the specified dataset.

Now you have all the tools to read the classification report below:

In [None]:
import sklearn.metrics as metrics
from sklearn.metrics import classification_report

print(metrics.classification_report(y_test, y_pred, labels=['northeast', 'northwest', 'southeast','southwest']))

Can you explain what these numbers mean for the insurance2 dataset?

# If you get Stuck

1.1 Data Shape
The analysis shows the connections and the dependencies between the different attributes. This is important because we want the X attributes (or features) to be independent from each other; the only dependent attribute should be the class attribute. If the X attributes are too correlated, we are looking at [multicollinearity](https://www.statisticshowto.com/multicollinearity/), which can impact the usefulness of our model.

In [None]:
# 1.2 Basic Visualizations
insurance.groupby('region').count()

In [None]:
# This is the solution for task 2 above.
insurance2 = pd.DataFrame(insurance, columns = ['age', 'bmi', 'children','charges','region'])
insurance2.head()

In [None]:
# This is the solution for task 3 above:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

1.6.2 The prediction is just as bad.