<img src="https://ucfai.org//course/sp19/random-forests/banner.jpg">

<div class="col-12">
    <a class="btn btn-success btn-block" href="https://ucfai.org/signup">
        First Attendance? Sign Up!
    </a>
</div>

<div class="col-12">
    <h1> A Walk Through the Random Forest </h1>
    <hr>
</div>

<div style="line-height: 2em;">
    <p>by: 
        <strong> John Muchovej</strong>
        (<a href="https://github.com/ionlights">@ionlights</a>)
     on 2019-03-27</p>
</div>

## Overview

Before getting going on more complex examples, we're going to take a look at a very simple example using the Iris Dataset. 

After that is done, we're going to move onto using a hybrid model made out of an Autoencoder and a Random Forest to classify hand drawn digits from the MNIST dataset. 

The final example deals with credit card fraud, and how to identify if fraud is taking place based a dataset of over 280,000 entries. 

In [None]:
# Importing the important stuff
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, matthews_corrcoef

____
## Iris Data Set

This is a classic dataset of flowers. The goal is to have the model classify the types of flowers based on 4 factors. Those factors are sepal length, sepal width, petal length, and petal width, which are all measured in cm. The dataset is very old in comparison to many of the datasets we use, coming from a [1936 paper about taxonomy](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x).


### Getting the Data Set


Sklearn has the dataset built into the the library, so getting the data will be easy. Once we do that, we'll do a test-train split.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.1)

### Making the Model

Making and Random Forests model is very easy, taking just two lines of code! Training times can take a second, but in this example is small so the training time is minimal.

In [None]:
trees = RandomForestClassifier(n_estimators=150)
trees.fit(X_train, Y_train)

### We need to Figure out how well this model does

There are a few ways we are going to test for accuracy using a Confusion Matrix and Mathews Correlation Coefficient . 




#### Confusion Matrix

A Confusion Matrix shows us where the model is messing up. Below is an example from dataschool.io. The benefit of a confusion matrix is that is it a very easy was to visualise the performance of the model. 

![alt text](https://www.dataschool.io/content/images/2015/01/confusion_matrix_simple2.png)

In [None]:
predictions = trees.predict(X_test)
print(confusion_matrix(Y_test, predictions))

#### Matthews correlation coefficient

This is used to find the quality of binary classification. It is based on the values found in the Confusion Matrix and tries to take those values and boil it down to one number. It is generally considered one of the better measures of qaulity for classification. MCC does not realy on class size, so in cases were we have very different class sizes we can get a realiable measure of how well it did. 

___ 

Matthews correlation coefficient ranges from -1 to 1. -1 represents total disagreement between the prediction and the observation, while 1 represnets prefect prediction. In other worlds, the closer to 1 we get the better the model is considered. 

In [None]:
print(matthews_corrcoef(Y_test, predictions))

That wasn't too bad wasn't it? Random Forests is a very easy model to implement and it's low training times means that the model can be used without the overheads associated with neural networks. It is also a very flexible model, but from some types of data it will require a little more wizardry.

## Image Classification on MNIST

Random forests by itself does not work well on image data. An image in a computer's world is just a array with values representing intensity and colour. By itself, those values do not lend themselves nicely to decesion trees, but if there were values that represent as features then it could be better for a Random Forest to train on it. 

To get those features out of an image, a dimension reduction technique should be applied to extract those features out of an image. To do this, we are going to be using an Autoencorder to find the important features from the data so that our Random Forests model can run better.

### Making the Autoencoder

Making a Autoencoder is just like making any other type of Neural Network in Keras. This example is very simple, which was done to save on training time more then anything. More complex Autoencoders can be used in this case and would probably give better results then this simple one that we are using today. 

In [None]:
from keras.layers import Input, Dense
from keras.models import Model

# size of our original image
img = Input(shape=(784,))

# how small the compressed image is going to be
compression = 42

# making the autoencoder
encoded = Dense(compression, activation='relu')(img)
decoded = Dense(784, activation='sigmoid')(encoded)
autoencoder = Model(img, decoded)
encoder = Model(img, encoded)

# create a placeholder for an encodeder_input
encoded_input = Input(shape=(compression,))

# retrieve the last layer of the autoencoder model
decoder_layer = autoencoder.layers[-1]

# create the decoder model
decoder = Model(encoded_input, decoder_layer(encoded_input))

# compile the model
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

### Getting the MNIST data

Keras has the dataset already, so importing it will be a breaze. Like with the last example, we're going to be doing a test-train split.

In [None]:
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# converting to floats 
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

### Training the Autoencoder 

Training is going to take a moment, but due to the model being realitively simple it shouldn't take *too long*. 

In [None]:
autoencoder.fit(x_train, x_train,
                epochs=45,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test, x_test))


### Autoencoder results

To see if the Autoencoder did a good job in compressing the images, we're going to be comparing the original images to ones that have been passed through the Autoencoder.


In [None]:
# run on the testing set
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder.predict(encoded_imgs)

# how many digits we will display
n = 7  

# this prints the images below
plt.figure(figsize=(14, 4))
for i in range(n):
    # display original
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(x_test[i].reshape(28, 28))
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

    # display reconstruction
    ax = plt.subplot(2, n, i + 1 + n)
    plt.imshow(decoded_imgs[i].reshape(28, 28))
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

### Random Forest Training 

We're going to train this model just like we did with the Iris dataset, but this time we are going to have to encode the training data before we train the model. Training might take a moment to complete since the dataset is quite large in comparison to the Iris data set we used before.

In [None]:
# Lets encode our training data for the Random Forest Model
encoded_x_train = encoder.predict(x_train)

# Test the model
start = time.time()
trees = RandomForestClassifier(n_estimators=75)
trees.fit(encoded_x_train, y_train)

# get the time it took to train the model
end = time.time()
print(end-start)

#### Something While We Wait

Since the model is going to take a moment to train, lets look at some XKCD comics to pass the time.

* [Pointers](http://www.xkcd.com/138/)

* [Sandwich](https://xkcd.com/149/)

* [Labyrinth Puzzle](https://xkcd.com/246/)

* [Wikipedian Protester](https://xkcd.com/285/)

* [Correlation](https://xkcd.com/552/)

* [Compiling](https://www.xkcd.com/303/)

* [(](https://xkcd.com/859/)

* [Time Machine](https://xkcd.com/716/)

### Testing the Model(s)

We're going to test this model just like we did with the Iris dataset, but this time we are going to have to encode the training data before we train the model.

In [None]:
images_pred = trees.predict(encoded_imgs) 
print(confusion_matrix(images_pred, y_test))
print(matthews_corrcoef(images_pred, y_test))

## Credit Card Fraud Dataset

As always, we are going to need a dataset to work on!
Credit Card Fraud Detection is a serious issue, and as such is something that data sciencists have looked into. This dataset is from a kaggle competition with over 2,000 Kernals based on it. Let's see how well Random Forests can do with this dataset!

Lets read in the data and use *.info()* to find out some meta-data

In [None]:
!git clone "https://github.com/JarvisEQ/RandomData.git"
!unzip RandomData/creditcardfraud.zip
data = pd.read_csv("creditcard.csv")
data.info()

What's going on with this V stuff?
Credit Card information is a bit sensitive, and as such raw information had to be obscured in some way to protect that information.

In this case, the data provider used a method know as PCA transformation to hide those features that would be considered sensitive. PCA is a very usefull technique for dimension reduction, a topic that we will be covering in a later lecture. For now, know that this technique allows us to take data and transform it in a way that maintains the patterns in that data.

Next, lets get get some basic statistical info from this data.

In [None]:
data.describe()

### Some important points about this data 

For most of the features, there is not a lot we can gather since it's been obscured with PCA. However there are three features that have been left out of the for us to see. 

#### 1. Time

Time is the amount of time from first transaction in seconds. The max is 172792, so the data was collected for around 48 hours. 

#### 2. Amount

Amount is the amount that the transaction was for. The denomination of the transactions was not given, so we're going to be calling them "Simoleons" as a place holder. Some interesting points about this feature is the STD, or standard diviation, which is 250§.That's quite large, but makes sense when the min and max are 0§ and 25,691§ respectively. There is a lot of variance in this feature, which is to be expected. The 75% amount is only in 77§, which means that the whole set skews to the lower end.

#### 3. Class

This tells us if the transaction was fraud or not. 0 is for no fraud, 1 if it is fraud. If you look at the mean, it is .001727, which means that only .1727% of the 284,807 cases are fraud. There are only 492 fraud examples in this data set, so we're looking for a needle in a haystack in this case.

Now that we have that out of the way, let's start making this model! We first need to split up our data into Test and Train sets.

In [None]:
X = data.drop(labels='Class', axis = 1)
Y = data.loc[:,'Class']

#don't need that original data
del data

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

#don't need our X and Y anymore
del X, Y

### Some Points about Training

With Neural Networks, we saw that the training times could easily reach the the hours mark and sometimes even days. With Random Forest, training time are much lower so we can finish training the model in realitively sort order, even when our dataset contains 284,807 entries. This done without the need of GPU acceleration, which Random Forests cannot take advantage of.

The area is left blank, but there's examples on how to make a Random Forest model earlier in the notebook that can be used as an example if you need it. 

In [None]:
start = time.time()
#TODO, make the model



end = time.time()

# this is going to tell you the training Time
print(end-start)

### Testing the model 

Whenever the model is done training, use the testing set find the Mathews Correlation Coefficient. We have examples of it from eariler in the notebook, so give them a look if you need an example. 

We're going to be collecting the data of your guys results and graphing out the relationship between Number of Trees, Training Time, and Quality using MCC as the metric. Make sure you fill out the form below to tell us how your model did.   

### [Tell us your results](https://docs.google.com/forms/d/e/1FAIpQLSdMh2J2I6X6DFuwIwXeZeagF8ah6ywPFjWjmMiNFDhONXuUDg/viewform?usp=sf_link)

In [None]:
# TODO, use your model to predict for a the test set
predictions = 
print(matthews_corrcoef(Y_test, predictions))