### 1 - Shelter Animals Part 2
This post is a continuation of my first shelter animal [post](https://github.com/yscyang1/ExploringDataScience/blob/master/5-ShelterAnimals1.ipynb) that will include a few things, including speeding up calculations, examining trees,and tuning hyperparameters.

As usual, first import all the libraries and data.  Since in my first post, I saved the dataframes into feather format after some processing, I can easily read the processed data using panda's read_feather function.  As you can see, all the categorial data has been encoded.  

Note:  This notebook was originally run on my personal computer, so %time might be a little different.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train_df =pd.read_feather('../input/shelter/train_df')
test_df = pd.read_feather('../input/shelter/test_df')

In [None]:
train_df.head()

In [None]:
test_df.head()

In the training dataset, I again drop the outcomes and save it as X, and 'Outcome1' becomes my y. 

In [None]:
X = train_df.drop(['Outcome1', 'Outcome2'], axis = 1)

In [None]:
y = train_df['Outcome1']

### 2 - Creating a Validation Set
Perhaps in my first post, I commited a machine learning faux pas and didn't create a validation set.  I only submitted my solution once in the first post, so a validation set wasn't critical.  However, if I wanted to try and speed up my model or play with hyperparameters for example, then creating a validation set is important just so that I don't overfit for Kaggle's public leaderboard.  

When creating a validation set, it is important to note if dates are important.  For example, if your goal is to try and predict future price, you wouldn't want create your validation set by picking out random data points.  Instead, you'd want the first 60-80% of the training set to stay as your training set, and take the remaining datapoints as your validation set.  

For this shelter animal outcome, the training and test sets come from the same time frame, and I'm not trying to forecast something in the future, so taking random data points for a validation set is reasonable.  To create my validation set, I'll be using scikit-learn.  I've chosen to make the validation set 43% of the training set because that's about the same size as test_df.  



In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.43, random_state = 42)

In [None]:
print('training shape: {}'.format(X_train.shape))
print('validation shape: {}'.format(X_val.shape))
print('test shape: {}'.format(test_df.shape))

#### 2.1 - Random Forest with Validation Set
Now to try the random forest model again with separate training and validation sets. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf1 = RandomForestClassifier(n_estimators=100, n_jobs= -1)
%time rf1.fit(X_train, y_train)

In [None]:
def print_score(model, X_t, y_t, X_v, y_v, oob = False):
    print('Training Score: {}'.format(model.score(X_t, y_t)))
    print('Validation Score: {}'.format(model.score(X_v, y_v)))
    if oob:
        if hasattr(model, 'oob_score_'):
            print("OOB Score:{}".format(model.oob_score_))

In [None]:
print_score(rf1, X_train, y_train, X_val, y_val)

I've written a function that prints out the score of the training and validation sets.  This score is the accuracy, or the number of correct predictions over the number of total predictions.  With the minimal processing done, the validation set got a score of 67.7%, which is far lower than the training score.  Clearly, there is quite a bit of overfitting.  

### 3 - Speeding Things Up
Why would we want to decrease computational time?  One reason would be if you have a large dataset and want to fiddle with hyperparameters.  If each computation took a minute or more, that's a lot of time wasted!  Instead, you could take a subset of the data, tune the hyperparameters, and then apply the hyperparameters to the whole dataset when you are ready.  

Above, I've used the %time function to see how long it took to fit the data (~1 seconds). 

The function get_subset (code borrowed from this [stackoverflow](https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test) thread) takes in the training set, training and validation percent, and outputs a randomized subset of the training set.  I've included a way to get randomized training, validation, and test sets as well.  

Note:  If you don't need a test set, then scikit-learn's train_test_split function will work as well, and it has a random seed function as well.  

In [None]:
def get_subset(df, train_percent=.6, validate_percent=.2, copy = True, seed=None):
    if copy:
        df_copy = df.copy()
    perm = np.random.RandomState(seed).permutation(df_copy.index)
    length = len(df_copy.index)
    train_end = int(train_percent * length)
    validate_end = int(validate_percent * length) + train_end
    train = df_copy.iloc[perm[:train_end]]
    validate = df_copy.iloc[perm[train_end:validate_end]]
    test = df_copy.iloc[perm[validate_end:]]
    
    return train, validate, test

In [None]:
train_speed, val_speed, test_speed = get_subset(train_df, 0.35, 0.35, seed = 42)

In [None]:
train_speed.head()

In [None]:
train_speed.shape

In [None]:
rf_speed = RandomForestClassifier(n_estimators=100, n_jobs=-1)
X_train_speed = train_speed.drop(['Outcome1', 'Outcome2'], axis = 1)
y_train_speed = train_speed['Outcome1']
X_val_speed = val_speed.drop(['Outcome1', 'Outcome2'], axis = 1)
y_val_speed = val_speed['Outcome1']
%time rf_speed.fit(X_train_speed, y_train_speed)
print_score(rf_speed, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

After testing the model on a subset of the data, the computational time decreased by 1 second, and both the training and validation accuracy decreased by a tiny bit.  

### 4 - Examining Trees
Similar to a real forest, a random forest classifier is made of trees.  One way to really understand what is going on behind the scenes is to study a tree.  Keep in mind that each tree is unique.  



#### 4.1 - Creating a Tree
The first step is to create a model.  I've skipped over it before, but in RandomForestClassifier and RandomForestRegressor, an estimator is a tree.  By setting n_estimators as one, I am creating only one tree in this model.  Max depth defines how many nodes there are in each tree.  When examining a tree, its good to set a max depth, otherwise the tree will be near impossible to interpret, especially if there are a lot of features.  

In addition, random forests introduces randomization by something called bootstrapping.  More a more in depth explanation can be found [here](https://nititek.wordpress.com/2013/12/10/bootstrapping/), but basically bootstrapping takes a random subset of the data, and samples the subset numerous times with replacement.  It sounds fancy, but a simple example would be to have a bag of 3 red, 3 blue, and 3 green marbles.  Sample the subset by drawing the marbles 9 times, but each time, put the marble back in the bag.  The act of replacing the marble drawin is the replacement part, as opposed to removing the drawn marble from the subset.

Lastly, n_jobs describes how many jobs to run in parallel.  If I'm not mistaken, this is machine dependent and depends on if you're using CPU vs GPU.  If you want to make things simple, the int -1 will tell your computer to use all processors.  

In [None]:
rf_1tree = RandomForestClassifier(n_estimators=1, max_depth=3, bootstrap=False, random_state=23,  n_jobs=-1)
rf_1tree.fit(X_train_speed, y_train_speed)
print_score(rf_1tree, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

You can see the accuracy of the training and validation sets dropped to about 50%.  Definately not a model to submit to Kaggle or your boss, but small and simple enough to examine deeply.  

The second step is to extract and export the tree.  Extracting a single tree uses the .estimators_ function.  Since I built a model with only one tree, there is no picking and choosing of trees.

Viewing the tree involves importing the export_graphviz function and exporting the tree as a .dot file.  ProTip: rotating the image made the tree much easier to read.

In [None]:
from sklearn.tree import export_graphviz

In [None]:
estimator = rf_1tree.estimators_[0]

In [None]:
export_graphviz(estimator, out_file = 'tree.dot', 
                feature_names = X_train_speed.columns, 
                class_names = rf_1tree.classes_,
                rounded = True,
                filled = True,
                precision = 2,
                rotate = True,
                node_ids = True)

To actually view the tree, the .dot file should be changed to a png file.  I saw two ways to do this, but the one I understood better was with the pydot library.

In [None]:
import pydot

In [None]:
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')

Lastly, view the image with IPython.display.

In [None]:
from IPython.display import Image
Image(filename = 'tree.png')

Okay, so what's going on here?  There is a lot of information to unpack, but luckily, trees make it easy to interpret data.  

First, what is a node?  A node is each of these boxes, and you can see each node is split into 2, depending on if it is True or False.  The conditional it is splitting on is written under the node number.  So at node 0, it is being split on if the value of 'Sex' is <= 2.0.  If this conditional is False, we end up at node 8, where the conditional again, is to split at 'Age' is <= 547.5.  

Samples is self-explanatory, the number of samples in that node.  The sum of the samples in each layer of the tree should add up to the sample size in the root node 0.  

The values correspond to the outcomes(adoption, died, euthanasia, return to owner, and transfer, in this order), and can be viewed as the distribution of the sample.  Class identifies the most common outcome.  For leaf nodes (the nodes furthest to the right and has no conditional), the class is the prediction for all samples in that node.  

Saving the best for last, is something called Gini.  Calculating the Gini Impurity sounds complicated, but it isn't as hard as it sounds.  [This website](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) goes in depth how to calculate it, and I've also calculated it for the [first node](https://github.com/yscyang1/ExploringDataScience/blob/master/6-ShelterAnimals2_SupportingInfo.ipynb).  The Gini Impurity ranges from 0 to 1, and is the probability that a randomly chosen sample in a node will be incorrectly labeled according to the distribution of values in that node.  The higher the Gini, the more likely the sample will be labeled incorrectly.  

So how does the tree pick what feature and where to split at?  This is where the Gini comes in.  Take the root node for example. The tree goes through the value of each feature and splits the node to find the greatest reduction in the Gini Impurity.  If you peek at the calculations page, then you will see that this is calculated with a weighted average.  Thus, as the tree move towards the leaves, the Gini should decrease. 

### 5 - Bagging



#### 5.1 - Intro to Bagging
So in the last section, I created one shallow tree (max depth of 3) that had pretty terrible accuracy on both training and validation sets.  What happens if I create a deep tree?  How would that affect accuracy?  

In [None]:
train_speed, val_speed, test_speed = get_subset(train_df, 0.5, 0.35)
X_train_speed = train_speed.drop(['Outcome1', 'Outcome2'], axis = 1)
y_train_speed = train_speed['Outcome1']
X_val_speed = val_speed.drop(['Outcome1', 'Outcome2'], axis = 1)
y_val_speed = val_speed['Outcome1']

In [None]:
rf_deeptree = RandomForestClassifier(n_estimators=1, n_jobs=-1)
rf_deeptree.fit(X_train_speed, y_train_speed)
print_score(rf_deeptree, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

As you can see, the training score improved drastically, but the validation score is still pretty terrible.  Is there a way to profit from this overtraining?  The answer is yes, using a technique called bagging.  The thought behind this technique is to combine a bunch of different bad models to create one better model.  In terms of trees and random forests, you could create a lot of different deep trees that overfit a LOT, like the one above.  But beacuse each tree has a different subset of the population, they have different errors (random errors).  If you take the average of random errors, you get 0.  

Lets see if this works.  Lets try it with 10 trees.

In [None]:
rf_deeptree = RandomForestClassifier(n_estimators=100, n_jobs=-1)
%time rf_deeptree.fit(X_train_speed, y_train_speed)
print_score(rf_deeptree, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

Indeed, the validation set's accuracy increased by about 10%.  


#### 5.2 - Extra Tree Classifier

An easy way to try to improve on the tree model and bagging is to use trees that are not correlated with each other, as opposed to more accurate trees.  Scikit-learn has a model called ExtraTreeClassifier (or regressor) that randomly splits for randomly selected features and chooses the best split.  Additional pros to the extra tree model is that it has less computing time, so you can build more trees in that time saved.  

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
etc_deeptree =ExtraTreesClassifier(n_estimators=100, n_jobs=-1)
%time etc_deeptree.fit(X_train_speed, y_train_speed)
print_score(etc_deeptree, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

Well, the computation time decreased by ~50 ms, but the validation score didn't improve.  In fact, it actually got a little bit worse. Perhaps in this case, the trees developed by the random forest classifier is already generally uncorrelated to each other, so introducing more randomness from the extra trees classifier didn't improve the model?  

#### 5.3 - Picking the Number of Trees
If you've been paying attention, the number of trees I've used for each model varied wildly, ranging from 1 to 100.  Yet, for the models that used 10 and 100 trees, the accuracy of the validation set was ~65%.  Clearly, there is a plateau for number of trees vs accuracy.  How do we figure out what it is?  

First, I need a list of predictions for each tree, called preds.  The shape tells us there are 100 lists, one for each tree, and 9355 predictions for each row of the data set.  I also printed out the predictions, where you can see each outcome is a number instead of a category.  

In [None]:
preds = np.stack([i.predict(X_val_speed) for i in rf_deeptree.estimators_])
print(preds.shape)
print(preds)

Since the predictions are numbers, I had to write a function to encode the outcomes of my y_val data.  I feel like scikit-learn should have something to already do this, but I couldn't find it.  Scikit-learn's label encoding function probably would have done the trick, but I don't know for sure if it labels based on alphabetical order.

In [None]:
def convert_outcome1(col):
    if col == 'Adoption':
        return 0
    if col == 'Died':
        return 1
    if col == 'Euthanasia':
        return 2
    if col == 'Return_to_owner':
        return 3
    if col == 'Transfer':
        return 4

y_val_speed_convert = y_val_speed.apply(convert_outcome1)

In [None]:
import scipy.stats

In [None]:
from sklearn import metrics

And finally, the graph.  What I'm doing is graphing the accuracy based on number of trees (as found from estimators_).  The predictions from each tree is averaged using the mode.  Note: if you are using a regression model instead, average using the mean.  

What you see is that the accuracy starts plateauing at around 50 or 60 trees.  

In [None]:
plt.plot([metrics.accuracy_score(y_val_speed_convert, np.round(scipy.stats.mode(preds[0:i+1],axis = 0)[0][0])) for i in range(100)])
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')

Just to confirm, I've printed out the accuracy for 10, 20, 50, and 100 trees.

In [None]:
model = RandomForestClassifier(n_estimators=10, n_jobs=-1)
model.fit(X_train_speed, y_train_speed)
print_score(model, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

In [None]:
model = RandomForestClassifier(n_estimators=20, n_jobs=-1)
model.fit(X_train_speed, y_train_speed)
print_score(model, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

In [None]:
model = RandomForestClassifier(n_estimators=50, n_jobs=-1)
model.fit(X_train_speed, y_train_speed)
print_score(model, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

In [None]:
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
model.fit(X_train_speed, y_train_speed)
print_score(model, X_train_speed, y_train_speed, X_val_speed, y_val_speed)

#### 5.4 - Out of Bag Score
Something unique to random forests is something called an out of bag score.  Since bagging only uses about 2/3 of the data, you could technically use the other 1/3 of the data as a validation set. Just like before, these "validation" trees can be averaged to find a prediction, and we can calculate a separate accuracy score from these validation trees.  

The advantage of an out of bag score (OOB) is to determine if there is overfitting without the need to create a whole new validation set.  Here, we see that with 50 trees, the validation and OOB score are similar, which is an indication there isn't overfitting.

In [None]:
model = RandomForestClassifier(n_estimators=50, oob_score= True, n_jobs=-1)
model.fit(X_train_speed, y_train_speed)
print_score(model, X_train_speed, y_train_speed, X_val_speed, y_val_speed, oob=True)

### 6 - Reducing Overfitting
The general consensus on how to avoid overfitting in random forests (according to Jeremy and a few stack overflow threads at least) is to 1) adjust the number of estimators, and 2) make the trees less deep.  I've already gone through finding a good number of estimators, so next is to explore some popular techniques to"prune" a tree, or make it less deep.

To start off, lets get a model that utilizes the full data set.  Validation and OOB scores are similar.

In [None]:
rf_all = RandomForestClassifier(n_estimators=50, n_jobs=-1, oob_score=True, random_state=42)
rf_all.fit(X_train, y_train)
print_score(rf_all, X_train, y_train, X_val, y_val, oob=True)

#### 6.1 - min_sample_leaf
Instead of letting the branches keep splitting until it has a single sample, assigning the tree to keep splitting until a node has the specified number of samples.  According to the documentation, it can induce a "smoothing effect, especially in regression.".  

In [None]:
rf_minleaf = RandomForestClassifier(n_estimators=50, min_samples_leaf=2, n_jobs=-1, oob_score=True, random_state=42)
rf_minleaf.fit(X_train, y_train)
print_score(rf_minleaf, X_train, y_train, X_val, y_val, oob=True)

With a min sample size of 2, the validation score increases a tiny bit from 0.668 to 0.672.

#### 6.2 - max_features
Another hyperparamter to play around with is the max_features parameter.  Instead of splitting at every feature (i.e columns) and finding the best split, the tree chooses a certain percentage of features.  There are also special paramters such as square roots or logs of features.  Popular numbers to try range between 0.3-0.5, sqrt, and log2.  

In [None]:
rf_maxfeat = RandomForestClassifier(n_estimators=50, min_samples_leaf=2, max_features=0.3, n_jobs=-1, oob_score=True, random_state=42)
rf_maxfeat.fit(X_train, y_train)
print_score(rf_maxfeat, X_train, y_train, X_val, y_val, oob=True)

In [None]:
rf_maxfeat = RandomForestClassifier(n_estimators=50, min_samples_leaf=2, max_features='sqrt', n_jobs=-1, oob_score=True, random_state=42)
rf_maxfeat.fit(X_train, y_train)
print_score(rf_maxfeat, X_train, y_train, X_val, y_val, oob=True)

In [None]:
rf_maxfeat = RandomForestClassifier(n_estimators=50, min_samples_leaf=2, max_features='log2', n_jobs=-1, oob_score=True, random_state=42)
rf_maxfeat.fit(X_train, y_train)
print_score(rf_maxfeat, X_train, y_train, X_val, y_val, oob=True)

Keeping the min samples to 2, I can eek out a little bit more accuracy when we randomly sample 30% of the features.   

### 7 - Hyperparameter Tuning
The number of hyperparameters to tune and optimize is getting a little bit much by hand.  Luckily, there is something called  RandomizedSearchCV and GridSearchCV to help us out. 

#### 7.1 - RandomizedSearchCV
If there are a lot of hyperparameters to test out, generally using scikit-learn's RandomizedSearchCV is a good place to start.  We create a grid of hyperparameters that includes a range of values for each hyperparameter to test out.  The RandomizedSearchCV then tests out some of the parameters specified.  This gives us an idea of what values to use and can narrow it down using GridSearchCV.

First, define the values of all the hyperparameters to test.

In [None]:
n_estimators = [int(x) for x in range(1,100,5)]
max_features = [float(x) for x in np.linspace(0.1,1,9)]
max_features.append('log2')
max_features.append('sqrt')
min_samples_split = [2,5,8,10,20,25]
min_samples_leaf = [2,5,8,10,20,25]
bootstrap = [True, False]

Next, combine the hyperparameters into a dictionary that RandomizedSearchCV will take.

In [None]:
randomCV_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'min_samples_split': min_samples_split, 
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}

Implementing RandomizedSearchCV is pretty similar to what we've seen before.  Create a RandomForestClassifier, create a RandomizedSearchCV, and fit the X and y training sets to RandomizedSearchCV.  The most important parameters to pay attention to are n_iter and cv.  The variable n_iter is how many combinations to try. Obviously, the more combinations, the more time it takes.  The variable cv is how many cross validation folds to do.  The higher the number, the less overfitting there is, but again, will take more time.  

One thing I noticed is that when I specified n_jobs = -1, I had an error message pop up, '[WinError 5] Access is denied:' to be specific.  Setting n_jobs to default seemed to fix it though.

In [None]:
rf_random = RandomForestClassifier(n_jobs=-1)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
rf_randomGrid = RandomizedSearchCV(rf_random, param_distributions=randomCV_grid, n_iter = 100, cv = 3, random_state=42,  verbose=1)

In [None]:
rf_randomGrid.fit(X_train, y_train);

To view which hyperparameters tested had the best result, use the following function.  

In [None]:
rf_randomGrid.best_params_

We see that the validation set's accuracy is 67.17%, which did better than the base model's score of 66.79%, but worse than a previous model, which got 67.30%.

In [None]:
best_randomCV_tree = rf_randomGrid.best_estimator_
print_score(best_randomCV_tree, X_train, y_train, X_val, y_val)

#### 7.2 - GridSearchCV
From the results of RandomizedSearchCV, I am able to focus on a smaller range for each hyperparameter.  The implementation of GridSearchCV is similar to the randomized version.


In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
CV_grid = {'n_estimators': [60, 65, 70, 75], 
           'max_features': [0.2, 0.3, 0.4, 0.5], 
           'min_samples_split': [16, 18, 20, 22], 
           'min_samples_leaf': [6, 7, 8, 9], 
           'bootstrap': [False]}

In [None]:
rf_grid = RandomForestClassifier(n_jobs=-1)

In [None]:
rf_randomGrid = GridSearchCV(rf_grid, param_grid=CV_grid, cv = 3, verbose = 1)

In [None]:
rf_randomGrid.fit(X_train, y_train)

In [None]:
rf_randomGrid.best_params_

In [None]:
print_score(rf_randomGrid, X_train, y_train, X_val, y_val)

These results are a slight decrease in accuracy from the randomized CV, so I'm pretty sure I've hit the limits of random forest hyperparater tuning for this data set without further data processing.

### 8 - Submitting to Kaggle
Using the model that got the highest accuracy on the validation set (from section 6.2), I got a score of 0.84781, placing me at 809 out of 1604.  This is actually worse than my first submission.  However, when I submitted using the optimized hyperparameters, I got a score of 0.77676, placing me at 548.  Thank you Randomized and GridSearchCV.  