# Random Forest Regression Example
Following [This Tutorial](http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/)

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
# Utilities for scaling, transforming and wrangling data
from sklearn import preprocessing

# Family of models that use Random Forests for regression
from sklearn.ensemble import RandomForestRegressor

# Cross Validation tools
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# Metrics for model evaluation
from sklearn.metrics import mean_squared_error, r2_score

# An alternative to pickle for presistently saving numpy arrays efficiently
from sklearn.externals import  joblib

#### Load the data about wine

In [None]:
data = pd.DataFrame()

try:
    data = pd.read_pickle('./wine_dataset_in_pandas.pkl')
    print('Loaded local copy of wines data')
except FileNotFoundError:
    dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
    data = pd.read_csv(dataset_url, sep=';')
    data.to_pickle('./wine_dataset_in_pandas.pkl')
    print('Downloaded data from the we and saved it locally')

Let's look at the data

In [None]:
display(data.describe())
display(data.head(4))
print('Data Size:', data.shape)

11 Features<br>
1600 Samples<br>
**quality** is the target<br>
Data features look to be very different, scale wise. _chlorides_ vs _total sulfur dioxide_ for instance<br>
**Scaling** will be required.

#### Setup the data

In [None]:
# Split between features and labels
y = data.quality # labels
X = data.drop('quality', axis=1) # features

# Split between training and testing
# Predefine a random state (seed) so we can get the same split again.
# By stratifying by the target we ensure training set looks similar o the test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

# Scaling all the features
X_train_scaled = preprocessing.scale(X_train, axis=0, with_mean=False, with_std=0)
display(X_train_scaled[0:2])
display(X_train_scaled[2][0:2])
display(X_train_scaled[0:5,2])
cloriScaledMax = X_train_scaled[:,2].max()
cloriScaledMin = X_train_scaled[:,2].min()
cloriOriginalMax = X_train.chlorides.max()
cloriOriginalMin = X_train.chlorides.min()
print('Original Min = ', cloriOriginalMin)
print('Scaled Min = ', cloriScaledMin)

print('Original MAX = ', cloriOriginalMax)
print('Scaled MAX = ', cloriScaledMax)
print('='*50)

sulDioxScaledMax = X_train_scaled[:,6].max()
sulDioxScaledMin = X_train_scaled[:,6].min()
sulDioxOriginalMax = X_train['total sulfur dioxide'].max()
sulDioxOriginalMin = X_train['total sulfur dioxide'].min()
print('Original Min = ', sulDioxOriginalMin)
print('Scaled Min = ', sulDioxScaledMin)

print('Original MAX = ', sulDioxOriginalMax)
print('Scaled MAX = ', sulDioxOriginalMin)




** NOTE **

Great, but we won't use this scaling.<br>
The reason is that we won't be able to perform the exact same transformation on the test set.<br>
Sure, we can still scale the test set separately, but we won't be using the same means and standard deviations as we used to transform the training set.<br>
In other words, that means it wouldn't be a fair representation of how the model pipeline, include the preprocessing steps, would perform on brand new data.

So instead of directly invoking the scale function, we'll be using a feature in Scikit-Learn called the Transformer API.<br>
The Transformer API allows you to "fit" a preprocessing step using the training data the same way you'd fit a model...
...and then use the same transformation on future data sets!

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

display(X_train_scaled.mean(axis=0))
display(X_train_scaled.std(axis=0))
display(X_train_scaled.min())
display(X_train_scaled.max())

All the training set is evenly distributed around zero.

In [None]:
X_test_scaled = scaler.transform(X_test)
display(X_test_scaled.mean(axis=0))
display(X_test_scaled.std(axis=0))
display(X_test_scaled.min())
display(X_test_scaled.max())

Scaled features in the test set are not centered at zero with variance **!= 1**. This is exactly what we'd expect, as we're transforming the test set using the means from the training set, not from the test set itself.<br>
In practice, when we set up the cross-validation pipeline, we won't even need to manually fit the Transformer API. Instead, we'll simply declare the class object, like so
## Pipeline

In [None]:
pipeline = make_pipeline(preprocessing.StandardScaler(), RandomForestRegressor(n_estimators=100))

This pipeline passes the data through a `StandardScaler` and the output enters the `RandomForestRegressor`

### Hyperparameters
There are two types of parameters we need to worry about: model parameters and hyperparameters. Models parameters can be learned directly from the data (i.e. regression coefficients), while hyperparameters cannot.<br>
Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model.
#### Random Forest hyperparameters
As an example, let's take our random forest for regression:<br>
Within each decision tree, the computer can empirically decide where to create branches based on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the actual branch locations are model parameters.<br>
However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in the forest. These are examples of _hyperparameters_ that the user must set.

In [None]:
# List the tunable hyperparameters
display(pipeline.get_params())

Since we are using a pippeline, each hyperparameter is prefixed by the stage of the pipeline `randomforestregressor__` or `standardscaler__`<br>


In [None]:
hyperparameters = {
                    'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                    'randomforestregressor__max_depth' : [None, 5, 3, 1]
                    }

**max_features** : int, float, string or None, optional (default=”auto”)
The number of features to consider when looking for the best split:
* _int_ then consider max_features features at each split.
* _float_ then max_features is a percentage and int(max_features * n_features) features are considered at each split.
* _“auto”_ then max_features=n_features.
* _“sqrt”_ then max_features=sqrt(n_features).
* _“log2”_ then max_features=log2(n_features).
* _None_ then max_features=n_features.

**max_depth**:<br>
If _None_, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

### Tune Model using Cross-Validation (CV) pipeline
This is one of the most important skills in all of machine learning because it helps maximizing the model performance while reducing the chance of overfitting.<br>
**Cross-validation** is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.<br>
Practically, that "method" is simply a set of hyperparameters in this context.<br>
These are the steps for CV:
1. Split your data into k equal parts, or "folds" (typically k=10).
2. Train your model on k-1 folds (e.g. the first 9 folds).
3. Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold).

Perform steps (2) and (3) k times, each time holding out a different fold.<br>
Aggregate the performance across all k folds. This is your _performance metric_.

<img src=https://elitedatascience.com/wp-content/uploads/2016/12/K-fold_cross_validation_EN.jpg>

**The Motivation**<br>
Let's say you want to train a random forest regressor. One of the hyperparameters you must tune is the maximum depth allowed for each decision tree in your forest.
How can you decide?
That's where cross-validation comes in. Using only your training set, you can use CV to evaluate different hyperparameters and estimate their effectiveness.
This allows you to keep your test set "untainted" and save it for a true hold-out evaluation when you're finally ready to select a model.
For example, you can use CV to tune a random forest model, a linear regression model, and a k-nearest neighbors model, using only the training set. Then, you still have the untainted test set to make your final selection between the model families!

**CV "Pipeline"**<br>
The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop.<br>
This prevents accidentally tainting your training folds with influential data from your test fold.<br>
Here's how the CV pipeline looks after including preprocessing steps:

1. Split your data into k equal parts, or "folds" (typically k=10).
2. **Preprocess k-1** training folds.
3. Train your model on the same k-1 folds.
4. **Preprocess the hold-out** fold using the same transformations from step (2).
5. Evaluate your model on the same hold-out fold.
6. Perform steps (2) - (5) k times, each time holding out a different fold.
5. Aggregate the performance across all k folds. This is your performance metric.

In [None]:
# Easily CVing the pipeline with Scikit is easy
# cv is the number of folds to take
cvp = GridSearchCV(pipeline, hyperparameters, cv=10)

# Fit and tune
cvp.fit(X_train, y_train)

Perform CV across the entire _grid_ (i.e. all possible permutaions) of the hyperparameters.

In [None]:
print('Best parameters for our pipeline:')
display(cvp.best_params_)
print('Best estimator')
display(cvp.best_estimator_)

#### Refit on the training set

After tunning your hyperparameters appropriately using cross-validation, get a small performance improvement by refitting the model on the entire training set.<br>
`GridSearchCV`  will automatically refit the model with the best set of hyperparameters using the entire training set.<br>
This functionality is ON by default, but you can confirm it:

In [None]:
print("Is fit: " + str(cvp.refit))

## Evaluate the Model pipeline
Predict the test data using the pipeline

In [None]:
y_pred = cvp.predict(X_test)

###  Evaluation Metrics
Using the metics we previously imported evaluate the model performance
\begin{equation*}
R^2 score
\end{equation*}

In [None]:
display(r2_score(y_test, y_pred))
print("MSE (Mean Square Error)")
display(mean_squared_error(y_test, y_pred))

#### So, is this good?
They recommend a combination of three strategies to decide if you're satisfied with your model performance.<br>
1. Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem?
2. Look in academic literature to get a sense of the current performance benchmarks for specific types of data.
3. Try to find low-hanging fruit in terms of ways to improve your model.
4. There are various ways to improve a model. We'll have more guides that go into detail about how to improve model performance, but here are a few quick things to try:
    1. Try other regression model families (e.g. regularized regression, boosted trees, etc.).
    1. Collect more data if it's cheap to do so.
    1. Engineer smarter features after spending more time on exploratory analysis.
    1. Speak to a domain expert to get more context (...this is a good excuse to go wine tasting!).

As a final note, when you try other families of models, we recommend using the **same training and test** set as you used to fit the random forest model. That's the best way to get a true apples-to-apples comparison between your models.

#### Save the Model to the HD

In [None]:
joblib.dump(cvp, 'rf_regressor.pkl')

In [None]:
# For loading the model and using it:
cvpFromDisk = joblib.load('./rf_regressor.pkl')
cvpFromDisk.predict(X_test)