# Predicting Yacht Resistance with Decision Trees & Random Forests

## Introduction

This notebook is a simple demonstration of how to use scikit-learn to build a Decision Treen and Random Forest model for regression. It uses a dataset of 308 experiments and their various attributes. The goal is to predict the residuary resistance per unit weight of displacement based upon the attributes.

## The Data

The data has been taken from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) and the raw data and information can be found [here](https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics). 

The columns are as follow:

1. Longitudinal position of the center of buoyancy, adimensional.
2. Prismatic coefficient, adimensional.
3. Length-displacement ratio, adimensional.
4. Beam-draught ratio, adimensional.
5. Length-beam ratio, adimensional.
6. Froude number, adimensional.
7. Residuary resistance per unit weight of displacement, adimensional. 

Where column 7 is the target variable we are looking to predict.

We import python libraries

In [1]:
import pandas as pd
import numpy as np

We read in the data we've saved, passing the column names

In [2]:
yacht = pd.read_csv("data/yacht_hydrodynamics.csv", names=["longitudinal_pos", "presmatic_coef", "length_disp", "beam-draught_rt", 
                                                           "length-beam_rt", "froude_num", "resid_resist"], sep=" ")

Let's check out the first few rows of data

In [3]:
yacht.head()

Unnamed: 0,longitudinal_pos,presmatic_coef,length_disp,beam-draught_rt,length-beam_rt,froude_num,resid_resist
0,-2.3,0.568,4.78,3.99,3.17,0.125,0.11
1,-2.3,0.568,4.78,3.99,3.17,0.15,0.27
2,-2.3,0.568,4.78,3.99,3.17,0.175,0.47
3,-2.3,0.568,4.78,3.99,3.17,0.2,0.78
4,-2.3,0.568,4.78,3.99,3.17,0.225,1.18


We can quickly check if we have any null values in our data

In [4]:
yacht.isnull().values.any()

True

We do! Let's use the "describe" method to find them, amongst other interesting information

In [5]:
yacht.describe()

Unnamed: 0,longitudinal_pos,presmatic_coef,length_disp,beam-draught_rt,length-beam_rt,froude_num,resid_resist
count,308.0,252.0,308.0,308.0,308.0,308.0,308.0
mean,-2.381818,0.563944,4.008182,4.096364,3.341364,0.824318,8.476461
std,1.513219,0.022947,1.643974,0.653655,0.391571,1.1462,14.052367
min,-5.0,0.53,0.53,2.81,2.73,0.125,0.01
25%,-2.4,0.546,4.34,3.75,3.15,0.225,0.3675
50%,-2.3,0.565,4.78,3.99,3.17,0.325,1.79
75%,-2.3,0.574,4.78,4.77,3.53,0.425,8.0925
max,0.0,0.6,5.14,5.35,4.24,3.51,62.42


So... the column *presmatic_coef* has 56 missing values... we can deal with this in a few different ways. The simpliest solution is to remove them, though we lose many examples in doing so. Alternatively, we could impute the values, replacing the NaN values with an average (mean or median). For the purpose of this simple notebook, we will simply remove them.

In [6]:
yacht = yacht.dropna()

## Train & Test Data

The purpose of splitting the data is to be able to assess the quality of a predictive model when it is used on unseen data. When training, you will try to build a model that fits to the data as closely as possible, to be able to most accurately make a prediction. However, without a test set you run the risk of overfitting - the model works very well for the data it has seen but not for new data.

The split ratio is often debated and in practice you might split your data into three sets: train, validation and test. You would use the training data to understand which classifier you wish to use; the validation set to test on whilst tweaking parameters; and the test set to get an understanding of how your final model would work in practice. Furthermore, there are techniques such as K-Fold cross validation that also help to reduce bias.

For the purpose of this demonstration, we will only be randomly splitting our data into test and train, with a 80/20 split.

We import the required library from scikit-learn, [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [7]:
from sklearn.model_selection import train_test_split

We wish for all features to be used for training, therefore we are taking all columns except "class"

In [8]:
X = yacht.drop(["resid_resist"], axis=1)

The column "class" is our target variable, we set y as this column

In [9]:
y = yacht["resid_resist"]

We use the *train_test_split* function to create the appropriate train and test data for our features ("X_train" and "X_test" respectively) and target data ("Y_train" and "Y_test"). We are specifying our test data to be 20% of the total data. We are also providing a seed to be able to reproduce this split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

We can check the number of examples we have in each of our train and test data sets using "shape"

In [11]:
X_train.shape

(201, 6)

In [12]:
X_test.shape

(51, 6)

## Standardisation

All features are numeric so we do not need to worry about converting categorical data with techniques such as one-hot encoding. However, we will demonstrate how to standardise our data. Standardisation rescales our attributes so they have a mean of 0 and standard deviation of 1. It assumes that the distribution is Gaussian (it works better if it is), alternatively normalisation can be used to rescale between the range of 0 and 1

We use scikit-learn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [13]:
from sklearn.preprocessing import StandardScaler

We create the scaler, leaving parameters as default

In [14]:
scaler = StandardScaler()

We fit the scaler passing the training data but also request it transforms the data and returns it to a variable named "train_scaled"

In [15]:
train_scaled = scaler.fit_transform(X_train)

We then transform our test data with the same fitted scaler

In [16]:
test_scaled = scaler.transform(X_test)

## Decision Trees & Random Forests

Decision trees learn how to best split the dataset into separate branches, allowing it to learn non-linear relationships.

Random forests (RF) and Gradient Boosted Trees (GBT) are two algorithms that build many individual decision trees, pooling their predictions. As they use a collection of results to make a final decision, they are referred to as 'Ensemble techniques'.

We are using scikit-learn's [Decision Tree Regressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) and [Random Forest Regressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [17]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

We create a Decision Tree and a Random Forest model

In [18]:
tree_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()

We train it with our scaled training data and target values

In [19]:
tree_model.fit(train_scaled, y_train)
rf_model.fit(train_scaled, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

## Model Evaluation

We wish to understand how good our model is; there are a few different metrics we can use. We will evaluate mean squared error (MSE) and mean absolute error (MAE)

We import [scikit-learn's mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) and [sckit-learn's mean absolute error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error)

In [20]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

We calculate the errors for our training data

In [25]:
tree_mse = mean_squared_error(y_train, tree_model.predict(train_scaled))
tree_mae = mean_absolute_error(y_train, tree_model.predict(train_scaled))
rf_mse = mean_squared_error(y_train, rf_model.predict(train_scaled))
rf_mae = mean_absolute_error(y_train, rf_model.predict(train_scaled))

In [26]:
from math import sqrt

In [27]:
print("Decision Tree training mse = ",tree_mse," & mae = ",tree_mae," & rmse = ", sqrt(tree_mse))
print("Random Forest training mse = ",rf_mse," & mae = ",rf_mae," & rmse = ", sqrt(rf_mse))

Decision Tree training mse =  0.0  & mae =  0.0  & rmse =  0.0
Random Forest training mse =  0.10843392537313411  & mae =  0.16930845771144287  & rmse =  0.32929306912404654


The easier metric to understand is the mean absolute error, this means that our predictions were perfect for the decision tree model but on average 0.17 away from the true prediction with the random forest model. Mean squared error, and consequently root mean squared error (RMSE), results in predictions further and further from the true value are punished more. 

We can calculate the same on the test data to understand how we the models are generalised.

In [28]:
tree_test_mse = mean_squared_error(y_test, tree_model.predict(test_scaled))
tree_test_mae = mean_absolute_error(y_test, tree_model.predict(test_scaled))
rf_test_mse = mean_squared_error(y_test, rf_model.predict(test_scaled))
rf_test_mae = mean_absolute_error(y_test, rf_model.predict(test_scaled))

In [29]:
print("Decision Tree test mse = ",tree_test_mse," & mae = ",tree_test_mae," & rmse = ", sqrt(tree_test_mse))
print("Random Forest test mse = ",rf_test_mse," & mae = ",rf_test_mae," & rmse = ", sqrt(rf_test_mse))

Decision Tree test mse =  1.190045098039216  & mae =  0.573921568627451  & rmse =  1.0908918819201177
Random Forest test mse =  1.1227068823529418  & mae =  0.5241764705882354  & rmse =  1.0595786343414735


So even though we were seeing perfect results on the training data for our decision tree model, it is actually performing worse than the random forest model on our test data.

## Decision Tree & Random Forest Parameters

More information on tree algorithms can be found in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/tree.html) and ensembles [here](http://scikit-learn.org/stable/modules/ensemble.html)

There are a number of parameters that can be tuned that should be explored when trying to improve Decision Trees and Random Forest models. A common approach is to test many different paramters, building multiple models and testing their accuracy to find the best combination.

### Decision Trees
For Decision Trees, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

Here is high-level information on the parameters, the documentation has more details:
- criterion : default=”mse”
    - The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, and “mae” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node.

- splitter : default=”best”
    - The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

- max_depth : default=None
    - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

- min_samples_split : default=2
    - The minimum number of samples required to split an internal node:

- min_samples_leaf : default=1
    - The minimum number of samples required to be at a leaf node:

- min_weight_fraction_leaf : default=0.
    - The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

- max_features : default=None
    - The number of features to consider when looking for the best split:

- random_state : default=None
    - If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

- max_leaf_nodes : default=None
    - Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

- min_impurity_decrease : default=0.
    - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

- presort : default=False
    - Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.
    
### Random Forests

Similarly, for Random Forests, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

- n_estimators : default=10
    - The number of trees in the forest.

- criterion : default=”mse”
    - The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.

- max_features : default=”auto”
    - The number of features to consider when looking for the best split:

- max_depth : default=None
    - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

- min_samples_split : default=2
    - The minimum number of samples required to split an internal node:

- min_samples_leaf : default=1
    - The minimum number of samples required to be at a leaf node:

- min_weight_fraction_leaf : default=0.
    - The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

- max_leaf_nodes : default=None
    - Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

- min_impurity_decrease : default=0.
    - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

- bootstrap : default=True
    - Whether bootstrap samples are used when building trees.

- oob_score : default=False
    - whether to use out-of-bag samples to estimate the R^2 on unseen data.

- n_jobs : default=1
    - The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

- random_state : default=None
    - If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

- verbose : default=0
    - Controls the verbosity of the tree building process.

- warm_start : default=False
    - When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

### Grid Search

To search for the best hyper-parameters for your algorithm and data, grid search cross validation is commonly used. The [scikit-learn documentation](http://scikit-learn.org/stable/modules/grid_search.html) provides more thorough information on how to use this. 

#### Data Citation

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 