## After finding a final algorithm using a train and test set, should you build your model on all available data or just training data?

Author: Zach Schuster <br> 2019-11-22
***

I was recently having a conversation with colleagues [Mark Ewing](https://github.com/bmewing) and Christine Grassi regarding whether or not you should fit a final model on just training data or all available data (train and test data). Hopefully this notebook helps answer the question.

The idea of the this "experiment" is fairly simple:
* Create a train, test, validation split (validation will act as new data to evaluate the final models)
* Run 10 fold CV to determine if a random forest or linear regression should be used
    * No strong reason for these two models other than keeping it simple
* Train two final models
    * One model will be trained on only the training data
    * One model will be trained on the train and test data
* Evaluate the models on the validation data 

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

In [2]:
train_data = pd.read_csv('../data/train.csv')
print('training data original dimensions: {}'.format(train_data.shape))

training data original dimensions: (1460, 81)


We will take only non null numeric columns for the simplicity of this analysis.

In [3]:
train_data.dropna(axis=1, inplace=True)
train_data = train_data.select_dtypes(include=np.number)

# remove id column
train_data.drop(columns=['Id'], inplace=True)

we will use the `train_test_split` function to create a train, test, validation set with a split of (50, 30, 20). To verify my logic below, $.8 * .375 = .3$, which leaves 50% of the data for training.

In [4]:
to_split, validation = train_test_split(train_data, test_size=.2, random_state=27)
train, test = train_test_split(to_split, test_size=.375, random_state=27)

Create X and y training variables for train, test, and validation sets

In [5]:
x_train = train.loc[:, train.columns != 'SalePrice']
y_train = train.loc[:, 'SalePrice']

x_train_test = to_split.loc[:, to_split.columns != 'SalePrice']
y_train_test = to_split.loc[:, 'SalePrice']

x_val = validation.loc[:, validation.columns != 'SalePrice']
y_val = validation.loc[:, 'SalePrice']

To use mean absolute error during CV, we can create a scorer using the `make_scorer` function.

In [6]:
mae_scorer = make_scorer(mean_absolute_error)

In [7]:
rf = RandomForestRegressor(n_estimators=500)
lr = LinearRegression()

rf_score = cross_val_score(rf, X=x_train, y=y_train, cv=10, scoring=mae_scorer)
lr_score = cross_val_score(lr, X=x_train, y=y_train, cv=10, scoring=mae_scorer)

print('Random Forest CV MAE: {}'.format(np.mean(rf_score)))
print('Linear Regression CV MAE: {}'.format(np.mean(lr_score)))

Random Forest CV MAE: 19305.812270580565
Linear Regression CV MAE: 22307.177580429536


From the MAEs, it looks like random forest has a better generalization error than linear regression (not too surprising), so we will use random forest for all models going forward.

Now it gets interesting. The two main approaches that can be taken are to:
1. train the algorithm on the training set and use that as a final model
2. train the algorithm on a combination of the train and test sets (which is all the data available to us) and use that as the final model

The second method is being trained on more information, but does it run the risk of poorly generalizing to new data (our validation set)? 

In [8]:
# train on training data
rf_train = RandomForestRegressor(n_estimators=500)
rf_train.fit(x_train, y_train)

# train on combination of train and test data
rf_all = RandomForestRegressor(n_estimators=500)
rf_all.fit(x_train_test, y_train_test)

# avoid unwanted output
print('')




Finally, we can generate predictions and evaluate performance on the validation data.

In [9]:
rf_train_preds = rf_train.predict(x_val)
rf_all_preds = rf_all.predict(x_val)

print('MAE when trained on only train data: {}'.format(mean_absolute_error(y_val, rf_train_preds)))
print('MAE when trained on combination of train and test data: {}'.format(mean_absolute_error(y_val, rf_all_preds)))


MAE when trained on only train data: 18482.00240410959
MAE when trained on combination of train and test data: 17991.74179452055


## So what do we make of it?
From the results, the model trained on all of the available data was able to generalize better to new information! The results were not drastically different, but this does provide some evidence for training a final model using all available data.
