# Bluebook for Bulldozers

We will be looking at the Blue Book for Bulldozers Kaggle Competition: "The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations."

This dataset/competition has been chosen because of the closeness of the data to the realtime workplace.

Link here : https://www.kaggle.com/c/bluebook-for-bulldozers

In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [None]:
import pandas as pd
import numpy as np
from structured import *
import warnings
from sklearn.ensemble import RandomForestRegressor

from sklearn import metrics 

warnings.filterwarnings('ignore')

In [None]:
! unzip ../input/bluebook-for-bulldozers/Train.zip

In [None]:
df_raw = pd.read_csv('./Train.csv',low_memory=False,
                    parse_dates=["saledate"])
df_raw.head()

In [None]:
df_raw.saledate

In [None]:
#since the kaggle competition evaluation metric is the RMSLE(Root mean square log error)
df_raw.SalePrice=np.log(df_raw.SalePrice)

## Initial processing

you need to drop the target variable convert the categorical variables to numbers and then fit 

The dataset has both continous and categorical variables like the datetime thing etc which you can use.
And you need a piece of feature engineering to get info out of this

### 1. Dealing with dates

In [None]:
df_raw.saledate #datatype is datetime

The **add_datepart** method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities.

In [None]:
add_datepart(df_raw, 'saledate')

In [None]:
df_raw.head()

### 2. Convert strings to numbers for pandas

In [None]:
df_raw.columns

In [None]:
df_raw.info()

The categorical variables are currently stored as strings, which is inefficient, and doesn't provide the numeric coding required for a random forest. Therefore we call **train_cats** to convert strings to pandas categories.

In [None]:
train_cats(df_raw) #converts most of these objects into categories

In [None]:
df_raw.info() #how most objects have been turned to category

**Note** :Category is a pandas datatype

In [None]:
df_raw.UsageBand

In [None]:
df_raw.UsageBand.cat.categories #gives you the categories for the usage band feature

To make things easier for the random forest, we rearrange the categories in the UsageBand feature to make more sense to split on

In [None]:
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)
#order it so the splitting gets the maximum benifit from it 

In [None]:
df_raw.UsageBand.cat.categories

Normally, pandas will continue displaying the text categories, while treating them as numerical data internally. Optionally, we can replace the text categories with numbers, which will make this variable non-categorical, like so:.

In [None]:
df_raw.UsageBand.cat.codes

In [None]:
df_raw.head() #usage band still says high or low but behind the scenes they've been made into numbers

## pre-processing

In [None]:
df_raw.to_feather(('bulldozers-raw'))

In [None]:
df_raw = pd.read_feather('bulldozers-raw')

 what proc_df does :
 proc_df takes a data frame df and splits off the response variable, and
 changes the df into an entirely numeric dataframe. For each column of df 
 which is not in skip_flds nor in ignore_flds, na values are replaced by the
median value of the column.

 1. fix_missing - Fill missing data in a column of df with the median, and add a {name}_na column
    which specifies if the data was missing.
 2. scale_vars(if needed)
 3. numericalize - Changes the column col from a categorical type to it's integer codes.

In [None]:
df, y, nas = proc_df(df_raw, 'SalePrice')

Creating a validation set

In [None]:
def split_vals(a,n): 
    return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
# raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

In [None]:
import math
#let's track the metrics we're interested in 
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

### Modelling

In [None]:
m = RandomForestRegressor() 
%time m.fit(X_train, y_train)
print_score(m) #training rmse, valid rmse, training accuracy and validation accuracy respectively

2 mins is too long and anything more than 10 secs will slow down the iteration process. so use proc_df to reduce the size of the training set. Proc_df has a subset attribute that handles it

In [None]:
len(df_raw)

In [None]:
df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=50_000) 
X_train, _ = split_vals(df_trn, 40_000) 
y_train, _ = split_vals(y_trn, 40_000) 

In [None]:
m = RandomForestRegressor()
%time m.fit(X_train, y_train)
print_score(m)

Use that subset which takes around 8-10 seconds to compute

note that as you increase the size of the subset, the accuracy of the model increases meaning you can try out the hyperparameters here, and then go back to the bigger dataset after you find the best ones

Each tree is stored in estimators_ so run the validation set through each tree 
So for every row you have 1 prediction per tree, so 12000 pedictions per tree and there are 10(will change to 100 as default) trees

In [None]:
preds = np.stack([t.predict(X_valid) for t in m.estimators_])
preds[:,0], np.mean(preds[:,0]), y_valid[0]

In [None]:
preds.shape

In [None]:
m = RandomForestRegressor(n_estimators=20, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m) #prev was 81

In [None]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=50, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=75, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=100, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=125, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=160, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=170, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

Score decreases at 170. So let's revert to 160

### Out-of-bag (OOB) score

In [None]:
m = RandomForestRegressor(n_estimators=160, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m) #final output is oob error

This shows that our validation set time difference is making an impact, as is model over-fitting.

## Reducing over-fitting

It turns out that one of the easiest ways to avoid over-fitting is also one of the best ways to speed up analysis: subsampling. Let's return to using our full dataset, so that we can demonstrate the impact of this technique.

In [None]:
df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice')
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

The basic idea is this: rather than limit the total amount of data that our model can access, let's instead limit it to a different random subset per tree. That way, given enough trees, the model can still see all the data, but for each individual tree it'll be just as fast as if we had cut down our dataset as before.

This requires using the set_rf_samples method, that changes sklearn source code 
To see the its implementation check 
https://github.com/VishakBharadwaj94/bluebook_for_bulldozers/blob/master/bluebook_for_bulldozers.ipynb

In [None]:
m = RandomForestRegressor(n_estimators=150, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

With the model now at near 90% accuracy, we can dig deep in for further insights

For model Interpretation. Have a look at :

https://github.com/VishakBharadwaj94/bluebook_for_bulldozers/blob/master/rf_interp.ipynb