# Simple Random Forests implementation

This is a challenge from Kaggle that I wanted to experiment with. The goal is to apply Random Forests to a fairly structured dataset.

The description of the challenge is available at: https://www.kaggle.com/c/bluebook-for-bulldozers

I'm going to start by importing the data and doing some basic EDA (Exploratory Data Analysis). I won't do it too much, just enough to see what kind of data I'm dealing with.

There is a folder called `data` which contains the `Train.csv` dataset.

Let's start by importing all the libraries we need in Python, and set up some configuration.

In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

import pandas as pd
import numpy as np


## EDA and data cleaning

With the pandas library imported, we can now read the CSV data to see what we're dealing with.

In [2]:
data_raw = pd.read_csv('./data/Train.csv', low_memory=False)
data_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,11/16/2006 0:00,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,3/26/2004 0:00,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2/26/2004 0:00,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,5/19/2011 0:00,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,7/23/2009 0:00,...,,,,,,,,,,


We have 53 features and I can't really see all of them. Let me transpose the data to see better.

In [3]:
data_raw.head().transpose()

Unnamed: 0,0,1,2,3,4
SalesID,1139246,1139248,1139249,1139251,1139253
SalePrice,66000,57000,10000,38500,11000
MachineID,999089,117657,434808,1026470,1057373
ModelID,3157,77,7009,332,17311
datasource,121,121,121,121,121
auctioneerID,3,3,3,3,3
YearMade,2004,1996,2001,2001,2007
MachineHoursCurrentMeter,68,4640,2838,3486,722
UsageBand,Low,Low,High,High,Medium
saledate,11/16/2006 0:00,3/26/2004 0:00,2/26/2004 0:00,5/19/2011 0:00,7/23/2009 0:00


There is a mixture of categorical and numerical data. There is also a date column that I might want to convert to a numerical format. There are also a large number of NaN values, which we need to fill.

Let's address a few things first. The Kaggle challenge specified that the score of interest is the Root Mean Square Error (RMSE). So let's apply a transformation to the `SalePrice` column.

In [4]:
data_raw['SalePrice'] = np.log(data_raw['SalePrice'])

In [5]:
data_raw['SalePrice'].head()

0    11.097410
1    10.950807
2     9.210340
3    10.558414
4     9.305651
Name: SalePrice, dtype: float64

Next, we need to deal with the date column, saledate. This is what it looks like:

In [6]:
data_raw['saledate'].head()

0    11/16/2006 0:00
1     3/26/2004 0:00
2     2/26/2004 0:00
3     5/19/2011 0:00
4     7/23/2009 0:00
Name: saledate, dtype: object

It's probably a lot more useful to transform this to some features. We have features like year, month, day of month, day of week, is_weekend, etc. To make things simple, I'll just pull out the day of month, month, and year:

In [7]:
saledates = data_raw['saledate'].astype('datetime64')

In [8]:
data_raw['salesYear'] = saledates.dt.year
data_raw['salesMonth'] = saledates.dt.month
data_raw['salesDay'] = saledates.dt.day
data_raw = data_raw.drop('saledate', axis=1)

Now, there are other string data, or "categorical" data. Pandas actually has a Categorical type but it's not converted to Categorical by default (probably for performance). 

We don't actually need to convert these columns to Categorical. Instead we will use a function called `get_dummies()` to perform one-hot encoding on these columns.

In [9]:
data = pd.get_dummies(data_raw)

In [10]:
len(data.columns)

7725

Okay, that created 7725 columns. Now all the data is numerical, but there is still one thing left to do.

Remember we had a large number of `NaN` values. There are different approaches to deal with this, but one common way is to fill it with the median of the column. 

So let's go through all 7725 columns and fill in `NaN` values. Pandas conveniently has a function called `fillna` to do precisely this.

In [11]:
for col in data.columns:
    median = data[col].median()
    data[col].fillna(median, inplace=True)

Let's actually backup the data so we don't have to go through this again when we restart the notebook.

In [12]:
data.to_feather('bulldozer_processed')

# Learning

We are now ready to do the training step. Let's reload the data to see if it works.

In [18]:
data = pd.read_feather('bulldozer_processed')
data.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,salesYear,salesMonth,...,Travel_Controls_Pedal,Differential_Type_Limited Slip,Differential_Type_Locking,Differential_Type_No Spin,Differential_Type_Standard,Steering_Controls_Command Control,Steering_Controls_Conventional,Steering_Controls_Four Wheel Standard,Steering_Controls_No,Steering_Controls_Wheel
0,1139246,11.09741,999089,3157,121,3.0,2004,68.0,2006,11,...,0,0,0,0,1,0,1,0,0,0
1,1139248,10.950807,117657,77,121,3.0,1996,4640.0,2004,3,...,0,0,0,0,1,0,1,0,0,0
2,1139249,9.21034,434808,7009,121,3.0,2001,2838.0,2004,2,...,0,0,0,0,0,0,0,0,0,0
3,1139251,10.558414,1026470,332,121,3.0,2001,3486.0,2011,5,...,0,0,0,0,0,0,0,0,0,0
4,1139253,9.305651,1057373,17311,121,3.0,2007,722.0,2009,7,...,0,0,0,0,0,0,0,0,0,0


Great. We should split our data into training and validation sets. We'll use a 0.25 split.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('SalePrice', axis=1),
    data['SalePrice'],
    test_size=0.25,
    random_state=42,
)

We will be using an algorithm called Random Forests, which is an ensemble supervised learning algorithm based on an ensemble of decision trees.

As this is a prediction task (regression) we will use the `RandomForestRegressor`. For a classification task we may use `RandomForestClassifier` but this is not a classification task.

Training will take a while...

In [5]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1)  #we'll leave everything else as default

In [6]:
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

# Evaluation

Now that it's been trained, we can see how it does on the training and testing sets.

In [7]:
model.score(X_train, y_train)

0.981710492948687

In [8]:
model.score(X_test, y_test)

0.8970776394527499

That's actually pretty good. However, there is a bit of difference between the test and training scores (although they're both pretty high). This suggests a bit of overfitting. We may want to address that later, but I'm going to leave this for now.

Let's save the model so we don't have to retrain again.

In [17]:
# save model
import pickle

pkl_filename = "rf_model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(model, file)

# Solution

Finally, I'm going to import the validation and test sets given by Kaggle. My prediction on `Valid.csv` will determine my rank on the public leaderboard, while `Test.csv` determines my rank on the private leaderboard. Let me import both of them, and do the same processing. 

We need to ensure that the one-hot encodings are the same.

In [12]:
data_raw = pd.read_csv('./data/Valid.csv')

In [13]:
saledates = data_raw['saledate'].astype('datetime64')
data_raw['salesYear'] = saledates.dt.year
data_raw['salesMonth'] = saledates.dt.month
data_raw['salesDay'] = saledates.dt.day
data_raw = data_raw.drop('saledate', axis=1)

In [34]:
public_data_processed = pd.get_dummies(data_raw)

In [35]:
# get columns from training set
train_cols = X_test.columns
test_cols = public_data_processed.columns

missing_cols = (set(train_cols)).symmetric_difference(set(test_cols))
for c in missing_cols:
     public_data_processed[c] = 0
        
new_cols = set(test_cols) - set(train_cols)
for c in new_cols:
    public_data_processed.drop(c, axis=1, inplace=True)
        
for col in public_data_processed.columns:
    median = public_data_processed[col].median()
    public_data_processed[col].fillna(median, inplace=True)

In [36]:
y_public = model.predict(public_data_processed)

In [60]:
results_public = public_data_processed[['SalesID']]
results_public = results_public.assign(SalePrice=lambda x: np.exp(y_public))

In [71]:
results_public.to_csv('public_leaderboard.csv', index=False)

Now for `Test.csv`:

In [62]:
data_raw = pd.read_csv('./data/Test.csv')

In [63]:
saledates = data_raw['saledate'].astype('datetime64')
data_raw['salesYear'] = saledates.dt.year
data_raw['salesMonth'] = saledates.dt.month
data_raw['salesDay'] = saledates.dt.day
data_raw = data_raw.drop('saledate', axis=1)

In [64]:
private_data_processed = pd.get_dummies(data_raw)

In [65]:
# get columns from training set
train_cols = X_test.columns
test_cols = private_data_processed.columns

missing_cols = (set(train_cols)).symmetric_difference(set(test_cols))
for c in missing_cols:
     private_data_processed[c] = 0
        
new_cols = set(test_cols) - set(train_cols)
for c in new_cols:
    private_data_processed.drop(c, axis=1, inplace=True)
        
for col in private_data_processed.columns:
    median = private_data_processed[col].median()
    private_data_processed[col].fillna(median, inplace=True)

In [66]:
y_private = model.predict(private_data_processed)

In [67]:
results_private = private_data_processed[['SalesID']]
results_private = results_private.assign(SalePrice=lambda x: np.exp(y_private))

In [70]:
results_private.to_csv('private_leaderboard.csv', index=False)

In [69]:
results_private

Unnamed: 0,SalesID,SalePrice
0,1227829,18395.146713
1,1227844,15191.654578
2,1227847,31871.396669
3,1227848,30287.580884
4,1227863,34139.328934
5,1227870,56269.391170
6,1227871,38709.354077
7,1227879,12981.014777
8,1227880,16303.186961
9,1227881,30573.677321
