## The Data

From:
https://www.kaggle.com/c/bluebook-for-bulldozers

'The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuaration.  The data is sourced from auction result postings and includes information on usage and equipment configurations.'

We are interested in the TrainAndValid.csv found here.  For this notebook, we'll be putting this in a folder labeled `data`:

https://www.kaggle.com/c/bluebook-for-bulldozers/data

In [1]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
from sklearn.externals import joblib

from metrics_auto_visualizer import plot_metrics

## Regression: Let's Load the Data and Generate some Features

We'll be doing basic feature engineering with the sale date, splitting it into month, day, and year.  We're not trying to build the greatest model here, just an illustrative example.  We'll drop difficult columns becaue they're difficult.  Next, we split the `fiProductClassDesc` into two features.  Lastly, we we label encode the categorical variables.

In [3]:
# Data Loading
data = pd.read_csv('./data/TrainAndValid.csv')
data.head()


Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000.0,999089,3157,121,3.0,2004,68.0,Low,11/16/2006 0:00,...,,,,,,,,,Standard,Conventional
1,1139248,57000.0,117657,77,121,3.0,1996,4640.0,Low,3/26/2004 0:00,...,,,,,,,,,Standard,Conventional
2,1139249,10000.0,434808,7009,121,3.0,2001,2838.0,High,2/26/2004 0:00,...,,,,,,,,,,
3,1139251,38500.0,1026470,332,121,3.0,2001,3486.0,High,5/19/2011 0:00,...,,,,,,,,,,
4,1139253,11000.0,1057373,17311,121,3.0,2007,722.0,Medium,7/23/2009 0:00,...,,,,,,,,,,


In [4]:
# Date Features
tmp_date = data.saledate.apply(lambda x : datetime.strptime(x[:-5], '%m/%d/%Y'))
data['sale_mon'] = tmp_date.dt.month
data['sale_dayofweek'] = tmp_date.dt.dayofweek
data['sale_dayofyear'] = tmp_date.dt.dayofyear
data['sale_year'] = tmp_date.dt.year
data.drop(['saledate'],axis=1,inplace=True)

# Taking Subset of Columns
kept_columns = [
                'YearMade', 
                'sale_mon', 
                'sale_dayofweek',
                'sale_dayofyear',
                'sale_year',
                'fiModelDesc',
                'fiBaseModel',
                'fiProductClassDesc',
                'state',
                'SalePrice'
               ]
data = data[kept_columns]
data['age'] = data.sale_year - data.YearMade

In [5]:
data.fiProductClassDesc.head(5)

0             Wheel Loader - 110.0 to 120.0 Horsepower
1             Wheel Loader - 150.0 to 175.0 Horsepower
2    Skid Steer Loader - 1351.0 to 1601.0 Lb Operat...
3    Hydraulic Excavator, Track - 12.0 to 14.0 Metr...
4    Skid Steer Loader - 1601.0 to 1751.0 Lb Operat...
Name: fiProductClassDesc, dtype: object

In [6]:
# Encoding Class Description
data.loc[:,'classDesc_1'] = data.fiProductClassDesc.apply(lambda x : x.replace(',','').strip().split('-')[0])
data.loc[:,'classDesc_2'] = data.fiProductClassDesc.apply(lambda x : x.replace(',','').strip().split('-')[1])
data.drop('fiProductClassDesc',axis=1,inplace=True)

for col in ['fiModelDesc','fiBaseModel','state','classDesc_1','classDesc_2']:
    lb = LabelEncoder()
    data.loc[:,col] = lb.fit_transform(data.loc[:,col])

# Floor and ceiling these miscoded values
data.loc[data.YearMade < 1920, 'YearMade'] = np.median(data.YearMade)
data.loc[data.YearMade > 2012, 'YearMade'] = 2012

## Building a Random Forest Regressor and Saving the Outputs

In [7]:
# split into train and valid
train, valid = train_test_split(data, test_size = .2)

# reset the indicies for less headaches
train.reset_index(drop=True, inplace=True)
valid.reset_index(drop=True, inplace=True)

# separate the response
y_train = train.SalePrice
y_valid = valid.SalePrice
train.drop('SalePrice',axis=1,inplace=True)
valid.drop('SalePrice',axis=1,inplace=True)

# train the model
rf = RandomForestRegressor(min_samples_split=15, 
                           n_estimators = 60, 
                           n_jobs=-1)
rf.fit(train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=15,
           min_weight_fraction_leaf=0.0, n_estimators=60, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

## Plot those metrics!

In [11]:
plot_metrics(rf, train, valid, y_train, y_valid, port=9999)

 * Running on http://127.0.0.1:9999/ (Press CTRL+C to quit)


Metrics Loaded!
Plot Preprocessing Complete!
Reticulating Splines!


127.0.0.1 - - [16/May/2018 16:39:54] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:39:54] "GET /_dash-component-suites/dash_core_components/rc-slider@6.1.2.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:39:54] "GET /_dash-component-suites/dash_core_components/react-select@1.0.0-rc.3.min.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:39:54] "GET /_dash-component-suites/dash_core_components/react-virtualized@9.9.0.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:39:54] "GET /_dash-component-suites/dash_core_components/react-virtualized-select@3.1.0.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:39:54] "GET /_dash-component-suites/dash_core_components/react-dates@12.3.0.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:39:54] "GET /_dash-component-suites/dash_renderer/react@15.4.2.min.js?v=0.12.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:39:54] "GET /_dash-component-suites/dash_renderer/react-dom@15.4.2.min.js?v=0.12.1 HTTP

## Now do Classification!

For this data set, we just made up a variable to classify on called `is_old` if the tractor was built before 1995.  This obviously leads to a perfect classifier, so we randomized a few of the values to add some noise.

In [9]:
# new feature
data['is_old'] = np.int8(data.YearMade < 1995)
r_index = np.random.choice(data.is_old.index, size=90_000)
data.loc[r_index,'is_old'] = data.loc[r_index].is_old.apply(lambda x : np.random.randint(0,1))

# train test split
train_c, valid_c = train_test_split(data, test_size = .2)
y_c_train = train_c.is_old
y_c_valid = valid_c.is_old
train_c.drop('is_old',axis=1,inplace=True)
valid_c.drop('is_old',axis=1,inplace=True)

# fit the model
rf_c = RandomForestClassifier(min_samples_split=15, 
                            n_estimators = 60, 
                            n_jobs=-1)
rf_c.fit(train_c, y_c_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=15,
            min_weight_fraction_leaf=0.0, n_estimators=60, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Plot those Metrics!

In [10]:
plot_metrics(rf_c, train_c, valid_c, y_c_train, y_c_valid, port=9999)


F-score is ill-defined and being set to 0.0 due to no predicted samples.


'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.

 * Running on http://127.0.0.1:9999/ (Press CTRL+C to quit)


Hard Predictions Loaded!


127.0.0.1 - - [16/May/2018 16:34:51] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:34:51] "GET /_dash-component-suites/dash_core_components/rc-slider@6.1.2.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:34:51] "GET /_dash-component-suites/dash_core_components/react-select@1.0.0-rc.3.min.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:34:51] "GET /_dash-component-suites/dash_core_components/react-virtualized@9.9.0.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:34:51] "GET /_dash-component-suites/dash_core_components/react-virtualized-select@3.1.0.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:34:51] "GET /_dash-component-suites/dash_core_components/react-dates@12.3.0.css?v=0.22.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:34:51] "GET /_dash-component-suites/dash_renderer/react@15.4.2.min.js?v=0.12.1 HTTP/1.1" 200 -
127.0.0.1 - - [16/May/2018 16:34:51] "GET /_dash-component-suites/dash_renderer/react-dom@15.4.2.min.js?v=0.12.1 HTTP