# Predicting the Sale Price of Bulldozers

## 1. Problem definition
    
    How well can we predict the future sales of a bulldozer given it's characteristics and previous examples of how much similar bulldozers have been sold for?
    
    
## 2. Data

    The data is taken from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data
    
    There are three main datasets:
    
    * Train.csv is the training set, which contains data through the end of 2011.
    * Valid.csv is the validation set, which contains data from January 1, 2012-April 30, 2012
    * Test.csv is the test set, which contains data from May, 2012-November, 2012.
    
## 3. Evaluation

    The evaluation metric for this competition in the RMSLE(root mean squared log error) between the True labels and predicted labels.
    
## 4. Features

    Kaggle provides a data dictionary detailing all the features of the dataset. It can be viewed here: https://docs.google.com/spreadsheets/d/1zRkHaM6oMOd-Fdo7hqhNkLEt4JJXJP1c_tugelQ1nXY/edit#gid=1461612573

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import tools for EDA
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns

In [None]:
# Import the training and validation datasets
df = pd.read_csv('../input/bluebook-for-bulldozers/TrainAndValid.csv',
                 low_memory=False)
df.head().T

In [None]:
# Check info on the data
df.info()

In [None]:
# Check for null values
df.isna().sum()

In [None]:
# Plot the SalePrice frequency ditribution
sns.distplot(df.SalePrice, bins=100);

# Parsing dates

When we work with time series data, we want to enrich the time and date component as much as possible.

We can do that by telling Pandas which of our columns contain dates using the `parse-dates` parameter

In [None]:
# Import the data agin but this time parse dates
df = pd.read_csv('../input/bluebook-for-bulldozers/TrainAndValid.csv',
                 low_memory=False,
                 parse_dates=['saledate'])

In [None]:
df.saledate.dtype

In [None]:
df.saledate[:1000]

In [None]:
# Sort the DataFrame by date
df.sort_values(by=['saledate'],
               inplace=True,
               ascending=True)
df.saledate.head(20)

# Make a copy of the original DataFrame

We make a copy of the original DataFrame so when we manipulate the copy we still have our original data.

In [None]:
# Make a copy of df to make edits on
df_tmp = df.copy()

# Preprocessing the data

In [None]:
# Create a function to preprocess the data into a format we can train on, evaluate and make predictions

def preprocess_data(df):
    '''
    Performs transformations of the df and returns transformed df.
    '''
    # Add datetime parameters to df and drop saledate column
    df['saleyear'] = df.saledate.dt.year
    df['salemonth'] = df.saledate.dt.month
    df['saleday'] = df.saledate.dt.day
    df['saledayofweek'] = df.saledate.dt.dayofweek
    df['saledayofyear'] = df.saledate.dt.dayofyear
    
    df.drop('saledate',
            axis=1,
            inplace=True)
    # Turn all string values into categorical values
    for label, content in df.items():
        if pd.api.types.is_string_dtype(content):
            df[label] = content.astype('category').cat.as_ordered()
    
    # Fill missing numeric values
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                # Add a binary column which declares the data missing or not
                df[label+'_is_missing'] = pd.isnull(content)
                # Fill missing numeric values with median
                df[label] = content.fillna(content.median())
                
        # Fill missing categorical data and converted categories to numbers
        if not pd.api.types.is_numeric_dtype(content):
            # Add binary column to indicate whether sample had missing values
            df[label+'_is_missing'] = pd.isnull(content)
            # Turn categories into numbers
            df[label] = pd.Categorical(content).codes + 1
            
    return df

In [None]:
df_tmp = preprocess_data(df)
df_tmp.head().T

# Splitting the Data

Now we've preprocessed the data, we can split into training and validation sets


In [None]:
# Split the data into train and validation sets
df_train = df_tmp[df_tmp.saleyear != 2012]
df_val = df_tmp[df_tmp.saleyear == 2012]

len(df_train), len(df_val)

In [None]:
# Split the data into X & y
X_train, y_train = df_train.drop('SalePrice', axis=1), df_train.SalePrice
X_val, y_val = df_val.drop('SalePrice', axis=1), df_val.SalePrice

X_train.shape, y_train.shape, X_val.shape, y_val.shape

# Evaluation

The evaluation metric for this Kaggle comp is RootMeanSquaredLogError.
There is no RMSLE function in the Sklearn library so we will create one.

We will also create a function to print the scores of the metrics we will use.
In the interest of being thorough, we will also evaluate the model with the following metrics:

* Mean Absolute Error
* R^2

In [None]:
# Import metrics for evaluation
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

# Create an evaluation function to calculate the RMSLE
def rmsle(y_test, y_preds):
    '''
    Calculates root mean squared log error between predictions and true labels.
    '''
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create another function to show scores of given metrics
def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_val)
    scores = {'Training MAE': mean_absolute_error(y_train, train_preds),
              'Validation MAE': mean_absolute_error(y_val, val_preds),
              'Training RMSLE': rmsle(y_train, train_preds),
              'Validation RMSLE': rmsle(y_val, val_preds),
              'Training R^2': r2_score(y_train, train_preds),
              'Validation R^2': r2_score(y_val, val_preds)
             }
    
    return scores

# Testing the model on a subset(for tuning hyperparams)


In [None]:
%%time

# Import a RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Instantiate a RandomForestRegressor with the max_samples set to 20,000
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42,
                              max_samples=20000)

model.fit(X_train, y_train)

In [None]:
show_scores(model)

# Hyperparameter tuning with RandomizedSearchCV

In [None]:
%%time

# Import the RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Create a dict grid of various hyperparams to try
rf_grid = {'n_estimators': np.arange(10, 100, 10),
           'max_depth': [None, 3, 5, 10],
           'min_samples_split': np.arange(2, 20, 2),
           'min_samples_leaf': np.arange(1, 20, 2),
           'max_features': [0.5, 1, 'sqrt', 'auto'],
           'max_samples': [15000]
          }

# Instantiate the RandomSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(random_state=42),
                              param_distributions=rf_grid,
                              n_iter=5,
                              cv=5,
                              verbose=True,
                              n_jobs=-1)

# Fit the RSCV model
rs_model.fit(X_train, y_train)

In [None]:
# Check the best params
rs_model.best_params_

In [None]:
# Check the scores
show_scores(rs_model)

In [None]:
%%time

# Train a model on the whole set using the best_params
ideal_model = RandomForestRegressor(n_estimators=80,
                                    min_samples_leaf=3,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None,
                                    max_depth=None,
                                    random_state=42)
ideal_model.fit(X_train, y_train)

In [None]:
# Scores for rs_model, trained on a subset of the data with 15,000 samples
show_scores(rs_model)

In [None]:
# Scores for ideal_model, trained on all the data
show_scores(ideal_model)

As the original competition is over and as this is merely a training note book, we will not further tune the model.


Let's make some predictions!

# Make predictions on the test data

In [None]:
# Import the test data
df_test = pd.read_csv('../input/bluebook-for-bulldozers/Test.csv',
                      low_memory=False,
                      parse_dates=['saledate'])

df_test.head().T

In [None]:
df_test.shape

In [None]:
df_test.sort_values(by=['saledate'], inplace=True, ascending=True)

In [None]:
df_test.saledate.head(50)

In [None]:
# Preprocess the test data using the preprocess_data function we created earlier
preprocess_data(df_test)
df_test.head(), df_test.shape

In [None]:
# Check the difference in number of columns between the training and test data
set(X_train.columns) - set(df_test.columns)

In [None]:
# Manually add a column for 'auctioneerID_is_missing'
df_test['auctioneerID_is_missing'] = False

In [None]:
# Make predictions on the test data
test_preds =ideal_model.predict(df_test)

In [None]:
test_preds

In [None]:
# # Format and export prediction data in the format requested by Kaggle
# df_preds = pd.DataFrame()
# df_preds['SalesID'] = df_test.SalesID
# df_preds['SalePrice'] = test_preds

# df_preds.to_csv('../input/bluebook-for-bulldozers/test-predictions.csv', index=False)

# Feature Importance

In [None]:
# Find the feture importance of our best model
ideal_model.feature_importances_

Let's visualise that a bit better


In [None]:
# Create a function to plot the feature importance

def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({'features': columns,
                        'feature_importances': importances})
           .sort_values('feature_importances', ascending=False)
           .reset_index(drop=True))
    # Plot the df
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.barh(df['features'][:n], df['feature_importances'][:20])
    ax.set_ylabel('Features')
    ax.set_xlabel('Feature Importance')
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns, ideal_model.feature_importances_)