
# Predicting the Sale Price of Bulldozers using Machine Learning

In this notebook, we're going to go through an example machine learning project with the goal of predicting the sale price of bulldozers.

Since we're trying to predict a number, this kind of problem is known as a regression problem.

The data and evaluation metric we'll be using (root mean square log error or RMSLE) is from the Kaggle Bluebook for Bulldozers competition.

The techniques used in here have been inspired and adapted from the fast.ai machine learning course.
What we'll end up with


To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.
Tools which can be used for each step of the machine learning modelling process.

We'll work through each step and by the end of the notebook, we'll have a trained machine learning model which predicts the sale price of a bulldozer given different characteristics about it.
## 1. Problem Definition

For this dataset, the problem we're trying to solve, or better, the question we're trying to answer is,

    How well can we predict the future sale price of a bulldozer, given its characteristics previous examples of how much similar bulldozers have been sold for?

## 2. Data

Looking at the dataset from Kaggle, you can you it's a time series problem. This means there's a time attribute to dataset.

In this case, it's historical sales data of bulldozers. Including things like, model type, size, sale date and more.

There are 3 datasets:

    Train.csv - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including SalePrice which is the target variable).
    Valid.csv - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as Train.csv).
    Test.csv - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the SalePrice attribute, as this is what we'll be trying to predict).

## 3. Evaluation

For this problem, Kaggle has set the evaluation metric to being root mean squared log error (RMSLE). As with many regression evaluations, the goal will be to get this value as low as possible.

To see how well our model is doing, we'll calculate the RMSLE and then compare our results to others on the Kaggle leaderboard.

## 4. Features

Features are different parts of the data. During this step, you'll want to start finding out what you can about the data.

One of the most common ways to do this, is to create a data dictionary.

For this dataset, Kaggle provide a data dictionary which contains information about what each attribute of the dataset means. You can download this file directly from the Kaggle competition page (account required) or view it on Google Sheets.

With all of this being known, let's get started!

First, we'll import the dataset and start exploring. Since we know the evaluation metric we're trying to minimise, our first goal will be building a baseline model and seeing how it stacks up against the competition.

### Importing the data and preparing it for modelling**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Loading the data, with **parse_dates** parameter so that the **saledate** column is converted into datetime64[ns] format.

In [None]:
df = pd.read_csv('../input/bluebook-for-bulldozers/TrainAndValid.csv', parse_dates = ['saledate'])

In [None]:
df.head()

In [None]:
df.head().T

In [None]:
df.info()

In [None]:
df.saledate.head(10)

### Let's sort the dataframe by ***saledate*** column.

In [None]:
df.sort_values(['saledate'], ascending=True, inplace=True)
df['saledate'].head(10)

Looks like the data has been sorted by the ***saledate*** column, let's continue further.

Now let's make a copy of our dataset, so that we have a back-up data incase we mess something up.

In [None]:
data = df.copy()

In [None]:
data.head()

In [None]:
data['saleyear'] = data.saledate.dt.year
data['salemonth'] = data.saledate.dt.month
data['saleday'] = data.saledate.dt.day
data['saledayofweek'] = data.saledate.dt.dayofweek
data['saledayofyear'] = data.saledate.dt.dayofyear


In [None]:
data.head().T

Now let's drop our `saledate` column.

In [None]:
data.drop('saledate', inplace=True, axis=1)

In [None]:
data.head().T

Let's checkout our state column

In [None]:
print(data['state'].value_counts())
#data['state'].value_counts().plot(kind='bar')

In [None]:
data.head()

If ever in doubt about choosing the right ML algorithm, take a look at [this](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

Let's try with `RandomForestRegressor`

## EDA

Let's take a look at datatypes of all the columns in our data.

In [None]:
data.info()

Let's convert all our string's into categories.

One way to do that is to convert them into pandas categories.

You can check it out yourself [here](https://pandas.pydata.org/docs/reference/general_utility_functions.html).

In [None]:
for label, content in data.items():
    if pd.api.types.is_string_dtype(content):
        data[label] = content.astype('category').cat.as_ordered()

In [None]:
data.info()

We've now converted all the features with strings into categories.
Although on the surface we don't see these features/ their values as numerical, but under the hood these values will be treated as numbers.

In [None]:
data['state'].cat.categories

In [None]:
data['state'].cat.codes

Let's now take care of missing values.

In [None]:
missing_values=(data.isnull().sum()/len(data)*100)

In [None]:
missing_values

In [None]:
#data.to_csv('new_data.csv', index=False)

In [None]:
#data = pd.read_csv('./new_data.csv')

## Taking Care of missing values.

#### Let's do that by seperating Numerical & Categorical features.

Let's first take care of numerical features.

In [None]:
for label, content in data.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
           print(label)

In [None]:
for label, content in data.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            #Add a binary column which will tell us if the data was missing
            data[label+'_is_missing'] = pd.isnull(content)
            #Fill missing numeric values with median.
            #Reason for choosing median over mean is, median is robust to outliers.
            data[label] = content.fillna(content.median())

Let's check for missing values now.

In [None]:
for label, content in data.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
           print(label)

Nice, missing values in numeric features have been taken care of.

Now let's do the same with our categorical features.

In [None]:
for label, content in data.items():
    if pd.api.types.is_categorical_dtype(content):
        if pd.isnull(content).sum():
            print(label)
len(label)

Turn categorical features into numbers and fill missing values.

In [None]:
for label, content in data.items():
    if not pd.api.types.is_numeric_dtype(content):
            #Add a binary column which will tell us if the data was missing
            data[label+'_is_missing'] = pd.isnull(content)
            #Turn categories into numbers and then add 1.
            #Reason for adding 1 is, if there are missing values after converting
            #categories into numbers, it'll replace missing values(0) by -1.
            data[label] = pd.Categorical(content).codes+1

In [None]:
data.info()

In [None]:
data.head().T

In [None]:
data.isnull().sum()

## 6. Modelling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [None]:
data_train = data[data.saleyear!=2012]
data_val = data[data.saleyear==2012]

len(data_train), len(data_val)

In [None]:
X_train, y_train = data_train.drop('SalePrice', axis=1), data_train['SalePrice']
X_valid, y_valid = data_val.drop('SalePrice', axis=1), data_val['SalePrice']

In [None]:
X_train.shape,y_train.shape, X_valid.shape, y_valid.shape

#### Let's build our own evaluation function

In [None]:
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

def rmsle(y_test, y_pred):
    """
    Calculates Root mean squared lof error between predictions and true labels.
    """
    return np.sqrt(mean_squared_log_error(y_test, y_pred))

# Create function to evaluate model on a few different levels.

def show_scores(model):
    train_pred = model.predict(X_train)
    val_pred = model.predict(X_valid)
    #If model performs better on validation dataset, it means the model is overfitting.
    scores = {'Training MAE':mean_absolute_error(y_train, train_pred),
             "Validation MAE": mean_absolute_error(y_valid, val_pred),
             "Training RMSLE": rmsle(y_train, train_pred),
             "Valid RMSLE": rmsle(y_valid, val_pred),
             "Training R^2": r2_score(y_train, train_pred),
             "Valid R^2": r2_score(y_valid, val_pred)}
    return scores

Let's test our model on a subset. This will help with hyperparameter tuning

In [None]:
%%time
model = RandomForestRegressor(n_jobs=-1, random_state=42,
                             max_samples=10000)
model.fit(X_train, y_train)

In [None]:
show_scores(model)

#### Let's use RandomizedSearchCV for hyperparameter tuning.

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

#Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000]}

#Instantiating RandomizedSearchCV
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1, random_state=42),
                              param_distributions=rf_grid,
                              n_iter=20,
                              cv=5,
                              verbose=True)
#Fitting the RandomizedSearchCV model
rs_model.fit(X_train, y_train)

In [None]:
rs_model.best_params_

Now let's evaluate the RandomizedSearchCV model.

In [None]:
show_scores(rs_model)

Now we can't be sure if this is the best model, so I've tried it with different variations in parameters and I find this as an ideal model.

In [None]:
%%time
# Most ideal hyperparameters
ideal_model = RandomForestRegressor(n_estimators=90,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None,
                                   random_state=42)
ideal_model.fit(X_train, y_train)

In [None]:
show_scores(ideal_model)

## Predictions on Test Data

In [None]:
df_test = pd.read_csv('../input/bluebook-for-bulldozers/Test.csv',
                     parse_dates = ['saledate'])

In [None]:
df_test.shape

## Taking care of the missing values.

In [None]:
def preprocess_data(df):
    # Add datetime parameters for saledate
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayofweek"] = df.saledate.dt.dayofweek
    df["saleDayofyear"] = df.saledate.dt.dayofyear

    # Drop original saledate
    df.drop("saledate", axis=1, inplace=True)
    
    # Fill numeric rows with the median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                df[label+"_is_missing"] = pd.isnull(content)
                df[label] = content.fillna(content.median())
                
        # Turn categorical variables into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            # We add the +1 because pandas encodes missing categories as -1
            df[label] = pd.Categorical(content).codes+1        
    
    return df

In [None]:
preprocess_data(df_test)

In [None]:
df_test.shape, X_train.shape

We can see that test data is still not in the right format, let's take care of that now.

Let's find out how the columns differ in test and train datasets using `set` fuction.

In [None]:
set(X_train.columns)-set(df_test.columns)

Manually adjust `df_test` so that number of features match.

In [None]:
df_test['auctioneer_ID_is_missing'] = False
df_test.head()

In [None]:
df_test.shape, X_train.shape

Now that our test data is in the right format, let's make the predictions.

In [None]:
test_pred = ideal_model.predict(df_test)
test_pred

In [None]:
len(test_pred)

Let's format the predictions as required.

In [None]:
df_preds = pd.DataFrame()
df_preds['SaledID'] = df_test['SalesID']
df_preds['SalesPrice'] = test_pred
df_preds.to_csv('submission.csv')

In [None]:
df_preds.to_csv('submission.csv')


## Feature Importance

Since we've built a model which is able to make predictions. The people you share these predictions with (or yourself) might be curious of what parts of the data led to these predictions.

This is where feature importance comes in. Feature importance seeks to figure out which different attributes of the data were most important when it comes to predicting the target variable.

In our case, after our model learned the patterns in the data, which bulldozer sale attributes were most important for predicting its overall sale price?

Beware: the default feature importances for random forests can lead to non-ideal results.

To find which features were most important of a machine learning model, a good idea is to search something like "[MODEL NAME] feature importance".

Doing this for our RandomForestRegressor leads us to find the feature_importances_ attribute.

Let's check it out.


In [None]:
# Find feature importance of our best model
ideal_model.feature_importances_

In [None]:


import seaborn as sns

# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importance": importances})
          .sort_values("feature_importance", ascending=False)
          .reset_index(drop=True))
    
    sns.barplot(x="feature_importance",
                y="features",
                data=df[:n],
                orient="h")

In [None]:
plot_features(X_train.columns, ideal_model.feature_importances_)

In [None]:
sum(ideal_model.feature_importances_)

In [None]:
df.ProductSize.isna().sum()

In [None]:
df.ProductSize.value_counts()

In [None]:
df.Turbocharged.value_counts()

In [None]:
df.Thumb.value_counts()