# Predicting the sale price of Bulldozers using machine learning

This notebook goes through an example machine learning project with the goal of predicting sale price of bulldozers.

## 1. Problem Definition

> How well can we predict the future sale price of a bulldozer, given it's characteristics and previous examples of how much similar bulldozers have been sold for?


## 2. Data

The data is downloaded from Kaggle's "Bluebook for Bulldozers" competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are 3 main datasets:

* **Train.csv** is the training set, which contains data through the end of 2011.
* **Valid.csv** is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
* **Test.csv** is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

## 3.Evaluation

The evaluation metric for this competition is the **RMSLE (root mean squared log error)** between the actual and predicted auction prices.

The goal for most regression evaluation metrics is to **minimize the error**. 

## 4. Features

Kaggles provides a data dictionary detailing all of the features of the dataset: https://www.kaggle.com/c/bluebook-for-bulldozers/data?select=Data+Dictionary.xlsx


### Preparing the tools

We're going to use:
* pandas for data analysis.
* NumPy for numerical operations.
* Matplotlib/seaborn for plotting or data visualization.
* Scikit-Learn for machine learning modelling and evaluation.

In [None]:
# Import all the tools we need

# Regular EDA ( exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Models from scikit-learn
# from sklearn.linear_model import LogisticRegression
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.ensemble import RandomForestClassifier

# Model evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

In [None]:
# Load the data (training and validation sets)
df = pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.columns

In [None]:
fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000]);

In [None]:
# plot histogram to see distribution of sales price
df.SalePrice.plot.hist();

### Parsing Dates

When we work with time series data, we want to enrich the time & date component as much as possible.

We can do that by telling pandas which of our columns has dates in it using the `parse_dates` paramter.

In [None]:
df.saledate.dtype

In [None]:
# Import the data again but this time parse dates
df = pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False,
                 parse_dates= ["saledate"])

In [None]:
df.saledate.dtype

In [None]:
df.saledate[:1000]

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(df.saledate[:1000], df.SalePrice[:1000]);

In [None]:
df.saledate.head(20)

### Sort Dataframe by saledate

When working with time series data, it's good idea to sort it by date.

In [None]:
# Sort dataframe by date
df.sort_values(by = ["saledate"], inplace=True, ascending=True)
df.saledate.head(20)

In [None]:
df.head()

### Make a copy of the original dataframe

We make a copy of the original dataframe so when we manipulate the copy, we've still got our original data.

In [None]:
# make a copy
df_tmp = df.copy()

In [None]:
df_tmp.head()

In [None]:
df_tmp.saledate.head()

### Add datetime parameters to `saledate` column

In [None]:
df[:5].saledate

In [None]:
df[:5].saledate.dt.year  #dt - accessing datetime index

In [None]:
df_tmp["saleYear"]  =df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayOfWeek"] = df_tmp.saledate.dt.dayofweek
df_tmp["saleDayOfYear"] = df_tmp.saledate.dt.dayofyear

In [None]:
df_tmp.head().T

In [None]:
# Now we've enriched our dataframe with date time features, we can remove 'saledate' column
df_tmp.drop("saledate", axis=1, inplace=True)

In [None]:
# Checking the values of different columns
df_tmp.state.value_counts()

## 5. Modelling

Let's start to do some model-driven EDA.

In [None]:
df_tmp.isna().sum()

In [None]:
df_tmp.info()

Our dataset contains non-numeric data as well as many columns have missing data. Thus, we need to convert non-numeric data to numeric and handle missing values before building a model.

### Convert strings to categories - Label Encoding

One way we can turn all of our data into numbers is by converting them into pandas categories.

Pandas categories: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

Different datatypes compatible with pandas: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html

Helpful resource: https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

In [None]:
df_tmp.UsageBand.dtype == "object"

In [None]:
pd.api.types.is_string_dtype(df_tmp["UsageBand"])

In [None]:
# Find the columns which contain strings

for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# Find the columns which contain strings
for label in df_tmp.keys():
    if df_tmp[label].dtype == "object":
        print(label)

In [None]:
df_tmp.info()

In [None]:
# this will turn all of the string values into category values
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label] = content.astype("category").cat.as_ordered()
        

`.cat` is used to access the category.

`cat.as_ordered()` - this means each columns that gets turned into a category will have an assumed order, and corresponding min(), max(), etc..

If categorical data is ordered (s.cat.ordered == True), then the order of the categories has a meaning and certain operations are possible. If the categorical is unordered, .min()/.max() will raise a TypeError.

In [None]:
df_tmp.info()

dtype of `object` is changed to `category` type

In [None]:
df_tmp.state  # ordered categories

In [None]:
type(df_tmp.state)

In [None]:
df_tmp.state.cat.categories

`cat.categories` - returns the categories of this categorical. The assigned value has to be a list-like object. 

Since the categories are ordered, under the hood, pandas has assigned numerical values/codes (order) to the items in each category. The codes can be accessed using `cat.codes`

In [None]:
df_tmp.state.cat.codes

In [None]:
d = dict(enumerate(df_tmp.state.cat.categories)) # Returns the state codes with their respective categories
print (d)

In [None]:
print(dict(enumerate(df_tmp.Hydraulics.cat.categories)))

Now all of our data can be accessed as numbers (thanks to pandas `categories`!).

But we still have a bunch of missing data..

In [None]:
# Check missing values
df_tmp.isnull().sum()/len(df_tmp)

## Fill missing values

### Fill numerical missing values

In [None]:
df_tmp.isnull().sum()

In [None]:
# Finding numerical columns
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
df_tmp.ModelID

In [None]:
# Check for which numeric columns have null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Fill numeric rows with median
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            #Add a binary column which tells us if the data was missing or not
            df_tmp[label+"_is_missing"]  = pd.isnull(content)
            # Fill missing numeric values with median
            df_tmp[label] = content.fillna(content.median())  # median is more robust than mean (susceptible to outliers)

In [None]:
# Check if there's any null numeric values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Check to see how many examples were missing
df_tmp.auctioneerID_is_missing.value_counts()

### Filling and turning categorical data into numbers

In [None]:
# Check for column which aren't numeric

for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
pd.Categorical(df_tmp['state']).codes

In [None]:
pd.Categorical(df_tmp.UsageBand).codes

By default pandas assigns `code` = -1 for rows with missing values (for any column).

In [None]:
df_tmp.UsageBand.cat.codes

In [None]:
# turn categorical variables into numbers and fill missing
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        # add binary column to indicate whether sample had missing value
        df_tmp[label +"_is_missing"] = pd.isnull(content)
        # Turn categories into numbers and add +1
        df_tmp[label] = pd.Categorical(content).codes+1 # +1 because pandas has assigned code -1 to missing values. We want it to be 0.        

In [None]:
df_tmp.isna().sum()

In [None]:
df_tmp.info()

Now that all of data is numeric as well as our dataframe has no missing values, we should be able to build a machine learning model.

In [None]:
len(df_tmp)

In [None]:
%%time
from sklearn.ensemble import RandomForestRegressor
# Instantiate model
model = RandomForestRegressor(random_state=42)

# Fit the model
model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

In [None]:
model.score(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

**Question:** Why doesn't the above metric hold water? ( why isn't the metric reliable) - Because model is evaluated on the same data that it has been trained on.

Splitting data into train/validation sets can be done on the basis of sale year 2012. Everything sold in 2012 will be part of validation set and everything sold before 2012 will be part of training set.

In [None]:
df_tmp.saleYear.value_counts()

In [None]:
# Split data into train and validation sets
df_valid = df_tmp[df_tmp.saleYear == 2012]
df_train = df_tmp[df_tmp.saleYear != 2012]

len(df_valid), len(df_train)

In [None]:
# Split data into x and y
x_train, y_train = df_train.drop("SalePrice", axis=1), df_train["SalePrice"]
x_valid, y_valid = df_valid.drop("SalePrice", axis=1), df_valid["SalePrice"]

x_train.shape, y_train.shape, x_valid.shape, y_valid.shape

In [None]:
y_train

### Building an evaluation function

In [None]:
# Create evaluation function (kaggle competition uses RMSLE)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

def rmsle(y_test, y_preds):
    """
    Calculates root mean squared log error between predictions
    and true labels.
    """
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create function to evaluate model on few different levels
def show_scores(model):
    train_preds = model.predict(x_train)
    valid_preds = model.predict(x_valid)
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Validatin MAE": mean_absolute_error(y_valid, valid_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Validation RMSLE": rmsle(y_valid, valid_preds),
              "Training R^2": r2_score(y_train, train_preds),
              "Validation R^2": r2_score(y_valid, valid_preds)}
    return scores

### Testing our model on a subset (to tune the hyperparameters)

In [None]:
# This take far too long...for experimenting

# %%timeit

# model = RandomForestRegressor()
# model.fit(x_train, y_train)

In [None]:
# Change max_samples value
model = RandomForestRegressor(random_state=42,
                              max_samples=10000)
model

In [None]:
%%time
# Cutting down on the max number of samples each estimator can see improves training time.
model.fit(x_train, y_train)

In [None]:
show_scores(model)

### Hyperparameter tuning with RandomizedSearchCV

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# different Random Forest regressor hyperparameters
rf_grid = {"n_estimators": np.arange(10,100,10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2,20,2),
           "min_samples_leaf": np.arange(1,20,2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000]}

# Insatntiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(random_state=42),
                              param_distributions=rf_grid,
                              n_iter=2,
                              verbose=True)

# Fit the randomizedSearchCV model
rs_model.fit(x_train, y_train) 

In [None]:
# find the best hyperparameters
rs_model.best_params_

In [None]:
# evaluate the model
show_scores(rs_model)

### Train a model with best hyperparameters

**Note:** These were found after 100 iterations of `RandomizedSearchCV`

In [None]:
%%time

# Most ideal hyperparameters
ideal_model = RandomForestRegressor(n_estimators=40,
                                    min_samples_split=14,
                                    min_samples_leaf=1,
                                    max_features=0.5, 
                                    max_samples=None,
                                    random_state=42) 

# fit the ideal model
ideal_model.fit(x_train, y_train)

In [None]:
# evaluate the ideal model
show_scores(ideal_model)

There is a significant decrease in Validation `RMSLE` in ideal mdel (trained on all data with best hyperparameters) from the previous randomizedSearchCVA model trained on ~10000 samples.

## Make predictions on test data

In [None]:
# Import the test data
df_test = pd.read_csv("../input/bluebook-for-bulldozers/Test.csv",
                      low_memory=False,
                      parse_dates= ["saledate"])
df_test.head()

In [None]:
df_test.shape

In [None]:
df_test.isna().sum()

In [None]:
df_test.info()

In [None]:
df_test.columns

In [None]:
x_train.columns

Test data has missing values as well as non-numeric columns.
Also, test data has only 52 columns whereas x_train (data on which model is trained) has 102 columns.

Therefore, our ideal model can't predict on the test data directly as it is not in the same format as the data model has been trained on.

### Preprocessing the data (getting the test data in same format as train dataset)

In [None]:
def preprocess_data(df):
    """
    Performs transformations on df and returns transformed df.
    """
    # Add datetime parameters to `saledate` column
    df["saleYear"]  =df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayOfWeek"] = df.saledate.dt.dayofweek
    df["saleDayOfYear"] = df.saledate.dt.dayofyear

    # Now we've enriched our dataframe with date time features, we can remove 'saledate' column
    df.drop("saledate", axis=1, inplace=True) 
    
    for label, content in df.items():
        # Fill numeric rows with median
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                #a binary column which tells us if the data was missing or not
                df[label+"_is_missing"]  = pd.isnull(content)
                # Fill missing numeric values with median
                df[label] = content.fillna(content.median())

        # turn categorical variables into numbers and fill missing
        if not pd.api.types.is_numeric_dtype(content):
            # add binary column to indicate whether sample had missing value
            df[label +"_is_missing"] = pd.isnull(content)
            # Turn categories into numbers and add +1
            df[label] = pd.Categorical(content).codes+1

    return df  

In [None]:
# Process the test data
df_test = preprocess_data(df_test)
df_test.head()

In [None]:
df_test.isna().sum()

In [None]:
df_test.info()

In [None]:
x_train.shape[1]

In [None]:
# make predictions on test data
# test_preds = ideal_model.predict(df_test)

There is a mismatch between number of columns in training dataset and test dataset.

In [None]:
# we can find how the columns differ using sets
set(x_train.columns) - set(df_test.columns)

This means that test data did not have any missing values for `auctioneerID` column and hence `auctioneerID_is_missing` column was not created.

In [None]:
# manually adjust df_test to have 'auctioneerID_is_missing' column
df_test["auctioneerID_is_missing"] = False
df_test.head()

Finally, our test dataframe has same features as training dataframe and we can make predictions!

In [None]:
# make predictions on test data
test_preds = ideal_model.predict(df_test)
test_preds

In [None]:
len(test_preds)

Format predictions into the same format kaggle is asking for:
* Have a header: "SalesID,SalePrice"
* Contain two columns

  **SalesID**: SalesID for the validation set in sorted order
  
  **SalePrice**: Your predicted price of the sale

In [None]:
# format predictions
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalePrice"] = test_preds
df_preds

In [None]:
# Export prediction data to csv
df_preds.to_csv("test-predictions.csv", index=True)

### Feature Importance

Feature Importance seeks to figure out which different attributes of data are most important when it comes to predicting the **target variable** (salePrice). 

In [None]:
# find feature importance of best model
ideal_model.feature_importances_

In [None]:
x_train.columns

In [None]:
# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importance": importances})
          .sort_values("feature_importance", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importance"][:n])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature Importance")
    ax.invert_yaxis()

In [None]:
plot_features(x_train.columns, ideal_model.feature_importances_)