## Predicting the Sale Price of Bulldozers (Kaggle Competition)

## 1. Problem Definition

The goal of this project is to predict the sale price of bulldozers at auction.  
Predictions are based on usage, equipment type, and configuaration.  
The data is sourced from auction result postings.  
Type of machine learning problem: **supervised learning / time series regression**

## 2. Evaluation

The competition evaluation metric was *root mean squared log error (RMSLE)*.  
**Project goal:** To minimize the difference between actual and predicted prices, i.e., to minimize RMSLE.

## 3. Data

Data is downloaded from the *Bluebook for Bulldozers* past Kaggle competition:  
[Bluebook for Bulldozers Kaggle Competition](https://www.kaggle.com/c/bluebook-for-bulldozers/overview)  
There are three main datasets:  
* **Train.csv** is the training set, which contains data through the end of 2011.
* **Valid.csv** is the validation set, which contains data from January 1, 2012 - April 30, 2012.  
The score on this set was used to create the public leaderboard.
* **Test.csv** is the test set, which contains data from May 1, 2012 - November 30, 2012.  
The score on this set determined the final rank for the competition.

## 4. Data Features

**Data dictionary**

Kaggle provided a data dictiorany for these datasets.  
See `data-dictionary.xlsx` in the project folder.

#### Importing tools

In [1]:
### importing standard libraries
from typing import List, Dict

### importing data analysis libraries
import numpy, pandas
from pandas import read_csv, DataFrame, Series
from matplotlib import pyplot
from matplotlib.figure import Figure

### importing machine learning libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import RandomizedSearchCV

#### Importing data

**Parsing dates**

Working with time series data requires date/time information to be in python datetime format for easy processing.  
Date/time columns are parsed into datetime format using the `parse_dates=` parameter of `read_csv()`.

In [2]:
### importing training and validation datasets from file
import_df: DataFrame = read_csv(
    filepath_or_buffer="../Large-Files/data-train-valid.csv",
    parse_dates=["saledate"],
    low_memory=False)

#### Exploring: Target variable

In [None]:
### plotting the distribution of target variable
bulldozer_sorted["SalePrice"].plot.hist(color="steelblue")

### customizing the plot
pyplot.title(label="Distribution of Bulldozer Sale Prices")
pyplot.ylabel(ylabel="Sale Count")
pyplot.xlabel(xlabel="Sale Price ($)");

#### Exploring: Sale date

In [None]:
### plotting correlation between sale price and sale date
bulldozer_df[:500].plot.scatter(y="SalePrice", x="saledate", c="steelblue", s=50)

### customizing the plot
pyplot.title(label="Sale Price and Sale Date")
pyplot.ylabel(ylabel="Sale Price ($)")
pyplot.xlabel(xlabel="Sale Date");

#### Exploring: State of sale

In [None]:
### plotting the distribution of sales by state
bulldozer_sorted["state"].value_counts().sort_index(ascending=True).plot.bar(figsize=(12,5))

### customizing the plot
pyplot.title(label="Distribution of Sales by States")
pyplot.ylabel(ylabel="Sale Counts");

#### Model-driven data exploration

**Concept**

When there are a huge amount of features, it may be better to start building a machine learning model right away.  
Model-driven data exploration lets the machine learning algorithm select the most important features.

## 5. Modeling

#### Feature engineering

**Concept**

Feature engineering means processing data in the dataset.  
It includes transforming existing data and/or creating new data from existing data.

**Sorting dataframe by date**

When working with time series, it is better to sort data by date/time.

In [3]:
### function: sorting dataframe by sale date
def sortData(df:DataFrame) -> None:
    df = df.sort_values(by="saledate", ascending=True)

**Enriching dataset with date information**

Using the `pandas.dt` interface, additional data is extracted from datetime values.

In [5]:
### function: extracting date information from sale date column
def appendDateInfo(df:DataFrame) -> None:
    df["saleYear"] = df["saledate"].dt.year
    df["saleMonth"] = df["saledate"].dt.month
    df["saleDay"] = df["saledate"].dt.day
    df["saleDayOfWeek"] = df["saledate"].dt.day_of_week
    df["saleDayOfYear"] = df["saledate"].dt.day_of_year
    df = df.drop(columns="saledate")

#### Preprocessing: Cleaning and transforming string values

**Pandas api.types interface**

Datatypes can be manipulated with the `pandas.api.types` interface.

**Pandas category datatype**

One way to convert strings into numbers is to use the pandas category datatype.  
The category datatype represents missing values with -1.

**Pandas .cat interface**

The category datatype is manipulated with the `pandas.cat` interface.  
`dataframe[column].cat.categories` = lists all categories in the given dataframe column.  
`dataframe[column].cat.codes` = lists all integer codes in the given dataframe column.

In [None]:
### function: converting string > category > int and filling missing values
def cleanStrings(df:DataFrame) -> None:
    for column,values in df.items(): # iterating through dataframe
        if pandas.api.types.is_string_dtype(values): # selecting string columns
            df[column+"_missing"] = values.isna() # saving location of missing values
            df[column] = values.astype('category').cat.as_ordered() # converting string > category
            values = df[column]
            df[column] = values.cat.codes + 1 # converting category > int (missing values = 0)

#### Preprocessing: Cleaning numerical values

**Important**

Filling missing numerical values with any statistical endpoints must be done after splitting the dataset!  
Validation / test data in any form must not be used in training the algorithm.

**Statistical concept**

The mean is much more sensitive to outliers than the median.  
Using the median is advised with large datasets full of outliers.

In [None]:
### function: filling missing numerical values with median
def cleanNumerics(df:DataFrame) -> None:
    for column,values in df.items(): # iterating through dataframe
        if pandas.api.types.is_numeric_dtype(values): # selecting numeric columns
            df[column+"_missing"] = values.isna() # saving location of missing values
            df[column] = values.fillna(value=values.median()) # filling with median

#### Preparing data: Splitting into training / validation sets

In [None]:
### splitting dataset training / validation
training_df: DataFrame = bulldozer_sorted.loc[bulldozer_sorted["saleYear"] != 2012]
validation_df: DataFrame = bulldozer_sorted.loc[bulldozer_sorted["saleYear"] == 2012]

training_df.shape, validation_df.shape

#### Preparing data: Splitting into features / targets

In [None]:
### splitting datasets features / targets
features_train: DataFrame = training_df.drop(columns="SalePrice")
targets_train: Series = training_df["SalePrice"]
features_valid: DataFrame = validation_df.drop(columns="SalePrice")
targets_valid: Series = validation_df["SalePrice"]

features_train.shape, targets_train.shape, features_valid.shape, targets_valid.shape

#### Selecting and training machine learning algorithm

In [None]:
### prepariong data
features: DataFrame = bulldozer_sorted.drop(columns="SalePrice")
targets: Series = bulldozer_sorted["SalePrice"]

### preparing random forest regressor
regressor: RandomForestRegressor = RandomForestRegressor(random_state=42, n_jobs=-1)

### training algorithm
regressor.fit(X=features, y=targets)

### scoring model
regressor.score(X=features, y=targets)

#### Model evaluation

In [None]:
### custom evaluation function
def evaluateModel(
        model:RandomForestRegressor,
        features_train:DataFrame, features_valid:DataFrame,
        targets_train:Series, targets_valid:Series):
    """
    Calculates the Root Mean Squared Log Error (RMSLE) for the given model.
    """
    predict_train: Series[float] = model.predict(X=features_train)
    predict_valid: Series[float] = model.predict(X=features_valid)
    rmsle_train: float = mean_squared_log_error(y_true=targets_train, y_pred=predict_train, squared=False)
    rmsle_valid: float = mean_squared_log_error(y_true=targets_valid, y_pred=predict_valid, squared=False)
    return rmsle_train, rmsle_valid

#### Reducing data

**Machine learning concept**

It takes a lot of time to train an algorithm on a large training dataset.  
The solution is to use only a fraction of the training dataset for experimenting.  
The final model is trained on the entire training dataset using the best hyperparameters found.

**`max_samples` argument**

With random forest regressor, the fraction of training dataset is specified by using the `max_samples` argument.

In [None]:
### training algorithm on a subset of training data
regressor = RandomForestRegressor(random_state=42, max_samples=10000, n_jobs=-1)
regressor.fit(X=features_train, y=targets_train)
scores = evaluateModel(
    model=regressor,
    features_train=features_train, features_valid=features_valid,
    targets_train=targets_train, targets_valid=targets_valid)
print(scores)

#### Hyperparameter tuning

In [None]:
### creating randomized search grid
rscv_grid: Dict[str,list] = {
    "max_depth": [None, 3, 5, 10],
    "max_features": [0.5, 1, "sqrt", "auto"],
    "max_samples": [10000],
    "min_samples_leaf": numpy.arange(1, 20, 2),
    "min_samples_split": numpy.arange(2, 20, 2),
    "n_estimators": numpy.arange(10, 100, 10)}

### creating and fitting randomized search cross validation model
rscv_model = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=42, n_jobs=-1),
    param_distributions=rscv_grid, n_iter=10, cv=5, verbose=True)
rscv_model.fit(X=features_train, y=targets_train)

### displaying best parameters
rscv_model.best_params_

In [None]:
### evaluating best model
scores = evaluateModel(
    model=rscv_model.best_estimator_,
    features_train=features_train, features_valid=features_valid,
    targets_train=targets_train, targets_valid=targets_valid)
print(scores)

#### Training the model on full training dataset using best hyperparameters

**Note**

These hyperparameters were found by the course instructor with 100 iterations of randomized search.

In [None]:
### creating and training random forest regressor
regressor = RandomForestRegressor(
    n_estimators=40, min_samples_split=14, min_samples_leaf=1,
    max_samples=None, max_features=0.5, random_state=42, n_jobs=-1)
regressor.fit(X=features_train, y=targets_train)

### evaluating ideal model
scores = evaluateModel(
    model=regressor,
    features_train=features_train, features_valid=features_valid,
    targets_train=targets_train, targets_valid=targets_valid)
print(scores)

#### Predicting on the test dataset

**Concept**

The training and test datasets must be in the same format.  
The exact same mamipulations must be performed on both datasets.

In [None]:
### importing test dataset from file
test_df: DataFrame = read_csv(
    filepath_or_buffer="data-test.csv",
    parse_dates=["saledate"],
    low_memory=False)
test_df.head()

In [None]:
### making predictions using optimal regressor
test_predicts = optimal_regressor.predict(test_df)

In [None]:
### formatting predictions as requested by Kaggle and exporting to csv
predicts_df = DataFrame(data=[test_df["salesID"], test_predicts], columns=["salesID", "SalesPrice"])
predicts_df.to_csv(path_or_buf="predict-test.csv", index=False)