## Predicting the Sale Price of Bulldozers (Kaggle Competition)

## 1. Problem Definition

The goal of this project is to predict the sale price of bulldozers at auction.  
Predictions are based on usage, equipment type, and configuaration.  
The data is sourced from auction result postings.  
Type of machine learning problem: **supervised learning / time series regression**

## 2. Evaluation

The competition evaluation metric was *root mean squared log error (RMSLE)*.  
**Project goal:** To minimize the difference between actual and predicted prices, i.e., to minimize RMSLE.

## 3. Data

Data is downloaded from the *Bluebook for Bulldozers* past Kaggle competition:  
[Bluebook for Bulldozers Kaggle Competition](https://www.kaggle.com/c/bluebook-for-bulldozers/overview)  
There are three main datasets:  
* **Train.csv** is the training set, which contains data through the end of 2011.
* **Valid.csv** is the validation set, which contains data from January 1, 2012 - April 30, 2012.  
The score on this set was used to create the public leaderboard.
* **Test.csv** is the test set, which contains data from May 1, 2012 - November 30, 2012.  
The score on this set determined the final rank for the competition.

## 4. Data Features

#### Data dictionary

Kaggle provided a data dictiorany for these datasets.  
See `data-dictionary.xlsx` in the project folder.

#### Importing the tools

In [None]:
### importing standard libraries
from typing import List, Dict

### importing data analysis libraries
import numpy, pandas
from pandas import read_csv, DataFrame, Series
from matplotlib import pyplot
from matplotlib.figure import Figure

### importing machine learning libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error, make_scorer
from sklearn.model_selection import RandomizedSearchCV

#### Importing the data

**Parsing dates**

Working with time series data requires date/time information to be in python datetime format for easy processing.  
Date/time columns are parsed into datetime format using the `parse_dates=` parameter of `read_csv()`.

In [None]:
### importing training and validation datasets from file
import_df: DataFrame = read_csv(
    filepath_or_buffer="../Large-Files/data-train-valid.csv",
    parse_dates=["saledate"],
    low_memory=False)

#### Exploring: Target variable

In [None]:
### plotting the distribution of target variable
import_df["SalePrice"].plot.hist(color="steelblue")

### customizing the plot
pyplot.title(label="Distribution of Bulldozer Sale Prices")
pyplot.ylabel(ylabel="Sale Count")
pyplot.xlabel(xlabel="Sale Price ($)");

#### Exploring: Sale date

In [None]:
### plotting the correlation between sale price and sale date
import_df[:500].plot.scatter(y="SalePrice", x="saledate", c="steelblue", s=50)

### customizing the plot
pyplot.title(label="Sale Price and Sale Date")
pyplot.ylabel(ylabel="Sale Price ($)")
pyplot.xlabel(xlabel="Sale Date");

#### Exploring: State of sale

In [None]:
### plotting the distribution of sales by state
import_df["state"].value_counts().sort_index(ascending=True).plot.bar(figsize=(12,5))

### customizing the plot
pyplot.title(label="Distribution of Sales by States")
pyplot.ylabel(ylabel="Sale Counts");

#### Model-driven data exploration

**Concept**

When there are a huge amount of features, it may be better to start building a machine learning model right away.  
Model-driven data exploration lets the machine learning algorithm select the most important features.

## 5. Modeling

#### Functions: Feature engineering

**Concept**

Feature engineering means processing data in the dataset.  
It includes transforming existing data and/or creating new data from existing data.

**Sorting dataframe by date**

When working with time series, it is better to sort data by date/time.

In [None]:
### function: sorting dataframe by sale date
def sortDataFrame(df:DataFrame) -> None:
    df.sort_values(by="saledate", ascending=True, ignore_index=True, inplace=True)
    return

**Enriching dataset with date information**

Using the `pandas.dt` interface, additional data is extracted from datetime values.

In [None]:
### function: extracting date information from sale date column
def appendDateInfo(df:DataFrame) -> None:
    df["saleYear"] = df["saledate"].dt.year
    df["saleMonth"] = df["saledate"].dt.month
    df["saleDay"] = df["saledate"].dt.day
    df["saleDayOfWeek"] = df["saledate"].dt.day_of_week
    df["saleDayOfYear"] = df["saledate"].dt.day_of_year
    df.drop(columns="saledate", inplace=True)
    return

#### Functions: Preprocessing numerical data

**Manipulating datatypes: Pandas api.types interface**

Datatypes can be manipulated with the `pandas.api.types` interface.

**Statistical concept**

The mean is much more sensitive to outliers than the median.  
Using the median is advised with large datasets full of outliers.

**Important machine learning concept**

Filling missing numerical values with any statistical endpoints must be done after splitting the dataset!  
Validation / test data in any form must not be used in training the algorithm.

In [None]:
### function: filling missing numerical values with median
def cleanNumerics(df:DataFrame) -> None:
    for column,values in df.items(): # iterating through dataframe
        if pandas.api.types.is_numeric_dtype(values): # selecting numeric columns
            df[column+"_missing"] = values.isna() # saving location of missing values
            df[column] = values.fillna(value=values.median()) # filling with median
    return

#### Functions: Preprocessing string data

**Pandas category datatype**

One way to convert strings into numbers is to use the pandas category datatype.  
The category datatype represents missing values with -1.

**Pandas .cat interface**

The category datatype is manipulated with the `pandas.cat` interface.  
`dataframe[column].cat.categories` = lists all categories in the given dataframe column.  
`dataframe[column].cat.codes` = lists all integer codes in the given dataframe column.

In [None]:
### function: converting str > category > int and filling missing values
def cleanStrings(df:DataFrame) -> None:
    for column,values in df.items(): # iterating through dataframe
        if pandas.api.types.is_string_dtype(values): # selecting string columns
            df[column] = values.astype('category').cat.as_ordered() # converting str > category
            df[column] = df[column].cat.codes + 1 # converting category > int (missing values = 0)
    return

#### Preparing data: Engineering, splitting, and preprocessing

In [None]:
### date/time feature engineering
sortDataFrame(df=import_df)
appendDateInfo(df=import_df)

### splitting dataset training / validation
training_df: DataFrame = import_df.loc[import_df["saleYear"] != 2012].copy(deep=True)
validation_df: DataFrame = import_df.loc[import_df["saleYear"] == 2012].copy(deep=True)

### preprocessing numeric columns
cleanNumerics(df=training_df)
cleanNumerics(df=validation_df)

### preprocessing string / categorical columns
cleanStrings(df=training_df)
cleanStrings(df=validation_df)

### splitting datasets features / targets
train_features: DataFrame = training_df.drop(columns="SalePrice")
train_targets: Series = training_df.loc[:,"SalePrice"].copy(deep=True)
valid_features: DataFrame = validation_df.drop(columns="SalePrice")
valid_targets: Series = validation_df.loc[:,"SalePrice"].copy(deep=True)


#### Algorithm training

In [None]:
### creating initial random forest regressor
regressor = RandomForestRegressor(random_state=42, n_jobs=-1)

### training algorithm
regressor.fit(X=train_features, y=train_targets);

#### Model evaluation

In [None]:
### creating scorer object: root mean squared log error
rmsle_scorer = make_scorer(mean_squared_log_error, greater_is_better=False, squared=False)

### scoring initial model using rmsle scorer
rmsle_train: float = rmsle_scorer(estimator=regressor, X=train_features, y_true=train_targets)
rmsle_valid: float = rmsle_scorer(estimator=regressor, X=valid_features, y_true=valid_targets)
rmsle_train, rmsle_valid

#### Hyperparameter tuning

**Reducing data**

It takes a lot of time to train an algorithm on a large training dataset.  
The solution is to use only a fraction of the training dataset for experimenting.  
The final model is trained on the entire training dataset using the best hyperparameters found.

**max_samples argument**

With random forest regressor, the fraction of training dataset is specified by using the `max_samples` argument.

In [None]:
### creating reduced data random forest regressor
regressor_reduced = RandomForestRegressor(random_state=42, max_samples=10000, n_jobs=-1)

### training algorithm
regressor_reduced.fit(X=train_features, y=train_targets)

### scoring reduced data model using rmsle scorer
rmsle_train = rmsle_scorer(estimator=regressor_reduced, X=train_features, y_true=train_targets)
rmsle_valid = rmsle_scorer(estimator=regressor_reduced, X=valid_features, y_true=valid_targets)
rmsle_train, rmsle_valid

**Randomized Grid Search**

In [None]:
### creating randomized search grid
rscv_grid: Dict[str,list] = {
    "max_depth": [None, 3, 5, 10],
    "max_features": [0.5, 1, "sqrt"],
    "max_samples": [10000],
    "min_samples_leaf": numpy.arange(1, 20, 2),
    "min_samples_split": numpy.arange(2, 20, 2),
    "n_estimators": numpy.arange(10, 100, 10)}

### creating randomized search cross validation model
rscv_model = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=42, n_jobs=-1),
    param_distributions=rscv_grid, random_state=42,
    n_iter=50, cv=3, scoring=rmsle_scorer, verbose=1)

### training model
rscv_model.fit(X=train_features, y=train_targets);

In [None]:
### displaying best parameters
rscv_model.best_params_

In [None]:
### evaluating best model
rmsle_train = rmsle_scorer(estimator=rscv_model.best_estimator_, X=train_features, y_true=train_targets)
rmsle_valid = rmsle_scorer(estimator=rscv_model.best_estimator_, X=valid_features, y_true=valid_targets)
rmsle_train, rmsle_valid

**Training full model with best hyperparameters**

These hyperparameters were found by the course instructor with 100 iterations of randomized search.

In [None]:
### creating random forest regressor with best hyperparameters
regressor_best = RandomForestRegressor(
    n_estimators=40, min_samples_split=14, min_samples_leaf=1,
    max_samples=None, max_features=0.5, max_depth=None,
    random_state=42, n_jobs=-1)

### training algorithm
regressor_best.fit(X=train_features, y=train_targets)

### evaluating best model
rmsle_train = rmsle_scorer(estimator=regressor_best, X=train_features, y_true=train_targets)
rmsle_valid = rmsle_scorer(estimator=regressor_best, X=valid_features, y_true=valid_targets)
rmsle_train, rmsle_valid

#### Predicting on the test dataset

**Concept**

The training and test datasets must be in the same format.  
The exact same mamipulations must be performed on both datasets.

In [None]:
### importing test dataset from file
test_df: DataFrame = read_csv(
    filepath_or_buffer="data-test.csv",
    parse_dates=["saledate"],
    low_memory=False)
test_df.head()

In [None]:
### making predictions using optimal regressor
test_predicts = optimal_regressor.predict(test_df)

In [None]:
### formatting predictions as requested by Kaggle and exporting to csv
predicts_df = DataFrame(data=[test_df["salesID"], test_predicts], columns=["salesID", "SalesPrice"])
predicts_df.to_csv(path_or_buf="predict-test.csv", index=False)