# Predicting the sales price of bulldozers using Machine Learning.

## 1. Problem Definition

> How well can we predict the future sale price of a bulldozer given its characteristics and previous record of sales.(Regression Problem).

## 2. Data

The data is used from kaggle's blue book for bulldozers, 
link: https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are three main datasets:
- Train.csv is the training set, which contains data through the end of 2011.
- Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
- Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

## 3. Evaluation

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices. For more on the evaluation check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

## 4. Features

-Variable	Description
-SalesID	  unique identifier of a particular sale of a machine at auction
-MachineID	  identifier for a particular machine;  machines may have multiple sales
-ModelID	  identifier for a unique machine model (i.e. fiModelDesc)
-datasource	  source of the sale record;  some sources are more diligent about reporting attributes of the machine than others.  Note that a particular datasource may report on multiple auctioneerIDs.
-auctioneerID	  identifier of a particular auctioneer, i.e. company that sold the machine at auction.  Not the same as datasource.
-YearMade	  year of manufacturer of the Machine
-MachineHoursCurrentMeter	  current usage of the machine in hours at time of sale (saledate);  null or 0 means no hours have been reported for that sale
-UsageBand	  value (low, medium, high) calculated comparing this particular Machine-Sale hours to average usage for the ---fiBaseModel;  e.g. 'Low' means this machine has less hours given it's lifespan relative to average of fiBaseModel.
-Saledate	  time of sale
-Saleprice	  cost of sale in USD
-fiModelDesc	  Description of a unique machine model (see ModelID); concatenation of fiBaseModel & fiSecondaryDesc & -fiModelSeries & fiModelDescriptor
-fiBaseModel	  disaggregation of fiModelDesc
-fiSecondaryDesc	  disaggregation of fiModelDesc
-fiModelSeries	  disaggregation of fiModelDesc
-fiModelDescriptor	  disaggregation of fiModelDesc
-ProductSize	  Don't know what this is 
-ProductClassDesc	  description of 2nd level hierarchical grouping (below ProductGroup) of fiModelDesc
-State	  US State in which sale occurred
-ProductGroup	  identifier for top-level hierarchical grouping of fiModelDesc
-ProductGroupDesc	  description of top-level hierarchical grouping of fiModelDesc
-Drive_System	machine configuration;  typcially describes whether 2 or 4 wheel drive
-Enclosure	machine configuration - does machine have an enclosed cab or not
-Forks	machine configuration - attachment used for lifting
-Pad_Type	machine configuration - type of treads a crawler machine uses
-Ride_Control	machine configuration - optional feature on loaders to make the ride smoother
-Stick	machine configuration - type of control 
-Transmission	machine configuration - describes type of transmission;  typically automatic or manual
-Turbocharged	machine configuration - engine naturally aspirated or turbocharged
-Blade_Extension	machine configuration - extension of standard blade
-Blade_Width	machine configuration - width of blade
-Enclosure_Type	machine configuration - does machine have an enclosed cab or not
-Engine_Horsepower	machine configuration - engine horsepower rating
-Hydraulics	machine configuration - type of hydraulics
-Pushblock	machine configuration - option
-Ripper	machine configuration - implement attached to machine to till soil
-Scarifier	machine configuration - implement attached to machine to condition soil
-Tip_control	machine configuration - type of blade control
-Tire_Size	machine configuration - size of primary tires
-Coupler	machine configuration - type of implement interface
-Coupler_System	machine configuration - type of implement interface
-Grouser_Tracks	machine configuration - describes ground contact interface
-Hydraulics_Flow	machine configuration - normal or high flow hydraulic system
-Track_Type	machine configuration - type of treads a crawler machine uses
-Undercarriage_Pad_Width	machine configuration - width of crawler treads
-Stick_Length	machine configuration - length of machine digging implement
-Thumb	machine configuration - attachment used for grabbing
-Pattern_Changer	machine configuration - can adjust the operator control configuration to suit the user
-Grouser_Type	machine configuration - type of treads a crawler machine uses
-Backhoe_Mounting	machine configuration - optional interface used to add a backhoe attachment
-Blade_Type	machine configuration - describes type of blade
-Travel_Controls	machine configuration - describes operator control configuration
-Differential_Type	machine configuration - differential type, typically locking or standard
-Steering_Controls	machine configuration - describes operator control configuration


## 5. Modelling
## 6. Experimentation

In [None]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False)
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df.saledate[:1000], df.SalePrice[:1000])
plt.show()

In [None]:
df.SalePrice.plot.hist()
plt.show()

In [None]:
df = pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv"
                ,low_memory=False
                ,parse_dates=["saledate"])
df.head()

In [None]:
df.saledate.dtype

In [None]:
df.saledate

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df.saledate[:1000], df.SalePrice[:1000])
plt.show()

In [None]:
df.head()

In [None]:
df.head().T

In [None]:
df.saledate.head(20)

In [None]:
df.sort_values(by=["saledate"]
              ,ascending=True
              ,inplace=True)

In [None]:
df.saledate.head(20)

In [None]:
df_temp = df.copy()

In [None]:
df_temp.head().T

In [None]:
df.saledate.head(20)

In [None]:
df_temp["saleYear"] = df_temp.saledate.dt.year
df_temp.saleYear.head()

In [None]:
df_temp["saleMonth"] = df_temp.saledate.dt.month
df_temp.saleMonth.head()

In [None]:
df_temp["saleDay"] = df_temp.saledate.dt.day
df_temp.saleDay.head()

In [None]:
df_temp["saleDayOfWeek"] = df_temp.saledate.dt.dayofweek
df_temp.saleDayOfWeek.head()

In [None]:
df_temp["saleDayOfYear"] = df_temp.saledate.dt.dayofyear
df_temp.saleDayOfYear.head()

In [None]:
df_temp.drop("saledate",axis=1, inplace=True)

In [None]:
df_temp.state.value_counts()

In [None]:
df_temp.info()

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1
                             ,random_state=42)

In [None]:
pd.api.types.is_string_dtype(df_temp.UsageBand)

In [None]:
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        df_temp[label] = content.astype("category").cat.as_ordered()
        
df_temp.info()

In [None]:
df_temp.state.cat.codes

In [None]:
df_temp.isna().sum()/len(df_temp) * 100

In [None]:
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            df_temp[label+"_is_missing"] = pd.isnull(content)
            df_temp[label]=content.fillna(content.median())

In [None]:
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
df_temp.auctioneerID_is_missing.value_counts()

In [None]:
df_temp.MachineHoursCurrentMeter_is_missing.value_counts()

In [None]:
df_temp.info()

In [None]:
df_temp.isna().sum()

In [None]:
pd.Categorical(df_temp.state).codes

In [None]:
for label, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        df_temp[label+"_is_missing"]=pd.isnull(content)
        df_temp[label]=pd.Categorical(content).codes + 1

In [None]:
df_temp.isna().sum()

In [None]:
model

In [None]:
df_temp.info()

In [None]:
for label, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
%%time
model.fit(df_temp.drop("SalePrice", axis=1), df_temp.SalePrice)

In [None]:
model.score(df_temp.drop("SalePrice", axis=1), df_temp.SalePrice)

In [None]:
df_temp.head()

In [None]:
df_temp.saleYear

In [None]:
df_val = df_temp[df_temp.saleYear == 2012]
df_train = df_temp[df_temp.saleYear != 2012]

In [None]:
df_val.head()

In [None]:
df_train.head()

In [None]:
len(df_train), len(df_val)

In [None]:
X_train, y_train = df_train.drop("SalePrice", axis=1), df_train.SalePrice
X_val, y_val = df_val.drop("SalePrice", axis=1), df_val.SalePrice

X_train.shape, y_train.shape, X_val.shape, y_val.shape

In [None]:
%%time
model.fit(X_train, y_train)

In [None]:
model.score(X_val, y_val)

In [None]:
from sklearn.metrics import mean_squared_log_error, r2_score, mean_absolute_error


def rmsle(y_test, y_preds):
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

def show_scores(model, X_train, X_valid, y_train, y_valid):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    
    scores = {
        "MAE train":mean_absolute_error(y_train, train_preds),
        "MAE valid":mean_absolute_error(y_valid, val_preds),
        "RMSLE train":rmsle(y_train, train_preds),
        "RMSLE valid":rmsle(y_valid, val_preds),
        "R2 train":r2_score(y_train, train_preds),
        "R2 valid":r2_score(y_valid, val_preds)
    }
    
    return scores 

In [None]:
show_scores(model, X_train, X_val, y_train, y_val)

In [None]:
model.set_params(max_samples=10000)

In [None]:
show_scores(model, X_train, X_val, y_train, y_val)

In [None]:
%%time
model.fit(X_train, y_train)

In [None]:
%%time
show_scores(model, X_train, X_val, y_train, y_val)
#out of the box scores

In [None]:
# %%time
# from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# rs_grid = {
#     "n_estimators":np.arange(10,100,10),
#     "max_depth":[None, 3, 5, 10],
#     "min_samples_split":np.arange(2,20,2),
#     "min_samples_leaf":np.arange(1,20,2),
#     "max_features":[0.5, 1, "sqrt", "auto"],
#     "max_samples":[10_000],
# }

# rs_model = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=42, n_jobs=-1),
#                              param_distributions=rs_grid,
#                              n_iter=1000,
#                              cv=5,
#                              verbose=2)

In [None]:
# %%time
# rs_model.fit(X_train, y_train)
# rs_model.best_params_

- n_iter = 10

{'n_estimators': 30,
 'min_samples_split': 12,
 'min_samples_leaf': 5,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': None}
 
 RMSLE: 0.256

- n_iter = 1000

{'n_estimators': 80,
 'min_samples_split': 4,
 'min_samples_leaf': 1,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': None}

In [None]:
model

In [None]:
model.set_params(max_samples=None)

In [None]:
model.set_params(n_estimators=80, min_samples_split=4, min_samples_leaf=1, max_features="auto", max_depth=None)

In [None]:
model

In [None]:
%%time
model.fit(X_train, y_train)
model.score(X_val, y_val)


In [None]:
show_scores(model, X_train, X_val, y_train, y_val)

In [None]:
df_test = pd.read_csv("../input/bluebook-for-bulldozers/Test.csv", low_memory=False, parse_dates=["saledate"])
df_test.head()

In [None]:
df_test.info()

In [None]:
df_test.isna().sum()

In [None]:
def preprocess_data(df):
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayOfWeek"] = df.saledate.dt.dayofweek
    df["saleDayOfYear"] = df.saledate.dt.dayofyear
    
    df.drop("saledate", axis=1, inplace=True)
    
    for label, content in df.items():
        if pd.api.types.is_string_dtype(content):
            df[label] = content.astype("category").cat.as_ordered()
            
    for label, content in df.items():
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"]=pd.isnull(content)
            df[label]=pd.Categorical(content).codes + 1
            
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                df[label+"_is_missing"]=pd.isnull(content)
                df[label]=content.fillna(content.median())
    
    return df

In [None]:
df_test = preprocess_data(df_test)

In [None]:
df_test.info()

In [None]:
for label, content in df_test.items():
        if pd.api.types.is_string_dtype(content):
            print(label)

In [None]:
for label, content in df_test.items():
        if pd.api.types.is_numeric_dtype(content):
            print(label)

In [None]:
df_test.isna().sum()

In [None]:
df_test.head()

In [None]:
df_test.info()

In [None]:
X_train.head()

In [None]:
df_test.head()

In [None]:
model

In [None]:
set(X_train.columns)-set(df_test.columns)

In [None]:
df_test["auctioneerID_is_missimg"] = False

In [None]:
df_test.head()

In [None]:
%%time
test_preds = model.predict(df_test)

In [None]:
test_preds

In [None]:
len(test_preds)

In [None]:
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalesPrice"] = test_preds
df_preds.to_csv("./output.csv", index=False)