Predicting Sale Price of Bulldozers :
1. Problem Defination 
    - How well we can predict the future sale price of bulldozers
    - Prediction on basis of given characteristics 
    - Sale price of simillar bulldozers

2. Data https://www.kaggle.com/c/bluebook-for-bulldozers/data
    - Data is downloaded from Kaggle Blue Book for Bulldozers Competition :
    - Train.csv is the training set, which contains data through the end of 2011.
    - Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You  makepredictions on this set throughout the majority of the competition. Your score on this set is used to create   the public leaderboard.
    - Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

3. Evaluation https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation
    - The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

4. Features (Data)
    - kaggle provide a data dictionary detailing of all the features of the data set
    - https://www.kaggle.com/c/bluebook-for-bulldozers/data?select=Data+Dictionary.xlsx

In [None]:
# 1. Regular Exploratory Data Analysis (EDA) Tools and Plotting Library  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import sklearn

%matplotlib inline
plt.style.use("seaborn-whitegrid")

In [None]:
# importing data
df = pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv", 
                 low_memory= False, 
                 parse_dates= ["saledate"]) # low_memory = False it helps to allocate RAM
df.info()

In [None]:
df.isna().sum()

In [None]:
fig, ax = plt.subplots(figsize = (10,6))
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000])

In [None]:
df.sort_values(by=["saledate"], inplace= True, ascending=True)
df.head()

In [None]:
df_tmp = df.copy()

#### Feature Engineering :

In [None]:
df_tmp["saleYear"] = df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayofWeek"] = df_tmp.saledate.dt.dayofweek
df_tmp['saleDayofYear'] = df_tmp.saledate.dt.dayofyear

In [None]:
df_tmp.T

In [None]:
# we dont need saledate column anymore
df_tmp.drop(["saledate"], axis = 1, inplace = True)

In [None]:
df_tmp.head()

In [None]:
# trying to get more info about data 
pd.crosstab(df_tmp.saleYear, df_tmp.state)

In [None]:
# Converting all string type data to category type
for label, content in df_tmp.items():
  if pd.api.types.is_string_dtype(content):
    df_tmp[label] = content.astype('category').cat.as_ordered()

In [None]:
df_tmp.info()

In [None]:
# checking the missing data ratio
df_tmp.isna().sum()/len(df_tmp)

#### Filling Missing Value

In [None]:
# Filling Numeric values 
for label, content in df_tmp.items():
  if pd.api.types.is_numeric_dtype(content):
    if pd.isna(content).sum():
      df_tmp[label + "_is_missing"] = pd.isna(content)
      df_tmp[label] = content.fillna(content.median())

In [None]:
# Filling Category Values
for label, content in df_tmp.items():
  if not pd.api.types.is_numeric_dtype(content):
      df_tmp[label + "_is_missing"] = pd.isna(content)
      # adding + 1 to categorical values as empty values it reflect as -1
      df_tmp[label] = pd.Categorical(content).codes + 1 

In [None]:
df_tmp.isna().sum()

In [None]:
df_tmp

#### Splitting Data :

In [None]:
np.random.seed(42)
# Data before 2012 is traning data and data of 2012 is validation data
df_val = df_tmp[df_tmp.saleYear == 2012]
df_train = df_tmp[df_tmp.saleYear != 2012]

# #Splitting into X and y
X_train, y_train = df_train.drop("SalePrice", axis = 1), df_train['SalePrice']
X_valid, y_valid = df_val.drop("SalePrice", axis = 1), df_val['SalePrice']

len(X_train), len(y_train), len(X_valid), len(y_valid)

#### Modelling :

In [None]:
np.random.seed(42)

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs= -1)

#Fitting Traning Data:
model.fit(X_train, y_train)

In [None]:
model.score(X_valid, y_valid)

#### Evaluation 
- We will be using Root Mean Square Log Error 

In [None]:
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

# Function for Root Mean Square log Error 
def rmsle(y_test, y_preds):
  """
  Calcualte Root Mean Square log Error between predictions and true labels
  """
  return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Function to evaluate model on few different metrics 
def model_scores(model):
  train_preds = model.predict(X_train)
  valid_preds = model.predict(X_valid)
  scores = {"Traning MAE" : mean_absolute_error(y_train, train_preds),
           "Valid MAE" : mean_absolute_error(y_valid, valid_preds),
           "Traning RMSLE" : rmsle(y_train, train_preds),
           "Valid RMSLE" : rmsle(y_valid, valid_preds),
           "Traning R^2 " : r2_score(y_train, train_preds),
           "Valid R^2" : r2_score(y_valid, valid_preds)}
  return scores

In [None]:
# Changing max sample value to reduce output time
%%time
np.random.seed(42)
model = RandomForestRegressor(n_jobs= -1, random_state= 42, max_samples= 10000)


model.fit(X_train, y_train)

In [None]:
%%time
model_scores(model)

#### Hyperparameter Tuning
- RandomizedSearchCV

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# Making a grid of hyperperameters
rf_grid = {"n_estimators" : np.arange(10,100,10),
           "max_depth" : [None, 3, 5, 10],
           "min_samples_split" : np.arange(2, 20, 2),
           "min_samples_leaf" : np.arange(1,20,2),
           "max_features" : [0.5, 1, "sqrt", "auto"],
           "max_samples" : [10000]}

rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs= -1, random_state= 42),
                              param_distributions = rf_grid, 
                              n_iter = 2,
                              cv = 5,
                              verbose = True)

rs_model.fit(X_train, y_train)

In [None]:
# getting best parameters :
rs_model.best_params_

In [None]:
# evaluating score 
model_scores(rs_model)

#### Traning a model with best hyperparameters
Note : These parameters are found after 100 iterations of RandomizedSearchCV

In [None]:
%time 
# Updating Hyperparameters

ideal_model = RandomForestRegressor(n_estimators=40,
                                    min_samples_leaf = 1,
                                    min_samples_split = 14,
                                    max_features = 0.5,
                                    n_jobs = -1,
                                    max_samples = None, 
                                    random_state = 42)

In [None]:
%%time
# Fitting Data to our ideal model 
ideal_model.fit(X_train, y_train)

In [None]:
# Model Score 
model_scores(ideal_model)

#### Importing Test data CSV

In [None]:
df_test = pd.read_csv("../input/bluebook-for-bulldozers/Test.csv",
                      low_memory = False, 
                      parse_dates = ["saledate"])
df_test.head()

In [None]:
# Preprocessing Test Data 
def preprocess_data(df_test):
  """
  Preprocessing Data and tranforming it to match our train and valid data formate
  """
  df_test["saleYear"] = df.saledate.dt.year
  df_test["saleMonth"] = df.saledate.dt.month
  df_test["saleDay"] = df.saledate.dt.day
  df_test["saleDayofWeek"] = df.saledate.dt.dayofweek
  df_test["saleDayofYear"] = df.saledate.dt.dayofyear

  df_test.drop("saledate", axis = 1, inplace = True)

  # Filling Numeric rows with median 
  for label, content in df_test.items():
    if pd.api.types.is_numeric_dtype(content):
      if pd.isna(content).sum():
        df_test[label + "_is_missing"] = pd.isna(content)
        df_test[label] = content.fillna(content.median())
  # Filling category missing data and turning them into numbers
    if not pd.api.types.is_numeric_dtype(content):
        df_test[label + "_is_missing"] = pd.isna(content)
        df_test[label] = pd.Categorical(content).codes + 1
  return df_test

In [None]:
df_test_processed = preprocess_data(df_test)
df_test_processed

In [None]:
# Using set() we can check if columns are different :
set(X_train.columns) - set(df_test_processed.columns)

In [None]:
# Adding " auctioneerID_is_missing" Column 
df_test_processed["auctioneerID_is_missing"] = False
df_test_processed.head()

In [None]:
# Making Predictions using precessed test data
test_preds = ideal_model.predict(df_test_processed)
test_preds

In [None]:
# Formating Predictions into a dataframe
df_preds =pd.DataFrame()
df_preds["SalesID"] = df_test_processed['SalesID']
df_preds["SalesPrice"] = test_preds
df_preds

# Feature Importance :
- To Figure out which features from the data are most important in predicting SalePrice

In [None]:
ideal_model.feature_importances_

In [None]:
# Function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importances": importances})
          .sort_values("feature_importances", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature importance")
    ax.invert_yaxis()

In [None]:
features_barh = plot_features(X_train.columns, ideal_model.feature_importances_)

In [None]:
df["Enclosure"].value_counts()