# Avacado Price Prediction Regression Models

## Table of Contents

1. Problem Statement
2. Importing Libraries
3. Data
  - Loading data
  - Description of data columns
  - Understanding data - Pre-Profiling
4. Exploratory Data Analysis
  - Pre processing data
      - Handling missing values
      - Type conversions
      - Feature engineering
      - Transforming exploratory variable
  - Post profiling
5. Modelling using sklearn
  - Data Preparation
    - Splitting data as train and test
    - Scaling and encoding
  - Building Models
  - Model Predictions
6. Model Evaluations
7. Model Plotting
    - Comparing models
8. Conclusions
    - Analyzing and finalizing best-fit model
  


## 1. Problem Statement
---

Given historical data on avocado prices and sales volume in multiple US markets and various other factors like Date, AveragePrice,Total Volume, Total Bags,Year,Type etc.

The goal is to predict average price of avocado using best regression model among Linear Regression, Decision Tree Regressor and Randon Forest Regressor.

## 2. Importing Libraries
---

In [None]:
# Importing required libraries
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

# Setting options
np.set_printoptions(precision=4)                 # To display values upto Four decimal places. 
plt.style.use('seaborn-whitegrid')               # To apply seaborn whitegrid style to the plots.
plt.rc('figure', figsize=(20, 12))               # Set the default figure size of plots.
sns.set(style='whitegrid')                       # To apply whitegrid style to the plots.
warnings.filterwarnings('ignore')                # To ignore warnings, if any

## 3. Data
---

### Loading data...

In [None]:
# Importing the dataset as data
data = pd.read_csv('../input/avocado-prices/avocado.csv', index_col=0)
data.sample(8)    # Preview of random 8 rows 

### Description of data columns


<p>
The dataset consists of the information about HASS Avocado. 

Historical data on avocado prices and sales volume in multiple US markets. Various variables present in the dataset includes Date, AveragePrice,Total Volume, Total Bags,Year,Type etc.

The dataset comprises of 18249 observations of 14 columns. Below is a table showing names of all the columns and their description.
</p>

|Column|Description|
|--:|:--|
|**Date**|The date of the observation|
|**AveragePrice**|Average price of a single avocado - ***Target Variable***|
|**Total Volume**|Total number of avocados sold|
|**4046**|Total avocados with PLU 4046 - *Small/Medium Hass Avocado (\~3-5oz avocado)* sold|
|**4225**|Total number of avocados with PLU 4225 - *Large Hass Avocado (\~8-10oz avocado)* sold|
|**4770**|Total number of avocados with PLU 4770 - *Extra Large Hass Avocado (\~10-15oz avocado)* sold|
|**type**|Conventional or Organic|
|**year**|Year of observation|
|**Region**|City or region of the observation|





In [None]:
data.shape       # Number of (records, features) of data

In [None]:
data.info()      # Info of data

In [None]:
data.describe()     # Descriptive statistics of data

* There are total of 18249 and 13 columns 
* From `info`, we can infer that there are no missing values.
* Target Variable 'Average Price' looks normally distribured as mean and median(50 percentile value) are almost similar, but TV seems to be right skewed



### Understanding data - Pre-Profiling

In [None]:
pre_profile = data.profile_report(title='Avacado Pre-Profiling')   # Performing Pre Profiling on data.

In [None]:
pre_profile.to_file('pre-profiling.html')                          # Saving report to pre-profiling.html

In [None]:
# pre_profile.to_notebook_iframe()                                 # Displaying the profiling report inline. 

**Profiling before Data Processing** <br><br>
__Dataset info__:
- Number of variables: 13
- Number of observations: 18249
- Missing cells: 0


__Variables types__: 
- Numeric: 10
- Categorical: 3

__Observations__: 
* There seems to be some problem with __index__, as there are only 53 unique values and total records are 18000+
* __Target variable__ in normally distributed but wiht slight skewedness at right
* There is equal distribution of __conventional__ and __organic__ avacado types
* __Region__ and __Date__ are uniformly distributed and have high cardinality
* Most of variables like _4046, 4225, Total Bags, Small Bags, Large Bags_ are highly corellated with **Total Volume**

## 4. Exploratory Data Analysis
---

###  Pre processing data

* __Handling issues found in pre-profiling__
* __Preparing data for modelling__
      - Handling missing values
      - Type conversions
      - Feature engineering
      - Transforming exploratory variable
      
---

From the above observations we will:
- Reset_index
- Rename features as per conveniance
- Altering the type of features
- Feature engineer some columns
- Drop ineffective features
- Drop highly correlated features
- Drop records with right skewed target variable

Fixing issues with index

In [None]:
# Unique values in data index - doing this as profiling shows there are zeros in index
print('No. of unique index values:', data.index.nunique())                          

In [None]:
data.reset_index(drop=True, inplace=True)    # reseting index as index values seems to be incorrect

# Unique values in data index after ressetting
print('No. of unique index values after resetting index: ', data.index.nunique())   

Rename columns as per conveniance

In [None]:
# Renaming column names
data.rename(columns={'4046':'PLU_4046','4225':'PLU_4225','4770':'PLU_4770'}, inplace=True) # Renaming size as per description
# Renaming columns to remove spaces and capitalize first letter
data.columns = data.columns.str.replace(' ','').map(lambda x : x[0].upper() + x[1:]) 
data.head(2)  # Preview of column header

Working with type of features...

In [None]:
data.dtypes # Looking for data types

There are 3 categorical columns
* __Type__ has on two values and distriburted uniformly
* __Date__ and __Region__ are highly cardinal, so we will work on how to proceed further...

Converting `Date` to `datetime` from `object`

Converting `Year` to `object` from `numeric`

In [None]:
data['Date'] = pd.to_datetime(data['Date'])    # Converting date to datetime type
data['Year'] = data['Year'].astype('object')   # Converting Year to object from numeric

Deriving some insightful columns from __Date__ - like 'Season', 'Month', 'Quarter'

In [None]:
# Utility / Helper Function - To categorize season based on date

def categorizing_seasons(date):
    month = date.month

    # Source - https://en.wikipedia.org/wiki/Season#Meteorological
    winter, spring, summer, autumn = ([12, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11])
    if month in winter:
        return 'Winter'
    elif month in spring:
        return 'Spring'
    elif month in summer:
        return 'Summer'
    else:
        return 'Autumn'

In [None]:
data['Month'] = data['Date'].dt.month_name()             # Deriving Month from Date
data['Quarter'] = data['Date'].dt.quarter                # Deriving Qurter from Date
data['Season'] = data['Date'].map(categorizing_seasons)  # Deriving Season from Date

Analyzing how `AveragePrice` varies w.r.t `Month`, `Quarter`, `Season`. 

In [None]:
# Utility / Helper Function - To update the variables as per data

def get_variables_from_data():
    # Target Variables
    y_column = 'AveragePrice'                                          
     
    # Categorical Feature variables 
    X_columns_cat = list(data.dtypes[data.dtypes.values == 'object'].index)  

    # Numeric Feature variables
    X_columns_num = list(data.dtypes[(data.dtypes.values != 'object') & (data.dtypes.index != y_column)].index)    

    # Feature variables
    X_columns = X_columns_num + X_columns_cat
    
    print('y_column:', y_column)
    print('X_columns: ',X_columns) 
    print('X_columns_num: ',X_columns_num) 
    print('X_columns_cat: ',X_columns_cat) 
    
    # Returning as a tuple
    return y_column, X_columns, X_columns_num, X_columns_cat

In [None]:
# Updating Variables
y_column, X_columns, X_columns_num, X_columns_cat = get_variables_from_data()

In [None]:
data.groupby('Month')[y_column].agg(['max', 'mean', 'min'])   # Understanding TV w.r.t 'Month'

In [None]:
data.groupby('Quarter')[y_column].agg(['max', 'mean', 'min'])  # Understanding TV w.r.t 'Quarter'

In [None]:
data.groupby('Season')[y_column].agg(['max', 'mean', 'min'])  # Understanding TV w.r.t 'Season'

In [None]:
f, ax = plt.subplots(1, 3, figsize=(15,5))
f.suptitle('Spread of mean AveragePrice Over Season, Quarter and Month', fontsize=16)
data.groupby('Season')[y_column].mean().plot(kind='bar',ax=ax[0])
data.groupby('Quarter')[y_column].mean().plot(kind='bar',ax=ax[1])
data.groupby('Month')[y_column].mean().plot(kind='bar',ax=ax[2])

__Observations__

*   Average price drops in the months of December, January, February, May, June, July
*   No much varience in price w.r.t Quarter - So we can drop this column. 
*   In winters Avacado prices drops more than any other seasons - as seasons are correlated and gives more info we can drop `Season`, `Date` and have `Month` column as an important feature.




In [None]:
X_columns          # Preview of existing Feature columns 

In [None]:
# Replacing date with less cordinal column month
data.drop(columns=['Date', 'Season', 'Quarter'], inplace=True)   # Droping Data, Quarter and Season columns

In [None]:
# Updating Variables
y_column, X_columns, X_columns_num, X_columns_cat = get_variables_from_data()

In [None]:
f, ax =  plt.subplots(1, 2, figsize=(15, 8))
f.suptitle('Box plot on Target Variable and Target Variable Distribution - Before', fontsize=16)
sns.boxplot(y=y_column, data=data, ax=ax[0]) # Box plot on TV before droping extreme values
sns.distplot(data[y_column], ax=ax[1])       # Distribution of Target Vaiable

In [None]:
# Checking mean|median and limiting data to 2 * (mean|median) - To eliminate extreme right values
data[y_column].describe()                  

We will remove extreme values above avg price 2.8, this makes our TV symetric

In [None]:
data.drop(data[data[y_column] > 2.8].index, inplace=True) # Droping records where price > 3
print(data.shape)                                         # Shape of data after droping few records
data.sample(5)                                            # Preview of data after droping few records

In [None]:
f, ax =  plt.subplots(1, 2, figsize=(15, 8))
f.suptitle('Box plot on Target Variable and Target Variable Distribution - After', fontsize=16)
sns.boxplot(y=y_column, data=data, ax=ax[0]) # Box plot on TV after droping extreme values
sns.distplot(data[y_column], ax=ax[1])       # Distribution of Target Vaiable

EDA
How Price are varying wrt to Region

Price and Type relation


Seperate X and y 
  - Do correlation and drop few columns
  - Joint plot
  - Dist Plot
  - Linear relation amoing x and y

In [None]:
data.head()   # Preview of data

In [None]:
# Density of mean price w.r.t categorical columns
f, ax = plt.subplots(2,2)
for x_var, subplot in zip(X_columns_cat, ax.flatten()):
    subplot.set_xlabel(x_var)
    data.groupby(x_var)[y_column].mean().plot(kind='kde', ax=subplot, label='Test')

In [None]:
# Mean price w.r.t categorical columns
f, ax = plt.subplots(2,2)
plt.subplots_adjust(hspace=0.5)
for x_var, subplot in zip(X_columns_cat, ax.flatten()):
    subplot.set_xlabel(x_var)
    subplot.set_ylabel('Mean Avg price')
    data.groupby(x_var)[y_column].mean().plot(kind='bar', ax=subplot, label='Test')

In [None]:
# Bot plot to check outliers in categorical columns

f, ax = plt.subplots(1,2, figsize=(15,5))
for x_var, subplot in zip(X_columns_cat[0:2], ax.flatten()):
    sns.boxplot(data = data, x=x_var, y=y_column, ax=subplot)

f, ax = plt.subplots(1, figsize=(15,5))
sns.boxplot(data = data, x=X_columns_cat[-1], y=y_column, ax=ax)

f, ax = plt.subplots(1, figsize=(15,5))
sns.boxplot(data = data, x=X_columns_cat[2], y=y_column, ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()

There are some outliers present but these are not too extreme so we do not drop any records

In [None]:
# Checking for relation of numeric columns w.r.t Target Variable
f, ax = plt.subplots(1, len(X_columns_num), figsize=(20, 5))

for x_var, sp in zip(X_columns_num, ax.flatten()):
    sns.regplot(x=data[x_var], y=data[y_column], ax=sp)

#### Assumptions - Checking for No Multicollinearity

In [None]:
# Heatmap to check correlation
plt.figure(figsize=(10,8))
sns.heatmap(data[X_columns_num].corr(), annot=True, cmap='viridis')

So droping 'PLU_4046', 'PLU_4225', 'TotalBags', 'SmallBags' which has very high Correlation

In [None]:
# Droping highly correlated columns
data.drop(columns=['PLU_4046', 'PLU_4225', 'TotalBags', 'SmallBags'], inplace=True)

In [None]:
data.head(2)   # Preview after droping columns

In [None]:
# Updating Variables
y_column, X_columns, X_columns_num, X_columns_cat = get_variables_from_data()

#### Assumptions - Target Variable is Normally Distributed

In [None]:
sns.distplot(data[y_column]) # Normal Distribution of Target Vaiable

In [None]:
# Pair Plot of data
sns.pairplot(data, size = 2, aspect = 1.5)

In [None]:
# Checking for relation of Numeric Features with Target Variable
sns.pairplot(data, x_vars=X_columns_num, y_vars=y_column, size=5, aspect=1, kind='reg') 

### Post profiling

In [None]:
post_profile = data.profile_report(title='Avacado Post-Profiling')   # Performing Post Profiling on data.

In [None]:
post_profile.to_file('post-profiling.html')                          # Saving report to post-profiling.html

In [None]:
# post_profile.to_notebook_iframe()                                    # View report inline here

## 5. Modelling using sklearn
---

### Preparing X and y

In [None]:
data.head(2) # Preview of data

In [None]:
X_columns   # Preview of feature columns

In [None]:
X = data[X_columns]           # Features data
y = data[y_column]            # TV data

In [None]:
print(X.shape)
X.head()                      # Preview of X

In [None]:
print(y.shape)
y.head()                     # Preview of y

In [None]:
# Splitting the dataset into training and test sets 80-20 split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
# Reset index of split data sets
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

In [None]:
print(X_train.shape)
X_train.head()        # Preview of X_train

In [None]:
print(X_test.shape)
X_test.head()         # Preview of X_test

Scaling numerical fields using StandardScaler

In [None]:
X_train_num = X_train[X_columns_num]       # Numeric X_train 
X_test_num = X_test[X_columns_num]         # Numeric X_test

In [None]:
from sklearn.preprocessing import StandardScaler         # Importing Standard Scalar
scaler = StandardScaler().fit(X_train_num)               # Fitting with train data

In [None]:
X_train_s = pd.DataFrame(scaler.transform(X_train_num), columns=X_columns_num)  # Transforming train data
X_test_s = pd.DataFrame(scaler.transform(X_test_num), columns=X_columns_num)    # Transforming test data

In [None]:
print(X_train_s.shape)
X_train_s.head()            # Scaled train data - Numeric

In [None]:
print(X_test_s.shape)
X_test_s.head()             # Scaled test data - Numeric

Scaling numerical fields using StandardScaler

In [None]:
X_train[X_columns_cat].head()         # Preview of categorical features

There are 4 categorical features and as per below table we will perform encoding on each feature.

|Column|Type of Encoding|
|--:|:--|
|Type|**OneHot** - As there are only 2 unique values|
|Year|**Label** - To keep the ordinal importance|
|Region|**Target** - As it has high cardinality we can use TargetEncoding to have effect of each Region on AveragePrice|
|Month|**Target** - As it has high cardinality we can use TargetEncoding to have effect of each Month on AveragePrice|

In [None]:
# One Hot Encoding on Type for Train set.
X_train_type_dummies = pd.get_dummies(X_train['Type'], prefix='Type', drop_first=True)
print(X_train_type_dummies.shape)       # Shape of Dummies
X_train_type_dummies.head()             # Preview of Type Dummies               

In [None]:
X_train_s = pd.concat([X_train_s, X_train_type_dummies], 1) # Merging type dummies to Scaled Train set
print(X_train_s.shape)                                      # Shape of merged train set
X_train_s.head()                                            # Preview of merged train set

In [None]:
# One Hot Encoding on Type for Test set.
X_test_type_dummies = pd.get_dummies(X_test['Type'], prefix='Type', drop_first=True)
print(X_test_type_dummies.shape)       # Shape of Dummies
X_test_type_dummies.head()             # Preview of Type Dummies               

In [None]:
X_test_s = pd.concat([X_test_s, X_test_type_dummies], 1)   # Merging type dummies to Scaled test set
print(X_test_s.shape)                                      # Shape of merged test set
X_test_s.head()                                            # Preview of merged test set

In [None]:
# Label Encoding on Year for Train set.
from sklearn.preprocessing import LabelEncoder         # Importing Label Encoder
label_encoder = LabelEncoder().fit(X_train['Year'])    # Fitting on train set

In [None]:
X_train_year_dummies = pd.DataFrame(label_encoder.transform(X_train['Year']), columns=['Year'])
print(X_train_year_dummies.shape)       # Shape of Transformed Year
X_train_year_dummies.head()             # Preview of Transformed Year 

In [None]:
X_train_s = pd.concat([X_train_s, X_train_year_dummies], 1)   # Merging type dummies to Scaled train set
print(X_train_s.shape)                                        # Shape of merged train set
X_train_s.head()                                              # Preview of merged train set

In [None]:
X_test_year_dummies = pd.DataFrame(label_encoder.transform(X_test['Year']), columns=['Year'])
print(X_test_year_dummies.shape)       # Shape of Transformed Year
X_test_year_dummies.head()             # Preview of Transformed Year 

In [None]:
X_test_s = pd.concat([X_test_s, X_test_year_dummies], 1)   # Merging type dummies to Scaled test set
print(X_test_s.shape)                                      # Shape of merged test set
X_test_s.head()                                            # Preview of merged test set

In [None]:
# Installing category_encoders to import TargetEncoder
# !pip install category_encoders

In [None]:
# Label Encoding on Year for Train set.
from category_encoders import TargetEncoder                                # Importing Target Encoder
target_encoder_region = TargetEncoder().fit(X_train['Region'], y_train)    # Fitting on train set

In [None]:
X_train_region_dummies = target_encoder_region.transform(X_train['Region'])
print(X_train_region_dummies.shape)       # Shape of Transformed region
X_train_region_dummies.head()             # Preview of Transformed region 

In [None]:
X_train_s = pd.concat([X_train_s, X_train_region_dummies], 1)   # Merging region dummies to Scaled train set
print(X_train_s.shape)                                          # Shape of merged train set
X_train_s.head()                                                # Preview of merged train set

In [None]:
X_test_region_dummies = target_encoder_region.transform(X_test['Region'])
print(X_test_region_dummies.shape)       # Shape of Transformed region
X_test_region_dummies.head()             # Preview of Transformed region 

In [None]:
X_test_s = pd.concat([X_test_s, X_test_region_dummies], 1)     # Merging region dummies to Scaled train set
print(X_test_s.shape)                                          # Shape of merged train set
X_test_s.head()                                                # Preview of merged train set

In [None]:
target_encoder_month = TargetEncoder().fit(X_train['Month'], y_train)    # Fitting on train set for Month

In [None]:
X_train_month_dummies = target_encoder_month.transform(X_train['Month'])
print(X_train_month_dummies.shape)       # Shape of Transformed region
X_train_month_dummies.head()             # Preview of Transformed region 

In [None]:
X_train_s = pd.concat([X_train_s, X_train_month_dummies], 1)    # Merging region dummies to Scaled train set
print(X_train_s.shape)                                          # Shape of merged train set
X_train_s.head()                                                # Preview of merged train set

In [None]:
X_test_month_dummies = target_encoder_month.transform(X_test['Month'])
print(X_test_month_dummies.shape)       # Shape of Transformed region
X_test_month_dummies.head()             # Preview of Transformed region 

In [None]:
X_test_s = pd.concat([X_test_s, X_test_month_dummies], 1)      # Merging month dummies to Scaled train set
print(X_test_s.shape)                                          # Shape of merged train set
X_test_s.head()                                                # Preview of merged train set

#### Final data after Scalings and Encodings

In [None]:
print(X_train_s.shape)
X_train_s.head()                    # Preview of X_train

In [None]:
print(y_train.shape)
y_train.head()                    # Preview of y_train

In [None]:
print(X_test_s.shape)
X_test_s.head()                    # Preview of X_test

In [None]:
print(y_test.shape)
y_test.head()                    # Preview of y_test

### Building Models

In [None]:
# Importing Models
from sklearn.linear_model import LinearRegression              # Importing LinearRegression Algo
from sklearn.tree import DecisionTreeRegressor                 # Importing DecisionTreeRegressor Algo
from sklearn.ensemble import RandomForestRegressor             # Importing RandomForestRegressor Algo

In [None]:
# Creating our LinearRegression model and fitting the data into it.
linreg_model = LinearRegression()
linreg_model.fit(X_train_s, y_train)

In [None]:
# Creating our DecisionTreeRegressor model and fitting the data into it.
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train_s, y_train)

In [None]:
# Creating our RandomForestRegressor model and fitting the data into it.
rf_model=RandomForestRegressor()
rf_model.fit(X_train_s,y_train)

### Hyper Parameter Tuning 
    - To find best RandomForestRegressor Using GridSearchCV and RandomizedSearchCV

In [None]:
# Preparations for Hyper Parameter Tuning

from sklearn.model_selection import GridSearchCV          # Importing GridSearchCV
from sklearn.model_selection import RandomizedSearchCV    # Importing RandomizedSearchCV

n_estimators = [10,50,100,200,300,500]                    # Number of trees in random forest
max_features = ['auto', 'log2',2,4,8,12]                  # Number of features to consider at every split
max_depth = [2,4,8,16,25]                                 # Maximum number of levels in tree=

# Creating param_grid for hyper-parameter tuning.
random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth,}

In [None]:
# Creating our RandomForestRegressor model from GridSearchCV and fitting the data into it.
rf_model_grid = GridSearchCV(estimator = rf_model, param_grid=random_grid, cv = 3, n_jobs = -1 )
rf_model_grid.fit(X_train_s,y_train)

In [None]:
# Creating our RandomForestRegressor model from RandomizedSearchCV and fitting the data into it.
rf_model_random = RandomizedSearchCV(estimator = rf_model, param_distributions = random_grid, 
                                     n_iter = 10, cv = 3, verbose=2, random_state=100, n_jobs = -1)
rf_model_random.fit(X_train_s, y_train)

### Model Predictions

#### 1. Predictions from LinearRegression Model - linreg_model

In [None]:
# Predictions from `linreg_model` - TRAIN Set
y_train_pred_lr = linreg_model.predict(X_train_s)     # Predicted Target Values for TRAIN set.
print(y_train_pred_lr.shape)                          # Shape of Predicted Target Value - TRAIN set.
y_train_pred_lr[:10]                                  # Top 10 Predicted Target Values for TRAIN set.

In [None]:
# Predictions from `linreg_model` - TEST Set
y_test_pred_lr = linreg_model.predict(X_test_s)     # Predicted Target Values for TEST set.
print(y_test_pred_lr.shape)                         # Shape of Predicted Target Value - TEST set.
y_test_pred_lr[:10]                                 # Top 10 Predicted Target Values for TEST set.

#### 2. Predictions from DecisionTreeRegressor Model - dt_model

In [None]:
# Predictions from `dt_model` - TRAIN Set
y_train_pred_dt = dt_model.predict(X_train_s)         # Predicted Target Values for TRAIN set.
print(y_train_pred_dt.shape)                          # Shape of Predicted Target Value - TRAIN set.
y_train_pred_dt[:10]                                  # Top 10 Predicted Target Values for TRAIN set.

In [None]:
# Predictions from `dt_model` - TEST Set
y_test_pred_dt = dt_model.predict(X_test_s)          # Predicted Target Values for TEST set.
print(y_test_pred_dt.shape)                          # Shape of Predicted Target Value - TEST set.
y_test_pred_dt[:10]                                  # Top 10 Predicted Target Values for TEST set.

#### 3. Predictions from RandomForestRegressor Model - rf_model

In [None]:
# Predictions from `rf_model` - TRAIN Set
y_train_pred_rf = rf_model.predict(X_train_s)         # Predicted Target Values for TRAIN set.
print(y_train_pred_rf.shape)                          # Shape of Predicted Target Value - TRAIN set.
y_train_pred_rf[:10]                                  # Top 10 Predicted Target Values for TRAIN set.

In [None]:
# Predictions from `rf_model` - TEST Set
y_test_pred_rf = rf_model.predict(X_test_s)          # Predicted Target Values for TEST set.
print(y_test_pred_rf.shape)                          # Shape of Predicted Target Value - TEST set.
y_test_pred_rf[:10]                                  # Top 10 Predicted Target Values for TEST set.

#### 4. Predictions from RandomForestRegressor - GridSearchCV  - rf_model_grid

In [None]:
# Predictions from `rf_model_grid` - TRAIN Set
y_train_pred_rf_grid = rf_model_grid.predict(X_train_s)    # Predicted Target Values for TRAIN set.
print(y_train_pred_rf_grid.shape)                          # Shape of Predicted Target Value - TRAIN set.
y_train_pred_rf_grid[:10]                                  # Top 10 Predicted Target Values for TRAIN set.

In [None]:
# Predictions from `rf_model_grid` - TEST Set
y_test_pred_rf_grid = rf_model_grid.predict(X_test_s)     # Predicted Target Values for TEST set.
print(y_test_pred_rf_grid.shape)                          # Shape of Predicted Target Value - TEST set.
y_test_pred_rf_grid[:10]                                  # Top 10 Predicted Target Values for TEST set.

#### 5. Predictions from RandomForestRegressor - RandomizedSearchCV  - rf_model_random

In [None]:
# Predictions from `rf_model_random` - TRAIN Set
y_train_pred_rf_random = rf_model_random.predict(X_train_s)  # Predicted Target Values for TRAIN set.
print(y_train_pred_rf_random.shape)                          # Shape of Predicted Target Value - TRAIN set.
y_train_pred_rf_random[:10]                                  # Top 10 Predicted Target Values for TRAIN set.

In [None]:
# Predictions from `rf_model_random` - TEST Set
y_test_pred_rf_random = rf_model_random.predict(X_test_s)     # Predicted Target Values for TEST set.
print(y_test_pred_rf_random.shape)                          # Shape of Predicted Target Value - TEST set.
y_test_pred_rf_random[:10]                                  # Top 10 Predicted Target Values for TEST set.

## 6. Model Evaluations


---



In [None]:
# Utility / Helper Function - Regression Model Evaluation

def regression_model_evaluation(y, y_pred, set_type='', features_count=None):
    '''
    Utility/Helper method to calulate the Evaluation parameters for a regression model
    '''
    from sklearn import metrics # Importing metrics from SK-Learn
    result = {}
    
    if set_type != '':
        set_type = '_'+set_type
        
    # Mean Absolute Error on train set.
    result['MAE'] = metrics.mean_absolute_error(y, y_pred) 
    # Mean Squared Error on train set.
    result['MSE'] = metrics.mean_squared_error(y, y_pred)  
    # Root Mean Squared Error on train set.
    result['RMSE'] = np.sqrt(result['MSE'])                      
    # R_squared on train set.
    result['R_squared'] = metrics.r2_score(y, y_pred)      
    
    # Adj r2 = 1-(1-R2)*(n-1)/(n-p-1)
    if features_count:
        # Adjusted R_squared on train set.
        result['Adj_R_squared'] = 1 - (((1 - result['R_squared']) * (len(y)-features_count))/(len(y)-features_count-1))
    # Returning with appending type to key and rounding value 
    return {f'{k}'+set_type: round(v, 4) for k, v in result.items()} 

#### 1. Evaluation Parameters for - linreg_model

In [None]:
# Evaluation metrics for LinearRegression - TRAIN set
metrics_lr_train = regression_model_evaluation(y_train, y_train_pred_lr, features_count=8)
metrics_lr_train

In [None]:
# Evaluation metrics for LinearRegression - TEST set
metrics_lr_test = regression_model_evaluation(y_test, y_test_pred_lr, features_count=8)
metrics_lr_test

In [None]:
# Converting metrics map to DataFrame
LR_Train_mertrics = pd.DataFrame(metrics_lr_train.items(), columns=['Metrics', 'LR_Train'])
LR_Test_mertrics = pd.DataFrame(metrics_lr_test.items(), columns=['Metrics', 'LR_Test'])

In [None]:
# To get the intercept of the model.
linreg_model.intercept_

In [None]:
# To get the coefficients of the model.
coefs = linreg_model.coef_
features = X_train_s.columns

list(zip(features,coefs))

#### 2. Evaluation Parameters for - dt_model

In [None]:
# Evaluation metrics for DecisionTreeRegressor - TRAIN set
metrics_dt_train = regression_model_evaluation(y_train, y_train_pred_dt, features_count=8)
metrics_dt_train

In [None]:
# Evaluation metrics for DecisionTreeRegressor - TEST set
metrics_dt_test = regression_model_evaluation(y_test, y_test_pred_dt, features_count=8)
metrics_dt_test

In [None]:
# Converting metrics map to DataFrame
DT_Train_mertrics = pd.DataFrame(metrics_dt_train.items(), columns=['Metrics', 'DT_Train'])
DT_Test_mertrics = pd.DataFrame(metrics_dt_test.items(), columns=['Metrics', 'DT_Test'])

In [None]:
# DecisionTreeRegressor Score; Same as R-Squared from X and y; So it internally calculates r-squared of y and y_pred (-from X)
print('Train set: ',dt_model.score(X_train_s,y_train))
print('Test set: ',dt_model.score(X_test_s,y_test))

#### 3. Evaluation Parameters for - rf_model

In [None]:
# Evaluation metrics for RandomForestRegressor - TRAIN set
metrics_rf_train = regression_model_evaluation(y_train, y_train_pred_rf, features_count=8)
metrics_rf_train

In [None]:
# Evaluation metrics for RandomForestRegressor - TEST set
metrics_rf_test = regression_model_evaluation(y_test, y_test_pred_rf, features_count=8)
metrics_rf_test

In [None]:
# Converting metrics map to DataFrame
RF_Train_mertrics = pd.DataFrame(metrics_rf_train.items(), columns=['Metrics', 'RF_Train'])
RF_Test_mertrics = pd.DataFrame(metrics_rf_test.items(), columns=['Metrics', 'RF_Test'])

In [None]:
# RandomForestRegressor Score; Same as R-Squared from X and y; So it internally calculates r-squared of y and y_pred (-from X)
print('Train set: ',rf_model.score(X_train_s,y_train))
print('Test set: ',rf_model.score(X_test_s,y_test))

#### 4. Evaluation Parameters for - rf_model_grid

In [None]:
# Evaluation metrics for RandomForestRegressor with GridSearchCV - TRAIN set
metrics_rf_grid_train = regression_model_evaluation(y_train, y_train_pred_rf_grid, features_count=8)
metrics_rf_grid_train

In [None]:
# Evaluation metrics for RandomForestRegressor with GridSearchCV - TEST set
metrics_rf_grid_test = regression_model_evaluation(y_test, y_test_pred_rf_grid, features_count=8)
metrics_rf_grid_test

In [None]:
# Converting metrics map to DataFrame
RF_Grid_Train_mertrics = pd.DataFrame(metrics_rf_grid_train.items(), columns=['Metrics', 'RF_Grid_Train'])
RF_Grid_Test_mertrics = pd.DataFrame(metrics_rf_grid_test.items(), columns=['Metrics', 'RF_Grid_Test'])

#### 5. Evaluation Parameters for - rf_model_random

In [None]:
# Evaluation metrics for RandomForestRegressor with RandomizedSearchCV - TRAIN set
metrics_rf_random_train = regression_model_evaluation(y_train, y_train_pred_rf_random, features_count=8)
metrics_rf_random_train

In [None]:
# Evaluation metrics for RandomForestRegressor with RandomizedSearchCV - TEST set
metrics_rf_random_test = regression_model_evaluation(y_test, y_test_pred_rf_random, features_count=8)
metrics_rf_random_test

In [None]:
# Converting metrics map to DataFrame
RF_Random_Train_mertrics = pd.DataFrame(metrics_rf_random_train.items(), columns=['Metrics', 'RF_Random_Train'])
RF_Random_Test_mertrics = pd.DataFrame(metrics_rf_random_test.items(), columns=['Metrics', 'RF_Random_Test'])

Creating DataFrames of Metrics for 5 Models

In [None]:
# Converting Train metrics df
Train_mertrics = LR_Train_mertrics.merge(
                    DT_Train_mertrics, on='Metrics').merge(
                    RF_Train_mertrics, on='Metrics').merge(
                    RF_Grid_Train_mertrics, on='Metrics').merge(
                    RF_Random_Train_mertrics, on='Metrics').set_index(keys='Metrics')
Train_mertrics

In [None]:
# Converting Train metrics df
Test_mertrics = LR_Test_mertrics.merge(DT_Test_mertrics, on='Metrics').merge(
                    RF_Test_mertrics, on='Metrics').merge(
                    RF_Grid_Test_mertrics, on='Metrics').merge(
                    RF_Random_Test_mertrics, on='Metrics').set_index(keys='Metrics')
Test_mertrics

In [None]:
model_mertrics = Train_mertrics.merge(Test_mertrics, on='Metrics')
model_mertrics = model_mertrics.reindex(
    columns=['LR_Train', 'LR_Test', 'DT_Train', 'DT_Test', 'RF_Train', 'RF_Test', 'RF_Grid_Train', 'RF_Grid_Test', 'RF_Random_Train', 'RF_Random_Test'])

model_mertrics

## 7. Model Plotings

#### 1. LinearRegression

In [None]:
train_diff = pd.DataFrame({'Y_ACT':y_train , 'Y_Pred':y_train_pred_lr},columns=['Y_ACT','Y_Pred']) 
train_diff.head()    # Preview of DF - y_train and y_train_pred

In [None]:
test_diff = pd.DataFrame({'Y_ACT':y_test , 'Y_Pred':y_test_pred_lr},columns=['Y_ACT','Y_Pred'])
test_diff.head()    # Preview of DF - y_test and y_test_pred

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize=(16,8))
f.suptitle('Y-Actual VS Y-Predicted - LinearRegression')
ax1.set_title('Train Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=train_diff, ax=ax1)
ax2.set_title('Test Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=test_diff, ax=ax2)

#### 2. DecisionTreeRegressor

In [None]:
train_diff = pd.DataFrame({'Y_ACT':y_train , 'Y_Pred':y_train_pred_dt},columns=['Y_ACT','Y_Pred'])
train_diff.head() # Preview of DF - y_train and y_train_pred

In [None]:
test_diff = pd.DataFrame({'Y_ACT':y_test , 'Y_Pred':y_test_pred_dt},columns=['Y_ACT','Y_Pred'])
test_diff.head()   # Preview of DF - y_test and y_test_pred

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize=(16,8))
f.suptitle('Y-Actual VS Y-Predicted - DecisionTreeRegressor')
ax1.set_title('Train Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=train_diff, ax=ax1)
ax2.set_title('Test Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=test_diff, ax=ax2)

#### 3. RandomForestRegressor

In [None]:
train_diff = pd.DataFrame({'Y_ACT':y_train , 'Y_Pred':y_train_pred_rf},columns=['Y_ACT','Y_Pred'])
train_diff.head() # Preview of DF - y_train and y_train_pred

In [None]:
test_diff = pd.DataFrame({'Y_ACT':y_test , 'Y_Pred':y_test_pred_rf},columns=['Y_ACT','Y_Pred'])
test_diff.head()   # Preview of DF - y_test and y_test_pred

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize=(16,8))
f.suptitle('Y-Actual VS Y-Predicted - RandomForestRegressor')
ax1.set_title('Train Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=train_diff, ax=ax1)
ax2.set_title('Test Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=test_diff, ax=ax2)

#### 4. RandomForestRegressor - GridSearchCV

In [None]:
train_diff = pd.DataFrame({'Y_ACT':y_train , 'Y_Pred':y_train_pred_rf_grid},columns=['Y_ACT','Y_Pred'])
train_diff.head() # Preview of DF - y_train and y_train_pred

In [None]:
test_diff = pd.DataFrame({'Y_ACT':y_test , 'Y_Pred':y_test_pred_rf_grid},columns=['Y_ACT','Y_Pred'])
test_diff.head()  # Preview of DF - y_test and y_test_pred

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize=(16,8))
f.suptitle('Y-Actual VS Y-Predicted - RandomForestRegressor With GridSearchCV')
ax1.set_title('Train Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=train_diff, ax=ax1)
ax2.set_title('Test Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=test_diff, ax=ax2)

#### 5. RandomForestRegressor - RandomizedSearchCV

In [None]:
train_diff = pd.DataFrame({'Y_ACT':y_train , 'Y_Pred':y_train_pred_rf_random},columns=['Y_ACT','Y_Pred'])
train_diff.head() # Preview of DF - y_train and y_train_pred

In [None]:
test_diff = pd.DataFrame({'Y_ACT':y_test , 'Y_Pred':y_test_pred_rf_random},columns=['Y_ACT','Y_Pred'])
test_diff.head() # Preview of DF - y_test and y_test_pred

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize=(16,8))
f.suptitle('Y-Actual VS Y-Predicted - RandomForestRegressor With RandomizedSearchCV')
ax1.set_title('Train Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=train_diff, ax=ax1)
ax2.set_title('Test Set', fontsize=14)
sns.regplot(x='Y_ACT',y='Y_Pred',data=test_diff, ax=ax2)

## 8. Conclusions

---


  - Analyzing and finalizing best-fit model

In [None]:
model_mertrics # Preview of Model Evaluation Metrics

From the above table we can observe,
1. LinearRegression Model has lease R-Squared Value - **UnderFit Model**
2. DecisionTreeRegressor Model has maximum R-Squared Value for train data and less R-Squared Value for test data - **OverFit Model**
3. RandomForestRegressor Model has better R-Squared Value for test data
4. We can best version of RandomForestRegressor by *hyper parameter tuning* with GridSearchCV or RandomizedSearchCV