
Team Name - Notebook

Team Members -
Daksha Pareek (2019mc21080)
Ranjan Mathur (2018mc21571)
    
The notebook covers below models
- Regularization Model - Ridge
- Regularization Model - Lasso
- Regression Tree with Pruning
- Ensemble Model - Gradient Boost
- Ensemble Models - Random Forest
1. 
**About Dataset**

**Overview**
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

**Data Fields**
datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather -

* Clear, Few clouds, Partly cloudy, Partly cloudy
* Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
* Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
* Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

* temp - temperature in Celsius
* atemp - "feels like" temperature in Celsius
* humidity - relative humidity
* windspeed - wind speed
* casual - number of non-registered user rentals initiated
* registered - number of registered user rentals initiated
* count - number of total rentals (Dependent Variable)

In [None]:
import IPython
display("IPython version: {}". format(IPython.__version__)) 

from IPython.display import display 

import sys
display("Python version: {}". format(sys.version))

import sklearn
display("scikit-learn version: {}". format(sklearn.__version__))

import pylab
import calendar
import numpy as np
display("NumPy version: {}". format(np.__version__))
import pandas as pd
display("pandas version: {}". format(pd.__version__))

import seaborn as sn
from scipy import stats
import missingno as msno
from datetime import datetime
import matplotlib.pyplot as plt
import warnings


pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline



from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import cross_val_score
pd.set_option("max_colwidth", None)

The Dataset

In [None]:
dailyData = pd.read_csv("/kaggle/input/bike-sharing-demand/train.csv")

**Data Summary**

As a first step lets do three simple steps on the dataset

* Size of the dataset
* Get a glimpse of data by printing few rows of it.
* What type of variables contribute our data

Shape of Dataset

In [None]:
dailyData.shape

Sample Of First Few Rows

In [None]:
dailyData.head(10)

Variables Data Type

In [None]:
dailyData.dtypes

**Feature Engineering**

As we see from the above results, the columns "season","holiday","workingday" and "weather" should be of "categorical" data type.But the current data type is "int" for those columns. Let us transform the dataset in the following ways so that we can get started up with our EDA

* Create new columns "date,"hour","weekDay","month" from "datetime" column.
* Coerce the datatype of "season","holiday","workingday" and weather to category.
* Drop the datetime column as we already extracted useful features from it.

Creating New Columns From "Datetime" Column

In [None]:
dailyData["date"] = dailyData.datetime.apply(lambda x : x.split()[0])
dailyData["hour"] = dailyData.datetime.apply(lambda x : x.split()[1].split(":")[0])
dailyData["weekday"] = dailyData.date.apply(lambda dateString : calendar.day_name[datetime.strptime(dateString,"%Y-%m-%d").weekday()])
dailyData["month"] = dailyData.date.apply(lambda dateString : calendar.month_name[datetime.strptime(dateString,"%Y-%m-%d").month])
dailyData["season"] = dailyData.season.map({1: "Spring", 2 : "Summer", 3 : "Fall", 4 :"Winter" })
dailyData["weather"] = dailyData.weather.map({1: " Clear + Few clouds + Partly cloudy + Partly cloudy",\
                                        2 : " Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ", \
                                        3 : " Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds", \
                                        4 :" Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog " })

Coercing To Category Type

In [None]:
categoryVariableList = ["hour","weekday","month","season","weather","holiday","workingday"]
for var in categoryVariableList:
    dailyData[var] = dailyData[var].astype("category")

Dropping Unncessary Columns

In [None]:
dailyData  = dailyData.drop(["datetime"],axis=1)

Lets Start With Very Simple Visualization Of Variables DataType Count

In [None]:
dataTypeDf = pd.DataFrame(dailyData.dtypes.value_counts()).reset_index().rename(columns={"index":"variableType",0:"count"})
fig,ax = plt.subplots()
fig.set_size_inches(12,5)
dataTypeDf['variableType'] = dataTypeDf['variableType'].astype('str')
sn.barplot(data=dataTypeDf,x="variableType",y="count",ax=ax)
ax.set(xlabel='variableTypeariable Type', ylabel='Count',title="Variables DataType Count")

**Missing Values Analysis**
Once we get hang of the data and columns, next step we geneally is to find out whether we have any missing values in our data. Luckily we dont have any missing value in the dataset. One way which I generally prefer to visualize missing value in the dataset is through "missingno".

Its a quiet handy library to quickly visualize variables for missing values. As I mentioned earlier we got lucky this time as there no missing value in the dataset.

**Skewness In Distribution**


In [None]:
msno.matrix(dailyData,figsize=(12,5))

**Outliers Analysis**
At first look, "count" variable contains lot of outlier data points which skews the distribution towards right (as there are more data points beyond Outer Quartile Limit).But in addition to that, following inferences can also been made from the simple boxplots given below.

* Spring season has got relatively lower count.The dip in median value in boxplot gives evidence for it.
* The boxplot with "Hour Of The Day" is quiet interesting.The median value are relatively higher at 7AM - 8AM and 5PM - 6PM. It can be attributed to regular school and office users at that time.
* Most of the outlier points are mainly contributed from "Working Day" than "Non Working Day". It is quiet visible from from figure 4.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(15, 15)
sn.boxplot(data=dailyData,y="count",orient="v",ax=axes[0][0])
sn.boxplot(data=dailyData,y="count",x="season",orient="v",ax=axes[0][1])
sn.boxplot(data=dailyData,y="count",x="hour",orient="v",ax=axes[1][0])
sn.boxplot(data=dailyData,y="count",x="workingday",orient="v",ax=axes[1][1])

axes[0][0].set(ylabel='Count',title="Box Plot On Count")
axes[0][1].set(xlabel='Season', ylabel='Count',title="Box Plot On Count Across Season")
axes[1][0].set(xlabel='Hour Of The Day', ylabel='Count',title="Box Plot On Count Across Hour Of The Day")
axes[1][1].set(xlabel='Working Day', ylabel='Count',title="Box Plot On Count Across Working Day")

Remove Outliers In The Count Column

In [None]:
dailyDataWithoutOutliers = dailyData[np.abs(dailyData["count"]-dailyData["count"].mean())<=(3*dailyData["count"].std())] 
display("Shape Of the dataframe before Ouliers: ",dailyData.shape)
display("Shape Of the dataframe after Ouliers: ",dailyDataWithoutOutliers.shape)

**Correlation Analysis**
One common to understand how a dependent variable is influenced by features (numerical) is to fibd a correlation matrix between them. Lets plot a correlation plot between "count" and ["temp","atemp","humidity","windspeed"].

* temp and humidity features has got positive and negative correlation with count respectively.Although the correlation between them are not very prominent still the count variable has got little dependency on "temp" and "humidity".
* windspeed is not gonna be really useful numerical feature and it is visible from it correlation value with "count"
* "atemp" is variable is not taken into since "atemp" and "temp" has got strong correlation with each other. During model building any one of the variable has to be dropped since they will exhibit multicollinearity in the data.
* "Casual" and "Registered" are also not taken into account since they are leakage variables in nature and need to dropped during model building.
Regression plot in seaborn is one useful way to depict the relationship between two features. Here we consider "count" vs "temp", "humidity", "windspeed".

In [None]:
corrMatt = dailyData[["temp","atemp","casual","registered","humidity","windspeed","count"]].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sn.heatmap(corrMatt, mask=mask,vmax=.8, square=True,annot=True)

In [None]:
# #correlation heatmap of dataset
# def correlation_heatmap(df):
#     _ , ax = plt.subplots(figsize =(14, 12))
#     colormap = sn.diverging_palette(220, 10, as_cmap = True)
    
#     _ = sn.heatmap(
#         df.corr(), 
#         cmap = colormap,
#         square=True, 
#         cbar_kws={'shrink':.9 }, 
#         ax=ax,
#         annot=True, 
#         linewidths=0.1,vmax=1.0, linecolor='white',
#         annot_kws={'fontsize':12 }
#     )
    
#     plt.title('Pearson Correlation of Features', y=1.05, size=15)

# correlation_heatmap(dailyData)

In [None]:
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3)
fig.set_size_inches(12, 5)
sn.regplot(x="temp", y="count", data=dailyData,ax=ax1)
sn.regplot(x="windspeed", y="count", data=dailyData,ax=ax2)
sn.regplot(x="humidity", y="count", data=dailyData,ax=ax3)

**Visualizing Distribution Of Data**
As it is visible from the below figures that "count" variable is skewed towards right. It is desirable to have Normal distribution as most of the machine learning techniques require dependent variable to be Normal. One possible solution is to take log transformation on "count" variable after removing outlier data points. After the transformation the data looks lot better but still not ideally following normal distribution.

In [None]:
fig,axes = plt.subplots(ncols=2,nrows=2)
fig.set_size_inches(12, 10)
sn.distplot(dailyData["count"],ax=axes[0][0])
stats.probplot(dailyData["count"], dist='norm', fit=True, plot=axes[0][1])
sn.distplot(np.log(dailyDataWithoutOutliers["count"]),ax=axes[1][0])
stats.probplot(np.log1p(dailyDataWithoutOutliers["count"]), dist='norm', fit=True, plot=axes[1][1])

**Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)**
* It is quiet obvious that people tend to rent bike during summer season since it is really conducive to ride bike at that season.Therefore June, July and August has got relatively higher demand for bicycle.
* On weekdays more people tend to rent bicycle around 7AM-8AM and 5PM-6PM. As we mentioned earlier this can be attributed to regular school and office commuters.
* Above pattern is not observed on "Saturday" and "Sunday".More people tend to rent bicycle between 10AM and 4PM.
* The peak user count around 7AM-8AM and 5PM-6PM is purely contributed by registered user.

In [None]:
fig,(ax1,ax2,ax3,ax4)= plt.subplots(nrows=4)
fig.set_size_inches(12,20)
sortOrder = ["January","February","March","April","May","June","July","August","September","October","November","December"]
hueOrder = ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]

monthAggregated = pd.DataFrame(dailyData.groupby("month")["count"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="count",ascending=False)
sn.barplot(data=monthSorted,x="month",y="count",ax=ax1,order=sortOrder)
ax1.set(xlabel='Month', ylabel='Avearage Count',title="Average Count By Month")

hourAggregated = pd.DataFrame(dailyData.groupby(["hour","season"],sort=True)["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["season"], data=hourAggregated, join=True,ax=ax2)
ax2.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Season",label='big')

hourAggregated = pd.DataFrame(dailyData.groupby(["hour","weekday"],sort=True)["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["weekday"],hue_order=hueOrder, data=hourAggregated, join=True,ax=ax3)
ax3.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Weekdays",label='big')

hourTransformed = pd.melt(dailyData[["hour","casual","registered"]], id_vars=['hour'], value_vars=['casual', 'registered'])
hourAggregated = pd.DataFrame(hourTransformed.groupby(["hour","variable"],sort=True)["value"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["value"],hue=hourAggregated["variable"],hue_order=["casual","registered"], data=hourAggregated, join=True,ax=ax4)
ax4.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across User Type",label='big')

So we have visualized the data to a greater extent.So lets go and build some models and see how close we can predict the results.

**Filling 0's In windspeed Using Random Forest**

In [None]:
from scipy import stats

dataTrain = pd.read_csv("/kaggle/input/bike-sharing-demand/train.csv")
#dataTrain = dataTrain5[np.abs(dataTrain5["count"]-dataTrain5["count"].mean())<=(3*dataTrain5["count"].std())] 


#dataTrain = dailyDataWithoutOutliers.copy(deep=True)
dataTest = pd.read_csv("/kaggle/input/bike-sharing-demand/test.csv")

display(dataTrain.info()) 
display(dataTrain.sample(10))

In [None]:
display('Train columns with null values:\n', dataTrain.isnull().sum())
display("-"*10)

display('Test/Validation columns with null values:\n', dataTest.isnull().sum())
display("-"*10)

In [None]:
display(dataTrain.describe(include = 'all'))

In [None]:
display(dataTest.describe(include = 'all'))

Combine Train And Test

In [None]:
data = dataTrain.append(dataTest)
data.reset_index(inplace=True)
data.drop('index',inplace=True,axis=1)

Feature Engineering

In [None]:
data["date"] = data.datetime.apply(lambda x : x.split()[0])
data["hour"] = data.datetime.apply(lambda x : x.split()[1].split(":")[0]).astype("int")
data["year"] = data.datetime.apply(lambda x : x.split()[0].split("-")[0])
data["weekday"] = data.date.apply(lambda dateString : datetime.strptime(dateString,"%Y-%m-%d").weekday())
data["month"] = data.date.apply(lambda dateString : datetime.strptime(dateString,"%Y-%m-%d").month)

Random Forest Model To Predict 0's In Windspeed

In [None]:
from sklearn.ensemble import RandomForestRegressor

dataWind0 = data[data["windspeed"]==0]
dataWindNot0 = data[data["windspeed"]!=0]
rfModel_wind = RandomForestRegressor()
windColumns = ["season","weather","humidity","month","temp","year","atemp"]
rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed"])

wind0Values = rfModel_wind.predict(X= dataWind0[windColumns])
dataWind0["windspeed"] = wind0Values
data = dataWindNot0.append(dataWind0)
data.reset_index(inplace=True)
data.drop('index',inplace=True,axis=1)

Coercing To Categorical Type

In [None]:
categoricalFeatureNames = ["season","holiday","workingday","weather","weekday","month","year","hour"]
numericalFeatureNames = ["temp","humidity","windspeed","atemp"]
dropFeatures = ['casual',"count","datetime","date","registered"]

In [None]:
for var in categoricalFeatureNames:
    data[var] = data[var].astype("category")

Splitting Train And Test Data

In [None]:
dataTrain1 = data[pd.notnull(data['count'])].sort_values(by=["datetime"])
dataTest1 = data[~pd.notnull(data['count'])].sort_values(by=["datetime"])
datetimecol = dataTest["datetime"]
yLablesRegistered = dataTrain1["registered"]
yLablesCasual = dataTrain1["casual"]

Dropping Unncessary Variables

In [None]:
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler

features = ["temp","atemp","humidity","windspeed"]
dataTrain2  = dataTrain1[dataTrain1[features].apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

yLabels = dataTrain2["count"]

display(dataTrain2)
display(dataTrain2)

dataTrain3  = dataTrain2.drop(dropFeatures,axis=1)
dataTest2  = dataTest1.drop(dropFeatures,axis=1)





# define min max scaler
scaler = MinMaxScaler()
# transform data
dataTrain = scaler.fit_transform(dataTrain3)
display(dataTrain)
dataTest = scaler.fit_transform(dataTest2)
display(dataTrain)

In [None]:
def rmsle(y, y_,convertExp=True):
    if convertExp:
        y = np.exp(y),
        y_ = np.exp(y_)
    log1 = np.nan_to_num(np.array([np.log(v + 1) for v in y]))
    log2 = np.nan_to_num(np.array([np.log(v + 1) for v in y_]))
    calc = (log1 - log2) ** 2
    return np.sqrt(np.mean(calc))

In [None]:
summary_data = {'ModelName':[],
	'RootMeanSquaredError':[],
	'RSquare':[],
	'RMSLE':[], 
	'TrainingScore': [],
	'TestingScore': [], 
    'IdealCondition':[]
    }

df_summary = pd.DataFrame(summary_data)
df_summary

In [None]:
def mean_squared_error(actual, pred): 
    actual, pred = np.array(actual), np.array(pred)
    return np.square(np.subtract(actual,pred)).mean() 

In [None]:
def rmse(actual, pred): 
    actual, pred = np.array(actual), np.array(pred)
    return np.sqrt(np.square(np.subtract(actual,pred)).mean())

**Regression Tree with Pruning**

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn import metrics
from sklearn.metrics import r2_score


yLabelsLog = np.log1p(yLabels)


reg = DecisionTreeRegressor()
parameters = {'max_depth':[5,6,7,8,9,10]}

reg = RandomizedSearchCV(reg,parameters, cv = 10,refit = True)
reg.fit(dataTrain, yLabelsLog)     

preds = reg.predict(X= dataTrain)


rmsle_1 =rmsle(np.exp(yLabelsLog),np.exp(preds),False)
display ("RMSLE Value: ",rmsle_1)

rmse_1 = rmse(yLabelsLog,preds)

display( "Root Mean squared Error", rmse_1)
r2 = r2_score(yLabelsLog, preds)
display("R Square Value", r2)

train_score = reg.score(dataTrain, yLabelsLog)

display("Training Score", train_score ) 
predsTest = reg.predict(X= dataTest)
test_score =reg.score(dataTest, np.exp(predsTest))
display("Testing Score", test_score ) 

new_row = {'ModelName':'Regression Tree with Pruning', 'RootMeanSquaredError':rmse_1, 'RSquare':r2, 'RMSLE':rmsle_1,'TrainingScore':train_score, 'TestingScore': test_score   ,'IdealCondition': 'The most ideal result would be an RMSE value of zero and R-squared value of 1'}

df_summary = df_summary.append([new_row], ignore_index=True)
fig,(ax1,ax2)= plt.subplots(ncols=2)
fig.set_size_inches(12,5)
sn.distplot(yLabels,ax=ax1,bins=50)
sn.distplot(np.exp(predsTest),ax=ax2,bins=50)

**Regularization Model - Ridge**

In [None]:
from sklearn.model_selection import GridSearchCV

ridge_m_ = Ridge()
ridge_params_ = { 'max_iter':[3000],'alpha':[0.1, 1, 2, 3, 4, 10, 30,100,200,300,400,800,900,1000]}
rmsle_scorer = metrics.make_scorer(rmsle, greater_is_better=False)
grid_ridge_m = GridSearchCV( ridge_m_,
                          ridge_params_,
                          scoring = rmsle_scorer,
                          cv=10)
yLabelsLog = np.log1p(yLabels)

grid_ridge_m.fit( dataTrain, yLabelsLog )
preds = grid_ridge_m.predict(X= dataTrain)


rmse_1 = np.sqrt(mean_squared_error(yLabelsLog,preds))

display (grid_ridge_m.best_params_)
r2= r2_score(yLabelsLog, preds)
rmsle_1 = rmsle(np.exp(yLabelsLog),np.exp(preds),False)

display( "Root Mean squared Error", rmse_1)
display("R Square Value", r2)

display ("RMSLE Value: ",rmsle_1)
train_score = grid_ridge_m.score(dataTrain, yLabelsLog)
display("Training Score", train_score ) 
predsTest = grid_ridge_m.predict(X= dataTest)

test_score =grid_ridge_m.score(dataTest, np.exp(predsTest))

display("Testing Score", test_score ) 


new_row = {'ModelName':'Regularization Model - Ridge', 'RootMeanSquaredError':rmse_1, 'RSquare':r2, 'RMSLE':rmsle_1,'TrainingScore':train_score, 'TestingScore': test_score, 'IdealCondition': 'The most ideal result would be an RMSE value of zero and R-squared value of 1'}
df_summary = df_summary.append([new_row], ignore_index=True)

fig,(ax1,ax2)= plt.subplots(ncols=2)
fig.set_size_inches(12,5)
sn.distplot(yLabels,ax=ax1,bins=50)
sn.distplot(np.exp(predsTest),ax=ax2,bins=50)

![](http://)**Regularization Model - Lasso**

In [None]:
from sklearn.model_selection import cross_val_score


lasso_m_ = Lasso()

alpha  = 1/np.array([0.1, 1, 2, 3, 4, 10, 30,100,200,300,400,800,900,1000])
lasso_params_ = { 'max_iter':[3000],'alpha':alpha}

grid_lasso_m = GridSearchCV( lasso_m_,lasso_params_,scoring = rmsle_scorer,cv=10)
yLabelsLog = np.log1p(yLabels)
grid_lasso_m.fit( dataTrain, yLabelsLog )
preds = grid_lasso_m.predict(X= dataTrain)
display (grid_lasso_m.best_params_)


rmse_1 = np.sqrt(mean_squared_error(yLabelsLog,preds))

display (grid_lasso_m.best_params_)
r2= r2_score(yLabelsLog, preds)
rmsle_1 = rmsle(np.exp(yLabelsLog),np.exp(preds),False)

display( "Root Mean squared Error", rmse_1)
display("R Square Value", r2)

display ("RMSLE value: ",rmsle_1)
train_score = grid_lasso_m.score(dataTrain, yLabelsLog)
display("Training Score", train_score ) 
predsTest = grid_lasso_m.predict(X= dataTest)
test_score =grid_lasso_m.score(dataTest, np.exp(predsTest))
display("Testing Score", test_score ) 

new_row = {'ModelName':'Regularization Model - Lasso', 'RootMeanSquaredError':rmse_1, 'RSquare':r2, 'RMSLE':rmsle_1, 'TrainingScore':train_score, 'TestingScore': test_score, 'IdealCondition': 'The most ideal result would be an RMSE value of zero and R-squared value of 1'}
df_summary = df_summary.append([new_row], ignore_index=True)

fig,(ax1,ax2)= plt.subplots(ncols=2)
fig.set_size_inches(12,5)
sn.distplot(yLabels,ax=ax1,bins=50)
sn.distplot(np.exp(predsTest),ax=ax2,bins=50)

**Ensemble Models - Random Forest**

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfModel = RandomForestRegressor(n_estimators=100)
yLabelsLog = np.log1p(yLabels)
rfModel.fit(dataTrain,yLabelsLog)
preds = rfModel.predict(X= dataTrain)


rmse_1 = np.sqrt(mean_squared_error(yLabelsLog,preds))

r2= r2_score(yLabelsLog, preds)
rmsle_1 = rmsle(np.exp(yLabelsLog),np.exp(preds),False)

display( "Root Mean squared Error", rmse_1)
display("R Square Value", r2)

display ("RMSLE Value: ",rmsle_1)
train_score = rfModel.score(dataTrain, yLabelsLog)
display("Training Score", train_score ) 
predsTest = rfModel.predict(X= dataTest)

output =predsTest

test_score =rfModel.score(dataTest, np.exp(predsTest))
display("Testing Score", test_score ) 

new_row = {'ModelName':'Ensemble Models - Random Forest', 'RootMeanSquaredError':rmse_1, 'RSquare':r2, 'RMSLE':rmsle_1, 'TrainingScore':train_score, 'TestingScore': test_score, 'IdealCondition': 'The most ideal result would be an RMSE value of zero and R-squared value of 1'}
df_summary = df_summary.append([new_row], ignore_index=True)

fig,(ax1,ax2)= plt.subplots(ncols=2)
fig.set_size_inches(12,5)
sn.distplot(yLabels,ax=ax1,bins=50)
sn.distplot(np.exp(predsTest),ax=ax2,bins=50)

**Ensemble Model - Gradient Boost**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbm = GradientBoostingRegressor(n_estimators=4000,alpha=0.01); ### Test 0.41
yLabelsLog = np.log1p(yLabels)
gbm.fit(dataTrain,yLabelsLog)
preds = gbm.predict(X= dataTrain)

rmse_1 = np.sqrt(mean_squared_error(yLabelsLog,preds))

r2= r2_score(yLabelsLog, preds)
rmsle_1 = rmsle(np.exp(yLabelsLog),np.exp(preds),False)

display( "Root Mean squared Error", rmse_1)
display("R Square Value", r2)
display ("RMSLE value: ",rmsle_1)
train_score = gbm.score(dataTrain, yLabelsLog)
display("Training Score", train_score ) 
predsTest = gbm.predict(X= dataTest)
test_score =gbm.score(dataTest, np.exp(predsTest))
display("Testing Score", test_score ) 

new_row = {'ModelName':'Ensemble Model - Gradient Boost', 'RootMeanSquaredError':rmse_1, 'RSquare':r2, 'RMSLE':rmsle_1, 'TrainingScore':train_score, 'TestingScore': test_score, 'IdealCondition': 'The most ideal result would be an RMSE value of zero and R-squared value of 1'}
df_summary = df_summary.append([new_row], ignore_index=True)


fig,(ax1,ax2)= plt.subplots(ncols=2)
fig.set_size_inches(12,5)
sn.distplot(yLabels,ax=ax1,bins=50)
sn.distplot(np.exp(predsTest),ax=ax2,bins=50)

Lets compare the distribution of train and test results. More or less the distribution of train and test looks identical. It confirms visually that our model has not predicted really bad and not suffering from major overfitting problem.