<a href="https://colab.research.google.com/github/yasharma09/Bike-Sharing-Demand-Prediction-ML-project1/blob/main/yash_Bike_Sharing_Demand_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Bike Sharing Demand Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual(Yash Sharma)

# **Project Summary -**

Bike-sharing systems are a means of renting bicycles.
Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort.



# **GitHub Link -**

https://github.com/yasharma09/Bike-Sharing-Demand-Prediction-ML-project1

# **Problem Statement**


It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing a stable supply of rental bikes becomes a major concern, which will grow the business of bike sharing. The crucial part is the prediction of the bike count required at each hour for the stable supply of rental bikes, so it's a need of the hour to solve this problem.
The bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance weather conditions, day of the week, season, hour of the day, etc. can affect the rental behaviors.
Therefore, the proposed model will predict the demand for rental bikes given information about the weather and time of the day.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.tree import export_graphviz
from sklearn import tree
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb

from prettytable import PrettyTable

%matplotlib inline
sns.set()

import warnings 
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
dataset= pd.read_csv ('/content/drive/MyDrive/AlmaBetter/Projects/ML regression Project/Bike sharing project/SeoulBikeData.csv',sep=',',encoding='latin')

In [None]:
dataset

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'We have total {dataset.shape[0]}  rows')
print(f'We have total {dataset.shape[1]}  columns')

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

we have good amount of catagorical and numerical features(mostly numerical features) in our dataset

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"we have {dataset.duplicated().sum()} duplicate values")

Luckily we don't have any duplicate values in our dataset,Which is a very good thing

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum()

In [None]:
# Visualizing the missing values
# Well we don't have any missing values so there is no point of visualizing missing values because it will not show anything
#Still I am adding the code to visualize the missing value
'''plt.figure(figsize=(15, 5))
sns.heatmap(dataset.isnull(), cbar=True, yticklabels=False)
plt.xlabel("Column_Name", size=14, weight="bold")
plt.title("Places of missing values in column",fontweight="bold",size=17)
plt.show()'''

### What did you know about your dataset?

Till now we know that the dataset contains 
the number of bikes rented per hour and date information.
* It contains 8760 rows and 14 columns where the columns contains diffrent columns(features) such as:- 
* Date
* count of bike rented
* time in hours 
* Weather conditions(Temperature, Humidity, seasons, etc)
we don't have any missing as well as duplicate values in our dataset




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe().transpose()
#I am using transpose method for better view.

### Variables Description 

**Date** - Date on which bikes are rented

**Rented Bike count** - Count of bikes rented at each hour

**Hour** - For how many Hour of the day bike was rented  (0-23)

**Temperature** - Temperature of that day

**Humidity** - Humidity measure

**Windspeed** - Windspeed

**Visibility** - Visibility measure

**Dew Point Temperature** - Dew Point Temperature Measure

**Solar Radiation** - Solar Radiation Measure

**Rainfall** - Rainfall in mm

**Snowfall** - Snowfall measure

**Seasons** - what season it was when bike was rented
  1.   spring item
  2.   summer item
  3.   fall
  4.   winter

**Holiday** - Whether a holiday or not

**Functional Day** - Whether a functional day or not


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# converting date variable in to datetime datatype
dataset['Date'] = dataset['Date'].apply(lambda x: dt.strptime(x,'%d/%m/%Y'))

In [None]:
# Number of days for which the data is collected
print('Number of days the data is collected: ',dataset['Date'].max()-dataset['Date'].min())

In [None]:
# Days between which the data is collected
print('Start date: ',dataset['Date'].min())
print('End date: ',dataset['Date'].max())



*   The dataset is from a rental bike company based out of Seoul. The goal of this project is to develop a machine learning model that can predict the demand for rental bikes.

*   The dataset contains the hourly weather conditions for a period of 364 days, and other details such as whether a said day was a holiday or not.

*   The dataset containes a total of 8870 records and 14 attributes. There are no duplicate records or missing values in the dataset.





We will rename the Features so that we can iterate without any problem of missing space while execution of code

In [None]:
# Renaming the columns
dataset.rename(columns= {'Date':'date','Rented Bike Count': 'rented_bike_count', 'Hour':'hour',
                    'Temperature(°C)':'temperature', 'Humidity(%)':'humidity',
                    'Wind speed (m/s)': 'wind_speed', 'Visibility (10m)': 'visibility',
                    'Dew point temperature(°C)':'dew_point_temp',
                    'Solar Radiation (MJ/m2)': 'solar_radiation', 'Rainfall(mm)': 'rainfall',
                    'Snowfall (cm)':'snowfall', 'Seasons':'seasons',
                    'Holiday':'holiday', 'Functioning Day':'func_day'},
          inplace=True)

In [None]:
dataset.columns

In [None]:
#Engineering new features 'month' and 'day_of_week' from the 'date':
#add month, day_of_week columns
for df in [dataset]:
    df['month'] = df['date'].dt.month
    df['day_of_week'] = df['date'].dt.dayofweek

# {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}



*   In a city, it is highly likely that the rental bike demand may follow different pattern over the weekends when people do not generally go to work.

*   To capture this trend, we can define a new feature 'weekend' which indicates whether a said day is a weekend (1) or not (0).



In [None]:
# engineering new feature 'weekend' from day_of_week
dataset['weekend'] = dataset['day_of_week'].apply(lambda x: 1 if x>4 else 0)

### What all manipulations have you done and insights you found?

*   We had zero null values in our dataset.
*   Zero duplicate values found.
*   We changed the data type of Date Column from 'object' to 'datetime64[ns]'. This was done for featurin engineering.
*   We created two new columns with the help of 'Date' column 'Month' & 'Day' which would further use for EDA and later we drop the Date column.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - Analyzing the distribution of the dependent variable:

# defining dependent variable separately
dependent_variable = ['rented_bike_count']

# visualizing the distribution of the dependent variable - rental bike count
plt.figure(figsize=(12,5))
sns.distplot(df[dependent_variable])
plt.xlabel(dependent_variable[0])
plt.title(dependent_variable[0]+' distribution')
plt.axvline(df[dependent_variable[0]].mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(df[dependent_variable[0]].median(), color='cyan', linestyle='dashed', linewidth=2)

In [None]:
# skew of the dependent variable
df[dependent_variable].skew()

##### 1. Why did you pick the specific chart?

To check the skewness of the dependent variable

##### 2. What is/are the insight(s) found from the chart?

*   The dependent variable is positively skewed. To get better predictions, it is ideal if the dependent variable is almost normally distributed.
*   To achieve this, we can transform the data by log, sqrt, etc.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Not a negative, we can fix it to get the better predications by transform the data by log, sqrt.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart - 2 Log transformation & Square-root transformation:
# visualizing the distribution of dependent variable after log transformation
plt.figure(figsize=(10,5))
sns.distplot(np.log1p(df[dependent_variable]))
plt.xlabel(dependent_variable[0])
plt.title(dependent_variable[0]+' distribution')
plt.axvline(np.log1p(df['rented_bike_count']).mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(np.log1p(df['rented_bike_count']).median(), color='cyan', linestyle='dashed', linewidth=2)

In [None]:
# skew of the dependent variable after log transformation
np.log1p(df[dependent_variable]).skew()

We can see that the dependent variable is skewed, lets try to reduce the skewness by appling square root method

In [None]:
# visualizing the distribution of dependent variable after sqrt transformation
plt.figure(figsize=(10,5))
sns.distplot(np.sqrt(df[dependent_variable]))
plt.xlabel(dependent_variable[0])
plt.title(dependent_variable[0]+' distribution')
plt.axvline(np.sqrt(df['rented_bike_count']).mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(np.sqrt(df['rented_bike_count']).median(), color='cyan', linestyle='dashed', linewidth=2)

In [None]:
# # skew of the dependent variable after sqrt transformation
np.sqrt(df[dependent_variable]).skew()

Bingo.......! the skewness has decresed, earlier it was negative left skewed but it seem good now.

##### 1. Why did you pick the specific chart?

We are trying to reduce the skewness over here with the help of log transformation or square root tranformation.

##### 2. What is/are the insight(s) found from the chart?

We were able to reduce skewness on square root transformation. Hence we can use square root transformation during the modelling.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We were able to reduce skewness on square root transformation. Hence we can use square root transformation during the modelling

#### Chart - 3

In [None]:
# Chart - 3 Analyzing the distribution of continuous independent variables:
# defining continuous independent variables separately
continuous_var = ['temperature', 'humidity', 'wind_speed', 'visibility', 'solar_radiation', 'rainfall', 'snowfall']

# Analyzing the distribution of the continuous independent variables
for col in continuous_var:
  plt.figure(figsize=(9,4))
  sns.distplot(df[col])
  plt.axvline(df[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(df[col].median(), color='cyan', linestyle='dashed', linewidth=2)
  plt.title(col+' distribution')
  plt.show()

##### 1. Why did you pick the specific chart?

To Analyze The distribution of continous independent variable.

##### 2. What is/are the insight(s) found from the chart?



*   Normally distributed attributes: temperature, humidity.
*  Positively skewed attributes: wind, solar_radiation, snowfall, rainfall.

*   Negatively skewed attributes: visibility.





##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it goona be very helpful in modeling.

#### Chart - 4

In [None]:
#Chart -4 : Analyzing the relationship between dependent variable and continuous independent variables:
# Analyzing the relationship between the dependent variable and the continuous variables
for i in continuous_var:
  plt.figure(figsize=(10,5))
  plt.scatter(x=i,y=dependent_variable[0],data=df)
  plt.xlabel(i)
  plt.ylabel(dependent_variable[0])
  plt.title(i+' vs '+ dependent_variable[0])
  plt.show()

##### 1. Why did you pick the specific chart?

This chart will show the relation between the dependent and continous independent variables.

##### 2. What is/are the insight(s) found from the chart?

Positively correlated variables: temperature, windspeed, visibility, solar radiation.
Negatively correlated variables: humidity, rainfall, snowfall.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It will help us to identify the positive and negative corelation between the variables.

#### Chart - 5 to 10

In [None]:
  #Chart 5 - Analyzing the relationship between dependent variable and categorical independent variables:
  plt.figure(figsize=(10,5))
  sns.barplot(x=dataset['hour'],y=dependent_variable[0],data=df)
  plt.xlabel("Hours ")
  plt.ylabel("Rentel Bike count")
  plt.title('Hour vs Rentel_bike')
  plt.show()

Instead of writing same code for all categoriacl features We can use **for loop** so that we can save time and space while executing the project

#####there are 5 charts within the same code so there are total 10 charts so far

In [None]:
#Chart 5 - Analyzing the relationship between dependent variable and categorical independent variables:
# defining categorical independent variables separately
categorical_var = ['hour','seasons', 'holiday', 'func_day', 'month', 'day_of_week', 'weekend']

In [None]:
# Analyzing the relationship between the dependent variable and the categorical variables
for i in categorical_var:
  plt.figure(figsize=(10,5))
  sns.barplot(x=i,y=dependent_variable[0],data=df)
  plt.xlabel(i)
  plt.ylabel(dependent_variable[0])
  plt.title(i+' vs '+ dependent_variable[0])
  plt.show()

In [None]:
# Highest rented bike count on a functioning day vs a non functioning day
dataset.groupby(['func_day'])['rented_bike_count'].max()

In [None]:
# Non functioning days in the dataset
df[(dataset['func_day']=='No')]['date'].unique()

###### 1. Why did you pick the specific chart?

In these chart, it will show us the all the isights of the dependent variable with cateogrical independent variables.

###### 2. What is/are the insight(s) found from the chart?


1.   The number of bikes rented is on average higher during the rush hours.
2.   The rented bike counts is higher during the summer and lowest during the winter.
3. The rented bike count is higher on working days than on non working days.
4. On a non functioning day, no bikes are rented in all the instances of the data.
5. The number of bikes rented on average remains constant throughout Monday - Saturday, it dips on Sunday, and on average, the rented bike counts is lower on weenends than on weekdays.



###### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

it will gained a insight for the positive business because we get to know about the when mostly the bike rented, when the countis high on which day, on which weather.

**On a non functioning day, no bikes are rented in all the instances of the data.**

#### Chart - 8 - Bike demand throughout the day:


In [None]:
for i in categorical_var:
  if i == 'hour':
    continue
  else:
    fig, ax = plt.subplots(figsize=(10,5))
    sns.pointplot(data=df, x='hour', y='rented_bike_count', hue=i, ax=ax)
    plt.title('Hourly bike demand broken down based on the attribute: '+i)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',title=i)
    plt.show

1. Why did you pick the specific chart?

We use the Point plot, so that these plots can show us the at what time, what day and in what season bike is required the most.

2. What is/are the insight(s) found from the chart?



*   In winters the overall demand for rented bikes is comparitively lower than that of other seasons.
*   On a non functioning day, no bikes are rented.
*   The demand for rented bikes throughout the day on holidays and weekends follow a different pattern than other days. On regular days, the demand for the bikes is higher during rush hours. On holidays or weekends, the demand is comparitively lower in the mornings, and is higher in the afternoons






3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason

With the help of this insight its very clear about the demand of the bike, so it gonna be very helpful

#### Chart - 7 - Outlier analysis:


In [None]:
for col in categorical_var:
  plt.figure(figsize=(10,5))
  sns.boxplot(x = col,y = dependent_variable[0],data=df)
  plt.title(col+' boxplot')
  plt.show()

##### 1. Why did you pick the specific chart?



*We use the box-plot to identify the outliers in the data, they will show us very clearly in thi*s.

##### 2. What is/are the insight(s) found from the chart?




 There are outliers in the data and this must be taken into consideration in the model building phase.


#####3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason



Its shows that there are outliers in the data and it will be taking into the consedration now at the time of model building.

#### Chart - 11

In [None]:
# Chart - 11 visualization code For Bike demand Throughout the Day in terms of diffrent attributes
for i in categorical_var:
  if i == 'hour':
    continue
  else:
    fig, ax = plt.subplots(figsize=(10,5))
    sns.pointplot(data=df, x='hour', y='rented_bike_count', hue=i, ax=ax)
    plt.title('Hourly bike demand broken down based on the attribute: '+i)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',title=i)
    plt.show

##### 1. Why did you pick the specific chart?

We use the Point plot, so that these plots can show us the at what time, what day and in what season bike is required the most.

##### 2. What is/are the insight(s) found from the chart?


*   In winters the overall demand for rented bikes is comparitively lower than that of other seasons.
*   On a non functioning day, no bikes are rented.
*   The demand for rented bikes throughout the day on holidays and weekends follow a different pattern than other days. On regular days, the demand for the bikes is higher during rush hours. On holidays or weekends, the demand is comparitively lower in the mornings, and is higher in the afternoons






##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


With the help of this insight its very clear about the demand of the bike, so it gonna be very helpful

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
 ## Correlation magnitude for continuous variables
plt.figure(figsize=(15,8))
plt.title('Correlation Analysis')
correlation = df[continuous_var+dependent_variable].corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?



From the above graph, we can see that Temperature and Dew_point_temperature is highy correlated, keeping the factor of 0.91 . And, then we have hour in the graph which is having good correlation with our dependent variable.




##### 2. What is/are the insight(s) found from the chart?

here is no multicollinerity in the data.


#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
# # Pair Plot visualization code
sns.pairplot(dataset)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
#Lets see our features again
df.columns

Since there are vaey few day on which there was snowfall / rainfall, it is in our interest that we convert these columns to binary categorical columns indicating whether there was rainfall / snowfall at that particular hour

In [None]:
# Converting snowfall and rainfall to categorical attributes
df['snowfall'] = df['snowfall'].apply(lambda x: 1 if x>0 else 0)
df['rainfall'] = df['rainfall'].apply(lambda x: 1 if x>0 else 0)

When

1. Visibility >= 20 Km ---> Clear (high visibility)
2. 4 Km <= Visibility < 10 Km ---> Haze (medium visibility)
3. Visibility < 4 Km ---> Fog (low visibility)

Converting visibility based on the above mentioned threshold values. Since they are ordinal, we can encode them as 0 (low visibility), 1 (medium visibility), 2 (high visibility)

In [None]:
# encoding the visibility column
dataset['visibility'] = pd.cut(df.visibility,bins=[0,399,999,2001],labels=[0,1,2])

Nominal categorical features 'month', 'day_of_week', 'hour' are nominal categorical variables. Hence we need to encode them.

In [None]:
# one hot encoding
df = pd.get_dummies(df, columns = ['month', 'hour','day_of_week'])

In [None]:
df.columns

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# NO MISSING VALUES ARE THERE IN OUR DATA SET SO WE WILL SKIP THIS PART

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Handling Outliers & Outlier treatments
sns.set(font_scale=1.0)
fig, axes = plt.subplots(nrows=4,ncols=2)
fig.set_size_inches(15, 15)
sns.boxplot(data=dataset,y="rented_bike_count",x="humidity",orient="v",ax=axes[0][0])
sns.boxplot(data=dataset,y="rented_bike_count",x="hour",orient="v",ax=axes[0][0])
sns.boxplot(data=dataset,y="rented_bike_count",x="temperature",orient="v",ax=axes[1][0])
sns.boxplot(data=dataset,y="rented_bike_count",x="wind_speed",orient="v",ax=axes[1][1])
sns.boxplot(data=dataset,y="rented_bike_count",x="visibility",orient="v",ax=axes[2][0])
sns.boxplot(data=dataset,y="rented_bike_count",x="seasons",orient="v",ax=axes[2][1])
sns.boxplot(data=dataset,y="rented_bike_count",x="holiday",orient="v",ax=axes[3][0])
sns.boxplot(data=dataset,y="rented_bike_count",x="solar_radiation",orient="v",ax=axes[3][1])

Since we have encoded the 'month' and 'day_of_week' attributes, we no longer need 'weekend' and 'seasons' attributes since they essentially convey similar information.

In [None]:
# dropping seasons and weekend
dataset.drop(['seasons','weekend'],axis=1, inplace=True)

In [None]:
dataset.head()

### 3. Categorical Encoding

In [None]:
#Encoding the data to fit a model:

# encoding
dataset['func_day'] = np.where(dataset['func_day'] == 'Yes',1,0)
dataset['holiday'] = np.where(dataset['holiday'] == 'Holiday', 1,0)

In [None]:
# Dropping date attribute
dataset.drop('date',axis=1,inplace=True)

The date column cannot be used to build a ML model. Hence we can drop it.

In [None]:
# Defining dependent and independent variables
X = dataset.drop('rented_bike_count',axis=1)
y = np.sqrt(df[dependent_variable])

In [None]:
# shape of dataframe
df.shape

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing (NOT REQUIRED IN REGRESSION PROJECT)

1.   List item
2.   List item


(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)

##### What data splitting ratio have you used and why? 

Since the dataset used here is compact with just 8760 records, and 18 attributes, we can use K-fold cross validation rather than train-test split.

## ***7. ML Model Implementation***

### Evaluation Metric



*   We know that the data we are working with contains outliers, we didnt drop them because if we do so, we may loose out important trends/patterns in the data.
*   Decision Trees or any tree based algorithms that we will use here are known to handle outliers. Hence we can use RMSE as the evaluation metric.

*   Since RMSE penalizes outliers a lot, this is a good metric to check whether ot not the model has learnt all the trends/patterns in the data.
*   In addition to RMSE, we can use R2 score to make the results more explainable to a larger audience.






In [None]:
# defining rmse evaluation metric
def rmse(actual,predicted):
  '''
  rmse(actual_y,predicted_y)
  '''
  mse = mean_squared_error(actual,predicted)
  rmse = np.sqrt(mse)
  return rmse

### ML Model - 1 **Decisison tree** 

In [None]:
# ML Model - 1 Implementation
# Using gridsearchcv to find the hyperparameters with best predictions
# A full grown tree has a max depth of 28.
dt_model = DecisionTreeRegressor(random_state=0)
dt_params = {'max_depth':np.arange(20,26),
             'min_samples_leaf':np.arange(30,41,2)
             }

In [None]:
# fitting model with hypertuned paramaters using grid search
dt_gridsearch = GridSearchCV(dt_model,dt_params, cv=6, scoring= 'neg_root_mean_squared_error')
dt_gridsearch.fit(X_train,y_train)
dt_best_params = dt_gridsearch.best_params_

# model best parameters
dt_best_params

In [None]:
# building DT model with best parameters
dt_model = DecisionTreeRegressor(max_depth=dt_best_params['max_depth'], min_samples_leaf=dt_best_params['min_samples_leaf'], random_state=0)

In [None]:
# fitting model
dt_model.fit(X_train,y_train)

In [None]:
# dt train predictions
dt_y_train_pred = dt_model.predict(X_train)

In [None]:
# dt test predictions
dt_y_test_pred = dt_model.predict(X_test)

In [None]:
from sklearn.metrics import r2_score

In [None]:
# train score
dt_train_r2_score = r2_score(np.square(y_train),np.square(dt_y_train_pred))
dt_train_r2_score

# test score
dt_test_r2_score = r2_score(np.square(y_test),np.square(dt_y_test_pred))
dt_test_r2_score

In [None]:
# training rmse
dt_train_rmse = rmse(np.square(y_train),np.square(dt_y_train_pred))
dt_train_rmse

# test rmse
dt_test_rmse = rmse(np.square(y_test),np.square(dt_y_test_pred))
dt_test_rmse 

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Decision tree is low bias, high variance model. If we fit a decision tree model on a dataset without tuning the hyperparameters, we get zero RMSE for training data and high RMSE for test data.
Also the R2 score is 1 for train data, and is significantly low when that model is fit on test data.
Our aim is to build a generalized model, that is able to predict the dependent variable for unseen data with less error.
To achieve this, we can tune the decision tree hyperparameters, thereby reducing the model complexity, which in turn improve predictions for the test data

In [None]:
# Predicted vs actual values of dependent variable
plt.figure(figsize=(10,5))
plt.scatter(x=np.square(y_test),y=np.square(dt_y_test_pred))
plt.xlabel('Actual Rented Bike Count')
plt.ylabel('Predicted Rented Bike Count')
plt.title('Actual vs Predicted values of dependent variable using: DECISION TREE')

In [None]:
# Decision tree diagram
graph = Source(tree.export_graphviz(dt_model,
                                    out_file=None,
                                    feature_names=X_train.columns,
                                    filled= True))
display(SVG(graph.pipe(format='svg')))

In [None]:
# Feature importances

dt_feat_imp = pd.Series(dt_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10,5))
plt.title('Feature Importances: DECISION TREE')
plt.xlabel('Relative Importance')
dt_feat_imp.nlargest(20).plot(kind='barh')

### ML Model - 2 **Random Forest**

In [None]:
# random forest model
rf_model = RandomForestRegressor(random_state=0)
rf_params = {'n_estimators':[500],                    # limited due to computational power availability
             'min_samples_leaf':np.arange(25,31)}     # Approximate range after fitting a decision tree model

In [None]:
# fitting a rf model with best parameters obtained from gridsearch
rf_gridsearch = GridSearchCV(rf_model,rf_params,cv=6,scoring='neg_root_mean_squared_error')
rf_gridsearch.fit(X_train,y_train)
rf_best_params = rf_gridsearch.best_params_

In [None]:
# best parameters for random forests
rf_best_params

In [None]:
# Fitting RF model with best parameters
rf_model = RandomForestRegressor(n_estimators=rf_best_params['n_estimators'],
                                 min_samples_leaf=rf_best_params['min_samples_leaf'],
                                 random_state=0)

In [None]:
# fit
rf_model.fit(X_train,y_train)

In [None]:
# rf predictions on train data
rf_y_train_pred = rf_model.predict(X_train)

In [None]:
# rf predictions on test data
rf_y_test_pred = rf_model.predict(X_test)

In [None]:
# train score
rf_train_r2_score = r2_score(np.square(y_train),np.square(rf_y_train_pred))
rf_train_r2_score

In [None]:
# test score
rf_test_r2_score = r2_score(np.square(y_test),np.square(rf_y_test_pred))
rf_test_r2_score

In [None]:
# train rmse
rf_train_rmse = rmse(np.square(y_train),np.square(rf_y_train_pred))
rf_train_rmse

In [None]:
# test rmse
rf_test_rmse = rmse(np.square(y_test),np.square(rf_y_test_pred))
rf_test_rmse

In [None]:
# Feature importances

rf_feat_imp = pd.Series(rf_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10,5))
plt.title('Feature Importances: RANDOM FORESTS')
plt.xlabel('Relative Importance')
rf_feat_imp.nlargest(20).plot(kind='barh')

Temperature is the most important feature in predicting the value of the dependent variable for random forests, followed by humidity and func_day.

In [None]:
# Actual vs predicted values of dependent variables

plt.figure(figsize=(10,5))
plt.scatter(x=y_test,y=rf_y_test_pred)
plt.xlabel('Actual Rented Bike Count')
plt.ylabel('Predicted Rented Bike Count')
plt.title('Actual vs Predicted values of dependent variable using: RANDOM FOREST')

**Scatter plot of the actual and predicted values of the dependent 
variable on test data using random forests.**

### ML Model - 3 **Gradient Boosting**

In [None]:
# GBM model
gb_model = GradientBoostingRegressor(random_state=0)
gb_params = {'n_estimators':[500],
             'min_samples_leaf':np.arange(25,31)}

In [None]:
# finding best parameters
gb_gridsearch = GridSearchCV(gb_model,gb_params,cv=6,scoring='neg_root_mean_squared_error')
gb_gridsearch.fit(X_train,y_train)
gb_best_params = gb_gridsearch.best_params_

In [None]:
# GBM best parameters
gb_best_params

In [None]:
# Building GBM model with best parameters
gb_model = GradientBoostingRegressor(n_estimators=gb_best_params['n_estimators'],
                                     min_samples_leaf=gb_best_params['min_samples_leaf'],
                                     random_state=0)

In [None]:
# fit
gb_model.fit(X_train,y_train)

In [None]:
# gradient boosing train predictions
gb_y_train_pred = gb_model.predict(X_train)

In [None]:
# gradient boosting test predictions
gb_y_test_pred = gb_model.predict(X_test)

In [None]:
# train score
gb_train_r2_score = r2_score(np.square(y_train),np.square(gb_y_train_pred))
gb_train_r2_score

In [None]:
# test score
gb_test_r2_score = r2_score(np.square(y_test),np.square(gb_y_test_pred))
gb_test_r2_score

In [None]:
# train rmse
gb_train_rmse = rmse(np.square(y_train),np.square(gb_y_train_pred))
gb_train_rmse

In [None]:
# test rmse
gb_test_rmse = rmse(np.square(y_test),np.square(gb_y_test_pred))
gb_test_rmse 

In [None]:
# gradient boosting feature importances
gbm_feat_imp = pd.Series(gb_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10,5))
plt.title('Feature Importances: Gradient Boosting Machine (GBM)')
plt.xlabel('Relative Importance')
gbm_feat_imp.nlargest(20).plot(kind='barh')

Temperature is the most important feature in predicting the value of the dependent variable using gradient boosting, followed by func_day and humidity.



In [None]:
# Actual vs predicted values of dependent variables

plt.figure(figsize=(10,5))
plt.scatter(x=y_test,y=gb_y_test_pred)
plt.xlabel('Actual Rented Bike Count')
plt.ylabel('Predicted Rented Bike Count')
plt.title('Actual vs Predicted values of dependent variable using: GRADIENT BOOSTING MACHINE (GBM)')

Scatter plot of the actual and predicted values of the dependent variable on test data using Gradient boosting.


### ML Model -4 **XG Boost**

In [None]:
X_train = X_train.astype('float')
X_test = X_test.astype('float')

In [None]:
# xg boost
xgb_model = xgb.XGBRegressor(random_state=0,
                             objective='reg:squarederror')
xgb_params = {'n_estimators':[500],
             'min_samples_leaf':np.arange(25,31)}

In [None]:
# finding best parameters
xgb_gridsearch = GridSearchCV(xgb_model,xgb_params,cv=6,scoring='neg_root_mean_squared_error')
xgb_gridsearch.fit(X_train,y_train)
xgb_best_params = xgb_gridsearch.best_params_

In [None]:
# xg boost best parameters
xgb_best_params

In [None]:
# Building a XG boost model with best parameters
xgb_model = xgb.XGBRegressor(n_estimators=xgb_best_params['n_estimators'],
                             min_samples_leaf=xgb_best_params['min_samples_leaf'],
                             random_state=0)

In [None]:
# fit
xgb_model.fit(X_train,y_train)

In [None]:
xgb_y_train_pred = xgb_model.predict(X_train)

In [None]:
xgb_y_test_pred = xgb_model.predict(X_test)

In [None]:
# train score
xgb_train_r2_score = r2_score(np.square(y_train),np.square(xgb_y_train_pred))
xgb_train_r2_score

In [None]:
# test score
xgb_test_r2_score = r2_score(np.square(y_test),np.square(xgb_y_test_pred))
xgb_test_r2_score

In [None]:
# train rmse
xgb_train_rmse = rmse(np.square(y_train),np.square(xgb_y_train_pred))
xgb_train_rmse 

In [None]:
# test rmse
xgb_test_rmse = rmse(np.square(y_test),np.square(xgb_y_test_pred))
xgb_test_rmse

In [None]:
# feature importance
xgb_feat_imp = pd.Series(xgb_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10,5))
plt.title('Feature Importances: XG Boost')
plt.xlabel('Relative Importance')
xgb_feat_imp.nlargest(20).plot(kind='barh')

Func_day is the most important feature in predicting the value of the dependent variable follwed by hour_4 and temperature. Many features have a significant importance for XG boost model, rather than the top few features for other models.

In [None]:
# Actual vs predicted values of dependent variables

plt.figure(figsize=(10,5))
plt.scatter(x=y_test,y=xgb_y_test_pred)
plt.xlabel('Actual Rented Bike Count')
plt.ylabel('Predicted Rented Bike Count')
plt.title('Actual vs Predicted values of dependent variable using: XG BOOST')

Scatter plot of the actual and predicted values of the dependent variable on test data using XG boost.

# RESULTS:

In [None]:
# Summarizing the results obtained
test = PrettyTable(['Sl. No.','Regression Model', 'Train RMSE','Test RMSE','Train R2 Score (%)','Test R2 Score (%)'])
test.add_row(['1','Decision Tree',dt_train_rmse,dt_test_rmse,dt_train_r2_score*100,dt_test_r2_score*100])
test.add_row(['2','Random Forests',rf_train_rmse,rf_test_rmse,rf_train_r2_score*100,rf_test_r2_score*100])
test.add_row(['3','Gradient Boosting Method',gb_train_rmse,gb_test_rmse,gb_train_r2_score*100,gb_test_r2_score*100])
test.add_row(['4','XG Boost',xgb_train_rmse,xgb_test_rmse,xgb_train_r2_score*100,xgb_test_r2_score*100])
print(test)

In [None]:
# Plotting RMSEs

ML_models = ['Decision Tree','Random Forests','GBM','XG Boost']
train_rmses = [dt_train_rmse,rf_train_rmse,gb_train_rmse,xgb_train_rmse]
test_rmses = [dt_test_rmse,rf_test_rmse,gb_test_rmse,xgb_test_rmse]
  
X_axis = np.arange(len(ML_models))

plt.figure(figsize=(10,5))
plt.bar(X_axis - 0.2, train_rmses, 0.4, label = 'Train RMSE')
plt.bar(X_axis + 0.2, test_rmses, 0.4, label = 'Test RMSE')
  
plt.xticks(X_axis,ML_models)
plt.ylabel("RMSE")
plt.title("RMSE for each model")
plt.legend()
plt.show()

The XG boost model was able to predict the dependent variable with the lowest test RMSE.

In [None]:
# Plotting R2 scores

ML_models = ['Decision Tree','Random Forests','GBM','XG Boost']
train_r2_scores = [dt_train_r2_score,rf_train_r2_score,gb_train_r2_score,xgb_train_r2_score]
test_r2_scores = [dt_test_r2_score,rf_test_r2_score,gb_test_r2_score,xgb_test_r2_score]
  
X_axis = np.arange(len(ML_models))

plt.figure(figsize=(10,5))
plt.bar(X_axis - 0.2, train_r2_scores, 0.4, label = 'Train R2 Score')
plt.bar(X_axis + 0.2, test_r2_scores, 0.4, label = 'Test R2 Score')
  
plt.xticks(X_axis,ML_models)
plt.ylabel("R2 Score")
plt.title("R2 score for each model")
plt.legend()
plt.show()

The XG boost model was able to predict the dependent variable with the highest test R2 score.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

We will consider XG boost model, because this was able to predict the dependent variable with highest test R2 score and lowest test RMSE

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

* We trained 4 unique Machine Learning models using the training dataset, and the its respective performance was improved through hyperparameter tuning.
* We initially started with the decision tree model, mainly because it is easily explainable to the stakeholders, and its low training time.
* Once we were successfully able to fit a decision tree, it was necessary to improve the prediction accuracy, and reduce errors in the predictions.
* To achieve this, we fit a random forest model on the training data, and the final predictions showed less errors compared to that of decision tree model.
* To further improve the predictions of the model, we fit 2 boosting models namely; Gradient boosting machine (GBM) and Extreme gradient boost (XG Boost). The predictions obtained from these models showed errors in the same range, but the errors were lower than that of decision tree model.

The XG Boost model has the lowest RMSE, and the highest R2 score.



Final choice of model depends on:
* If it is absolutely necessary to have a model with the best accuracy, then XG boost will be the best choice, since it has the lowest RMSE than other models built.
* But as discussed above, higher the model complexity, lower is the model explainability. Hence if the predictions must be explained to stakeholers, then XG Boost is not an ideal choice.
* In this case decision tree can be used, since they are easier to explain. By choosing a simpler model, we will be compromising with the model accuracy (Accuracy vs Interpretability tradeoff).

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***