<a href="https://colab.research.google.com/github/yankita165/BIKE-SHARING-DEMAND-PROJECT/blob/main/BIKE_PREDCTION_DEMAND_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **PROJECT NAME - BIKE SHARING DEMAND PREDICTION**

##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

* A bike sharing system is a new form of public transportation systems. Allowing its users to rent a bike from a location and return the bike to another location.
* Historical data on bike rentals and environmental factors will be utilized to create a model that optimizes inventory management and enhances customer satisfaction in the bike rental industry.
* By accurately predicting bike demand, this ML regression project aims to support bike rental companies in optimizing inventory management, reducing costs, and delivering better customer service. The insights gained from the project will enable data-driven decision, such as adjusting rental prices, resurce allocation, and improving overall operational efficiency.
* The data set we will use for this project contains 14 columns as variables: Data, Seasons, Holiday, Functional day, Hour, Rainfall, Snowfall. Rented Bike Count, Temperature, Humidity, Dew Point Temperature, Visibility, Solar raidation and Windspeed.
* We will use Python libraries such as Pandas, Seaborn, Numpy, and sklearn to develop our prediction algorithm. By testing and evaluating different models, we will determing which algorithm provide the most accurate predictions and can be deployed effectively in real world scenarios.
* Today, there exists geat interest in these systems due to their importantt role in traffic, environmental and health issues. Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicity recorded in these systems. This feature turns bike sharing into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

#  **GitHub Link**

GitHub Link : https://github.com/yankita165/BIKE-SHARING-DEMAND-PROJECT

# **Problem Statement**


At present, numerous urban cities have implemented rental bikes as a means to improve mobility convience. Ensuring the availability and accessibility of rental bikes to the publiec in a timely manner is crucial, as it reduces waiting times. Consequently, the challenge lies in establishing a reliable and consistent supply of rental bikes for the city. The key aspect involves accurately predicting the number of bikes needed during each hour to maintain a stable provision of rental bikes.



# **Data Description**  

**The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.**


**Attribute Information:**

* Date : year-month-day

* Rented Bike count - Count of bikes rented at each hour

* Hour - Hour of the day

* Temperature-Temperature in Celsius

* Humidity - %

* Windspeed - m/s

* Visibility - 10m

* Dew point temperature - Celsius

* Solar radiation - MJ/m2

* Rainfall - mm

* Snowfall - cm

* Seasons - Winter, Spring, Summer, Autumn

* Holiday - Holiday/No holiday

* Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)





















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Let's import the Modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor


from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

import warnings
warnings.filterwarnings('ignore')

# **Dataset Loading**

In [None]:
# Load Dataset
# Let's mount the google drive for imports the dataset
from google.colab import drive
drive.mount('/content/drive')



In [None]:
#Load the Seoul bike data set from drive
bike_df=pd.read_csv("/content/SeoulBikeData (4).csv", encoding= 'latin')

# **Dataset First View**

In [None]:
# Display the top 5 rows of the dataset
bike_df.head()

In [None]:
# Display the bottom 5 rows of the dataset
bike_df.tail()

In [None]:
# Getting the shape of  dataset with rows and columns
print(bike_df.shape)

# **Dataset Columns count**

In [None]:
# Getting all the columns
print("Features of the dataset")
bike_df.columns

In [None]:
# Looking for the description of the dataset to get insights of the data
bike_df.describe().T

In [None]:
# Print the unique values
bike_df.nunique()

# **Dataset Information**

In [None]:
# Check details about the data set
bike_df.info()


# **Duplicate Values**

In [None]:
# Checking the duplicate values
dup=len(bike_df[bike_df.duplicated()])
print("The number of duplicate values in the data set is = ",dup)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Check for count of missing values in each other
bike_df.isna().sum()
bike_df.isnull().sum()

In [None]:
# Visualizing the missing values
missing = pd.DataFrame((bike_df.isnull().sum())*100/bike_df.shape[0]).reset_index()
plt.figure(figsize = (16,5))
ax = plt.stem(missing['index'], missing[0])
plt.xticks(rotation = 90, fontsize = 7)
plt.title("Percentage of Missing Values")
plt.ylabel("PERCENTAGE")
plt.show()

* **In the above data we came to know that there are no missing and duplicate value present.**

In [None]:
#Rename the complex columns name
bike_df=bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

### What did you know about your dataset?

**This Dataset contains 8760 rows and 14 columns.**

## ***2. Understanding Your Variables***

# Breaking date column

In [None]:
# Changing the "Date" column into three "year", "month",column
bike_df['Date'] = bike_df['Date'].apply(lambda x:
                                        dt.datetime.strptime(x,"%d/%m/%Y"))


In [None]:
bike_df['year'] = bike_df['Date'].dt.year
bike_df['month'] = bike_df['Date'].dt.month
bike_df['day'] = bike_df['Date'].dt.day_name()

In [None]:
#creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
bike_df['weekdays_weekend']=bike_df['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
bike_df=bike_df.drop(columns=['Date','day','year'],axis=1)

* **Essentially, when python reads the 'Date column, it interprets it as an object type, which is basically a string. Since the date column is crucial for analyzing user behaviour, it needs to be converted to a datetime format. Afterward, it can be split into three distinct columns, namely "year", "month", and "day", which can be categorized as a data type.**

* **So we convert the "date" column into 3 different column i.e "year","month", and "day".**

In [None]:
bike_df.head()

In [None]:
bike_df.info()

In [None]:
bike_df['weekdays_weekend'].value_counts()

In [None]:
bike_df.describe()

# **Variable Description**



**Date** : The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formatting in DD/MM/YY, type:str, we need to convert into datetime format.

 **Rented Bike Coun**: Number of rented bikes per hour which our dependent variable and we need to predict that type: int.

**Hour**: The hour of the day, starting from 0.23 it'S digital time format type: int, we need to convert it into category data type.

**Temperature('c)**: Temperature in Celsius, type:Float

**Humidity(%)**: Humidity in the air in %, type:int

**Wind Speed**: Speed of the wind in m/s, type:Float.

**Visibility(10m)**: Visibility in m, type:int

 **Dew pint temperature('c)**: Temperature at the beggining of the day, type:Float

 **Solar Radiation(MJ/M2)**: Sun contribution, type:Float

**Rainfall(mm)**: Amount of raining in mm, type:Float

**Snowfall(cm)**: Amount of snowing in cm, type:Float

**Seasons**: Season of the year, type:str, there are only 4 seasons's in data

**Holiday**:If the day is holiday period or not, type:str

**Functioning Day**: If the day is a Functioning Day or not, type:str


# Cheque Unique Values for each variable

In [None]:
# Check Unique Values for each variable.
bike_df.nunique()

#Changing data type

In [None]:
#Change the int64 column into category column
cols=['Hour','month','weekdays_weekend']
for col in cols:
    bike_df[col]= bike_df[col].astype('category')

In [None]:
#Let's check the result of data type
bike_df.info()

In [None]:
bike_df.columns

In [None]:
bike_df['weekdays_weekend'].unique()

# **Exploratory Data Analysis On The Data Set**


**What is a dependent variable in data analysis?**

  * **We analyse our dependent variable, A dependent variable is a variable whose value will change depending on the value of another variable.**
  * **Our dependent variable is 'Rented Bike Count' so we need to analysis this column withh the other columns with the other columns by using some visualisation plot.**

#Month

In [None]:
#Analysing dataset by visualisation
fig, ax = plt.subplots(figsize=(15,8))
sns.pointplot(data=bike_df,x='month', y='Rented_Bike_Count',ax =ax)
ax.set(title='Count of Rented bikes according to Month')
plt.show()

* **Based on the point plot shown above, it is evident that the demand for rented bikes is higher from the months of May to October as compared to other months. It is worth noting that these months fall within the summer season.**

## Working  Day


In [None]:
#Analysing dataset by visualisation
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=bike_df,x='Functioning_Day',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Functioning Day')
plt.show()

In [None]:
#Analysing dataset by visualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='Functioning_Day',ax=ax)
ax.set(title='Count of Rented bikes acording to Functioning_Day')

**. The bar plot and point displayed above  depict the utilization of rented bikes on working and non-working days.It is evident from the plot that people do not use rented bikes on non-functioning days.**

## Hour

In [None]:
#Analysing dataset by visualisation
fig,ax = plt.subplots(figsize=(15,8))
sns.boxplot(data=bike_df,x='Hour', y='Rented_Bike_Count', ax=ax)
ax.set(title='Count of Rented bikes according to Hour')
plt.show()


**. The plot above showcases the usage of rented bikes across different hours throughout the year. It is notable that people tend to use rented bikes during their working hours, specificially from 7AM and 9PM and 5AM to 7PM.**

#  Weekdays_weekend





In [None]:
# Analysing dataset by visualisation
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=bike_df,x='weekdays_weekend',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes according to weekdays and weekend ')

In [None]:
#Analysing dataset by visualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.lineplot(data=bike_df,x='Hour',y= 'Rented_Bike_Count',hue='weekdays_weekend',ax=ax)
ax.set(title='Count of Rented bikes according to weekdays_weekend')


* **Based on the line and bar plots above, we can observe that the demand for rented bikes is higher on weekdays, represented by the blue color, which is likely due to the increased demand for transportation to and from the office.
The peak demand times during weekdays are between 7am-9am and 5pm-7pm. On weekends, represented by the orange color, the demand for rented bikes is generally lower, especially during the morning hours. However, in the evening, between 4pm-8pm, we can observe a slight increases in demand for rented bikes.**

#Seasons

In [None]:
#Analysing data by visualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(data=bike_df,x='Seasons', y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes according to Seasons')

In [None]:
#Analysing dataset by visualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.pointplot(data=bike_df,x='Hour', y='Rented_Bike_Count',hue = 'Seasons',ax=ax)
ax.set(title='Count of Rented bikes according to seasons')


* **The bar plot and point plot presented above depict the usage of rented bikes across four distinct seasons. The analysis reveals that the use of rented bikes in significantly high during the summer season with peak demand during 7am-9am and 5pm-7pm. However, during the winter season, teh use of rented bikes is quite low due to snowfall.**

#Holiday

In [None]:
#Analysing dataset by visualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.boxplot(data=bike_df,x='Holiday',y='Rented_Bike_Count',ax=ax)
ax.set(title='Count of Rented bikes according to Holiday')

In [None]:
#Analysing dataset by visualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue ='Holiday',ax=ax)
ax.set(title='Count of Rented bikes according to Holiday')

* **The bar plot and point plot displayed above illustrate the usage of rented bikes during holidays, indicating that people tend to use rented bikes primarily between 2pm to 8pm.**

# **Analyze of Numerical variables**

In [None]:
#Assign the numerical column to variable
num_columns=list(bike_df.select_dtypes(['Int64','float64']).columns)
num_features=pd.Index(num_columns)
num_features

In [None]:
#Printing disputes to analyse the distribution of all numerurical features
for col in num_features:
  plt.figure(figsize=(10,6))
  sns.distplot(x=bike_df[col])
  plt.xlabel(col)
plt.show()

#Numerical vs Rented_Bike_Count

In [None]:
#Print the plot to analyse teh relationship between 'Rented_Bike_Count' and 'Temperature'
plt.figure(figsize=(10,6))
sns.scatterplot(x='Temperature',y='Rented_Bike_Count',data=bike_df)
plt.title('Temperature vs Rented Bike Count')
plt.show

* **The plot above indicates that individuals tend to prefer biking when the temperature is relatively high, averaging around 25'c.**

In [None]:
#Print the plot to analyze  the relationship between 'Rented_Bike_Count' and 'Solar_Radiation'
plt.figure(figsize=(10,6))
sns.scatterplot(x='Solar_Radiation', y='Rented_Bike_Count', data=bike_df)
plt.title('Solar_Radiation bs Rented Bike Count')
plt.show()

* **The plot above indicates that the number of rented bikes significantly increases with the presence of solar radiation, reaching a count of approximately 1000.**

In [None]:
# Print the plot to analyze the relationship between 'Rented_Bike_Count' and 'Rainfall
bike_df.groupby('Rainfall').mean()['Rented_Bike_Count'].plot()


* **The above plot indicates that despite heavy rainfall, the demand for related bikes does not decrease. For instance, even with a rainfall of 20mm, there is a significant peak in the number of rented bikes.**

In [None]:
# Print the plot to analyze the relationship between 'Rented_Bike_Count' and 'Snowfall'
plt.figure(figsize=(10,6))
sns.scatterplot(x='Snowfall', y='Rented_Bike_Count',data=bike_df)
plt.title('Snowfall vs Rented Bike Count')
plt.show()

* **The plot indicates that when the snowfall is more than 4cm, there is a significant drop in the number of rented bikes, as shown on the y-axis.**

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Wind_speed"
bike_df.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot()



* **From the plot above, we can observe that the demand for rented bikes is evenly distributed regardless of the wind speed. However, there is a spike in bike rentals when the wind speed is at 7m/s, indicating that people enjoy riding bikes when there is a slight breeeze.**

### **Regression plot**

* **Seaborn regression plots are designed to aid in exploratory data analysis by providing a visual aid that highlights patterns in a dataset. These plots, as their name implies, generate a regression line between two variables and assist in visualizing their linear relationship.**

In [None]:
# Printing the regression plot for all the numerical features
for col in num_features:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=bike_df[col],y=bike_df['Rented_Bike_Count'],scatter_kws={"color": 'red'}, line_kws={"color": "black"})


 * **The above regression plot for all the numerical features indicates that 'Temperture', 'Wind_speed', 'Visibility', 'Dew_point_temperature', and 'Solar_Radiation' are positively correlated with the target variable, that is, an increase in these features results in an increase in rented bike count. On the other hand, 'Rainfall', 'Snowfall', and 'Humidity' are negatively correlated with the target variable, indicating that an increase in these features results in a decrease in rented bike count.**

###  ***Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Normalise Rented_Bike_Count column data**


In [None]:
# Distribution plot of Rented Bike Count
plt.figure(figsize=(10,6))
plt.xlabel('Rented_Bike_Count')
plt.ylabel('Density')
ax=sns.distplot(bike_df['Rented_Bike_Count'],hist=True ,color="y")
ax.axvline(bike_df['Rented_Bike_Count'].mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(bike_df['Rented_Bike_Count'].median(), color='black', linestyle='dashed', linewidth=2)
plt.show()

* **Based on the graph above, it can be observed that the Rented Bike Count has a moderately skewed distribution towards the right. However,since the assumption for linear regression is that the dependent variables's distribution should be normal, we need to apply some transformation to achieve normality.**

In [None]:
# Boxplot of Rented Bike Count to check outliers
plt.figure(figsize=(10,6))
plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=bike_df['Rented_Bike_Count'])
plt.show()

 * **The above boxplot shows that we have detect outliers in Rented Bike Count column**

In [None]:
# Applying square root to Rented Bike Count to improve skewness
plt.figure(figsize=(10,8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

ax=sns.distplot(np.sqrt(bike_df['Rented_Bike_Count']), color="y")
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).mean(), color='green', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).median(), color='black', linestyle='dashed', linewidth=2)

plt.show()

* **Applying the generic rule of taking the square root of skewed variable to normalize them, we can observe that the Rented Bike Count, which was previously skewed, now follows a nearly normal distribution.**

In [None]:
# #After applying sqrt on Rented Bike Count check wheater we still have outliers
plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=np.sqrt(bike_df['Rented_Bike_Count']))
plt.show()

In [None]:
bike_df.corr()

* **After applying Square root to the Rented Bike Count column, we find that there is no outliers present.**

# **Checking of correlation between variables**

# Checking in OLS Model

**Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable**

In [None]:
#importing  the module
#assign the 'x','y' value
#Checking in OLS Model
import statsmodels.api as sm
X = bike_df[[ 'Temperature','Humidity',
       'Wind_speed', 'Visibility','Dew_point_temperature',
       'Solar_Radiation', 'Rainfall', 'Snowfall']]
Y = bike_df['Rented_Bike_Count']
bike_df.head()

In [None]:
#add a constant column
X = sm.add_constant(X)
X

In [None]:
# fitting a OLS model
model= sm.OLS(Y, X).fit()
model.summary()


* **The R sqauare and Adjacent Square are close to each other, indicating that the model is explained about  40% of the variance in the Rented Bike count. The P value for F statistic is less than 0.05 for a 5% level of significance. However,the P valuess for dew point temp and visibility are quite high, indicating that these variables are not significant.**

* **The Omnibus tests checks the skewness and kurtosis of the residuals, and in this case, the value of Omnibus is high, indicating that there is skewness in the data. The condition number is large, 3.11e+04, suggest  that there may be strong multicollinearity or other numerical issues.**

* **The Durbin-Watson test is used to detect autocorrelation among variables,and in this case, the  value is less than 0.5, indicating the presence of  positive auto correlation among the variables.**

In [None]:
X.corr()

* **Based on the OLS Model, it was found that there is a high correlation between 'Temperture' and 'Dew_point_temperature'. Therefore, one of these variables needs to be dropped. To decide which one to drop, (P>|t|) values from the  above table were checked. It was found that the 'Dew_point_temperature' value is higher, indicating that it is less significant. Therefore Dew_point_temperature column was dropped. To make this decision clearer, a  heatmap visualization was used in next step**

# Heatmap

* **A correlation heatmap is a type of graphical representation that displays the correlation matrix, which helps to determine the correlation between different variables.**

In [None]:
# Plot the Correlation matrix
plt.figure(figsize=(20,8))
correlation=bike_df.corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))
sns.heatmap((correlation),mask=mask, annot=True,cmap='coolwarm')

**We can observe on the heatmap that on the target variable line the most positively correlated variables to the rent are:**
* the temperture
* the dew point temperature
* the solar radiation

**And most negatively correlated variables are:**
* Humidity
* Rianfall

* **Based on the correlation heatmap above, we observe that columns 'Temperture' and 'Dew point temperture' are positively correlated, with a correlation coefficient of 0.91. Therefore, dropping the 'Dew point temperature('C') column would not significantly affect our analysis since it has similar variations to 'Temperture'.**

In [None]:
# Drop the Dew point temperature column
bike_df=bike_df.drop(['Dew_point_temperature'],axis=1)

In [None]:
bike_df.info()

# **Feature Engineering & Data Pre-Processing**

* **A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use those categorical value for programming efficiently we create dummy variables.**

In [None]:
#Assign all categorical features to a variable
cat_features=list(bike_df.select_dtypes(['object','category']).columns)
cat_features=pd.Index(cat_features)
cat_features


* **Onehot encoding enables a more description representation of categorical data. Since many machines learning algorithm do not accept categorical data as input or output variables, the categorical need to be converted into numerical values.**

In [None]:
#Create a copy
bike_df_copy = bike_df

def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

for col in cat_features:
    bike_df_copy = one_hot_encoding(bike_df_copy, col)
bike_df_copy.head()

# **Model Training**

### **Train Test split for regression**

**It is generally recommended to divide the dataset into two parts, namely training and testing sets, before applying any model. This division involves allocting some proportion of the data for training the model and reserving the remaining portion for evaluating the model's performance on unseen data. The proportion of data allocated for training and testing can vary from person to person, with commonly used ratios being 60:40,70:30,750:25 or 80:20 for training and testing respectively. To perform this split, we will use the scikit learn library.**

In [None]:
#Assign the value  in X and Y
X = bike_df_copy.drop(columns=['Rented_Bike_Count'], axis=1)
y = np.sqrt(bike_df_copy['Rented_Bike_Count'])

In [None]:
X.head()

In [None]:
y.head()

In [None]:
#Creating test and train data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)
print(X_train.shape)
print(X_test.shape)


In [None]:
bike_df_copy.describe().columns

* The mean squared error (MSE) tells you how to close a regression line is to a set to points. It does this by taking the distances from the points to the regression line (these distances are the "errors") and squaring them. It's called the mean squared error as you're finding the average of a set of errors. The lower the MSE, the better the forecast.

* MSE formula = (1/n) * Σ(actual – forecast)2
Where:

n = number of items,

Σ = summation notation,

Actual = original or observed y-value,

Forecast = y-value from regression.

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

Mean Absolute Error (MAE) are metrics used to evaluate a Regression Model. ... Here, errors are the differences between the predicted values (values predicted by our regression model) and the actual values of a variable.

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

Formula for R-Squared
                         
                         R² = 1 - unexplained Variation / Total Variation


​

R
2 =1− Total Variation Unexplained Variation​

Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
​

# **ML Model Implementation**

#### **LINEAR REGRESSION**

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line.
Linear regression uses a linear approach to model the relationship between independent and dependent variables. In simple words its a best fit line drawn over the values of independeent variables and dependent variable. In case of single variable, the formula is same as straight line equation having  an intercept and slope.
                              y_pred = β + βx

                         β and  β
 are intercept and slope respectively.
 In case of multiple features the formula translates into
                                    y_pred =                         
  where x_1x_2x_3 are the features values and

  are weights assignes to each of the features. These become the parameters which the algorithm tries to learn using Gradient descent.
  Gradient descent is the process by which the algorithm tries to update the parameters using a loss function. Loss function is nothing but the different between the actual values and predicted values(aka error or residuals). There are different types of loss function but this is the simplest one. Loss function summed over all observation gives the cost functions. The role of gradient descent is to update the parameters till the cost function is minimized i.e, a global minima is reached. It uses a hyperparameter 'alpha' that gives a weightage to the cost function and decides on how big the steps to take. Alpha is called as the learning rate. It is always necessary to keep an optimal value of alpha as high and low values of alpha might the gradient descent overshoot or get stuck at a local minima. There are also some basic assumptions that must be fulfilled before implementing this algorithm. They are:
  
  1. No multicollinearity in the dataset.

  2.Independent variables should show linear relationship with dv.

  3. Residual mean should be 0 or close to 0.

  4.There should be no heteroscedasticity i.e., variance should be constant along the line of best fit.

Let us now implement our first model. We will be using LinearRegression from scikit library.

                         

In [None]:
#import the packages
from sklearn.linear_model import LinearRegression
reg= LinearRegression().fit(X_train, y_train)

In [None]:
#check the score
reg.score(X_train, y_train)



In [None]:
#Check the coefficient
reg.coef_

In [None]:
#Get the x_train and x_test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)



#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
MSE : 35.07751288189293


* **It appears that the  r2 score of our model is 0.77, indicating that the model has  captured a significant portion of the most data variance. We should store this result  in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
training_df=pd.DataFrame(dict1,index=[1])


In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr= mean_squared_error(y_test, y_pred_test)
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_test, y_pred_test)
print("MAE :",MAE_lr)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score((y_test), (y_pred_test))
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",Adjusted_R2_lr )

**The r2_score for the test set is 0.78. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
test_df=pd.DataFrame(dict2,index=[1])

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test),(y_test)-(y_pred_test),color='black')
plt.xlabel('Preddicted Values')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

In [None]:
plt.figure(figsize=(12,10))
plt.scatter(range(len(y_pred_test)),y_pred_test, s=20, c='blue', label = 'Predicted')
plt.scatter(range(len(y_test)), y_test, s=20, c ='red', label = 'Actual')
plt.legend()
plt.xlabel('No of Test Data')
plt.show()

### **LASSO REGRESSION**

In [None]:
#Create an instances of Lasso Regression Implementation
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0, max_iter=3000)
# Fit the Lasso model
lasso.fit(X_train, y_train)
# Create the model score
print(lasso.score(X_test, y_test), lasso.score(X_train, y_train))

In [None]:
#get the X_train and X-test value
y_pred_train_lasso=lasso.predict(X_train)
y_pred_test_lasso=lasso.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_l= mean_squared_error((y_train), (y_pred_train_lasso))
print("MSE :",MSE_l)

#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)


#calculate MAE
MAE_l= mean_absolute_error(y_train, y_pred_train_lasso)
print("MAE :",MAE_l)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l= r2_score(y_train, y_pred_train_lasso)
print("R2 :",r2_l)
Adjusted_R2_l = (1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**Our model's r2 score value is 0.40, indicating it is unable to capture a significant portion of thr data variance. We should save the score in a dataframe for future comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Lasso regression ',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2)
       }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_l= mean_squared_error(y_test, y_pred_test_lasso)
print("MSE :",MSE_l)

#calculate RMSE
RMSE_l=np.sqrt(MSE_l)
print("RMSE :",RMSE_l)


#calculate MAE
MAE_l= mean_absolute_error(y_test, y_pred_test_lasso)
print("MAE :",MAE_l)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l= r2_score((y_test), (y_pred_test_lasso))
print("R2 :",r2_l)
Adjusted_R2_l=(1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )




**The r2_score for the test set is 0.38. This means our linear model is not performing well on the data. To investigate further, we will examine the residuals and check for heteroscedasticity(unequal variance or scatter).**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Lasso regression ',
       'MAE':round((MAE_l),3),
       'MSE':round((MSE_l),3),
       'RMSE':round((RMSE_l),3),
       'R2_score':round((r2_l),3),
       'Adjusted R2':round((Adjusted_R2_l ),2),
       }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(np.array(y_pred_test_lasso))
plt.plot(np.array((y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_lasso),(y_test)-(y_pred_test_lasso),color = 'Indigo')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

# **RIDGE REGRESSION**

In [None]:
# Import the packages
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=0.1)

In [None]:
#FIT THE MODEL
ridge.fit(X_train,y_train)

In [None]:
#Check the score
ridge.score(X_train,y_train)

In [None]:
#Get the X_train and X_test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)


In [None]:
#Import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r= mean_squared_error((y_train), (y_pred_train_ridge))
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)


#calculate MAE
MAE_r= mean_absolute_error(y_train, y_pred_train_ridge)
print("MAE :",MAE_r)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r= r2_score(y_train, y_pred_train_ridge)
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**It appears that our model has an r2 score value of 0.77, incicating that it can capture a significant portion of the data variance. We can store this result in a duration for future comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
training_df=training_df.append(dict1,ignore_index=True)


In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r= mean_squared_error(y_test, y_pred_test_ridge)
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)


#calculate MAE
MAE_r= mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE :",MAE_r)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r= r2_score((y_test), (y_pred_test_ridge))
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**The test set has an r2 score of 0.78, indicating good performance by our linear model on the data. However, we need to examine the residuals visually to check for heteroscedasticity, which refers to unequal variance or scatter.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Plot the figure
plt.figure(figsize=(12,10))
plt.scatter(range(len(y_pred_test_ridge)),y_pred_test_ridge, s=20, c='blue', label='Predicted')
plt.scatter(range(len(y_test)),y_test, s=20, c='red', label='Actual')
plt.legend
plt.xlabel('No of Test Data')
plt.show()

In [None]:
#Heteroscadacity
plt.scatter((y_pred_test_ridge),(y_test)-(y_pred_test_ridge),color='blue')
plt.xlabel('Predicted value')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

# **ELASTIC NET REGRESSION**

In [None]:
#Import the packages
from sklearn.linear_model import ElasticNet
#a * L1 + b * L2
#alpha = a + b and l1_ratio = a / (a + b)
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

In [None]:
#FIT THE MODEL
elasticnet.fit(X_train,y_train)

In [None]:
#Check the score
elasticnet.score(X_train, y_train)

In [None]:
#Get the x_train and x_test value
y_pred_train_en=elasticnet.predict(X_train)
y_pred_Test_en=elasticnet.predict(X_test)

In [None]:
#Import the packages
from sklearn.metrics import mean_squared_error
#Calculate MSE
MSE_e= mean_squared_error((y_train), (y_pred_train_en))
print("MSE :",MSE_e)

#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)


#calculate MAE
MAE_e= mean_absolute_error(y_train, y_pred_train_en)
print("MAE :",MAE_e)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e= r2_score(y_train, y_pred_train_en)
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**Based on the r2 score value of 0.62, it appears that our model has successfully captured a significant portion of the data variance. We can store this data in a dataframe for future comparisons.**

In [None]:
#Storing the test set metrics value is a dataframe for later comprison
dict1 = {'Model':'Elastic net regression ',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)}
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_log_error
#Calculate MSE
MSE_e= mean_squared_error((y_train), (y_pred_train_en))
print("MSE :",MSE_e)

print("MSE :",MSE_e)

#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)


#calculate MAE
MAE_e= mean_absolute_error(y_train, y_pred_train_en)
print("MAE :",MAE_e)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e= r2_score(y_train, y_pred_train_en)
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )


**The test set's r2_score of 0.62 indicates that  our linear model is effectively modeling  the data. However, we need to investigate if there is heteroscedasticity which refers to enequal variance or scatter, by examining the residuals visually.**

In [None]:
#Storing the test set metrics value is a dataframe for later comparison
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Elastic net regression Test',
       'MAE':round((MAE_e),3),
       'MSE':round((MSE_e),3),
       'RMSE':round((RMSE_e),3),
       'R2_score':round((r2_e),3),
       'Adjusted R2':round((Adjusted_R2_e ),2)}
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
#Let the figure
plt.figure(figsize=(15,10))
plt.plot(np.array(y_pred_Test_en))
plt.plot(np.array(y_test))
plt.legend(['Predicted','Actual'])
plt.show()

In [None]:
#Heteroscadacity
plt.scatter((y_pred_Test_en),(y_test)-(y_pred_Test_en), color='green')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

# **GRADIENT BOOSTING**

In [None]:
# Import the packages
from sklearn.ensemble import GradientBoostingClassifier
# Create an instance of the GradientBoostingRegressor
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train,y_train)

In [None]:
# Making prediction on train and test data
y_pred_train_g = gb_model.predict(X_train)
y_pred_test_g = gb_model.predict(X_test)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
print("Model Score:",gb_model.score(X_train,y_train))
#calculate MSE
MSE_gb= mean_squared_error(y_train, y_pred_train_g)
print("MSE :",MSE_gb)

#calculate RMSE
RMSE_gb=np.sqrt(MSE_gb)
print("RMSE :",RMSE_gb)


#calculate MAE
MAE_gb= mean_absolute_error(y_train, y_pred_train_g)
print("MAE :",MAE_gb)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_gb= r2_score(y_train, y_pred_train_g)
print("R2 :",r2_gb)
Adjusted_R2_gb = (1-(1-r2_score(y_train, y_pred_train_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

**Based on the r2 score value of 0.87, it appears that our model has successfully captured a significant portion of the data variance. We can now  store this value in a dataframe for future comparisonns.**

In [None]:
# Storing the test set metrics value is a dataframe for later comparison
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Gradient boosting regression ',
       'MAE':round((MAE_gb),3),
       'MSE':round((MSE_gb),3),
       'RMSE':round((RMSE_gb),3),
       'R2_score':round((r2_gb),3),
       'Adjusted R2':round((Adjusted_R2_gb ),2),
       }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_gb= mean_squared_error(y_test, y_pred_test_g)
print("MSE :",MSE_gb)

#calculate RMSE
RMSE_gb=np.sqrt(MSE_gb)
print("RMSE :",RMSE_gb)


#calculate MAE
MAE_gb= mean_absolute_error(y_test, y_pred_test_g)
print("MAE :",MAE_gb)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_gb= r2_score((y_test), (y_pred_test_g))
print("R2 :",r2_gb)
Adjusted_R2_gb = (1-(1-r2_score((y_test), (y_pred_test_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )


**The test set's r2_score of 0.86 indicates that our linear model is effectively performing on the data. However, we need to examine our residuals graphically to determine whether there is any heteroscadasticity (unequal variance or scatter) present.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Gradient boosting regression ',
       'MAE':round((MAE_gb),3),
       'MSE':round((MSE_gb),3),
       'RMSE':round((RMSE_gb),3),
       'R2_score':round((r2_gb),3),
       'Adjusted R2':round((Adjusted_R2_gb ),2),
       }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test_g),(y_test)-(y_pred_test_g))

In [None]:
gb_model.feature_importances_

In [None]:
importances = gb_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df.head()

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)


In [None]:
gb_model.fit(X_train,y_train)

In [None]:
features = X_train.columns
importances = gb_model.feature_importances_
indices = np.argsort(importances)

In [None]:
#Plot the figure
plt.figure(figsize=(10,20))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

## ***5. HYPERPARAMETER TUNING***

* Hyperparameter tuning is the process of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a model argument whose value is set before the learning process begins. The key to machine learning algorithm is hyperparameter tuning.

* **Using GridSearch CV**


GridSearchCV helps to loop through predefined hyperparameters and fit the model on the training set. So, in the end, we can select the best parameters from the listed hyperparameters.

## **Gradient Boosting Regressor with GridSearch CV**

### **Provide a range of values for chosen hyperparameters**

In [None]:
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

In [None]:
param_dict

###**Importing Gradient Boosting Regressor**

In [None]:
from sklearn.model_selection import GridSearchCV
# Create an instance of the GradientBoostingRegressor
gb_model = GradientBoostingRegressor()

# Grid search
gb_grid = GridSearchCV(estimator=gb_model,
                       param_grid = param_dict,
                       cv = 5, verbose=2)

gb_grid.fit(X_train,y_train)


In [None]:
gb_grid.best_estimator_

In [None]:
gb_optimal_model = gb_grid.best_estimator_
gb_grid.best_params_

In [None]:
# Making predictions on train and test data

y_pred_train_g_g = gb_optimal_model.predict(X_train)
y_pred_g_g= gb_optimal_model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
print("Model Score:",gb_optimal_model.score(X_train,y_train))
MSE_gbh= mean_squared_error(y_train, y_pred_train_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(y_train, y_pred_train_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score(y_train, y_pred_train_g_g)
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
vg_g = gb_optimal_model.predict(X_train)
y_pred_g_g= gb_optimal_model.predict(X_test)

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Gradient Boosting gridsearchcv ',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
training_df=training_df.append(dict1,ignore_index=True)

In [None]:
from sklearn.metrics import mean_squared_error
MSE_gbh= mean_squared_error(y_test, y_pred_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(y_test, y_pred_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score((y_test), (y_pred_g_g))
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_test, y_pred_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_g_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Gradient Boosting gridsearchcv ',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
test_df=test_df.append(dict2,ignore_index=True)

In [None]:
### Heteroscadacity
plt.scatter((y_pred_g_g),(y_test)-(y_pred_g_g), color='orange')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

In [None]:
gb_optimal_model.feature_importances_

In [None]:
imp = gb_optimal_model.feature_importances_

imp_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : imp}

imp_df = pd.DataFrame(imp_dict)

In [None]:
imp_df['Feature Importance'] = round(imp_df['Feature Importance'],2)

In [None]:
imp_df.head()

In [None]:
imp_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
gb_model.fit(X_train,y_train)

In [None]:
features = X_train.columns
imp = gb_model.feature_importances_
indices = np.argsort(imp)

In [None]:
# Plot the figure
plt.figure(figsize=(10,20))
plt.title('Feature importance')
plt.barh(range(len(indices)),imp[indices], color='black', align = 'center')
plt.yticks(range(len(indices)),[features[i] for i in indices])
plt.xlabel('Relative importance')

# **Conclusion**

During our analysis, we conducted an initial exploratory data analysis(EDA) on all the features in our dataset. Firstly, we analyzed our dependent variable 'Rented Bike Count' and applied transformation as necessary. We then examined the categorical  variables and removed those with a majority of one class. We also studied the numerical variables, calculated their correlations, distributions, and their relationships with the dependent variable. Additionally, we removed some numerical features that contained mostly 0 values and applied one-hot encoding to the categorical variables.

Subsequently, we employed five  machine learning algorithm including Linear Regression, Lasso, Ridge, Elastic Net, and Gradient Booster. We
also performed hyperparameter tuning to enhance the performance of our models. The evaluation of our models resulted in the following findings:

In [None]:
# Displaying the results of evaluation metric values for all models
result = pd.concat([training_df,test_df],keys=['Training set','Test set'])
result

1. Among all the models on the training set, the Gradient Boosting GridSearchCV has the lowest MAE, MSE, and RMSE, and the highest R2_score and Adjusted R2.

2. The Linear Regression and Ridge Regression models have the same MAE,MSE,RMSE,R2_score, and Adjusted R2 on the training set and test set.

3. Among all the models on the test set, the Gradient Boosting GridSearchCV has the lowest MAE,MSE and RMSE, and the highest R2_score and Adjusted R2.

4. The Lasso regression has the highest MAE,MSE, and RMSE on both the training set and test set.

5. The Elastic net regression has a similar performance on both the training set and test set.

6. The Gradient Boosting regression has a lower performance than Gradient Boosting GridSearchCV on both the training set and test set.

7. The Gradient Boosting GridSearchCV model performed the best on both the training and test sets based on its low MAE,MSE,RMSE and high R2_score and Adjusted R2.


8. The Linear Regression model also performed well on both sets, although not as well as the Gradient Boosting GridSearchCV model.

9. Lasso Regression had a higher error and a lower R2_score compared to the other models, indicating that it may not be the best model for this dataset.

10. Elastic  net regression performed well but not as well as the Gradient Boosting GridSearchCV and Linear Regression models.

11. No overfitting is observed in the dataset.

12. Ridge regression had similar performance to Linear Regeression on both the training and test sets.

13. The Gradient Boosting regression model had good performance on the training set, but slightly worse performance on the test set indicating some overfitting may have occured.

Overall, the Gradient Boosting GridSearchCV model is the most promising model for this dataset based on the presented metrics.

We can deploy this model.

Although the current  analysis may be insightful, it is important to note that the dataset is time-dependent and variables such as temperature, windspeed, and solar radiation may not always remain consistent. As a result, there may be situations where the model falls to perform well. As the field of machine learning is constantly evolving, it is necessary to stay up-to-date with the latest developments and be prepared to handle unexpected scenarios. Maintaining a strong understanding of machine learning concepts will undoubtedly provide an advantage in staying ahead in the future.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***