<a href="https://colab.research.google.com/github/s07376/EDA-project/blob/main/ML_Capstone_project_shared.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -**Bike sharing demand prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**




*   This ML project is dedicated to improving public mobility and convenience through the implementation of bike-sharing programs in metropolitan areas. With the aim of ensuring a consistent supply of bikes for rental, we tackle the challenge of predicting demand based on historical data. By leveraging factors such as temperature and time, we employ advanced machine learning techniques to forecast the bike-sharing program's usage in Seoul.
*   With a dataset containing approximately **8760 records and 14 attributes**, we embarked on an extensive data analysis journey. Through exploratory data analysis (EDA), we gained valuable insights and prepared the data for modeling. Cleaning processes were applied to eliminate outliers and handle null values, while appropriate transformations were employed to ensure compatibility with machine learning algorithms.
*   Subsequently, the cleaned and scaled data was fed into **11 diverse models**, allowing us to evaluate their performance using multiple metrics. Hyperparameter tuning was conducted to optimize the models and ensure accurate predictions.
*   In our evaluation, we prioritize metrics such as the R2 score and RMSE score. The R2 score, being scale-independent, enables us to compare models with different target variables or units of measurement. This provides a robust framework for assessing model performance across various problem domains.
*   By accurately forecasting bike-sharing demand, we aim to enhance resource allocation, reduce wait times, and elevate the overall user experience.


# **GitHub Link -**

# **Problem Statement**


**The "Bike Sharing Demand Prediction" project addresses the challenge faced by bike sharing companies in accurately forecasting and meeting the fluctuating demand for bike rentals. The unpredictable nature of bike rental demand poses difficulties in managing fleet size, allocating resources, and providing optimal customer service. Without a reliable demand prediction system, bike sharing companies often struggle to ensure a sufficient number of bikes are available during peak periods, resulting in frustrated customers and missed revenue opportunities. Conversely, overestimating demand leads to surplus bikes and unnecessary operational costs. Therefore, the problem at hand is to develop a robust machine learning model that can accurately forecast bike rental demand, enabling companies to optimize fleet management, allocate resources efficiently, and deliver an exceptional user experience while maximizing profitability.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Import Libraries and modules

# libraries that are used for analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno
from scipy import stats as st

pd.set_option('display.max_columns', 500)

plt.style.use('ggplot')

# to import datetime library
from datetime import datetime
import datetime as dt

# libraries used to pre-process
from sklearn import preprocessing, linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split


# libraries used to implement models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
import xgboost as xgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# libraries to evaluate performance
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, mean_absolute_error

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')

# to set max column display
pd.pandas.set_option('display.max_columns',None)

### Dataset Loading

In [None]:
#mounting the drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Seoul Bike Dataset
df=pd.read_csv("/content/drive/MyDrive/Capstone projects/Module_6_Machine learning/SeoulBikeData.csv",encoding= 'unicode_escape')
df.head()

### Dataset First View

In [None]:
# Dataset First Look
# Display the first 5 rows
df.head()

In [None]:
#Check last five records
df.tail()

In [None]:
#check random sample
df.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Dimensions of the dataset
df.shape

There are 8760 rows and 14 columns in this dataset.

In [None]:
# Name of columns in the data
df.columns

### Dataset Information

In [None]:
# Dataset Info
# Get information about the dataset
df.info()

**Observation:**

* Float64 datatype: 6 columns ie Temperature(°C), Wind speed (m/s, Dew point temperature(°C), Solar Radiation(MJ/m2), Rainfall(mm), Snowfall(cm) & Seasons.

* Int64 datatype: 4 columns ie Rented Bike, Count, Hour, Humidity(%) & Visibility(10m).

*  Object datatype: 4 columns ie Date, Seasons, Holidays & Functioming Day.**









#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

print('The number of duplicated values in each column:' , df.duplicated().sum())

#### Unique values

In [None]:
# Number of unique values in each columns
df.nunique()

**From the above result, it is observed that this datasets contains bike rental data of 1 year (since there are 365 unique values in a Date column)**

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Check for missing values

df.isnull().sum()

In [None]:
# Visualizing the missing values

msno.matrix(df)

**From the above results, it is evident that there are no missing values in the dataset .**

### What did you know about your dataset?


* The dataset contains 8760 rows and 14 columns.
* There are 6 columns of datatype float64, 4 columns of datatype int64 and 4 columns of datatype object.
* There are no missing and duplicate values in the dataset.
* The dataset contains bike rental data of 1 year.
* Input features: Date, Hour, Temperature(°C), Humidity(%), Wind speed (m/s),Visibility (10m), Dew point temperature(°C), Solar Radiation (MJ/m2), Rainfall(mm), Snowfall (cm), Season, Holiday & Functioning Day
* Target feature: Rented Bike Count

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns.tolist()

In [None]:
# Dataset Describe
df.describe().T

In [None]:
(df["Rainfall(mm)"]<1).sum()

In [None]:
(df["Rainfall(mm)"]>1).sum()

In [None]:
(df["Snowfall (cm)"]<1).sum()

In [None]:
(df["Snowfall (cm)"]>1).sum()


*   **Majority of values for Rainfall & Snowfall are below 1**

In [None]:
df['Seasons'].value_counts()

In [None]:
df['Functioning Day'].value_counts()

In [None]:
df['Holiday'].value_counts()

### Variables Description

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.

Attribute Information:



*   Date: The specific calendar date for the bike rental record.
*   Rented Bike Count: The number of bikes rented during a specific time interval.
*   Temperature: The temperature in Celsius at the time of the bike rental.
*   Humidity: The relative humidity percentage at the time of the bike rental.
*   Wind Speed: The speed of the wind in meters per second at the time of the bike rental.
*   Visibility: The visibility in meters at the time of the bike rental.
*   Dew Point Temperature: The temperature at which air becomes saturated and dew forms at the time of the bike rental.
*   Solar Radiation: The amount of solar radiation in mega-joules per square meter at the time of the bike rental.
*   Rainfall: The amount of rainfall in millimeters at the time of the bike rental.
*   Snowfall: The amount of snowfall in centimeters at the time of the bike rental.
*   Seasons: The four seasons (Spring, Summer, Autumn, Winter) corresponding to the bike rental record.
*   Holiday: A categorical variable indicating whether the day of the bike rental record is a holiday or not. It has two possible values: "Holiday" and "No Holiday". The "Holiday" value represents a day that is recognized as a holiday, while the "No Holiday" value represents a regular day that is not a designated holiday.
*   Functioning Day: A categorical variable indicating whether the bike rental service was functioning on the day of the record. It has two possible values: "Yes" and "No". The "Yes" value indicates that the bike rental service was operational and functioning normally on that day. Conversely, the "No" value indicates that the bike rental service was not operating, potentially due to maintenance, strikes, or other reasons.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.to_list():
  print('Number of unique values in', i, 'is', df[i].nunique())

In [None]:
# Converting Date column of datatype Object to Datetime datatype
df['Date'] = pd.to_datetime(df['Date'], dayfirst = True)

In [None]:
# Write your code to make your dataset analysis ready.

# Extracting day name feature
df['Day'] = df['Date'].dt.day_name()

# Extracting month name feature
df['Month'] = df['Date'].dt.month_name()

# Extracting year feature
df['Year'] = df['Date'].dt.year


In [None]:
# Dropping Date column
df.drop(columns = ['Date'], inplace = True)

In [None]:
#Rename the complex columns name
df = df.rename(columns={
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind Speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew point temperature',
                                'Solar Radiation (MJ/m2)':'Solar Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                              })

In [None]:
#To observe the day, month , year columns and other changes made to the dataset
df.sample(3)

In [None]:
# convert Hour and Year columns from integer to object
df['Hour'] = df['Hour'].astype('object')
df['Year'] = df['Year'].astype('object')

## 3. ***Exploratory data analysis***

**What is EDA?**

- EDA stands for Exploratory Data Analysis. It is a crucial step in the data analysis process that involves exploring and understanding the characteristics, patterns, and relationships within a dataset. EDA aims to uncover insights, identify patterns, detect outliers, and gain a deeper understanding of the data before conducting further analysis or modeling.

In [None]:
df.info()

## **3.1 Numerical & categorical variable**

In [None]:
# Dividing data into numerical and categorical features

categorical_features = df.select_dtypes(include = 'object')
numerical_features = df.select_dtypes(exclude = 'object')

In [None]:
categorical_features.head(2)

In [None]:
numerical_features.head(2)

# **3.2 Univariate analysis**

## **3.2.1 Data distribution of numeric features**

In [None]:
# figsize
plt.figure(figsize=(15,10))

# title
plt.suptitle('Data Distribution of Numeric Features', fontsize = 20, fontweight = 'bold', y=1.02)

for i, col in enumerate(numerical_features):
  # subplots 3 rows and 3 columns
  plt.subplot(3, 3, i+1 )

  # dist plot
  sns.distplot(df[col])
  plt.axvline(df[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(df[col].median(), color='cyan', linestyle='dashed', linewidth=2)

  plt.title(col)
  plt.tight_layout()

**Observations:**
*   Right-skewed features are: Rented Bike Count, Wind speed, Solar Radiation, Rainfall & Snowfall.
*   Left-skewed features: Visibility & Dew point temperature


## **3.2.2 Outlier analysis of numerical features**

In [None]:
# figsize
plt.figure(figsize = (15,10))

# title
plt.suptitle('Outlier Analysis of Numeric features', fontsize = 20, fontweight='bold', y=1.02)

for i, col in enumerate(numerical_features):
  # subplots 3 rows, 3 columns
  plt.subplot(3,3, i+1)

  # boxplots
  sns.boxplot(numerical_features[col])

  plt.title(col)
  plt.tight_layout()

**Observations:**
*   Outliers are visible in Rented Bike Count, Wind Speed, Solar Radiation, Rainfall & Snowfall.
*   The columns like Temperature, Humidity, Visibility & Dew point temperature do not contain any outliers.







## **3.2.3 Univariate analysis of categorical features**

In [None]:
# figure
plt.figure(figsize = (20,8))

# title
plt.suptitle('Univariate Analysis of Categorical Features', fontsize = 20, fontweight = 'bold', y = 1.02)

for i, col in enumerate(categorical_features):
  # subplots of
  plt.subplot(3,3, i+1)

  # Countplots
  sns.countplot(x = categorical_features[col])

  plt.xticks(rotation ='vertical')
  plt.title(col)
  plt.tight_layout()

**Observations:**
- Every hour has an equal number of counts in the dataset.
- Every season has almost equal number of counts.
- Dataset has more records of No holiday than a holiday which is obvious as most of the days are working days.
- Dataset has more records of Functioning Day than no functioning day which is obvious as most of the days are working days.
- Except Friday, other Days have equal number of counts in the dataset.
- Months like April, June, September, November & February have a slightly low number of count comparted to other months.
- More data was colected in the year 2018 than 2017.

# **3.3 Multivariate & Bivariate analysis**

## **3.3.1 Analysis between target variable and numerical features**

In [None]:
# Identify patterns and trends in numerical features

plt.suptitle('Bivariate Analysis of Numerical features', fontsize=20, fontweight='bold', y=1.02)


for i in numerical_features:
  if i=="Rented Bike Count":
    pass
  else:
    plt.figure(figsize=(15,6))
    sns.lineplot(x= i, y='Rented Bike Count', data = numerical_features, palette='Grouped')
    plt.title(f"Bike Demand over {i}");
    print('\n')
    plt.xticks(rotation = 45)


In [None]:
plt.figure(figsize = (15, 10))

# title
plt.suptitle('Bivariate Analysis of Numerical features', fontsize=20, fontweight='bold', y=1.02)

for index, col in enumerate(numerical_features):
  if col=="Rented Bike Count":
    pass
  else:
    # subplots of 3 rows and 3 columns
    plt.subplot(3,3, index+1)

    # line plots
    sns.scatterplot(x = numerical_features[col], y = numerical_features['Rented Bike Count'])

    plt.title(f'Bike Demand Over {col}')
    plt.xticks(rotation = 45)
    plt.tight_layout()

## **3.3.2 Bivariate Analysis of Categorical Features**

In [None]:
# Counting number of category present in each feature with respect to target feature

# figsize
plt.figure(figsize=(15,10))
# title
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(categorical_features):
   # subplots of 3 rows and 3 columns
  plt.subplot(3, 3, i+1)
  a = df.groupby(col)[['Rented Bike Count']].mean().reset_index()

  # barplot
  sns.barplot(x=a[col], y=a['Rented Bike Count'])
  # x-axis label
  plt.title(f'Average bike rentals across {col}')
  plt.xticks(rotation = 'vertical')
  plt.tight_layout()

Observations:

*   Hours: The highest demand is in hours from say 7-10 and from 15-19. This could be the reason that in most of the metroploitan cities this is the peak office time and so more people would be renting bikes.

*   Seasons: Summer season had the higest Bike Rent Count. People are more likely to rent bikes in summer. Bike rentals in winter is very less compared to other seasons.

*   Holidays: High number of bikes were rented on No Holidays.

*   Functioning Day: On 'No Functioning Day, zero bikes were rented. Hence, this column does not add value to our prediction, we can drop this column in the next steps.

*   Day: Most of the bikes were rented on Weekdays compared to weekends.
*   Month: From March Bike Rent Count started increasing and it was highest in June.

## **3.3.3 Multivariate analysis**

In [None]:
# Analysing bike demand with respect to hour and different third value

for i in categorical_features:
  if i == 'Hour':
    pass
  else:
    plt.figure(figsize=(15,8))
    sns.lineplot(x= df["Hour"], y= df['Rented Bike Count'], hue= df[i], marker ='o')
    plt.title(f"Bike Demand over Hour wrt to {i}")
  plt.show()

In [None]:
#Bar plot for seasonwise monthly distribution of Rented_Bike_Count
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(x='Month',y='Rented Bike Count',data= df, hue='Seasons',ax=ax);
ax.set_title('Season-wise monthly Rented Bike Count');
plt.show();

 **Observations:**
*   The above regression plots for the numerical features indicate that the columns Temperature, Wind_speed, Visibility, Dew_point_temperature & Solar_Radiation are positively correlated with the target variable, ie , with an increase in these features results in an increase in rented bike count.
*   On the other hand, Rainfall, Snowfall & Humidity are negatively correlated with the target variable, indicating that with an increase in these features results in a decrease in rented bike count.





# **4. Data cleaning**



*   Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset.
*   It involves handling missing data, removing duplicates, addressing outliers, standardizing formats, resolving inconsistencies, and validating data.
*   Data cleaning ensures that the data is accurate, complete, and reliable for analysis or machine learning purposes.

### **4.1 Handling missing values**

In [None]:
# Checking for missing values
df.isnull().sum()

**As we can see there are no null values present in our dataset and therefore we are good to go.**

### **4.2 Handling duplicate values**

In [None]:
# Checking for duplicate values
df.duplicated().sum()

As we can see there are no duplicate values

### **4.3 Handling outliers**

In [None]:
#Creating a boxplot to detect columns with outliers
# figsize
plt.figure(figsize = (15,10))

# title
plt.suptitle('Outlier Analysis of Numerical features', fontsize = 20, fontweight='bold', y=1.02)

for index , col in enumerate(numerical_features):
  # subplots 3 rows, 3 columns
  plt.subplot(3,3, index+1)

  # boxplots
  sns.boxplot(numerical_features[col])

  plt.title(col)
  plt.tight_layout()

**Here we can see that the columns that contain outliers are Rented Bike Count,Windspeed,Solar Radiation,Rainfall & Snowfall**

In [None]:
#Creating a list of columns that contains outliers
outlier_cols = ['Rented Bike Count', 'Wind Speed', 'Solar Radiation', 'Rainfall','Snowfall']
outlier_cols

In [None]:
def calculate_ranges(data, column):

  # Skip categorical columns
  if data[column].dtype == 'object':
    return None, None
  else:
    # Calculate quartiles
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)

    # Calculate IQR
    IQR = Q3 - Q1

    # Calculate upper and lower ranges
    upper_range = Q3 + 1.5 * IQR
    lower_range = Q1 - 1.5 * IQR

    return upper_range, lower_range

In [None]:
calculate_ranges(numerical_features, 'Rented Bike Count')

In [None]:
# Identify potential outliers
plt.figure(figsize = (15,10))
# plt.suptitle('Distribution of Numerical features with Potential Outliers', fontsize = 20, fontweight='bold', y=1.02)
for index, col in enumerate(outlier_cols):


  # Apply calculate_ranges function to get upper bound and lower bound
  upper_bound, lower_bound = calculate_ranges(df, col)

  # Identify potential outliers
  outliers = df[(df[col] > upper_bound) | (df[col] < lower_bound)]

# Visualize the potential outliers

  plt.figure(figsize=(12, 12))

  # subplots 3 rows, 3 columns
  plt.subplot(3,3, index+1)
  plt.hist(df[col], bins=30, color='lightblue', edgecolor='black', label='Data')
  plt.hist(outliers[col], bins=10, color='red', edgecolor='black', label='Potential Outliers')
  plt.xlabel(col)
  plt.ylabel('Frequency')
  plt.title(f"Distribution of {col} with Potential Outliers", fontsize = 10, fontweight='bold', y=1.02)
  plt.legend()
  plt.tight_layout()
  plt.show()

In [None]:
# Create a function to count the total number of outliers in each column

def count_outliers(data):
    # Initialize a variable to store the total number of outliers
    outlier_count = {}

    # Loop through each column in the list containing outliers
    for col in outlier_cols:

        # Calculate the upper and lower ranges
        upper_range, lower_range = calculate_ranges(data, col)

        # Count the number of outliers in the column
        outlier_count[col] = len(data[(data[col] > upper_range) | (data[col] < lower_range)])

    return outlier_count

In [None]:
# Number of outliers in each column
count_outliers(df)

**Observation:**

It is not recommended to trim the entire outliers as we tend to lose many data points. Hence we are not simply removing the outlier instead of that we are using the clipping method.

In [None]:
num_features = ['Temperature', 'Humidity', 'Wind Speed', 'Visibility', 'Dew point temperature', 'Solar Radiation']

**Clipping Method:**In this method, we set a cap on our outliers data, which means that if a value is higher than or lower than a certain threshold, all values will be considered outliers. This method replaces values that fall outside of a specified range with either the minimum or maximum value within that range.

In [None]:
# we are going to replace the datapoints with upper and lower bound of all the outliers

def clip_outliers(bike_df):
    #numerical_features = ['Temperature', 'Humidity', 'Wind Speed', 'Visibility', 'Dew point temperature', 'Solar Radiation']

    for col in num_features:
        # Using IQR method to define the range of upper and lower limits
        q1 = bike_df[col].quantile(0.25)
        q3 = bike_df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr

        # Replacing the outliers with the upper and lower bounds
        bike_df[col] = bike_df[col].clip(lower_bound, upper_bound)

    return bike_df

In [None]:
# Creating a copy of dataset

new_df = df.copy()
# using the function to treat outliers
new_df = clip_outliers(new_df)

In [None]:
# checking the boxplot after outlier treatment

# figsize
plt.figure(figsize=(15,8))
# title
plt.suptitle('Outlier Analysis of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(num_features):
  # subplot of 3 rows and 2 columns
  plt.subplot(3, 2, i+1)

  # countplot
  sns.boxplot(new_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

In [None]:
# checking for distribution after treating outliers.

# figsize
plt.figure(figsize=(15,6))
# title
plt.suptitle('Data Distibution of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(num_features):
  # subplots 3 rows, 2 columns
  plt.subplot(3, 2, i+1)

  # dist plots
  sns.distplot(new_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

**We can also observe some shifts in the distribution of the data after treating outliers. Some of the data were skewed before handling outliers, but after doing so, the features almost follow the normal distribution. Therefore, we are not utilizing the numerical feature transformation technique.**

# **5. Feature Engineering**

1. Feature engineering is the process of transforming raw data into a set of meaningful, informative, and predictive features that can be used to train machine learning models.

2. It involves selecting, creating, or modifying features in the dataset to enhance the performance and effectiveness of the models.

3. Feature engineering is a critical step in machine learning because the quality and relevance of features can significantly impact the model's performance. Well-engineered features can help capture relevant patterns, relationships, and structures in the data, enabling the model to make accurate predictions or classifications

### **5.1 Linear relationship with target variable**

In [None]:
# Checking Linearity of all numerical features with our target variable

# figsize
plt.figure(figsize=(15, 10))

# title
plt.suptitle('Regression Analysis of Numerical features', fontsize=20, fontweight='bold', y=1.02)

for i, col in enumerate(numerical_features):
  if col=="Rented Bike Count":
    pass
  else:
  # subplots of 3 rows and 3 columns
    plt.subplot(3, 3, i+1)

    # regression plots
    sns.regplot(x= numerical_features[col], y = numerical_features['Rented Bike Count'], scatter_kws={"color": "blue"}, line_kws={"color": "red"})

    plt.title(f'Dependend variable and {col}')
    plt.tight_layout()


Most of the features are positively correlated with target variable.

### **5.2 Correlation heatmap**

1. A correlation coefficient of 1 indicates a perfect positive linear relationship, where the variables increase or decrease together with a constant slope.
2. A correlation coefficient of -1 indicates a perfect negative linear relationship, where the variables move in opposite directions with a constant slope.
3. A correlation coefficient of 0 indicates no linear relationship between the variables.
4. The correlation coefficient is calculated using the covariance between the variables divided by the product of their standard deviations.
5. The correlation coefficient provides insight into the strength and direction of the relationship between variables.
However, it only measures linear relationships and does not capture other types of associations, such as nonlinear or complex dependencies.

In [None]:
# Heatmap relative to all numeric columns
corr_matrix = new_df.corr()
mask = np.array(corr_matrix)
mask[np.tril_indices_from(mask)] = False

fig = plt.figure(figsize=(10, 10))
sns.heatmap(corr_matrix, mask=mask, annot=True, cbar=True, vmax=0.8, vmin=-0.8, cmap='RdYlGn')
plt.show()

In [None]:
plt.figure(figsize=(2,4), dpi=150)
sns.heatmap(new_df.corr()[["Rented Bike Count"]].sort_values
            (by="Rented Bike Count", ascending=False)[1:],annot=True)
plt.title('Features Correlating with Rented Bike Count', fontsize=10, fontweight='bold', y=1.02);

**From the above graph we could see that the columns Temperature and Dew Point Temperature are highly corelated. We can drop one of them. As the corelation between Temperature and our dependent variable "Bike Rented Count" is high compared to Dew Point Temperature. So we will Keep the Temperature column and drop the Dew Point Temperature column.**

In [None]:
# droping Dew point temperature column due to multi-collinearity

new_df.drop('Dew point temperature', axis=1, inplace=True)


### **5.3 VIF**

1. VIF, which stands for Variance Inflation Factor, is a measure used in regression analysis to assess multicollinearity among predictor variables.

2. Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other, which can cause issues in interpreting the individual effects of the variables and can lead to unstable and unreliable model estimates.

3. The VIF quantifies the extent to which the variance of the estimated regression coefficient is inflated due to multicollinearity.

4. It measures how much the variance of a particular predictor variable's estimated coefficient is increased compared to if that variable were uncorrelated with the other predictor variables in the model.

**Interpreting VIF values:**

1. A VIF of 1 indicates no multicollinearity, meaning the predictor variable is not correlated with the other predictors.
2. A VIF greater than 1 suggests some degree of multicollinearity, where higher values indicate stronger correlation with other predictors.
3. A commonly used threshold is a VIF value of 5 or 10. Variables with VIF values exceeding these thresholds are considered to have high multicollinearity and may need to be addressed.
4. By examining VIF values, researchers can identify predictor variables that contribute to multicollinearity and take appropriate actions, such as removing highly correlated variables, combining variables, or gathering additional data to mitigate the multicollinearity issue.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# function to calculate Multicollinearity

def calculate_vif(X):

  # For each X, calculate VIF and save in dataframe
  vif = pd.DataFrame()
  vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
  vif["features"] = X.columns

  return vif

In [None]:
# multicollinearity result

calculate_vif(new_df[[i for i in new_df.describe().columns if i not in ['Rented Bike Count','Date']]])

### **We are going to use these variables for further model building**

### **5.4 Encoding**

Encoding refers to the process of converting categorical variables into numerical representations that can be understood and processed by machine learning algorithms. Since many machine learning algorithms require numerical inputs, encoding categorical variables becomes necessary.

In [None]:
# droping Year columns as it does not account for any information addition

new_df.drop(['Year'], axis=1, inplace = True)
categorical_features.drop('Year', axis = 1, inplace = True)

In [None]:
# Check Unique Values for each categorical variable.
for i in categorical_features:
  print("Number of unique values in", i, "is" , new_df[i].nunique())


We will use one hot encoding for Seasons and Numeric encoding for Holiday and Functioning day. Other columns are already encoded.

In [None]:
# Creating a copy for checking
Enc = new_df.copy()
Enc =pd.get_dummies(Enc, columns=['Seasons'],prefix='Seasons',drop_first=True)

In [None]:
Enc.head()

In [None]:
new_df.head()

In [None]:
# Doing the same process on our data after checking the dummy function output

new_df = pd.get_dummies(new_df, columns = ['Seasons'], prefix='Seasons', drop_first = True)

In [None]:
new_df.head()

In [None]:
new_df = pd.get_dummies(new_df, columns = ['Day'], prefix='Day', drop_first = True)

In [None]:
new_df

In [None]:
new_df = pd.get_dummies(new_df, columns = ['Month'], prefix='Month', drop_first = True)

In [None]:
new_df.head()

In [None]:
# Numerical Encoding for holiday and functioning_day

new_df['Holiday'] = new_df['Holiday'].map({'Holiday': 1, 'No Holiday': 0})

new_df['Functioning Day'] = new_df['Functioning Day'].map({'Yes': 1, 'No': 0})

In [None]:
new_df.head(2)

### **5.5 Normalisation of target variable**

In [None]:
fig, ax = plt.subplots(1,2 , figsize = (15,5))

# Distribution plot of Rented Bike Count
dist =sns.distplot(new_df['Rented Bike Count'],hist=True, ax = ax[0])
dist.set(xlabel = 'Rented Bike Count', ylabel ='Density', title = 'Distribution Plot of Target Variable')

# mean line
dist.axvline(new_df['Rented Bike Count'].mean(), color='magenta', linestyle='dashed', linewidth=2)
# median line
dist.axvline(new_df['Rented Bike Count'].median(), color='black', linestyle='dashed', linewidth=2)

# Boxplot
box = sns.boxplot(new_df['Rented Bike Count'], ax= ax[1])
box.set(title = 'Outlier Analysis of Target Variable')
plt.show()

**Observation:**

The graph above indicates that the Rented Bike Count has a moderate right skewness. Linear regression assumes that the dependent variable has a normal distribution, therefore, to meet this assumption, we need to take some measures to normalize the distribution.
The boxplot above indicates that there are outliers in the rented bike count column.


In [None]:
#apply diffrent tranformation technique and checking data distributation
fig,axes = plt.subplots(1,4,figsize=(20,5))
sns.distplot((new_df['Rented Bike Count']),ax=axes[0],color='brown').set_title(" Input data");

# here we use log10
#transform only posible in positive value and >0 value so add 0.0000001 in data
sns.distplot(np.log1p(new_df['Rented Bike Count']),ax=axes[1],color='red').set_title("log1p");

# here we use square root
sns.distplot(np.sqrt(new_df['Rented Bike Count']),ax=axes[2], color='blue').set_title("Square root");

# here we use cube root
sns.distplot(np.cbrt(new_df['Rented Bike Count']),ax=axes[3], color='green').set_title("cube root");

Observations:

1. Applying a logarithmic transformation to the dependent variable did not help much as it resulted in a negatively skewed distribution.
2. Cube root transformation was attempted, but it did not result in a normally distributed variable.
3. Therefore, we will use a square root transformation for the regression as it transformed the variable into a well-distributed form.

In [None]:
fig, ax = plt.subplots(1,2 , figsize = (15,5))

#  checking square root tranformation in our target variable
dist =sns.distplot(np.sqrt(new_df['Rented Bike Count']), ax = ax[0])
dist.set(xlabel = 'Rented Bike Count', ylabel ='Density', title = 'Distribution Plot of Target Variable in sqrt tranformation')

# mean line
dist.axvline(np.sqrt(new_df['Rented Bike Count']).mean(), color='magenta', linestyle='dashed', linewidth=2)
# median line
dist.axvline(np.sqrt(new_df['Rented Bike Count']).median(), color='black', linestyle='dashed', linewidth=2)

# Boxplot
box = sns.boxplot(np.sqrt(new_df['Rented Bike Count']), ax= ax[1])
box.set(title = 'Outlier Analysis of Target Variable in sqrt tranformation')
plt.show()

**Observation**-

1. By applying the square root transformation to the skewed Rented Bike Count, we were able to obtain an almost normal distribution, which is in line with the general rule that skewed variables should be normalized in linear regression.
2. We found that there are no outliers in the Rented Bike Count column after applying square root transformation.

In [None]:
# applying square root on Rented_Bike_Count
new_df['Rented Bike Count']=np.sqrt(new_df['Rented Bike Count'])

## **Manipulations done and insights found**

1. We checked for correlation coefficient and found that most of the numerical features are positively correlated to our target variable.
2. From heatmap and correlation coefficient, dew_point_temperature and temperature have a correlation coefficient of 0.91 and dew_point_temperature is less correlated to our target variable hence we dropped dew_point_temperature.
3. We also did a VIF analysis to remove multi-colinearity and since the VIF factor of 'year' is too large hence we removed the year from our data to build our model.
We encoded our categorical features which are necessary for the model to understand. We used one hot encoding for 'seasons' and Numeric encoding for 'holiday' and 'functioning_day'. Other columns are already encoded.
4. To treat our target variable we Applied a square root transformation for the regression as it transformed the target variable into a well-distributed form.

# **6. Model Building**

## **6.1 Train Test Split**

In [None]:
#X = independent variable and y = target variable
X = new_df.drop('Rented Bike Count', axis=1)
y= new_df['Rented Bike Count']

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)

print(X_train.shape)
print(X_test.shape)

## **6.2 Scaling Data**

In [None]:
# Scaling Data
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## **6.3 Model Training**

In [None]:
# empty list for appending performance metric score
model_result = []

def predict(ml_model,model_name):

  '''
  Pass the model and predict value.
  Function will calculate all the evaluation metrics and appending those metrics score on model_result list.
  Plotting different graphs for test data.
  '''

  # model fitting
  model = ml_model.fit(X_train,y_train)

  # predicting values
  y_train_pred = model.predict(X_train)
  y_test_pred = model.predict(X_test)

  # Reverse the transformation on the predictions    (In case if we need y_train_pred in original and transformed way)
  y_train_pred_original = np.power(y_train_pred, 2)
  y_test_pred_original = np.power(y_test_pred, 2)

  # graph --> best fit line on test data
  sns.regplot(x=y_test_pred, y=y_test, line_kws={'color':'red'})
  plt.xlabel('Predicted')
  plt.ylabel('Actual')

  '''Evaluation metrics on train data'''
  train_MSE  = round(mean_squared_error(y_train, y_train_pred),3)
  train_RMSE = round(np.sqrt(train_MSE),3)
  train_r2 = round(r2_score(y_train, y_train_pred),3)
  train_MAE = round(mean_absolute_error(y_train, y_train_pred),3)
  train_adj_r2 = round(1-(1-r2_score(y_train, y_train_pred))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)),3)
  print(f'train MSE : {train_MSE}')
  print(f'train RMSE : {train_RMSE}')
  print(f'train MAE : {train_MAE}')
  print(f'train R2 : {train_r2}')
  print(f'train Adj R2 : {train_adj_r2}')
  print('-'*150)

  '''Evaluation metrics on test data'''
  test_MSE  = round(mean_squared_error(y_test, y_test_pred),3)
  test_RMSE = round(np.sqrt(test_MSE),3)
  test_r2 = round(r2_score(y_test, y_test_pred),3)
  test_MAE = round(mean_absolute_error(y_test, y_test_pred),3)
  test_adj_r2 = round(1-(1-r2_score(y_test, y_test_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)),3)
  print(f'test MSE : {test_MSE}')
  print(f'test RMSE : {test_RMSE}')
  print(f'test MAE : {test_MAE}')
  print(f'test R2 : {test_r2}')
  print(f'test Adj R2 : {test_adj_r2}')
  print('-'*150)

  # graph --> actual vs predicted on test data
  plt.figure(figsize=(6,5))
  plt.plot((y_test_pred)[:20])
  plt.plot(np.array((y_test)[:20]))
  plt.legend(["Predicted","Actual"])
  plt.xlabel('Test Data on last 20 points')
  plt.show()
  print('-'*150)

  '''actual vs predicted value on test data'''
  d = {'y_actual':y_test, 'y_predict':y_test_pred, 'error':y_test-y_test_pred}
  print(pd.DataFrame(data=d).head().T)
  print('-'*150)

  # using the score from the performance metrics to create the final model_result.
  model_result.append({'model':model_name,
                       'train MSE':train_MSE,
                       'test MSE':test_MSE,
                       'train RMSE':train_RMSE,
                       'test RMSE':test_RMSE,
                       'train MAE':train_MAE,
                       'test MAE':test_MAE,
                       'train R2':train_r2,
                       'test R2':test_r2,
                       'train Adj R2':train_adj_r2,
                       'test Adj R2':test_adj_r2})

# **7. Model Implementation**

## **7.1 Linear Regression**

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The goal is to find the best-fitting line that can predict the value of the dependent variable based on the values of the independent variables.

In [None]:
predict(LinearRegression(), 'LinearRegression')

## **8.2 Lasso**

Lasso (Least Absolute Shrinkage and Selection Operator) is a regularization technique used in linear regression models. It helps to reduce the complexity of the model and improve its generalization ability by penalizing the magnitude of coefficients of the features.

The lasso regularization adds a penalty term to the loss function being optimized. The penalty term is proportional to the absolute magnitude of the coefficients, but unlike ridge regression, it shrinks the coefficients of some features to zero, effectively removing them from the model.

In [None]:
predict(Lasso(alpha=0.1, max_iter=1000), 'Lasso')

## **8.3 Ridge**

Ridge Regression is a type of regularized linear regression that aims to solve the problem of multicollinearity and overfitting by adding a penalty term to the loss function. The penalty term is the L2 regularization term (also known as the weight decay term), which adds a penalty proportional to the square of the magnitude of the coefficients.

In [None]:
predict(Ridge(alpha=0.1, max_iter=1000), 'Ridge')

## **8.4 Elastic Net**

ElasticNet is a linear regression algorithm that combines both L1 (Lasso) and L2 (Ridge) regularization techniques. L1 and L2 regularization are methods used to prevent overfitting by adding penalty terms to the loss function that the algorithm minimizes. Lasso adds a penalty proportional to the absolute value of the coefficients, while Ridge adds a penalty proportional to the square of the coefficients.

In [None]:
predict(ElasticNet(alpha=0.1, max_iter=1000), 'Elastic Net')

## **8.5 K-Nearest Neighbors**

A supervised machine learning algorithm known as KNN or K-nearest neighbor can be used to solve classification and regression problems. K is not a non-parametric nearest neighbor, i.e. It makes no assumptions regarding the assumptions that underlie the data. An input or unseen data set is categorized here by the algorithm based on the characteristics shared by the closest data points. The distance between two points determines these closest neighbors. The distance metric methods that are utilized can be Euclidean Distance, Manhattan Distance, Minkowski, Cosine Similarity Measure etc)

In [None]:
predict(KNeighborsRegressor(n_neighbors=3),'KNN')

## **8.6 Support Vector Machine**

Support Vector Machine (SVM) is a popular and powerful machine learning algorithm for classification and regression problems. It is based on the concept of finding the best hyperplane that separates the data into classes, or predicts the target value for regression problems.

In [None]:
predict(SVR(kernel='rbf',C=100), 'SVM')

## **8.7 Decision Tree**

A decision tree is a tree-like model used in machine learning to make predictions or decisions by breaking down a set of rules or conditions into smaller and smaller sub-conditions, based on the values of the input features.

Each node in the tree represents a test on a feature, and each branch represents the outcome of the test. The final branches of the tree, called the leaves, represent the class predictions or decisions. The tree is built recursively by finding the best feature to split the data based on the information gain or decrease in impurity at each node.

In [None]:
predict(DecisionTreeRegressor(min_samples_leaf=20, min_samples_split=3,max_depth=20, random_state=33), 'Decision Tree')

## **8.8 Random Forest**

Random Forest is an ensemble machine learning algorithm that builds multiple decision trees and combines their predictions to make a final classification or regression prediction. In contrast to a single decision tree, Random Forest reduces the risk of overfitting by combining the results of many trees, each built on a different subset of the data.

**Hyperparameter Tunning using GridSearchCV**

In [None]:
param_grid = {'n_estimators': [50,80],       # number of trees in the ensemble
             'max_depth': [15,20],           # maximum number of levels allowed in each tree.
             'min_samples_split': [5,15],    # minimum number of samples necessary in a node to cause node splitting.
             'min_samples_leaf': [3,5]}      # minimum number of samples which can be stored in a tree leaf.


# Initialize the RandomForestRegressor model
rf = RandomForestRegressor()

# Use GridSearchCV to perform a grid search over the parameter grid
grid_search = GridSearchCV(rf, param_grid=param_grid, cv=5, scoring='r2')

# Fit the model to the training data
grid_search.fit(X, y)

In [None]:
# Get the best parameters from the grid search
rf_optimal_model = grid_search.best_estimator_
rf_optimal_model

In [None]:

predict(rf_optimal_model, 'Random Forest')

# **Model Explainability**

In [None]:
# feature importance
importances = rf_optimal_model.feature_importances_

# Creating a dictonary
importance_dict = {'Feature' : list(X.columns),
                   'Feature Importance' : importances}

# Creating the dataframe
importance = pd.DataFrame(importance_dict)
sorting_features = importance.sort_values(by=['Feature Importance'],ascending=False)
sorting_features

In [None]:
# plotting feature importance graph
plt.figure(figsize=(15,5))
bar = sns.barplot(x='Feature Importance', y='Feature', data=sorting_features, color='blue')
bar.set_title('Important Features')
plt.show()

**The top 5 important features in Random Forest are temperature, hour, functioning_day, rainfall, and humidity**

## **8.9 AdaBoost**

AdaBoost (Adaptive Boosting) is an ensemble machine learning algorithm that combines multiple weak models to form a stronger model. It works by assigning weights to the data points in a dataset and iteratively building weak models that try to correctly classify or predict the target variable. After each iteration, the weights of the misclassified or mispredicted data points are increased, making it more likely that the next weak model will focus on these points.

In [None]:
# Create a base decision tree regression model
dt = DecisionTreeRegressor(max_depth=12)

# Initialize the AdaBoost regression model
ada = AdaBoostRegressor(base_estimator=dt, n_estimators=60, learning_rate=1, random_state =33)

# Predict using function
predict(ada, 'AdaBoost')

In [None]:
new_df.info()

In [None]:
new_df["Hour"]

In [None]:
new_df['Hour'] = new_df['Hour'].astype('int')

In [None]:
new_df.info()

In [None]:
new_df["Functioning Day"]

# **Test XGboost**

In [None]:
X1 = Txg.drop('Rented Bike Count', axis=1)
y1= Txg['Rented Bike Count']

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.2, random_state=33)

print(X_train.shape)
print(X_test.shape)

In [None]:
# empty list for appending performance metric score
model_result = []

def predict(ml_model,model_name):

  '''
  Pass the model and predict value.
  Function will calculate all the evaluation metrics and appending those metrics score on model_result list.
  Plotting different graphs for test data.
  '''

  # model fitting
  model = ml_model.fit(X_train,y_train)

  # predicting values
  y_train_pred = model.predict(X_train)
  y_test_pred = model.predict(X_test)

  # Reverse the transformation on the predictions    (In case if we need y_train_pred in original and transformed way)
  y_train_pred_original = np.power(y_train_pred, 2)
  y_test_pred_original = np.power(y_test_pred, 2)

  # graph --> best fit line on test data
  sns.regplot(x=y_test_pred, y=y_test, line_kws={'color':'red'})
  plt.xlabel('Predicted')
  plt.ylabel('Actual')

  '''Evaluation metrics on train data'''
  train_MSE  = round(mean_squared_error(y_train, y_train_pred),3)
  train_RMSE = round(np.sqrt(train_MSE),3)
  train_r2 = round(r2_score(y_train, y_train_pred),3)
  train_MAE = round(mean_absolute_error(y_train, y_train_pred),3)
  train_adj_r2 = round(1-(1-r2_score(y_train, y_train_pred))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)),3)
  print(f'train MSE : {train_MSE}')
  print(f'train RMSE : {train_RMSE}')
  print(f'train MAE : {train_MAE}')
  print(f'train R2 : {train_r2}')
  print(f'train Adj R2 : {train_adj_r2}')
  print('-'*150)

  '''Evaluation metrics on test data'''
  test_MSE  = round(mean_squared_error(y_test, y_test_pred),3)
  test_RMSE = round(np.sqrt(test_MSE),3)
  test_r2 = round(r2_score(y_test, y_test_pred),3)
  test_MAE = round(mean_absolute_error(y_test, y_test_pred),3)
  test_adj_r2 = round(1-(1-r2_score(y_test, y_test_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)),3)
  print(f'test MSE : {test_MSE}')
  print(f'test RMSE : {test_RMSE}')
  print(f'test MAE : {test_MAE}')
  print(f'test R2 : {test_r2}')
  print(f'test Adj R2 : {test_adj_r2}')
  print('-'*150)

  # graph --> actual vs predicted on test data
  plt.figure(figsize=(6,5))
  plt.plot((y_test_pred)[:20])
  plt.plot(np.array((y_test)[:20]))
  plt.legend(["Predicted","Actual"])
  plt.xlabel('Test Data on last 20 points')
  plt.show()
  print('-'*150)

  '''actual vs predicted value on test data'''
  d = {'y_actual':y_test, 'y_predict':y_test_pred, 'error':y_test-y_test_pred}
  print(pd.DataFrame(data=d).head().T)
  print('-'*150)

  # using the score from the performance metrics to create the final model_result.
  model_result.append({'model':model_name,
                       'train MSE':train_MSE,
                       'test MSE':test_MSE,
                       'train RMSE':train_RMSE,
                       'test RMSE':test_RMSE,
                       'train MAE':train_MAE,
                       'test MAE':test_MAE,
                       'train R2':train_r2,
                       'test R2':test_r2,
                       'train Adj R2':train_adj_r2,
                       'test Adj R2':test_adj_r2})

In [None]:
Txg=new_df.copy()

In [None]:
Txg.info()

In [None]:
Txg["Hour"].astype("int")

In [None]:
Txg[['Holiday', 'Functioning Day',"Seasons_Spring","Seasons_Summer","Seasons_Winter","Day_Monday","Day_Saturday","Day_Sunday","Day_Thursday","Day_Tuesday","Day_Wednesday","Month_August","Month_December","Month_February","Month_January","Month_July","Month_June","Month_March","Month_May","Month_November","Month_October","Month_September"]] = Txg[['Holiday', 'Functioning Day',"Seasons_Spring","Seasons_Summer","Seasons_Winter","Day_Monday","Day_Saturday","Day_Sunday","Day_Thursday","Day_Tuesday","Day_Wednesday","Month_August","Month_December","Month_February","Month_January","Month_July","Month_June","Month_March","Month_May","Month_November","Month_October","Month_September"]].astype("category")

In [None]:
Txg.info()

In [None]:
param_grid = {'n_estimators': [300,500],     # number of trees in the ensemble
             'max_depth': [7,8],             # maximum number of levels allowed in each tree.
             'min_samples_split': [3,5],     # minimum number of samples necessary in a node to cause node splitting.
             'min_samples_leaf': [3,5]}      # minimum number of samples which can be stored in a tree leaf.


# Initialize the RandomForestRegressor model
xgb = XGBRegressor()
Xy = xgb.DMatrix(X1, y1, enable_categorical=True)

# Use GridSearchCV to perform a grid search over the parameter grid
grid_search = GridSearchCV(Xy, param_grid=param_grid, cv=5, scoring='r2')

# Fit the model to the training data
grid_search.fit(X1, y1)

## **8.10 Xtreme Gradient Boosting**

XGBoost (eXtreme Gradient Boosting) is an optimized implementation of the Gradient Boosting algorithm that is specifically designed for large-scale and complex data. XGBoost is an ensemble learning algorithm that builds multiple decision trees and combines their predictions to make a final prediction.

**HyperParameter Tunning using GridSearchCV**

In [None]:
param_grid = {'n_estimators': [300,500],     # number of trees in the ensemble
             'max_depth': [7,8],             # maximum number of levels allowed in each tree.
             'min_samples_split': [3,5],     # minimum number of samples necessary in a node to cause node splitting.
             'min_samples_leaf': [3,5]}      # minimum number of samples which can be stored in a tree leaf.


# Initialize the RandomForestRegressor model
xgb = XGBRegressor()

# Use GridSearchCV to perform a grid search over the parameter grid
grid_search = GridSearchCV(xgb, param_grid=param_grid, cv=5, scoring='r2')

# Fit the model to the training data
grid_search.fit(X, y)

In [None]:
# Get the best parameters from the grid search
xgb_optimal_model = grid_search.best_estimator_
xgb_optimal_model

In [None]:
predict(xgb_optimal_model, 'XGB')

**Model Explainability**

In [None]:
# feature importance
importances = xgb_optimal_model.feature_importances_

# Creating a dictonary
importance_dict = {'Feature' : list(X.columns),
                   'Feature Importance' : importances}

# Creating the dataframe
importance = pd.DataFrame(importance_dict)
sorting_features = importance.sort_values(by=['Feature Importance'],ascending=False)
sorting_features

In [None]:

# plotting feature importance graph
plt.figure(figsize=(15,5))
bar = sns.barplot(x='Feature Importance', y='Feature', data=sorting_features, color='blue')
bar.set_title('Important Features')
plt.show()

**The top 5 important features in XGB Model are season_winter, functioning_day, rainfall, hour, and season_autumn**

## **8.11 Light GBM**

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be more efficient than traditional gradient boosting algorithms and is particularly well-suited for large datasets.LightGBM is an open-source library that was developed by Microsoft.

One of the key features of LightGBM is its use of a histogram-based approach to split nodes in decision trees.

**HyperParameter Tunning using GridSearchCV**

In [None]:
param_grid = {'n_estimators': [600,800],     # number of trees in the ensemble
             'max_depth': [8,10],            # maximum number of levels allowed in each tree.
             'min_samples_split': [3,5],     # minimum number of samples necessary in a node to cause node splitting.
             'min_samples_leaf': [2,3]}      # minimum number of samples which can be stored in a tree leaf.


# Initialize the RandomForestRegressor model
lgb = LGBMRegressor()

# Use GridSearchCV to perform a grid search over the parameter grid
grid_search = GridSearchCV(lgb, param_grid=param_grid, cv=5, scoring='r2')

# Fit the model to the training data
grid_search.fit(X_train, y_train)


In [None]:
# Get the best parameters from the grid search
lgb_optimal_model = grid_search.best_estimator_
lgb_optimal_model

In [None]:
predict(lgb_optimal_model, 'LGB')

**Model Explainability**

In [None]:
# feature importance
importances = lgb_optimal_model.feature_importances_

#Creating a dictonary
importance_dict = {'Feature' : list(X.columns),
                   'Feature Importance' : importances}

#Creating the dataframe
importance = pd.DataFrame(importance_dict)
sorting_features = importance.sort_values(by=['Feature Importance'],ascending=False)
sorting_features

In [None]:
# plotting feature importance graph
plt.figure(figsize=(15,5))
bar=sns.barplot(x='Feature Importance', y='Feature', data=sorting_features, color='blue')
bar.set_title('Important Features')
plt.show()

**The top 5 important features in XGB Model are temperature, humidity, visibility, hour, and day.**

## **8.12 Model Result**

The fit of the model to the dependent variables can be evaluated using the R square measure. On the other hand, overfitting is not taken into consideration. If there are a lot of independent variables in the regression model, it may work well with training data but fail with testing data because it is too complicated. Adjusted R Square is a new metric that penalizes additional independent variables added to the model and adjusts the metric to prevent overfitting.

Because it estimates the relationship between the movements of a dependent variable and those of an independent variable, R square is the best evaluation method for predicting the rented_bike_count.

In [None]:
# converting the model_result list into DataFrame
model_result = pd.DataFrame(model_result)

# sorting the values by test R2 score
model_result.sort_values(by='test R2', ascending=False)

In [None]:
# plotting graph to compare model performance of all the models
fig, ax = plt.subplots(1,2, figsize=(20,5))
sns.barplot(x=model_result['model'], y=model_result['test R2'], ax=ax[0])           # Model Vs test R2
sns.barplot(x=model_result['model'], y=model_result['test Adj R2'], ax=ax[1])       # Model Vs test Adj R2
plt.tight_layout()

**From the above result, we can select XGB or LGB Regressor as the final model because it has the lowest RMSE value as well as the highest R2 score on the test data. I would go with LGB because it is considered a better model for large dataset and it is explaining the features which it is taking into account in better manner.**

# **9. Conclusion**

**Summary**

We began our analysis by performing EDA on all of our datasets. First, we looked at and changed our dependent variable, "Rental Bike Count." After that, we looked at categorical variables and numerical variables and discovered their correlation, distribution, and connection to the dependent variable. Additionally, we hot-encoded the categorical variables and removed some numerical features which are used only for EDA purposes and have multi-collinearity.

Following that, we examine several well-known individual models, ranging from straightforward Linear Regression and Regularization Models (Ridge, Lasso, and Elastic Net) to more complex and ensemble models like Random Forest, Gradient Boost, and Light GBM. To enhance the performance of our model, we performed hyperparameter tuning.

## **Conclusion**

1. Here are some solutions to manage Bike Sharing Demand.
- The majority of rentals are for daily commutes to workplaces and colleges. Therefore open additional stations near these landmarks to reach their primary customers.
- While planning for extra bikes to stations the peak rental hours must be considered, i.e. 7–9 am and 5–6 pm.
- Maintenance activities for bikes should be done at night due to the low usage of bikes during the night time. Removing some bikes from the streets at night time will not cause trouble for the customers.
2. We see 2 rental patterns across the day in bike rental count - first for a Working Day where the rental count is high at peak office hours (8 am and 5 pm) and the second for a Non-working day where the rental count is more or less uniform across the day with a peak at around noon.
3. Hour of the day: Bike rental count is mostly correlated with the time of the day. As indicated above, the count reaches a high point during peak hours on a working day and is mostly uniform during the day on a non-working day.
4. Temperature: People generally prefer to bike at moderate to high temperatures. We see the highest rental counts between 32 to 36 degrees Celcius
5. Season: We see the highest number of bike rentals in the Spring (July to September) and Summer (April to June) Seasons and the lowest in the Winter (January to March) season.
6. Weather: As one would expect, we see the highest number of bike rentals on a clear day and the lowest on a snowy or rainy day
7. Humidity: With increasing humidity, we see a decrease in the bike rental count.
8. I have chosen the Light GBM model which is above all else I want better expectations for the rented_bike_count and time isn't compelling here. As a result, various linear models, decision trees, Random Forests, and Gradient Boost techniques were used to improve accuracy. I compared R2 metrics to choose a model.
9. Due to less no. of data in the dataset, the training R2 score is around 99% and the test R2 score is 92.5%. Once we get more data we can retrain our algorithm for better performance.

## **Way Forward**

However, this is not the ultimate end. As this data is time-dependent, the values for variables like temperature, wind speed, solar radiation, etc., will not always be consistent. Therefore, there will be scenarios where the model might not perform well. As Machine learning is an exponentially evolving field, we will have to be prepared for all contingencies and also keep checking our model from time to time. Therefore, having quality knowledge and keeping pace with the ever-evolving ML field would surely help one to stay a step ahead in the future.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***