<a href="https://colab.research.google.com/github/sandip99999/Bike-shearing-Demand-Prediction/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Seoul Bike Sharing Demand Prediction**



##### **Project Type**    - **Regression**

##### **Contribution**    - **Individual**

##### **Team Member**- **Sandip Dey**

# **Project Summary -**

The contents of the data came from a city called Seoul. A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system. The data had variables such as date, hour, temperature, humidity, wind-speed, visibility, dew point temperature, solar radiation, rainfall, snowfall, seasons, holiday, functioning day and rented bike count.


The problem statement was to build a machine learning model that could predict the rented bikes count required for an hour, given other variables. The first step in the exercise involved exploratory data analysis where we tried to dig insights from the data in hand. It included univariate and multivariate analysis in which we identified certain trends, relationships, correlation and found out the features that had some impact on our dependent variable. The second step was to clean the data and perform modifications. We checked for missing values and outliers and removed irrelevant features. We also encoded the categorical variables. The third step was to try various machine learning algorithms on our split and standardized data. We tried different algorithms namely; Linear regression, Randomforest and XGBoost. We did hyperparameter tuning and evaluated the performance of each model using various metrics. The best performance was given by the Gradient boosting and Random forest model where the R2_score for training and test set was 0.95 and 0.92 respectively.


The most important features who had a major impact on the model predictions were; hour, temperature, wind-speed, solar-radiation, month and seasons. Demand for bikes got higher when the temperature and hour values were more. Demand was high for low values of wind-speed and solar radiation. Demand was high during springs and summer and very low during winters.


The model performed well in this case but as the data is time dependent, values of temperature, wind-speed, solar radiation etc. will not always be consistent. Therefore, there will be scenarios where the model might not perform well. As Machine learning is an exponentially evolving field, we will have to be prepared for all contingencies and also keep checking our model from time to time



# **GitHub Link -**

https://github.com/sandip99999/Bike-shearing-Demand-Prediction

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd     #(provides wide variety tools for data analysis,many inbuilt methods for grouping,)
                         #(combining and filtering data.)
    
import numpy as np      #for some basic mathematical operations

from matplotlib import pyplot as plt #comprehensive library for creating static, animated, and interactive visualizations

import seaborn as sns                #  high-level interface for drawing attractive and informative statistical graphics

from sklearn.model_selection import train_test_split, GridSearchCV,  cross_val_score
from sklearn import preprocessing, linear_model
from sklearn.preprocessing import  LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler 
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score
from sklearn.linear_model import Ridge, Lasso, LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn import neighbors
from sklearn.svm import SVR
from sklearn import tree
from sklearn.ensemble import BaggingRegressor


import warnings
warnings.filterwarnings('ignore')

pd.pandas.set_option('display.max_columns',None)
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path="/content/drive/MyDrive/Colab Notebooks/Bike shearing demand prediction/SeoulBikeData.csv"
bike_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Bike shearing demand prediction/SeoulBikeData.csv",encoding= 'unicode_escape')

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head(5)


In [None]:
bike_df.tail(5)

In [None]:
#Getting all the columns
print("Features of the dataset:")
bike_df.columns

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bike_df.shape

In [None]:
print(f'There are {bike_df.shape[0]} Rows and {bike_df.shape[1]} Columns in the dataset')

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Duplicate entry in dataset is :",bike_df.duplicated().sum())

**Not found any duplicated entries inside the dataset**

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

In [None]:
# Visualizing the missing values
bike_df.isna().sum()

In [None]:
pd.DataFrame((bike_df.isnull().sum())*100/bike_df.shape[0]).reset_index()

In [None]:
missing = pd.DataFrame((bike_df.isnull().mean())*100).reset_index()
plt.figure(figsize=(8,3))
ax = sns.pointplot(data=missing,x="index",y=0)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()

**Not found any missing values inside the dataset**

### What did you know about your dataset?

In [None]:
# Custom Function for Dtype,Unique values and Null values
def bike_df_info():
    bike = pd.DataFrame(index=bike_df.columns)
    bike['DataType'] = bike_df.dtypes
    bike["Non-null_Values"] = bike_df.count()
    bike['Unique_Values'] = bike_df.nunique().sort_values(ascending=True)
    bike['NaN_Values'] = bike_df.isnull().sum()
    bike['NaN_Values_Percentage'] = (bike['NaN_Values']/len(bike_df))*100 
    return bike

In [None]:
# Custom Function
bike_df_info()

**Finding details from data:**

1. There are 14 features with 8760 rows of data.
2. There are 4 categorical columns and 10 numerical columns. Columns ‘Date’, ‘Seasons’ and ‘Functioning Day’ are of 𝑜𝑏𝑗𝑒𝑐𝑡 data type
3. Columns ‘Rented Bike Count’, ‘Hour’, ‘Humidity (%)' and ‘Visibility (10𝑚)' are of 𝑖𝑛𝑡64 numarical data type
4. Columns ‘Temperature Temperature (℃)’, ‘Wind Speed (𝑚/𝑠)’, ‘Dew Point Temperature (℃)’,‘Solar Radiation (𝑀𝐽/𝑚2)’,‘Rainfall (𝑚𝑚)' and ‘Snowfall(𝑐𝑚) are of 𝑓𝑙𝑜𝑎𝑡64 numarical data type
5. Not any null value present in any column
6. Unique count : Seasons- 4 , Holiday- 2 , Functioning Day- 2, Date-365, Rented Bike Count-2166, Hour-24, Temperature(°C)-546 etc.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Features of the dataset:")
bike_df.columns

In [None]:
# Dataset Describe
bike_df.describe().T               #.T use for transpose the describe table

### Variables Description 

**Breakdown of Our Features:**

**Date** : *The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formating in DD/MM/YYYY, type : str*, we need to convert into datetime format.

**Rented Bike Count** : *Number of rented bikes per hour which our dependent variable and we need to predict that, type : int*

**Hour**: *The hour of the day, starting from 0-23 it's in a digital time format, type : int, we need to convert it into category data type.*

**Temperature(°C)**: *Temperature in Celsius, type : Float*

**Humidity(%)**: *Humidity in the air in %, type : int*

**Wind speed (m/s)** : *Speed of the wind in m/s, type : Float*

**Visibility (10m)**: *Visibility in m, type : int*

**Dew point temperature(°C)**: *Temperature at the beggining of the day, type : Float*

**Solar Radiation (MJ/m2)**: *Sun contribution, type : Float*

**Rainfall(mm)**: *Amount of raining in mm, type : Float*

**Snowfall (cm)**: *Amount of snowing in cm, type : Float*

**Seasons**: *Season of the year, type : str, there are only 4 season's in data *. 

**Holiday**: *If the day  is holiday period or not, type: str*

**Functioning Day**: *If the day is a Functioning Day or not, type : str*






### Check Unique Values for each variable.

**Checking the possible values important and meaningful categorical columns can have**

In [None]:
print(bike_df["Seasons"].unique())

In [None]:
print(bike_df["Holiday"].unique())

In [None]:
print(bike_df["Functioning Day"].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#Rename the complex columns name
bike_df=bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

In [None]:
bike_df.head(2)

In [None]:
bike_df.groupby("Functioning_Day")["Rented_Bike_Count"].sum().sort_values(ascending=False).reset_index()

As per diagnosis data found that rental bike only given on Functioning Day,So remove Functioning Day Column.

In [None]:
#Due to not unsefull in Functioning Day Column ,remove Functioning Day Column
bike_df1=bike_df.drop(["Functioning_Day"],axis=1)

In [None]:
#check the new shape of the dataset
print(f"New shape of the dataset is {bike_df1.shape}")

In [None]:
#convert datetime to datatype
import datetime as dt
bike_df1['Date'] = bike_df1['Date'].apply(lambda x: dt.datetime.strptime(x,"%d/%m/%Y"))

In [None]:
#Seperate Day, Month, Year from DataFrame Column
bike_df1['Day']=bike_df1['Date'].dt.day_name()
bike_df1['Month']=bike_df1['Date'].dt.month_name()
bike_df1['Year']=bike_df1['Date'].dt.year

In [None]:
bike_df1.head(10)

In [None]:
bike_df1["Year"].unique()

In [None]:
#creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
bike_df1['weekdays_weekend']=bike_df1['Day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
bike_df1=bike_df1.drop(columns=['Date','Day','Year'],axis=1)

In [None]:
bike_df1.head(2)

In [None]:
bike_df1.info()

In [None]:
bike_df1['weekdays_weekend'].value_counts()

In [None]:
#Change the int64 column into catagory column
columns1=['Hour','Seasons','Holiday','Month','weekdays_weekend']
for col in columns1:
  bike_df1[col]=bike_df1[col].astype('category')

In [None]:
#Change the int64 column into float column
columns2=["Rented_Bike_Count","Humidity","Visibility"]
for col in columns2:
  bike_df1[col]=bike_df1[col].astype('float')

In [None]:
#let's check the result of data type
bike_df1.info()

In [None]:
bike_df1.columns

### What all manipulations have you done and insights you found?

1. As per diagnosis data found that rental bike only given on Functioning Day,  remove Functioning Day Column.
2. So we convert the "date" column into 3 different column i.e "year","month",day".
3. The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november so if i consider this is a one year then we don't need the "year" column so we drop it.
4. The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column.
5. As "Hour","month","weekdays_weekend" column are show as a integer data type but actually it is a category data tyepe. so we need to change this data tyepe if we not then, while doing the further anlysis and correleted with this then the values are not actually true so we can mislead by this.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**Let's explore the data by visualizing the distribution of values in some columns of the dataset, and the relationships between "Rented Bike count" and other columns.**

**We'll use libraries Matplotlib, Seaborn for visualization.**

#### Chart - 1 Relation between 'Month' and 'Rented_bike_count'

In [None]:
bike_df1.head(2)

In [None]:
# Chart - 1 visualization code
fig,ax=plt.subplots(figsize=(14,7))
sns.barplot(data=bike_df1,x='Month',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Month ')

##### 1. Why did you pick the specific chart?

To see the distribution of the Rented_bike_count in each month  

##### 2. What is/are the insight(s) found from the chart?

From the above bar plot we can clearly say that from  the month May to October the demand of the rented bike is high as compare to other months. 

These months are comes inside the summer season.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In the summer season the rented bike business will go high as compare to winter season.

#### Chart - 2 Count of Rented bikes according to Weekdays_Weekend and Time

In [None]:
bike_df1.head(2)

In [None]:
# mean distribution of Rented bike betwwn weekdays and weeend
bike_df1.groupby("weekdays_weekend")["Rented_Bike_Count"].mean().reset_index()

In [None]:
# Chart - 2 visualization code
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=bike_df1,x='weekdays_weekend',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes according to weekdays and weekend ')
plt.show()

In [None]:
# Chart - 3 visualization code
fig,ax=plt.subplots(figsize=(14,7))
sns.pointplot(data=bike_df1,x='Hour',y='Rented_Bike_Count',hue='weekdays_weekend',ax=ax)
ax.set(title='Count of Rented bikes acording to weekdays_weekend ')
plt.show()

##### 1. Why did you pick the specific chart?

To see the distribution of use of Rented bikes acording to weekdays_weekend and time .

##### 2. What is/are the insight(s) found from the chart?

1. From the above bar plot we can clearly say that the mean distribution of rented bike between Weekdays and weekend is almost same.
  But in weekdays its slightly higher due to office and weekend its slightly lower.

2. From the above point plot and bar plot we can say that in the week days which represent in blue colur show that the demand of the bike higher because of the office.

  Peak Time are 7 am to 9 am and 5 pm to 7 pm


3. The orange colur represent the weekend days, and it show that the demand of rented bikes are very low specially in the morning hour but when the evening start from 4 pm to 8 pm the demand slightly increases.  

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

From the above graph we can say that we can run the business in entire week. and It will not affect our profit margins.

Weekdays in office time between 7 am to 9 am and 5pm to 7pm the demand of rented bike increases 

Similarly in weekend from 4pm to 8pm the demand slightly increases compare to others time.

#### Chart - 3 Relation between Rented Bikes count and Hours

In [None]:
bike_df1.head(2)

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(14,7))
sns.barplot(data=bike_df,x='Hour',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Hour ')
plt.show()

##### 1. Why did you pick the specific chart?

To see the use of rented bike according the hours.

##### 2. What is/are the insight(s) found from the chart?

In the above plot which shows the use of rented bike according the hours and the data are from all over the year.generally people use rented bikes during their working hour from 7am to 9am and 5pm to 7pm.  

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

From the above graph which show that maximum demend of rented bikes comes at the time working hour from 7am to 9am and 5pm to 7pm and 

minimum demend of rented bikes comes in the morning .

#### Chart - 4 Count of Rented bikes according to Seasons

In [None]:
bike_df1.head(2)

In [None]:
# Chart - 4 visualization code
fig,ax=plt.subplots(figsize=(14,7))
sns.barplot(data=bike_df1,x="Seasons",y="Rented_Bike_Count",ax=ax,capsize=.2)
ax.set(title="Count of Rented bikes according to Seasons")
plt.show()

In [None]:
# Chart - 3 visualization code
fig,ax=plt.subplots(figsize=(14,7))
sns.pointplot(data=bike_df1,x='Hour',y='Rented_Bike_Count',hue='Seasons',ax=ax)
ax.set(title='Count of Rented bikes according to Seasons')
plt.show()

##### 1. Why did you pick the specific chart?

To see the distribution of use of Rented bikes acording to Seasons .

##### 2. What is/are the insight(s) found from the chart?

In the above bar plot and point plot which shows the use of rented bike in in four different seasons, and it clearly shows that,

1. In summer season the use of rented bike is high and peak time is 7am-9am and 7pm-5pm.

2. In winter season the use of rented bike is very low because of snowfall.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In summer, Spring and Autumn seasons the use of rented bike are high so it is profitable to do business in that time.
And at the time of summer we will get maxium profit.

In winter season the use of rented bike is very low because of snowfall. Due to that we will get minimum profit mergin.

#### Chart - 5 Analyze of Numerical variables

In [None]:
# Chart - 5 visualization code
bike_df1.head(2)

In [None]:
bike_df1.info()

In [None]:
#if dtype is not Equal to object type then its a num data
numerical_features=[col for col in bike_df1.columns if bike_df1[col].dtype=="float"]
numerical_features

In [None]:
# Seprate dataframe for Numerical feature
num_data=bike_df1[numerical_features]
num_data.head(5)

In [None]:
for col in numerical_features:
    fig = plt.figure(figsize=(6, 3))
    ax = fig.gca()
    feature = bike_df1[col]
    sns.distplot(feature)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

##### 1. Why did you pick the specific chart?

For and check distribution of numerical and analyse it in the dataset. 

##### 2. What is/are the insight(s) found from the chart?

1.From the above plot which shows that mean ranted bike count was 650 and ranted bike count maximum goes to above 3000 in a day.and this is right skew distribution.

2.From the above plot which shows that mean temperatur was 14 degree and the maximum distribution lies between 0 to 30 degree celsius.

3.From the above plot which shows that mean Humidity was 58.

4.From the above plot which shows that mean wind speed in the year was 1.7 m/s and its normal.

5.From the above plot which shows that the maximum days visibility was good and the mean visibility in the year was 1700. 

6.From the above plot which shows that the mean Dew poin temperation was 5 degC.and the maximum distribution lies between -5 to +25 deg 

7.From the above plot which shows that mean solar radiation lies about 0.6 and maimum days solar radiation lies close to zero.

8.From the above plot which shows that in a year maximum days were dry.

9.From the above plot which shows that in a year maximum days sky was clear and did not have snowfall.

In [None]:
bike_df1.agg(['skew']).T

1. Right/Positive Skewed Distribution: Mode < Median < Mean: Rented Bike Count, Wind Speed, Solar Radiation

2. No Skew: Mean = Median = Mode : Hour, Temperature, Humidity(%),Rainfall(mm),Snowfall(cm)

3. Left/Negative Skewed Distribution: Mean < Median < Mode: visibility(10m),Teperature

#### Chart - 6  Numerical vs.Rented_Bike_Count

In [None]:
# Chart - 6 visualization code
bike_df1.head(2)

In [None]:
#print the plot to analyze the relationship between "Rented_Bike_Count" and "Temperature"  
bike_df1.groupby("Temperature")["Rented_Bike_Count"].mean().plot()
plt.show()

In [None]:
#print the plot to analyze the relationship between Humidity and Rented_Bike_Count
bike_df1.groupby("Humidity")["Rented_Bike_Count"].mean().plot()
plt.show()

In [None]:
#print the plot to analyze the relationship between Wind_speed and Rented_Bike_Count
bike_df1.groupby("Wind_speed")["Rented_Bike_Count"].mean().plot()

In [None]:
#print the plot to analyze the relationship between Visibility and Rented_Bike_Count
bike_df1.groupby("Visibility")["Rented_Bike_Count"].mean().plot()
plt.show()

In [None]:
#print the plot to analyze the relationship between Dew_Point_Teperatue and Rented_Bike_Count
bike_df1.groupby("Dew_point_temperature")["Rented_Bike_Count"].mean().plot()
plt.show()

In [None]:
#print the plot to analyze the relationship between Solar_Radiation and Rented_Bike_Count
bike_df1.groupby("Solar_Radiation")["Rented_Bike_Count"].mean().plot()
plt.show()

In [None]:
#print the plot to analyze the relationship between Rainfall and Rented_Bike_Count
bike_df1.groupby("Rainfall")["Rented_Bike_Count"].mean().plot()
plt.show()

In [None]:
#print the plot to analyze the relationship between Snowfall and Rented_Bike_Count
bike_df1.groupby("Snowfall")["Rented_Bike_Count"].mean().plot()
plt.show()

##### 1. Why did you pick the specific chart?

For established relation between Numerical data and Rented_Bike_Count

##### 2. What is/are the insight(s) found from the chart?

1. From the above plot we see that people like to ride bikes when it is pretty hot around 25°C in average

2. We can see from the above plot that the demand of rented bike is uniformly distribute from 20% to 80% Humidity but when the Humidity was above 80% then the demand of rented bike decrease and below 20% Humidity the demand of rented bike was increased.


3. We can see from the above plot that the demand of rented bike is uniformly distribute despite of wind speed but when the speed of wind was 7 m/s then the demand of rented bike also increase that clearly means peoples love to ride bikes when its little windy.

4. We can see from the above plot that the demand of rented bike is uniformly distribute above 500 visibility but below 500 visibility the demand of rented bike slightly less. 

5. From the above plot of "Dew_point_temperature' is almost same as the 'temperature' there is some similarity present we can check it in our next step.


6. from the above plot we see that, the amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000


7. We can see from the above plot that even if it rains a lot the demand of of rent bikes is not decreasing, here for example even if we have 20 mm of rain there is a big peak of rented bikes

8. We can see from the plot that, on the y-axis, the amount of rented bike is very low When we have more than 4 cm of snow, the bike rents is much lower

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

1. People like to ride bikes when it is pretty hot around 25°C in average. So its add positive impact to the business.

2. In between 20% to 80% humidity that show a positive lead to the business and below 20%and above 80% humidity lead to negative business growth. 


3. The demand of rented bike is uniformly distribute despite of wind speed but when the speed of wind was 7 m/s then the demand of rented bike also increase that clearly means peoples love to ride bikes when its little windy.that is good sign of positive business growth.

4. The day visibility was below 500 that leads to negative growth in the business.

6. The amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000.It was a positive insights in the business. 

7. In the time of raining the demend of bike increasing and its give a positive impect in the business.

8. The amount of rented bike is very low When we have more than 4 cm of snow, the bike rents is much lower.It shows that in winter the business is not profitable. The insights that lead to negative growth in the business.

#### Chart - 7 percentage distribution of the value counts of the categorical features

In [None]:
# percentage distribution of the value counts of the categorical features
cols=['Month','Holiday','Seasons','Hour','weekdays_weekend']
n=1
plt.figure(figsize=(20,15))
for i in cols:
  plt.subplot(3,3,n)
  n=n+1
  plt.pie(bike_df1[i].value_counts(),labels = bike_df1[i].value_counts().keys().tolist(),autopct='%.0f%%')
  plt.title(i)
  plt.tight_layout()

#### Chart - 8 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
data_corr= bike_df1.corr()       
sns.heatmap(data_corr, cmap='bwr', linewidths=0.1, annot=True, linecolor='black')
plt.figure(figsize=(12,6))
plt.show()

##### 1. Why did you pick the specific chart?

Correlation heatmaps can be used to find potential relationships between variables and to understand the strength of these relationships.

##### 2. What is/are the insight(s) found from the chart?

We can infer the following from the above correlation heatmap -

1. Temperature and Dew point temperature are highly correlated to each other.

2. We see a positive correlation between Rented bike count and temperature. Note that this is only true for the range of temperatures provided.

3. We see a negative correlation among rented bike count with humidity,Rainfall and Snowfall. The more the humidity, Rainfall and Snowfall the less people prefer to bike.

4. visibility has a weak dependence on Humidity.




#### Chart - 9- Pair Plot 

In [None]:
# Pair Plot visualization code
bike_df1.head(2)

In [None]:
fig = plt.figure(figsize=(15, 8))
sns.pairplot(bike_df1)
plt.show()

##### 1. Why did you pick the specific chart?

to understand the best set of features to explain a relationship between two variables or to form the most separated clusters

##### 2. What is/are the insight(s) found from the chart?

We can infer the following from the above Pair Plot -

1. When the snofall and Rainfall increased Rented bike count decreased.

2. With the increased of Temperature Rented bike count also increased.

3. There is a positive  relation between count and visibility.

4. Visibility has a weak dependence on Humidity

5. When the snofall and Rainfall are decreased solar radiation increased.

## ***5. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
bike_df1.head(2)

In [None]:
# Handling Missing Values & Missing Value Imputation
missing = pd.DataFrame((bike_df1.isnull().mean())*100).reset_index()

In [None]:
missing

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,5))
ax = sns.pointplot(x='index',y=0,data=missing)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()

#### What all missing value imputation techniques have you used and why did you use those techniques?

temperature and Dew point temperature are almost 0.91 correlated, So it's generate multicollinearity issue. so we drop Dew point temperature feature

In [None]:
#Drop Dew point temperature(°C) from dataset bike_df1
bike_df1.drop(columns=['Dew_point_temperature'],inplace=True)

As we can see above there are no missing value presents thankfully

### 2. Handling Outliers

In [None]:
#show 1st 2 row of dataset
bike_df1.head(2)

***Target Parameter Rented Bike Count distributation analysis***

In [None]:
# draw boxplot of numeric values of the dataset
plt.figure(figsize=(18, 18))
for i, col in enumerate(bike_df1.select_dtypes(include=["float"]).columns):
    ax = plt.subplot(4,2, i+1)
    sns.boxplot(data=bike_df1, x=col, ax=ax,color='r')
plt.suptitle('Box Plot of continuous variables')
plt.tight_layout()

In [None]:
#Finding the IQR
bike_df1_cap=bike_df1.copy()

In [None]:
#find feature of the dataset
feature=bike_df1_cap.loc[:,['Rented_Bike_Count', 'Temperature', 'Humidity', 'Wind_speed','Visibility', 'Solar_Radiation']]

In [None]:
feature

In [None]:
#removing the outliner
def iqr_capping(df,cols,factor):                 # this function remover the outliner of dataset with iql method 
  for col in cols:
    q1=df[col].quantile(0.25)
    q3=df[col].quantile(0.75)

    iqr=q3-q1

    upper_whisker=q3+(factor*iqr)
    lower_whisker=q1-(factor*iqr)

    df[col]=np.where(df[col]>upper_whisker,upper_whisker,np.where(df[col]<lower_whisker,lower_whisker,df[col]))

In [None]:
#call the function
iqr_capping(bike_df1_cap,feature,1.5)

In [None]:
bike_df1_cap

In [None]:
# final check of outliner in dataset
plt.figure(figsize=(18, 18))
for i, col in enumerate(['Rented_Bike_Count', 'Temperature', 'Humidity', 'Wind_speed','Visibility', 'Solar_Radiation']):
    plt.subplot(3,2, i+1)
    sns.boxplot(data=bike_df1_cap, x=col,color='r')
plt.suptitle('Box Plot of continuous variables')
plt.tight_layout()

In [None]:
bike_df1_cap.describe().T

***Now which is show that the dataset has no outliner.***

##### What all outlier treatment techniques have you used and why did you use those techniques?

***Interquartile Range Definition***


The interquartile range defines the difference between the third and the first quartile. Quartiles are the partitioned values that divide the whole series into 4 equal parts. So, there are 3 quartiles. First Quartile is denoted by Q1 known as the lower quartile, the second Quartile is denoted by Q2 and the third Quartile is denoted by Q3 known as the upper quartile. Therefore, the interquartile range is equal to the upper quartile minus lower quartile.

***Interquartile Range Formula***
The difference between the upper and lower quartile is known as the interquartile range. The formula for the interquartile range is given below

***Interquartile range(IQR)*** = Upper Quartile – Lower Quartile = Q­3 – Q­1

* where Q1 is the first quartile and Q3 is the third quartile of the series.

Median: In the box plot, the median is displayed rather than the mean.
* Q1: The first quartile (25%) position.
* Q3: The third quartile (75%) position.
* Interquartile range (IQR): a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles. It represents how 50% of the points were dispersed.
* Lower and upper 1.5*IQR whiskers: These represent the limits and boundaries for the outliers.
* Outliers: Defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. Outliers are displayed as dots or circles.

### 3. Regression plot

In [None]:
for col in bike_df1_cap.select_dtypes(include=["float"]).columns:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=bike_df1_cap[col],y=bike_df['Rented_Bike_Count'],scatter_kws={"color": 'orange'}, line_kws={"color": "black"})

#####  What is/are the insight(s) found from the chart?

 ***From the above regression plot of all numerical features we see that the columns  'Temperature', 'Wind_speed','Visibility','Solar_Radiation' are positively relation to the target variable.***


* ***which means the rented bike count increases with increase of these features.***
* ***'Rainfall','Snowfall','Humidity' these features are negatively related with the target variaable which means the rented bike count decreases when these features increase.***

### 4. Data Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler=MinMaxScaler()

In [None]:
numeric_bike_df1=bike_df1_cap.iloc[:,2:9]

In [None]:
model=scaler.fit(numeric_bike_df1)

In [None]:
scaled_df=pd.DataFrame(model.transform(numeric_bike_df1),columns=numeric_bike_df1.columns)

In [None]:
scaled_df

In [None]:
for col in scaled_df :
    fig = plt.figure(figsize=(6, 3))
    ax = fig.gca()
    feature = scaled_df[col]
    sns.distplot(feature)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

In [None]:
#Applying square root to Rented Bike Count to improve skewness
bike_df1_cap['Rented_Bike_Count']=(np.sqrt(bike_df1_cap['Rented_Bike_Count']))

In [None]:
#Applying square root to Rented Bike Count to improve skewness
plt.figure(figsize=(10,8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

ax=sns.distplot(np.sqrt(bike_df1_cap['Rented_Bike_Count']), color="y")
ax.axvline(np.sqrt(bike_df1_cap['Rented_Bike_Count']).mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(bike_df1_cap['Rented_Bike_Count']).median(), color='black', linestyle='dashed', linewidth=2)

plt.show()

In [None]:
plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=np.sqrt(bike_df1_cap['Rented_Bike_Count']))
plt.show()

In [None]:
bike_df1_cap.head(2)

In [None]:
scaled_bike_df=pd.concat([bike_df1_cap.loc[:,["Rented_Bike_Count","Hour","Seasons","Holiday","Month","weekdays_weekend"]],scaled_df],axis=1)

In [None]:
scaled_bike_df.shape

In [None]:
scaled_bike_df.sample(5)

Which method have you used to scale you data and why?

* we have applying MinMaxScaler on "Rented_Bike_Count","Hour","Seasons","Holiday","Month","weekdays_weekend" in oder to make it normal.After applying MinMaxScaler,here we get almost normal distribution


* Since we have generic rule of applying Square root for the skewed variable in order to make it normal .After applying Square root to the skewed Rented Bike Count, here we get almost normal distribution.

### 5. Categorical Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
categorical_features=list(scaled_bike_df.select_dtypes(['category']).columns)
categorical_features=pd.Index(categorical_features)
categorical_features

In [None]:
scaled_bike_df_copy = scaled_bike_df

def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

for col in categorical_features:
    scaled_bike_df_copy = one_hot_encoding(scaled_bike_df_copy, col)
scaled_bike_df_copy.head()      

In [None]:
scaled_bike_df_copy.shape

#### What all categorical encoding techniques have you used & why did you use those techniques?

***In here I used both OdinalEncoder on 'Seasons' feature and OneHotEncoder on 'Hour','Holiday','Month','weekdays_weekend' features.***

* OdinalEncoder is used when the variables in the data are ordinal, ordinal encoding converts each label into integer values and the encoded data represents the sequence of labels.

* In One-Hot Encoding, each category of any categorical variable gets a new variable. It maps each category with binary numbers (0 or 1). This type of encoding is used when the data is nominal. Newly created binary features can be considered dummy variables. After one hot encoding, the number of dummy variables depends on the number of categories presented in the data.

### **6**. Data Splitting

In [None]:
#import Library
from sklearn.model_selection import train_test_split, GridSearchCV,  cross_val_score

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
scaled_bike_df_copy.head(2)

In [None]:
x_train,x_test,y_train,y_test= train_test_split(scaled_bike_df_copy.drop(["Rented_Bike_Count"],axis=1),scaled_bike_df_copy["Rented_Bike_Count"],test_size=0.2,random_state=42)

In [None]:
x_train.head(2)

In [None]:
x_test.tail(2)

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

##### What data splitting ratio have you used and why? 

The foregoing data splitting methods can be implemented once we specify a splitting ratio. A commonly used ratio is 80:20, which means 80% of the data is for training and 20% for testing which I did in here. Other ratios such as 70:30, 60:40, and even 50:50 are also used in practice. There does not seem to be clear guidance on what ratio is best or optimal for a given dataset. The 80:20 split draws its justification from the well-known Pareto principle, but that is again just a thumb rule used by practitioners.

## ***6. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
#import the packages
from sklearn.linear_model import LinearRegression
reg= LinearRegression().fit(x_train, y_train)

In [None]:
#check the score
reg.score(x_train, y_train)

In [None]:
reg.coef_

In [None]:
#get the X_train and X-test value
y_pred_train=reg.predict(x_train)
y_pred_test=reg.predict(x_test)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)



#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)) )


**Looks like our r2 score value is 0.77 that means our model is  able to capture most of the data variance. Lets save it in a dataframe for later comparisons.**

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
training_df=pd.DataFrame(dict1,index=[1])

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr= mean_squared_error(y_test, y_pred_test)
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_test, y_pred_test)
print("MAE :",MAE_lr)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score((y_test), (y_pred_test))
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print("Adjusted R2 :",Adjusted_R2_lr )


**The r2_score for the test set is 0.78. This means our linear model is  performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).**




In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Linear regression ',
       'MAE':round((MAE_lr),3),
       'MSE':round((MSE_lr),3),
       'RMSE':round((RMSE_lr),3),
       'R2_score':round((r2_lr),3),
       'Adjusted R2':round((Adjusted_R2_lr ),2)
       }
test_df=pd.DataFrame(dict2,index=[1])

In [None]:
### Heteroscadacity
plt.scatter((y_pred_test),(y_test)-(y_pred_test))

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(y_pred_test)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***7.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***