<a href="https://colab.research.google.com/github/solankinitin1210/solankinitin1210-Capstone-Project-Bike-Sharing-Demand-Prediction/blob/main/Bike_Sharing_Demand_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Seoul Bike Sharing Demand Prediction </u></b>

## <b> Problem Description </b>

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

**What has been given to us? (Input)**
- We have been given A Dataset with information on rented bike data for seoel bikes, 
- A dataset contains various independent variables like date, hour, weather condition, holiday etc and also the number of bikes rented for that particular hour of the day

**What do we need to achieve? (Required Output)**
- To create a model which is trained from the given data set and can able to predict the closest values of the required bike count for a given instance/Independent variable

**How will we approach it? (Process to follow)**
-	Import required libraries and data set
-	Based on initial observation take care of null values and perform EDA on data
-	plot each variable individually and check data distribution
-	Plot each variable versus the target variable and check the correlation
-	Prepare data for the ML model
- Prepare train and test dataset and fit it into the different machine learning algorithm
- Derived the different evaluation metrics for each ML model compare them
- Conclude which model gave the best performance to predict the rented bike count  

In [1]:
# Importing the libraries for process data
import pandas as pd
import numpy as np
from numpy import math

# Import libraries for ploting data
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split # For prepare our train and test dataset
from sklearn.preprocessing import minmax_scale # For scaling our dataset 

# Import libraries for regression 
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

# Import libraries for measure perfomance matrics 
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [2]:
# Mount driver
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Import CSV file from drive 
# There were some special character in given file, so to encode the file I used encoding latin1
df=pd.read_csv('/content/drive/MyDrive/Almabetter/Capstone Project-2/SeoulBikeData.csv',encoding='latin1') 

In [4]:
# Let's check how our dataset look like
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       8760 non-null   object 
 1   Rented Bike Count          8760 non-null   int64  
 2   Hour                       8760 non-null   int64  
 3   Temperature(°C)            8760 non-null   float64
 4   Humidity(%)                8760 non-null   int64  
 5   Wind speed (m/s)           8760 non-null   float64
 6   Visibility (10m)           8760 non-null   int64  
 7   Dew point temperature(°C)  8760 non-null   float64
 8   Solar Radiation (MJ/m2)    8760 non-null   float64
 9   Rainfall(mm)               8760 non-null   float64
 10  Snowfall (cm)              8760 non-null   float64
 11  Seasons                    8760 non-null   object 
 12  Holiday                    8760 non-null   object 
 13  Functioning Day            8760 non-null   objec

In [8]:
# Let's verify if there is any column with null values
df.isna().sum()

Date                         0
Rented Bike Count            0
Hour                         0
Temperature(°C)              0
Humidity(%)                  0
Wind speed (m/s)             0
Visibility (10m)             0
Dew point temperature(°C)    0
Solar Radiation (MJ/m2)      0
Rainfall(mm)                 0
Snowfall (cm)                0
Seasons                      0
Holiday                      0
Functioning Day              0
dtype: int64

In [6]:
# Let's check the statistics of columns
df.describe()

Unnamed: 0,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm)
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,704.602055,11.5,12.882922,58.226256,1.724909,1436.825799,4.073813,0.569111,0.148687,0.075068
std,644.997468,6.922582,11.944825,20.362413,1.0363,608.298712,13.060369,0.868746,1.128193,0.436746
min,0.0,0.0,-17.8,0.0,0.0,27.0,-30.6,0.0,0.0,0.0
25%,191.0,5.75,3.5,42.0,0.9,940.0,-4.7,0.0,0.0,0.0
50%,504.5,11.5,13.7,57.0,1.5,1698.0,5.1,0.01,0.0,0.0
75%,1065.25,17.25,22.5,74.0,2.3,2000.0,14.8,0.93,0.0,0.0
max,3556.0,23.0,39.4,98.0,7.4,2000.0,27.2,3.52,35.0,8.8


**Initial observation** 
- The Given dataset contain 8760 rows and 14 columns.
- No column was found with null values.
- Data contained by every column are in the required datatype except the "Date" column hence data type conversion is only required for the "Date" Column.
- Out of 14 columns, the Number 9 Columns contain the numeric data and 5 columns contain categoric data
- Our Dependent/Target variable is "Rented Bike Count", rest are Independent variable
- The data contained by our target variable is not in a normalized format, we conclude this by checking the mead and median values, the difference between mean and median is a huge number (200+)(We’ll verify this by plotting the chart)
- Our target variable contains the numeric data so our approach will be a regression model of machine learning.

# Deal with the "Date" Variable

In [9]:
# First let's convert data type of date varible from object to date
df['Date']=pd.to_datetime(df['Date'])

In [11]:
df['Date'].dtypes

dtype('<M8[ns]')

In [19]:
import datetime
# Create three diffrent column of Date,month and year from date column 
df['year']=pd. DatetimeIndex(df['Date']).year
df['month']=df['Date'].dt.month_name()
df['day']=pd. DatetimeIndex(df['Date']).day

In [21]:
# We Do not require Date column now so let's drop it
df.drop('Date', axis='columns', inplace=True)

In [22]:
# Verify our updated dataset
df.head()

Unnamed: 0,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day,year,month,day
0,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes,2017,January,12
1,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes,2017,January,12
2,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes,2017,January,12
3,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes,2017,January,12
4,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes,2017,January,12


# **Let's Start Basic EDA**

In [29]:
# Let's Seaso wise analysis
df.groupby('Seasons')['Rented Bike Count'].sum().reset_index().sort_values(by='Rented Bike Count',ascending=False)

Unnamed: 0,Seasons,Rented Bike Count
2,Summer,2283234
0,Autumn,1790002
1,Spring,1611909
3,Winter,487169
