<a href="https://colab.research.google.com/github/singhdiwakar020/Data-Analytis-Projects/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Bike Sharing Demand Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# **Project Summary -**

# Bike-sharing demand prediction is the process of estimating how many bikes will be rented from a bike-sharing system at a given time. This prediction is based on factors like the day of the week, weather, and time of day.
Here are some factors that can affect bike-sharing demand:
Day of the week: Registered users tend to demand more bikes on weekdays than on weekends or holidays.
Weather: Demand for bikes is lower on rainy days and when humidity is higher.
Time of day: The number of bikes required at each hour is crucial for a stable supply of rental bikes.
Some datasets for bike-sharing demand prediction include:
Weather information, such as temperature, humidity, windspeed, visibility, dewpoint, solar radiation, snowfall, and rainfall
The number of bikes rented per hour Date **information**


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
#data visualization libraries(matplotlib,seaborn, plotly)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("ticks")
sns.set_context("poster");
import plotly.express as px
from scipy.stats import norm

# Importing numpy, pandas and tensorflow
import pandas as pd
import numpy as np
import tensorflow as tf

# Z score
# from scipy import stats          # was using to detect outliers


# Datetime library for manipulating Date columns.
from datetime import datetime
import calendar

# from sci-kit library scaling, transforming and labeling functions are brought
# which is used to change raw feature vectors into a representation that is more
# suitable for the downstream estimators.
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.preprocessing import  LabelEncoder

# Importing various machine learning models.
from sklearn.linear_model import Lasso, Ridge, LinearRegression, ElasticNet
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import neighbors
from lightgbm import LGBMRegressor
import lightgbm


# XGB regressor.
from xgboost import XGBRegressor

#calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

# spilitting data
from sklearn.model_selection import train_test_split

#for optimization
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Import different metrics from sci-kit libraries for model evaluation.
from sklearn import metrics
from sklearn.metrics import r2_score as r2
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error


# The following lines adjust the granularity of reporting.
pd.options.display.max_rows = 50
pd.options.display.float_format = "{:.3f}".format

# Importing warnings library. The warnings module handles warnings in Python.
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
import chardet

In [None]:


# Use the detected encoding
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/SeoulBikeData.csv',encoding='ISO-8859-1')


### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
# Count the number of rows and columns
num_rows, num_columns = df.shape

print(f"The DataFrame has {num_rows} rows and {num_columns} columns.")

### Dataset Information

In [None]:
# Dataset Info

In [None]:
df.info()

In [None]:
df.describe(include='all')

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count duplicate rows
num_duplicates = df.duplicated().sum()

print(f"There are {num_duplicates} duplicate rows in the DataFrame.")

#### Missing Values/Null Values

In [None]:
# Count missing values in each column
missing_values = df.isnull().sum()


In [None]:
print("Missing Values in Each Column:")
print(missing_values)

In [None]:
# Visualizing the missing values

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create a heatmap to visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False, yticklabels=False)

plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

**There are 14 features with 8760 rows of data.
There are 3 categorical columns and 11 numerical columns. Columns ‘Date’, ‘Seasons’ and ‘Functioning Day’ are of 𝑜𝑏𝑗𝑒𝑐𝑡 data type
Columns ‘Rented Bike Count’, ‘Hour’, ‘Humidity (%)’ and ‘Visibility (10𝑚)’ are of 𝑖𝑛𝑡64 numerical data type
Columns ‘Temperature Temperature (℃)’, ‘Wind Speed (𝑚/𝑠)’, ‘Dew Point Temperature (℃)’,‘Solar Radiation (𝑀𝐽/𝑚2)’,‘Rainfall (𝑚𝑚)’ and ‘Snowfall(𝑐𝑚) are of 𝑓𝑙𝑜𝑎𝑡64 numerical data type
Not any null value present in any column
Unique count: Seasons- 4, Holiday- 2, Functioning Day- 2**

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe

## Variables Description

## *The ranges of values in the numerical columns seem reasonable too, so we may not have to do much data cleaning. The “Wind speed”,”Dew point temperature(°C)”, “Solar Radiation”, “Rainfall” and “Snowfall” column seems to be significantly skewed however, as the median (50 percentile) is much lower than the maximum value.*

---



```



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Display unique values for each variable
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for {column}:\n{unique_values}\n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Display basic information about the DataFrame
print("Initial DataFrame Info:")
print(df.info())

# Display summary statistics of numeric columns
print("\nSummary Statistics:")
print(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check for duplicate rows
print("\nDuplicate Rows:")
print(df.duplicated().sum())

# Handle missing values (if needed)
# Example: Drop rows with missing values
df = df.dropna()

# Handle duplicate rows (if needed)
# Example: Drop duplicate rows
df = df.drop_duplicates()

# Convert data types (if needed)
# Example: Convert a column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Additional data cleaning and transformation steps can be added here

# Display final information about the cleaned DataFrame
print("\nFinal DataFrame Info:")
print(df.info())

In [None]:
# we make the seperate columns for weekdays, date, month, year by 'Date'column.

df['Weekday'] = df['Date'].dt.day_name()
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year



In [None]:
df['Weekday'] = df['Date'].dt.day_name()

In [None]:
df['Weekday']

In [None]:
df.drop('Date',axis=1)

In [None]:
df.head()

### What all manipulations have you done and insights you found?

I have Done Data Cleanin , Statistical Analysis:, After done this I found there is No any null values and dublicates values into the dataset.also i fix the data type of date to its real format.

After Describe the Data its gives me lots of informative value of each column like what is min value, max value , and 25 percentile and 50 %tile.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

This plot gives me the pair relation between all each other in one graph.

##### 2. What is/are the insight(s) found from the chart?

The insights we found here is that

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(10,7))

Month = df.groupby(['Month']).sum().reset_index()

sns.barplot(x='Month', y ='Rented Bike Count', data= Month)





##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(16,10))

Day = df.groupby(['Day']).sum().reset_index()

sns.barplot(x='Day',y= 'Rented Bike Count', data=Day)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(13,7))
Hour = df.groupby("Hour").sum().reset_index()
sns.barplot(x="Hour", y="Rented Bike Count", data=Hour)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

plt.figure(figsize=(10,7))

sns.barplot(x='Holiday',y='Rented Bike Count',data=df)

In [None]:
plt.figure(figsize=(10,7))
sns.barplot(x="Seasons", y="Rented Bike Count", data=df)

In [None]:
plt.figure(figsize=(40,7))
sns.barplot(x="Rainfall(mm)", y="Rented Bike Count", data=df)

In [None]:
plt.figure(figsize=(40,7))

sns.barplot(x="Snowfall (cm)", y="Rented Bike Count", data=df)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization cod
plt.figure(figsize=(40,7))

sns.displot(df['Rented Bike Count'])

In [None]:
sns.displot(np.sqrt(df["Rented Bike Count"]))

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

In [None]:
## Skewed Data

##   Skewed Data

In [None]:
df.skew().sort_values(ascending=True)

# Remove Multicolinearity

In [None]:
# Remove Multicolinearity

plt.figure(figsize=(25,25))

sns.heatmap(df.corr(),annot=True, cmap = 'coolwarm')

## **Varinace Inflation Factor (VIF)**

VIF is showing which features multi-collinearity.

In [None]:
def get_vif(df):
    vif = pd.DataFrame()
    vif["variables"] = df.columns
    vif["VIF"] = [ variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

    return vif

In [None]:
not_for_vif = [ "Day", "Month", "Year", "Rented Bike Count"]

get_vif(df[[i for i in df.describe().columns if i not in not_for_vif]])

### Dew point temperature(°C) is showing high collinearity with the Temp but if would take one of them, we take Temperature columns here bacause the co-relation between the 'Temperature' and the dependent variable 'Rented Bike Counts' is More than the Dew point temperature(°C).

In [None]:
not_for_vif = [ "Day", "Month", "Year", "Rented Bike Count", "Dew point temperature(°C)"]

get_vif(df[[i for i in df.describe().columns if i not in not_for_vif]])

In [None]:
df.drop(["Dew point temperature(°C)"], axis=1, inplace=True)

# **Encoding**

In [None]:
df.info()

### Catagorical Features

In [None]:
cat_features = ["Seasons", "Holiday", "Functioning Day", "weekday"]

### Value-Count of each Catagorical features.

In [None]:
df["Holiday"].value_counts()

In [None]:
df["Functioning Day"].value_counts()

In [None]:
df["Seasons"].value_counts()

In [None]:
df.columns

In [None]:
df["weekday"] = pd.to_datetime(df["Date"]).dt.day_name()

In [None]:
df["weekday"].value_counts()

In [None]:
df["weekday"].value_counts()

In [None]:
df.drop("Date", axis=1, inplace=True)

In [None]:
df["weekday"].value_counts()

### Map the Binary catagorical values into 0 and 1.

In [None]:
df["Holiday"] = df["Holiday"].map({"No Holiday":0, "Holiday":1})
df["Functioning Day"] = df["Functioning Day"].map({"No":0, "Yes":1})

### Take the dummies of these variables

In [None]:
df_season = pd.get_dummies(df["Seasons"], drop_first = True)
df_weekday = pd.get_dummies(df["weekday"], drop_first = True)

In [None]:
df.info()

## Now join the dummy variables into the our main data frame.

In [None]:
df = pd.concat([df, df_season, df_weekday], axis=1)

In [None]:
df.info()

In [None]:
df.drop(["Seasons", "weekday"], axis=1, inplace=True)

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.columns

In [None]:

df.shape

In [None]:
df.drop('Weekday',axis=1,inplace=True)

In [None]:

df.shape

# **Split the Data For Traning and Testing**

---



In [None]:
X = df.drop("Rented Bike Count", axis=1)
y = df["Rented Bike Count"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2023)

print("Shape of X_train : ", X_train.shape)
print("Shape of y_train : ", y_train.shape)
print("Shape of X_test : ", X_test.shape)
print("Shape of y_test : ", y_test.shape)

## **Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()
sc.fit(X_train)

X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_train[:2]

In [None]:
sc.mean_

In [None]:
sc.scale_

# Traning ML Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)

In [None]:
y_pred = lr.predict(X_test)

In [None]:
y_pred

# **Model Evaluation**

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
MAE = mean_absolute_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)

print(f"MSE : {MSE}")
print(f"RMSE : {RMSE}")
print(f"MAE : {MAE}")
print(f"R2 : {R2}")

### Make permanent function so that we can find  mean_squared_error, mean_absolute_error, r2_score for each model at a time.

In [None]:
def get_metrics(y_true, y_pred, model_name):
    MSE = mean_squared_error(y_test, y_pred)
    RMSE = np.sqrt(MSE)
    MAE = mean_absolute_error(y_test, y_pred)
    R2 = r2_score(y_test, y_pred)

    print(f"{model_name} : ['MSE': {round(MSE,3)}, 'RMSE':{round(RMSE,3)}, 'MAE' :{round(MAE,3)}, 'R2':{round(R2,3)}]")

In [None]:
get_metrics(y_test, y_pred, "LinearRegression")

# **Train Multiple Models**

### ***Now we train multiple models and find which model accuracy is good.***

In [None]:
!pip install xgboost

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [None]:
rir = Ridge().fit(X_train, y_train)
y_pred_rir = rir.predict(X_test)

lar = Lasso().fit(X_train, y_train)
y_pred_lar = lar.predict(X_test)

poly = PolynomialFeatures(2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)

poly_r = LinearRegression().fit(X_train_poly, y_train)
y_pred_poly = poly_r.predict(X_test_poly)

svr = SVR().fit(X_train, y_train)
y_pred_svr = svr.predict(X_test)

knnr = KNeighborsRegressor().fit(X_train, y_train)
y_pred_knnr = knnr.predict(X_test)

dtr = DecisionTreeRegressor().fit(X_train, y_train)
y_pred_dtr = dtr.predict(X_test)

rfr = RandomForestRegressor().fit(X_train, y_train)
y_pred_rfr = rfr.predict(X_test)

xgbr = XGBRegressor().fit(X_train, y_train)
y_pred_xgbr = xgbr.predict(X_test)

In [None]:
get_metrics(y_test, y_pred_rir, "Ridge")
get_metrics(y_test, y_pred_lar, "Lasso")
get_metrics(y_test, y_pred_poly, "PolynomialFeatures")
get_metrics(y_test, y_pred_svr, "SVR")
get_metrics(y_test, y_pred_knnr, "KNNR")
get_metrics(y_test, y_pred_dtr, "DecisionTreeRegressor")
get_metrics(y_test, y_pred_rfr, "RandomForestRegressor")
get_metrics(y_test, y_pred_xgbr, "XGBRegressor")

## Here we Find two best Models that is Random Forest Regressor and XG Boost. so we have to consider which model we would take.

# **Visualise Model Prediction**

**Linear Regression model**

In [None]:
plt.scatter(y_test, y_pred)
plt.title("Linear Regression Truth vs Prediction ")
plt.xlabel("Ground Truth")
plt.ylabel("Prediction")
plt.show()

**Random Forest Regressor**

In [None]:
plt.scatter(y_test, y_pred_rfr)
plt.title("Random Forest Regressor Truth vs Prediction ")
plt.xlabel("Ground Truth")
plt.ylabel("Prediction")
plt.show()

### **XGB Regressor**

In [None]:
plt.scatter(y_test, y_pred_xgbr)
plt.title("XGB Regressor Truth vs Prediction ")
plt.xlabel("Ground Truth")
plt.ylabel("Prediction")
plt.show()

# **Hyperparameter Tuning for XGBoost Regressor**

In [None]:
from sklearn.model_selection import RandomizedSearchCV

import time
start_time = time.time()

params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'subsample': np.arange(0.5, 1.0, 0.1),
           'colsample_bytree': np.arange(0.4, 1.0, 0.1),
           'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
           'n_estimators': [100, 500, 1000]}


xgbr = XGBRegressor(seed = 20)
rscv = RandomizedSearchCV(estimator=xgbr,
                         param_distributions=params,
                         scoring='neg_mean_squared_error',
                         n_iter=25,
                          cv=5,
                         verbose=1)

rscv.fit(X_train, y_train)

y_pred_xgb_random = rscv.predict(X_test)

get_metrics(y_test, y_pred_xgb_random, "XGBRegressor With Best Parameters")

print("Time taken to training using randomize search : ", time.time()-start_time)

print("Best parameters:", rscv.best_params_)

In [None]:
xgbr = XGBRegressor(subsample=0.6,
                   n_estimators=1000,
                   max_depth=6,
                   learning_rate=0.1,
                   colsample_bytree=0.7,
                   colsample_bylevel=0.4,
                   seed = 20)

xgbr.fit(X_train, y_train)

y_pred_tuned = xgbr.predict(X_test)

get_metrics(y_test, y_pred_tuned, "XGBRegressor With Best Parameters")