# **Project Name**    - Yulu Bike Sharing Demand Prediction- Regression

##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

Rental bikes have been introduced in many urban cities to enhance mobility and convenience. Ensuring the availability and accessibility of rental bikes at the right time is essential, as it reduces waiting times and provides a consistent supply across the city. A key challenge is predicting the number of bikes needed each hour to maintain a steady supply. To address this, data mining techniques are used to predict hourly rental bike demand.

This project focuses on building models to forecast hourly bike demand using the **Seoul Bike Rental dataset, available on Kaggle**. The dataset includes weather-related factors (such as **temperature, humidity, wind speed, visibility, dew point, solar radiation, snowfall, and rainfall**), along with the **number of bikes rented per hour and date information**. **Regression models were trained with optimal hyperparameters using repeated cross-validation**, and their performance was evaluated on a testing set.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Accurately predict the hourly demand for rental bikes in urban cities, using the Seoul Bike Rental dataset. This involves building regression models based on weather data and rental history to ensure the timely availability of bikes, reduce waiting times, and maintain a stable supply across the city.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("ticks")
sns.set_context("poster");
import plotly.express as px
from scipy.stats import norm, ttest_ind, mannwhitneyu, pearsonr
import pandas as pd
import numpy as np
import tensorflow as tf
from datetime import datetime
import calendar
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.preprocessing import  LabelEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.linear_model import Lasso, Ridge, LinearRegression, ElasticNet
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,BaggingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import neighbors
from lightgbm import LGBMRegressor
import lightgbm
from xgboost import XGBRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
pd.options.display.max_rows = 50
pd.options.display.float_format = "{:.3f}".format
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset

train_file_path = "/content/drive/MyDrive/AlmaBetter Projects/SeoulBikeData.csv"
df = pd.read_csv(train_file_path,encoding='ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
print(df['Rented_Bike_Count'].dtype)


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

* The dataset contains 8760 rows and 14 columns.
* There are no duplicate values in the data.
* There are no missing values in the data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* **Date** : year-month-day
* **Rented_Bike_Count** - Count of bikes rented at each hour
* **Hour** - Hour of he day
* **Temperature**-Temperature in Celsius
* **Humidity** - %
* **Windspeed** - m/s
* **Visibility** - 10m
* **Dew point temperature** - Celsius
* **Solar radiation** - MJ/m2
* **Rainfall** - mm
* **Snowfall** - cm
* **Seasons** - Winter, Spring, Summer, Autumn
* **Holiday** - Holiday/No holiday
* **Functional Day** - NoFunc(Non Functional Hours), Fun(Functional hours)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique().sort_values(ascending=True)

## 3. ***Data Wrangling***

### Data Wrangling Code

Changing the data types

In [None]:
# Write your code to make your dataset analysis ready.
# Change the data types
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')         # Convert to datetime
df['Seasons'] = df['Seasons'].astype('category')  # Convert to category
df['Holiday'] = df['Holiday'].astype('category')  # Convert to category
df['Functioning_Day'] = df['Functioning_Day'].astype('category')  # Convert to category

Extracting date, month, year from date and adding to dataset

In [None]:
#Extracting month from date column
df['month'] = pd.DatetimeIndex(df['Date']).month
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x])

#Extracting day name from date
df['day'] = df['Date'].dt.day_name()

#Extracting year
df['year'] = df['Date'].dt.year

Count total number of different value in specific column

In [None]:
def total(df,var):
  total = len(df[var].value_counts())
  return total

total_lenght_different_column = {
    'Seasons': total(df,'Seasons'),
    'Holiday': total(df,'Holiday'),
    'Funtioning_Day': total(df,'Functioning_Day'),
    'month' : total(df,'month'),
    'day'   : total(df,'day'),
    'year'  : total(df,'year')
}
total_df = pd.DataFrame.from_dict(total_lenght_different_column,orient='index')

print(total_df)

Year  has numerical value but it explains timestamp.


In [None]:
df['year'] = df['year'].astype('object')

### What all manipulations have you done and insights you found?

**Manipulations done-**
* We changed the data type to appropriate one.
* Extracted the day, month and year from the 'Date' column which would make it easier to do further analysis.

**Insights found-**
* The total no. of unique Seasons, Holiday, Functioning Day, month , day and year are found to be 4, 2, 2, 12, 7 and 2 respectively.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

Divided the dataframe into - numerical and categorical

In [None]:
df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])

#### Chart - 1: Correlation Heatmap to find the variables that affect the number of bike rented.

In [None]:
# Chart - 1: Correlation Heatmap
plt.figure(figsize=(12, 10))
corr_matrix = df_numerical.corr()
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5,
            annot_kws={"size": 10}, cbar_kws={"shrink": 0.8})
plt.xticks(rotation=45, ha='right', fontsize=10)
plt.yticks(fontsize=10)
plt.title('Correlation Heatmap of all numerical variables', fontsize=16)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The heatmap was chosen to visualize correlations between variables and highlight relationships.

##### 2. What is/are the insight(s) found from the chart?

Rented bike count has strong correlations with hour and temperature, and weaker correlations with rainfall and wind speed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding these correlations can help optimize bike availability based on time and weather conditions.

#### Chart - 2: Pairplot to see the relation between Rented Bike Count, Temperature, Humidity, Wind_speed and Solar_Radiation

In [None]:
# Chart - 2: PairPlot
selected_columns = ['Rented_Bike_Count', 'Temperature', 'Humidity', 'Wind_speed', 'Solar_Radiation']
g = sns.pairplot(df[selected_columns], diag_kind='kde', markers='o', palette='Set2', height=2.5)
for ax in g.axes.flatten():
    ax.label_outer()
    ax.set_xlabel(ax.get_xlabel(), rotation=45, ha='right', fontsize=12)
    ax.set_ylabel(ax.get_ylabel(), rotation=0, ha='right', fontsize=12)
plt.suptitle('Pairplot of Selected Variables', y=1.02, fontsize=16)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot was chosen to visualize relationships and distributions between multiple numerical variables simultaneously.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals correlations, such as higher temperatures leading to increased bike rentals, and trends in humidity and solar radiation affecting demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights can inform operational strategies for bike availability, optimizing rentals during peak demand times.

* High rainfall correlates with lower bike rentals, indicating potential negative growth during rainy seasons, suggesting a need for alternative promotional strategies or services during adverse weather conditions.

#### Chart - 3: Bar Plot to see the hours wise data distribution.

In [None]:
# Chart - 3: Bar Plot
sns.set_style("whitegrid")
plt.figure(figsize=(18, 14))
for i, col in enumerate(df_numerical):
    ax = plt.subplot(4, 3, i + 1)
    sns.barplot(data=df, x='Hour', y=col, ax=ax, palette='Set2')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
    ax.set_title(f'Hour vs {col}', fontsize=12)
    ax.set_xlabel('Hour', fontsize=10)
    ax.set_ylabel(col, fontsize=10)
plt.tight_layout()
plt.subplots_adjust(top=0.92)
plt.suptitle('Hourly Data Distribution of Continuous Variables', fontsize=18, y=0.98)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart was chosen to visualize hourly data distribution across multiple continuous variables, making it easy to compare trends over time.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights peaks and patterns during specific hours, showing how it affects Rented Bike count, temperature, wind speed, and humidity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can optimize bike availability by predicting demand, reducing wait times, and improving customer satisfaction.

#### Chart - 4: Histogram Plot to check data distribution of each continuous variable

In [None]:
# Chart - 4: Histogram Plot
sns.set_style("whitegrid")
plt.figure(figsize=(12, 12))
for i, col in enumerate(df.select_dtypes(include=['float64','int64']).columns):
    ax = plt.subplot(5,2, i+1)
    sns.histplot(data=df, x=col, ax=ax,palette='Set2',kde=True)
    ax.set_xticklabels(ax.get_xticklabels(),fontsize=10, ha="right")
    ax.set_yticklabels(ax.get_yticklabels(),fontsize=10, ha="right")
    ax.set_xlabel(col, fontsize=10)
    ax.set_ylabel('Count', fontsize=10)
plt.suptitle('Data distribution of continuous variables', fontsize=18, y=0.98)
plt.tight_layout()

##### 1. Why did you pick the specific chart?

The chart shows the distribution of key continuous variables, allowing for quick identification of trends, patterns, and outliers.

##### 2. What is/are the insight(s) found from the chart?

* Bike rentals peak at specific hours and temperatures.
* Wind speed and visibility affect rentals.
* Significant outliers exist in rainfall, snowfall, and visibility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights on peak hours and weather conditions can improve demand forecasting and resource allocation.

Outliers, especially in rainfall and snowfall, may indicate unpredictable conditions affecting rental demand, leading to potential misallocation of resources.

#### Chart - 5: Box Plot to find outliers present

In [None]:
# Chart - 5 Box Plot
sns.set_style("whitegrid")
plt.figure(figsize=(12, 12))
for i, col in enumerate(df.select_dtypes(include=['float64','int64']).columns):
    ax = plt.subplot(5,2, i+1)
    sns.boxplot(data=df, x=col, ax=ax,palette='Set2')
    ax.set_xticklabels(ax.get_xticklabels(),fontsize=10, ha="right")
    ax.set_yticklabels(ax.get_yticklabels(),fontsize=10, ha="right")
    ax.set_xlabel(col, fontsize=10)
plt.suptitle('Box Plot of continuous variables')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Box plots are useful for visualizing the distribution of continuous variables and detecting outliers.

##### 2. What is/are the insight(s) found from the chart?

Several variables, like "Rented_Bike_Count," "Wind_speed," "Solar_Radiation," "Rainfall," and "Snowfall," show significant outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying outliers and understanding data distribution can help refine predictions and improve decision-making.

Outliers could indicate unusual conditions (extreme weather), which might affect rental demand unpredictably, leading to potential resource misallocation.

#### Chart - 6: Pie chart to analyze categorical variables.

In [None]:
# Chart - 6: Pie Chart
fig, ax = plt.subplots(1, 3, figsize=(18, 6))

season_var = pd.crosstab(index=df['Seasons'], columns='% observations')
ax[0].pie(season_var['% observations'], labels=season_var['% observations'].index, autopct='%.0f%%')
ax[0].set_title('Seasons')

Functioning_Day_var = pd.crosstab(index=df['Functioning_Day'], columns='% observations')
ax[1].pie(Functioning_Day_var['% observations'], labels=Functioning_Day_var['% observations'].index, autopct='%.0f%%')
ax[1].set_title('Functioning_Day')

holiday_var = pd.crosstab(index=df['Holiday'], columns='% observations')
ax[2].pie(holiday_var['% observations'], labels=holiday_var['% observations'].index, autopct='%.0f%%')
ax[2].set_title('Holiday')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Pie charts effectively represent categorical data as proportions of a whole.

##### 2. What is/are the insight(s) found from the chart?

It shows the distribution of observations across seasons, functioning days, and holidays. Among them, the Seasons variable is balanced.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding trends during seasons and holidays can optimize resource allocation.

if the business doesn't adapt to low demand during certain seasons or holidays, growth may be negatively impacted.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1: The average bike rental count is higher during summer compared to winter.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* **Null Hypothesis (H₀)**: The average bike rental count during summer is higher than in winter.
* **Alternative Hypothesis (H₁)**: The average bike rental count during summer is equal to or less than that in winter.

#### 2. Perform an appropriate statistical test.

In [None]:
# Two-sample t-test to obtain P-Value
# Separate the data based on the seasons
summer_data = df[df['Seasons'] == 'Summer']['Rented_Bike_Count']
winter_data = df[df['Seasons'] == 'Winter']['Rented_Bike_Count']

# Perform the two-sample t-test
t_stat, p_value = ttest_ind(summer_data, winter_data, alternative='less')

print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4f}')

# Conclusion based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The average bike rental count during summer is less than or equal to that in winter.")
else:
    print("Fail to reject the null hypothesis: The average bike rental count during summer is greater than in winter.")

##### Which statistical test have you done to obtain P-Value?

**Two-sample t-test**

##### Why did you choose the specific statistical test?

The **two-sample t-test** is appropriate for **comparing** the **means** of two **independent groups** (bike rentals in summer and winter) to determine if there is a significant difference between them.

### Hypothetical Statement - 2: There is a significant difference in bike rentals on holidays versus non-holidays.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* **Null Hypothesis (H₀)**: The average bike rental count during holidays is higher than that during non-holidays.
* **Alternative Hypothesis (H₁)**: The average bike rental count during holidays is equal to that during non-holidays.

#### 2. Perform an appropriate statistical test.

In [None]:
# Mann-Whitney U test
# Separate the data based on holidays
holiday_data = df[df['Holiday'] == 'Holiday']['Rented_Bike_Count']
non_holiday_data = df[df['Holiday'] == 'No Holiday']['Rented_Bike_Count']

# Perform the Mann-Whitney U test
u_stat, p_value = mannwhitneyu(holiday_data, non_holiday_data, alternative='two-sided')

print(f'U-statistic: {u_stat:.4f}')
print(f'P-value: {p_value:.4f}')

# Conclusion based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The average bike rental count during holidays is not higher than that during non-holidays.")
else:
    print("Fail to reject the null hypothesis: No significant evidence that the average bike rental count during holidays is higher.")

##### Which statistical test have you done to obtain P-Value?

**Mann-Whitney U test**

##### Why did you choose the specific statistical test?

It is a **non-parametric** test suitable for comparing the **medians** of two **independent groups** when the data may **not** be **normally distributed**.

### Hypothetical Statement - 3: Wind speed has no significant impact on the number of bike rentals

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* **Null Hypothesis (H₀)**: Wind speed has no significant impact on the number of bike rentals.
* **Alternative Hypothesis (H₁)**: Wind speed has a significant impact on the number of bike rentals.

#### 2. Perform an appropriate statistical test.

In [None]:
# Pearson's correlation coefficient

# Calculate Pearson correlation coefficient
correlation, p_value = pearsonr(df['Wind_speed'], df['Rented_Bike_Count'])

print(f'Correlation coefficient: {correlation:.4f}')
print(f'P-value: {p_value:.4f}')

# Perform simple linear regression
X = df[['Wind_speed']]
y = df['Rented_Bike_Count']
X = sm.add_constant(X)  # Adds a constant term to the predictor

model = sm.OLS(y, X).fit()
print(model.summary())

# Conclusion based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: Wind speed has a significant impact on bike rentals.")
else:
    print("Fail to reject the null hypothesis: No significant evidence of an impact of wind speed on bike rentals.")


##### Which statistical test have you done to obtain P-Value?

**Pearson correlation coefficient**

##### Why did you choose the specific statistical test?

It measures the strength and direction of the linear relationship between two continuous variables, which is suitable for assessing the impact of wind speed on bike rentals.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
var=list(df.select_dtypes(include=['float64','int64']).columns)
sc_X=PowerTransformer(method = 'yeo-johnson')
df[var]=sc_X.fit_transform(df[var])

In [None]:
# Box Plot after applying Power Transformation
sns.set_style("whitegrid")
plt.figure(figsize=(12, 12))
for i, col in enumerate(df.select_dtypes(include=['float64','int64']).columns):
    ax = plt.subplot(5,2, i+1)
    sns.boxplot(data=df, x=col, ax=ax,palette='Set2')
    ax.set_xticklabels(ax.get_xticklabels(),fontsize=10, ha="right")
    ax.set_yticklabels(ax.get_yticklabels(),fontsize=10, ha="right")
    ax.set_xlabel(col, fontsize=10)
plt.suptitle('Box Plot of continuous variables', fontsize=16)
plt.tight_layout()

##### What all outlier treatment techniques have you used and why did you use those techniques?

We apply Power Transformation to handle the outliers and skewness of the data.It can handle both positive and negative values.

We again check the box plot of the variables and found that the outliers have been reduced significantly.

### 2. Categorical Encoding

In [None]:
# Encode your categorical columns
df=pd.get_dummies(df,columns=['Holiday','Seasons','Functioning_Day','Hour','month','day'] ,drop_first=True)

#### What all categorical encoding techniques have you used & why did you use those techniques?

We use **one-hot encoding** to convert categorical variables into a numerical format that machine learning models can understand. Many algorithms (e.g., linear regression, logistic regression) require input features to be numerical, so one-hot encoding transforms categorical values into binary vectors

### 3. Feature Selection

**Variance Inflation Factor**: A variance inflation factor(VIF) detects multicollinearity in regression analysis. Multicollinearity is when there’s correlation between predictors (i.e. independent variables) in a model; it’s presence can adversely affect your regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

In [None]:
X=df.iloc[:,2:]
y=df.iloc[:,1]
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)
calc_vif(X.select_dtypes(include=['float','int']))

Dew Point Temperature is highly correlated . We delete this variable and check the VIF score again.

In [None]:
X = X.drop(columns=['Dew_point_temperature'])
calc_vif(X.select_dtypes(include=['float','int']))

To reduce the number of input variables that are believed to be most useful to a model in order to predict the target variable, we perform Feature Selection.

In [None]:
# Select your features wisely to avoid overfitting
fs = SelectKBest(score_func=f_regression, k='all')
fs.fit(X, y)
feature_contribution=(fs.scores_/sum(fs.scores_))*100
for i,j in enumerate(X.columns):
    print(f'{j} : {feature_contribution[i]:.2f}%')

# Create a DataFrame for the feature scores
feature_scores = pd.DataFrame({'Feature': X.columns, 'Score': fs.scores_})
# Sort the DataFrame by scores in descending order
feature_scores = feature_scores.sort_values(by='Score', ascending=False)
# Plot the bar plot in descending order
sns.set_style("whitegrid")
plt.figure(figsize=(12, 8))
sns.barplot(x='Feature', y='Score', data=feature_scores, palette='Set2')
plt.xticks(rotation=45, ha='right', fontsize=10)
plt.yticks(fontsize=10)
plt.suptitle('Bar Plot of Feature Importance Scores (Descending)', fontsize=16)
plt.xlabel('Features', fontsize=12)
plt.ylabel('Feature Scores', fontsize=12)
plt.tight_layout()
plt.show()

##### What all feature selection methods have you used  and why?

* We used **Variance Inflation Factor** (VIF) to detect **multicollinearity** and removed **highly correlated** variables like **Dew Point Temperature**, as it was redundant and affected model stability.

* We also performed **Feature Selection** using the **SelectKBest** method with **f_regression** as the score function, to identify and prioritize features that have the  **most influence** in predicting the target variable.

##### Which all features you found important and why?

* **Temperature (20.27%)**: This has the highest score, indicating it's strongly correlated with the target variable (likely rented bike count) and crucial for prediction.

* **Solar Radiation (7.16%)** and **Seasons_Summer (5.82%)**: These environmental factors significantly impact bike demand, making them important predictors.

* Other features like **Rainfall (4.73%)** and **Humidity (2.95%)** also influence bike rentals but to a lesser extent. These variables are essential for understanding weather-related fluctuations in bike demand.

### 4. Data Scaling

In [None]:
# Scaling your data
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

##### Which method have you used to scale you data and why?

We have used **Standardization** which is a scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

### 5. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=0)

##### What data splitting ratio have you used and why?

We used a **70:30** data splitting ratio to allocate **70%** of the data for **training** and **30%** for **testing**. This ensures sufficient data for training the model while keeping a good portion for evaluating its performance.

## ***7. ML Model Implementation***

In [None]:
#creating dictionary for storing different models accuracy
model_comparison={}

### ML Model - 1: Linear Regression

In [None]:
# ML Model - 1 Implementation
lm=LinearRegression()
# Fit the Algorithm
lm.fit(X_train,y_train)
# Predict on the model
y_pred=lm.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(f"Model R-Square : {r2_score(y_test,y_pred)*100:.2f}%")
print(f"Model MSE : {mean_squared_error(y_test,y_pred)*100:.2f}%")

In [None]:
# Regression plot between actual and predicted prices
plt.figure(figsize=(10, 6))
# Create a scatter plot of actual vs predicted values
sns.scatterplot(x=y_test, y=y_pred, color='blue', label='Predicted Values')
# Add a line for perfect predictions
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--', label='Perfect Prediction')
# Adding labels and title
plt.title('Actual vs Predicted Prices', fontsize=16)
plt.xlabel('Actual Prices', fontsize=14)
plt.ylabel('Predicted Prices', fontsize=14)
plt.legend()
plt.grid()
plt.show()

The model used is **Linear Regression**, which aims to establish a relationship between the independent variables (features) and the dependent variable (target) to predict outcomes.

**R-Square (83.11%)**: This indicates that approximately 83.11% of the variance in the target variable can be explained by the independent variables, suggesting a strong fit of the model to the data.

**Mean Squared Error (MSE) (16.83%)**: This metric measures the average squared difference between the actual and predicted values. A lower MSE indicates better model performance, meaning the model's predictions are relatively close to the actual values.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
accuracies = cross_val_score(estimator = lm, X = X_train, y = y_train, cv = 5)
print("Cross Val Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Cross Val Standard Deviation: {:.2f} %".format(accuracies.std()*100))
model_comparison['Linear Regression']=[r2_score(y_test,y_pred),mean_squared_error(y_test,y_pred),(accuracies.mean()),(accuracies.std())]

### ML Model - 2: Random Forest Regression

In [None]:
# ML Model - 2 Implementation
rm=RandomForestRegressor(n_estimators=10,random_state=0)
rm.fit(X_train,y_train)
y_pred=rm.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(f"Model R-Square : {r2_score(y_test,y_pred)*100:.2f}%")
print(f"Model MSE : {mean_squared_error(y_test,y_pred)*100:.2f}%")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
accuracies = cross_val_score(estimator = rm, X = X_train, y = y_train, cv = 5)
print("Cross Val Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Cross Val Standard Deviation: {:.2f} %".format(accuracies.std()*100))
model_comparison['Random forest Regression']=[r2_score(y_test,y_pred),mean_squared_error(y_test,y_pred),(accuracies.mean()),(accuracies.std())]

### ML Model 3- Bagging Regression

In [None]:
bm= BaggingRegressor(RandomForestRegressor(n_estimators=10,random_state=0),random_state=0)
bm.fit(X_train, y_train)
y_pred=bm.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(f"Model R-Square : {r2_score(y_test,y_pred)*100:.2f}%")
print(f"Model MSE : {mean_squared_error(y_test,y_pred)*100:.2f}%")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
accuracies = cross_val_score(estimator = bm, X = X_train, y = y_train, cv = 5)
print("Cross Val Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Cross Val Standard Deviation: {:.2f} %".format(accuracies.std()*100))
model_comparison['Bagging Regressor']=[r2_score(y_test,y_pred),mean_squared_error(y_test,y_pred),(accuracies.mean()),(accuracies.std())]

#### Model Comparison

In [None]:
Model_com_df=pd.DataFrame(model_comparison).T
Model_com_df.columns=['R-Square','MSE','CV Accuracy','CV std']
Model_com_df=Model_com_df.sort_values(by='R-Square',ascending=False)
Model_com_df.style.format("{:.2%}").background_gradient(cmap='RdYlBu_r')

* **Bagging Regressor** outperforms the other models across all metrics, indicating it has the highest predictive accuracy and the lowest error.

* **Random Forest Regression** also shows strong performance, but not as high as Bagging.

* **Linear Regression** exhibits the lowest R² and the highest MSE, indicating that it may not capture the underlying patterns in the data as effectively as the ensemble methods.

##### Which hyperparameter optimization technique have you used and why?

I have used K-fold cross-validation to ensure the model's performance is robust and to reduce overfitting by validating it on multiple subsets of the dataset. This technique provides a better estimate of the model's ability to generalize to unseen data.

#### 3. Explain each evaluation metric's indication towards business and the business impact of the ML model used.

**R² (Coefficient of Determination)**:

**Indication**: R² explains how well the model's predictions match the actual data. A higher R² value means the model captures a larger portion of the variability in bike rentals.

**Business Impact**: High R² suggests the model is effective in predicting demand, leading to better inventory management, optimized staffing, and improved bike availability, ultimately enhancing customer satisfaction and reducing operational costs.

**MSE (Mean Squared Error)**:

**Indication**: MSE measures the average squared difference between actual and predicted values. A lower MSE indicates fewer errors in the model's predictions.

**Business Impact**: A low MSE ensures accurate demand forecasting, reducing the risk of over- or under-supplying bikes. This improves resource allocation, minimizing maintenance costs, and maximizing bike utilization.

# **Conclusion**

* Bagging Regressor performs best with the highest accuracy and lowest error.
* Random Forest Regression shows strong performance but is slightly weaker than Bagging.
* Linear Regression has the lowest R² and highest MSE, indicating it captures patterns less effectively.
* Summer rentals are higher than winter rentals.
* Rentals during holidays are not higher than non-holidays.
* Wind speed significantly impacts bike rentals.
* Strong correlations exist between rented bike count, hour, and temperature.
* Weaker correlations are seen with rainfall and wind speed.
* Higher temperatures lead to increased rentals, as shown in the chart.
* Outliers in "Rented_Bike_Count," "Wind_speed," and other variables may reflect extreme conditions, affecting demand unpredictably.
* The Seasons variable shows a balanced distribution.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***