<h1><b>Project Type - Supervised Machine learning (Regression Model)
<h1><b>Contribution - Individual

<h1><b>GitHUb Link



<h1><b>Project Summary

* This project aimed to predict how many bikes would be rented at different times based on factors like weather and environmental conditions. The dataset used had information on temperature, humidity, wind speed, and other factors.

* The dataset used in the study consisted of 8760 entries with 14 features.

* To start, the dataset was checked for any missing values, outliers, or relationships between the features. Categorical variables were transformed into numerical ones. The dataset was then divided into a training set (70%) and a testing set (30%).

* The first step was to clean and explore the data. There were no duplicate or empty values, and the columns were converted to the right data types. Exploratory analysis was done to find any patterns or trends in the data.

* Next, seven different machine learning algorithms were trained and evaluated: Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression, Elastic Net Regression, Decision Tree, Random Forest, and XGBoost Regression. Their performance was assessed using metrics like mean squared error (MSE), root mean squared error (RMSE), R2 score, and mean absolute error (MAE). The best models were further fine-tuned using hyperparameter tuning.

* The results showed that the Random Forest and XGBoost models performed better than the other models, with R2 scores of 0.89 and 0.91, respectively. These models were found to be the most accurate in predicting the demand for rental bikes.

* In conclusion, this study demonstrated that machine learning algorithms can accurately predict the demand for rental bikes. These findings can assist bike rental companies in making better decisions and improving their services to meet the growing demand.

<h1><b>Business Context

* The bike rental service in Seoul, known as the capital bike share system, allows people to rent bikes by the hour or day. It's important for this service to accurately predict how many bikes will be rented at different times. This helps them plan their operations effectively, reduce waiting times, and make sure they have enough bikes available for customers.

* To make these predictions, they can use machine learning models based on historical data. By analyzing past rental patterns and factors like time of day, weather, and other relevant information, the model can estimate how many bikes will be rented in the future. This information is crucial for the rental service to adjust their bike supply accordingly, ensuring that customers don't have to wait for bikes and have a good experience.

* Accurate prediction of bike rental demand is very beneficial. It allows the rental service to plan better, provide a more efficient service, and ultimately satisfy their customers. It also helps improve overall mobility in the city and can result in increased revenue for the bike rental service. By having a competitive advantage in predicting demand, the bike rental service can stay ahead in the market and offer a more reliable and convenient service to the public

<h1><b>Problem Statement


The city of Seoul has a bike-sharing system where people can rent bikes. They have collected data on how many bikes are rented, as well as information about the weather and the time of year, for the years 2017 and 2018. -


* The goal of this project is to create a machine learning model that can accurately predict how many bikes will be rented in Seoul based on this historical data. The model will take into account factors like the weather conditions, the season (such as summer or winter), and the time of day. By considering all these factors, the model will be able to make more accurate predictions about how many bikes will be needed at any given time.

* Having this model will be very helpful for the city's bike rental service. It will allow them to better plan and manage their bike supply, making sure they have enough bikes available for people to rent. This will improve the overall customer experience by reducing waiting times and ensuring that people can easily find a bike when they need one. By accurately predicting the demand for rental bikes, the city's bike rental service can provide a more reliable and convenient service to its residents and visitors.

# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries
# Importing Pandas and Numpy
import pandas as pd
import numpy as np
from numpy import math

# importing visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from seaborn.rcmod import set_style

import datetime as dt
from datetime import datetime

# Importing Models libraries
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from xgboost.sklearn import XGBRegressor

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_validate, cross_val_score

# import evaluation metrics
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Importing warning for ignore warnings
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/capstone project 2/SeoulBikeData.csv',encoding="latin1")

In [None]:
# making a copy of data for safity purpose
df_copy = df.copy()


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(len(df[df.duplicated()]))


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum().sort_values(ascending= False).reset_index().rename(columns={'index':'Columns',0:'Null values'})

In [None]:
# Visualizing the missing values
plt.figure(figsize=(14, 5))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.xlabel("column_name", size=14, weight="bold")
plt.title("Missing Values in Column",fontweight="bold",size=17)
plt.show()

In [None]:
# Describe the dataset
df.describe(include='all').T

In [None]:
# Dataset Columns

# fetch attribute
df.columns

### Variables Description

> Date - year-month-day

>Rented Bike count - Count of bikes rented at each hour

>Hour - Hour of the day

>Temperature - Temperature in Celsius

>Humidity - %

>Windspeed - m/s

>Visibility - 10m

>Dew point temperature - Celsius

>Solar radiation - MJ/m2

>Rainfall - mm

>Snowfall - cm

>Seasons - Winter, Spring, Summer, Autumn

>Holiday - Holiday/No holiday

>Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique().reset_index().rename(columns={'index':'Columns',0:'Unique values'})

<h1>understandng the more about your variable<h>


> The dataset is from a rental bike company based out of Seoul. The goal of this project is to develop a machine learning model that can predict the demand for rental bikes.

> There are not any null and duplicate value in the dataset.

> Dataset has 8760 entries with 14 features.

> The dataset contains the hourly weather conditions for a period of 364
  days, and other details such as whether a said day was a holiday or not.

## 3. ***Data Wrangling***

<B/>  Data wrangling, also known as data cleaning or data preprocessing, is the
  process of transforming and cleaning raw data into a structured and usable
  format. It involves various tasks such as removing irrelevant or duplicate
  data, and converting data into a standardized format. The goal of data
  wrangling is to make the data more suitable for analysis, which involves
  making it accurate, complete, and consistent.

<b><h2>Renaming the column name :-<b/><h/>

<h6>To improve the readability of a dataset and enhance understanding is to rename the column names.

In [None]:
# Write your code to make your dataset analysis ready.

# renaming the features

df.rename(columns= {'Date':'date','Rented Bike Count': 'rented_bike_count', 'Hour':'hour',
                    'Temperature(°C)':'temperature', 'Humidity(%)':'humidity',
                    'Wind speed (m/s)': 'wind_speed', 'Visibility (10m)': 'visibility',
                    'Dew point temperature(°C)':'dew_point_temp',
                    'Solar Radiation (MJ/m2)': 'solar_radiation', 'Rainfall(mm)': 'rainfall',
                    'Snowfall (cm)':'snowfall', 'Seasons':'seasons',
                    'Holiday':'holiday', 'Functioning Day':'func_day'},
          inplace=True)

<h3>Converting the date column in appropriate format

In [None]:
# finding the datatype of 'Date' column

type(df['date'][0])

In [None]:
# converting string format of 'Date' column into date-time format

df['date'] = pd.to_datetime(df['date'])

In [None]:
df.head()

<h4>Convert the "Date" column into 3 different column i.e "Year","Month" and "Day" :

>"Year" column contains the 2 unique numbers, details from 2017 december to
 2018 november.So if i consider this as a year then we don't need this column, so we can drop it.

>"Day" column contains the details about the each day of the month, considering
  day wise data is too long, so we concise this data into a day is a weekend. Therfore, convert it into this format and drop the "day" column.

In [None]:

# extracting day,month, day of week and weekdays/weekend from date column

df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()

#Converted weekdays and weekend into binary class as Weekdays = 0 and Weekend = 1.
df['weekdays_weekend']=df['day_of_week'].apply(lambda x : "1" if x=='Saturday' or x=='Sunday' else "0" )

In [None]:
# Remove date, day_of_week column from data set
df=df.drop(columns=['date','day_of_week'],axis=1)

In [None]:
df.info()

In [None]:
df.describe()

<h3>Look at the data, Hour and month Columns are a numerical columns but these are time stamp so we have to treat Hour and month columns as a categorical feature

In [None]:

# convert Hour column integer to Categorical
df['hour']=df['hour'].astype('object')

# convert month column integer to Categorical
df['month']=df['month'].astype('object')

Let's Separate the data into appropriate format:

<H4> what is categorical data ?

> A categorical variable is a variable that represents a set of categories or
  groups. Categorical variables can take on a limited number of possible values, such as yes or no, red, blue or green. Categorical variables are often used to represent qualitative data and are non-numeric in nature. It includes data type such as object and other category.


<h4> what is numerical data ?

> A numerical variable is a variable that represents numerical values.Numerical
  variables can take on any numeric value and can be either discrete or continuous. Examples of numerical variables include age, weight, height, temperature, and income. Numerical variables are often used to represent quantitative data and can be used in mathematical calculations.



  <B><h1> Let's extract the numerical and categorical data

In [None]:
# Divide Data in categorical and numerical features
num_col= df.select_dtypes(exclude='object')
cat_col=df.select_dtypes(include='object')


In [None]:

# fetch unique value in categorical feature columns

for col in cat_col:
  print('{} has {} values'.format(col,df[col].unique()))
  print('\n')

<h2><b>Visualized Categorical column

In [None]:
# Set the figure size
plt.figure(figsize=(12, 6), dpi=200)

# Iterate through each categorical column and create a pie chart
for i, feature in enumerate(cat_col):

  # Create a subplot for each pie chart
  plt.subplot(2,3, i+1)

  # Generate the value counts of the current column
  value_counts = df[feature].value_counts()

  # Generate the pie chart with the value counts of the current column
  plt.pie(value_counts, labels=value_counts.index, autopct='%.0f%%')

  # Set the title of the subplot to the name of the current column
  plt.title(feature, fontsize=12, color='red')

  # Adjust the layout of the subplots for better spacing
  plt.tight_layout()

# Show the pie charts
plt.show()

<h3><b> Why did you pick this specific chart?

>A pie chart is a useful tool to display the distribution of various categories in a dataset. By dividing the circle into proportional sections, each representing a different category, the pie chart allows for a clear comparison of the relative size of each category. The use of different colors for each section further enhances the clarity of the representation and makes it easier to understand and interpret the data.

<h3><b>What is/are the insight(s) found from the chart?

>The data shows that a hours and seasons are equally distributed.

>Mostlty bike rented on function_day and non_holidays.

<h3><b>Will the gained insights help creating a positive business impact?

>Yes, gaining insights into the distribution of bike rentals across different times of day and seasons, as well as the preference for functional days over non-holidays, can help bike rental businesses create a positive impact on their operations and customer experience. With this knowledge, bike rental businesses can adjust their resources and marketing strategies to match the demand, ensuring they have enough bikes available when customers need them most.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

<h3>What is EDA :-<h3>
                   Exploratory Data Analysis (EDA) is a process of examining data to uncover patterns, relationships, anomalies, and other insights. It involves visualizing and summarizing data, identifying missing data and outliers, and exploring relationships between variables. EDA is a crucial step in the data analysis process and helps to guide further analysis and model selection.




<B><h3>Let's start the EDA to find the data insight



#### Chart - 1  <h4> Firstly we start the Analyzing the distribution of the dependent variable

In [None]:
# Chart - 1 visualization code# defining dependent variable separately

dependent_variable = ['rented_bike_count']

In [None]:
# visualizing the distribution of the dependent variable - rental bike count
plt.figure(figsize=(10,5))
sns.distplot(df['rented_bike_count'],color="c")
plt.title('Distribution of the dependent variable',fontsize=16,color='indigo');


In [None]:
#skew of the dependant varaible
df[dependent_variable].skew()

In [None]:
df[dependent_variable]

In [None]:
tranformed_dep_var=np.sqrt(df[dependent_variable])
tranformed_dep_var

In [None]:
# visualizing the distribution of dependent variable after sqrt transformation
plt.figure(figsize=(10,5))
sns.distplot(tranformed_dep_var,color="c")
plt.title('Distribution of the dependent variable after sqrt',fontsize=16,color='red');

In [None]:
# skew of the dependent variable after sqrt transformation
np.sqrt(df[dependent_variable]).skew()


##### 1. Why did you pick the specific chart?


> A displot is a type of chart used to visualize the distribution of a single
  variable. It combines a histogram with a kernel density plot to provide an
  estimate of the probability density function of the variable. The histogram
  displays the frequency distribution of the variable, while the kernel density
  plot displays the continuous distribution curve of the variable.The displot
  is useful for checking normality, exploring the shape of the distribution,
  and identifying any issues that need to be addressed before data analysis.

##### 2. What is/are the insight(s) found from the chart?


>The Rented Bike Count variable is highly skewed towards the right, which violates the normal distribution assumption of linear regression. To address this issue, we applied a data transformation technique to normalize the variable. Specifically, we used the square root transformation, which resulted in a nearly normal distribution for the Rented Bike Count variable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


>Yes, gaining insights from data analysis can have a positive impact on a business. By analyzing data, businesses can gain a better understanding of customer behavior, market trends, and operational efficiency.

#### Chart - 2


<h2>Explore relation between categorical feature and dependent variable

In [None]:
# Chart - 2 visualization code


# fetch categorical columns
cat_col

In [None]:
# Count of Rented bikes acording to Functioning Day

plt.figure(figsize=(15,5),dpi=200)
sns.pointplot(data=df,x='hour',y='rented_bike_count',hue='func_day')
plt.title('Count of Rented bikes acording to Functioning Day', fontsize=16,color='red');


In [None]:
# Count of Rented bikes acording to season

plt.figure(figsize=(15,5),dpi=200)
sns.pointplot(data=df,x='hour',y='rented_bike_count',hue='seasons')
plt.title('Count of Rented bikes acording to seasons',fontsize=16,color='red')

In [None]:
# Count of Rented bikes acording to Holiday

plt.figure(figsize=(15,5),dpi=200)
sns.pointplot(data=df,x='hour',y='rented_bike_count',hue='holiday')
plt.title('Count of Rented bikes acording to Holiday',fontsize=16,color='red');

##### 1. Why did you pick the specific chart?

> A point plot is a type of data visualization in which data points are represented as discrete points along an axis. our aim to display the relationship between numerical variable(rented_bike_count) and categorical variables(holiday,seasons, and functioning day).



##### 2. What is/are the insight(s) found from the chart?

>Understanding the patterns of bike rentals can be a valuable asset for businesses looking to enhance their operations and customer satisfaction. By analyzing data, businesses can gain insights into the demand for bikes during different seasons, peak hours, and holidays, allowing them to optimize inventory, plan promotions, and improve their services accordingly. By leveraging these insights, businesses can make informed decisions that drive positive business impact and improve the overall customer experience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Understanding the patterns of bike rentals can be a valuable asset for businesses looking to enhance their operations and customer satisfaction. By analyzing data, businesses can gain insights into the demand for bikes during different seasons, peak hours, and holidays, allowing them to optimize inventory, plan promotions, and improve their services accordingly. By leveraging these insights, businesses can make informed decisions that drive positive business impact and improve the overall customer experience.bold textAnswer Here

#### Chart - 3

<h2>Explore relation between numerical feature and dependent variable

In [None]:
# Chart - 3 visualization code

# fetch numerical columns
num_col

In [None]:
# Analyzing the relationship between "Bike_Count" and "Temperature" :
plt.figure(figsize=(15,5),dpi=200)
df.groupby('temperature').mean()['rented_bike_count'].plot()
plt.title("Bike_Count  v/s  temp",fontsize=16,color='red')
plt.ylabel('Bike_count',fontsize=12)
plt.xlabel('temp',fontsize=12);

In [None]:

# Analyzing the relationship between "Bike_Count" and "Snowfall" :
plt.figure(figsize=(15,5),dpi=200)
df.groupby('snowfall').mean()['rented_bike_count'].plot()
plt.title("Bike_Count  v/s  Snowfall",fontsize=16,color='red')
plt.ylabel('Bike_count',fontsize=12)
plt.xlabel('Snow',fontsize=12);

In [None]:
# Analyzing the relationship between "Bike_Count" and "Wind" :
plt.figure(figsize=(15,8),dpi=200)
df.groupby('wind_speed').mean()['rented_bike_count'].plot()
plt.title("Bike_Count  v/s  Wind",fontsize=16,color='red')
plt.ylabel('Bike_count',fontsize=12)
plt.xlabel('Wind',fontsize=12);

In [None]:
# Analyzing the relationship between "Bike_Count" and "Rainfall" :
plt.figure(figsize=(15,8),dpi=200)
df.groupby('rainfall').mean()['rented_bike_count'].plot()
plt.title("Bike_Count  v/s  Rainfall",fontsize=16,color='red')
plt.ylabel('Bike_count',fontsize=12)
plt.xlabel('Rainfall',fontsize=12);

##### 1. Why did you pick the specific chart?

>A line plot is a type of plot that displays data as a series of points connected by straight lines. Line plots are commonly used to display the relationship between two numerical variables, where one variable is plotted on the x-axis and the other variable is plotted on the y-axis. Line plots are useful for showing trends or patterns in data over time, and for identifying changes or discontinuities in the data.Answer Here.

##### 2. What is/are the insight(s) found from the chart?


>Answer HereThe count of rented bikes is highest when the temperature is around 30°C, indicating that people prefer to ride bikes in warmer weather. While the count of rented bikes is generally lower in winter compared to other seasons, heavy snowfall (beyond 4cm) leads to a significant drop in bike rentals. Therefore, Snowfall appears to be a major factor influencing the count of rented bikes. Despite a slight drop in rented bikes during moderate rainfall (10-20 mm) the demand for rented bikes does not decrease significantly. In fact, there is a significant increase in rented bikes when there is heavy rainfall (20mm). Thus, it can be concluded that rainfall does not have a major impact on the count of rented bikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


>Yes, the gained insights from the analysis of the data can help create a positive business impact. For example, the insight that the count of rented bikes is highest when the temperature is around 30°C can be used to adjust the number of bikes available during the warmer months to meet the increased demand. Similarly, the insight that heavy snowfall leads to a significant drop in bike rentals can help the business plan accordingly for the winter months. By taking these insights into consideration, the business can optimize its operations and potentially increase revenue.Answer Here

#### Chart - 4

<h2>Regression plot to know relation between dependent variable and numerical feature

In [None]:
# Chart - 4 visualization code

n=1
plt.figure(figsize=(15,10))
for i in num_col.columns:
  if i == 'rented_bike_count':
    pass
  else:
    plt.subplot(4,2,n)
    n+=1
    sns.regplot(x=df[i], y=df['rented_bike_count'],scatter_kws={"color": "c"}, line_kws={"color": "red"})
    plt.title(f'Dependent variable and {i}')
    plt.tight_layout()




<h2><b>Why did you pick the specific chart?

<H4>A regression plot is a type of data visualization that displays the relationship between two variables using a scatter plot with a regression line. It is commonly used to examine the relationship between a dependent variable and one or more independent variables, and can also be used to identify outliers, trends, and patterns in the data.

<h2><b>What is/are the insight(s) found from the chart?

> <span>1. Positive linear related features :- Hour, Temperature, Wind speed, visibility and solar radiation.

> <span>2. Negatively linear related features :- Rainfall, Snowfall, Humidity.



 <h2><B>Will the gained insights help creating a positive business impact?

 <H4>Yes, the insights gained from the linear regression analysis can be used to create a positive business impact. By identifying the factors that are positively correlated with the count of rented bikes, the business can allocate its resources accordingly, such as increasing the number of bikes available during peak hours and optimizing the rental prices. On the other hand, the insights gained from the factors that have a negative correlation with the count of rented bikes can help the business plan for adverse weather conditions, such as heavy rainfall or snowfall

<H1><B> Pair Plot

In [None]:
# Plot pairwise relationships in the dataset
sns.pairplot(df, corner=True)

<H3>Why did you pick the specific chart?

>Pairplot is a useful tool for visualizing patterns and relationships between variables in a dataset.

>In a pairplot, each variable is plotted against all other variables in the dataset, resulting in a grid of scatterplots.

>The diagonal of the grid shows the distribution of each variable, while the lower diagonal shows the scatterplots of the pairwise relationships between the variables. The upper diagonal is a mirror image of the lower diagonal, showing the same scatterplots, but with the axes flipped.

>Pairplots are useful for identifying patterns and relationships between variables, such as linear or nonlinear relationships, clusters, and outliers. They can also be used to identify which variables are strongly correlated with each other.

<H3> What is/are the insight(s) found from the chart?

>The resulting plot will show scatterplots of each pair of features along with the distributions of each individual feature.

 <H3>Will the gained insights help creating a positive business impact?

 >The insights gained from a pair plot can definitely help create a positive business impact. By identifying any patterns or relationships between variables, businesses can make more informed decisions and improve their operations.

<h1><b>Feature Engineering & Data Pre-processing



In [None]:
#checking Outliers in numeric features using seaborn boxplot

plt.figure(figsize=(20,15))

for i,feature in enumerate(num_col):
  plt.subplot(3,3,i+1)
  sns.boxplot(df[feature])
  plt.title(feature, fontsize=16,color='red')
  plt.tight_layout()

>It is not always necessary to remove outliers from a dataset, especially if the data is continuous and represents real-world measurements.

>In the case of rainfall and snowfall data, if the outliers are due to extreme weather events or other natural phenomena, they may represent important information that should not be removed.



<h1><b>Check Correlation and Multicollinearity between features

In [None]:

#checking correlation between independent features using heatmap

plt.figure(figsize=(15,6))
sns.heatmap(df.corr(),cmap='PiYG',annot=True)
plt.title('Correlation between all the variables', size=16,color='red')
plt.show()

>The correlation heatmap indicates a high positive correlation of 0.91 between 'Temperature' and 'Dew point temperature', suggesting that dropping one of the columns would not significantly affect our analysis. Therefore, we can drop the 'Dew point temperature(°C)' column since it has the same variation as 'Temperature'.

>To ensure there is no collinearity between other variables, we will check their VIF values before proceeding further.



'''<h3>What is Variance Inflation Factor (VIF)?

>VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.The default VIF cutoff value is 5; only variables with a VIF less than 5 will be included in the model. In some cases VIF of less than 10 is also acceptable.

>Here, we have performed the VIF calculations for the clarity about the correlation between the features. after that, we have dropped the features which were highly correlated with any other independent features for accurate predictions.'''

In [None]:
# Function to calculate Multicollinearity

# Checking the VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
  # VIF dataframe
  vif_data = pd.DataFrame()
  vif_data["feature"] = X.columns

  # calculating VIF for each feature
  vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
  vif_data['VIF'] = round(vif_data['VIF'],2)
  return(vif_data)

In [None]:
calc_vif(df[[i for i in df.describe().columns]])

>From heatmap and VIF, we can clearly visible Temperature and Dew point temperature(°C) has the high correlation and high multicollinearity respectively. As a result, to reduce correlation and multicollinearity We can drop dew point temperature column.

In [None]:
# Drop Dew Point Temperature Column

df.drop(columns= ['dew_point_temp'], inplace=True)

In [None]:
# Again plot correlation using heatmap

plt.figure(figsize=(15,6))
sns.heatmap(df.corr(),cmap='PiYG',annot=True)
plt.title('Correlation between all the Variables', size=16, color='red')
plt.show()

<h3><b> Why did you choose this specific chart

>Correlation analysis is a statistical technique used to measure the strength and direction of the relationship between two or more variables. It is used to determine whether and how strongly two variables are related to each other.

 <h3><b>What is/are the insight(s) found from the chart?

 >After removable of dew point temperature column, correlation between dependent variable and multicollinearity could be controlled.

<h3><b>Convert unappropriate columns to appropriate columns for Machine learning

In [None]:
#Changing the int64 column into category column

cols=['hour','month','weekdays_weekend']
for col in cols:
  df[col]= df[col].astype('category')

<B>Converting snowfall, rainfall and visibility to categorical attributes:

In [None]:
# Converting snowfall and rainfall to categorical attributes

df['snowfall'] = df['snowfall'].apply(lambda x: 1 if x>0 else 0)
df['rainfall'] = df['rainfall'].apply(lambda x: 1 if x>0 else 0)

<h>Converting visibility to a categorical attribute:

1. pd.cut() is a function from the pandas library that is used to segment data
   into bins. It takes the column df.visibility as the input data to be binned.

2. The bins parameter specifies the bin edges or intervals. In this case, the
   bins are defined as [0, 399, 999, 2001]. This means that the values in the 'visibility' column will be divided into three bins: values less than or equal to 399, values greater than 399 and less than or equal to 999, and values greater than 999 and less than or equal to 2001.

3. The labels parameter specifies the labels to assign to each bin. In this
   case, the labels are [0, 1, 2]. So, the first bin (values <= 399) will be labeled as 0, the second bin (values > 399 and <= 999) will be labeled as 1, and the third bin (values > 999 and <= 2001) will be labeled as 2.

4. The resulting bin labels are assigned to the 'visibility' column of the
   DataFrame, creating a new column with the name 'visibility' and the binned values.

In [None]:
# Binning the 'visibility' column into 3 categories using Pandas cut() function.

df['visibility'] = pd.cut(df.visibility,bins=[0,399,999,2001],labels=[0,1,2])

In [None]:

# Converting categorical columns to numerical using numpy's where() function.

df['func_day'] = np.where(df['func_day'] == 'Yes',1,0)
df['holiday'] = np.where(df['holiday'] == 'Holiday', 1,0)

<h3>In Data 'month', 'day_of_week', 'hour' are nominal categorical variables, therefore we have to apply onehot encoding.

In [None]:
# Converting categorical to float and integer

df['visibility'] = df['visibility'].astype(float)
df['weekdays_weekend'] = df['weekdays_weekend'].astype(int)

<H> Apply one hot encoding to convert data into numerical format

In [None]:
# One-Hot Encoding of Categorical Features in Dataset

df= pd.get_dummies(df,columns = ['month','hour','seasons'],drop_first=True)

<h> convert the data into dependent and independent variable

In [None]:
# Assigning the value of independent variable (X) and dependent variable (Y) :

X = df.drop(columns=['rented_bike_count'], axis=1)
y = np.sqrt(df['rented_bike_count'])

In [None]:
X.head()

In [None]:
y.head()

<h1><B>Data Spliting


<h2> why we use data spliting<h>

>Dividing the data into training and testing sets is a common approach in machine learning to evaluate the performance of a model. The idea is to use the training data to estimate the parameters of the model, and the testing data to evaluate the performance of the model on new, unseen data.

>By dividing the data into an 80/20 ratio, you are following the Pareto principle, which states that 80% of the effects come from 20% of the causes. In this case, the 80% of the data is used for training, and 20% is used for testing. This split ensures that you have enough data to accurately estimate the parameters of the model while also having enough data to accurately evaluate its performance.

>However, it's important to note that the choice of split ratio (80/20 or any other) depends on the size of your dataset and the complexity of your model. If you have a large dataset, you may be able to use a smaller ratio (e.g., 70/30), while if you have a small dataset, you may need to use a larger ratio (e.g., 90/10).

>In general, the goal is to find the right balance between the variance of the parameter estimates and the variance of the performance statistics, so that neither is too high. Therefore, I choose 70:30 ratio.


In [None]:
# Dividing the dataset into train and test set

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

In [None]:
X_train

<h1><b> Data Scaling<h1>

<h2><b>Which method have you used to scale you data and why?

<h3>IN this method we use Min Max Scaling to Scale the data

> I used MinMaxscaler as it preserves the shape of the original distribution.
  Note that MinMaxScaler doesn't reduce the importance of outliers. The default range for the feature returned by MinMaxScaler is 0 to 1.

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

<h1><b>ML Model Implementation

<b>First we apply linear regression

In [None]:
# initalizing the model
regg = LinearRegression().fit(X_train,y_train)

In [None]:
# Predicted Train & Test values

y_pred_train = regg.predict(X_train)
y_pred_test = regg.predict(X_test)

In [None]:
# Checking score

regg.score(X_train,y_train)

In [None]:
#Checking Coefficent

regg.coef_

In [None]:
# Performance Metrics calculation function

def print_metrics(actual, predicted):
  print('MSE is: {}'.format(mean_squared_error(actual, predicted)))
  print('RMSE is: {}'.format(math.sqrt(mean_squared_error(actual, predicted))))
  print('R2 score: is {}'.format(r2_score(actual, predicted)))
  print('MAE is: {}'.format(mean_absolute_error(actual, predicted)))



# Here is the all info about how this function work

'''MSE = mean_squared_error((y_train), (y_pred_train))
RMSE = np.sqrt(MSE)
MAE = mean_absolute_error((y_train), (y_pred_train))
R2 = r2_score((y_train), (y_pred_train))
Adj_R2 = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))'''

In [None]:
print_metrics(y_train, y_pred_train)

In [None]:
#Adjusted R2 (considers and tests different independent variables against the model)
def Adjusted_R2(actual, predicted):
  Adj_R2 = (1-(1-r2_score(actual, predicted))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
  return('Adjusted R2 :', Adj_R2)

In [None]:
Adjusted_R2(y_train, y_pred_train)

In [None]:
print_metrics(y_test, y_pred_test)
Adjusted_R2(y_test, y_pred_test)

>The linear regression model has a MSE of 32.93, which means that on average, the predicted values differ from the actual values by 5.74 (RMSE) units in the original units of the target variable.

>The R2 score of 0.79 indicates that the model explains around 79% of the variance in the target variable, which is a good fit.

>The MAE of 4.44 indicates that the average absolute difference between the predicted and actual values is 4.44 units in the original units of the target variable.

>The adjusted R2 score of 0.78 suggests that the model is not overfitting, taking into account the number of features in the model.

>Overall, these results suggest that the linear regression model is a good fit for the data and can be used to make predictions on new data.

In [None]:
# If our model is perfect, residuals would all be zeros

test_residuals = y_test - y_pred_test
test_residuals

In [None]:
# residual plot

sns.scatterplot(x=y_test, y=test_residuals)
plt.axhline(y=0, color='r', ls='--');

In [None]:
# Check Normal probability plot

sns.displot(test_residuals, bins=50, kde=True);

In [None]:
import scipy as sp

# Create a figure and axis to plot on
fig, ax = plt.subplots(figsize=(6,8),dpi=100)
# probplot returns the raw values if needed
_ = sp.stats.probplot(test_residuals,dist='norm',plot=ax)

In [None]:
# Plot between actual target variable vs Predicted one

plt.figure(figsize=(18,6))
plt.plot(y_pred_train[:250], color='r')
plt.plot(np.array(y_test)[:250], color='g')
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
#Checking Heteroscadacity

plt.figure(figsize=(8,6))
plt.scatter((y_pred_test),(y_test)-(y_pred_test),marker='x')
plt.title('Heteroscadacity of Linear model')

<h><B> Observation

> From residual and normal Probability plot, it is clearly visible our model is not perfect. We have to go for next level.

<h2><b>Apply Polnomial Regression

In [None]:
# importing polynominal features from sklearn
from sklearn.preprocessing import PolynomialFeatures

# polynomial convertion
polynomial_converter = PolynomialFeatures(degree=2,include_bias=False)

In [None]:
poly_features = polynomial_converter.fit_transform(X)

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True)
model.fit(X_train,y_train)
test_predictions = model.predict(X_test)

In [None]:
print_metrics(y_train, y_pred_train)

Adjusted_R2(y_test, y_pred_test)

> The polynomial linear regression model has a slightly lower MSE (32.80) and RMSE (5.73) than the linear regression model, indicating that it has a better fit.

>The R2 score of 0.79 indicates that the model explains around 79% of the variance in the target variable, which is a good fit.

>The MAE of 4.39 is also slightly lower than the MAE of the linear regression model, indicating that the average absolute difference between the predicted and actual values is slightly smaller.

>The adjusted R2 score is the same as the linear regression model (0.78), suggesting that the polynomial model is not overfitting.

>Overall, these results suggest that the polynomial linear regression model is a better fit for the data than the linear regression model and can be used to make more accurate predictions on new data.

<h1><B> Apply Regularized Regression


<h6>Regularization attempts to minimize the RSS (residual sum of squares) and a penalty factor. This penalty factor will penalize models that have coefficients that are too large.<h>



<h2>First we apply Ridge Regression(L2 Regularization)

In [None]:
# Initalizing ridge regression

ridge = Ridge(alpha = 0.1)

ridge.fit(X_train,y_train)

In [None]:
'''#checking score

ridge.score(X_train,y_train)'''

# Predicted Train & Test values

y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

In [None]:
#checking score

ridge.score(X_train,y_train)

<h2><b>Choosing an Alpha value with cross-validation

In [None]:
from sklearn.linear_model import RidgeCV

ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_mean_absolute_error',cv=5)

In [None]:
ridge_cv_model.fit(X_train,y_train)
ridge_cv_model.alpha_

In [None]:
test_predictions = ridge_cv_model.predict(X_test)

In [None]:
print_metrics(y_test, test_predictions)
Adjusted_R2(y_test, test_predictions)

In [None]:
# Plot between actual target variable vs Predicted one

plt.figure(figsize=(18,6))
plt.plot(test_predictions[:250], color='r')
plt.plot(np.array(y_test)[:250], color='g')
plt.legend(["Predicted","Actual"])
plt.show()

>The ridge regression model has similar results to the linear regression model, with a slightly higher MSE (32.93) and RMSE (5.74) but a similar R2 score of 0.79. The MAE of 4.44 is also similar to the linear regression model. The adjusted R2 score is slightly lower than the linear regression model (0.78), which suggests that the ridge regression model may be slightly overfitting.

>Overall, these results suggest that the ridge regression model has a similar performance to the linear regression model but may not provide any significant improvement. However, ridge regression can be useful in reducing the impact of multicollinearity among the predictor variables and may improve the stability of the model.*italicised text*

<h1><b>Lasso Regression(L1 Regularization) along with cross-validation

In [None]:
from sklearn.linear_model import LassoCV

In [None]:
# Initalizing lasso regression

lasso_cv_model = LassoCV(eps=0.001,n_alphas=100,cv=3,max_iter=1000000)
lasso_cv_model.fit(X_train, y_train)

In [None]:
# Creating the model score

print(lasso_cv_model.score(X_test, y_test))
print(lasso_cv_model.score(X_train, y_train))

In [None]:
lasso_cv_model.fit(X_train,y_train)

In [None]:
lasso_cv_model.alpha_

In [None]:
test_predictions = lasso_cv_model.predict(X_test)

In [None]:
# Calculate the performance metrics

print_metrics(y_test, test_predictions)
Adjusted_R2(y_test, test_predictions)

In [None]:
# Predicted Train & Test values

y_pred_train_lasso=lasso_cv_model.predict(X_train)
y_pred_test_lasso=lasso_cv_model.predict(X_test)

In [None]:
# Plot between actual target variable vs Predicted one
plt.figure(figsize=(18,6))
plt.plot(test_predictions[:250], color='r')
plt.plot(np.array(y_test)[:250], color='g')
plt.legend(["Predicted","Actual"])
plt.show()

>The lasso regression model has a higher MSE (37.43) and RMSE (6.12) compared to both the linear regression and ridge regression models, indicating that it has a worse fit. The R2 score of 0.76 also suggests that the model explains a lower proportion of the variance in the target variable compared to the other models. The MAE of 4.79 is higher than the linear regression model and similar to the ridge regression model. The adjusted R2 score is also lower than the other models (0.75), indicating that the lasso regression model may be overfitting.

>Overall, these results suggest that the lasso regression model does not perform as well as the other models in predicting the target variable. Lasso regression can be useful in selecting a subset of predictor variables and reducing the complexity of the model, but in this case, it seems that the full set of predictor variables may be necessary for a better fit.

<h1><b>Decision Tree Regression

In [None]:
#Initilazing the model

from sklearn.tree import DecisionTreeRegressor
dt_regressor = DecisionTreeRegressor(criterion='squared_error', max_depth=8, max_features=9, max_leaf_nodes=100)

In [None]:
dt_regressor.fit(X_train,y_train)

In [None]:
#Train Test values
y_pred_train_d = dt_regressor.predict(X_train)
y_pred_test_d = dt_regressor.predict(X_test)

In [None]:
# Calculating Performance Metrics for training data
print_metrics((y_train), (y_pred_train_d))

In [None]:
# Calculating Performance Metrics for Test data
print_metrics((y_test), (y_pred_test_d))

In [None]:
#adjusted R2 score
Adjusted_R2((y_train), (y_pred_train_d))

In [None]:
# Plot between actual target variable vs Predicted one
plt.figure(figsize=(18,6))
plt.plot(test_predictions[:250], color='r')
plt.plot(np.array(y_test)[:250], color='g')
plt.legend(["Predicted","Actual"])
plt.show()

>The decision tree model has a moderate fit, with a higher MSE and RMSE compared to linear regression models, but lower MAE than some other models.

>The adjusted R2 score is the highest among all models, suggesting it may not be overfitting. Decision trees can capture nonlinear relationships, but are prone to overfitting and may not generalize well.

>Overall, while it has some strengths, other models may be more suitable for this dataset.

<h1><b>Random Forest Regression

In [None]:
# Initalizing the Model
rf_model = RandomForestRegressor(random_state=0)
parameters = {'n_estimators':[500],
             'min_samples_leaf':np.arange(25,31)
             }

In [None]:
rf_model.fit(X_train, y_train)

In [None]:
#Train test values
y_pred_train_rf = rf_model.predict(X_train)
y_pred_test_rf = rf_model.predict(X_test)

In [None]:
#Calculating Performance Metrics for train data
print_metrics((y_train), (y_pred_train_rf))

In [None]:
#adjusted R2 score
Adjusted_R2((y_train), (y_pred_train_rf))

In [None]:
# Calculating Performance Metrics for Test data
print_metrics((y_test), (y_pred_test_rf))

In [None]:
#adjusted R2 score
Adjusted_R2((y_test), (y_pred_test_rf))

In [None]:
# Feature importances
rf_model.feature_importances_

In [None]:
importances = rf_model.feature_importances_

importance_dict = {'Feature' : list(X_train_scaled.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:

# features = X_train_scaled.columns
# importances = rf_model.feature_importances_
# indices = np.argsort(importances)

In [None]:
plt.style.use('dark_background')

In [None]:
# Feature importances

rf_feat_imp = pd.Series(rf_model.feature_importances_, index=X.columns)
plt.figure(figsize=(15,5),dpi=200)
plt.title('Feature Importances: RANDOM FORESTS',fontsize=16,color='red')
plt.xlabel('Relative Importance')
rf_feat_imp.nlargest(20).plot(kind='barh', color='r')

>The random forest regression model has the lowest MSE, RMSE, and MAE compared to all other models, indicating the best fit.

>The R2 score is high, indicating it explains a high proportion of the variance. The adjusted R2 score suggests it is not overfitting.

>The random forest model is powerful and can handle nonlinear relationships, interactions, and high-dimensional data.

>Overall, it performs well in predicting the target variable and is a good choice for this dataset.

<h1><b>XG Boost Regression

In [None]:
# Initializing the model

xgb_r = xgb.XGBRegressor()

In [None]:
#Fitting the model
xgb_r.fit(X_train, y_train)

In [None]:
#Train Test values
y_pred_train_xgb = xgb_r.predict(X_train)
y_pred_test_xgb = xgb_r.predict(X_test)

In [None]:
# Calculating Performance Metrics for train data
print_metrics((y_train), (y_pred_train_xgb))

In [None]:
#adjusted R2 score
Adjusted_R2((y_train), (y_pred_train_xgb))

In [None]:
# Calculating Performance Metrics for Test data
print_metrics((y_test), (y_pred_test_xgb))

In [None]:
#adjusted R2 score
Adjusted_R2((y_test), (y_pred_test_xgb))

In [None]:
# Feature importances
xgb_r.feature_importances_

In [None]:
features = X_train.columns
importances = xgb_r.feature_importances_
indices = np.argsort(importances)

In [None]:
# Feature importances
rf_feat_imp = pd.Series(xgb_r.feature_importances_, index=X.columns)
plt.figure(figsize=(15,5),dpi=200)
plt.title('Feature Importances: XG Boost regression',fontsize=16,color='red')
plt.xlabel('Relative Importance')
rf_feat_imp.nlargest(20).plot(kind='barh', color='r')
plt.show()

>The XGBoost regression model has a lower MSE, RMSE, and MAE than most models, suggesting a good fit. The R2 score of 0.86 indicates that it explains a high proportion of the variance in the target variable. The adjusted R2 score of 0.86 suggests that it is not overfitting.

>XGBoost is a powerful algorithm that can handle nonlinear relationships and interactions in the data, and is known for its high predictive accuracy. Overall, these results suggest that the XGBoost model is a good choice for this dataset, as it performs well in predicting the target variable.

<h1><b>Hyperparameter Tuning :

<h2>XG Boost Regressor with GridSearchCV

In [None]:
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
parameter_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

In [None]:
parameter_dict

In [None]:
# Create an instance of the XG Boost Regressor
xg_boost = xgb.XGBRegressor()

# Grid search
xg_grid = GridSearchCV(estimator=xg_boost,
                       param_grid = parameter_dict,
                       cv = 5, verbose=2)

xg_grid.fit(X_train,y_train)

In [None]:
xg_grid.best_estimator_

In [None]:
xg_optimal_model = xg_grid.best_estimator_

In [None]:
#Train Test values
y_pred_train_xg_opt = xg_optimal_model.predict(X_train)
y_pred_test_xg_opt= xg_optimal_model.predict(X_test)

In [None]:
# Calculating Performance Metrics for train data
print_metrics((y_train), (y_pred_train_xg_opt))

In [None]:
# adjusted R2 score
Adjusted_R2((y_train), (y_pred_train_xg_opt))

In [None]:
# Calculating Performance Metrics for test data
print_metrics((y_test), (y_pred_test_xg_opt))

In [None]:
#adjusted R2 score
Adjusted_R2((y_test), (y_pred_test_xg_opt))

In [None]:
xg_optimal_model.feature_importances_

In [None]:
xg_optimal_model.fit(X_train,y_train)

In [None]:
features = X_train.columns
importance = xg_optimal_model.feature_importances_
index = np.argsort(importance)

In [None]:
# Feature importances
rf_feat_imp = pd.Series(xg_optimal_model.feature_importances_, index=X.columns)
plt.figure(figsize=(15,5),dpi=200)
plt.title('Feature Importances: XG Boost Regressor with GridSearchCV',fontsize=16,color='red')
plt.xlabel('Relative Importance')
rf_feat_imp.nlargest(20).plot(kind='barh', color='r')
plt.show()


>The XGBoost Regressor with GridSearchCV has a lower MSE, RMSE, and MAE than most models, indicating a good fit. The R2 score of 0.91 suggests that it explains a high proportion of the variance in the target variable. The adjusted R2 score of 0.91 suggests that it is not overfitting.

>Using GridSearchCV, the model is optimized by finding the best hyperparameters. This results in a more accurate and robust model. The XGBoost algorithm is powerful, and combining it with GridSearchCV helps to improve the performance of the model.

>Overall, these results suggest that the XGBoost Regressor with GridSearchCV is a good choice for this dataset, as it performs well in predicting the target variable with high accuracy and robustness.

# **Conclusion**

* The goal of this project is to develop a machine learning model that can accurately predict the demand for rental bikes based on different weather and other conditions.

* The XG Boost prediction model had the lowest RMSE.

* After applying several regression models to the dataset, it can be concluded that the XGBoost Regressor with GridSearchCV provides the best results with an MSE of 14.27, RMSE of 3.78, R2 score of 0.91, and MAE of 2.59. The model also has an adjusted R2 score of 0.91, indicating that it is not overfitting.

* The Random Forest Regressor also showed promising results with an MSE of 15.95, RMSE of 3.99, R2 score of 0.89, and MAE of 2.64.

* he XG Boost Regressor with GridSearchCV has a slightly lower MSE, RMSE, and MAE, and a higher R2 score and Adjusted R2 score than the Random Forest Regression model. This indicates that the XG Boost Regressor is the better model for predicting the target variable in this dataset. However, both models have shown promising results, and the final choice of model for deployment depends on the business need. If high accuracy in results is necessary, the XG Boost Regressor should be deployed. If the model interpretability is important to the stakeholders, then the Random Forest Regression model can be considered.

* The Polynomial Linear Regression and Ridge Regression models also performed well, but with slightly higher error rates.

* The Lasso Regression and Elastic Net Regression models did not perform as well, with higher error rates and lower R2 scores.

* The Decision Tree Regressor showed relatively lower performance, with an MSE of 59.55, RMSE of 7.72, R2 score of 0.61, and MAE of 5.45.

* Overall, the XGBoost Regressor with GridSearchCV is the most appropriate model for this dataset, as it provides high accuracy and robustness in predicting the target variable.

### ***Hurrah! You have successfully completed your Supervised ML Capstone Project !!!***