# Exploring Urban Mobility: A Comprehensive Analysis of Bike Sharing System Data

Developed by Uma Sivakumar - 834006815

## 2. Introduction

#### Hypothesis (Objectives / Goals)

1. Examine how different weather conditions impact bike rentals.
2. Determine if there are differences in bike rentals on working days and non-working days.
3. Investigate the impact of holidays on bike rentals.
4. Is there a discernible impact of temperature on bike rentals.
5. How well those variables predict the bike demands.


#### Significance of the project

We are modelling the demand for shared bikes with the available independent variables. It could be used by the management of the bike companies to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

By leveraging the rich dataset provided by bike-sharing systems, the research aims to provide a comprehensive understanding of urban mobility patterns. This knowledge can inform city planning, transportation policies, and infrastructure development, ultimately leading to more sustainable and efficient urban environments. Furthermore, the project's findings may have implications for addressing broader issues related to smart city initiatives and the integration of data-driven solutions into urban planning.


#### Background of the project

Bike sharing systems represent an innovative evolution of traditional bike rentals, automating the entire process from membership acquisition to bike rental and return. Users can seamlessly rent a bike from one location and return it to another, marking a departure from conventional rental systems. Currently, there are over 500 bike-sharing programs globally, collectively offering more than 500,000 bicycles. These systems have garnered significant interest due to their pivotal role in addressing traffic congestion, environmental concerns, and public health.

Beyond their practical applications, bike-sharing systems generate data with distinctive characteristics that make them intriguing for research. Unlike other transportation services like buses or subways, bike-sharing systems explicitly record travel duration, departure, and arrival positions. This unique feature transforms bike-sharing systems into virtual sensor networks capable of sensing mobility patterns within a city. Consequently, the project anticipates that important urban events can be detected by monitoring this data.

The primary objective of the project is to model the demand for shared bikes by considering various independent variables. This modeling can prove invaluable for the management of bike companies, offering insights into how demand varies with different features. Armed with this understanding, management can strategically manipulate business approaches to meet demand levels and exceed customer expectations. Moreover, the model provides a valuable tool for comprehending the dynamics of demand in new markets.

Through the utilization of the expansive dataset provided by bike-sharing systems, the research endeavors to provide a comprehensive understanding of urban mobility patterns. The insights derived from this analysis can play a crucial role in informing city planning, shaping transportation policies, and guiding infrastructure development. The ultimate aim is to contribute to the creation of more sustainable and efficient urban environments. Furthermore, the project's findings may extend to addressing broader issues related to smart city initiatives and the integration of data-driven solutions into urban planning practices.

## 3. Data Description

**Source of Information :** Kaggle - https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset
    
**Description :** Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

**Attribute Information**

We will be using the day.csv file that have the following fields:

day.csv - bike sharing counts aggregated on daily basis.

1. instant: record index  - (type is **Discrete Quantitative Variable**)
2. dteday : date          - (type is **DateTime**)
3. season :               - (type is **Nominal categorical variable**)
    - 1: springer
    - 2: summer
    - 3: fall
    - 4: winter
4. yr : year (0: 2011, 1:2012)               - (type is **Ordinal categorical variable**)
5. mnth : month (1 to 12)                   - (type is **Nominal categorical variable**)
6. holiday : weather day is holiday or not   - (type is **Nominal categorical variable**)
7. weekday : day of the week                 - (type is **Nominal categorical variable**)
7. workingday : if day is neither weekend nor holiday is 1, otherwise is 0.   - (type is **Nominal categorical variable**)
8. weathersit :                                                               - (type is **Nominal categorical variable**)
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
9. temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)           - (type is **Continuous Quantitative Variable**)
10. atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)  - (type is **Continuous Quantitative Variable**)
11. hum: Normalized humidity. The values are divided to 100 (max)          - (type is **Continuous Quantitative Variable**)
12. windspeed: Normalized wind speed. The values are divided to 67 (max)   - (type is **Continuous Quantitative Variable**)
13. casual: count of casual users                                          - (type is **Discrete Quantitative Variable**)
14. registered: count of registered users                                  - (type is **Discrete Quantitative Variable**)
15. cnt: count of total rental bikes including both casual and registered  - (type is **Discrete Quantitative Variable**)

## Data Preprocessing

Preprocessing steps that will be done in the course of the project:

**1. Handling Missing Values, Outliers, and Data Quality Issues:**

   - <span style="color:green">Missing Values:</span> Identifying and handling any missing values in the dataset. This could involve imputing missing values using methods like mean, median, or dropping rows or columns with missing values depending on the impact.
      - After checking for NULL values in our bikeShare dataset we can observe that there were no NULL values present.
      
      
   - <span style="color:green">Outliers:</span> Detect and address outliers. This might involve visual inspection using box plots or statistical methods like the IQR (Interquartile Range) to filter out extreme values.
      - Analyzing outliers for attribute <span style="color:blue">temp</span> using the boxplot visualization.
      
      ![download.png](attachment:download.png)
      
      We can observe that there is no outlier in the variable temp.
      <br/>
      <br/>
      
      - Analyzing outliers for attribute <span style="color:blue">temp_feel</span> using the boxplot visualization.
      
      ![download-2.png](attachment:download-2.png)
      
      We can observe that there is no outlier in the variable temp_feel.
      <br/>
      <br/>
      
      - Analyzing outliers for attribute <span style="color:blue">humidity</span> using the boxplot visualization.
      
      ![download-3.png](attachment:download-3.png)
      
      We have identified outliers in the "humidity" attribute using the Interquartile Range (IQR) statistical method. The analysis revealed 2 outliers. To address this, we established lower and upper bounds, calculated as Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, respectively. Subsequently, we filtered the dataset to retain only the data points with humidity values falling within these bounds, effectively removing the identified outliers.
      
      The humidity distribution, post the removal of outliers using the calculated bounds, is depicted below:
     
      ![download-4.png](attachment:download-4.png)
      
      <br/>
      <br/>
      
      - Analyzing outliers for attribute <span style="color:blue">windspeed</span> using the boxplot visualization.
      
      ![download-5.png](attachment:download-5.png)
      
      We have identified outliers in the "windspeed" attribute using the Interquartile Range (IQR) statistical method. The analysis revealed 2 outliers. To address this, we established lower and upper bounds, calculated as Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, respectively. Subsequently, we filtered the dataset to retain only the data points with windspeed values falling within these bounds, effectively removing the identified outliers.
      
      The windspeed distribution, post the removal of outliers using the calculated bounds, is depicted below:
      
      ![download-6.png](attachment:download-6.png)
      
      <br/>
      <br/>
      
      - Analyzing outliers for attribute <span style="color:blue">casual</span> using the boxplot visualization.
      
      ![download-7.png](attachment:download-7.png)
      
      We have identified outliers in the "casual" attribute using the Interquartile Range (IQR) statistical method. The analysis revealed 2 outliers. To address this, we established lower and upper bounds, calculated as Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, respectively. Subsequently, we filtered the dataset to retain only the data points with casual values falling within these bounds, effectively removing the identified outliers.
      
      The casual users distribution, post the removal of outliers using the calculated bounds, is depicted below:
      
      ![download-8.png](attachment:download-8.png)
      
      <br/>
      <br/>
      
      - Analyzing outliers for attribute <span style="color:blue">registered</span> using the boxplot visualization.
      
      ![download-9.png](attachment:download-9.png)
      
      We can observe that there is no outlier in the variable registered.
      <br/>
      <br/>
      
      
   - <span style="color:green">Data Quality Issues:</span> Check for any inconsistencies or errors in the data. This could include renaming columns, correcting typos, addressing duplicate entries, or ensuring consistency in categorical variables.
      - <span style="color:blue">Renamed</span> the columns for better understanding and readability.
         - yr → year
         - mnth → month
         - mnth → month
         - weathersit → weather_situation
         - atemp → temp_feel
         - hum → humidity
         - cnt → count
         
      - Ensuring <span style="color:blue">consistency</span> in categorical variable (by converting their 0's and 1's to meaningful categories)
         - season → (1 : 'spring', 2 : 'summer', 3 : 'fall', 4 : 'winter')
         - month → (1 : 'January', 2 : 'February', 3 : 'March', 4 : 'April', 5 : 'May', 6 : 'June', 7 : 'July', 8 : 'August', 9 : 'September', 10 : 'October', 11 : 'November', 12 : 'December')
         - weekday → (0 : 'Monday', 1 : 'Tuesday', 2 : 'Wednesday', 3 : 'Thursday', 4 : 'Friday', 5 : 'Saturday', 6 : 'Sunday')
         - weather_situation → (1 : 'Clear', 2 : 'Mist', 3 : 'Light Snow', 4 : 'Heavy Rain')
         - year → (0 : "2011", 1 : "2012")

**2. Building a Common Dataset:**

   - Ensure that all relevant data is integrated into a unified dataset. This may involve merging datasets, handling different formats, or addressing data inconsistencies.
   
      - <span style="color:blue"><u>Removing</u></span> insignificant data from the dataset for our regression hypothesis - <span style="color:blue">instant, dteday</span>
      - We later <span style="color:blue"><u>remove</u> atemp</span> from the subdataset to create a new relevant dataset after observing a high positive correlation between temp and temp_feel indicating <span style="color:blue">multi-colinearity</span>.
      
      The sub-dataset that will used for regression is depicted below (head - just for visual):
      
      ![subdataset.png](attachment:subdataset.png)

**3. Transforming Variables:**

   - <span style="color:green">Normalization (scaling):</span> If variables have different scales, normalization can be applied to bring them to a similar scale. Normalization methods that could be used are Min-Max scaling or Z-score normalization.
   
   We have used Min-Max scaler to normalize or rescale the data here.
   
   sample train data post normalization : 
   
   ![MinMax_train.png](attachment:MinMax_train.png)

     sample test data post normalization : 
       
   ![MinMax_test.png](attachment:MinMax_test.png)

- <span style="color:green">Encoding Categorical Variables:</span> Convert categorical variables into a numerical format suitable for machine learning models. This might involve one-hot encoding, or other methods depending on the nature of the categorical data.
   
   Here, we are using One-hot encoding to convert categorical variables into binary columns.
   
   With this the number of column will increase from 13 to 31.
   
   ![one-hot_encoding.png](attachment:one-hot_encoding.png)

**4. Performing Exploratory Data Analysis (EDA):**

   - Histograms: Show the distribution of numerical variables.
   - Box Plots: Visualize the spread and identify outliers in numerical data.
   - Scatter Plots: Explore relationships between two numerical variables.
   - Bar Plots: Represent the distribution of categorical variables.
   - Correlation Heatmaps: Display the correlation matrix visually.


   - <span style="color:green">Visualizing Data:</span> Use various plots and charts to visualize the data. This would include histograms for distribution analysis, scatter plots to explore relationships, and box plots to identify outliers.
   <br/>
   <br/>
   
      - **Univariate Analysis**
      
      __Examining the skewness and overall distribution of continuous features through the visualization of histograms and kernel density estimation.__
      <br/>
      
      ![download.png](attachment:download.png)
      
      Observations :
         - We can see the distribution of humidity and windspeed is like a normal distribution, increasing it a point then then gradually descreasing.
         - The distribution of casual users is highly left skewed indicating less number of casual users use the shared bikes compared to the normal distribution of the registered useds.
         - The graph of temp and temp_feel are similar with an increasing descresing trend.
         <br/>
         <br/>
         
      __Examining the overall distribution of categorical features with respect to target variable count through the visualization of barplots.__
      
      ![download-2.png](attachment:download-2.png)
      
      Observations :
         - Most usage of the shared bikes happens during Fall season. This could be due to a variety of reasons such as school and university openings, coporate appraisals and more travel to office.
         - Clearly in 2012, there has a been a drastic increase in the demand for shared bikes, this could constitute to a positive polutionless environment.
         - We can also see more usage of shared bike on Fridays and Saturdays of the Week.
         - Bikes are used moslty during clear weather but a more suprising discovery is people use shared bikes even during Mist and Light Snow.
         <br/>
         <br/>
         <br/>
      
      - **Bivariate Analysis**
      
      __Visualizing the relationship between features and the target variable while considering the distinction based on weather situation.__
      
      ![download-14.png](attachment:download-14.png)
      
      Observations : 
         - We can see a clear increase in the usuage of bikes during clear weather by both registered and casual users, Registered users being highly positively correalted.
         - We can also see that irrespective of the temperature people are using the bikes. But heavily concentrated during mild to medium temperatures.
         <br/>
         <br/>
         
      __Visualizing the relationship between features and the target variable while considering the distinction based on working days.__
      
      ![download-15.png](attachment:download-15.png)
      
      Observations : 
      Bikes used on working days is heavily dominating the non-working days.
         <br/>
         <br/>
         
      __Visualizing the relationship between features and the target variable while considering the data for the years 2011 and 2012.__
      
      ![download-16.png](attachment:download-16.png)
      
      Observations : 
      The demand has started to increase from year 2011 to 2012.
         <br/>
         <br/>

   - <span style="color:green">Finding Correlations:</span> Explore relationships between variables by calculating correlation coefficients. This can help identify which variables are strongly or weakly correlated.
   <br/>
   <br/>
   
      - **Multivaraite Analysis**
      
      __Plotting pairplot to find corelation between features.__
      
      ![download-17.png](attachment:download-17.png)
      
      Observations : 
         - temp_feel and temp have a high linear correlation.
         - temp and casual seem to have some positive correlation with less slope.
         - temp and registered also seem to have some positive correlation with slightly a higher slope than casual and temp.
         - casual and registered seem to have a +ve slight correlation
         - causal and windspeed seem to have some negative correaltion. 
         <br/>
         <br/>
         
      __Plotting heatmap to find corelation between features.__
      
      ![download-18.png](attachment:download-18.png)
      
      Observations : 
         - The strong correlation of 0.99 between 'temp' and 'temp_feel' indicates a high degree of multi-colinearity. Therefore, we are dropping 'temp_feel' from the dataset.

## 4. Methodology

**Statistical Methods to test hypothesis on the data:**
    
   - Various statistical methods can be performed on the dataset, such as t-test, ANOVA, AB testing etc depending on the nature of the data and the hypothesis.

I have implemented peason's and Spearman's Rank correlaion to perform the statistical modelling. Pearson's correlation coefficient is a valuable statistical measure for assessing the strength and direction of a linear relationship between two continuous variables. Scaling from -1 to 1, a positive coefficient indicates a positive correlation, while a negative coefficient signifies a negative correlation. Values close to 1 or -1 suggest a strong linear association, whereas those close to 0 indicate a weak correlation.

In exploring relationships between variables, researchers may consider alternative measures like Spearman's rank correlation or Kendall's tau, which are more robust to nonlinear associations and less influenced by outliers. These considerations contribute to a comprehensive analysis, enhancing the reliability and interpretability of findings in research endeavors.


**Hypothesis-1**
   
Weather Impact Hypothesis:
   - <span style="color:blue">Null Hypothesis:</span> Weather conditions have no significant impact on bike rentals.
   - <span style="color:blue">Alternative Hypothesis:</span> Different weather conditions affect bike rentals differently.
   
   
  ![Test-1.png](attachment:Test-1.png)

![Test-visual-1.png](attachment:Test-visual-1.png)

**Hypothesis-2**
   
Working Day Influence Hypothesis:
   - <span style="color:blue">Null Hypothesis:</span> Working days and non-working days have no difference in bike rentals.
   - <span style="color:blue">Alternative Hypothesis:</span> The number of bike rentals varies significantly between working days and non-working days
      
      ![Test-2.png](attachment:Test-2.png)

![Test-visual-2.png](attachment:Test-visual-2.png)

**Hypothesis-3**
   
Holiday Effect Hypothesis:
   - <span style="color:blue">Null Hypothesis:</span> Holidays do not impact bike rentals.
   - <span style="color:blue">Alternative Hypothesis:</span> Bike rentals experience a change in demand during holidays.
      
      ![Test-3.png](attachment:Test-3.png)

![Test-visual-3.png](attachment:Test-visual-3.png)

**Hypothesis-4**
   
Temperature Impact Hypothesis:
   - <span style="color:blue">Null Hypothesis:</span> Temperature has no effect on bike rentals.
   - <span style="color:blue">Alternative Hypothesis:</span> Bike rentals are influenced by temperature, with specific temperature ranges associated with higher or lower demand.
      
      ![Test-4.png](attachment:Test-4.png)

![Test-visual-4.png](attachment:Test-visual-4.png)

### Model Evaluation and Selection

The regression models I chose for predicting the demand for shared bikes are Multiple Linear Regression, Lasso Regression, Ridge Regression and Elastic Net Regression.

Quantile, Poisson, and Negative Binomial, Cox, Partial Least Square, PCA regression models are deemed unsuitable for this dataset.Quantile regression is typically employed for predicting quantiles, while Poisson and Negative Binomial regressions are more fitting for forecasting counts of events. However, in this context, the primary objective is to predict temperature and pollutant concentrations, making these models less appropriate. Zero-inflated regression is designed for datasets with a significant number of zero values, which is not characteristic of this dataset. Cox regression, designed for survival analysis, does not align with the project's hypothesis and objectives. Additionally, PCA regression, intended for visualizing high-dimensional data, is unnecessary here. Since the project focuses on predicting a single variable, PCA does not contribute to the assumptions or goals of the analysis.

Reasons for choosing these regressors are as follows : 

**1. Standard Linear Regression:**
Standard linear regression is a straightforward model that assumes a linear relationship between the features and the target variable. If the relationship between the features (e.g., temperature, humidity, windspeed) and the bike rental count ('count') is approximately linear, a standard linear regression model can provide interpretable coefficients.

**2. Ridge Regression:**
Ridge regression is useful when there is multicollinearity among the features. If features such as temperature and 'feels-like' temperature ('temp_feel') are highly correlated, Ridge regression can help mitigate multicollinearity by adding a regularization term to the cost function.

**3. Lasso Regression:**
Lasso regression is beneficial when feature selection is desired. If there are many features, and some of them may not be significantly contributing to the prediction of bike rentals, Lasso can automatically shrink the coefficients of less important features to zero.

**4. Elastic Net Regression:**
Elastic Net regression combines the advantages of Ridge and Lasso, making it suitable when there is a mix of correlated features and potential feature sparsity. It provides a balance between Ridge and Lasso regularization.

### Model Implementation and Evaluation

**Hypothesis-5**

Implemented four regression models—Multiple Linear Regression, Lasso Regression, Elastic Net Regression, and Ridge Regression. The models were instantiated, trained on the training data, and used to predict the target variable ('count'). Subsequently, model evaluation was conducted using metrics like Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, and R^2 score.

![Evaluation_before_hyper.png](attachment:Evaluation_before_hyper.png)

#### Hyperparameter Tunning (Model improvement)

For further refinement, **<span style="color:blue">hyperparameter tuning</span>** was performed using GridSearchCV for Elastic Net Regression, LassoCV for Lasso Regression, and RidgeCV for Ridge Regression. The models were then instantiated with the optimized hyperparameters, trained on the training data, and used to predict the target variable.

Comparisons between models were visualized by creating a dataframe summarizing error and accuracy rates in a tabular form.

![Evaluation_after_hyper.png](attachment:Evaluation_after_hyper.png)

Additionally, a **line graph was generated to provide a visual representation** of the different errors and accuracies across various models. The results demonstrated improved performance with hyperparameter tuning compared to the initial model configurations.

![Evaluation_line_graph.png](attachment:Evaluation_line_graph.png)

By comparing all the 4 models (Linear, Lasso, Ridge, ElasticNet) we can see that the values of MSE, RMSE for **Linear regression** is the least and therefore a **better predictor**. We can also infer that the **R-squared value** for **Linear Regression** is the highest indicating the model can **better explain the variation of output with different inputs**(generalization of the data).

Therefore, **Linear Regression** is the optimal model here.

#### Linear Regression (Optimal Model - Graphs)

![LR1.png](attachment:LR1.png)

This shows the graph of the predicted count vs the actual values. We can observe from the graph that there is a 100% accuracy in predicting the demand of shared bikes.

![LR2.png](attachment:LR2.png)

## 5. Results and Interpretation

The coefficients and intercept of the model are : 

![Coeffcients%20and%20intercept.png](attachment:Coeffcients%20and%20intercept.png)

The coefficients and intercept play crucial roles in understanding the relationship between the independent variables (features) and the dependent variable (target). Here's what they imply:


<span style="color:green">Intercept (Bias):</span>
b0 - intercept represents the predicted value of the dependent variable when all independent variables are set to zero. In many cases, it doesn't have a practical interpretation unless the variables involved are meaningful when set to zero.


<span style="color:green">Coefficients (Slopes):</span>
b1, b2, ..., bn: Each coefficient represents the change in the predicted value of the dependent variable for a one-unit change in the corresponding independent variable, while holding all other variables constant.
   - <span style="color:purple">Positive Coefficient:</span> A positive coefficient implies that an increase in the corresponding independent variable is associated with an increase in the predicted value of the dependent variable.
   - <span style="color:purple">Negative Coefficient:</span> A negative coefficient implies that an increase in the corresponding independent variable is associated with a decrease in the predicted value of the dependent variable.
   - <span style="color:purple">Coefficient Magnitude:</span> The magnitude of the coefficient indicates the strength of the relationship. Larger magnitudes suggest a more significant impact.


In summary, the intercept and coefficients provide insights into the baseline value and the impact of each independent variable on the predicted outcome. They allow you to interpret how changes in the input variables contribute to changes in the predicted output in a linear manner.

#### Equation of the Model

The linear regression equation is represented as follows:
            
   <span style="color:green">Y=b0+(b1∗X1+b2∗X2+...+bn∗Xn)</span>
            
<span style="color:purple">Y</span> is the predicted value of the dependent variable.

<span style="color:purple">b0</span> is the intercept.

<span style="color:purple">b1,b2,...,bn</span> are the coefficients corresponding to independent variables 

<span style="color:purple">X1,X2,...,Xn</span> are the independent variables.

![Equation%20of%20the%20model-2.png](attachment:Equation%20of%20the%20model-2.png)

## 6. Conclusion

The Bike Share dataset provides a rich source of information for understanding urban mobility patterns and predicting bike rentals based on various features. During the exploratory data analysis (EDA) phase, Regression modelling phase, several hypotheses were tested and insights were gained:

<span style="color:green">Impact of Weather Conditions:</span>
Hypothesis: Does Weather conditions have a significant impact on bike rentals.
Findings: The dataset revealed that certain weather situations, such as clear days, may influence bike rentals differently.

<span style="color:green">Working Day vs. Non-Working Day Comparison:</span>
Hypothesis: There are differences in bike rentals on working days and non-working days.
Findings: Statistical tests, such as t-tests, were performed to compare bike rentals on different days, providing insights into the rental patterns. Yes, Working and non-working days do affect the shared bikes demand.

<span style="color:green">Holiday Effect Analysis:</span>
Hypothesis: Holidays have an impact on bike rentals.
Findings: Statistical tests were conducted to compare bike rentals on holidays and regular days, uncovering trends and patterns around holiday periods. After the t-test we can confidently say Yes, holidays do impact bike rentals.

<span style="color:green">Temperature Impact Hypothesis:</span>
Hypothesis: Temperature has an effect on bike rentals.
Findings: Spearman's rank correlation coefficient was used to evaluate the linear relationship between temperature and bike rentals, indicating a significant correlation. It does show a linear relationship and proves that temperature does have an effect on the demand of shared bikes.

<span style="color:green">Linear Regression Modeling:</span>
Hypothesis: Linear regression models can predict bike rentals based on various features.
Findings: Multiple linear regression, Lasso regression, Elastic Net regression, and Ridge regression models were implemented and evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, and R^2 score. Hyperparameter tuning improved model performance. After which we came to a conclusion that linear regression is the most optimal model.


In conclusion, the Bike Share dataset allows for the modeling and prediction of bike rentals, with weather conditions, working days, holidays, and temperature playing significant roles. The implemented regression models provide a valuable tool for understanding and predicting bike rental patterns, enabling better management and strategy development for bike-sharing companies.

---------------
In summary, this study employed regression modeling to predict bike rental counts using a comprehensive bike share dataset. The exploratory data analysis (EDA) phase provided valuable insights into the trends, patterns, and relationships within the dataset. Multiple regression models, including linear regression, lasso regression, elastic net regression, and ridge regression, were implemented and evaluated for predicting bike rentals. The models underwent hyperparameter tuning, leading to enhanced performance, as evidenced by metrics such as Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, and R^2 score.

This study contributes reliable models for predicting bike rentals, shedding light on factors influencing demand. The findings may have practical applications for bike-sharing companies and urban planning efforts. However, it's essential to acknowledge the dataset's limitations, such as its one-year timeframe and the inherent unpredictability associated with human behavior.

**<span style="color:blue">Limitations:</span>**

   - **Temporal Scope:**
The dataset may have a limited temporal scope, covering a specific time period. This could limit the generalizability of the findings to different seasons, years, or changing trends.

   - **Geographical Scope:**
The dataset may be specific to a certain city or region, and the patterns observed may not necessarily apply to other locations with different characteristics.

   - **Population Representativeness:**
The dataset may not be fully representative of the entire population using bike-sharing services. For example, if certain demographics are underrepresented in the dataset, the analysis might not capture the preferences or behaviors of those groups.


**<span style="color:blue">Future work:</span>**

   - **Time Series Analysis:**
   Explore more advanced time series analysis techniques to capture temporal patterns and trends in bike rentals. This could involve using methods like ARIMA, SARIMA, or even deep learning models tailored for time series data.

   - **Predictive Modeling Improvement:**
   Experiment with more advanced predictive models beyond linear regression, such as decision trees, random forests, gradient boosting, or neural networks. Evaluate their performance and compare them with the linear regression model.

   - **User Segmentation:**
Segment users based on their rental patterns, and analyze each segment separately. This could provide insights into the different behaviors and preferences of user groups, helping tailor marketing or operational strategies.

   - **Customer Behavior Analysis:**
Conduct a detailed analysis of customer behavior, such as preferred routes, popular pickup/drop-off locations, and user demographics. This information can guide marketing efforts and service improvements.

   - **Integration with Weather Data:**
Incorporate more detailed weather data to enhance the model's accuracy. Weather conditions beyond the basic categories (clear, mist, etc.) may provide more nuanced insights into how weather impacts bike rentals.

   - **Long-Term Trends and Seasonal Analysis:**
Analyze long-term trends in bike rentals and conduct a more in-depth seasonal analysis to understand how demand varies across different seasons and years.

## Reference and Appendix

1. Lecture notes for Regression models
2. hue - [How to use Hue](https://www.statology.org/seaborn-pairplot-hue/#:~:text=You%20can%20use%20the%20hue,values%20of%20a%20specific%20variable.&text=This%20particular%20example%20creates%20a,value%20of%20the%20team%20variable.)
3. Generate discriptive statistics - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
4. Drop() - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
5. isnull() - detects missing values for an array-like object - pandas.pydata.org/docs/reference/api/pandas.isnull.html
6. boxplot() - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html
7. pairplot() - https://seaborn.pydata.org/generated/seaborn.pairplot.html
8. https://seaborn.pydata.org/generated/seaborn.heatmap.html
9. https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
10. https://stackoverflow.com/questions/50773877/create-for-loop-to-plot-histograms-for-individual-columns-of-dataframe-with-seab
11. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
12. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
13. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
14. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

### Code

```
!pip install pearsonr
!pip install sklearn

# Importing libraies required
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import spearmanr

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_regression
from scipy.stats import uniform

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


# Reading the data set
bikeShare = pd.read_csv("days.csv")
bikeShare.head()


bikeShare.shape

bikeShare.info()

bikeShare.describe()

# Checking for Null Values
bikeShare.isnull().sum()

# Checking for unique values
bikeShare.nunique()

# Renaming columns for better readablity 
bikeShare.rename(columns={'yr':'year','mnth':'month','weathersit':'weather_situation','atemp':'temp_feel',
                   'hum':'humidity','cnt':'count'}, inplace=True)
bikeShare.head()

season_codes = {1:'spring', 2:'summer', 3:'fall', 4:'winter'}
bikeShare['season'] = bikeShare['season'].map(season_codes)
bikeShare.season.head()

month_codes = {1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December'}
bikeShare['month'] = bikeShare['month'].map(month_codes)
bikeShare.month.head()

weekday_codes = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
bikeShare['weekday'] = bikeShare['weekday'].map(weekday_codes)
bikeShare.weekday.head()

weathersit_codes = {1:'Clear', 2:'Mist', 3:'Light Snow', 4:'Heavy Rain'}
bikeShare['weather_situation'] = bikeShare['weather_situation'].map(weathersit_codes)
bikeShare.weather_situation.head()

yr_codes = {0:"2011",1:"2012"}
bikeShare['year'] = bikeShare['year'].map(yr_codes)
bikeShare.year.head()

bikeShare.head()

# Dropping the above column as it is of no use to us
bikeShare = bikeShare[['season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_situation', 'temp', 'temp_feel', 'humidity', 'windspeed', 'casual', 'registered', 'count']]
bikeShare.head()

bikeShare.info()

# Outlier Detection
# Plot the boxplot of temp variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.temp)
plt.title("Distribution of temperature", fontsize = 12, color = "brown")
plt.show()

# Plot the boxplot of temp_feel variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.temp_feel)
plt.title("Distribution of temperature feeeling", fontsize = 12, color = "brown")
plt.show()

# Plot the boxplot of humidity variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.humidity)
plt.title("Distribution of humidity", fontsize = 12, color = "brown")
plt.show()

# Calculate the IQR (Interquartile Range)
Q1 = bikeShare.humidity.quantile(0.25)
Q3 = bikeShare.humidity.quantile(0.75)
IQR = Q3 - Q1

# Define a threshold for outliers
threshold = 1.5 * IQR

# Identify and mark outliers
bikeShare['Is_Outlier'] = (bikeShare.humidity < (Q1 - threshold)) | (bikeShare.humidity > (Q3 + threshold))

# Print the outliers
outliers = bikeShare[bikeShare['Is_Outlier']]
print("\033[1mNumber of outliers:\033[0m ", outliers.shape[0])

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

bikeShare = bikeShare[(bikeShare.humidity >= lower_bound) & (bikeShare.humidity <= upper_bound)]

bikeShare.drop('Is_Outlier', inplace = True, axis = 1)
bikeShare.head()

# Checking for shape
bikeShare.shape

# Plot the boxplot of humidity variable after the removal of outliers.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.humidity)
plt.title("Distribution of humidity", fontsize = 12, color = "brown")
plt.show()


# Plot the boxplot of windspeed variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.windspeed)
plt.title("Distribution of windspeed", fontsize = 12, color = "brown")
plt.show()

# Calculate the IQR (Interquartile Range)
Q1 = bikeShare.windspeed.quantile(0.25)
Q3 = bikeShare.windspeed.quantile(0.75)
IQR = Q3 - Q1

# Define a threshold for outliers
threshold = 1.5 * IQR

# Identify and mark outliers
bikeShare['Is_Outlier'] = (bikeShare.windspeed < (Q1 - threshold)) | (bikeShare.windspeed > (Q3 + threshold))

# Print the outliers
outliers = bikeShare[bikeShare['Is_Outlier']]
print("\033[1mNumber of outliers:\033[0m ", outliers.shape[0])

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

bikeShare = bikeShare[(bikeShare.windspeed >= lower_bound) & (bikeShare.windspeed <= upper_bound)]

bikeShare.drop('Is_Outlier', inplace = True, axis = 1)
bikeShare.head()

# Checking for shape
bikeShare.shape

# Plot the boxplot of windspeed variable after the removal of outliers.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.windspeed)
plt.title("Distribution of windspeed", fontsize = 12, color = "brown")
plt.show()

# Plot the boxplot of casual variable.
plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.casual)
plt.title("Distribution of casual users", fontsize = 12, color = "brown")
plt.show()

# Calculate the IQR (Interquartile Range)
Q1 = bikeShare.casual.quantile(0.25)
Q3 = bikeShare.casual.quantile(0.75)
IQR = Q3 - Q1

# Define a threshold for outliers
threshold = 1.5 * IQR

# Identify and mark outliers
bikeShare['Is_Outlier'] = (bikeShare.casual < (Q1 - threshold)) | (bikeShare.casual > (Q3 + threshold))

# Print the outliers
outliers = bikeShare[bikeShare['Is_Outlier']]
print("\033[1mNumber of outliers:\033[0m ", outliers.shape[0])

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

bikeShare = bikeShare[(bikeShare.casual >= lower_bound) & (bikeShare.casual <= upper_bound)]

bikeShare.drop('Is_Outlier', inplace = True, axis = 1)
bikeShare.head()

# Checking for shape

bikeShare.shape

# Plot the boxplot of casual variable.

plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.casual)
plt.title("Distribution of casual users", fontsize = 12, color = "brown")
plt.show()# Plot the boxplot of registered variable.

plt.figure(figsize = [8,2])
sns.boxplot(bikeShare.registered)
plt.title("Distribution of registered users", fontsize = 12, color = "brown")
plt.show()


# Creating a df for numerical values

num_var = bikeShare[['temp', 'temp_feel', 'humidity', 'windspeed', 'casual', 'registered']]

# Creating a df for categories

cat_var = bikeShare[['season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_situation', 'count']]

# Exploring numerical columns of the bikeShare dataframe

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 12))

axes = axes.flatten()

for i, col in enumerate(num_var.columns):
    sns.histplot(num_var[col], stat='density', kde=True, kde_kws={"cut": 3}, ax=axes[i])
    
plt.suptitle('Histograms depicting the distribution of Numerical variables', fontsize=16, color='Green')
plt.show()

# Exploring categorical columns of the bikeShare dataframe

for col in cat_var:
    print("\033[1m" + col + "\033[0m")
    print(bikeShare[col].value_counts())
    print("\n")
    
# Plotting the categorical variables

fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(14, 14))

# Flatten the axes for easier iteration
axes = axes.flatten()

# Loop through each categorical column and create a bar plot
for i, column in enumerate(cat_var):  # Exclude 'cnt'
    sns.barplot(x=column, y='count', data=cat_var, ax=axes[i])
    axes[i].set_title(f'{column} vs. count')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('count')
    axes[i].tick_params(axis='x', rotation=45)

# Remove the empty subplot (if any)
if len(cat_var.columns[:-1]) < len(axes.flat):
    for j in range(len(cat_var.columns[:-1]), len(axes.flat)):
        fig.delaxes(axes.flatten()[j])
        
# Adjust layout
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.suptitle('Bar Plots for Categorical Variables Against Target Variable count', fontsize=16, color="Green")
plt.show()

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 14))
axes = axes.flatten()

for i, col in enumerate(num_var.columns):
    sns.scatterplot(data=bikeShare,x=col,y='count',hue = bikeShare['weather_situation'], ax=axes[i])

plt.suptitle('Relationship between features and the target variable w.r.t weather situation', fontsize=16, color="Green", y=0.9)
plt.show()

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 14))
axes = axes.flatten()

for i, col in enumerate(num_var.columns):
    sns.scatterplot(data=bikeShare,x=col,y='count',hue = bikeShare['workingday'], ax=axes[i], palette='inferno')
   
plt.suptitle('Relationship between features and the target variable w.r.t working day', fontsize=16, color="Green", y=0.9)
plt.show()

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 14))
axes = axes.flatten()

for i, col in enumerate(num_var.columns):
    sns.scatterplot(data=bikeShare,x=col,y='count',hue = bikeShare['year'], ax=axes[i], palette='magma')
    
plt.suptitle('Relationship between features and the target variable w.r.t year', fontsize=16, color="Green", y=0.9)
plt.show()

# Visualizing numerical variables - pairplot

num_var1 = bikeShare[['temp', 'temp_feel', 'humidity', 'windspeed', 'casual', 'registered', 'count']]

sns.pairplot(num_var1)
plt.suptitle('Pairplot to find corelation between features', fontsize=20, color="Green", y=1.02)
plt.show()

# Let's check the correlation coefficients to see which variables are highly correlated

plt.figure(figsize = (8, 6))
sns.heatmap(num_var1.corr(), annot = True, cmap="YlGnBu")
plt.title('Heatmap to find corelation between features', fontsize=12, color="Green", y=1.02)
plt.show()

# Removing temp_feel.

bikeShare.drop(['temp_feel'],axis=1,inplace=True)

#Statistical Modelling
# Split data into four groups based on weather conditions
group_clear = bikeShare[bikeShare['weather_situation'] == 'Clear']['count']
group_mist = bikeShare[bikeShare['weather_situation'] == 'Mist']['count']
group_light_snow = bikeShare[bikeShare['weather_situation'] == 'Light Snow']['count']

# Perform one-way ANOVA
statistic, p_value = f_oneway(group_clear, group_mist, group_light_snow)

# Define significance level
alpha = 0.05

# Print the results
print(f'\033[1mANOVA Statistic:\033[0m {statistic}')
print(f'\033[1mP-value:\033[0m {p_value}')

# Compare p-value with significance level
if p_value < alpha:
    print('\033[1mReject the null hypothesis: Different weather conditions affect bike rentals differently.\033[0m')
else:
    print('\033[1mFail to reject the null hypothesis: No significant impact of weather conditions on bike rentals.\033[0m')
    
    
# Visualize the distribution of rentals on working and non-working days
plt.figure(figsize=(10, 6))
sns.histplot(data=bikeShare, x='count', hue='weather_situation', bins=30, kde=True)
plt.title('Distribution of Bike Rentals on different weather situations', color='green')
plt.xlabel('Count of Bike Rentals')
plt.ylabel('Frequency')
plt.legend(title='Weather situation', labels=['Clear', 'Mist', 'Light Snow'])
plt.show()


# Split data into two groups: working days and non-working days
working_days = bikeShare[bikeShare['workingday'] == 1]['count']
non_working_days = bikeShare[bikeShare['workingday'] == 0]['count']

# Perform independent t-test
statistic, p_value = ttest_ind(working_days, non_working_days)

# Define significance level
alpha = 0.05

# Print the results
print(f'\033[1mT-test Statistic:\033[0m {statistic}')
print(f'\033[1mP-value:\033[0m {p_value}')

# Compare p-value with significance level
if p_value < alpha:
    print('\033[1mReject the null hypothesis: There are significant differences in bike rentals between working and non-working days.\033[0m')
else:
    print('\033[1mFail to reject the null hypothesis: \033[1mNo significant differences in bike rentals between working and non-working days.\033[0m')

    
# Visualize the distribution of rentals on working and non-working days
plt.figure(figsize=(10, 6))
sns.histplot(data=bikeShare, x='count', hue='workingday', bins=30, kde=True)
plt.title('Distribution of Bike Rentals on Working and Non-Working Days', color='green')
plt.xlabel('Count of Bike Rentals')
plt.ylabel('Frequency')
plt.legend(title='Working Day', labels=['Non-Working Day', 'Working Day'])
plt.show()


# Split data into two groups: holidays and regular days
holidays = bikeShare[bikeShare['holiday'] == 1]['count']
regular_days = bikeShare[bikeShare['holiday'] == 0]['count']

# Perform independent t-test
statistic, p_value = ttest_ind(holidays, regular_days)

# Define significance level
alpha = 0.05

# Print the results
print(f'\033[1mT-test Statistic:\033[0m {statistic}')
print(f'\033[1mP-value:\033[0m {p_value}')

# Compare p-value with significance level
if p_value < alpha:
    print('\033[1mReject the null hypothesis: There are significant differences in bike rentals between holidays and regular days.\033[0m')
else:
    print('\033[1mFail to reject the null hypothesis: No significant differences in bike rentals between holidays and regular days.\033[0m')

# Visualize the distribution of rentals on holidays and regular days
plt.figure(figsize=(10, 6))
sns.histplot(data=bikeShare, x='count', hue='holiday', bins=30, kde=True)
plt.title('Distribution of Bike Rentals on Holidays and Regular Days', color='green')
plt.xlabel('Count of Bike Rentals')
plt.ylabel('Frequency')
plt.legend(title='Holiday', labels=['Regular Day', 'Holiday'])
plt.show()


# Test the correlation between temperature and bike rentals using Spearman's rank correlation
correlation, p_value = spearmanr(bikeShare['temp'], bikeShare['count'])

# Define significance level
alpha = 0.05

# Print the results
print(f'\033[1mSpearman\'s Rank Correlation:\033[0m {correlation}')
print(f'\033[1mP-value:\033[0m {p_value}')

# Compare p-value with significance level
if p_value < alpha:
    print('\033[1mReject the null hypothesis: There is a significant correlation between temperature and bike rentals.\033[0m')
else:
    print('\033[1mFail to reject the null hypothesis: No significant correlation between temperature and bike rentals.\033[0m')

# Visualize the relationship between temperature and bike rentals
plt.figure(figsize=(10, 6))
sns.scatterplot(data=bikeShare, x='temp', y='count')
plt.title('Relationship Between Temperature and Bike Rentals', color='green')
plt.xlabel('Temperature')
plt.ylabel('Count of Bike Rentals')
plt.show()

# Creating dummies

status = pd.get_dummies(bikeShare[['season', 'year', 'month', 'weekday', 'weather_situation']], drop_first=True)
status.head()

# concat the status dataframe created with our bikeShare_new dataframe

bikeShare_new = pd.concat([bikeShare, status], axis = 1)

# Checking the concatinated dataframe

bikeShare_new.head()

# Drop 'season', 'mnth', 'weekday', 'weathersit' as we have created the dummies for it

drop_cols = ['season', 'year', 'month', 'weekday', 'weather_situation']
bikeShare_new.drop(drop_cols, axis = 1, inplace = True)

from sklearn.model_selection import train_test_split

# We specify random_state so that the train and test data set always have the same rows, respectively

df_train, df_test = train_test_split(bikeShare_new, train_size = 0.7, test_size = 0.3, random_state = 100)

scaler = MinMaxScaler()

# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables

num_vars = ['temp', 'humidity', 'windspeed', 'casual', 'registered']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
df_test[num_vars] = scaler.transform(df_test[num_vars])

# Let's check the correlation coefficients to see which variables are highly correlated

plt.figure(figsize = (25, 20))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

X_train = df_train.drop('count', axis=1)
y_train = df_train['count']

X_test = df_test.drop('count', axis=1)
y_test = df_test['count']

# Linear Regression

# Create a Regression Model instances and fitting the model on the training data

# Linear Regression Model
print("\033[1mLinear Regression Model\033[0m")
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_linear = lr.predict(X_test)
print('Predicted Linear Regression values are ', y_pred_linear[1:5]) 

mse = mean_squared_error(y_test,y_pred_linear)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_linear)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_linear)
print(f"R^2 Score: {r2score}")


print("\n")
# Lasso Regression Model
print("\033[1mLasso Regression Model\033[0m")
lasso = Lasso(alpha=0.5)
lasso.fit(X_train,y_train)
y_pred_lasso = lasso.predict(X_test)
print('Predicted Lasso Regression values are ', y_pred_lasso[1:5])

mse = mean_squared_error(y_test,y_pred_lasso)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_lasso)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_lasso)
print(f"R^2 Score: {r2score}")


print("\n")
# Ridge Regression Model
print("\033[1mRidge Regression Model\033[0m")
ridge = Ridge(alpha=1)
ridge.fit(X_train,y_train)
y_pred_ridge=ridge.predict(X_test)
print('Predicted Ridge Regression values are ', y_pred_ridge[1:5])

mse = mean_squared_error(y_test,y_pred_ridge)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_ridge)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_ridge)
print(f"R^2 Score: {r2score}")



print("\n")
# Elastic Net Regression Model
print("\033[1mElasticNet Regression Model\033[0m")
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train,y_train)
y_pred_elastic = elastic_net.predict(X_test)
print('Predicted ElasticNet Regression values are ', y_pred_elastic[1:5])

mse = mean_squared_error(y_test,y_pred_elastic)
print(f"Mean Squared Error: {mse}")
rmse = mse**0.5
print(f"Root Mean Squared Error: {rmse}")
mae = mean_absolute_error(y_test,y_pred_elastic)
print(f"Mean Absolute Error: {mae}")
r2score=r2_score(y_test,y_pred_elastic)
print(f"R^2 Score: {r2score}")

#Hyperparameter tunning
# ElasticNet Regression


# Define the grid of hyperparameters 'param_grid'
param_grid = {
    'alpha': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    'l1_ratio': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
}

# Initialize GridSearchCV with the required parameters
grid_model_result = GridSearchCV(estimator=ElasticNet(),
                                 param_grid=param_grid,
                                 cv=10).fit(X_train,y_train)

# Print results
print(f"Best: {grid_model_result.best_score_} using {grid_model_result.best_params_}")

# Ridge Regression

# Defining list of alphas
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

# Create a ridge regressor object that does cross-validation
ridge_cv = RidgeCV(alphas=alphas)

# Fit it to our training data
ridge_cv.fit(X_train,y_train)

# Get predictions for test set.
y_pred_ridgecv=ridge_cv.predict(X_test)

print('Ridge CV Model Mean Squared Error:', mean_squared_error(y_test,y_pred_ridgecv))
print('Best Alpha after Cross Validation :', ridge_cv.alpha_)

# Create a Regression Model instances and fitting the model on the training data


# Linear Regression Model
lr = LinearRegression()
lr.fit(X_train,y_train)

y_pred_linear = lr.predict(X_test)
print("\033[1mLinear Regression Model\033[0m")
print('Predicted Linear Regression values are ', y_pred_linear[1:5])    


# Lasso Regression Model
lasso = Lasso(alpha=0)
lasso.fit(X_train,y_train)

y_pred_lasso = lasso.predict(X_test)
print("\n")
print("\033[1mLasso Regression Model\033[0m")
print('Predicted Lasso Regression values are ', y_pred_lasso[1:5])


# Ridge Regression Model
ridge = Ridge(alpha=0.1)
ridge.fit(X_train,y_train)

y_pred_ridge=ridge.predict(X_test)
print("\n")
print("\033[1mRidge Regression Model\033[0m")
print('Predicted Ridge Regression values are ', y_pred_ridge[1:5])


# Elastic Net Regression Model
elastic_net = ElasticNet(alpha=0.1, l1_ratio=1)
elastic_net.fit(X_train,y_train)

y_pred_elastic = elastic_net.predict(X_test)
print("\n")
print("\033[1mElasticNet Regression Model\033[0m")
print('Predicted ElasticNet Regression values are ', y_pred_elastic[1:5])

# Calculate evaluation metrics for each model

model_names = ["Linear Regression", "Lasso Regression", "Ridge Regression", "Elastic Net Regression"]
models = [lr, lasso, ridge, elastic_net]
predictions = [y_pred_linear, y_pred_lasso, y_pred_ridge, y_pred_elastic]

metrics = []
for name, model, y_pred in zip(model_names, models, predictions):
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse) 
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test,y_pred)
    metrics.append([name, mse, rmse, mae, r2])

# Create a DataFrame to display the metrics
df_metrics = pd.DataFrame(metrics, columns=["Model", "Mean Squared Error (MSE)", "Root Mean Sqaure Error (RMSE)", "Mean Absolute Error (MAE)", "R-squared (R^2)"])

# Display the table
print(df_metrics)

df_met = df_metrics.drop(columns=["Model"])
df_metrics.set_index(df_met.columns, inplace=True)

# Custom x-axis labels
custom_labels = ["Linear", "Lasso", "Ridge", "Elastic Net"]

# Plotting the metrics
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.suptitle('Model Evaluation Metrics')

# Plotting Mean Squared Error (MSE)
df_metrics['Mean Squared Error (MSE)'].plot(kind='line', marker='o', ax=axes[0, 0], grid=True)
axes[0, 0].set_ylabel('MSE')
axes[0, 0].set_xticks(range(len(custom_labels)))
axes[0, 0].set_xticklabels(custom_labels)

# Plotting Root Mean Square Error (RMSE)
df_metrics['Root Mean Sqaure Error (RMSE)'].plot(kind='line', marker='o', ax=axes[0, 1], grid=True)
axes[0, 1].set_ylabel('RMSE')
axes[0, 1].set_xticks(range(len(custom_labels)))
axes[0, 1].set_xticklabels(custom_labels)

# Plotting Mean Absolute Error (MAE)
df_metrics['Mean Absolute Error (MAE)'].plot(kind='line', marker='o', ax=axes[1, 0], grid=True)
axes[1, 0].set_ylabel('MAE')
axes[1, 0].set_xticks(range(len(custom_labels)))
axes[1, 0].set_xticklabels(custom_labels)

# Plotting R-squared (R^2)
df_metrics['R-squared (R^2)'].plot(kind='line', marker='o', ax=axes[1, 1], grid=True)
axes[1, 1].set_ylabel('R^2')
axes[1, 1].set_xticks(range(len(custom_labels)))
axes[1, 1].set_xticklabels(custom_labels)

plt.tight_layout(rect=[0, 0, 1, 0.96])  # Adjust the layout
plt.show()

# Plotting Regression Graph
plt.scatter(y_test, y_pred_linear)
plt.xlabel("Actual Count")
plt.ylabel("Predicted Count")
plt.title("Linear Regression: Actual vs. Predicted Count")
plt.show()

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 12))
axes = axes.flatten()

plot_data = pd.DataFrame({'Actual Count': y_test, 'Predicted Count': y_pred_linear})

numerical_cols = ['temp', 'humidity', 'windspeed', 'casual', 'registered']

# Plotting individual regression plots for each numerical variable
for i, column in enumerate(numerical_cols):
    sns.regplot(x=X_test[column], y=y_test, scatter_kws={'s': 15}, line_kws={'color': 'red'}, ax=axes[i])
    plt.title(f"Linear Regression: Actual vs. Predicted Count for {column}")
    plt.xlabel(column)
    plt.ylabel("Count")
    
# Remove the empty subplot (if any)
if len(numerical_cols) < len(axes.flat):
    for j in range(len(numerical_cols), len(axes.flat)):
        fig.delaxes(axes.flatten()[j])


plt.suptitle('Linear Regression: Actual vs. Predicted Count for Numerical Variables', fontsize=16, color='Green')
plt.show()

# Coefficients for each feature
coefficients = lr.coef_

feature_names = X_train.columns
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
sorted_coef_df = coef_df.sort_values(by='Coefficient', ascending=False)

# Intercept
intercept = lr.intercept_

print("\033[1mCoefficients:\033[0m")
print(sorted_coef_df)
print("\n")
print("\033[1mIntercept:\033[0m", intercept)
```