# Analysis of bike sharing demands in Seoul 



## Introduction 

This project focuses on analyzing the factors influencing bike rentals for a bike-sharing company. Specifically, we aim to investigate whether temperature has a significant impact on the number of bikes rented. The dataset contains information from the years 2017 and 2018, including various variables such as daily temperature, weather conditions, and historical bike rental data. 
Our goal is to use this data to build a predictive model that can estimate the number of bikes that will be rented for each month. The model will help the company optimize bike availability and manage resources more efficiently based on temperature forecasts and other environmental factors

The dataset contains the count of public bicycles rented per hour in the Seoul Bike Sharing System, with corresponding weather data and holiday information. 
The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information. Each coloumn in “SeoulBikeData.csv” have the following description:

    Date : the date of the day
    Rented Bike Count : number of rented bike
    Hour : the time in hour 
    Temperature(°C) : the temperature at this time 
    Humidity(%) : the pourcentage of humidity at this time
    Wind speed (m/s) : the wind speed in m/s at this time
    Visibility (10m) : the visibility per 10m at this time
    Dew point temperature(°C) : the dew point temperature of day 
    Solar Radiation (MJ/m2) : the solar radiation in MJ/m2 at this time
    Rainfall(mm) : the rain in mm at this time
    Snowfall (cm) : the snowfalls in cm at this time
    Seasons : the season of the day
    Holiday : if the day is in holiday or not
    Functioning Day : 




## Business Understanding 

The city of Seoul is well organized for bikes, like cites in Danmark, and there is a good organistton for renting bikes. As so, the bikes renting data, from a company,  have been collected and put in a database called “SeoulBikeData.csv”. With those data you can know the number of rented bike every hour from 2017 to 2019. The dataset allow us to see the cycles of bike renting every year. In fact, with those data the companies might want to forecast the bike renting, compared to the weather conditions,the day or the season. 
The purpuse of the project will be to design a model that could help the bike companies to deal with the bike stock with the wether forcast and the period. 
The following research question (RQ) havebeen formulated:

**"Determine the optimum number of bikes needed each time of day based on hour, temperature and solar radiation."**




## Data collection

The dataset that we use is from UC Irvine Machine Learning repository. It is composed of 14 columns (the features) and 8760 rows (the data). We choose this dataset because it can be useful in real life for real business and to see if its possible to forecast the number of rented bike. 
First in this part we are going to see how is the dataset and which feature are useful.

### The Library
Libraries are crucial for expanding Python’s capabilities, improving eﬀiciency, and offering solutions for a broad spectrum of tasks. To get started, it’s necessary to import the required libraries

In [10]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots

import statsmodels.api as sm # implements several commonly used regression methods
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import LeaveOneOut, cross_val_score

In [11]:
df = pd.read_csv('./SeoulBikeData.csv', encoding='unicode_escape')

FileNotFoundError: [Errno 2] No such file or directory: './SeoulBikeData.csv'

pandas library enable to managing and preparing the data.
numpy library contribute to handling numerical operations.
matplotlib library contribute to better data visualization.
seaborn library is employed to create visualizations of statistical data.
sklearn library is employed for machine learning and modelling.

## Data Cleaning and Data Preparation


After introducing the different libraries to the program it is possible to import the single dataset at disposal.
The dataset has 8760 rows and 14 columns, the latter referred to as many features.

Data cleaning and data manipulation are necessary steps before take a closer look at the data

Further data cleaning and exploration are shown as to separate steps. In some cases exploration needs to happen while doing data cleaning because it can become an iterative process. Data cleaning is a fundamental step in any data science and machine learning project. It involves the identification and adjustment of data quality issues, which can significantly impact the accuracy and reliability of subsequent analyses and models development.




In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

Through this line is possible to see that the dataset doesn't contain any categorical data for each feature analyzed.

Furtherome the inevstigation of duplicates led to the conclusion that duplicated cells nor values are included in the original dataframe.
As a conclusion number of rows and columns have remained the same.

Missing data can arise for various reasons, such as incomplete records or data entry errors. By examining these columns, a deeper understanding is gained of the dataset’s characteristics. A identification of which columns have missing values and assess whether these gaps can be filled with reasonable imputations.
In our specific case no missing values were reported.



In [None]:
df.describe().T

## Data Manipulation

Data manipulation can be crucial in order to make data exploration phase easier.
In our case, the main temporal feature was the days of the year. To simplify data analysis and exploration, additional features were created, including days of the week, months, and seasons of the year.
As a result, three new columns—and therefore three new features—were added to the existing 14 features in the database, bringing the total to 17 features.

In [None]:
df.columns = df.columns.str.replace(r"\s*\(.*?\)\s*", "", regex=True)
df.columns = df.columns.str.replace(" ", "_", regex=False)

In [None]:
df.columns = [x.lower() for x in df.columns]

In [None]:
df["date"] = pd.to_datetime(df["date"], dayfirst = True)
df["day"] = df['date'].dt.day
df["month"] = df['date'].dt.month
df["year"] = df['date'].dt.year
df["weekday"] = df['date'].dt.day_name()

In [None]:
df['seasons'] = df['seasons'].map({'Winter': 0, 'Spring': 1, 'Summer': 2, 'Autumn': 3})
df['holiday'] = df['holiday'].map({"No Holiday": 0, "Holiday": 1})
df['weekday'] = df['weekday'].map({"Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4, "Friday": 5, "Saturday": 6, "Sunday": 7})

In [None]:
df=df.drop(['functioning_day'], axis = 1)

## Heat Map
In the following section, the heat map is introduced in order to discover correlations between the numerical features
The heat map shows that there is a correlation between some features:
The main feature to analyze "rented bike count" shows acceptable and relevant level of correlations with seasonality, temperature, dew point temperature (referred to the temperature at which the external air became dry --> strictly linked to humidity).


In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

sns.heatmap(df.corr(), 
            annot=True, 
            fmt='1.2f', 
            annot_kws={'size': 12},
            linewidths=1, 
            linecolor='white',
            cmap='Purples',
            cbar_kws={"shrink": 0.75, "aspect": 30})

plt.title('Correlation Heatmap of Bike Rental Data', fontsize=18, fontweight='bold', pad=20)
plt.xticks(fontsize=12) 
plt.yticks(fontsize=12) 
plt.tight_layout()

plt.show()

## Data Exploration

In this section, we conduct a thorough exploration of the dataset, employing visualization techniques and examining correlations between features to gain a deeper understanding.


In [None]:
plt.figure(figsize=(10, 6))
sns.set_palette("husl")

ax = sns.barplot(x="seasons", y="rented_bike_count", data=df, errorbar=None)

plt.title("Bike Rentals by Season", fontsize=16, fontweight='bold')
plt.xlabel("Seasons", fontsize=14)
plt.ylabel("Rented Bike Count", fontsize=14)

for patch in ax.patches:
    patch.set_edgecolor('black')
    patch.set_linewidth(2)

for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2, p.get_height()), 
                ha='center', va='bottom', fontsize=12, color='black')

plt.tight_layout()
plt.show()

Starting from visualization of The number of bike rented per season through years
and then breakdown the same variables per month and by hour of the day, thus is displayed that mainly during summer and autumn the peak of rented bike is reached due to favourable weather conditions as confirmed by plot referred to bike rentals by month, highlighting that starting from May until October  (peak in June) rental bike services are exploited.


In [None]:
plt.figure(figsize=(10, 6))
sns.set_palette("Set2")

ax = sns.barplot(x="month", y="rented_bike_count", data=df, errorbar=None)

plt.title("Bike Rentals by Month", fontsize=16, fontweight='bold')
plt.xlabel("Months", fontsize=14)
plt.ylabel("Rented Bike Count", fontsize=14)

for patch in ax.patches:
    patch.set_edgecolor('black')
    patch.set_linewidth(2)

for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2, p.get_height()), 
                ha='center', va='bottom', fontsize=12, color='black')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.set_palette("flare")

ax = sns.barplot(x="hour", y="rented_bike_count", data=df, errorbar=None)

plt.title("Bike Rentals by Hour", fontsize=16, fontweight='bold')
plt.xlabel("Hour", fontsize=14)
plt.ylabel("Rented Bike Count", fontsize=14)

for patch in ax.patches:
    patch.set_edgecolor('black')
    patch.set_linewidth(2)

for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2, p.get_height()), 
                ha='center', va='bottom', fontsize=12, color='black')

plt.tight_layout()
plt.show()

While the rented bike per hour can help understand the most likely reasons are for commuting home and work and hang out/ moving around the city during evening/dinner/after dinner.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['humidity'], df['rented_bike_count'], c="#61b7f7", alpha=0.5)
plt.title('Rented Bike Count vs. Humidity')
plt.xlabel('Humidity (%)')
plt.ylabel('Rented Bike Count')
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")

scatter = plt.scatter(x=df['temperature'], y=df['rented_bike_count'], 
                      c=df['rented_bike_count'], cmap='coolwarm', alpha=0.7)

plt.title('Temperature and Rented Bike Count in Seoul', fontsize=14)
plt.xlabel('Temperature (°C)', fontsize=12)
plt.ylabel('Rented Bike Count', fontsize=12)

plt.colorbar(scatter, label='Rented Bike Count')

plt.tight_layout()
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 6))

# Regression plot
sns.regplot(
    x=df['temperature'], y=df['rented_bike_count'],
    scatter_kws={'alpha': 0.6},
    line_kws={'color': 'red'},
    ci=None
)

# Add title and labels
plt.title('Relationship Between Temperature and Rented Bike Count', fontsize=14)
plt.xlabel('Temperature (°C)', fontsize=12)
plt.ylabel('Rented Bike Count', fontsize=12)

In [None]:
corr = df.corr()
features = corr["rented_bike_count"]
significant_features = features[features.abs() > 0.199]

In [None]:
significant_features

First Simple REGRESSION

Simple regression with the rented bike and the temperature 

In [None]:
# Define the data frame
X = pd.DataFrame({'intercept': np.ones(df.shape[0]), 'temperature': df['temperature']})
X[:5]

In [None]:
y= df['rented_bike_count']
model = sm.OLS(y,X) # does not fit the model, but specifies it 
results = model.fit()

In [None]:
results.summary()

In [None]:
new_df = pd.DataFrame({'intercept': np.ones(3), 'temperature': [1,6,-9]})
new_df

In [None]:
new_predictions = results.get_prediction(new_df)

In [None]:
new_predictions.predicted_mean

In [None]:
# Produce confidence intervals for the predicted values:
new_predictions.conf_int(alpha=0.05)

In [None]:
# Prediction intervals are computed by setting obs=True:
new_predictions.conf_int(obs=True, alpha=0.05)

In [None]:
def abline(ax, b, m): # defining the function 
    "Add a line with slope m and intercept b to ax"
    xlim = ax.get_xlim()
    ylim = [m*xlim[0] +b, m*xlim[1] +b]
    ax.plot(xlim, ylim)

In [None]:
# Including additional arguments: *args allows a number of non-named arguments to abline
def abline(ax,b,m, *args, **kwargs): # **kwards allows any number of named arguments, e.g., linewith=3 to abiline
    "Add a line with slope m and intercept b to ax"
    xlim = ax.get_xlim()
    ylim = [m *xlim[0]+b,m*xlim[1]+b]
    ax.plot(xlim, ylim, *args, **kwargs)

In [None]:
# Let's use the new function and add the regression line to the plot of medv vs. lstat:
ax = df.plot.scatter('temperature', 'rented_bike_count')
abline(ax,
      results.params[0],
      results.params[1],
      'r--')#, # produces a red dashed line
      #linewith=3) # should define the line width

multiple regression 

In [None]:
# We use ModelSpec(), again, for the multiple LR using least squares, including now lstat and age: better R^2=0,551 than with the
# model before:
X = pd.DataFrame({'intercept': np.ones(df.shape[0]), 'temperature': df['temperature'], 'humidity' : df['humidity'], 'rain': df['rainfall'], 'hour' : df['hour'], 'weekday': df['weekday'] })
model_2pred = sm.OLS(y,X)
results_2pred = model_2pred.fit()
results_2pred.summary(model_2pred)

In [None]:
X

In [None]:
new_de2 = pd.DataFrame({'intercept': np.ones(3), 'temperature': [1,7,10], 'humidity' : [37,70,40], 'rain': [0,0,0], 'hour' : [19,8,7], 'weekday': [1,7,3] })
new_de2

In [None]:
new_predictions = results_2pred.get_prediction(new_de2)
new_predictions.predicted_mean

K nearest neighbors and cross validation 

In [None]:
X = pd.DataFrame({'temperature': df['temperature'], 'humidity' : df['humidity'], 'hour' : df['hour'], 'visibility' : df['visibility'], 'dew_point_temperature' : df['dew_point_temperature'], 'solar_radiation' : df['solar_radiation'], 'seasons':df['seasons'],'year' : df['year'] })

The type date is compatible with KNN

In [None]:
X.head()

In [None]:
y = pd.DataFrame({ 'rented_bike_count': df['rented_bike_count']})
y.head()

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y, random_state =7, train_size=0.75)

In [None]:
knn = KNeighborsRegressor(n_neighbors=7)

In [None]:
knn.fit(X_train,y_train)

In [None]:
y_pred = knn.predict(X_val)
mse = mean_squared_error(y_val,y_pred)
r2 = r2_score(y_val, y_pred)
print ("The real model accuracy is: \n",r2) 

In [None]:
m = r2
for k in range (2,200):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_val)
    r2 = r2_score(y_val, y_pred)
    
    if r2>m :
        m=r2
        n=k
[m,n]

In [None]:
loo = LeaveOneOut() 
scores = cross_val_score(clf, X, y, cv = loo)
print("CV score in average: ", scores.mean())

In [None]:
scores=cross_val_score(clf, X, y, cv=100)
print("CV score in average: ", scores.mean())