# Bike Sharing Data Analysis Project

## Business Questions
- Which users use bicycles the most? is it casual or registered?
- At what time of day are there the most bicycle rental users and at what time of day are there the least?
- What season are there most bicycle rentals?
- How has the company's sales performance been in 2011 - 2012?

## Import Required Library

In [3]:
import numpy as np
import pandas as pd
import zipfile
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

## Data Wrangling

### Data Gathering

In [4]:
# Download using gdown
!gdown 1RaBmV6Q6FYWU4HWZs80Suqd7KQC34diQ

Downloading...
From: https://drive.google.com/uc?id=1RaBmV6Q6FYWU4HWZs80Suqd7KQC34diQ
To: /home/usernx/codespace/bikesharing/Bike-sharing-dataset.zip
100%|████████████████████████████████████████| 280k/280k [00:00<00:00, 2.34MB/s]


In [5]:
# Extract file zip
content = 'Bike-sharing-dataset.zip'
zip = zipfile.ZipFile(content, 'r')
zip.extractall('data/')
zip.close()

### Data Loading

In [143]:
# Load data hours.csv as a table
dfh = pd.read_csv("data/hour.csv")
dfh.head(len(dfh))

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0000,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.80,0.0000,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.80,0.0000,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0000,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,17375,2012-12-31,1,1,12,19,0,1,1,2,0.26,0.2576,0.60,0.1642,11,108,119
17375,17376,2012-12-31,1,1,12,20,0,1,1,2,0.26,0.2576,0.60,0.1642,8,81,89
17376,17377,2012-12-31,1,1,12,21,0,1,1,1,0.26,0.2576,0.60,0.1642,7,83,90
17377,17378,2012-12-31,1,1,12,22,0,1,1,1,0.26,0.2727,0.56,0.1343,13,48,61


In [7]:
# Load data day.csv as a table
dfd = pd.read_csv("data/day.csv")
dfd.head(len(dfd))

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.200000,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.229270,0.436957,0.186900,82,1518,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,2012-12-27,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1,1,12,0,5,1,2,0.253333,0.255046,0.590000,0.155471,644,2451,3095
728,729,2012-12-29,1,1,12,0,6,0,2,0.253333,0.242400,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1,1,12,0,0,0,1,0.255833,0.231700,0.483333,0.350754,364,1432,1796


### Data Assesing

In [12]:
# Print dataset hour.csv infromation
dfh.info(verbose=True, buf=None, max_cols=None, memory_usage=None, show_counts=None)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


In [13]:
# Print dataset day.csv infromation
dfd.info(verbose=True, buf=None, max_cols=None, memory_usage=None, show_counts=None)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB


In [15]:
# Function for assesing data
def data_assesing(data):

    # Display the total number of NaN and Null values in each column, sorted in descending order
    print(f"Total NaN/Null Data per Column:\n{data.isna().sum().sort_values(ascending=False)}\n")
    
    # Display the shape of the dataset
    print(f"Data Shape:\n{data.shape}")

    # Total duplicted data in dataset
    print(f"\nTotal Duplicated Data: {data.duplicated().sum()}")

# Call the function for assesing dataset hour.csv
data_assesing(dfh)

Total NaN/Null Data per Column:
instant       0
weathersit    0
registered    0
casual        0
windspeed     0
hum           0
atemp         0
temp          0
workingday    0
dteday        0
weekday       0
holiday       0
hr            0
mnth          0
yr            0
season        0
cnt           0
dtype: int64

Data Shape:
(17379, 17)

Total Duplicated Data: 0


In [16]:
# Call the function for assesing dataset day.csv
data_assesing(dfd)

Total NaN/Null Data per Column:
instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

Data Shape:
(731, 16)

Total Duplicated Data: 0


In [17]:
# Generate descriptive statistic for dataset hour.csv
dfh.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


In [18]:
# Generate descriptive statistic for dataset day.csv
dfd.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


### Data Cleaning

In [22]:
# Renaming all the column
dfd.rename(columns={'yr':'year',
                    'mnth':'month',
                    'hum':'humidity',
                    'cnt':'count',
                    'dteday':'Datetime'
                    }, inplace=True)

# Capitalize each column name
dfd.columns = dfd.columns.str.title()

# Change the 'Datetime' data type from object to datetime
dfd['Datetime'] = pd.to_datetime(dfd['Datetime'])
dfd.set_index('Datetime', inplace=True)

# Show the dataset
dfd.head(len(dfd))

Unnamed: 0_level_0,Instant,Season,Year,Month,Holiday,Weekday,Workingday,Weathersit,Temp,Atemp,Humidity,Windspeed,Casual,Registered,Count
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2011-01-01,1,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
2011-01-02,2,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2011-01-03,3,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
2011-01-04,4,1,0,1,0,2,1,1,0.200000,0.212122,0.590435,0.160296,108,1454,1562
2011-01-05,5,1,0,1,0,3,1,1,0.226957,0.229270,0.436957,0.186900,82,1518,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-27,727,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
2012-12-28,728,1,1,12,0,5,1,2,0.253333,0.255046,0.590000,0.155471,644,2451,3095
2012-12-29,729,1,1,12,0,6,0,2,0.253333,0.242400,0.752917,0.124383,159,1182,1341
2012-12-30,730,1,1,12,0,0,0,1,0.255833,0.231700,0.483333,0.350754,364,1432,1796


In [144]:
# Renaming all the column
dfh.rename(columns={'yr':'year',
                    'mnth':'month',
                    'hum':'humidity',
                    'cnt':'count',
                    'dteday':'Datetime',
                    'hr':'Hour'
                    }, inplace=True)

# Capitalize each column name
dfh.columns = dfh.columns.str.title()

# Show the dataset
dfh.head(len(dfh))

Unnamed: 0,Instant,Datetime,Season,Year,Month,Hour,Holiday,Weekday,Workingday,Weathersit,Temp,Atemp,Humidity,Windspeed,Casual,Registered,Count
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0000,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.80,0.0000,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.80,0.0000,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0000,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,17375,2012-12-31,1,1,12,19,0,1,1,2,0.26,0.2576,0.60,0.1642,11,108,119
17375,17376,2012-12-31,1,1,12,20,0,1,1,2,0.26,0.2576,0.60,0.1642,8,81,89
17376,17377,2012-12-31,1,1,12,21,0,1,1,1,0.26,0.2576,0.60,0.1642,7,83,90
17377,17378,2012-12-31,1,1,12,22,0,1,1,1,0.26,0.2727,0.56,0.1343,13,48,61


## Exploratory Data Analysis

In [38]:
# Displays casual and registered users by year
dfd.groupby(by="Year").agg({"Registered": "sum","Casual": "sum"})

Unnamed: 0_level_0,Registered,Casual
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
0,995851,247252
1,1676811,372765


Seasons Dictionary:
- 1 : Spring
- 2 : Summer
- 3 : Fall 
- 4 : Winter

In [35]:
# Displays the number of users by season
dfd.groupby(by="Season").Count.sum().sort_values(ascending=False).reset_index().head(10)

Unnamed: 0,Season,Count
0,3,1061129
1,2,918589
2,4,841613
3,1,471348


From the results of the analysis above, the season with the most users is the fall season with more than 1 million bicycle rental users.

In [45]:
# Displays the number of users 
dfd.groupby(by="Month").Count.sum().sort_values(ascending=False).reset_index().head(len(dfd))

Unnamed: 0,Month,Count
0,8,351194
1,6,346342
2,9,345991
3,7,344948
4,5,331686
5,10,322352
6,4,269094
7,11,254831
8,3,228920
9,12,211036


From the code above, it can be concluded that the 8th month or August has the highest number of users with a total of more than 350,000 bike rental users

In [90]:
# Displays the number of users 
dfh.groupby(by="Hour").Count.sum().sort_values(ascending=False).reset_index().head(len(dfh))

Unnamed: 0,Hour,Count
0,17,336860
1,18,309772
2,8,261001
3,16,227748
4,19,226789
5,13,184919
6,12,184414
7,15,183149
8,14,175652
9,20,164550


In [80]:
corr = dfd.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Apply the mask to the correlation matrix
corr_masked = corr.mask(mask)
# Create the heatmap
fig = go.Figure(data=go.Heatmap(
    z=corr_masked.values,
    x=corr_masked.columns,
    y=corr_masked.index,
    colorscale=[(0, '#636EFA'), (0.5, 'white'), (1, '#EF553B')],
    zmin=-1,
    zmax=1,
    text=np.round(corr_masked.values, 2),
    hoverinfo='text'
))
# Add annotations
annotations = []
for i in range(len(corr_masked)):
    for j in range(len(corr_masked)):
        if not mask[i, j]:
            annotations.append(
                dict(
                    x=corr_masked.columns[j],
                    y=corr_masked.index[i],
                    text=str(np.round(corr_masked.iloc[i, j], 2)),
                    showarrow=False,
                    font=dict(color="black")
                )
            )
fig.update_layout(
    width=1000,  # Set the width of the plot
    height=600,  # Set the height of the plot
    title='Correlation Heatmap',
    annotations=annotations,
    xaxis=dict(tickmode='array', tickvals=list(range(len(corr.columns))), ticktext=corr.columns),
    yaxis=dict(tickmode='array', tickvals=list(range(len(corr.index))), ticktext=corr.index)
)
fig.show()

Based on the analysis above, the relationship between temp and atemp variables is very strong with a value of 0.99, then the relationship between registered and count variables is very strong with a value of 0.95.

## Visualization & Explanatory Analysis
Business Questions
- Which users use bicycles the most? is it casual or registered?
- At what time of day are there the most bicycle rental users and at what time of day are there the least?
- What season are there most bicycle rentals?
- How has the company's sales performance been in 2011 - 2012?

### Question 1


In [78]:
# Assuming dfd is a DataFrame with 'Registered' and 'Casual' columns
registered_sum = dfd['Registered'].sum()
casual_sum = dfd['Casual'].sum()

# Create a subplot with 1 row and 2 columns, with the first column being the bar chart and the second column being the pie chart
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "bar"}, {"type": "pie"}]],
)

# Add bar chart to the first column
fig.add_trace(
    go.Bar(
        x=["Registered", "Casual"],
        y=[registered_sum, casual_sum],
        marker_color=["#EF553B", "#636EFA"],
        showlegend=False  # Hide legend for the bar chart
    ),
    row=1, col=1
)

# Add pie chart to the second column
fig.add_trace(
    go.Pie(
        labels=["Registered", "Casual"],
        values=[registered_sum, casual_sum],
        marker=dict(colors=["#EF553B", "#636EFA"]),
        showlegend=False  # Hide legend for the pie chart
    ),
    row=1, col=2
)

# Update layout for the figure
fig.update_layout(
    width=1000,  # Set the width of the plot
    height=600,  # Set the height of the plot
    title_text="Total and Proportional Rides by User Type",
    template="plotly_white"
)

# Show the plot
fig.show()

# Print the counts
print(f"Count of Registered Users: {registered_sum}")
print(f"Count of Casual Users: {casual_sum}")

Count of Registered Users: 2672662
Count of Casual Users: 620017


From the results of the previous analysis, the most users are registered users, where registered users account for more than 80% of the total, namely more than 2 million registered users for bicycle rental.

### Question 2

In [127]:
sumhours = dfh.groupby("Hour").Count.sum().sort_values(ascending=False).reset_index()

In [148]:
# Create a subplot with 1 row and 2 columns
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Hours with Most Bike Rentals", "Hours with Fewest Bike Rentals")
)

# Data for the most bike rentals
most_rentals = sumhours.head(5)
least_rentals = sumhours.sort_values(by="Hour", ascending=True).head(5)

# Add bar chart for the most bike rentals
fig.add_trace(
    go.Bar(
        x=most_rentals["Hour"],
        y=most_rentals["Count"],
        marker_color="#EF553B",  # Set color
        showlegend=False  # Hide legend for the bar chart
    ),
    row=1, col=1
)

# Add bar chart for the fewest bike rentals
fig.add_trace(
    go.Bar(
        x=least_rentals["Hour"],
        y=least_rentals["Count"],
        marker_color="#636EFA",  # Set color
        showlegend=False  # Hide legend for the bar chart
    ),
    row=1, col=2
)

# Update layout for the figure
fig.update_layout(
    width=1000,
    height=600,
    title_x=0.5,
    template="plotly_white"
)

# Update x-axis and y-axis properties for both subplots
fig.update_xaxes(title_text="Hours (PM)", row=1, col=1, tickfont=dict(size=15))
fig.update_yaxes(tickfont=dict(size=15), row=1, col=1)

fig.update_xaxes(title_text="Hours (AM)", row=1, col=2, tickfont=dict(size=15))
fig.update_yaxes(tickfont=dict(size=15), row=1, col=2)

# Reverse x-axis for the second plot
fig['layout']['xaxis2']['autorange'] = 'reversed'

# Show the plot
fig.show()


Based on the image above, it can be seen that the time with the highest number of bike rentals occurs at 17:00, with more than 300,000 bike rentals. This is probably because at this time, many people return from work or school and are looking for a means of transportation to return home or for recreation. At this time of the day, the weather is usually still light, making it suitable for cycling.
On the other hand, bike rentals at 04:00 are the least sold, with only about 4,428 rentals. This is understandable as 04:00 is an early morning time where most people are sleeping and outdoor activities are still minimal. In addition, the cold and dark weather conditions during these hours may also discourage people from cycling.

### Question 3

In [79]:
# Distribution Figure
fig = go.Figure()
fig.update_layout(title='Distribution of Seasons',
                  xaxis_title='Season',
                  yaxis_title='Count')

# Correlation Figure
fig.add_trace(go.Bar(
    x=['Spring', 'Summer', 'Fall', 'Winter'],
    y=dfd.groupby('Season')['Registered'].mean(),
    name='Registered Users',
    marker_color='#EF553B',
    width=0.5
))

fig.add_trace(go.Bar(
    x=['Spring', 'Summer', 'Fall', 'Winter'],
    y=dfd.groupby('Season')['Casual'].mean(),
    name='Casual Users',
    marker_color='#636EFA',
    width=0.5
))

fig.update_layout(
    barmode='group',
    width=1000,  # Set the width of the plot
    height=600,  # Set the height of the plot
    title='Registered and Casual Users (Seasons)',
    xaxis_title='Season',
    yaxis_title='Users Count')

fig.show()

The analysis shows that fall is the most desirable time for users to rent a bicycle. This could be due to the generally cooler and more comfortable weather for cycling, as well as the beautiful natural scenery during autumn with leaves changing colors. Summer ranked second, probably due to the warm weather and school vacations leading to an increase in outdoor activities. Winter and spring ranked last, perhaps due to more unpredictable weather and a lack of interest in cycling under more extreme weather conditions. As such, this information can help bike rental service providers to organize their stock and marketing strategies according to the higher demand during the fall and summer seasons.

### Question 4

In [150]:
# Create a figure
fig = go.Figure()

# Add trace for Registered users
fig.add_trace(go.Scatter(x=dfd.index, y=dfd['Registered'], mode='lines', name='Registered',marker_color='#636EFA',))

# Add trace for Casual users
fig.add_trace(go.Scatter(x=dfd.index, y=dfd['Casual'], mode='lines', name='Casual',marker_color='#EF553B',))

# Update layout
fig.update_layout(
    width=1000,  # Set the width of the plot
    height=600,  # Set the height of the plot
    title='Registered and Casual Over Time',
    xaxis_title='Datetime',
    yaxis_title='Count'
)

fig.show()

Based on the visualization above, we can see that the highest number of orders occurs in September, which may indicate a seasonal increase in demand or a certain promotion that attracts many customers in that month. 
In addition, we can also see a significant drop in the number of orders in November and December.

## Conclusion

### Question 1
Which users use bicycles the most? is it casual or registered?

<p align="justify">
Users who has registered is 81.2%, while users who has not regsitered (casual) is 18.8%.
</p>

### Question 2
At what time of day are there the most bicycle rental users and at what time of day are there the least?

<p align="justify">
Many cyclists rent bicycles at 17:00 PM. On the other hand, fewer users rent bicycles at 4:00 AM in the morning.
</p>

### Question 3
What season are there most bicycle rentals?

<p align="justify">
The season with the most bicycle rental users is the <b>fall</b> season.
</p>

### Question 4
How has the company's sales performance been in 2011 - 2012?

<p align="justify">
The highest number of orders occurred in September 2012. Also, there was a significant decrease in the number of orders in January 2011.
</p>

## Implementation of Advanced Analysis Techniques

### RFM Analysis
RFM (Recency, Frequency, Monetary) analysis is a marketing method used to analyze and categorize customers based on their behavior in three main dimensions:

- Recency: How recently the customer made a purchase. Customers who have recently made a purchase are more likely to make another purchase compared to those who have not transacted for a long time.

- Frequency: How often customers make purchases within a certain period of time. Customers who transact frequently tend to be more loyal and valuable to the business.

- Monetary: How much money customers spend on their purchases. Customers who spend more money are considered more valuable to the business.

In [141]:
dfh.head()

Unnamed: 0_level_0,Instant,Season,Year,Month,Hour,Holiday,Weekday,Workingday,Weathersit,Temp,Atemp,Humidity,Windspeed,Casual,Registered,Count
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2011-01-01,1,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
2011-01-01,2,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2011-01-01,3,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
2011-01-01,4,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
2011-01-01,5,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [147]:
dfh['Datetime'] = pd.to_datetime(dfh['Datetime'])

current_date = max(dfh['Datetime'])

rfm_df = dfh.groupby('Registered').agg({
    'Datetime': lambda x: (current_date - x.max()).days,  # Recency
    'Instant': 'count',  # Frequency
    'Count': 'sum'  # Monetary
}).reset_index()

rfm_df.columns = ['Registered', 'Recency', 'Frequency', 'Monetary']

print(rfm_df.head())

   Registered  Recency  Frequency  Monetary
0           0       38         24        35
1           1        0        201       294
2           2        1        245       648
3           3        0        294      1154
4           4        3        307      1602
