# Proyek Analisis Data: [Bike Sharing]
- **Nama:** [Zahra Nadhifah]
- **Email:** [zahraanadhfh@gmail.com]
- **ID Dicoding:** [yourloops]

## Menentukan Pertanyaan Bisnis

- Spesific
1.   Berapa jumlah perbandingan total sewa sepeda pada holiday dan workingday di tahun 2012?

- Measurable
2.    Berapa prediksi jumlah sewa sepeda per jam (hourly) berdasarkan pengaturan lingkungan dan musiman?

- Time-bound
3.   Pada hari libur atau hari kerja, pada jam berapa sepanjang tahun 2012 terjadi peningkatan peminjaman sepeda yang paling signifikan?



## Import Semua Packages/Library yang Digunakan

Data Processing Library

In [None]:
!pip install numpy
!pip install pandas
!pip install scipy

In [69]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.cluster import KMeans
from plotly.subplots import make_subplots

Data Visualization Library

In [None]:
!pip install matplotlib
!pip install seaborn

## Data Wrangling

### Gathering Data

In [41]:
df_day = pd.read_csv("/content/day.csv")
df_hour = pd.read_csv("/content/hour.csv")

In [42]:
df_day.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [43]:
df_hour.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Displays the top 5 data from a dataframe using head() function.

### Assessing Data

Datatype Information

In [44]:
print('Dataframe (day) information: ')
print(df_day.info())

print('Dataframe (hour) information: ')
print(df_hour.info())

Dataframe (day) information: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB
None
Dataframe (hour) information: 
<class 'pandas.core.frame.D

The result is that in df_day there is a data type error in the dte_day column which is listed as an object, it should be a datetime data type.

Handling Missing Value

In [45]:
print('Total missing value (DF day): ')
print(df_day.isnull().sum())

Total missing value (DF day): 
instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64


No missing value in dataframe day.

In [46]:
print('Total missing value (DF hour): ')
print(df_hour.isnull().sum())

Total missing value (DF hour): 
instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64


No missing value in dataframe hour.

Handling Duplicate Data

In [47]:
print('Total Duplicate (DF day): ', df_day.duplicated().sum())
print('Total Duplicate (DF hour): ', df_hour.duplicated().sum())

Total Duplicate (DF day):  0
Total Duplicate (DF hour):  0


No duplicate data for Dataframe day and hour.

Descriptive Statistics

In [48]:
df_day.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


In [49]:
df_hour.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


### Descriptive Analysis


*   **Dataset Day**

1.   Total entry data: 731 data.
2.   Peak season: season 2 (musim panas).
3.   Dominant year to rental bike: year 1 (2012).
4.   Dominant month to rental bike: month 7 (July).
5.   Persentage - Holiday: 2.87%.
6.   Persentage - mayoritas: 68.40%.
7.   Cuaca baik (weathersit): 1.
8.   Normalized Temperature: 0.50 °C.
9.   Normalized Atemperature: 0.47 °C.
10.  Humidity: 47%.
11.  Windspeed: 67%.
12.  Casual users: 848.
13.  Registered users: 3656.
14.  Total rental users: 4504/day.  




*   **Dataset Hour**

1.   Total entry data: 17,739 data.
2.   Peak season: season 2 (musim panas).
3.   Dominant year to rental bike: year 1 (2012).
4.   Dominant month to rental bike: month 6 (June).
5.   Hour/day: 11.55.
6.   Persentage - Holiday: 2.87
7.   Persentage - workingday: 68.27%.
8.   Cuaca baik (weathersit): 1.43
9.   Normalized temperature: 0.50 °C.
10.   Normalized atemperature: 0.48 °C.
11.  Humidity: 48%.
12.  Windspeed: 67%.
13.  Casual users: 35.68.
14.  Registered users: 153.79.
15.  Total rental users: 189.46.  



### Cleaning Data

Fixing data type

In [50]:
df_day["dteday"] = df_day["dteday"].astype('datetime64')
df_hour["dteday"] = df_hour["dteday"].astype('datetime64')

In [51]:
print('df_day["dteday"] : ', df_day["dteday"].dtypes)
print('df_hour["dteday"] : ', df_hour["dteday"].dtypes)

df_day["dteday"] :  datetime64[ns]
df_hour["dteday"] :  datetime64[ns]


## Exploratory Data Analysis (EDA)

### Explore ...

In [52]:
df_day.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


Berdasarkan rangkuman parameter statistik diatas, informasi yang diperoleh pada tahun 2012 sekitar 2.87% user casual dan registered merental sepeda pada hari libur (holiday). Sedangkan, pada hari kerja (workingday) user casual maupun registered yang menyewa sepeda lebih tinggi daripada holiday yaitu sekitar 68.40%. Informasi ini memberikan gambaran bahwa segmentasi user casual dan registered yang merental sepeda pada tahun 2012 di hari kerja (weekday) dan hari libur (holiday) lebih besar peminatnya.

In [66]:
df_hour.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


Distribusi variabel numerik

In [53]:
numeric_cols = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']

# Create subplots
fig = make_subplots(rows=len(numeric_cols), cols=1, subplot_titles=[f'Distribusi {col}' for col in numeric_cols])

# Populate subplots with histograms
for i, col in enumerate(numeric_cols, start=1):
    histogram = go.Histogram(x=df_day[col], name=f'Distribusi {col}')
    fig.add_trace(histogram, row=i, col=1)

# Update layout
fig.update_layout(height=len(numeric_cols) * 300, showlegend=False, title_text="Distribusi Variabel Numerik")

# Show the plot
fig.show()


Distribusi variabel kategorikal

In [54]:
categorical_cols = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']

for col in categorical_cols:
    # Create bar chart
    fig = px.bar(df_day[col].value_counts().reset_index(), x='index', y=col)

    # Update layout
    fig.update_layout(title=f'Distribusi {col}', xaxis_title=col, yaxis_title='Jumlah')

    # Show the plot
    fig.show()


Hubungan antara hari libur dan jumlah sewa

In [55]:
# Create box plot
fig = px.box(df_day, x='holiday', y='cnt')

# Update layout
fig.update_layout(title='Hubungan antara Hari Libur dan Jumlah Sewa', xaxis_title='Hari Libur', yaxis_title='Jumlah Sewa')

# Show the plot
fig.show()


Hubungan antara hari kerja dan jumlah sewa

In [56]:
# Create box plot
fig = px.box(df_day, x='workingday', y='cnt')

# Update layout
fig.update_layout(title='Hubungan antara Hari Kerja dan Jumlah Sewa', xaxis_title='Hari Kerja', yaxis_title='Jumlah Sewa')

# Show the plot
fig.show()


Hubungan antara musim dan jumlah sewa

In [57]:
# Create box plot
fig = px.box(df_day, x='season', y='cnt')

# Update layout
fig.update_layout(title='Hubungan antara Musim dan Jumlah Sewa', xaxis_title='Musim', yaxis_title='Jumlah Sewa')

# Show the plot
fig.show()


Hubungan antara cuaca dan jumlah sewa

In [58]:
# Create box plot
fig = px.box(df_day, x='weathersit', y='cnt')

# Update layout
fig.update_layout(title='Hubungan antara Cuaca dan Jumlah Sewa', xaxis_title='Situasi Cuaca', yaxis_title='Jumlah Sewa')

# Show the plot
fig.show()

Pengaruh jumlah sewa sepeda per jam (hourly) berdasarkan cuaca

In [63]:
# Group data berdasarkan cuaca dan hitung jumlah sewa per jam
hourly_weather_counts = df_hour.groupby(['weathersit', 'hr'], as_index=False)['cnt'].mean()

# Visualisasi menggunakan diagram garis
fig = px.line(hourly_weather_counts, x='hr', y='cnt', color='weathersit',
              labels={'hr': 'Jam', 'cnt': 'Jumlah Sewa', 'weathersit': 'Kondisi Cuaca'},
              title='Jumlah Sewa Sepeda per Jam Berdasarkan Cuaca')
fig.show()


## Visualization & Explanatory Analysis

### Pertanyaan 1: (Spesific)

Berapa jumlah perbandingan total sewa sepeda pada holiday dan workingday di tahun 2012?



In [60]:
# Filter data untuk tahun 2012
df_2012 = df_day[df_day['yr'] == 1]

# Hitung jumlah total sewa sepeda pada hari libur dan weekday di tahun 2012
total_sewa_holiday = df_2012[df_2012['holiday'] == 1]['cnt'].sum()
total_sewa_weekday = df_2012[df_2012['weekday'] < 5]['cnt'].sum()

print(f"Jumlah total sewa sepeda pada hari libur di tahun 2012: {total_sewa_holiday} sepeda")
print(f"Jumlah total sewa sepeda pada weekday di tahun 2012: {total_sewa_weekday} sepeda")

Jumlah total sewa sepeda pada hari libur di tahun 2012: 48413 sepeda
Jumlah total sewa sepeda pada weekday di tahun 2012: 1445728 sepeda


Berdasarkan hasil diatas mendapatkan perbandingan total sewa sepeda pada holiday dan workingday pada tahun 2012 sebesar 0.0335 atau 3.35%

### Pertanyaan 2: (Measurable)

Berapa prediksi jumlah sewa sepeda per jam (hourly) berdasarkan pengaturan lingkungan dan musiman?

In [64]:
# Membuat DataFrame baru yang berisi rata-rata jumlah sewa per jam berdasarkan cuaca
hourly_weather_avg = df_hour.groupby(['hr', 'weathersit'], as_index=False)['cnt'].mean()

# Menampilkan DataFrame hasil
print(hourly_weather_avg)


    hr  weathersit         cnt
0    0           1   59.161554
1    0           2   47.232432
2    0           3   28.115385
3    1           1   34.395918
4    1           2   35.541899
..  ..         ...         ...
70  22           2  116.823171
71  22           3   70.345455
72  23           1   93.981707
73  23           2   85.171598
74  23           3   49.373134

[75 rows x 3 columns]


In [67]:
hourly_weather_avg.describe()

Unnamed: 0,hr,weathersit,cnt
count,75.0,75.0,75.0
mean,11.506667,2.08,158.386736
std,6.996859,0.896841,122.705899
min,0.0,1.0,4.684211
25%,5.5,1.0,48.302783
50%,12.0,2.0,138.060606
75%,17.5,3.0,227.331484
max,23.0,4.0,500.42998


Berdasarkan informasi diatas, informasi yang diperoleh adalah rata-rata pada jam 11 pagi dengan  cuaca 2 (cloudy) dengan jumlah sebanyak 158 sepeda disewakan.

### Pertanyaan 3: (Time-bound)

Pada hari libur atau hari kerja, pada jam berapa sepanjang tahun 2012 terjadi peningkatan peminjaman sepeda yang paling signifikan?

In [65]:
import plotly.express as px

# Filter data untuk tahun 2012 saja
df_2012 = df_hour[df_hour['yr'] == 1]

# Group data berdasarkan hari libur atau hari kerja dan jam, lalu hitung rata-rata jumlah sewa sepeda
hourly_day_type_avg = df_2012.groupby(['holiday', 'workingday', 'hr'], as_index=False)['cnt'].mean()


# Visualisasi menggunakan diagram batang
fig = px.bar(hourly_day_type_avg, x='hr', y='cnt', color='workingday',
             facet_col='holiday', labels={'hr': 'Jam', 'cnt': 'Jumlah Sewa'},
             title='Rata-Rata Jumlah Sewa Sepeda per Jam pada Hari Libur dan Hari Kerja (Tahun 2012)')

fig.show()

Diperoleh informasi pada workingday peak hour penyewaan sepeda sebanyak 656 unit.

## Conclusion

- Conclution pertanyaan 1
Dengan melihat perbandingan tersebut, dapat disimpulkan bahwa pada tahun 2012, mayoritas sewa sepeda terjadi pada hari kerja, sementara hari libur memiliki kontribusi yang lebih rendah terhadap total sewa sepeda. Hal ini mungkin disebabkan oleh pola aktivitas masyarakat yang cenderung menggunakan sepeda lebih banyak pada hari-hari kerja dibandingkan dengan hari libur.

- Conclution pertanyaan 2
Dengan mellihat hasil pivot table, informasi yang diperoleh adalah rata-rata pada jam 11 pagi dengan cuaca 2 (cloudy) dengan jumlah sebanyak 158 sepeda disewakan. Pengaruh cuaca mempengaruhi variasi jumlah penyewaan sepeda, dimana cuaca yang kurang cerah dapat mengurangi minat sewa sepeda pada jam tersebut.

- Conclution pertanyaan 3
Pada hari kerja (workingday), terlihat pola yang umum, dimana jumlah sewa sepeda cenderung meningkat pada pagi hari menurun selama siang, dan meningkat lagi pada sore hingga malam. Puncak sewa sepeda terlihat pada jam 8 pagi dan 5 sore, yang dapat dikaitkan dengan jam berangkat dan pulang kerja.

Teknik RFM Analysis

Recency (informasi terbaru pelanggan berinteraksi dengan bisnis), Frequency (seberapa sering pelanggan berinteraksi), dan Monetary (seberapa banyak uang yang dihabiskan)

In [72]:
# Menghitung rfm
current_date = max(df_hour['dteday'])
rfm_df = df_hour.groupby('registered').agg({
    'dteday': lambda x: (current_date - x.max()).days,  # Recency
    'instant': 'count',  # Frequency
    'cnt': 'sum'  # Monetary
}).reset_index()

# Mengganti nama kolom
rfm_df.columns = ['registered', 'Recency', 'Frequency', 'Monetary']

# Tampilkan hasil
print(rfm_df.head())

   registered  Recency  Frequency  Monetary
0           0       38         24        35
1           1        0        201       294
2           2        1        245       648
3           3        0        294      1154
4           4        3        307      1602


Teknik Clustering

Teknik mengelompokkan data berdasarkan fiturnya.

In [73]:
# Menyiapkan data yang akan digunakan untuk clustering
X = rfm_df[['Frequency', 'Monetary']]

# Menentukan jumlah cluster
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters)

# Melakukan clustering
rfm_df['cluster'] = kmeans.fit_predict(X)

print(rfm_df[['registered', 'Frequency', 'Monetary', 'cluster']].head())

   registered  Frequency  Monetary  cluster
0           0         24        35        1
1           1        201       294        1
2           2        245       648        1
3           3        294      1154        1
4           4        307      1602        1






In [74]:
# Menghitung total pengguna berdasarkan cluster
cluster_totals = rfm_df.groupby('cluster')['registered'].sum().reset_index()

print(cluster_totals)

   cluster  registered
0        0       63308
1        1      208868
2        2       33537
