# Seoul Bike Rental Dataset Visualization

## Introduction



In [19]:
import pandas as pd
import altair as alt

# Read the dataset
url = 'https://raw.githubusercontent.com/yuwangy/Seoul_bike_rental_viz/main/SeoulBikeData.csv'
data = pd.read_csv(url, encoding='latin-1')
data.head()

# Add 'Month' as ordinal attribute
data['Date'] = pd.to_datetime(data['Date'], format='%d/%m/%Y')

# Create 'Month' column based on date
data['Month'] = data['Date'].dt.month

# Add 'Weekend' as nominal attribute, determine if the date is on a weekend 
data['Weekend'] = data['Date'].dt.dayofweek.apply(lambda x: 'Yes' if x >= 5 else 'No')

data.head(10)



Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day,Month,Weekend
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
5,2017-12-01,100,5,-6.4,37,1.5,2000,-18.7,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
6,2017-12-01,181,6,-6.6,35,1.3,2000,-19.5,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
7,2017-12-01,460,7,-7.4,38,0.9,2000,-19.3,0.0,0.0,0.0,Winter,No Holiday,Yes,12,No
8,2017-12-01,930,8,-7.6,37,1.1,2000,-19.8,0.01,0.0,0.0,Winter,No Holiday,Yes,12,No
9,2017-12-01,490,9,-6.5,27,0.5,1928,-22.4,0.23,0.0,0.0,Winter,No Holiday,Yes,12,No


In [20]:
# EDA for the rental bike dataset

# Print the shape
print(data.shape)

# Print the info
print(data.info())

# Print the names of the columns
print(data.columns)



(8760, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Date                       8760 non-null   datetime64[ns]
 1   Rented Bike Count          8760 non-null   int64         
 2   Hour                       8760 non-null   int64         
 3   Temperature(°C)            8760 non-null   float64       
 4   Humidity(%)                8760 non-null   int64         
 5   Wind speed (m/s)           8760 non-null   float64       
 6   Visibility (10m)           8760 non-null   int64         
 7   Dew point temperature(°C)  8760 non-null   float64       
 8   Solar Radiation (MJ/m2)    8760 non-null   float64       
 9   Rainfall(mm)               8760 non-null   float64       
 10  Snowfall (cm)              8760 non-null   float64       
 11  Seasons                    8760 non-null   object        


In [21]:
# Obtaining the extreme quantitative values in the data 

# Extract maximum and minimum Rented Bike Count
max_rented_bike_count = data['Rented Bike Count'].max()
print(f'maximum bikes rented: {max_rented_bike_count}')
min_rented_bike_count = data['Rented Bike Count'].min()
print(f'minimum bikes rented: {min_rented_bike_count}')

# Extract highest and lowest temperature
highest_temperature = data['Temperature(°C)'].max()
print(f'highest temperature: {highest_temperature}')
lowest_temperature = data['Temperature(°C)'].min()
print(f'lowest temperature: {lowest_temperature}')

# Extract highest and lowest humidity
highest_humidity = data['Humidity(%)'].max()
print(f'highest humidity: {highest_humidity}')
lowest_humidity = data['Humidity(%)'].min()
print(f'lowest humidit: {lowest_humidity}')

# Extract maximum and minimum wind speed
max_wind_speed = data['Wind speed (m/s)'].max()
print(f'maximum wind speed: {max_wind_speed}')
min_wind_speed = data['Wind speed (m/s)'].min()
print(f'minimum wind speed: {min_wind_speed}')

# Extract maximum and minimum visibility
max_visibility = data['Visibility (10m)'].max()
print(f'maximum visibility: {max_visibility}')
min_visibility = data['Visibility (10m)'].min()
print(f'minimum visibility: {min_visibility}')

# Extract highest and lowest dew point temperature
highest_dew_point_temperature = data['Dew point temperature(°C)'].max()
print(f'highest dew point temperature: {highest_dew_point_temperature}')
lowest_dew_point_temperature = data['Dew point temperature(°C)'].min()
print(f'lowest dew point temperature: {lowest_dew_point_temperature}')

# Extract maximum and minimum solar radiation
max_solar_radiation = data['Solar Radiation (MJ/m2)'].max()
print(f'maximum solar radiation: {max_solar_radiation}')
min_solar_radiation = data['Solar Radiation (MJ/m2)'].min()
print(f'minimum solar radiation: {min_solar_radiation}')

# Extract maximum and minimum rainfall
max_rainfall = data['Rainfall(mm)'].max()
print(f'maximum rainfall: {max_rainfall}')
min_rainfall = data['Rainfall(mm)'].min()
print(f'minimum rainfall: {min_rainfall}')

# Extract maximum and minimum snowfall
max_snowfall = data['Snowfall (cm)'].max()
print(f'maximum snowfall: {max_snowfall}')
min_snowfall = data['Snowfall (cm)'].min()
print(f'minimum snowfall: {min_snowfall}')



maximum bikes rented: 3556
minimum bikes rented: 0
highest temperature: 39.4
lowest temperature: -17.8
highest humidity: 98
lowest humidit: 0
maximum wind speed: 7.4
minimum wind speed: 0.0
maximum visibility: 2000
minimum visibility: 27
highest dew point temperature: 27.2
lowest dew point temperature: -30.6
maximum solar radiation: 3.52
minimum solar radiation: 0.0
maximum rainfall: 35.0
minimum rainfall: 0.0
maximum snowfall: 8.8
minimum snowfall: 0.0


In [22]:
# Create data abstraction for Seoul Bike Rental Dataset in the form of csv table

url = 'https://raw.githubusercontent.com/yuwangy/Seoul_bike_rental_viz/main/Seoul%20Bike%20Rental%20Data%20Abstraction.csv'
data_abstraction = pd.read_csv(url)
data_abstraction



Unnamed: 0.1,Unnamed: 0,Semantics,Attribute Type,Cardinality
0,Date,The specific day of the month of the year,Temporal,01-12-2017 to 30-11-2018
1,Rented Bike Count,The specific number of bikes rented in given hour,Quantitative,0 - 3556 bikes rented
2,Hour,The specific hour of the day ...,Ordinal,24
3,Temperature (°C),The temperature in given hour,Quantitative,-17.8 - 39.4 °C
4,Humidity (%),The humidity in given hour,Quantitative,0 - 98 % humidity
5,Wind speed (m/s),The wind speed in given hour,Quantitative,0 - 7.4 m/s wind speed
6,Visibility (10m),The visibility (how far you can see clearly) i...,Quantitative,27 - 2000 (10m) visibility
7,Dew point temperature (°C),The dew point temperature in given hour,Quantitative,-30.6 - 27.2 °C dew point temperature
8,Solar Radiation (MJ/m2),The amount of solar radiation in given hour,Quantitative,0 - 3.52 (MJ/m2)
9,Rainfall (mm),The amount of rain in given hour,Quantitative,0 - 35 (mm)


In [23]:
# Statistical summaries for numerical attributes (central tendency measures)

data.describe()


Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Month
count,8760,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,2018-05-31 23:59:59.999999744,704.602055,11.5,12.882922,58.226256,1.724909,1436.825799,4.073813,0.569111,0.148687,0.075068,6.526027
min,2017-12-01 00:00:00,0.0,0.0,-17.8,0.0,0.0,27.0,-30.6,0.0,0.0,0.0,1.0
25%,2018-03-02 00:00:00,191.0,5.75,3.5,42.0,0.9,940.0,-4.7,0.0,0.0,0.0,4.0
50%,2018-06-01 00:00:00,504.5,11.5,13.7,57.0,1.5,1698.0,5.1,0.01,0.0,0.0,7.0
75%,2018-08-31 00:00:00,1065.25,17.25,22.5,74.0,2.3,2000.0,14.8,0.93,0.0,0.0,10.0
max,2018-11-30 00:00:00,3556.0,23.0,39.4,98.0,7.4,2000.0,27.2,3.52,35.0,8.8,12.0
std,,644.997468,6.922582,11.944825,20.362413,1.0363,608.298712,13.060369,0.868746,1.128193,0.436746,3.448048


In [24]:
# Obtain frequency table of all attributes

print(data['Date'].value_counts())
print(data['Rented Bike Count'].value_counts())
print(data['Hour'].value_counts())
print(data['Temperature(°C)'].value_counts())
print(data['Humidity(%)'].value_counts())
print(data['Wind speed (m/s)'].value_counts())
print(data['Visibility (10m)'].value_counts())
print(data['Dew point temperature(°C)'].value_counts())
print(data['Solar Radiation (MJ/m2)'].value_counts())
print(data['Rainfall(mm)'].value_counts())
print(data['Snowfall (cm)'].value_counts())
print(data['Seasons'].value_counts())
print(data['Holiday'].value_counts())
print(data['Functioning Day'].value_counts())
print(data['Month'].value_counts())
print(data['Weekend'].value_counts())



Date
2018-11-30    24
2017-12-01    24
2017-12-02    24
2017-12-03    24
2017-12-04    24
              ..
2017-12-17    24
2017-12-18    24
2017-12-19    24
2017-12-20    24
2017-12-21    24
Name: count, Length: 365, dtype: int64
Rented Bike Count
0       295
122      19
223      19
262      19
189      18
       ... 
2159      1
1692      1
894       1
793       1
1304      1
Name: count, Length: 2166, dtype: int64
Hour
0     365
1     365
2     365
3     365
4     365
5     365
6     365
7     365
8     365
9     365
10    365
11    365
12    365
13    365
14    365
15    365
16    365
17    365
18    365
19    365
20    365
21    365
22    365
23    365
Name: count, dtype: int64
Temperature(°C)
 19.1    40
 20.5    40
 23.4    39
 20.7    38
 7.6     38
         ..
-17.4     1
-13.5     1
-13.6     1
-13.9     1
 36.9     1
Name: count, Length: 546, dtype: int64
Humidity(%)
97    173
53    173
43    164
57    159
56    157
     ... 
19     11
13      3
12      1
10      1
11      1

In [25]:
# Activate the VegaFusion data transformer
import altair as alt
alt.data_transformers.enable("vegafusion")

# # Drop the missing data
# data = data.dropna()
# print(data.head())

# Create simple visual summaries, bar charts, line charts, boxplot charts etc
# Boxplot to identify outliers in mean value of Rented Bike Count
boxplot_chart = alt.Chart(data).mark_boxplot().encode(
    alt.X('Hour:O'),
    alt.Y('mean(Rented Bike Count):Q', title = 'Mean of Rented Bike Count'),
).properties(
    width = 500,
    height = 300,
    title = 'Boxplot chart for Hourly Mean Rented Bike Count',
)
display(boxplot_chart)

# The sum of Rented Bike Count across various hours of day
hourly_barchart = alt.Chart(data).mark_bar().encode(
    alt.X('Hour:O', sort = '-y'),
    alt.Y('sum(Rented Bike Count):Q'),
    alt.Color('Seasons:N'),
    alt.Tooltip('Date'),
).properties(
    title = 'Barchart for Hourly Sum Rented Bike Count',
)
display(hourly_barchart)

# Mean values of Rented Bike Count across 4 seasons of year
seasons_barchart = alt.Chart(data).mark_bar().encode(
    alt.Y('Seasons:N', sort = 'x'),
    alt.X('sum(Rented Bike Count):Q'),
    alt.Color('Month:O'),
    alt.Tooltip('sum(Rented Bike Count):Q'),
).properties(
    width = 500,
    height = 150,
    title = 'Barchart for Season Sum Rented Bike Count',
)
display(seasons_barchart)

# Density plot for Rented Bike Count
density_chart = alt.Chart(data).mark_area().encode(
    x='Rented Bike Count:Q',
    y='density:Q',
).transform_density(
    'Rented Bike Count',
    as_=['Rented Bike Count', 'density'],
).properties(
    width = 500,
    height = 300,
    title = 'Density plot for Rented Bike Count',
)

display(density_chart)

# Heatmap for Rented Bike Count 
heatmap_temperature = alt.Chart(data).mark_rect().encode(
    alt.Y('Month:O'),
    alt.X('Hour:O'),
    alt.Color('Rented Bike Count:Q'),
    alt.Tooltip('Date'),
).properties(
    width = 500,
    height = 250,
    title = 'Heatmap for Rented Bike Count by Hour & Month',
)

display(heatmap_temperature)

# # Scatter matrix for Rented Bike Count
# scatter_matrix = alt.Chart(data).mark_circle().encode(
#     alt.X(alt.repeat("column"), type='quantitative'),
#     alt.Y(alt.repeat("row"), type='quantitative'),
# ).properties(
#     width=100,
#     height=100
# ).repeat(
#     row=['Rainfall (mm)', 'Snowfall (cm)', 'sum(Rented Bike Count)'],
#     column=['Rainfall (mm)', 'Snowfall (cm)', 'sum(Rented Bike Count)'],
# )

# display(scatter_matrix)

# Line chart for Rented Bike Count
linechart_temperature = alt.Chart(data).mark_line(point = True).encode(
    alt.X('Temperature(°C):Q', bin = alt.Bin(maxbins = 30)),
    alt.Y('mean(Rented Bike Count):Q'),
    alt.Color('Seasons:N'),
    alt.Tooltip('mean(Rented Bike Count):Q'),
).properties(
    width = 500,
    height = 300,
    title = 'Linechart for Mean Rented Bike Count by Temperature',
)

# display(linechart_temperature)

linechart_humidity = alt.Chart(data).mark_line(point = True).encode(
    alt.X('Humidity(%):Q', bin = alt.Bin(maxbins = 30)),
    alt.Y('mean(Rented Bike Count):Q'),
    alt.Color('Seasons:N'),
    alt.Tooltip('mean(Rented Bike Count):Q'),
).properties(
    width = 500,
    height = 300,
    title = 'Linechart for Mean Rented Bike Count by Humidity',
)

# display(linechart_humidity)

# concatenating two line charts horizontally
side_by_side_linecharts = alt.hconcat(linechart_temperature, linechart_humidity)
display(side_by_side_linecharts)




# Study of Bike Rental in Seoul


### Introduction
In this study of bike rental in Seoul, we want to learn about data wrangling and processing. We want to develop skills of tidying our data and adapt it to fit the task we want to complete. Also, we want to improve our visualization design skills by exploring different visualization type we can use and find the one that suits our task the best. 


### Audience
Our intended audience will be primarily young to middle-age citizens in Seoul, due to that it is mainly young people live in Seoul is renting bikes, and they would like to know the bike rental condition(amount available, etc.) under different circumstance. Another potential audience could be the rental company, they want know when is the rush hours for bike renting, so that they can arrange their storage and staff schedule better.


### Motivation
Citizens would like to avoid the rush hour or reserve bikes before rush hour so that they would have a bike to rent. Rental company would like to know when is the rush hour so that they can better prepare for the rush hour. Therefore, our audience can apply their condition(weather, temprature, etc.) to the visualization, and have a basic sence of what is the amount of renting going to be like, so that they can correspondingly decide what they shoould do. 

# Task Analysis
First of all, this visualization should have 2 main tabs and 4 minor tabs under it. The 2 main tabs are "Overview" and "Trending", which is studying the overall distribution of renting amount throught out the year(Overview), and the trend of renting amount over different Temperature and different time in a day(Trending). The second part is devided into 4 subset for the 4 seasons.

For the first part "Overview", the visualization should be able to give the total amount of renting in a day encoded in the y-axis channel(compute a derived value), and form an area plot to show the distribution of amount of renting through the year with data encoded in the x-axis channel(Characterize Distribution). There should be a tooltip for each day on the visualization, showing what is the date, total amount of renting, if it is a weekend or holiday. Clicking on a specific day should show a detailed distribution of amount of renting in each hour in that day.

For the second part "Trending", the visualization is aiming for finding the trend between amount of renting and the temperature or time in a day, therefore it is better to use a line plot to show the trending. There should be a dropdown menu for the audience to select to select which trend they want to see. There should also be 4 tabs to allow the audience to switch between the 4 seasons. The average of amount of renting is encoded in the y-axis channel, and the temperature or time in the day is encoded in the x-axis channel. This plot is aiming for showing the correlation between the amount of renting and the temperature or time in a day(Correlate), with a dropdown menu for the audience to select between weekdays or weekends(Filter), due to the amount of renting could be greatly affected by whether it is a weekday or weekend. Tooltip should also be available for this visualization, showing the average amount of renting(Retrieve value), temperature/time of the day.