# Helsinki City Bikes EDA
## Exploratory data analysis of the Helsinki city bike system.

### [Associated Github Repo](https://github.com/Geometrein/helsinki-city-bikes)
### [Associated Medium Article](https://towardsdatascience.com/helsinki-city-bikes-exploratory-data-analysis-e241ce5096db?sk=19ffb3c11b016486b2dd11455568eee1)

## What are Helsinki City bikes?
Helsinki City Bikes are shared bicycles available to the public in Helsinki and Espoo metropolitan areas. The main aim of the Helsinki city bike system is to address the so-called last-mile problem present in all distribution networks. The city bikes were introduced in 2016 as a pilot project with only 46 bike stations available in Helsinki. After becoming popular among the citizens, Helsinki city decided to gradually expand the bike network. In the period between 2017 and 2019, approximately one hundred stations were being added to the network each year. By 2019 the bike network reached its complete state with only 7 stations being added in 2020. As of 2020, there were 3,510 bikes and 350 stations operating in Helsinki and Espoo.

>Since 2016 more than 10.000.000 rides have been made. The total distance of the trips is 25.291.523 kilometres. To put that in perspective 25.3 million kilometres is 65 times the distance to the moon. The total time all residents spent riding the bikes is approximately 280 Years and 4 months.

In order to use the city bikes, citizens purchase access for a day, week or the entire cycling season that lasts from April to November. All passes include an unlimited number of 30-minute bike rides. For an extra fee of 1€/hour, you can use the bike for longer. Bikes are picked up and returned to stations that are located all around Helsinki and Espoo.




In [None]:
import datetime
import calendar

import numpy as np
import pandas as pd

import matplotlib.cm as cm
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap, LinearSegmentedColormap

import seaborn as sns
import plotly.graph_objects as go

import networkx as nx
import community as community_louvain
from operator import itemgetter

import folium
from folium import plugins

# Custom Colors
MAGENTA = "#6C3483"
GREEN = "#239B56"
BLUE = "#5DADE2"

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Custom Colors
MAGENTA = "#6C3483"
GREEN = "#239B56"
BLUE = "#5DADE2"

In [None]:
dataframe = pd.read_csv("/kaggle/input/helsinki-city-bikes/database.csv", low_memory=False)
dataframe.head()

Data type conversion

In [None]:
# Convert timestamp to datetime64[ns]
dataframe[['departure','return']] =  dataframe[['departure','return']].apply(pd.to_datetime, format='%Y-%m-%d %H:%M:%S.%f')
dataframe.dropna(inplace=True)
dataframe.info()

# Preparation
## Rename Columns
It is considered best practice to store measurements units in the column names when sharing the dataset. However, it is not very comfortable when you're using it. Therefore, it's best to rename the columns to something more readable. This also reduces the risk of human error.

In [None]:
dataframe = dataframe.rename(columns={'distance (m)': 'distance',
                                     'duration (sec.)': 'duration',
                                     'avg_speed (km/h)':'speed',
                                      'Air temperature (degC)':'temperature',
                                     })
dataframe.head()

## Looking for Errors
In large datasets, there is always a corrupted entry. It's crucial to check for suspicious values before proceeding with the analysis.

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
dataframe[["distance", "duration", "temperature"]].describe()

As we can see above there are certain anomalies in the data like negative and extremely large distances. In cases like this, the context of the data can help to filter out the anomalies.
### Allowed Distance range
Obviously, the distance cannot be negative in the Euclidean space. However, filtering only positive values is not enough either. To filter out all odd cases it is best to remove any trip that is less than 50 meters. The stations are always positioned more than 50 meters away therefore if the trip is less than 50 meters it is an irregularity within the context of this EDA.

### Allowed Duration range
Determining the allowed upper and lower limits is very case-specific. However, the maximum rental time for a bike is 5 hours. Users that exceed 5 hours should pay an 80€ penalty. The 5-hour mark(18000 seconds) can be used as an upper limit for the trip's duration.
The lower limit is pretty simple to determine based on previously defined limits.
50m distance limit/ 25km/h average speed.

### Allowed Temperature range
As we can see from the table above the only anomaly for temperature is that it reached  32 degrees in Helsinki. However, this probably has more to do with global warming than with an error in our dataset. 

Based on these assumptions the dataset can be filtered like this:


In [None]:
dataframe = dataframe[ 
                    (50 < dataframe['distance']) & (dataframe['distance'] < 10000) &
                    (120 < dataframe['duration']) & (dataframe['duration'] <  18000) &
                    (-20 < dataframe['temperature']) & (dataframe['temperature'] < 50)        
                    ]

dataframe[["distance", "duration", "temperature"]].describe()

# Exploratory Data Analysis

The City Bikes were introduced in 2016 as a pilot project with only 46 bike stations In Helsinki. After becoming popular among the citizens, Helsinki city decided to gradually expand the bike network. From 2017 to 2019 100 stations were being added to the network each year and in 2020 only 7 stations were added. Now the network operates in Helsinki and Espoo and has 350 stations.
Below you can see the number of bike trips over time. As we can see from the graph expanding the coverage of the network has a huge impact on the number of trips made by the citizens. It is also visible that 2020 was the first year when Bike usage has decreased. There are multiple possible explanations. This decrease can be due to the COVID-19 pandemic or because the city bike network reached the end of its growth phase.


### What the average ride looks like?
The city bike system has grown significantly since 2016, however, how the city bikes are used has not changed substantially. If we look at the individual trips through the last 5 years we will see that the average ride duration is around 13 minutes while the average travelled distance is approximately 2242 meters(1.4 miles). Given the right-skewed distribution of the data, the averages are slightly skewed and the majority of trips actually last from 4–8 minutes and cover a distance of 1700 meters (approx.1 mile).

#### Ride duration distribution

In [None]:
def duration(dataframe):
    """
    """
    df = dataframe.copy()
    
    # Converting seconds to minutes
    df["duration"] = df["duration"]/60
    
    # Filtering relevant data
    duration_data = df["duration"]
    fig = plt.figure(figsize=(15,7))
    ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])

    # Ploting the histogram
    plt.hist(duration_data, bins = range(62), color = BLUE, histtype ="bar")

    # Adding median and mean lines
    plt.axvline(df["duration"].mean(), color=MAGENTA, linestyle='-', linewidth=2 )
    plt.axvline(df["duration"].median(), color=GREEN, linestyle='-', linewidth=2 )
    plt.axvline(x = 30, color="blue", linestyle='-', linewidth=1 )

    # Adding median and mean texts
    min_ylim, max_ylim = plt.ylim()
    plt.text(df["duration"].mean()*1.1, max_ylim*0.9, 'Mean: {:.0f} min'.format(df["duration"].mean()), color = MAGENTA,  fontsize= 12)
    plt.text(df["duration"].median()*1.1, max_ylim*0.8, 'Median: {:.0f} min'.format(df["duration"].median()), color = GREEN, fontsize= 12)
    plt.text(x= 28,y=200000, s="Free", color = GREEN, fontsize= 12)
    plt.text(x= 30.5,y=200000, s="Extra Charge", color = "grey", fontsize= 12)

    # Seting ticks on x axis
    ticks =range(0, 62, 2)
    plt.xticks(ticks)
    # Seting ticks on y axis
    ticks =range(0, 3600, 2000)

    # Labeling Axes
    ax.set_title('Ride Duration Distribution', fontdict={"fontsize":20}, pad = 20)
    plt.xlabel("Duration of Rides (Minutes)", fontsize= 12, x = 0.5)
    plt.ylabel("Number of Rides", fontsize= 12)

    # Adding Grid
    plt.grid(linestyle=":", color='grey')

    # Watermark
    ax.text(0.99, 0.01, '© Github/Geometrein',
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes,
            color='grey',
            fontsize=15,
            alpha = 0.8)

    plt.show()

duration(dataframe)

#### Ride Distance distribution

In [None]:
def distance(dataframe):
    """
    """
    df = dataframe.copy()
    
    fig = plt.figure(figsize=(15,7))
    ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])

    data = df["distance"]
    plt.hist(data, bins = 60, color = BLUE)


    plt.axvline(df["distance"].mean(), color=MAGENTA, linestyle='-', linewidth=2 )
    plt.axvline(df["distance"].median(), color=GREEN, linestyle='-', linewidth=2 )
    plt.axvline(df["distance"].median(), color=GREEN, linestyle='-', linewidth=2 )

    min_ylim, max_ylim = plt.ylim()

    plt.text(df["distance"].mean()*1.1, max_ylim*0.9, 'Mean: {:.0f} m'.format(df["distance"].mean()), color = MAGENTA,  fontsize= 10)
    plt.text(df["distance"].median()*1.1, max_ylim*0.8, 'Median: {:.0f} m'.format(df["distance"].median()), color = GREEN, fontsize= 10)

    ax.set_xlim([0,10000])

    # Labeling Axes
    plt.xlabel("Travelled Distance (meters)", fontsize= 15, x = 0.5)
    plt.ylabel("Number of Rides", fontsize= 15)

    # Adding Grid
    plt.grid(linestyle=":", color='grey')

    # Watermark
    ax.text(0.99, 0.01, '© Github/Geometrein',
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes,
            color='grey',
            fontsize=15,
            alpha = 0.8)

    plt.show()

distance(dataframe)

As we can see from the graphs above the vast majority of rides are shorter than 30 minutes. However, 3.176% of users ended up exceeding the limit. Those that exceeded the 30-minute limit but not the 60-minutes limit collectively paid is €261.715 since the launch of the city bikes in 2016.

## When are City Bikes used?
Below you can see the number of daily bike trips since the launch of the City bike system. As we can see expanding the coverage of the network has a huge impact on the number of trips made by the citizens. It is also visible that 2020 was the first year when bike usage has decreased. There are multiple possible explanations. This decrease can be due to the COVID-19 pandemic or because the city bike network reached the end of its growth phase.

In [None]:
def tripsByYear(dataframe):
    """
    Number of trips over the years
    """
    # Data 
    df = dataframe.copy()
    df_over_time = df.groupby(df['departure'].dt.date).size().reset_index(name='count')

    # Figure
    fig, ax = plt.subplots(figsize=(20,9))
    plt.plot(df_over_time["departure"], df_over_time["count"], color= BLUE)

    # Labels
    ax.set_title("Number of trips over time", fontsize= 15, pad= 20)
    ax.set_ylabel("Number of trips", fontsize=12)
    ax.set_xlabel("Years", fontsize=12)

    # Grid & Legend
    plt.grid(linestyle=":", color='grey')
    plt.legend(["Number of trips"])

    # Watermark
    ax.text(0.99, 0.01, '© Github/Geometrein',
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes,
            color='grey',
            fontsize=15,
            alpha = 0.9)

    plt.show()

tripsByYear(dataframe)

If we look at the heatmap below we can see a clear usage pattern. The Most intensive bike usage occurs from 6:00 to 8:00  and from 16:00 to 18:00 on weekdays. This shows that bikes are actively used by commuters around the beginning and the end of the working day.

In [None]:
def weekday_heatmap(dataframe):
    """
    """
    weekdays = ["Mon", "Tue","Wed", "Thu", "Fri", "Sat", "Sun"]
    
    # Data
    df = dataframe.copy()
    df["hour"] = pd.DatetimeIndex(df['departure']).hour
    df["weekday"] = pd.DatetimeIndex(df['departure']).weekday
    daily_activity = df.groupby(by=['weekday','hour']).count()['departure_name'].unstack()

    # Figure
    fig, ax = plt.subplots(figsize=(15,15))
    sns.heatmap(daily_activity, robust=True, cmap="Blues", yticklabels=weekdays) # "YlOrBr

    # Labeling Axes
    plt.xlabel("Time of the day (Hours)", fontsize= 12, x = 0.5)
    plt.ylabel("Day of the week", fontsize= 12)

    # Watermark
    ax.text(0.99, 0.01, '© Github/Geometrein',
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes,
            color='grey',
            fontsize=15,
            alpha = 0.8)


weekday_heatmap(dataframe)

On weekends, however, the usage is different. Seems like Helsinkis' citizens prefer to kick off the weekend a little late. The most active hours are between 15:00 and 17:00. Interestingly enough the usage of city bikes is higher around midnight on weekend. This could mean that on weekends city bikes are used as a substitute when the other forms of public transport are no longer available.

Since the City bikes are actively used by commuters, it is natural to assume that the Covid pandemic and the transition to remote work had some effect on city bike usage. The graph below illustrates bike usage patterns for the past three years.

In [None]:
def yearlyHeatmap(dataframe):
    """
    This function plots the number of trips by weekday and hour of the day.
    """
    weekdays = ["Mon", "Tue","Wed", "Thu", "Fri", "Sat", "Sun"]
    
    # Data
    df = dataframe.copy()

    df["hour"] = pd.DatetimeIndex(df['departure']).hour
    df["weekday"] = pd.DatetimeIndex(df['departure']).weekday

    df_2018 = df[df['departure'].dt.year == 2018]
    df_2019 = df[df['departure'].dt.year == 2019]
    df_2020 = df[df['departure'].dt.year == 2020]

    daily_activity_2018 = df_2018.groupby(by=['weekday','hour']).count()['departure_name'].unstack()
    daily_activity_2019 = df_2019.groupby(by=['weekday','hour']).count()['departure_name'].unstack()
    daily_activity_2020 = df_2020.groupby(by=['weekday','hour']).count()['departure_name'].unstack()

    # Figure
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3,figsize=(60,15))

    sns.heatmap(daily_activity_2018, ax=ax1, robust=True, vmin=0, vmax=70000, cmap="Blues", yticklabels=weekdays)
    sns.heatmap(daily_activity_2019, ax=ax2, robust=True, vmin=0, vmax=70000, cmap="Blues", yticklabels=weekdays)
    sns.heatmap(daily_activity_2020, ax=ax3, robust=True, vmin=0, vmax=70000, cmap="Blues", yticklabels=weekdays)

    # Labeling Axes
    ax1.set_title("Usage patterns in 2018", fontsize= 15, pad = 15)    
    ax2.set_title("Usage patterns in 2019", fontsize= 15, pad = 15)
    ax3.set_title("Usage patterns in 2020", fontsize= 15,pad = 15)

    ax1.set(xlabel="Time of the day (Hours)", ylabel="Day of the week")
    ax2.set(xlabel="Time of the day (Hours)", ylabel="Day of the week")
    ax3.set(xlabel="Time of the day (Hours)", ylabel="Day of the week")

    # Watermark
    ax2.text(0.99, 0.01, '© Github/Geometrein',
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax2.transAxes,
            color='grey',
            fontsize=15,
            alpha = 0.8)


yearlyHeatmap(dataframe)

These graphs already illustrate that there is some difference in 2020 bike usage patterns. Besides the decrease in the overall bike usage, it seems that the number of trips during the rush hours has decreased also.

# Which stations are the most popular?

In [None]:
def topDepartureStantions(dataframe):
    """
    This function displays top stations by departure
    """
    # Data
    df = dataframe.copy()
    df = df[df['departure'].dt.year == 2017]

    # Figure
    fig = plt.figure(figsize=(20,9))
    ax = sns.countplot(x="departure_name", color = BLUE, data=df, order = df['departure_name'].value_counts().index)

    # Labeling Axes
    ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    plt.xlabel("Stations", fontsize= 12, x = 0.5)
    plt.ylabel("Number of Rides", fontsize= 12)
    plt.xlim(-1,20.5)

    # Adding Grid
    plt.grid(linestyle=":", color='grey')

topDepartureStantions(dataframe)

In [None]:
def topReturnStantions(dataframe):
    """
    This function displays top stations by return
    """
    # Data
    df = dataframe.copy()
    df = df[df['departure'].dt.year == 2017]

    # Figure
    fig = plt.figure(figsize=(20,9))
    ax = sns.countplot(x="return_name", color = BLUE, data=df, order = df['return_name'].value_counts().index)

    # Labeling Axes
    ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    plt.xlabel("Stations", fontsize= 12, x = 0.5)
    plt.ylabel("Number of Rides", fontsize= 12)
    plt.xlim(-1,20.5)

    # Adding Grid
    plt.grid(linestyle=":", color='grey')

#topReturnStantions(dataframe)

As one can expect not all stations are used equally. In 2016 the station next to the Kamppi metro station(Central Helsinki) was the most popular one. However, since 2017 Itämerentori has become the undisputed champion by usage. Seeing Itämerentori and Töölönlahdenkatu as the most popular stations might be surprising, however, this popularity is explained by their location within the city bike network. While these stations are not in the centre of Helsinki they are grouped around the centre of the bike network. In 2016, when there were less than 50 stations, Kamppi was at the structural centre of the network. However, with the expansion of the network towards northern Helsinki the centre of the bike network moved north too. Because of this Itämerentori and Töölönlahdenkatu stations gained more "important" role within the whole network.

One boundary condition that can affect this list is bike availability in a given station. If there are no bikes available, then the dataset will not reflect the demand for bikes but rather their availability. What makes Itämerentori and Töölönlahdenkatu stations lively is that they are popular stations for both departures and returns. This ensures bike availability at all times and increases station usage.

In [None]:
def heatMapPlot(dataframe, year = 2020):
    """
    This function prints an interactive heatmap by destination locations for a given year.
    """
    # Data
    df = dataframe.copy()
    df = df[df['departure'].dt.year == year]

    df.dropna(inplace=True)
    df['freq'] = df.groupby('departure_name')['departure_name'].transform('count')
    

    # Map
    hel_map = folium.Map([60.1975594, 24.9320720], zoom_start=12)
    folium.TileLayer('cartodbdark_matter').add_to(hel_map)

    stationArr = df[['departure_latitude', 'departure_longitude']].to_numpy()
    hel_map.add_child(plugins.HeatMap(stationArr, radius=15))

    display(hel_map)
    
# function call is commented for performance reasons
#heatMapPlot(dataframe)

Another interesting observation is that the popularity of stations doesn't change substantially throughout the year. This tendency is illustrated in the animated heatmap above.

# Which trips are the most popular?
The heatmap below shows the origin-destination pairs and the frequency of their occurrence in 2016.

In [None]:
def odHeatmap(dataframe, year=2016):
    """
    This function Prints Origin-Destination heatmap for a given year.
    """
    # Data
    df = dataframe.copy()
    df = df[df['departure'].dt.year == year]
    dff = df.groupby(['departure_name', 'return_name']).size()
    dff = dff.sort_values(ascending=False)
    dff = dff.reset_index()
    dff.columns.values[2] = "count"
    #dff = dff[:50] 

    # Color scale for heatmap
    min_value = dff["count"].quantile(0.05)
    max_value = dff["count"].quantile(0.95)

    # Pivot
    dff = dff.pivot_table(index='departure_name', columns="return_name", fill_value=0)
    dff.sort_index(level=0, ascending=True, inplace=True)

    # Figure
    fig, ax = plt.subplots(figsize=(21,20))
    sns.heatmap(dff,vmin=min_value,vmax=max_value, cmap="Blues",square=True)
    
    # Labeling
    ax.set_title('Origin-Destination Heatmap', fontdict={"fontsize":20}, pad = 50)
    ax.set_xlabel("Destination Name", fontsize= 15, x = 0.5)
    ax.set_ylabel("Origin Name", fontsize= 15)

    # Watermark
    ax.text(0.99, 0.01, '© Github/Geometrein',
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes,
            color='grey',
            fontsize=15,
            alpha = 0.9)

odHeatmap(dataframe)

## Conclusion
In this article, we looked at the Helsinki city bike system through the lens of descriptive statistics. We barely scratched the surface of all the possible analyses that can be performed on the underlying dataset. **[The second part of the notebook](https://www.kaggle.com/geometrein/helsinki-city-bike-network-analysis)** will analyze the Helsinki city bike system as a complex transportation network.