In [1]:
import pandas as pd
import altair as alt
alt.data_transformers.enable('default', max_rows=None)

DataTransformerRegistry.enable('default')

The dataset we'll be working with is [Bike Share ridership](https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/) data from the City of Toronto Open Data portal.

We can download it and save it in a folder as follows:

In [2]:
import urllib.request

year = 2022
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/7e876c24-177c-4605-9cef-e50dd74c617f/resource/db10a7b1-2702-481c-b7f0-0c67070104bb/download/bikeshare-ridership-" + str(year) + ".zip"
folder = "data"
urllib.request.urlretrieve(url, folder + "/bike-share-ridership-" + str(year) + ".zip")

('data/bike-share-ridership-2022.zip',
 <http.client.HTTPMessage at 0x7f045e6f05b0>)

The zip folder has `.csv` data for each month in the selected year. 

Since our data are zipped, we can either unzip the folder manually and run `df = pd.read_csv(path_to_csv_file)`.

Or we can load using the `zipfile` library. I'm feeding in variables for year and month that can easily allow for switching these out or looping over multiple in the future.

In [3]:
import zipfile

month = '06'

with zipfile.ZipFile("data/bike-share-ridership-" + str(year) + ".zip") as myzip:
    with myzip.open("bikeshare-ridership-" + str(year) + "/Bike share ridership " + str(year) + "-" + month + ".csv") as myfile:
        df = pd.read_csv(myfile)
        
df.head()

Unnamed: 0,Trip Id,Trip Duration,Start Station Id,Start Time,Start Station Name,End Station Id,End Time,End Station Name,Bike Id,User Type
0,16028433,384,7430,06/01/2022 00:00,Marilyn Bell Park Tennis Court,7518.0,06/01/2022 00:06,Lake Shore Blvd W / Colborne Lodge Dr,4157,Annual Member
1,16028434,437,7372,06/01/2022 00:00,Adelaide St W / Portland St,7035.0,06/01/2022 00:07,Queen St W / Ossington Ave,1577,Annual Member
2,16028435,495,7156,06/01/2022 00:00,Salem Ave / Bloor St W,7666.0,06/01/2022 00:08,Dundas St W / St Helen Ave - SMART,4628,Casual Member
3,16028436,812,7248,06/01/2022 00:00,Baldwin Ave / Spadina Ave - SMART,7044.0,06/01/2022 00:14,Church St / Alexander St,4137,Annual Member
4,16028437,293,7256,06/01/2022 00:00,Vanauley St / Queen St W - SMART,7416.0,06/01/2022 00:05,Spadina Ave / Blue Jays Way,2295,Annual Member


Great! 

Let's start by looking at the trip duration column. I'm curious how long people are travelling by Bike Share.

The "Trip Duration" column is in seconds, that can be a bit a difficult to picture, let's create a column for minutes by dividing by 60. Also notice that the initial column has an extra space, probably just a typo when the data were created.

We can then compute some simple summary statistics on the column.

In [4]:
df["Trip Duration Minutes"] = df["Trip  Duration"] / 60
df["Trip Duration Minutes"].describe()

count    605645.000000
mean         16.901254
std          56.321134
min           0.000000
25%           7.766667
50%          12.850000
75%          20.150000
max       19578.000000
Name: Trip Duration Minutes, dtype: float64

Cool! we've got the mean, standard deviation, and quantiles. The max trip is pretty crazy, 19578 minutes! that's over 13 days! Not sure if it's an error in the data, or someone just forgot to return their bike for that long.

The median (50%) being lower than the mean shows how their are definetly outliers.

Let's plot a distribution of shorter trips (those less than 2 hours long).

This will be our first forray into Altair. The `Chart` method reads in the data, specifically set just trips less than 120 minutes, and the `mark_bar().encode` builds the chart.

Note as well that I am just plotting a random sample of 1000 observations. Could do them all, but plotting is slower.

In [5]:
alt.Chart(
    df.loc[df["Trip Duration Minutes"] <= 120].sample(1000)
).mark_bar(
    opacity=0.8
).encode(
    alt.X("Trip Duration Minutes", bin=alt.Bin(step = 5)),
    y='count()',
    tooltip='count()'
)

Let's add some colour for user type

In [7]:
alt.Chart(
    df.loc[df["Trip Duration Minutes"] <= 120].sample(1000)
).mark_bar(
    opacity=0.8
).encode(
    alt.X("Trip Duration Minutes", bin=alt.Bin(step = 5)),
    alt.Y('count()'),
    alt.Color('User Type'),
    tooltip='count()'
)

How about a plot of trips by day of the month, and colour by user type?

In [36]:
alt.Chart(
    df.loc[df["Trip Duration Minutes"] <= 120].sample(10000)
).mark_line(point=True).encode(
    x='date(Start Time):O',
    y='count()',
    color='User Type',
    tooltip='count()'
)

Let's try something a bit more analytical. Which stations have the most trips of people taking out and returning at the same place?

In [64]:
dfs = df.loc[df["Start Station Id"] == df["End Station Id"]]
dfs = dfs.groupby("Start Station Name").size().reset_index(name = "count").sort_values("count", ascending = False)

alt.Chart(
    dfs.head(20),
    title = "Number of return trips to the same station"
).mark_bar(
    opacity=0.8
).encode(
    y = alt.Y("Start Station Name", sort='-x'),
    x = alt.X("count"),
    tooltip = "count"
).configure_axis(
    labelLimit=300,
    labelPadding=10,
    title=None
)
