## Exploratory Data Analysis and Visualization in pandas

In [1]:
import pandas as pd
import altair as alt
alt.data_transformers.enable('default', max_rows=None)

DataTransformerRegistry.enable('default')

The dataset we'll be working with is [Bike Share ridership](https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/) data from the City of Toronto Open Data portal.

We can download it and save it in a folder as follows:

In [2]:
import urllib.request

year = 2022
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/7e876c24-177c-4605-9cef-e50dd74c617f/resource/db10a7b1-2702-481c-b7f0-0c67070104bb/download/bikeshare-ridership-" + str(year) + ".zip"
folder = "data"
urllib.request.urlretrieve(url, folder + "/bike-share-ridership-" + str(year) + ".zip")

('data/bike-share-ridership-2022.zip',
 <http.client.HTTPMessage at 0x7f8ca8561480>)

The zip folder has `.csv` data for each month in the selected year. 

Since our data are zipped, we can either unzip the folder manually and run `df = pd.read_csv(path_to_csv_file)`.

Or we can load using the `zipfile` library. I'm feeding in variables for year and month that can easily allow for switching these out or looping over multiple in the future.

In [12]:
import zipfile

month = '06'

with zipfile.ZipFile("data/bike-share-ridership-" + str(year) + ".zip") as myzip:
    with myzip.open("bikeshare-ridership-" + str(year) + "/Bike share ridership " + str(year) + "-" + month + ".csv") as myfile:
        df = pd.read_csv(myfile)
        
df.head()

Unnamed: 0,Trip Id,Trip Duration,Start Station Id,Start Time,Start Station Name,End Station Id,End Time,End Station Name,Bike Id,User Type
0,16028433,384,7430,06/01/2022 00:00,Marilyn Bell Park Tennis Court,7518.0,06/01/2022 00:06,Lake Shore Blvd W / Colborne Lodge Dr,4157,Annual Member
1,16028434,437,7372,06/01/2022 00:00,Adelaide St W / Portland St,7035.0,06/01/2022 00:07,Queen St W / Ossington Ave,1577,Annual Member
2,16028435,495,7156,06/01/2022 00:00,Salem Ave / Bloor St W,7666.0,06/01/2022 00:08,Dundas St W / St Helen Ave - SMART,4628,Casual Member
3,16028436,812,7248,06/01/2022 00:00,Baldwin Ave / Spadina Ave - SMART,7044.0,06/01/2022 00:14,Church St / Alexander St,4137,Annual Member
4,16028437,293,7256,06/01/2022 00:00,Vanauley St / Queen St W - SMART,7416.0,06/01/2022 00:05,Spadina Ave / Blue Jays Way,2295,Annual Member


Great! 

Let's start by looking at the trip duration column. I'm curious how long people are travelling by Bike Share.

The "Trip Duration" column is in seconds, that can be a bit a difficult to picture, let's create a column for minutes by dividing by 60. Also notice that the initial column has an extra space, probably just a typo when the data were created.

We can then compute some simple summary statistics on the column.

In [4]:
df["Trip Duration Minutes"] = df["Trip  Duration"] / 60
df["Trip Duration Minutes"].describe()

count    180010.000000
mean         12.727990
std          43.539560
min           0.000000
25%           6.100000
50%           9.733333
75%          15.450000
max        8513.250000
Name: Trip Duration Minutes, dtype: float64

Cool! we've got the mean, standard deviation, and quantiles. The max trip is pretty crazy! Not sure if it's an error in the data, or someone just forgot to return their bike for that long.

The median (50%) being lower than the mean shows how their are definetly outliers.

Let's plot a distribution of shorter trips (those less than 2 hours long).

This will be our first forray into Altair. The `Chart` method reads in the data, specifically set just trips less than 120 minutes, and the `mark_bar().encode` builds the chart.

Note as well that I am just plotting a random sample of 1000 observations. Could do them all, but plotting is slower.

In [5]:
alt.Chart(
    df.loc[df["Trip Duration Minutes"] <= 120].sample(1000)
).mark_bar(
    opacity=0.8
).encode(
    alt.X("Trip Duration Minutes", bin=alt.Bin(step = 5)),
    y='count()',
    tooltip='count()'
)

Let's add some colour for user type

In [6]:
alt.Chart(
    df.loc[df["Trip Duration Minutes"] <= 120].sample(1000)
).mark_bar(
    opacity=0.8
).encode(
    alt.X("Trip Duration Minutes", bin=alt.Bin(step = 5)),
    alt.Y('count()'),
    alt.Color('User Type'),
    tooltip='count()'
)

How about a plot of trips by day of the month, and colour by user type? We can comment on/off the colour parameter do add different lines by user type.

In [7]:
alt.Chart(
    df.loc[df["Trip Duration Minutes"] <= 120].sample(10000)
).mark_line(point=True).encode(
    x='date(Start Time):O',
    y='count()',
    # color='User Type',
    tooltip='count()'
)

I'm curious if both types of members are likely to use Bike Share on the same dates. It's a bit difficult in this plot to see if there is correlation between the Annual Member and Casual Members. Let's make a scatter plot! Let's first do a group by to generate a smaller DataFrame of counts for each type. We can use the `pivot_table` function, very similar to Excel

In [8]:
df['Start Date'] = pd.to_datetime(df['Start Time'], format='%m/%d/%Y %H:%M')

df_date = df.pivot_table(index=df['Start Date'].dt.date, columns='User Type', aggfunc='size', fill_value=0).reset_index()

df_date["Start Date"] = df_date["Start Date"].astype(str)

df_date.head(5)


User Type,Start Date,Annual Member,Casual Member
0,2022-12-01,2689,5461
1,2022-12-02,2871,5778
2,2022-12-03,1912,3987
3,2022-12-04,2126,4480
4,2022-12-05,2804,5657


In [9]:
alt.Chart(
    pd.DataFrame(df_date)
).mark_circle(
    size=60
).encode(
    x='Annual Member',
    y='Casual Member',
    tooltip=['Annual Member', 'Casual Member', 'Start Date']
).interactive()

We can also do a quickly compute a correlation between the two variables

In [10]:
from scipy.stats import pearsonr

pearsonr(df_date['Annual Member'], df_date['Casual Member'])

PearsonRResult(statistic=0.9879081099750452, pvalue=4.952780539990712e-25)

Let's try something a bit more analytical. Which stations have the most trips of people taking out and returning at the same place?

In [11]:
dfs = df.loc[df["Start Station Id"] == df["End Station Id"]]
dfs = dfs.groupby("Start Station Name").size().reset_index(name = "count")

alt.Chart(
    dfs.sort_values("count", ascending = False).head(20),
    title = "Number of return trips to the same station"
).mark_bar(
    opacity=0.8
).encode(
    y = alt.Y("Start Station Name", sort='-x'),
    x = alt.X("count"),
    tooltip = "count"
).configure_axis(
    labelLimit=300,
    labelPadding=10,
    title=None
)

### Table Joins - Looking at Weather and Ridership

Okay! Let's do one last bit of analysis. Let's try to see how ridership is related to weather.

Let's first load in ALL the ridership data, and compute the total number of trips per day. This might take a little while, it's a lot of data to load!

In [151]:
df_months = []

for month in ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]:
    
    # for some reason, the data for November is zipped twice
    if month == "11":
        with zipfile.ZipFile("data/bike-share-ridership-" + str(year) + ".zip") as myzip:
            with myzip.open("bikeshare-ridership-" + str(year) + "/Bike share ridership " + str(year) + "-" + month + ".zip") as inner_zip_file:
                inner_zip = zipfile.ZipFile(inner_zip_file)
                with inner_zip.open("Bike share ridership " + str(year) + "-" + month + ".csv") as myfile:
                    df = pd.read_csv(myfile) 
                    df['Start Date'] = pd.to_datetime(df['Start Time'], format='%m/%d/%Y %H:%M').dt.date.astype(str)
                    df_month = df.groupby('Start Date')['Start Date'].count().reset_index(name='Count')
                    df_months.append(df_month)
    else:
        with zipfile.ZipFile("data/bike-share-ridership-" + str(year) + ".zip") as myzip:
            with myzip.open("bikeshare-ridership-" + str(year) + "/Bike share ridership " + str(year) + "-" + month + ".csv") as myfile:
                df = pd.read_csv(myfile)
                df['Start Date'] = pd.to_datetime(df['Start Time'], format='%m/%d/%Y %H:%M').dt.date.astype(str)
                df_month = df.groupby('Start Date')['Start Date'].count().reset_index(name='Count')
                df_months.append(df_month)

df_by_day = pd.concat(df_months)

df_by_day['Start DateTime'] = pd.to_datetime(df_by_day['Start Date'])

del df_months

Great! lets plot like we did earlier

In [152]:
alt.Chart(
    df_by_day
).mark_line(point=True).encode(
    x=alt.X('Start DateTime:T', axis=alt.Axis(format="%b %d")),
    y='Count',
    tooltip=['Count', 'Start DateTime:T']
).configure_view(
    width=1000
)

Let's load in our weather data! This was accessed from the federal governments historical climate data website: https://climate.weather.gc.ca/index_e.html

In [153]:
df_weather = pd.read_csv("toronto-historical-weather-2022.csv")
df_weather.head()

Unnamed: 0,Longitude (x),Latitude (y),Station Name,Climate ID,Date/Time,Year,Month,Day,Data Quality,Max Temp (°C),...,Total Snow (cm),Total Snow Flag,Total Precip (mm),Total Precip Flag,Snow on Grnd (cm),Snow on Grnd Flag,Dir of Max Gust (10s deg),Dir of Max Gust Flag,Spd of Max Gust (km/h),Spd of Max Gust Flag
0,-79.4,43.67,TORONTO CITY,6158355,2022-01-01,2022,1,1,,5.1,...,,,2.4,,,,,M,,M
1,-79.4,43.67,TORONTO CITY,6158355,2022-01-02,2022,1,2,,-2.1,...,,,2.0,,3.0,,,M,,M
2,-79.4,43.67,TORONTO CITY,6158355,2022-01-03,2022,1,3,,-4.0,...,,,0.0,,3.0,,,M,,M
3,-79.4,43.67,TORONTO CITY,6158355,2022-01-04,2022,1,4,,3.3,...,,,0.0,,3.0,,,M,,M
4,-79.4,43.67,TORONTO CITY,6158355,2022-01-05,2022,1,5,,4.9,...,,,0.3,,3.0,,,M,,M


There's a lot of data here we can look at, but let's keep it simple for now, just look at mean temperature (°C) and total precipitation (mm) and join it to our daily ridership DataFrame

In [154]:
df_ridership_weather = df_by_day.merge(df_weather[["Date/Time", "Mean Temp (°C)", "Total Precip (mm)"]], left_on="Start Date", right_on="Date/Time")
df_ridership_weather.head(5)

Unnamed: 0,Start Date,Count,Start DateTime,Date/Time,Mean Temp (°C),Total Precip (mm)
0,2022-01-01,2851,2022-01-01,2022-01-01,1.5,2.4
1,2022-01-02,1135,2022-01-02,2022-01-02,-6.3,2.0
2,2022-01-03,2157,2022-01-03,2022-01-03,-8.4,0.0
3,2022-01-04,3371,2022-01-04,2022-01-04,-1.2,0.0
4,2022-01-05,2870,2022-01-05,2022-01-05,0.2,0.3


In [186]:
alt.Chart(
    df_ridership_weather
).mark_circle(
    size=60
).encode(
    x="Mean Temp (°C)",
    y="Count",
    tooltip=["Mean Temp (°C)", "Count", "Start Date"]
).configure_view(
    width=420, 
    height=420
)

Clearly pretty correlated! (except for the one outlier). How about we include a simple classification for precipitation and add it on the chart as a colour

In [207]:
df_ridership_weather['Precip Category'] = df_ridership_weather['Total Precip (mm)'].apply(
    lambda x: '0mm' if x == 0 else ('0mm < X < 10mm' if 0 < x < 10 else '10mm +')
)

In [244]:
domain = ['0mm', '0mm < X < 10mm', '10mm +']
range_ = ['#DC4633', '#8DBF2E', '#007FA3']

alt.Chart(
    df_ridership_weather
).mark_circle(
    size=60
).encode(
    x="Mean Temp (°C)",
    y="Count",
    color=alt.Color('Precip Category', scale=alt.Scale(domain=domain, range=range_)),
    tooltip=["Mean Temp (°C)", "Count", 'Total Precip (mm)', "Start Date"]
).configure_view(
    width=420, 
    height=420
)

Cool! clearly there is a trend here.

We can try to statistical model this trend via a linear regression model.

How does temperature and precipitation predict ridership per day?

[scikit-learn](https://scikit-learn.org/stable/index.html) is a commonly used library for statistical and machine learning modelling in Python.

Let's first just do a bivariate model:

In [243]:
import pandas as pd
import statsmodels.api as sm

df_ridership_weather.dropna(inplace=True)

X = df_ridership_weather[['Mean Temp (°C)']]
y = df_ridership_weather['Count']

# Add constant to the X matrix
X = sm.add_constant(X)

# Fit an OLS model and print the results
ols_model = sm.OLS(y, X).fit()
print(ols_model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Count   R-squared:                       0.808
Model:                            OLS   Adj. R-squared:                  0.808
Method:                 Least Squares   F-statistic:                     1525.
Date:                Sun, 07 May 2023   Prob (F-statistic):          7.54e-132
Time:                        15:12:32   Log-Likelihood:                -3495.4
No. Observations:                 364   AIC:                             6995.
Df Residuals:                     362   BIC:                             7003.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           5824.3977    256.883     22.

And now with the precipitation categories!

In [242]:
dummies = pd.get_dummies(df_ridership_weather['Precip Category'], prefix='Precip')

# Concatenate the dummy variables with the original dataframe
df_ridership_weather = pd.concat([df_ridership_weather, dummies], axis=1)

# Define the X and y variables for the regression model
X = df_ridership_weather[['Mean Temp (°C)', 'Precip_0mm', 'Precip_0mm < X < 10mm', 'Precip_10mm +']]
y = df_ridership_weather['Count']

# Add constant to the X matrix
X = sm.add_constant(X)

# Fit an OLS model and print the results
ols_model = sm.OLS(y, X).fit()
print(ols_model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Count   R-squared:                       0.850
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                     680.6
Date:                Sun, 07 May 2023   Prob (F-statistic):          6.09e-148
Time:                        15:12:12   Log-Likelihood:                -3450.5
No. Observations:                 364   AIC:                             6909.
Df Residuals:                     360   BIC:                             6925.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                  3243.53

Count = 3243.54 + 686.34 * "Mean Temp (°C)" + 3863.46 * "Precip_0mm" + 1024.64 * "Precip_0mm" < X < 10mm - 1644.57 * "Precip_10mm +"