# Explore cycle share data

This notebook is directly inspired by this excellent [blogpost](http://jakevdp.github.io/blog/2015/10/17/analyzing-pronto-cycleshare-data-with-python-and-pandas/) by Jake VanderPlas. Incidentally, Jake VanderPlas is also a lead developer of the [Altair](https://altair-viz.github.io/) package that we will use for data viz and the author of the 
[Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) :)

# Import data

The data was downloaded from the [Kaggle Datasets](https://www.kaggle.com/pronto/cycle-share-dataset) repository.
It was originally provided by Pronto, the company that operates the cycle share system at Seattle, as part of an open data initiative.
You can find the following data description on the Kaggle repository.

## Context

The Pronto Cycle Share system consists of 500 bikes and 54 stations located in Seattle. Pronto provides open data on individual trips, stations, and daily weather.

## Content

There are 3 datasets that provide data on the stations, trips, and weather from 2014-2016.

1. Station dataset
    
    - station_id: station ID number
    - name: name of station
    - lat: station latitude
    - long: station longitude
    - install_date: date that station was placed in service
    - install_dockcount: number of docks at each station on the installation date
    - modification_date: date that station was modified, resulting in a change in location or dock count
    - current_dockcount: number of docks at each station on 8/31/2016
    - decommission_date: date that station was placed out of service


2. Trip dataset
    
    - trip_id: numeric ID of bike trip taken
    - start_time: day and time trip started, in PST
    - stop_time: day and time trip ended, in PST
    - bike_id: ID attached to each bike
    - trip_duration: time of trip in seconds
    - from_station_name: name of station where trip originated
    - to_station_name: name of station where trip terminated
    - from_station_id: ID of station where trip originated
    - to_station_id: ID of station where trip terminated
    - user_type: "Short-Term Pass Holder" is a rider who purchased a 24-Hour or 3-Day Pass; "Member" is a rider who purchased a Monthly or an Annual Membership
    - gender: gender of rider
    - birth_year: birth year of rider


3. Weather dataset contains daily weather information in the service area

    - speeds in miles-per-hour
    - temperatures in Fahrenheit
    - visibilities in miles
    - precipitation in inch
    - pressure in inch of mercury

In [None]:
# import the csv data in pandas 
import os
import pandas as pd
import numpy as np

data_dir = "cycle-share-dataset"

station = pd.read_csv(
    os.path.join(data_dir, "station.csv"), 
    parse_dates=["install_date","modification_date","decommission_date"]
)

trip = pd.read_csv(
    os.path.join(data_dir, "trip.csv"), 
    parse_dates=["start_time","stop_time"],
    infer_datetime_format=True,
    skiprows=range(1,50794) # the first 50794 rows are duplicates ...
)
assert trip["trip_id"].nunique()==trip.shape[0]

weather = pd.read_csv(
    os.path.join(data_dir, "weather.csv"), 
    parse_dates=["date"]
)

# Clean data

Before using any data you should check it thoroughly ! Let's check for instance the `station` dataset. The data description already provides a lot of information:

- the meaning (and therefore expected data-type) of each column
- there are 54 stations, uniquely identified by `station_id`

First let's see how many rows we have and list the columns along with their data-types.

In [None]:
print(f"station: {station.shape[0]} rows {station.shape[1]} columns\n")
print(station.dtypes)

We have 58 records and not 54. Are there any duplicated rows ? Let's how many distinct `station_id` we have:

In [None]:
n_station_id = station["station_id"].nunique()
print(f"There are {n_station_id} distinct station_id")

Ok, so we have no duplicate and exactly one record per `station_id`. There are in fact 58 stations in our dataset, not 54. 

Now let's take a look at the first rows:

In [None]:
station.head(5)

Note that all fields are consistent with the description: `lat` and `long` look like latitude and longitude, `name` as a station name, the counts look like counts and dates like dates.

However the `modification_date` and `decommission_date` are all missing in the first 5 rows: values are all `NaT` meaning Not-a-Time. For the dates let's see how many missing values we have, as well as the min and max date: 

In [None]:
for date_column in ["install_date", "modification_date", "decommission_date"]:
    n_missing = station[date_column].isnull().sum()
    date_min = station[date_column].min()
    date_max = station[date_column].max()
    print(f"{date_column}: {n_missing} are missing min={date_min} max={date_max}")

That makes sense: all have an installation date, a few have been modified and only 58 - 54 = 4 were decommissioned.
Probably the 54 in the data description was referring to stations still in service.

We can list here the 4 out of service stations:

In [None]:
station[station.decommission_date.notnull()]

What about the remaining numeric columns `lat`, `long`, `install_dockcount` and `current_dockcount`, does the range of values make sense ? Are their any missing values ? Let's use the `describe()` method to get a quick statistical summary for each column, and `.T` to transpose the summary stats dataframe.

In [None]:
station.describe().T

They are no missing values (**count** counts the number of non-null values), and every range of values makes sense. For instance, Seattle is located at $47.6^{\circ}N-122.3^{\circ}W$.

Now let's check the `trip` dataset. If the start and stop time are consitent we expect that 

- `stop_time` $>$ `start_time`,
- `trip_duration` $\simeq$ `stop_time` $-$ `start_time` (in seconds).

But actually there are a few errors !

In [None]:
print(f"trip: {trip.shape[0]} records")
# count number of trips where stop_time < start_time
n_time_travel = (trip["stop_time"] < trip["start_time"]).sum()
print(f"The {n_time_travel} trips for which stop_time < start_time:")
# show the few outliers
trip.query("stop_time < start_time")

In [None]:
# let's recompute the trip duration in seconds from the stop and start times:
trip["computed_duration"] = (trip["stop_time"]-trip["start_time"]).dt.seconds
# and see if there is more than a 1min = 60s difference with trip_duration:
trip["over_1min"] = (trip["computed_duration"]-trip["trip_duration"]).abs() > 60
# show the few outliers
print("Trips with over 1min difference between stop_time-start_time and trip_duration")
# we focus on the stop_time > start_time trips, the stoptime < startime trips are shown just above
trip.query("(stop_time > start_time) & over_1min")

There are actually very few errors (9 out of 236065). Let's filter out the bad rows and drop the utility columns
`computed_duration` and `over_1min` we have created for the sanity check.

In [None]:
# tilde is the logical NOT operator
trip = trip.query("(stop_time > start_time) & ~over_1min")
# dropping columns
trip = trip.drop(columns=["computed_duration", "over_1min"])
print(f"Filtered trip dataset: we now have {trip.shape[0]} records")

Let's do one final check. You may have notice that the trip dataset contains the ID of the stations where the trip originated and terminated. But do we recover every `station_id` of the `trip` dataset in the `station` dataset ?

In [None]:
station_ids = station["station_id"].unique().tolist()
from_station_ids = trip["from_station_id"].unique().tolist()
to_station_ids = trip["to_station_id"].unique().tolist()
trip_station_ids = set(from_station_ids + to_station_ids)
not_in_station = [
    station_id for station_id in trip_station_ids
    if station_id not in station_ids
]
print(
    f"{len(not_in_station)} / {len(trip_station_ids)} ids not recovered in station", 
    not_in_station
)

These station ids indeed seem special, maybe they correspond to a repair / maintenance shop , or refer to "lost" bikes ?

## Summary

This dataset is very clean : all fields have a clear meaning that match the data type, no weird values. Note that such high quality data is more the exception than the rule...

Usually data is very messy, and you will spend a considerable amount of time cleaning it.
Unfortunately people are very creative to mess things up: each dataset is messy in its own unique way.
Fields can be incoherent (start time $>$ end time), using $-1$ or $999$ as missing values, aberrant values, duplicated rows, unintelligble column names, etc.

## Exercise

Your turn ! Check that:

- each trip corresponds to a single record in the `trip` dataset (no duplicates)
- the values for `gender`, `user_type` and `birth_year` make sense
- the number of bikes roughly agrees with the data description
- the values in `weather` data make sense

In [None]:
# %load solutions/exo_sanity_check.py

# Basic visualization

For data visualization we will be using the [Altair](https://altair-viz.github.io/) package, a good place to start is the [gallery](https://altair-viz.github.io/gallery/index.html) where you will find many basic and advanced examples.

In [None]:
import altair as alt
alt.data_transformers.enable('csv')

## Weather

Let's take at the look at the number of events (fog/snow/rain/sun) for each month. In a declarative visualization library like Altair, it's very simple: just declare *what* you would like to see, not *how* to do it. In our case we would like to show, for each 
$x$ = month and $y$ = type of events, the number of records (or count) in our dataset, 
and display this count by a circle of varying size for example.
Well, that's pretty much all you need to write in code:

In [None]:
# creating a chart using the `weather` dataset
alt.Chart(weather).mark_circle().encode(
    x="month(date)", y='events', size="count()"
)

It seems that there is a lot of sunny days during spring-summer, rainy days in autumn-winter. 
There are very few snowing days, and only during winter. It seems reasonable (Seattle has a warm-temperate climate) !

Now let's take a look at the daily temperature: we want to show the mean temperature, but also the min-max temperature interval. 

In [None]:
# the mean temperature is represented as a line
line  = alt.Chart(weather).mark_line().encode(
    x='date', y='mean_temperature'
) 
# the min-max temperature interval is represented by a shaded (opacity=0.2) area
area = alt.Chart(weather).mark_area(opacity=0.2).encode(
    x='date', 
    # we change the y-axis title and do not enforce the y-scale to start at zero
    y=alt.Y('min_temperature', scale=alt.Scale(zero=False), title="temperature"),
    y2='max_temperature'
)
# we can zoom and pan along the x-axis
(line + area).interactive(bind_y=False)

We clearly see a seasonal trend.

## Exercise

Explore the weather dataset ! The list of columns in the weather dataset is given below. Don't forget to take a look at the Altair [examples gallery](https://altair-viz.github.io/gallery/index.html), there is probably an example very close to the visualization you would like to do ;) 

In [None]:
weather.columns

# Group-by   

## Daily Trips

Let's now look at the daily number of trips, for each user type (member or short-term pass holder). To do this, we need to group by date and user type and simply count the number of records (method `.size()` in pandas). 

In [None]:
# extract the date from start_time
trip["date"] = pd.to_datetime(trip["start_time"].dt.date)
# daily number of trips per user_type
daily_trips = trip.groupby(["date","user_type"]).size().rename("trips").reset_index()

We can now visualize the computed time-series. The chart below is interactive: you can pan or zoom. 

In [None]:
alt.Chart(daily_trips).mark_line().encode(
    x='date', y="trips", color="user_type"
).interactive(bind_y=False)

You must have notice a very strong weekly pattern ! Interestingly this pattern is opposite for members and short-term pass holders.    

## Exercise : weekly trend

Explain the weekly trend. Using an appropriate groupby aggregation and visualization, investigate if there is something like a "commute" versus "leisure ride" usage of the cycle share system.  

In [None]:
# %load solutions/exo_user_type.py

## Exercise : most popular routes

Find the most popular routes for members and short-term pass holders.

In [None]:
# %load solutions/exo_routes.py

## Exercice : trips per bike

Show that there is clear relationship between the number of trips done a on bike and the number of days 
the bike has been in service.

In [None]:
# %load solutions/exo_bikes.py

## Exercise : trip durations

In the pronto cycle share system only the first 30 minutes are free, afterwards one must pay an additional fee. Are members and short-time pass holders equally aware of the 30 minutes limit ?

**Hint** : look at the `.value_counts()` method in pandas. How would you compute a normalized histogram on the trip durations in minutes ?

In [None]:
# %load solutions/exo_durations.py

# Join

## Weather influence

We have computed the daily number of trips and we also have daily weather data. To enrich our daily trips with meteorological  data we simply need to join the `daily_trips` dataframe with the `weather` dataframe on the `date` column. The function `merge` in pandas is used to compute joins.

In [None]:
# join daily_trips with weather
daily_trips_joined = pd.merge(daily_trips, weather, on="date", how="left")
daily_trips_joined.head()

## Exercise : temperature trend

Does the temperature affect the number of trips ? Do you observe the same trend for members and short-term pass holders, during workdays and weekends ? Do the weather events (rain/sun/fog/snow) affect the number of trips ?

In [None]:
# %load solutions/exo_temperature.py

## Exercise : elevation trend

Seattle is a quite bumpy city, do the elevations affect the trips ?
The elevation for each station was fetched from the google maps elevation API, see the [cycle-share-dataset/Fetch_APIs.ipynb](cycle-share-dataset/Fetch_APIs.ipynb) notebook.

**Hint** try to find the elevations for both `from_station_id` and `to_station_id` in the `routes` dataset. You can use for instance the `.rename(columns={...})` method and the `pd.merge()` function to do this join.

In [None]:
elevations = pd.read_csv(os.path.join(data_dir, "elevations.csv"))
elevations.head()

In [None]:
# %load solutions/exo_elevation.py

# Maps

## Stations

It's always fun and instructive to visualize geographical data on a map ! Let's display the stations on the Seattle map. If you zoom enough to see a station, and hover over the station to display its name, you should convince youself that the locations (latitude/longitude) are consistent with the station names.

In [None]:
from ipyleaflet import Map, basemaps, Marker, MarkerCluster, CircleMarker

station_map = Map(center=(47.63, -122.3), zoom=12)
markers = tuple(
    Marker(location=(row["lat"],row["long"]), title=row["name"], draggable=False)
    for _, row in station.iterrows()
)
station_map.add_layer(MarkerCluster(markers = markers))
station_map

## Feature engineering on the station dataset

In [None]:
trip["weekend"] = trip["day_name"].isin(["Saturday", "Sunday"])
trip["member"] = trip["user_type"]=="Member"

In [None]:
station_infos = station.set_index("station_id")
station_infos = station_infos.join(
    elevations.set_index("station_id")["elevation"]
)
station_infos = station_infos.join(
    trip.groupby("from_station_id")[["member","weekend"]].mean()
)
station_infos = station_infos.join(
    trip.groupby("from_station_id").size().rename("departures")
)
station_infos = station_infos.join(
    trip.groupby("to_station_id").size().rename("arrivals")
)
station_infos["outflow"] = (
    station_infos["departures"] - station_infos["arrivals"]
).div(
    station_infos["departures"] + station_infos["arrivals"]
)
station_infos.head()

In [None]:
station_infos.describe()

## Visualizing stations features on a map

In [None]:
from ipywidgets import HTML, interact
import seaborn as sns
import matplotlib

def viridis(x):
    cmap = matplotlib.cm.get_cmap("viridis")
    return matplotlib.colors.to_hex(cmap(x))


station_map = Map(center=(47.63, -122.3), zoom=12)
circles = {
    station_id: CircleMarker(
        location=(row["lat"],row["long"]),
        fill_color=viridis(row["member"]),
        radius=7, fill_opacity=1, stroke=False
    )
    for station_id, row in station_infos.iterrows()
}
for station_id, circle in circles.items():
    station_map.add_layer(circle)
    row = station_infos.loc[station_id]
    circle.popup = HTML(
        f"{station_id}<br><small>{row['name']}"
        "</small>"
    )

def min_max_scaling(x):
    return (x-x.min()).div(x.max()-x.min())
    
    
def refresh_circles(feature):
    scaled_feature = min_max_scaling(station_infos[feature])
    for station_id, circle in circles.items():
        scaled_value = scaled_feature.loc[station_id]
        circle.fill_color = viridis(scaled_value)

interact(
    refresh_circles, 
    feature=["member", "weekend", "departures", "install_dockcount", "outflow", "elevation","lat","long"]
)
station_map