# Explore cycle share data


## Installation requirements

You must install the following packages to follow this notebook:

```bash
conda install -c conda-forge altair vega_datasets vega
conda install -c conda-forge ipyleaflet
pip install dataset
```

# Import data

The data was downloaded from the [Kaggle Datasets](https://www.kaggle.com/pronto/cycle-share-dataset) repository,
and was originally provided by Pronto, the company that operates the cycle share system at Seattle, as part of an open data initiative.

On the Kaggle repository you will find a detailed data description, reproduced here for the reader convenience.
## Context

The Pronto Cycle Share system consists of 500 bikes and 54 stations located in Seattle. Pronto provides open data on individual trips, stations, and daily weather.

## Content

There are 3 datasets that provide data on the stations, trips, and weather from 2014-2016.

1. Station dataset
    
    - station_id: station ID number
    - name: name of station
    - lat: station latitude
    - long: station longitude
    - install_date: date that station was placed in service
    - install_dockcount: number of docks at each station on the installation date
    - modification_date: date that station was modified, resulting in a change in location or dock count
    - current_dockcount: number of docks at each station on 8/31/2016
    - decommission_date: date that station was placed out of service


2. Trip dataset
    
    - trip_id: numeric ID of bike trip taken
    - starttime: day and time trip started, in PST
    - stoptime: day and time trip ended, in PST
    - bike_id: ID attached to each bike
    - tripduration: time of trip in seconds
    - from_station_name: name of station where trip originated
    - to_station_name: name of station where trip terminated
    - from_station_id: ID of station where trip originated
    - to_station_id: ID of station where trip terminated
    - usertype: "Short-Term Pass Holder" is a rider who purchased a 24-Hour or 3-Day Pass; "Member" is a rider who purchased a Monthly or an Annual Membership
    - gender: gender of rider
    - birthyear: birth year of rider


3. Weather dataset contains daily weather information in the service area

In [None]:
# import the csv data in pandas 
import os
import pandas as pd
import numpy as np

data_dir = "cycle-share-dataset"

station = pd.read_csv(
    os.path.join(data_dir, "station.csv"), 
    parse_dates=["install_date","modification_date","decommission_date"]
)

trip = pd.read_csv(
    os.path.join(data_dir, "trip.csv"), 
    parse_dates=["starttime","stoptime"],
    skiprows=range(1,50794) # the first 50794 are duplicates ...
)
assert trip["trip_id"].nunique()==trip.shape[0]

weather = pd.read_csv(
    os.path.join(data_dir, "weather.csv"), 
    parse_dates=["Date"]
)

# Clean data

Before using any data you should check it thoroughly ! Let's check for instance the `station` dataset. The data description already provides a lot of information:

- the meaning (and therefore expected data-type) of each column
- there are 54 stations, uniquely identified by `station_id`

First let's see how many rows we have and list the columns along with their data-type.

In [None]:
print(f"station: {station.shape[0]} rows {station.shape[1]} columns\n")
print(station.dtypes)

We have 58 records and not 54. Are there any duplicated rows ? Let's how many distinct `station_id` we have:

In [None]:
n_station_id = station["station_id"].nunique()
print(f"There are {n_station_id} distinct station_id")

Ok, so we have no duplicate and exactly one record per `station_id`. There are in fact 58 stations in our dataset, not 54. 

Now let's take a look at the first rows:

In [None]:
station.head(5)

Note that all fields are consistent with the description: `lat` and `long` look like latitutde and longitude, `name` as a station name, the counts look like counts and dates like dates.

However the `modification_date` and `decommission_date` are all missing in the first 5 rows: values are all `NaT` meaning Not-a-Time. For the dates let's see how many missing values we have, and also the min and max date: 

In [None]:
for date_column in ["install_date", "modification_date", "decommission_date"]:
    n_missing = station[date_column].isnull().sum()
    date_min = station[date_column].min()
    date_max = station[date_column].max()
    print(f"{date_column}: {n_missing} are missing min={date_min} max={date_max}")

That makes sense: all have an installation date, a few have been modified and only 58 - 54 = 4 were decommissioned.
Probably the 54 in the data description was referring to stations still in service.

We can list here the 4 out of service stations:

In [None]:
station[station.decommission_date.notnull()]

What about the remaining numeric columns `lat`, `long`, `install_dockcount` and `current_dockcount`, does the range of values make sense ? Are their any missing values ? Let's use the `describe()` method to get a quick statistical summary for each column, and `.T` to transpose the summary stats dataframe.

In [None]:
station.describe().T

They are no missing values (**count** counts the number of non-null values), and every range of values makes sense. For instance, Seattle is located at $47^{\circ}N-122^{\circ}W$.

Now let's check the `trip` dataset. If the start and stop time are consitent we expect that 

- `stoptime` $>$ `startime`,
- `tripduration` $\simeq$ `stoptime` $-$ `startime` (in seconds).

But actually there are a few errors !

In [None]:
print(f"trip: {trip.shape[0]} records")
# count number of trips where stoptime < starttime
n_time_travel = (trip["stoptime"] < trip["starttime"]).sum()
print(f"The {n_time_travel} trips for which stoptime < startime:")
# show the few outliers
trip.query("stoptime < starttime")

In [None]:
# let's recompute the trip duration in seconds from stop and start time:
trip["computed_duration"] = (trip["stoptime"]-trip["starttime"]).dt.seconds
# and see if there is more than a 1min = 60s difference with tripduration:
trip["over_1min"] = (trip["computed_duration"]-trip["tripduration"]).abs() > 60
# show the few outliers
print("Trips with over 1min difference between stoptime-starttime and tripduration")
# we focus on the stoptime > startime trips, the stoptime < startime trips are shown just above
trip.query("(stoptime > starttime) & over_1min")

There are actually very few errors (9 out of 236065). Let's filter out the bad rows, and drop the utility columns
`computed_duration` and `over_1min` we have created for the sanity check.

In [None]:
# tilde is the logical NOT operator
trip = trip.query("(stoptime > starttime) & ~over_1min")
# dropping columns
trip = trip.drop(
    columns=["computed_duration", "over_1min"]
)
print(f"Filtered trip dataset: we now have {trip.shape[0]} records")

Let's do one final check. You may have notice that the trip dataset contains the ID of the stations where the trip originated and terminated. But do we recover every `station_id` of the `trip` dataset in the `station` dataset ?

In [None]:
station_ids = station["station_id"].unique().tolist()
from_station_ids = trip["from_station_id"].unique().tolist()
to_station_ids = trip["to_station_id"].unique().tolist()
trip_station_ids = set(from_station_ids + to_station_ids)
not_in_station = [
    station_id for station_id in trip_station_ids
    if station_id not in station_ids
]
print(
    f"{len(not_in_station)} / {len(trip_station_ids)} ids not recovered in station", 
    not_in_station
)

These station ids indeed seem special, maybe they correspond to a repair / maintetance shop ?

## Summary

This dataset is very clean : all fields have a clear meaning that match the data type, no weird values, each row corresponds to exactly one station. Note that such high quality data is more the exception than the rule...

Usually data is very messy, and you will spend a considerable amount of time cleaning it.
Unfortunately people are relentlessly creative to mess things up: often each dataset is messy in its own unique way.
Fields can be incoherent (start time $>$ end time), using $-1$ or $999$ as missing values, aberrant values, duplicated rows, unintelligble column names, etc.

## Exercise

Your turn ! Check that:

- each trip corresponds to a single record in the `trip` dataset (no duplicates)
- the values for `gender`, `usertype` and `birthyear` make sense
- the number of bikes agrees with the data description
- the values in `weather` data make sense

From wikipedia:
- The dew point is the temperature to which air must be cooled to become saturated with water vapor.
- A gust or wind gust is a brief increase in the speed of the wind, usually less than 20 seconds.

In [None]:
# %load exo1.py

# Basic visualization

In [None]:
import altair as alt
alt.renderers.enable("notebook")

# Reshape data

# Advanced visualization