# Identifying Empty Citibike Docks

Ultimately, the goal behind this Citibike project work is to create 
insightful analyses around the Citibike system. I'd also like to use the 
results of the analyses to build some novel predictive models. There are 
plenty of "gotchas" with these data and potential analyses, and I'm going
to address the major ones in their own notebooks. 

In this particular notebook, we address the issue of discerning what a 
lack of rides originating from a given station really means. 

The data only contains information about individual rides, not the 
availability of bikes at a station. Zero rides originating from a station 
in a certain time period can mean one of two things:
* There is no demand for bikes from the station. 
* There is such high demand for bikes that station is empty.  

The latter situation poses a particular issue for any kind of demand 
model, as the actual ridership data does not reflect the number of people
who would have started a ride had a bike been available to them.

Furthermore, knowing which stations frequently run out of bikes is useful 
in terms of knowing where to expand stations, build new ones, or when and
how to stage resupply runs from other areas. 

## Setup
Module imports and notebook setup. 

In [14]:
import os
import pandas as pd

## Getting the Data
If you want to run all this yourself and haven't run through the Citibike
Data Pull notebook, you should do so now.

Our data is stored in /data as monthly .parquet files containing "fact 
tables" of ridership data. /data also contains the relevant "dimension
tables" that can join to the ridership data. It was done this way to 
significantly reduce the size of the data such that we can access larger
time frames and do more useful research. 

Let's start by reading our dimension tables into the notebook. 

### Stations
These are the docks for the bikes. Each station has an ID, a cross street, 
and a lat/long.

In [15]:
stations = pd.read_parquet(os.path.join('../data/stations.parquet'))
stations.head(3)

Unnamed: 0,int_station_id,station_id,station_name,lat,lng
0,0,6283.05,48 St & Skillman Ave,40.746155,-73.916191
1,1,5105.01,Liberty St & Broadway,40.708858,-74.010231
2,2,6809.07,W 56 St & 6 Ave,40.763405,-73.977226


### Rideable Types
These are the different kinds of bikes. Currently that's limited to 
electric (pedal assisted) and classic (regular) bikes. 

In [16]:
rideables = pd.read_parquet(os.path.join('../data/rideable_types.parquet'))
rideables.head(3)

Unnamed: 0,rideable_id,rideable_type
0,0,electric_bike
1,1,classic_bike


### Membership Types
These are the kinds of rider "memberships" available. Basically this 
denotes whether a rider has a monthly subscription to the Citibike 
service or if they're riding on a "casual" pay-as-you-go basis. 

In [17]:
membership_types = pd.read_parquet(os.path.join('../data/membership_types.parquet'))
membership_types.head(3)

Unnamed: 0,membership_id,membership_type
0,0,member
1,1,casual


### Ride Data
Now let's look at a sample of our fact table formatted ride data. 

In [27]:
# All the ride files start with a yyyy or yyyymm before an underscore
# We can extract those filenames from the data folder
ride_filepaths = [f for f in os.listdir('../data') if f.split('_')[0].isnumeric()]

# Concat ride files into a single dataframe
rides = pd.DataFrame()
for path in ride_filepaths:
    rides = pd.concat([rides, pd.read_parquet(f'../data/{path}')])

rides.head(3)

Unnamed: 0,ride_id,started_at,start_station_id,end_station_id,rideable_id,membership_id,trip_duration
0,0FC89A53DF9D7E90,2024-03-07 19:49:43,0,70,0,0,1850
1,0FF38F5D1277746B,2024-03-15 17:45:30,1,68,0,0,609
2,DE040AD144FB0BFA,2024-03-19 18:00:52,2,65,0,0,394


In [29]:
[f for f in os.listdir('../data') if f.split('_')[0].isnumeric()]

['202403_data.parquet', '202404_data.parquet', '202405_data.parquet']