# Bike Rebalances [2019]
Citibike does not provide data regarding bike rebalances, however, a bike that starts from a station where it did not end its previous trip it likely was either rebalanced or taken out of service. We will assume the former is the case for this preliminary exercise and consider ways to make this more robust in the future.

To Do:
- parquet files instead of CSV?
- general dock station EDA (pull geo data from reverse_geo eda)
- notebook for generating dataframe with station info (separate from above)

- NY_2019 (or any dataset from clobber)
  - stations
  - geo
      - final stations dataframe
      - rebalance dataframe
  - weather (not from NY_2019)
  - rides
  - 


In [None]:
import pandas as pd
from pandas import to_datetime
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
import gc
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

Skip generate section to import rebalance dataframe directly

# Generate Rebalance Data

## Identify Rebalanced Bikes

TODO:
- drop 180 columns with missing location data

In [None]:
col_select = [
    "starttime",
    "stoptime",
    "startstationid",
    "startstationname",
    "startstationlatitude",
    "startstationlongitude",
    "endstationid",
    "endstationname",
    "endstationlatitude",
    "endstationlongitude",
    "bikeid",
]
col_types = {
    "startstationid": "category",
    "startstationname": "category",
    "endstationid": "category",
    "endstationname": "category",
    "bikeid": "category",
}


rides_raw = pd.read_csv(
    "data/NY_2019.csv",
    index_col=False,
    parse_dates=["starttime", "stoptime"],
    usecols=col_select,
    dtype=col_types,
)


pd.DataFrame.from_records(
    [
        (
            col,
            rides_raw[col].nunique(),
            rides_raw[col].dtype,
            rides_raw[col].memory_usage(deep=True),
        )
        for col in rides_raw.columns
    ],
    columns=["Column Name", "Unique", "Data Type", "Memory Usage"],
)

In [None]:
# comparison to when left as objects

# Column  Name	              Unique	    Data Type	      Memory Usage
# 0	      starttime	          20539444	  datetime64[ns]	 164413704
# 1	      stoptime	          20539225	  datetime64[ns]	 164413704
# 2	      startstationid      936	      float64	         164413704
# 3	      startstationname	  938	      object	        1574199724
# 4	      endstationid	      973	      float64	         164413704
# 5	      endstationname	  976	      object	        1573922082
# 6	      bikeid	          19571	      int64	             164413704

In [None]:
rides_raw.loc[rides_raw.startstationid.isna()]

In [None]:
rides_raw.head()

In [None]:
# order trips sequentially by bike
rides = rides_raw.sort_values(by=["bikeid", "starttime"])

# create an dummy dataframe to offset when merging
offset = pd.DataFrame(
    {
        "starttime": pd.to_datetime("2010-09-01"),
        "startstationid": 0,
        "stoptime": pd.to_datetime("2010-09-01"),
        "endstationid": 0,
        "bikeid": 0,
    },
    index=[0],
)

# offset rides1 (start stations) to track end station, rides 2 for start station
rides1 = (
    pd.concat([offset, rides])
    .reset_index(drop=True)
    .rename(columns={"bikeid": "bikeid1"})
)
rides2 = (
    pd.concat([rides, offset])
    .reset_index(drop=True)
    .rename(columns={"bikeid": "bikeid2"})
)

# concat horizontally - a ride would start from the same endstation unless rebalanced
rides = pd.concat(
    [
        rides1[["bikeid1", "stoptime", "endstationid", "endstationname"]],
        rides2[["bikeid2", "starttime", "startstationid", "startstationname"]],
    ],
    axis=1,
)

# remove temp dataframes from memory
del [[offset, rides1, rides2]]
gc.collect()

rides.head(10)

In [None]:
# filter for rebalances - bikeid = same, different stop and start stations
rebal = rides[
    [
        "bikeid1",
        "stoptime",
        "endstationid",
        "endstationname",
        "starttime",
        "startstationid",
        "startstationname",
    ]
].loc[(rides.bikeid1 == rides.bikeid2) & (rides.startstationid != rides.endstationid)]

rebal.reset_index(drop=True, inplace=True)

rebal.head()

## Import Geo Data

In [None]:
locations = pd.read_parquet("data/NY_2019_locations.parquet")

In [None]:
locations.head()

## Geo Analysis
CAUTION - data has many null values (neighborhood in particular)

In [None]:
# distribution of stations per neighborhood and boro
sns.histplot(locations.neighborhood.value_counts())

In [None]:
plt.figure(figsize=(15, 8))
x = sns.countplot(
    x=locations.neighborhood, order=locations.neighborhood.value_counts().index[:20]
)
x.set_xticklabels(x.get_xticklabels(), rotation=45, horizontalalignment="right")
x.set(title="Count of Stations per neighborhood [top 20]")

In [None]:
plt.figure(figsize=(15, 8))
x = sns.countplot(x=locations.boro, order=locations.boro.value_counts().index)
# x.set_xticklabels(x.get_xticklabels(),rotation=45,horizontalalignment='right')
x.set(title="Count of Stations per boro")

# Rebalancing Analysis

In [None]:
print(rides.shape)
print(rebal.shape)
print("The ratio of rebalances to rides in 2019 is: ", rebal.shape[0] / rides.shape[0])

## Rebalance by Station (to and from)

In [None]:
# plot top20 rebalances
rebalout = (
    rebal["endstationname"]
    .value_counts()
    .reset_index()
    .rename(columns={"index": "Station", "endstationname": "Count_Out"})[:20]
)
rebalin = (
    rebal["startstationname"]
    .value_counts()
    .reset_index()
    .rename(columns={"index": "Station", "startstationname": "Count_In"})[:20]
)

plt.figure(figsize=(10, 8))
plt.title("Citi Bike Rebalancing [2019] From Stations")
sns.barplot(y=rebalout.Station, x=rebalout.Count_Out, orient="h")

plt.figure(figsize=(10, 8))
plt.title("Citi Bike Rebalancing [2019] To Stations")
sns.barplot(y=rebalin.Station, x=rebalin.Count_In, orient="h")

## Station Capacity
https://gbfs.citibikenyc.com/gbfs/en/station_information.json
what is region code?

## Rebalance Timing