## Data Processing

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

from deepar_model_utils import prep_station_data
from deepar_model_utils import get_station_data
from deepar_model_utils import deepar_station_data
from deepar_model_utils import write_dicts_to_file

%matplotlib inline

In [None]:
#bucket = ""

In [None]:
#file = "cleaned_historical_trips_2015_2022.csv"

#s3_data_location = f"s3://{bucket}/{file}*"
#trips = pd.read_csv(s3_data_location, parse_dates = True)

5 years of trips data.

In [None]:
#trips = trips[(trips["starttime"] > "2017-09-01") & (trips["stoptime"] < "2022-08-31")]

In [None]:
#trips_start = trips[["starttime", "start station id", "start station name"]]
#trips_stop = trips[["stoptime", "end station id", "end station name"]]

In [None]:
#trips_start.to_csv("model_trips_start_station_2017_2022.csv")
#trips_stop.to_csv("model_trips_stop_station_2017_2022.csv")

2 years of trips data to train + 3 days to test. Filter data for trip start time between 8/29/2020 and 8/31/2022 (inclusive). Filter data for trip stop time between 8/29/2020 and 8/31/2022 (inclusive).

In [None]:
#trips_start_poc = trips[(trips["starttime"] > "2020-08-29") & (trips["starttime"] < "2022-09-01")][["starttime", "start station id"]]
#trips_stop_poc = trips[(trips["stoptime"] > "2020-08-29") & (trips["stoptime"] < "2022-09-01")][["stoptime", "end station id"]]

In [None]:
#trips_start_poc.to_csv("../model_trips_start_station_20208029_20220831.csv", index = False)
#trips_stop_poc.to_csv("../model_trips_stop_station_20208029_20220831.csv", index = False)

### Trip Start Station

Aka how many bikes left a station.

In [None]:
start_file = "../model_trips_start_station_20208029_20220831.csv"

#s3_start_location = f"s3://{bucket}/{start_file}*"
#trips_start = pd.read_csv(s3_start_location, parse_dates = True)

trips_start = pd.read_csv(start_file, parse_dates = True)
trips_start.shape

Check start station id matches up with start station name. In this case, some stations have changed names due to location changes or due to a lack of data standardization. `trips_start_lookup` is a lookup table to match between the different station ids and station names.

In [None]:
#trips_start["start station id"].nunique()

In [None]:
#trips_start["start station name"].nunique()

In [None]:
#trips_start.drop_duplicates(subset = ["start station id", "start station name"]).to_csv("unique_start.csv")

In [None]:
#trips_start_lookup = trips_start.drop(["Unnamed: 0", "starttime"], axis = 1).drop_duplicates()

In [None]:
#trips_start_lookup.to_csv("trip_start_station_id_lookup.csv", index = False)

Will use start station id and not start station name. From manually looking at the data, station name has more variation and very similar station names have the same station id.

Although not terribly useful now, grouping by and getting the size will help with the resampling later.

In [None]:
trips_start_all_group = prep_station_data(trips_start, "start station id", "starttime")
print(sum(trips_start_all_group["size"]))

Transform data into the format required by DeepAR. Not all series start at the same time or end at the same time. DeepAR allows series to start at different times, but I assume that all series have to end at the same time (or else how is prediction supposed to happen?).

**Large Model**

Training period is first 4 years of the data and testing period is final year of the data. Also, to train the initial model, I filtered out any stations that did not exist prior to the `test_date`. This ensures that there is corresponding training and testing data for every station.

In [None]:
freq = "15min" # group and sum trips by a set increment
max_date = "2022-08-31 23:45:00" # make sure all series end at the same time
train_date = "2021-08-31"
test_date = "2021-09-01"

In [None]:
train_data_start, test_data_start = deepar_station_data(trips_start_all_group, "start station id", "starttime", freq, max_date, train_date, test_date)

In [None]:
print(len(train_data_start))
print(len(test_data_start))

In [None]:
# make sure all test data is the same length
test_length = 0
for i in range(len(test_data_start)):
    test_length += len(test_data_start[i]["target"])
test_length / len(test_data_start) # should be 35,040

In [None]:
# check number of trips
trips = 0
for i in range(len(train_data_start)):
    trips += sum(train_data_start[i]["target"])
for i in range(len(test_data_start)):
    trips += sum(test_data_start[i]["target"])
trips # lost 85,515 trips

In [None]:
# save to json lines format
write_dicts_to_file("train_start.json", train_data_start)
write_dicts_to_file("test_start.json", test_data_start)

**POC Model**

Training period is 4th-5th year of the data minus 3 days and testing period is final 3 days of the data. Also, to train the initial model, I filtered out any stations that did not exist prior to the `test_date`. This ensures that there is corresponding training and testing data for every station.

In [None]:
freq = "15min" # group and sum trips by a set increment
max_date = "2022-08-31 23:45:00" # make sure all series end at the same time
train_date = "2022-08-28"
test_date = "2022-08-29"

In [None]:
train_poc_start, test_poc_start = deepar_station_data(trips_start_all_group, "start station id", "starttime", freq, max_date, train_date, test_date)

In [None]:
# retained all stations
print(len(train_poc_start))
print(len(test_poc_start))

In [None]:
# make sure all test data is the same length
test_length = 0
for i in range(len(test_poc_start)):
    test_length += len(test_poc_start[i]["target"])
test_length / len(test_poc_start) # should be 288

In [None]:
# check number of trips
trips = 0
for i in range(len(train_poc_start)):
    trips += sum(train_poc_start[i]["target"])
for i in range(len(test_poc_start)):
    trips += sum(test_poc_start[i]["target"])
trips # retained all trips

In [None]:
# save to json lines format
write_dicts_to_file("train_poc_start.json", train_poc_start)
write_dicts_to_file("test_poc_start.json", test_poc_start)

**Plot 15-minute time series by station**

In [None]:
fig, axs = plt.subplots(4, 1, figsize = (20, 20), sharex = True)
axx = axs.ravel()
for i in range(0, 4):
    temp_station = [177, 436, 572, 67][i]
    get_station_data(trips_start_all_group, "start station id", "starttime", temp_station, freq, max_date).plot(ax = axx[i])
    axx[i].set_xlabel("date")
    axx[i].set_ylabel("trip count")
    axx[i].set_title(str(temp_station))
    axx[i].grid(which = "minor", axis = "x")

### Trip End Station

Aka how many bikes arrived at a station

In [None]:
stop_file = "../model_trips_stop_station_20208029_20220831.csv"

#s3_end_location = f"s3://{bucket}/{end_file}*"
#trips_end = pd.read_csv(s3_end_location, parse_dates = True)

trips_stop = pd.read_csv(stop_file, parse_dates = True)
trips_stop.shape

Although not terribly useful now, grouping by and getting the size will help with the resampling later.

In [None]:
trips_stop_all_group = prep_station_data(trips_stop, "end station id", "stoptime")
print(sum(trips_stop_all_group["size"]))

Transform data into the format required by DeepAR. Not all series start at the same time or end at the same time. DeepAR allows series to start at different times, but I assume that all series have to end at the same time (or else how is prediction supposed to happen?).

**POC Model**

Training period is 4th-5th year of the data minus 3 days and testing period is final 3 days of the data. Also, to train the initial model, I filtered out any stations that did not exist prior to the `test_date`. This ensures that there is corresponding training and testing data for every station.

In [None]:
freq = "15min" # group and sum trips by a set increment
max_date = "2022-08-31 23:45:00" # make sure all series end at the same time
train_date = "2022-08-28"
test_date = "2022-08-29"

In [None]:
train_poc_stop, test_poc_stop = deepar_station_data(trips_stop_all_group, "end station id", "stoptime", freq, max_date, train_date, test_date)

In [None]:
# retained all but 1 station
print(len(train_poc_stop))
print(len(test_poc_stop))

In [None]:
# make sure all test data is the same length
test_length = 0
for i in range(len(test_poc_stop)):
    test_length += len(test_poc_stop[i]["target"])
test_length / len(test_poc_stop) # should be 288

In [None]:
# check number of trips
trips = 0
for i in range(len(train_poc_stop)):
    trips += sum(train_poc_stop[i]["target"])
for i in range(len(test_poc_stop)):
    trips += sum(test_poc_stop[i]["target"])
trips # lost 5 trips due to the 1 station loss

Station 572 w/ 5 trips was dropped b/c the first trip that ended there was after the `test_date` of 8/29/2022.

In [None]:
# save to json lines format
write_dicts_to_file("train_poc_stop.json", train_poc_stop)
write_dicts_to_file("test_poc_stop.json", test_poc_stop)

**Plot 15-minute time series by station**

In [None]:
fig, axs = plt.subplots(4, 1, figsize = (20, 20), sharex = True)
axx = axs.ravel()
for i in range(0, 4):
    temp_station = [177, 436, 572, 67][i]
    get_station_data(trips_stop_all_group, "end station id", "stoptime", temp_station, freq, max_date).plot(ax = axx[i])
    axx[i].set_xlabel("date")
    axx[i].set_ylabel("trip count")
    axx[i].set_title(str(temp_station))
    axx[i].grid(which = "minor", axis = "x")

### Master Training Dataset

In [None]:
freq = "15min" # group and sum trips by a set increment
max_date = "2022-08-28 23:45:00" # make sure all series end at the same time

In [None]:
train_data = pd.DataFrame()
for station in tqdm(trips_start_all_group["start station id"].unique()):
    trip_station = pd.DataFrame(get_station_data(trips_start_all_group, "start station id", "starttime", station, freq, max_date)["size"])
    trip_station["timestamp"] = trip_station.index
    trip_station = trip_station.reset_index(drop = True)
    trip_station["station"] = station
    
    train_data = pd.concat([train_data, trip_station], ignore_index = True)

In [None]:
sum(train_data["size"])

In [None]:
train_data.to_pickle("../../datasets/trip_start_station_train_20200829-20220828.pkl")

In [None]:
train_data = pd.DataFrame()
for station in tqdm(trips_stop_all_group["end station id"].unique()):
    if station != 572:
        trip_station = pd.DataFrame(get_station_data(trips_stop_all_group, "end station id", "stoptime", station, freq, max_date)["size"])
        trip_station["timestamp"] = trip_station.index
        trip_station = trip_station.reset_index(drop = True)
        trip_station["station"] = station
    
        train_data = pd.concat([train_data, trip_station], ignore_index = True)

In [None]:
sum(train_data["size"])

In [None]:
train_data.to_pickle("../../datasets/trip_stop_station_train_20200829-20220828.pkl")

### Master Evaluation Dataset

In [None]:
freq = "15min" # group and sum trips by a set increment
test_start = "2022-08-29 00:00:00"
max_date = "2022-08-31 23:45:00" # make sure all series end at the same time

In [None]:
eval_data = pd.DataFrame()
for station in tqdm(trips_start_all_group["start station id"].unique()):
    trip_station = pd.DataFrame(get_station_data(trips_start_all_group, "start station id", "starttime", station, freq, max_date, cluster = True, min_date = test_start).loc[test_start:]["size"])
    trip_station["timestamp"] = trip_station.index
    trip_station = trip_station.reset_index(drop = True)
    trip_station["station"] = station
    
    eval_data = pd.concat([eval_data, trip_station], ignore_index = True)

In [None]:
sum(eval_data["size"])

In [None]:
eval_data.to_pickle("../../datasets/trip_start_station_eval_20220829-20220831.pkl")

In [None]:
eval_data = pd.DataFrame()
for station in tqdm(trips_stop_all_group["end station id"].unique()):
    if station != 572:
        trip_station = pd.DataFrame(get_station_data(trips_stop_all_group, "end station id", "stoptime", station, freq, max_date, cluster = True, min_date = test_start).loc[test_start:]["size"])
        trip_station["timestamp"] = trip_station.index
        trip_station = trip_station.reset_index(drop = True)
        trip_station["station"] = station
    
        eval_data = pd.concat([eval_data, trip_station], ignore_index = True)

In [None]:
sum(eval_data["size"])

In [None]:
eval_data.to_pickle("../../datasets/trip_stop_station_eval_20220829-20220831.pkl")