### Create Training Data from Time Series

The next step is to create our training data from the time series data of aggregated rides. The features in this data will be the previous rides per hour and the target will be how many rides in the hour we want to predict. To do this, we will use the slice and slide method. For example, if we want to predict the number of rides for the first hour of today, the features will be the previous 24 hours and the target will be the next hour. In terms of our data, this means the features will be indices 0, 24 from the data frame and the target will be the 25th index. For the next hour, we shift the bottom index up by one so the index wil be (1, 25, 26) with features being rows 1-25 and the target variable being row 26.

This tabular data has N + 1 columns where N is the number of features and the last column is the target variable. Remember, in Python, if we want to use the 12th hour as the target variable so that we use rows up to but not including the 12th row we would use range(0, 12) because 12 is not included in the range function.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import our aggregated data
ts_data = pd.read_parquet('../data/transformed/ts_data_2022_01.parquet')


To illustrate, here is an example that shows what the features will be if we are trying to predict the third hour. We will use two features, the previous two hours (11, 15), to predict the third unknown hour rides (26).

In [3]:
ts_data.head(n = 3)

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,11,4
1,2022-01-01 01:00:00,15,4
2,2022-01-01 02:00:00,26,4


In [4]:
# features to predict row 3
pred_3 = ts_data.iloc[0:2, 1].to_numpy()

# target variable
target_3 = ts_data.iloc[2, 1]

print(f'Features: {pred_3}\nTarget: {target_3}')


Features: [11 15]
Target: 26


In [5]:
ts_data.head()

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,11,4
1,2022-01-01 01:00:00,15,4
2,2022-01-01 02:00:00,26,4
3,2022-01-01 03:00:00,8,4
4,2022-01-01 04:00:00,9,4


In [6]:
# create central park data frame 
ts_data_cp = ts_data.loc[ts_data['pickup_location_id'] == 43].reset_index(drop = True)

In [7]:
ts_data_cp.head()

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,97,43
1,2022-01-01 01:00:00,60,43
2,2022-01-01 02:00:00,22,43
3,2022-01-01 03:00:00,8,43
4,2022-01-01 04:00:00,6,43


In [8]:
# cutoff indices function
def get_cutoff_indices(data = pd.DataFrame, n_features = int, step_size = int) -> list:
    stop_position = len(data) - 1

    # start at 0th index - first feature
    subseq_first_idx = 0
    # last feature
    subseq_mid_idx = n_features
    # this is the target
    subseq_last_idx = n_features + 1
    indices = []

    # continue this process for all data
    while subseq_last_idx <= stop_position:
            # add the three indices to our list
            indices.append((subseq_first_idx, subseq_mid_idx, subseq_last_idx))
            
            # our step size will be one so we add one to each
            subseq_first_idx += step_size
            subseq_mid_idx += step_size
            subseq_last_idx += step_size

    return indices


As we see below, the **get_cutoff_indices** function has generated training data for us by using 24 features for one target. The way to read this is that we use indices 0-24 as the features and 24 is not included. Then we use indices 24-25 for the target variable and 25 is not included so it is just index 24.

In [14]:
# generate a sample where we use a full day as features
n_features = 24
step_size = 1

indices = get_cutoff_indices(
    ts_data_cp,
    n_features,
    step_size
)
indices[:5]

[(0, 24, 25), (1, 25, 26), (2, 26, 27), (3, 27, 28), (4, 28, 29)]

In [15]:
# we can see here that indices 0-23 account for one day (it is 24 hours)
ts_data_cp[0:28]

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,97,43
1,2022-01-01 01:00:00,60,43
2,2022-01-01 02:00:00,22,43
3,2022-01-01 03:00:00,8,43
4,2022-01-01 04:00:00,6,43
5,2022-01-01 05:00:00,5,43
6,2022-01-01 06:00:00,3,43
7,2022-01-01 07:00:00,10,43
8,2022-01-01 08:00:00,7,43
9,2022-01-01 09:00:00,19,43


### Create X and y 

In [16]:
import numpy as np

n_features = 24
step_size = 1

# rows are the length of the indices and columns are the numbers of features
n_examples = len(indices)

# x is the first two numbers in our indices and y is just the length of the indices list
x = np.ndarray(shape=(n_examples, n_features), dtype=np.float32)
y = np.ndarray(shape=(n_examples), dtype=np.float32)
pickup_hours = []

# i is the index and idx is the element the element is a tuple so idx[0] is the first number of the tuple
for i, idx in enumerate(indices):
    # x is assigned the rides value at the indices the loop is currently at 
    x[i, :] = ts_data_cp.iloc[idx[0]:idx[1]]['rides'].values
    y[i] = ts_data_cp.iloc[idx[1]:idx[2]]['rides'].values         # idx[2] is the target variable in the tuple
    pickup_hours.append(ts_data_cp.iloc[idx[1]]['pickup_hour'])

  y[i] = ts_data_cp.iloc[idx[1]:idx[2]]['rides'].values


In [26]:
print(f'{x.shape=}')
print(f'{x=}')
print(f'{pickup_hours[:5]=}')


x.shape=(719, 24)
x=array([[ 97.,  60.,  22., ...,  16.,  18.,   6.],
       [ 60.,  22.,   8., ...,  18.,   6.,   3.],
       [ 22.,   8.,   6., ...,   6.,   3.,   1.],
       ...,
       [ 28.,  16.,  13., ..., 102.,  66.,  61.],
       [ 16.,  13.,   8., ...,  66.,  61.,  73.],
       [ 13.,   8.,   1., ...,  61.,  73.,  33.]], dtype=float32)
pickup_hours[:5]=[Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 01:00:00'), Timestamp('2022-01-02 02:00:00'), Timestamp('2022-01-02 03:00:00'), Timestamp('2022-01-02 04:00:00')]


In [27]:
# change to dataframe 
features_one_location = pd.DataFrame(
    x, 
    # range is 24 but we want to iterate backwards so use reversed
    columns = (f'rides_previous_{i+1}_hour' for i in reversed(range(n_features)))
)

In [28]:
features_one_location

Unnamed: 0,rides_previous_24_hour,rides_previous_23_hour,rides_previous_22_hour,rides_previous_21_hour,rides_previous_20_hour,rides_previous_19_hour,rides_previous_18_hour,rides_previous_17_hour,rides_previous_16_hour,rides_previous_15_hour,...,rides_previous_10_hour,rides_previous_9_hour,rides_previous_8_hour,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour
0,97.0,60.0,22.0,8.0,6.0,5.0,3.0,10.0,7.0,19.0,...,70.0,94.0,87.0,73.0,34.0,32.0,22.0,16.0,18.0,6.0
1,60.0,22.0,8.0,6.0,5.0,3.0,10.0,7.0,19.0,24.0,...,94.0,87.0,73.0,34.0,32.0,22.0,16.0,18.0,6.0,3.0
2,22.0,8.0,6.0,5.0,3.0,10.0,7.0,19.0,24.0,39.0,...,87.0,73.0,34.0,32.0,22.0,16.0,18.0,6.0,3.0,1.0
3,8.0,6.0,5.0,3.0,10.0,7.0,19.0,24.0,39.0,35.0,...,73.0,34.0,32.0,22.0,16.0,18.0,6.0,3.0,1.0,1.0
4,6.0,5.0,3.0,10.0,7.0,19.0,24.0,39.0,35.0,77.0,...,34.0,32.0,22.0,16.0,18.0,6.0,3.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
714,52.0,36.0,28.0,16.0,13.0,8.0,1.0,1.0,2.0,1.0,...,78.0,74.0,66.0,91.0,117.0,100.0,106.0,147.0,121.0,102.0
715,36.0,28.0,16.0,13.0,8.0,1.0,1.0,2.0,1.0,1.0,...,74.0,66.0,91.0,117.0,100.0,106.0,147.0,121.0,102.0,66.0
716,28.0,16.0,13.0,8.0,1.0,1.0,2.0,1.0,1.0,4.0,...,66.0,91.0,117.0,100.0,106.0,147.0,121.0,102.0,66.0,61.0
717,16.0,13.0,8.0,1.0,1.0,2.0,1.0,1.0,4.0,9.0,...,91.0,117.0,100.0,106.0,147.0,121.0,102.0,66.0,61.0,73.0


In [29]:
# do the same for target variables 
targets_one_location = pd.DataFrame(y, 
                                    columns = [f'target_rides_next_hour'])

In [30]:
targets_one_location

Unnamed: 0,target_rides_next_hour
0,3.0
1,1.0
2,1.0
3,0.0
4,0.0
...,...
714,66.0
715,61.0
716,73.0
717,33.0
