### Create Training Data from Time Series

The next step is to create our training data from the time series data of aggregated rides. The features in this data will be the previous rides per hour and the target will be how many rides in the hour we want to predict. To do this, we will use the slice and slide method. For example, if we want to predict the number of rides for the first hour of today, the features will be the previous 24 hours and the target will be the next hour. In terms of our data, this means the features will be indices 0, 24 from the data frame and the target will be the 25th index. For the next hour, we shift the bottom index up by one so the index wil be (1, 25, 26) with features being rows 1-25 and the target variable being row 26.

This tabular data has N + 1 columns where N is the number of features and the last column is the target variable. Remember, in Python, if we want to use the 12th hour as the target variable so that we use rows up to but not including the 12th row we would use range(0, 12) because 12 is not included in the range function.

In [2]:
import pandas as pd
import numpy as np

In [3]:
# import our aggregated data
ts_data = pd.read_parquet('../data/transformed/ts_data_2022_01.parquet')


To illustrate, here is an example that shows what the features will be if we are trying to predict the third hour. We will use two features, the previous two hours (11, 15), to predict the third unknown hour rides (26).

In [4]:
# example to predict the 26 rides in the third hour 
ts_data.head(n = 3)

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,11,4
1,2022-01-01 01:00:00,15,4
2,2022-01-01 02:00:00,26,4


In [5]:
# features to predict row 3
pred_3 = ts_data.iloc[0:2, 1].to_numpy()

# target variable
target_3 = ts_data.iloc[2, 1]

print(f'Features: {pred_3}\nTarget: {target_3}')


Features: [11 15]
Target: 26


In [6]:
# ts data
ts_data.head()

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,11,4
1,2022-01-01 01:00:00,15,4
2,2022-01-01 02:00:00,26,4
3,2022-01-01 03:00:00,8,4
4,2022-01-01 04:00:00,9,4


In [7]:
# create central park data frame - central park is pickup location 43
ts_data_cp = ts_data.loc[ts_data['pickup_location_id'] == 43].reset_index(drop = True)

In [8]:
# central park time series df
ts_data_cp.head()

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,97,43
1,2022-01-01 01:00:00,60,43
2,2022-01-01 02:00:00,22,43
3,2022-01-01 03:00:00,8,43
4,2022-01-01 04:00:00,6,43


To create the features and targets we will need the tuple of indicies corresponding to the starting index, ending index, followed by the target variable. Since the length of data will always be 1 more than the highest index, the **stop_position** is len(data) - 1. 

In [9]:
# cutoff indices function
def get_cutoff_indices(data = pd.DataFrame, n_features = int, step_size = int) -> list:
    stop_position = len(data) - 1

    # start at 0th index - first feature
    subseq_first_idx = 0
    # last feature before target var
    subseq_mid_idx = n_features
    # this is the target
    subseq_last_idx = n_features + 1
    # empty list
    indices = []

    # continue this process for all data filling the empty list
    while subseq_last_idx <= stop_position:
            
            # add the three indices to our list
            indices.append((subseq_first_idx, subseq_mid_idx, subseq_last_idx))
            
            # our step size will be one so we add one to each
            subseq_first_idx += step_size
            subseq_mid_idx += step_size
            subseq_last_idx += step_size

    return indices


As we see below, the **get_cutoff_indices** function has generated training data for us by using 24 features for one target. The way to read this is that we use indices 0-24 as the features and 24 is not included. Then we use indices 24-25 for the target variable and 25 is not included so it is just index 24.

As a reminder, the indicies (0, 24, 25) will mean 0-23 for features since the last is not included.

In [10]:
# generate a sample where we use a full day as features
n_features = 24
step_size = 1

indices = get_cutoff_indices(
    ts_data_cp,
    n_features,
    step_size
)
indices[:5]

[(0, 24, 25), (1, 25, 26), (2, 26, 27), (3, 27, 28), (4, 28, 29)]

In [15]:
# we can see here that indices 0-23 account for one day (it is 24 hours)
ts_data_cp[0:28]

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,97,43
1,2022-01-01 01:00:00,60,43
2,2022-01-01 02:00:00,22,43
3,2022-01-01 03:00:00,8,43
4,2022-01-01 04:00:00,6,43
5,2022-01-01 05:00:00,5,43
6,2022-01-01 06:00:00,3,43
7,2022-01-01 07:00:00,10,43
8,2022-01-01 08:00:00,7,43
9,2022-01-01 09:00:00,19,43


### Create X and y 

In [13]:
# view this to help visualize the following code
indices[0:2]

[(0, 24, 25), (1, 25, 26)]

In [14]:
import numpy as np

n_features = 24
step_size = 1

# rows are the length of all sliced indices (list created above) and columns are the numbers of features
n_examples = len(indices)

# x is the first two numbers in our indices and y is just the length of the indices list
x = np.ndarray(shape=(n_examples, n_features), dtype=np.float32) 
y = np.ndarray(shape=(n_examples), dtype=np.float32)
pickup_hours = []

# i is the index and idx is the element. The element is a tuple so idx[0] is the first number of the tuple
for i, idx in enumerate(indices):

    # x is assigned the rides value at the indices the loop is currently at 
    x[i, :] = ts_data_cp.iloc[idx[0]:idx[1]]['rides'].values

    # idx[1] is the target variable in the tuple since iloc is exclusive to ending index (24th index is target)
    y[i] = ts_data_cp.iloc[idx[1]:idx[2]]['rides'].values       

    # keep trach of pickup hours  
    pickup_hours.append(ts_data_cp.iloc[idx[1]]['pickup_hour'])

  y[i] = ts_data_cp.iloc[idx[1]:idx[2]]['rides'].values


In [15]:
# view the creation of x
print(f'{x.shape=}')
print(f'{x=}')
print(f'{pickup_hours[:5]=}')


x.shape=(719, 24)
x=array([[ 97.,  60.,  22., ...,  16.,  18.,   6.],
       [ 60.,  22.,   8., ...,  18.,   6.,   3.],
       [ 22.,   8.,   6., ...,   6.,   3.,   1.],
       ...,
       [ 28.,  16.,  13., ..., 102.,  66.,  61.],
       [ 16.,  13.,   8., ...,  66.,  61.,  73.],
       [ 13.,   8.,   1., ...,  61.,  73.,  33.]], dtype=float32)
pickup_hours[:5]=[Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 01:00:00'), Timestamp('2022-01-02 02:00:00'), Timestamp('2022-01-02 03:00:00'), Timestamp('2022-01-02 04:00:00')]


In [16]:
# change to dataframe 
features_one_location = pd.DataFrame(
    x, 
    # range is 24 but we want to iterate backwards so use reversed
    columns = [f'rides_previous_{i+1}_hour' for i in reversed(range(n_features))]
)

In [17]:
# data frame of features
features_one_location

Unnamed: 0,rides_previous_24_hour,rides_previous_23_hour,rides_previous_22_hour,rides_previous_21_hour,rides_previous_20_hour,rides_previous_19_hour,rides_previous_18_hour,rides_previous_17_hour,rides_previous_16_hour,rides_previous_15_hour,...,rides_previous_10_hour,rides_previous_9_hour,rides_previous_8_hour,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour
0,97.0,60.0,22.0,8.0,6.0,5.0,3.0,10.0,7.0,19.0,...,70.0,94.0,87.0,73.0,34.0,32.0,22.0,16.0,18.0,6.0
1,60.0,22.0,8.0,6.0,5.0,3.0,10.0,7.0,19.0,24.0,...,94.0,87.0,73.0,34.0,32.0,22.0,16.0,18.0,6.0,3.0
2,22.0,8.0,6.0,5.0,3.0,10.0,7.0,19.0,24.0,39.0,...,87.0,73.0,34.0,32.0,22.0,16.0,18.0,6.0,3.0,1.0
3,8.0,6.0,5.0,3.0,10.0,7.0,19.0,24.0,39.0,35.0,...,73.0,34.0,32.0,22.0,16.0,18.0,6.0,3.0,1.0,1.0
4,6.0,5.0,3.0,10.0,7.0,19.0,24.0,39.0,35.0,77.0,...,34.0,32.0,22.0,16.0,18.0,6.0,3.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
714,52.0,36.0,28.0,16.0,13.0,8.0,1.0,1.0,2.0,1.0,...,78.0,74.0,66.0,91.0,117.0,100.0,106.0,147.0,121.0,102.0
715,36.0,28.0,16.0,13.0,8.0,1.0,1.0,2.0,1.0,1.0,...,74.0,66.0,91.0,117.0,100.0,106.0,147.0,121.0,102.0,66.0
716,28.0,16.0,13.0,8.0,1.0,1.0,2.0,1.0,1.0,4.0,...,66.0,91.0,117.0,100.0,106.0,147.0,121.0,102.0,66.0,61.0
717,16.0,13.0,8.0,1.0,1.0,2.0,1.0,1.0,4.0,9.0,...,91.0,117.0,100.0,106.0,147.0,121.0,102.0,66.0,61.0,73.0


In [18]:
# do the same for target variables 
targets_one_location = pd.DataFrame(y, 
                                    columns = [f'target_rides_next_hour'])

In [19]:
# df of targets
targets_one_location

Unnamed: 0,target_rides_next_hour
0,3.0
1,1.0
2,1.0
3,0.0
4,0.0
...,...
714,66.0
715,61.0
716,73.0
717,33.0


In [23]:
from tqdm import tqdm

def transform_ts_data_into_features_and_target(
    ts_data: pd.DataFrame,
    input_seq_len: int,
    step_size: int
) -> pd.DataFrame:
    """
    Slices and transposes data from time-series format into a (features, target)
    format that we can use to train Supervised ML models
    """
    # make sure the columns match exactly
    assert set(ts_data.columns) == {'pickup_hour', 'rides', 'pickup_location_id'}

    # only need to do this once per location id
    location_ids = ts_data['pickup_location_id'].unique()

    # set up empty dataframes
    features = pd.DataFrame()
    targets = pd.DataFrame()
    
    # for each location in our time series data.....
    for location_id in tqdm(location_ids):
        
        # keep only ts data for this location_id
        ts_data_one_location = ts_data.loc[ts_data['pickup_location_id'] == location_id, ['pickup_hour', 'rides']]

        # indices to split dataframe rows
        indices = get_cutoff_indices(
            ts_data_one_location,
            # number of features
            input_seq_len,
            step_size
        )

        # slice and transpose data into numpy arrays for features and targets
        n_examples = len(indices)
        x = np.ndarray(shape=(n_examples, input_seq_len), dtype=np.float32)
        y = np.ndarray(shape=(n_examples), dtype=np.float32)
        pickup_hours = []

        for i, idx in enumerate(indices):
            x[i, :] = ts_data_one_location.iloc[idx[0]:idx[1]]['rides'].values
            y[i] = ts_data_one_location.iloc[idx[1]:idx[2]]['rides'].values[0]
            pickup_hours.append(ts_data_one_location.iloc[idx[1]]['pickup_hour'])

        # numpy -> pandas
        features_one_location = pd.DataFrame(
            x,
            columns=[f'rides_previous_{i+1}_hour' for i in reversed(range(input_seq_len))]
        )
        features_one_location['pickup_hour'] = pickup_hours
        features_one_location['pickup_location_id'] = location_id

        # numpy -> pandas
        targets_one_location = pd.DataFrame(y, columns=[f'target_rides_next_hour'])

        # concatenate results
        features = pd.concat([features, features_one_location])
        targets = pd.concat([targets, targets_one_location])

    features.reset_index(inplace=True, drop=True)
    targets.reset_index(inplace=True, drop=True)

    return features, targets['target_rides_next_hour']

In [24]:
features, targets = transform_ts_data_into_features_and_target(
    ts_data,
    input_seq_len=24*7*1, # number of features in one week
    step_size=24,         # speed up the loop - 24 hour jump
)

print(f'{features.shape=}')
print(f'{targets.shape=}')

100%|██████████| 257/257 [00:01<00:00, 136.83it/s]

features.shape=(6168, 170)
targets.shape=(6168,)





In [25]:
features.head()

Unnamed: 0,rides_previous_168_hour,rides_previous_167_hour,rides_previous_166_hour,rides_previous_165_hour,rides_previous_164_hour,rides_previous_163_hour,rides_previous_162_hour,rides_previous_161_hour,rides_previous_160_hour,rides_previous_159_hour,...,rides_previous_8_hour,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id
0,11.0,15.0,26.0,8.0,9.0,7.0,3.0,1.0,0.0,3.0,...,2.0,3.0,3.0,7.0,4.0,4.0,7.0,10.0,2022-01-08,4
1,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,3.0,3.0,5.0,7.0,8.0,6.0,7.0,14.0,2022-01-09,4
2,0.0,1.0,0.0,0.0,1.0,1.0,1.0,3.0,2.0,3.0,...,6.0,4.0,3.0,5.0,1.0,1.0,1.0,0.0,2022-01-10,4
3,1.0,1.0,0.0,0.0,0.0,3.0,2.0,3.0,4.0,5.0,...,6.0,3.0,2.0,4.0,1.0,0.0,1.0,2.0,2022-01-11,4
4,0.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,1.0,2.0,...,1.0,6.0,3.0,2.0,3.0,2.0,4.0,1.0,2022-01-12,4
