# Transform Raw Data into Features and Targets

### Step 1: Fetch/Validate the Data

Since we want to import another function from our data.py script but we already imported from src.data, Jupyter will have an **ImportError** because it chaches imports. This means it will revert back to the first import of src.data. To fix this, we use a magic command. This will tell Jupyter to reimport the model everytime you use an import statement.

In [25]:
%reload_ext autoreload
%autoreload 2

In [26]:
from src.data import load_raw_data

In [27]:
# load rides from 2022 for all months
rides = load_raw_data(year = 2022)
rides.head()

File 2022-01 was already in local storage
File 2022-02 was already in local storage
File 2022-03 was already in local storage
File 2022-04 was already in local storage
File 2022-05 was already in local storage
File 2022-06 was already in local storage
File 2022-07 was already in local storage
File 2022-08 was already in local storage
File 2022-09 was already in local storage
File 2022-10 was already in local storage
File 2022-11 was already in local storage
File 2022-12 was already in local storage


Unnamed: 0,pickup_datetime,pickup_location_id
0,2022-01-01 00:35:40,142
1,2022-01-01 00:33:43,236
2,2022-01-01 00:53:21,166
3,2022-01-01 00:25:21,114
4,2022-01-01 00:36:48,68


### Step 2: Transform the Data to Time Series Data

First we will have to transform the data into time series data using grouping and aggregation. We want to count the number of rides per-hour, per-pickup location. When we do this, we have to make sure we are imputing 0s where no rides take place. This function uses processes that we did in the 02_transform_raw_data_into_ts_data.ipynb notebook. 

In [28]:
from src.data import transform_raw_data_into_ts_data
# transform the raw data into time series data
ts_data = transform_raw_data_into_ts_data(rides)
ts_data

  full_range = pd.date_range(ts_data['pickup_hour'].min(),
100%|██████████| 265/265 [00:03<00:00, 68.68it/s]


Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,0,1
1,2022-01-01 01:00:00,0,1
2,2022-01-01 02:00:00,0,1
3,2022-01-01 03:00:00,0,1
4,2022-01-01 04:00:00,1,1
...,...,...,...
2321395,2022-12-31 19:00:00,2,265
2321396,2022-12-31 20:00:00,2,265
2321397,2022-12-31 21:00:00,7,265
2321398,2022-12-31 22:00:00,3,265


Just as before, the next step after transforming the data into time series data will be to transform that data into features and targets (tabular data transformation).

In [35]:
# import the function
from src.data import transform_ts_data_into_features_and_target

features, targets = transform_ts_data_into_features_and_target(
    ts_data, 
    input_seq_len = 24*28*1,   # one month of features
    step_size = 24
)

print(f'{features.shape=}')
print(f'{targets.shape=}')

100%|██████████| 265/265 [00:30<00:00,  8.72it/s]

features.shape=(89305, 674)
targets.shape=(89305,)





Now we follow a similar process as before when we saved the data to a parquet file. This time, we can use the **TRANSFORMED_DATA_DIR** variable that we created in the paths.py script. It is basically just the path that we want to store our data.

In [37]:
# save this data
tabular_data = features
tabular_data['target_rides_next_hour'] = targets

# save to transformed data path that we created in src data
from src.paths import TRANSFORMED_DATA_DIR
tabular_data.to_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')