## Create Feature Pipeline

The first step is to load the configuration file we created called **config.py** so that we can access the HOPSWORKS variables such as the project name, feature group name, feature group version and API key. We do this so that we do not have to keep identifying them over and over again.

In [1]:
import src.config as config

Since we are fetching data in this notebook and doing it every hour, we need to keep track of the current time and the range of dates that we would like to fetch data from. This will fetch the last 28 days every hour. This is not necessary but the redundancy will help if a certain fetch does not work. 

In [19]:
from datetime import datetime, timedelta, timezone

import pandas as pd

current_date = pd.to_datetime(datetime.now(timezone.utc)).floor('H')
current_date = current_date.replace(tzinfo=None)
print(f'{current_date=}')


# fetch data to the current time (rounded to nearest hour) from 28 days ago
fetch_data_to = current_date
fetch_data_from = current_date - timedelta(days = 28)

current_date=Timestamp('2024-09-25 19:00:00')


In [23]:
fetch_data_from

Timestamp('2024-08-28 19:00:00')

For this model, we do not have access to the NYC Taxi Data Warehouse (we cannot get the new rides every hour). For this purpose we will simulate a call which will show how it would work. Since we have access to the rides from a year ago we will pretend that this is the most recent demand that we can fetch. This function will load in data from exactly one year ago and get data for 28 days prior. This data is the simulated new data. We pretend that we run this code every hour to get the newest 28 days worth of data but since our data comes from a source that does not update every hour, we use this synthetic data.

In [32]:
from src.data import load_raw_data

def fetch_batch_raw_data(from_date: datetime, to_date: datetime) -> pd.DataFrame:
    """
    Simulate production data by sampling historical data from 52 weeks ago (i.e. 1 year)
    """
    # get data from a year ago from today
    from_date_ = from_date - timedelta(days=7*52)
    # get the data to the date which is 28 days before from_date_
    to_date_ = to_date - timedelta(days=7*52)
    print(f'{from_date_=}, {to_date_=}')

    # download 2 files from website using our load_raw_data_function
    rides = load_raw_data(year=from_date_.year, months=from_date_.month)
    rides = rides[rides['pickup_datetime'] >= from_date_]
    
    rides_2 = load_raw_data(year=to_date_.year, months=to_date_.month)
    rides_2 = rides_2[rides_2['pickup_datetime'] < to_date_]

    rides = pd.concat([rides, rides_2])

    # shift the data to pretend this is recent data - add a year to each row
    rides['pickup_datetime'] += timedelta(days=7*52)

    rides.sort_values(by=['pickup_location_id', 'pickup_datetime'], inplace=True)

    return rides

In [33]:
# get rides
rides = fetch_batch_raw_data(from_date=fetch_data_from, to_date=fetch_data_to)

from_date_=Timestamp('2023-08-30 19:00:00'), to_date_=Timestamp('2023-09-27 19:00:00')
File 2023-08 was already in local storage
File 2023-09 was already in local storage


Below we can get a sense of what happened. We called the data warehouse to get 28 days worth of data from today. Since we didnt have the data we used last years data and shifted it to be 28 days ago from today. This is why the **rides** dataframe starts on 08-28-2024 (28 days ago), and ends today. All we did is retrieve 28 days worth of data but since we didnt have it we pretended to have it by creating it. Don't overthink this step, its just getting 28 days worth of *new* data. In practice we would have the real data available to us.

At the time of writing this code - 09.25.2024 the newest data available is from July.

In [37]:
rides

Unnamed: 0,pickup_datetime,pickup_location_id
2639768,2024-08-28 21:50:04,1
2656234,2024-08-29 07:55:22,1
2658958,2024-08-29 08:41:05,1
2695049,2024-08-29 16:24:00,1
2702871,2024-08-29 17:15:21,1
...,...,...
2824574,2024-09-25 18:09:07,265
2824595,2024-09-25 18:09:10,265
2341399,2024-09-25 18:22:50,265
2345555,2024-09-25 18:35:37,265


##### Transform the Raw Data into Time Series Data

We now have the raw data and want to transform it into time series data just as we did before. This will be done with the **transform_raw_data_into_ts_data** function from **src.data**.

The older data that we had is already transformed into time series data and loaded into our feature group. This is the new data that we now have to transform.

In [38]:
from src.data import transform_raw_data_into_ts_data

ts_data = transform_raw_data_into_ts_data(rides)

100%|██████████| 265/265 [00:00<00:00, 316.52it/s]


##### Connect to Feature Store/Feature Group

This is where we will connect this new data to our feature group.

In [41]:
import hopsworks

# connect to project 
project = hopsworks.login(
    project = config.HOPSWORKS_PROJECT_NAME,
    api_key_value = config.HOPSWORKS_API_KEY
)

# get feature store
feature_store = project.get_feature_store()

# connect to feature group
feature_group = feature_store.get_or_create_feature_group(
    name=config.FEATURE_GROUP_NAME,
    version=config.FEATURE_GROUP_VERSION,
    description='Hourly time series data',
    primary_key=['pickup_location_id', 'pickup_hour'],
    event_time='pickup_hour'
)


Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1049751
Connected. Call `.close()` to terminate connection gracefully.


In [42]:
# insert this data to the feature group
feature_group.insert(ts_data, write_options={'wait_for_job':False})

Uploading Dataframe: 0.00% |          | Rows 0/178080 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: time_series_hourly_feature_group_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/1049751/jobs/named/time_series_hourly_feature_group_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x2bb214ad060>, None)

#### Automate Running Every Hour

we want this notebook to automatically run every hour. We will do this using GitHub actions. This is done by creating a **.github/workflows** folder which is placed in the parent directory. In this folder we will create a YAML file called **feature_workflows.yaml**. View the file to see the sytax. Help with the syntax can be found at the following links:

+ https://docs.github.com/en/actions/use-cases-and-examples/creating-an-example-workflow

+ https://spacelift.io/blog/github-actions-tutorial

+ https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions

+ https://docs.github.com/en/actions/writing-workflows/quickstart
