## Create Feature Pipeline

The first step is to load the configuration file we created called **config.py** so that we can access the HOPSWORKS variables such as the project name, feature group name, feature group version and API key. We do this so that we do not have to keep identifying them over and over again.

In [11]:
import src.config as config

Since we are fetching data in this notebook and doing it every hour, we need to keep track of the current time and the range of dates that we would like to fetch data from. This will fetch the last 28x3 days every hour. We are setting this to be 28x3 because the last available data is from over two months ago. This means we need enough data to cover this gap. 

In [1]:
from datetime import datetime, timedelta, timezone

import pandas as pd

current_date = pd.to_datetime(datetime.now(timezone.utc)).floor('H')
current_date = current_date.replace(tzinfo=None)
print(f'{current_date=}')


# fetch data to the current time (rounded to nearest hour) from three months ago (28 days ago)
fetch_data_to = current_date
# we need to cover for no data in last 28 days
fetch_data_from = current_date - timedelta(days = 28)
print(f'{fetch_data_from=}')

current_date=Timestamp('2024-10-07 18:00:00')
fetch_data_from=Timestamp('2024-09-09 18:00:00')


In [2]:
fetch_data_from

Timestamp('2024-09-09 18:00:00')

For this model, we do not have access to the NYC Taxi Data Warehouse (we cannot get the new rides every hour). For this purpose we will simulate a call which will show how it would work. Since we have access to the rides from a year ago we will pretend that this is the most recent demand that we can fetch. This function will load in data from exactly one year ago and get data for 3 months prior with each month being 28 days long. This data is the simulated new data. We pretend that we run this code every hour to get the newest 28x3 days worth of data but since our data comes from a source that does not update every hour, we use this synthetic data.

In [3]:
from src.data import load_raw_data

def fetch_batch_raw_data(from_date: datetime, to_date: datetime) -> pd.DataFrame:
    """
    Simulate production data by sampling historical data from 52 weeks ago (i.e. 1 year)
    """
    from_date_ = from_date - timedelta(days=7*52)
    to_date_ = to_date - timedelta(days=7*52)
    print(f'{from_date=}, {to_date_=}')

    # download 2 files from website
    rides = load_raw_data(year=from_date_.year, months=from_date_.month)
    rides = rides[rides.pickup_datetime >= from_date_]
    rides_2 = load_raw_data(year=to_date_.year, months=to_date_.month)
    rides_2 = rides_2[rides_2.pickup_datetime < to_date_]

    rides = pd.concat([rides, rides_2])

    # shift the data to pretend this is recent data
    rides['pickup_datetime'] += timedelta(days=7*52)

    rides.sort_values(by=['pickup_location_id', 'pickup_datetime'], inplace=True)

    return rides



In [4]:
# get rides
rides = fetch_batch_raw_data(from_date=fetch_data_from, to_date=fetch_data_to)

from_date=Timestamp('2024-09-09 18:00:00'), to_date_=Timestamp('2023-10-09 18:00:00')
File 2023-09 was already in local storage
File 2023-10 was already in local storage


In [6]:
# rides

Unnamed: 0,pickup_datetime,pickup_location_id
1053576,2024-09-09 18:31:20,1
1061035,2024-09-09 19:32:46,1
1065718,2024-09-09 20:25:14,1
1065719,2024-09-09 20:26:26,1
1090789,2024-09-10 07:34:58,1
...,...,...
3407858,2024-10-07 16:10:52,265
893498,2024-10-07 16:31:34,265
897416,2024-10-07 17:16:51,265
3408207,2024-10-07 17:22:13,265


In [35]:
# rides[rides['pickup_datetime'] > '2024-08-22']

Unnamed: 0,pickup_datetime,pickup_location_id
1966252,2024-08-22 08:12:17,1
1984767,2024-08-22 12:21:34,1
2002619,2024-08-22 15:08:40,1
2002620,2024-08-22 15:10:04,1
2008549,2024-08-22 16:22:31,1
...,...,...
3377697,2024-10-02 13:10:55,265
3377760,2024-10-02 13:10:59,265
3377865,2024-10-02 14:10:13,265
3377871,2024-10-02 14:10:18,265


Below we can get a sense of what happened. We called the data warehouse to get 28 days worth of data from today. Since we didnt have the data we used last years data and shifted it to be 28 days ago from today. This is why the **rides** dataframe starts on 08-28-2024 (28 days ago), and ends today. All we did is retrieve 28 days worth of data but since we didnt have it we pretended to have it by creating it. Don't overthink this step, its just getting 28 days worth of *new* data. In practice we would have the real data available to us.

At the time of writing this code - 09.25.2024 the newest data available is from July.

##### Transform the Raw Data into Time Series Data

We now have the raw data and want to transform it into time series data just as we did before. This will be done with the **transform_raw_data_into_ts_data** function from **src.data**.

The older data that we had is already transformed into time series data and loaded into our feature group. This is the new data that we now have to transform.

In [7]:
from src.data import transform_raw_data_into_ts_data

ts_data = transform_raw_data_into_ts_data(rides)

100%|██████████| 265/265 [00:00<00:00, 300.94it/s]


In [19]:
# string to datetime
ts_data['pickup_hour'] = pd.to_datetime(ts_data['pickup_hour'], utc=True)

# add column with Unix epoch milliseconds
# Convert datetime to Unix epoch milliseconds
ts_data['pickup_ts'] = ts_data['pickup_hour'].view('int64') // 10**6

In [15]:
ts_data

Unnamed: 0,pickup_hour,rides,pickup_location_id,pickup_ts
0,2024-09-09 18:00:00+00:00,1,1,1.725905e+12
1,2024-09-09 19:00:00+00:00,1,1,1.725908e+12
2,2024-09-09 20:00:00+00:00,2,1,1.725912e+12
3,2024-09-09 21:00:00+00:00,0,1,1.725916e+12
4,2024-09-09 22:00:00+00:00,0,1,1.725919e+12
...,...,...,...,...
178075,2024-10-07 13:00:00+00:00,1,265,1.728306e+12
178076,2024-10-07 14:00:00+00:00,5,265,1.728310e+12
178077,2024-10-07 15:00:00+00:00,4,265,1.728313e+12
178078,2024-10-07 16:00:00+00:00,4,265,1.728317e+12


##### Connect to Feature Store/Feature Group

This is where we will connect this new data to our feature group.

In [20]:
import hopsworks

# connect to project 
project = hopsworks.login(
    project = config.HOPSWORKS_PROJECT_NAME,
    api_key_value = config.HOPSWORKS_API_KEY
)

# get feature store
feature_store = project.get_feature_store()

# connect to feature group
feature_group = feature_store.get_or_create_feature_group(
    name=config.FEATURE_GROUP_NAME,
    version=config.FEATURE_GROUP_VERSION,
    description='Hourly time series data',
    primary_key=['pickup_location_id', 'pickup_ts'],
    event_time='pickup_ts'
)


Connection closed.
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1049751
Connected. Call `.close()` to terminate connection gracefully.


In [21]:
ts_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178080 entries, 0 to 178079
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype              
---  ------              --------------   -----              
 0   pickup_hour         178080 non-null  datetime64[ns, UTC]
 1   rides               178080 non-null  int64              
 2   pickup_location_id  178080 non-null  int64              
 3   pickup_ts           178080 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(3)
memory usage: 5.4 MB


In [22]:
# insert this data to the feature group
feature_group.insert(ts_data, write_options={'wait_for_job':False})

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1049751/fs/1041478/fg/1257807


Uploading Dataframe: 0.00% |          | Rows 0/178080 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: time_series_hourly_feature_group_4_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/1049751/jobs/named/time_series_hourly_feature_group_4_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x2094e1e0ee0>, None)

In [23]:
from src.plot import plot_one_sample, plot_ts

plot_ts(ts_data, [43])

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)



#### Automate Running Every Hour

we want this notebook to automatically run every hour. We will do this using GitHub actions. This is done by creating a **.github/workflows** folder which is placed in the parent directory. In this folder we will create a YAML file called **feature_workflows.yaml**. View the file to see the sytax. Help with the syntax can be found at the following links:

+ https://docs.github.com/en/actions/use-cases-and-examples/creating-an-example-workflow

+ https://spacelift.io/blog/github-actions-tutorial

+ https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions

+ https://docs.github.com/en/actions/writing-workflows/quickstart
