## Create Feature Pipeline

The first step is to load the configuration file we created called **config.py** so that we can access the HOPSWORKS variables such as the project name, feature group name, feature group version and API key. We do this so that we do not have to keep identifying them over and over again.

In [1]:
import src.config as config

Since we are fetching data in this notebook and doing it every hour, we need to keep track of the current time and the range of dates that we would like to fetch data from. This will fetch the last 28x3 days every hour. We are setting this to be 28x3 because the last available data is from over two months ago. This means we need enough data to cover this gap. 

In [2]:
from datetime import datetime, timedelta, timezone

import pandas as pd

current_date = pd.to_datetime(datetime.now(timezone.utc)).floor('H')
current_date = current_date.replace(tzinfo=None)
print(f'{current_date=}')


# fetch data to the current time (rounded to nearest hour) from three months ago (28 days ago*3 months)
fetch_data_to = current_date
# we need to cover for no data from July until today- 09/26
fetch_data_from = current_date - timedelta(days = 28*3)
print(f'{fetch_data_from=}')

current_date=Timestamp('2024-10-02 15:00:00')
fetch_data_from=Timestamp('2024-07-10 15:00:00')


In [3]:
fetch_data_from

Timestamp('2024-07-10 15:00:00')

For this model, we do not have access to the NYC Taxi Data Warehouse (we cannot get the new rides every hour). For this purpose we will simulate a call which will show how it would work. Since we have access to the rides from a year ago we will pretend that this is the most recent demand that we can fetch. This function will load in data from exactly one year ago and get data for 3 months prior with each month being 28 days long. This data is the simulated new data. We pretend that we run this code every hour to get the newest 28x3 days worth of data but since our data comes from a source that does not update every hour, we use this synthetic data.

In [32]:
from src.data import load_raw_data

def fetch_batch_raw_data(from_date: datetime, to_date: datetime) -> pd.DataFrame:
    """
    Simulate production data by sampling historical data from 1 year ago for the last 4 months up to today's date.
    """
    # Calculate the dates one year ago
    from_date_ = from_date - timedelta(days=365)
    to_date_ = to_date - timedelta(days=365)
    print(f'{from_date_=}, {to_date_=}')

    # List to collect the monthly data
    monthly_data = []

    # Loop through the four months
    for i in range(4):
        current_date = from_date_ + timedelta(days=30 * i)
        year, month = current_date.year, current_date.month
        
        # Handle wrapping around the end of the year
        if month > 12:
            year += 1
            month = month % 12

        month_data = load_raw_data(year=year, months=month)
        if i == 0:
            month_data = month_data[month_data['pickup_datetime'] >= from_date_]
        elif i == 3:
            month_data = month_data[month_data['pickup_datetime'] < to_date_]
        
        monthly_data.append(month_data)

    # Concatenate all the monthly DataFrames
    rides = pd.concat(monthly_data)

    # Shift the data to pretend this is recent data - add a year to each row
    rides['pickup_datetime'] += timedelta(days=365)

    rides.sort_values(by=['pickup_location_id', 'pickup_datetime'], inplace=True)

    return rides



In [33]:
# get rides
rides = fetch_batch_raw_data(from_date=fetch_data_from, to_date=fetch_data_to)

from_date_=Timestamp('2023-07-11 15:00:00'), to_date_=Timestamp('2023-10-03 15:00:00')
File 2023-07 was already in local storage
File 2023-08 was already in local storage
File 2023-09 was already in local storage
File 2023-10 was already in local storage


In [34]:
rides

Unnamed: 0,pickup_datetime,pickup_location_id
838318,2024-07-10 16:55:05,1
845283,2024-07-10 17:00:50,1
878579,2024-07-10 23:08:28,1
887385,2024-07-11 06:36:53,1
895335,2024-07-11 08:44:43,1
...,...,...
3377697,2024-10-02 13:10:55,265
3377760,2024-10-02 13:10:59,265
3377865,2024-10-02 14:10:13,265
3377871,2024-10-02 14:10:18,265


In [35]:
rides[rides['pickup_datetime'] > '2024-08-22']

Unnamed: 0,pickup_datetime,pickup_location_id
1966252,2024-08-22 08:12:17,1
1984767,2024-08-22 12:21:34,1
2002619,2024-08-22 15:08:40,1
2002620,2024-08-22 15:10:04,1
2008549,2024-08-22 16:22:31,1
...,...,...
3377697,2024-10-02 13:10:55,265
3377760,2024-10-02 13:10:59,265
3377865,2024-10-02 14:10:13,265
3377871,2024-10-02 14:10:18,265


Below we can get a sense of what happened. We called the data warehouse to get 28 days worth of data from today. Since we didnt have the data we used last years data and shifted it to be 28 days ago from today. This is why the **rides** dataframe starts on 08-28-2024 (28 days ago), and ends today. All we did is retrieve 28 days worth of data but since we didnt have it we pretended to have it by creating it. Don't overthink this step, its just getting 28 days worth of *new* data. In practice we would have the real data available to us.

At the time of writing this code - 09.25.2024 the newest data available is from July.

##### Transform the Raw Data into Time Series Data

We now have the raw data and want to transform it into time series data just as we did before. This will be done with the **transform_raw_data_into_ts_data** function from **src.data**.

The older data that we had is already transformed into time series data and loaded into our feature group. This is the new data that we now have to transform.

In [36]:
from src.data import transform_raw_data_into_ts_data

ts_data = transform_raw_data_into_ts_data(rides)

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)



100%|██████████| 265/265 [00:01<00:00, 153.03it/s]


In [37]:
# string to datetime
ts_data['pickup_hour'] = pd.to_datetime(ts_data['pickup_hour'], utc=True)

# add column with Unix epoch milliseconds
ts_data['pickup_ts'] = ts_data['pickup_hour'].apply(lambda x: x.timestamp() * 1000)

In [38]:
cp = ts_data[ts_data['pickup_location_id'] == 43]

In [39]:
cp[cp['pickup_hour'] > '2024-09-05']

Unnamed: 0,pickup_hour,rides,pickup_location_id,pickup_ts
86026,2024-09-05 01:00:00+00:00,0,43,1.725498e+12
86027,2024-09-05 02:00:00+00:00,4,43,1.725502e+12
86028,2024-09-05 03:00:00+00:00,0,43,1.725505e+12
86029,2024-09-05 04:00:00+00:00,0,43,1.725509e+12
86030,2024-09-05 05:00:00+00:00,8,43,1.725512e+12
...,...,...,...,...
86683,2024-10-02 10:00:00+00:00,90,43,1.727863e+12
86684,2024-10-02 11:00:00+00:00,122,43,1.727867e+12
86685,2024-10-02 12:00:00+00:00,144,43,1.727870e+12
86686,2024-10-02 13:00:00+00:00,134,43,1.727874e+12


In [40]:
ts_data[ts_data['pickup_hour'] > '09-07-2024']

Unnamed: 0,pickup_hour,rides,pickup_location_id,pickup_ts
1402,2024-09-07 01:00:00+00:00,0,1,1.725671e+12
1403,2024-09-07 02:00:00+00:00,0,1,1.725674e+12
1404,2024-09-07 03:00:00+00:00,0,1,1.725678e+12
1405,2024-09-07 04:00:00+00:00,0,1,1.725682e+12
1406,2024-09-07 05:00:00+00:00,3,1,1.725685e+12
...,...,...,...,...
534235,2024-10-02 10:00:00+00:00,6,265,1.727863e+12
534236,2024-10-02 11:00:00+00:00,3,265,1.727867e+12
534237,2024-10-02 12:00:00+00:00,6,265,1.727870e+12
534238,2024-10-02 13:00:00+00:00,7,265,1.727874e+12


In [41]:
ts_data['pickup_hour'].dtype

datetime64[ns, UTC]

In [42]:
ts_data

Unnamed: 0,pickup_hour,rides,pickup_location_id,pickup_ts
0,2024-07-10 15:00:00+00:00,0,1,1.720624e+12
1,2024-07-10 16:00:00+00:00,1,1,1.720627e+12
2,2024-07-10 17:00:00+00:00,1,1,1.720631e+12
3,2024-07-10 18:00:00+00:00,0,1,1.720634e+12
4,2024-07-10 19:00:00+00:00,0,1,1.720638e+12
...,...,...,...,...
534235,2024-10-02 10:00:00+00:00,6,265,1.727863e+12
534236,2024-10-02 11:00:00+00:00,3,265,1.727867e+12
534237,2024-10-02 12:00:00+00:00,6,265,1.727870e+12
534238,2024-10-02 13:00:00+00:00,7,265,1.727874e+12


##### Connect to Feature Store/Feature Group

This is where we will connect this new data to our feature group.

In [43]:
import hopsworks

# connect to project 
project = hopsworks.login(
    project = config.HOPSWORKS_PROJECT_NAME,
    api_key_value = config.HOPSWORKS_API_KEY
)

# get feature store
feature_store = project.get_feature_store()

# connect to feature group
feature_group = feature_store.get_or_create_feature_group(
    name=config.FEATURE_GROUP_NAME,
    version=config.FEATURE_GROUP_VERSION,
    description='Hourly time series data',
    primary_key=['pickup_location_id', 'pickup_ts'],
    event_time='pickup_ts'
)


Connection closed.
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1049751
Connected. Call `.close()` to terminate connection gracefully.


In [44]:
ts_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534240 entries, 0 to 534239
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype              
---  ------              --------------   -----              
 0   pickup_hour         534240 non-null  datetime64[ns, UTC]
 1   rides               534240 non-null  int64              
 2   pickup_location_id  534240 non-null  int64              
 3   pickup_ts           534240 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(1), int64(2)
memory usage: 16.3 MB


In [45]:
# insert this data to the feature group
feature_group.insert(ts_data, write_options={'wait_for_job':False})

Uploading Dataframe: 0.00% |          | Rows 0/534240 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: time_series_hourly_feature_group_2_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/1049751/jobs/named/time_series_hourly_feature_group_2_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x1fc398e52d0>, None)

In [47]:
from src.plot import plot_one_sample, plot_ts

plot_ts(ts_data, [43])


np.find_common_type is deprecated.  Please use `np.result_type` or `np.promote_types`.
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)




#### Automate Running Every Hour

we want this notebook to automatically run every hour. We will do this using GitHub actions. This is done by creating a **.github/workflows** folder which is placed in the parent directory. In this folder we will create a YAML file called **feature_workflows.yaml**. View the file to see the sytax. Help with the syntax can be found at the following links:

+ https://docs.github.com/en/actions/use-cases-and-examples/creating-an-example-workflow

+ https://spacelift.io/blog/github-actions-tutorial

+ https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions

+ https://docs.github.com/en/actions/writing-workflows/quickstart
