In order to use the feature store in the API calls we will need the project name and the API key. I am using CAPS for hopsworks variables.

In [3]:
# project name for API call feature store
HOPSWORKS_PROJECT_NAME = 'taxi_demand_rs'

#### Loading the API Key from the .env File

The **dotenv** library allows us to load variables from external files as environment variables. Environment variables means we can access them using the **os** module.

A **.env** file is a plain text file that stores API keys and other sensitive information. This file is created within our project parent directory (not in the notebooks or src folders). We store it in this file because hardcoding an API key is a serious security violation. 

In [4]:
import os
from dotenv import load_dotenv
from src.paths import PARENT_DIR

# specify the path where the file is
load_dotenv(PARENT_DIR / '.env')

HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']

##### Commiting to Git

Never commit the API key to GitHub repository. For this reason, we create a **gitignore** file. This file is located in the parent directory (not src, data, notebooks).

#### Fetching Raw Data

Now we can use the **load_raw_data** function from the **src.data** script to load in raw data from 2022-today.

In [5]:
from datetime import datetime
import pandas as pd
from src.data import load_raw_data

# use load_raw_data
# starting year of data fetching
start_year = 2022
# ending year will be the current year
end_year = datetime.now().year   
print(f'Downloading files from {start_year} to {end_year}.')

# set up an empty dataframe to be filled by function
rides = pd.DataFrame()

# loop to download all wanted data
for year in range(start_year, end_year+1):
    # download data for the year
    rides_one_year = load_raw_data(year)

    # append rows
    rides = pd.concat([rides, rides_one_year])



Downloading files from 2022 to 2024.
File 2022-01 was already in local storage
File 2022-02 was already in local storage
File 2022-03 was already in local storage
File 2022-04 was already in local storage
File 2022-05 was already in local storage
File 2022-06 was already in local storage
File 2022-07 was already in local storage
File 2022-08 was already in local storage
File 2022-09 was already in local storage
File 2022-10 was already in local storage
File 2022-11 was already in local storage
File 2022-12 was already in local storage
File 2023-01 was already in local storage
File 2023-02 was already in local storage
File 2023-03 was already in local storage
File 2023-04 was already in local storage
File 2023-05 was already in local storage
File 2023-06 was already in local storage
File 2023-07 was already in local storage
File 2023-08 was already in local storage
File 2023-09 was already in local storage
File 2023-10 was already in local storage
File 2023-11 was already in local stora

In [6]:
print(len(rides))

101372930


In [7]:
rides

Unnamed: 0,pickup_datetime,pickup_location_id
0,2022-01-01 00:35:40,142
1,2022-01-01 00:33:43,236
2,2022-01-01 00:53:21,166
3,2022-01-01 00:25:21,114
4,2022-01-01 00:36:48,68
...,...,...
3076898,2024-07-31 23:12:00,243
3076899,2024-07-31 23:10:34,170
3076900,2024-07-31 23:32:00,197
3076901,2024-07-31 23:32:52,230


#### Transform the Data into Time Series Data

Next, the data needs to be transformed into time series data. We can use the **transform_raw_data_into_ts_data** from the **data.py** module for this.

In [8]:
from src.data import transform_raw_data_into_ts_data

ts_data = transform_raw_data_into_ts_data(rides)

100%|██████████| 265/265 [00:06<00:00, 42.95it/s]


In [9]:
ts_data

Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,0,1
1,2022-01-01 01:00:00,0,1
2,2022-01-01 02:00:00,0,1
3,2022-01-01 03:00:00,0,1
4,2022-01-01 04:00:00,1,1
...,...,...,...
5997475,2024-07-31 19:00:00,2,265
5997476,2024-07-31 20:00:00,1,265
5997477,2024-07-31 21:00:00,4,265
5997478,2024-07-31 22:00:00,2,265


The column **pickup_ts** is the pickup_hour expressed in Unix milliseconds. I added it because the pickup_hour, being of type datetime, is not ideal to be used as a primary key in the feature group. Instead, I use pickup_ts, which has is a bigint, as a primary key, together with pickup_location_id. 

You can convert a datetime to a BIGINT by representing it as a Unix timestamp (the number of seconds since January 1, 1970). Just use .astype(int) // 10**6 for millisecond representation.


In [21]:
# string to datetime
ts_data['pickup_hour'] = pd.to_datetime(ts_data['pickup_hour'], utc=True)

# add column with Unix epoch milliseconds
ts_data['pickup_ts'] = ts_data['pickup_hour'].apply(lambda x: int(pd.Timestamp(x).timestamp() * 1000))


In [22]:
ts_data

Unnamed: 0,pickup_hour,rides,pickup_location_id,pickup_ts
0,2022-01-01 00:00:00+00:00,0,1,1640995200000
1,2022-01-01 01:00:00+00:00,0,1,1640998800000
2,2022-01-01 02:00:00+00:00,0,1,1641002400000
3,2022-01-01 03:00:00+00:00,0,1,1641006000000
4,2022-01-01 04:00:00+00:00,1,1,1641009600000
...,...,...,...,...
5997475,2024-07-31 19:00:00+00:00,2,265,1722452400000
5997476,2024-07-31 20:00:00+00:00,1,265,1722456000000
5997477,2024-07-31 21:00:00+00:00,4,265,1722459600000
5997478,2024-07-31 22:00:00+00:00,2,265,1722463200000


In [23]:
import hopsworks

Now, we connect to our project on Hopsworks. To do so, we will use the variable **HOPSWORKS_PROJECT_NAME** that we created at the top of this notebook along with the API key that we read in using **load_dotenv** called, **HOPSWORKS_API_KEY**. To connect to the project we import hopsworks then use **hopsworks.login()** with the two variables.

In [24]:
# connect to our project
project = hopsworks.login(
    project = HOPSWORKS_PROJECT_NAME,
    api_key_value= HOPSWORKS_API_KEY
)

Connection closed.
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1049751


After we have connected to the project we have the ability to use **project.get_feature_store()**.

In [25]:
# create feature store
feature_store = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


##### Save Features to the Feature Store

To save data to the feature store we need to use **feature groups**. A feature group is just a table of features with predictive power for our model. Each group has a **primary key** and optionally an **event_time** column and a **partition key**. These are certain columns in our data. The partition key helps with queries such as a location id. 

**Primary Key** - The primary key ensures each row in the data is unique. In our example we have pickup_hour, pickup_location_id and rides. The pickup_hour, pickup_location_id and number of rides can all be used more than once. For this reason, the primary key will be **['pickup_location_id', 'pickup_hour']** because this ensures each row of the data is unique based on the location and time of the ride.

**Partition Key** - This helps query the data. We can use either pickup_hour or pickup_location_id if we want to query based on hour or position. We can also use both for multidimensional partitioning. I will not be starting out with a partition key since it is optional. 

To use a feature group we need to set the **name** and **version**.

In [26]:
# feature group name and version
FEATURE_GROUP_NAME = 'time_series_hourly_feature_group'
FEATURE_GROUP_VERSION = 3

In [27]:
# create the feature group
feature_group = feature_store.get_or_create_feature_group(
    name = FEATURE_GROUP_NAME,
    version = FEATURE_GROUP_VERSION,
    description = 'Hourly time series data',
    primary_key = ['pickup_location_id', 'pickup_ts'],
    event_time = 'pickup_ts'
)

Now that the feature group is created we save our data to it. We use **write_options = {'wait_for_job':False}** so we can keep working as the data is inserted.

In [28]:
# add data to the feature group
feature_group.insert(ts_data, write_options={'wait_for_job':False})

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1049751/fs/1041478/fg/1241445


Uploading Dataframe: 0.00% |          | Rows 0/5997480 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: time_series_hourly_feature_group_3_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/1049751/jobs/named/time_series_hourly_feature_group_3_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x21e5b3a6470>, None)