## Feature Store
The Feature Store provides the API to enable clients to write features to feature groups in the feature store, and to read features from feature views - either through a low latency Online API to retrieve pre-computed features for operational models or through a high throughput, latency insensitive Offline API, used to create training data and to retrieve batch data for scoring.

<img src="./architecture.svg"></img>


#### Feature Group
A feature group is a table of features, where each feature group has a primary key, and optionally an event_time column (indicating when the features in that row were observed), and a partition key.

### Hands On HopsWorks Feature Store

In [1]:
import hopsworks
project = hopsworks.login()
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/514196
Connected. Call `.close()` to terminate connection gracefully.


In [3]:
import joblib
import os
import time

import pandas as pd
import numpy as np
from matplotlib import pyplot
import seaborn as sns
from math import radians

import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Specify the window length as "4h"
window_len = "4h"

# Specify the URL for the data
url = "https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/"

In [5]:
# Read the 'credit_cards.csv' file
credit_cards_df = pd.read_csv(url + "credit_cards.csv")

# Read the 'profiles.csv' file
# Parse the 'birthdate' column as dates
profiles_df = pd.read_csv(url + "profiles.csv", parse_dates=["birthdate"])

# Read the 'transactions.csv' file
# Parse the 'datetime' column as dates
trans_df = pd.read_csv(url + "transactions.csv", parse_dates=["datetime"])

In [6]:
credit_cards_df.head()

Unnamed: 0,cc_num,provider,expires
0,4796807885357879,visa,05/23
1,4529266636192966,visa,03/22
2,4922690008243953,visa,02/27
3,4897369589533543,visa,04/22
4,4848518335893425,visa,10/26


In [7]:
profiles_df.head()

Unnamed: 0,name,sex,mail,birthdate,City,Country,cc_num
0,Catherine Zimmerman,F,valenciajason@hotmail.com,1988-09-20,Bryn Mawr-Skyway,US,4796807885357879
1,Michael Williams,M,brettkennedy@yahoo.com,1977-03-01,Gates-North Gates,US,4529266636192966
2,Jessica Krueger,F,marthacruz@hotmail.com,1947-09-10,Greenfield,US,4922690008243953
3,Ruth Harris,F,james11@yahoo.com,1983-12-27,New City,US,4897369589533543
4,Paul Ashley,M,matthew97@hotmail.com,1974-11-10,Peabody,US,4848518335893425


In [8]:
trans_df.head()

Unnamed: 0,tid,datetime,cc_num,category,amount,latitude,longitude,city,country,fraud_label
0,11df919988c134d97bbff2678eb68e22,2022-01-01 00:00:24,4473593503484549,Health/Beauty,62.95,42.30865,-83.48216,Canton,US,0
1,dd0b2d6d4266ccd3bf05bc2ea91cf180,2022-01-01 00:00:56,4272465718946864,Grocery,85.45,33.52253,-117.70755,Laguna Niguel,US,0
2,e627f5d9a9739833bd52d2da51761fc3,2022-01-01 00:02:32,4104216579248948,Domestic Transport,21.63,37.60876,-77.37331,Mechanicsville,US,0
3,6fb3e6beafbb92b8e15827037f603c52,2022-01-01 00:03:24,4814447237003448,Health/Beauty,54.71,43.54072,-116.56346,Nampa,US,0
4,be0b8acc57bfe126a5a392fd99e6ddd1,2022-01-01 00:03:55,4515188652242507,Grocery,59.22,40.24537,-75.64963,Pottstown,US,0


### Feature Engineering

To facilitate model learning, we will create additional features based on these patterns. In particular, we will create two types of features:

- Features that aggregate data from different data sources. This could for instance be the age of a customer at the time of a transaction, which combines the birthdate feature from profiles.csv with the datetime feature from transactions.csv.
- Features that aggregate data from multiple time steps. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.

In [9]:
# Merge the 'trans_df' DataFrame with the 'profiles_df' DataFrame based on the 'cc_num' column
age_df = trans_df.merge(profiles_df, on="cc_num", how="left")

# Compute the age at the time of each transaction and store it in the 'age_at_transaction' column
trans_df["age_at_transaction"] = (age_df["datetime"] - age_df["birthdate"]) / np.timedelta64(365, "D")

# Merge the 'trans_df' DataFrame with the 'credit_cards_df' DataFrame based on the 'cc_num' column
card_expiry_df = trans_df.merge(credit_cards_df, on="cc_num", how="left")

# Convert the 'expires' column to datetime format
card_expiry_df["expires"] = pd.to_datetime(card_expiry_df["expires"], format="%m/%y")

# Compute the days until the card expires and store it in the 'days_until_card_expires' column
trans_df["days_until_card_expires"] = (card_expiry_df["expires"] - card_expiry_df["datetime"]) / np.timedelta64(1, "D")

In [10]:
# Sort the 'trans_df' DataFrame based on the 'datetime' column in ascending order
trans_df.sort_values("datetime", inplace=True)

# Convert the 'longitude' and 'latitude' columns to radians
trans_df[["longitude", "latitude"]] = trans_df[["longitude", "latitude"]].applymap(radians)

# Define a function to compute Haversine distance between consecutive coordinates
def haversine(long, lat):
    """Compute Haversine distance between each consecutive coordinate in (long, lat)."""

    # Shift the longitude and latitude columns to get consecutive values
    long_shifted = long.shift()
    lat_shifted = lat.shift()

    # Calculate the differences in longitude and latitude
    long_diff = long_shifted - long
    lat_diff = lat_shifted - lat

    # Haversine formula to compute distance
    a = np.sin(lat_diff/2.0)**2
    b = np.cos(lat) * np.cos(lat_shifted) * np.sin(long_diff/2.0)**2
    c = 2*np.arcsin(np.sqrt(a + b))

    return c

# Apply the haversine function to compute the 'loc_delta' column
trans_df["loc_delta"] = trans_df.groupby("cc_num")\
    .apply(lambda x : haversine(x["longitude"], x["latitude"]))\
    .reset_index(level=0, drop=True)\
    .fillna(0)

In [11]:
trans_df.head()

Unnamed: 0,tid,datetime,cc_num,category,amount,latitude,longitude,city,country,fraud_label,age_at_transaction,days_until_card_expires,loc_delta
0,11df919988c134d97bbff2678eb68e22,2022-01-01 00:00:24,4473593503484549,Health/Beauty,62.95,0.738425,-1.457039,Canton,US,0,97.578083,1460.999722,0.0
1,dd0b2d6d4266ccd3bf05bc2ea91cf180,2022-01-01 00:00:56,4272465718946864,Grocery,85.45,0.585079,-2.054384,Laguna Niguel,US,0,33.775344,1733.999352,0.0
2,e627f5d9a9739833bd52d2da51761fc3,2022-01-01 00:02:32,4104216579248948,Domestic Transport,21.63,0.656397,-1.350419,Mechanicsville,US,0,80.953429,242.998241,0.0
3,6fb3e6beafbb92b8e15827037f603c52,2022-01-01 00:03:24,4814447237003448,Health/Beauty,54.71,0.759929,-2.034416,Nampa,US,0,53.56165,150.997639,0.0
4,be0b8acc57bfe126a5a392fd99e6ddd1,2022-01-01 00:03:55,4515188652242507,Grocery,59.22,0.702414,-1.320335,Pottstown,US,0,46.035624,515.99728,0.0


In [12]:
# Define a rolling window groupby on 'cc_num' with a specified window length on the 'datetime' column
cc_group = trans_df[["cc_num", "amount", "datetime"]].groupby("cc_num").rolling(
    window_len, 
    on="datetime",
)

# Moving average of transaction volume.
df_4h_mavg = pd.DataFrame(cc_group.mean())
df_4h_mavg.columns = ["trans_volume_mavg", "datetime"]
df_4h_mavg = df_4h_mavg.reset_index(level=["cc_num"])
df_4h_mavg = df_4h_mavg.drop(columns=["cc_num", "datetime"])
df_4h_mavg = df_4h_mavg.sort_index()

# Moving standard deviation of transaction volume.
df_4h_std = pd.DataFrame(cc_group.std())
df_4h_std.columns = ["trans_volume_mstd", "datetime"]
df_4h_std = df_4h_std.reset_index(level=["cc_num"])
df_4h_std = df_4h_std.drop(columns=["cc_num", "datetime"])
df_4h_std = df_4h_std.fillna(0)
df_4h_std = df_4h_std.sort_index()
window_aggs_df = df_4h_std.merge(df_4h_mavg, left_index=True, right_index=True)

# Moving transaction frequency.
df_4h_count = pd.DataFrame(cc_group.count())
df_4h_count.columns = ["trans_freq", "datetime"]
df_4h_count = df_4h_count.reset_index(level=["cc_num"])
df_4h_count = df_4h_count.drop(columns=["cc_num", "datetime"])
df_4h_count = df_4h_count.sort_index()
window_aggs_df = window_aggs_df.merge(df_4h_count, left_index=True, right_index=True)

# Moving average of location difference between consecutive transactions.
cc_group_loc_delta = trans_df[["cc_num", "loc_delta", "datetime"]].groupby("cc_num").rolling(window_len, on="datetime").mean()
df_4h_loc_delta_mavg = pd.DataFrame(cc_group_loc_delta)
df_4h_loc_delta_mavg.columns = ["loc_delta_mavg", "datetime"]
df_4h_loc_delta_mavg = df_4h_loc_delta_mavg.reset_index(level=["cc_num"])
df_4h_loc_delta_mavg = df_4h_loc_delta_mavg.drop(columns=["cc_num", "datetime"])
df_4h_loc_delta_mavg = df_4h_loc_delta_mavg.sort_index()
window_aggs_df = window_aggs_df.merge(df_4h_loc_delta_mavg, left_index=True, right_index=True)

# Merge 'trans_df' with selected columns for the final result
window_aggs_df = window_aggs_df.merge(
    trans_df[["cc_num", "datetime"]].sort_index(), 
    left_index=True, 
    right_index=True,
)
window_aggs_df.tail()

Unnamed: 0,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg,cc_num,datetime
106015,0.0,73.08,1.0,0.045635,4032019521897961,2022-03-24 10:57:02
106016,0.0,287.33,1.0,0.045846,4032019521897961,2022-03-28 11:57:02
106017,0.0,53.88,1.0,0.00012,4032019521897961,2022-04-01 12:57:02
106018,0.0,279.73,1.0,0.045928,4032019521897961,2022-04-05 13:57:02
106019,0.0,73.66,1.0,0.045974,4032019521897961,2022-04-09 14:57:02


### Creating Feature Groups

In [13]:
# Get or create the 'transactions' feature group
trans_fg = fs.get_or_create_feature_group(
    name="transactions",
    version=1,
    description="Transaction data",
    primary_key=["cc_num"],
    event_time="datetime",
    online_enabled=True,
)

In [14]:
# Insert data into feature group
trans_fg.insert(trans_df)

Uploading Dataframe: 0.00% |          | Rows 0/106020 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: transactions_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/514196/jobs/named/transactions_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fa799a13450>, None)

In [15]:
# Update feature descriptions
feature_descriptions = [
    {"name": "tid", "description": "Transaction id"},
    {"name": "datetime", "description": "Transaction time"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "category", "description": "Expense category"},
    {"name": "amount", "description": "Dollar amount of the transaction"},
    {"name": "latitude", "description": "Transaction location latitude"},
    {"name": "longitude", "description": "Transaction location longitude"},
    {"name": "city", "description": "City in which the transaction was made"},
    {"name": "country", "description": "Country in which the transaction was made"},
    {"name": "fraud_label", "description": "Whether the transaction was fraudulent or not"},
    {"name": "age_at_transaction", "description": "Age of the card holder when the transaction was made"},
    {"name": "days_until_card_expires", "description": "Card validity days left when the transaction was made"},
    {"name": "loc_delta", "description": "Haversine distance between this transaction location and the previous transaction location from the same card"},
]

for desc in feature_descriptions: 
    trans_fg.update_feature_description(desc["name"], desc["description"])

In [19]:
# Get or create the 'transactions' feature group with aggregations using specified window len
window_aggs_fg = fs.get_or_create_feature_group(
    name=f"transactions_{window_len}_aggs",
    version=1,
    description=f"Aggregate transaction data over {window_len} windows.",
    primary_key=["cc_num"],
    event_time="datetime",
    online_enabled=True,
)

In [20]:
# Insert data into feature group
window_aggs_fg.insert(window_aggs_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/514196/fs/510019/fg/607592


Uploading Dataframe: 0.00% |          | Rows 0/106020 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: transactions_4h_aggs_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/514196/jobs/named/transactions_4h_aggs_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fa79c536550>, None)

In [21]:
# Update feature descriptions
feature_descriptions = [
    {"name": "datetime", "description": "Transaction time"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "loc_delta_mavg", "description": "Moving average of location difference between consecutive transactions from the same card"},
    {"name": "trans_freq", "description": "Moving average of transaction frequency from the same card"},
    {"name": "trans_volume_mavg", "description": "Moving average of transaction volume from the same card"},
    {"name": "trans_volume_mstd", "description": "Moving standard deviation of transaction volume from the same card"},
]

for desc in feature_descriptions: 
    window_aggs_fg.update_feature_description(desc["name"], desc["description"])

### Transformation Function

In [24]:
# Load transformation functions.
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
}

### Feature Selection

In [22]:
# Select features for training data
selected_features = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["cc_num"]))

### Feature View Creation

In [25]:
# Get or create the 'transactions_view' feature view
feature_view = fs.get_or_create_feature_view(
    name='transactions_view',
    version=1,
    query=selected_features,
    labels=["fraud_label"],
    transformation_functions=transformation_functions,
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/514196/fs/510019/fv/transactions_view/version/1


### Offline Fetch Feature Store Data For Training

In [2]:
feature_view = fs.get_feature_view(
    name='transactions_view',
    version=1
)

In [4]:
feature_view.get_feature_vector({'cc_num': 4000323325541926})



[2,
 43.81,
 25.46375076103501,
 134.73097222222222,
 0.19325359537553227,
 26.799347006970155,
 24.86,
 2.0,
 0.13231378113449282,
 datetime.datetime(2022, 2, 16, 6, 27, 24)]

In [None]:
features_df, labels_df  = feature_view.training_data(
    description='Descriprion of a dataset',
)

### Online Fetch Feature Store Data For Prediction

In [6]:
feature_vector = feature_view.get_feature_vector({"cc_num": 4855787436134696}, allow_missing=True)

In [7]:
feature_vector

[5,
 85.21,
 58.49763546423136,
 797.3630555555555,
 6.366227392422488e-05,
 0.0,
 85.21,
 1.0,
 6.366227392422488e-05,
 datetime.datetime(2022, 4, 25, 15, 17, 12)]