In [1]:
import time

notebook_start_time = time.time()

# Set up environment

In [2]:
import sys
from pathlib import Path


def is_google_colab() -> bool:
    if "google.colab" in str(get_ipython()):
        return True
    return False


def clone_repository() -> None:
    !git clone https://github.com/decodingml/hands-on-recommender-system.git
    %cd hands-on-recommender-system/


def install_dependencies() -> None:
    !pip install --upgrade uv
    !uv pip install --all-extras --system --requirement pyproject.toml


if is_google_colab():
    clone_repository()
    install_dependencies()

    root_dir = str(Path().absolute())
    print("⛳️ Google Colab environment")
else:
    root_dir = str(Path().absolute().parent)
    print("⛳️ Local environment")

# Add the root directory to the `PYTHONPATH` to use the `recsys` Python module from the notebook.
if root_dir not in sys.path:
    print(f"Adding the following directory to the PYTHONPATH: {root_dir}")
    sys.path.append(root_dir)

⛳️ Local environment
Adding the following directory to the PYTHONPATH: /Users/pauliusztin/Documents/01_projects/hopsworks_recsys/hands-on-recommender-system


# 👩🏻‍🔬 Feature pipeline: Computing features

## Imports

In [3]:
%load_ext autoreload
%autoreload 2

import warnings
from pprint import pprint

import polars as pl
import torch
from loguru import logger
from sentence_transformers import SentenceTransformer

warnings.filterwarnings("ignore")

from recsys import hopsworks_integration
from recsys.config import settings
from recsys.features.articles import (
    compute_features_articles,
    generate_embeddings_for_dataframe,
)
from recsys.features.customers import DatasetSampler, compute_features_customers
from recsys.features.interaction import generate_interaction_data
from recsys.features.ranking import compute_ranking_dataset
from recsys.features.transactions import compute_features_transactions
from recsys.hopsworks_integration import feature_store
from recsys.raw_data_sources import h_and_m as h_and_m_raw_data

  from .autonotebook import tqdm as notebook_tqdm


## Constants

These are the default settings used across the lessons. You can always override them in the `.env` file that sits at the root of the repository:

In [4]:
pprint(dict(settings))

{'CUSTOMER_DATA_SIZE': <CustomerDatasetSize.SMALL: 'SMALL'>,
 'CUSTOM_HOPSWORKS_INFERENCE_ENV': 'custom_env_name',
 'FEATURES_EMBEDDING_MODEL_ID': 'all-MiniLM-L6-v2',
 'HOPSWORKS_API_KEY': SecretStr('**********'),
 'OPENAI_API_KEY': SecretStr('**********'),
 'OPENAI_MODEL_ID': 'gpt-4o-mini',
 'RANKING_DATASET_VALIDATON_SPLIT_SIZE': 0.1,
 'RANKING_EARLY_STOPPING_ROUNDS': 5,
 'RANKING_ITERATIONS': 100,
 'RANKING_LEARNING_RATE': 0.2,
 'RANKING_MODEL_TYPE': 'ranking',
 'RANKING_SCALE_POS_WEIGHT': 10,
 'RECSYS_DIR': PosixPath('/Users/pauliusztin/Documents/01_projects/hopsworks_recsys/hands-on-recommender-system/recsys'),
 'TWO_TOWER_DATASET_TEST_SPLIT_SIZE': 0.1,
 'TWO_TOWER_DATASET_VALIDATON_SPLIT_SIZE': 0.1,
 'TWO_TOWER_LEARNING_RATE': 0.01,
 'TWO_TOWER_MODEL_BATCH_SIZE': 2048,
 'TWO_TOWER_MODEL_EMBEDDING_SIZE': 16,
 'TWO_TOWER_NUM_EPOCHS': 10,
 'TWO_TOWER_WEIGHT_DECAY': 0.001}


The most important one is the dataset size.

Choosing a different dataset size will impact the time it takes to run everything and the quality of the final models. We suggest using a small dataset size when running this the first time.

Suported user dataset sizes:

In [5]:
DatasetSampler.get_supported_sizes()

{<CustomerDatasetSize.LARGE: 'LARGE'>: 50000,
 <CustomerDatasetSize.MEDIUM: 'MEDIUM'>: 5000,
 <CustomerDatasetSize.SMALL: 'SMALL'>: 1000}

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [6]:
project, fs = hopsworks_integration.get_feature_store()

[32m2024-12-24 12:39:18.485[0m | [1mINFO    [0m | [36mrecsys.hopsworks_integration.feature_store[0m:[36mget_feature_store[0m:[36m13[0m - [1mLoging to Hopsworks using HOPSWORKS_API_KEY env var.[0m


2024-12-24 12:39:18,486 INFO: Initializing external client
2024-12-24 12:39:18,486 INFO: Base URL: https://c.app.hopsworks.ai:443
2024-12-24 12:39:19,965 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1192098


# The H&M dataset

To show how a recommender system using the two tower architecture works, we will use the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) dataset.

It consists of:
- articles
- customers
- transactions

# 🗄️ Articles data

The **article_id** and **product_code** serve different purposes in the context of H&M's product database:

- **Article ID**: This is a unique identifier assigned to each individual article within the database. It is typically used for internal tracking and management purposes. Each distinct item or variant of a product (e.g., different sizes or colors) would have its own unique article_id.

- **Product Code**: This is also a unique identifier, but it is associated with a specific product or style rather than individual articles. It represents a broader category or type of product within H&M's inventory. Multiple articles may share the same product code if they belong to the same product line or style.

While both are unique identifiers, the article_id is specific to individual items, whereas the product_code represents a broader category or style of product.

Here is an example:

**Product: Basic T-Shirt**

- **Product Code:** TS001

- **Article IDs:**
    - Article ID: 1001 (Size: Small, Color: White)
    - Article ID: 1002 (Size: Medium, Color: White)
    - Article ID: 1003 (Size: Large, Color: White)
    - Article ID: 1004 (Size: Small, Color: Black)
    - Article ID: 1005 (Size: Medium, Color: Black)

In this example, "TS001" is the product code for the basic t-shirt style. Each variant of this t-shirt (e.g., different sizes and colors) has its own unique article_id.



In [7]:
articles_df = h_and_m_raw_data.extract_articles_df()
articles_df.shape

(105542, 25)

The articles DataFrame looks as follows:

In [8]:
articles_df.head(3)

article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
i64,i64,str,i64,str,str,i64,str,i64,str,i64,str,i64,str,i64,str,str,str,i64,str,i64,str,i64,str,str
108775015,108775,"""Strap top""",253,"""Vest top""","""Garment Upper body""",1010016,"""Solid""",9,"""Black""",4,"""Dark""",5,"""Black""",1676,"""Jersey Basic""","""A""","""Ladieswear""",1,"""Ladieswear""",16,"""Womens Everyday Basics""",1002,"""Jersey Basic""","""Jersey top with narrow shoulde…"
108775044,108775,"""Strap top""",253,"""Vest top""","""Garment Upper body""",1010016,"""Solid""",10,"""White""",3,"""Light""",9,"""White""",1676,"""Jersey Basic""","""A""","""Ladieswear""",1,"""Ladieswear""",16,"""Womens Everyday Basics""",1002,"""Jersey Basic""","""Jersey top with narrow shoulde…"
108775051,108775,"""Strap top (1)""",253,"""Vest top""","""Garment Upper body""",1010017,"""Stripe""",11,"""Off White""",1,"""Dusty Light""",9,"""White""",1676,"""Jersey Basic""","""A""","""Ladieswear""",1,"""Ladieswear""",16,"""Womens Everyday Basics""",1002,"""Jersey Basic""","""Jersey top with narrow shoulde…"


Check for NaNs:


In [9]:
articles_df.null_count()

article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,416


## Articles feature engineering


In [10]:
articles_df = compute_features_articles(articles_df)
articles_df.shape


(105542, 27)

The features of the articles look as:

In [11]:
articles_df.head(3)

article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,prod_name_length,article_description,image_url
str,i64,str,i64,str,str,i64,str,i64,str,i64,str,i64,str,i64,str,str,str,i64,str,i64,str,i64,str,u32,str,str
"""108775015""",108775,"""Strap top""",253,"""Vest top""","""Garment Upper body""",1010016,"""Solid""",9,"""Black""",4,"""Dark""",5,"""Black""",1676,"""Jersey Basic""","""A""","""Ladieswear""",1,"""Ladieswear""",16,"""Womens Everyday Basics""",1002,"""Jersey Basic""",9,"""Strap top - Vest top in Garmen…","""https://repo.hops.works/dev/jd…"
"""108775044""",108775,"""Strap top""",253,"""Vest top""","""Garment Upper body""",1010016,"""Solid""",10,"""White""",3,"""Light""",9,"""White""",1676,"""Jersey Basic""","""A""","""Ladieswear""",1,"""Ladieswear""",16,"""Womens Everyday Basics""",1002,"""Jersey Basic""",9,"""Strap top - Vest top in Garmen…","""https://repo.hops.works/dev/jd…"
"""108775051""",108775,"""Strap top (1)""",253,"""Vest top""","""Garment Upper body""",1010017,"""Stripe""",11,"""Off White""",1,"""Dusty Light""",9,"""White""",1676,"""Jersey Basic""","""A""","""Ladieswear""",1,"""Ladieswear""",16,"""Womens Everyday Basics""",1002,"""Jersey Basic""",13,"""Strap top (1) - Vest top in Ga…","""https://repo.hops.works/dev/jd…"


### Create embeddings from the articles description

In [12]:
for i, desc in enumerate(articles_df["article_description"].head(n=3)):
    logger.info(f"Item {i+1}:\n{desc}")

[32m2024-12-24 12:39:23.398[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mItem 1:
Strap top - Vest top in Garment Upper body
Appearance: Solid
Color: Dark Black (Black)
Category: Ladieswear - Womens Everyday Basics - Jersey Basic
Details: Jersey top with narrow shoulder straps.[0m
[32m2024-12-24 12:39:23.398[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mItem 2:
Strap top - Vest top in Garment Upper body
Appearance: Solid
Color: Light White (White)
Category: Ladieswear - Womens Everyday Basics - Jersey Basic
Details: Jersey top with narrow shoulder straps.[0m
[32m2024-12-24 12:39:23.399[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mItem 3:
Strap top (1) - Vest top in Garment Upper body
Appearance: Stripe
Color: Dusty Light White (Off White)
Category: Ladieswear - Womens Everyday Basics - Jersey Basic
Details: Jersey top with narrow shoulder straps.[0m


In [13]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
logger.info(
    f"Loading '{settings.FEATURES_EMBEDDING_MODEL_ID}' embedding model to {device=}"
)

# Load the embedding model from SentenceTransformer's model registry.
model = SentenceTransformer(settings.FEATURES_EMBEDDING_MODEL_ID, device=device)

[32m2024-12-24 12:39:23.438[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1mLoading 'all-MiniLM-L6-v2' embedding model to device='mps'[0m


2024-12-24 12:39:23,438 INFO: Load pretrained SentenceTransformer: all-MiniLM-L6-v2


In [14]:
articles_df = generate_embeddings_for_dataframe(
    articles_df, "article_description", model, batch_size=128
)  # Reduce batch size if getting OOM errors.

Generating embeddings: 100%|██████████| 105542/105542 [04:59<00:00, 352.58it/s]


For each article description, we have a numerical vector which we can feed to a model, opposite to a string containing the description of an object.

In [15]:
articles_df[["article_description", "embeddings"]].head(3)

article_description,embeddings
str,list[f64]
"""Strap top - Vest top in Garmen…","[-0.026782, 0.082344, … 0.022782]"
"""Strap top - Vest top in Garmen…","[-0.010396, 0.089874, … 0.022564]"
"""Strap top (1) - Vest top in Ga…","[-0.032753, 0.091124, … 0.022804]"


### Looking at image links

In [16]:
articles_df["image_url"][0]

'https://repo.hops.works/dev/jdowling/h-and-m/images/010/0108775015.jpg'

In [17]:
from IPython.display import HTML, display

image_urls = articles_df["image_url"].tail(12).to_list()
grid_html = '<div style="display: grid; grid-template-columns: repeat(6, 1fr); gap: 10px; max-width: 900px;">'

for url in image_urls:
    grid_html += f'<img src="{url}" style="width: 100%; height: auto;">'

grid_html += "</div>"

display(HTML(grid_html))


## Customers Data

In [18]:
customers_df = h_and_m_raw_data.extract_customers_df()
customers_df.shape


(1371980, 7)

The customers DataFrame looks as follows:

In [19]:
customers_df.head(3)

customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
str,f64,f64,str,str,i64,str
"""00000dbacae5abe5e23885899a1fa4…",,,"""ACTIVE""","""NONE""",49,"""52043ee2162cf5aa7ee79974281641…"
"""0000423b00ade91418cceaf3b26c6a…",,,"""ACTIVE""","""NONE""",25,"""2973abc54daa8a5f8ccfe9362140c6…"
"""000058a12d5b43e67d225668fa1f8d…",,,"""ACTIVE""","""NONE""",24,"""64f17e6a330a85798e4998f62d0930…"


Check for NaNs:

In [20]:

customers_df.null_count()

customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
u32,u32,u32,u32,u32,u32,u32
0,895050,907576,6062,16009,15861,0


## Customers feature engineering


In [21]:
customers_df = compute_features_customers(customers_df, drop_null_age=True)
customers_df.shape

(1356119, 5)

The features of the customers DataFrame looks as follows:

In [22]:
customers_df.head(3)

customer_id,club_member_status,age,postal_code,age_group
str,str,f64,str,str
"""00000dbacae5abe5e23885899a1fa4…","""ACTIVE""",49.0,"""52043ee2162cf5aa7ee79974281641…","""46-55"""
"""0000423b00ade91418cceaf3b26c6a…","""ACTIVE""",25.0,"""2973abc54daa8a5f8ccfe9362140c6…","""19-25"""
"""000058a12d5b43e67d225668fa1f8d…","""ACTIVE""",24.0,"""64f17e6a330a85798e4998f62d0930…","""19-25"""



# Transactions Data

In [23]:
transactions_df = h_and_m_raw_data.extract_transactions_df()
transactions_df.shape

(31788324, 5)

The transaction DataFrame looks as follows:

In [24]:
transactions_df.head(3)

t_dat,customer_id,article_id,price,sales_channel_id
date,str,i64,f64,i64
2018-09-20,"""000058a12d5b43e67d225668fa1f8d…",663713001,0.050831,2
2018-09-20,"""000058a12d5b43e67d225668fa1f8d…",541518023,0.030492,2
2018-09-20,"""00007d2de826758b65a93dd24ce629…",505221004,0.015237,2


## Transactions feature engineering

In [25]:
transactions_df = compute_features_transactions(transactions_df)
transactions_df.shape

(31788324, 9)

The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, you will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), you'll map each month to the unit circle using sine and cosine.

Thus, the features of the transactions DataFrame look as follows:

In [26]:
transactions_df.head(3)

t_dat,customer_id,article_id,price,sales_channel_id,year,month,day,day_of_week
i64,str,str,f64,i64,i32,i8,i8,i8
0,"""000058a12d5b43e67d225668fa1f8d…","""663713001""",0.050831,2,2018,9,20,4
0,"""000058a12d5b43e67d225668fa1f8d…","""541518023""",0.030492,2,2018,9,20,4
0,"""00007d2de826758b65a93dd24ce629…","""505221004""",0.015237,2,2018,9,20,4


We don't want to work with ~30 million transactions in these series, as everything will take too much time to run. Thus, we create a subset of the original dataset by randomly sampling from the customers' datasets and taking only their transactions.

In [27]:
sampler = DatasetSampler(size=settings.CUSTOMER_DATA_SIZE)
dataset_subset = sampler.sample(
    customers_df=customers_df, transations_df=transactions_df
)
customers_df = dataset_subset["customers"]
transactions_df = dataset_subset["transactions"]

[32m2024-12-24 12:45:24.721[0m | [1mINFO    [0m | [36mrecsys.features.customers[0m:[36msample[0m:[36m29[0m - [1mSampling 1000 customers.[0m
[32m2024-12-24 12:45:24.776[0m | [1mINFO    [0m | [36mrecsys.features.customers[0m:[36msample[0m:[36m32[0m - [1mNumber of transactions for all the customers: 31788324[0m
[32m2024-12-24 12:45:25.369[0m | [1mINFO    [0m | [36mrecsys.features.customers[0m:[36msample[0m:[36m38[0m - [1mNumber of transactions for the 1000 sampled customers: 23799[0m


In [28]:
transactions_df.shape

(23799, 9)

Some of the remaining customers:

In [29]:
for customer_id in transactions_df["customer_id"].unique().head(10):
    logger.info(f"Logging customer ID: {customer_id}")

[32m2024-12-24 12:45:25.506[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mLogging customer ID: 70ea0561e99d2e66da63cba206615b0341d160dc8e99600b74a125cf53d54eb6[0m
[32m2024-12-24 12:45:25.506[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mLogging customer ID: 58a4f309b5431ddc9913cf126f28c0010913a4a22729dbb087e751cfd2c7c9a7[0m
[32m2024-12-24 12:45:25.507[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mLogging customer ID: 51e21d5437279ea291b8b5aa58162192ae7998ef98708d754c79a9239cee690e[0m
[32m2024-12-24 12:45:25.507[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mLogging customer ID: 480ba7290134bc7d959e0a230d8bd9e1c0bb387a69c456450b3b845aca4d7278[0m
[32m2024-12-24 12:45:25.507[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mLogging customer ID: 042012178b0189c9f2269bba579273aaded65c23b7aa4e42a3a77b536d3c0403[0m
[32m2024-12-24

# 🤳🏻 Interaction data

To train our models, we need more than just the transactions DataFrame. We need positive samples that signal whether a customer clicked or bought an item, but we also need negative samples that signal no interactions between a customer and an item.

In [30]:
interaction_df = generate_interaction_data(transactions_df)
interaction_df.shape

Processing customer chunks: 100%|██████████| 1/1 [00:05<00:00,  5.04s/it]


(135813, 5)

The interaction features look as follows:

In [31]:
interaction_df.head()

t_dat,customer_id,article_id,interaction_score,prev_article_id
i64,str,str,i64,str
-370800000,"""00b203a32faa3d007dba198ef27c15…","""854301003""",0,"""START"""
-367200000,"""00b203a32faa3d007dba198ef27c15…","""717490008""",0,"""854301003"""
-363600000,"""00b203a32faa3d007dba198ef27c15…","""717490008""",0,"""717490008"""
-352800000,"""00b203a32faa3d007dba198ef27c15…","""811099002""",0,"""717490008"""
-349200000,"""00b203a32faa3d007dba198ef27c15…","""811099002""",0,"""811099002"""


Let's take a look at the interaction score distribution:

In [32]:
interaction_df.group_by("interaction_score").agg(
    pl.count("interaction_score").alias("total_interactions")
)

interaction_score,total_interactions
i64,u32
0,73710
1,38304
2,23799


Here is what each score means:
- `0` : No interaction between a customer and an item
- `1` : A customer clicked an item
- `2` : A customer bought an item

# <span style="color:#ff5f27">🪄 Create Hopsworks Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.

To create a feature group you need to give it a name and specify a primary key. It is also best practice to provide a description of the contents of the feature group.

#### Customers

We set `online_enabled=True` to enable low-latency access to the data from the inference pipeline for real-time predictions. 

A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

In [33]:
logger.info("Uploading 'customers' Feature Group to Hopsworks.")
customers_fg = feature_store.create_customers_feature_group(
    fs, df=customers_df, online_enabled=True
)

logger.info("✅ Uploaded 'customers' Feature Group to Hopsworks!")

[32m2024-12-24 12:45:30.825[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mUploading 'customers' Feature Group to Hopsworks.[0m
Uploading Dataframe: 100.00% |██████████| Rows 1000/1000 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: customers_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1192098/jobs/named/customers_1_offline_fg_materialization/executions
2024-12-24 12:45:48,374 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2024-12-24 12:45:51,547 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2024-12-24 12:47:23,642 INFO: Waiting for execution to finish. Current state: SUCCEEDING. Final status: UNDEFINED
2024-12-24 12:47:26,825 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2024-12-24 12:47:26,989 INFO: Waiting for log aggregation to finish.
2024-12-24 12:47:45,595 INFO: Execution finished successfully.


[32m2024-12-24 12:47:48.501[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1m✅ Uploaded 'customers' Feature Group to Hopsworks![0m


#### Articles

Let's do the same thing for the rest of the data frames.

In [34]:
logger.info("Uploading 'articles' Feature Group to Hopsworks.")
articles_fg = feature_store.create_articles_feature_group(
    fs,
    df=articles_df,
    articles_description_embedding_dim=model.get_sentence_embedding_dimension(),
    online_enabled=True,
)
logger.info("✅ Uploaded 'articles' Feature Group to Hopsworks!")


[32m2024-12-24 12:47:48.556[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mUploading 'articles' Feature Group to Hopsworks.[0m
Uploading Dataframe: 100.00% |██████████| Rows 105542/105542 | Elapsed Time: 00:29 | Remaining Time: 00:00


Launching job: articles_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1192098/jobs/named/articles_1_offline_fg_materialization/executions
2024-12-24 12:48:32,354 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2024-12-24 12:48:35,566 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2024-12-24 12:51:23,826 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2024-12-24 12:51:23,984 INFO: Waiting for log aggregation to finish.
2024-12-24 12:51:39,233 INFO: Execution finished successfully.


[32m2024-12-24 12:51:39.234[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m✅ Uploaded 'articles' Feature Group to Hopsworks![0m


####  Transactions

In [35]:
logger.info("Uploading 'transactions' Feature Group to Hopsworks.")
trans_fg = feature_store.create_transactions_feature_group(
    fs=fs, df=transactions_df, online_enabled=True
)
logger.info("✅ Uploaded 'transactions' Feature Group to Hopsworks!")

[32m2024-12-24 12:51:39.287[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mUploading 'transactions' Feature Group to Hopsworks.[0m
Uploading Dataframe: 100.00% |██████████| Rows 23799/23799 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: transactions_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1192098/jobs/named/transactions_1_offline_fg_materialization/executions
2024-12-24 12:51:55,702 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2024-12-24 12:51:58,876 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2024-12-24 12:53:31,141 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2024-12-24 12:53:31,301 INFO: Waiting for log aggregation to finish.
2024-12-24 12:53:43,307 INFO: Execution finished successfully.


[32m2024-12-24 12:53:50.168[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m5[0m - [1m✅ Uploaded 'transactions' Feature Group to Hopsworks![0m


#### Interactions

In [36]:
logger.info("Uploading 'interactions' Feature Group to Hopsworks.")
interactions_fg = feature_store.create_interactions_feature_group(
    fs=fs, df=interaction_df, online_enabled=True
)
logger.info("✅ Uploaded 'interactions' Feature Group to Hopsworks!!")

[32m2024-12-24 12:53:50.220[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mUploading 'interactions' Feature Group to Hopsworks.[0m
Uploading Dataframe: 100.00% |██████████| Rows 135813/135813 | Elapsed Time: 00:02 | Remaining Time: 00:00


Launching job: interactions_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1192098/jobs/named/interactions_1_offline_fg_materialization/executions
2024-12-24 12:54:06,519 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2024-12-24 12:54:09,702 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2024-12-24 12:55:48,209 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2024-12-24 12:55:48,370 INFO: Waiting for log aggregation to finish.
2024-12-24 12:55:56,957 INFO: Execution finished successfully.


[32m2024-12-24 12:55:59.467[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m5[0m - [1m✅ Uploaded 'interactions' Feature Group to Hopsworks!![0m


# Compute ranking dataset

The last step is to compute the ranking dataset used to train the scoring/ranking model from the feature groups we've just created:


In [37]:
ranking_df = compute_ranking_dataset(
    trans_fg,
    articles_fg,
    customers_fg,
)
ranking_df.shape

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (1.55s) 
Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (5.61s) 
Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (0.67s) 
Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (21.64s) 


(224136, 15)

The ranking dataset looks as follows:

In [38]:
ranking_df.head(3)

customer_id,age,article_id,label,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
str,f64,str,i32,str,str,str,str,str,str,str,str,str,str,str
"""34d30dcece38ac652e9fd05285d94d…",64.0,"""839496003""",1,"""Top""","""Garment Upper body""","""All over pattern""","""Beige""","""Dark""","""Beige""","""Jersey fancy""","""Ladieswear""","""Ladieswear""","""Womens Everyday Collection""","""Jersey Fancy"""
"""91a30617b9a5b5ff3927779204a176…",49.0,"""777093001""",1,"""Jumpsuit/Playsuit""","""Garment Full body""","""All over pattern""","""Dark Blue""","""Dark""","""Blue""","""Young Girl Dresses""","""Children Sizes 134-170""","""Baby/Children""","""Young Girl""","""Dresses/Skirts girls"""
"""23eeb5e9595c9409031f21a9c01fa3…",25.0,"""875719003""",1,"""Trousers""","""Garment Lower body""","""Solid""","""Beige""","""Dusty Light""","""Mole""","""Trousers DS""","""Divided""","""Divided""","""Divided Selected""","""Trousers"""


In [39]:
ranking_df.get_column("label").value_counts()

label,count
i32,u32
0,203760
1,20376


As the ranking dataset was computed based on articles, customers, and transactions Hopsworks Feature Groups, we can reflect this lineage in the ranking Feature Group.

In [40]:
logger.info("Uploading 'ranking' Feature Group to Hopsworks.")
rank_fg = feature_store.create_ranking_feature_group(
    fs,
    df=ranking_df,
    parents=[articles_fg, customers_fg, trans_fg],
    online_enabled=False
)
logger.info("✅ Uploaded 'ranking' Feature Group to Hopsworks!!")

[32m2024-12-24 12:56:32.770[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mUploading 'ranking' Feature Group to Hopsworks.[0m
Uploading Dataframe: 100.00% |██████████| Rows 224136/224136 | Elapsed Time: 00:04 | Remaining Time: 00:00


Launching job: ranking_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1192098/jobs/named/ranking_1_offline_fg_materialization/executions
2024-12-24 12:56:51,122 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2024-12-24 12:56:54,297 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2024-12-24 12:58:39,189 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2024-12-24 12:58:39,350 INFO: Waiting for log aggregation to finish.
2024-12-24 12:58:47,939 INFO: Execution finished successfully.


[32m2024-12-24 12:58:55.424[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m✅ Uploaded 'ranking' Feature Group to Hopsworks!![0m


## <span style="color:#ff5f27"> Inspecting the Feature Groups in Hopsworks UI </span>

View results in [Hopsworks Serverless](https://rebrand.ly/serverless-github): **Feature Store → Feature Groups**

---

In [41]:
notebook_end_time = time.time()
notebook_execution_time = notebook_end_time - notebook_start_time

logger.info(
    f"⌛️ Notebook Execution time: {notebook_execution_time:.2f} seconds ~ {notebook_execution_time / 60:.2f} minutes"
)

[32m2024-12-24 12:58:55.478[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1m⌛️ Notebook Execution time: 1181.77 seconds ~ 19.70 minutes[0m


# <span style="color:#ff5f27">→ Next Steps </span>

In the next notebook you'll train the retrieval model and register it to the Hopsworks model registry.