In [1]:
import time

notebook_start_time = time.time()

# Set up environment

In [2]:
import sys
from pathlib import Path


def is_google_colab() -> bool:
    if "google.colab" in str(get_ipython()):
        return True
    return False


def clone_repository() -> None:
    !git clone https://github.com/decodingml/hands-on-recommender-system.git
    %cd hands-on-recommender-system/


def install_dependencies() -> None:
    !pip install --upgrade uv
    !uv pip install --all-extras --system --requirement pyproject.toml


if is_google_colab():
    clone_repository()
    install_dependencies()

    root_dir = str(Path().absolute())
    print("⛳️ Google Colab environment")
else:
    root_dir = str(Path().absolute().parent)
    print("⛳️ Local environment")

# Add the root directory to the `PYTHONPATH` to use the `recsys` Python module from the notebook.
if root_dir not in sys.path:
    print(f"Adding the following directory to the PYTHONPATH: {root_dir}")
    sys.path.append(root_dir)

⛳️ Local environment
Adding the following directory to the PYTHONPATH: /Users/pauliusztin/Documents/01_projects/hopsworks_recsys/hands-on-recommender-system


# 👩🏻‍🔬 Feature pipeline: Computing item embeddings

In this notebook you will compute the candidate embeddings and populate a Hopsworks feature group with a vector index.

## 📝 Imports

In [3]:
import warnings

warnings.filterwarnings("ignore")

from loguru import logger

from recsys import features, hopsworks_integration
from recsys.config import settings

## Constants

In [4]:
from pprint import pprint

pprint(dict(settings))

{'CUSTOMER_DATA_SIZE': <CustomerDatasetSize.SMALL: 'SMALL'>,
 'FEATURES_EMBEDDING_MODEL_ID': 'all-MiniLM-L6-v2',
 'HOPSWORKS_API_KEY': SecretStr('**********'),
 'RECSYS_DIR': PosixPath('/Users/pauliusztin/Documents/01_projects/hopsworks_recsys/hands-on-recommender-system/recsys'),
 'TWO_TOWER_DATASET_TEST_SPLIT_SIZE': 0.1,
 'TWO_TOWER_DATASET_VALIDATON_SPLIT_SIZE': 0.1,
 'TWO_TOWER_LEARNING_RATE': 0.01,
 'TWO_TOWER_MODEL_BATCH_SIZE': 2048,
 'TWO_TOWER_MODEL_EMBEDDING_SIZE': 16,
 'TWO_TOWER_NUM_EPOCHS': 10,
 'TWO_TOWER_WEIGHT_DECAY': 0.001}


## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [5]:
project, fs = hopsworks_integration.get_feature_store()

mr = project.get_model_registry()

[32m2024-11-21 13:24:34.734[0m | [1mINFO    [0m | [36mrecsys.hopsworks_integration.feature_store[0m:[36mget_feature_store[0m:[36m12[0m - [1mLoging to Hopsworks using HOPSWORKS_API_KEY env var.[0m


Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/15551
Connected. Call `.close()` to terminate connection gracefully.
Connected. Call `.close()` to terminate connection gracefully.


# Computing candidate embeddings

You start by computing candidate embeddings for all items in the training data.

First, you load your candidate model. Recall that you uploaded it to the Hopsworks Model Registry in previous steps:

In [6]:
candidate_model, model_schema = (
    hopsworks_integration.two_tower_serving.HopsworksCandidateModel.download(mr=mr)
)

[32m2024-11-21 13:24:39.407[0m | [1mINFO    [0m | [36mrecsys.hopsworks_integration.two_tower_serving[0m:[36mdownload[0m:[36m204[0m - [1mDownloading 'candidate_model' version 3[0m


Downloading model artifact (2 dirs, 6 files)... DONE

### Get candidates data

Now, we get the training retrieval data containing all the features required for the candidate embedding model.

In [7]:
feature_view = fs.get_feature_view(
    name="retrieval",
    version=1,
)

In [8]:
train_df, val_df, test_df, _, _, _ = feature_view.train_validation_test_split(
    validation_size=settings.TWO_TOWER_DATASET_VALIDATON_SPLIT_SIZE,
    test_size=settings.TWO_TOWER_DATASET_TEST_SPLIT_SIZE,
    description="Retrieval dataset splits",
)
train_df.head(3)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (51.64s) 



Unnamed: 0,customer_id,article_id,t_dat,price,month_sin,month_cos,age,club_member_status,age_group,garment_group_name,index_group_name
0,601bdfee20edbbf1bf49844216b9b64dbf1ce1ed1b6770...,790663002,1575763200000,0.045746,-2.449294e-16,1.0,59.0,ACTIVE,56-65,Jersey Fancy,Ladieswear
1,f687aec1b7e8b87dd5bc226af37ee120f36cee375cc1db...,811198002,1572912000000,0.033881,-0.5,0.866025,51.0,ACTIVE,46-55,Knitwear,Ladieswear
2,080707ddf1df9461c54b359f909fba5c24e9e4d57619db...,721063009,1566345600000,0.018068,-0.8660254,-0.5,55.0,ACTIVE,46-55,Shorts,Menswear


### Compute embeddings

Next you compute the embeddings of all candidate items that were used to train the retrieval model.

In [9]:
item_df = features.embeddings.preprocess(train_df, model_schema)
item_df.head(3)

Unnamed: 0,article_id,garment_group_name,index_group_name
0,790663002,Jersey Fancy,Ladieswear
1,811198002,Knitwear,Ladieswear
2,721063009,Shorts,Menswear


In [10]:
embeddings_df = features.embeddings.embed(df=item_df, candidate_model=candidate_model)
embeddings_df.head()

Unnamed: 0,article_id,embeddings
0,790663002,"[-0.7971345782279968, -1.1829744577407837, -0...."
1,811198002,"[-1.6718555688858032, -0.8620117902755737, -0...."
2,721063009,"[-1.0836117267608643, 1.1448973417282104, 0.46..."
3,698286003,"[0.8582326173782349, 0.7445253133773804, 0.192..."
4,663970005,"[-0.1496417075395584, -0.14807413518428802, -1..."


# <span style="color:#ff5f27">Create Hopsworks Embedding Index </span>

Now you are ready to create a feature group for your candidate embeddings.

To begin with, you need to create your Embedding Index where you will specify the name of the embeddings feature and the embeddings length.
Then you attach this index to the FG.

In [11]:
candidate_embeddings_fg = (
    hopsworks_integration.feature_store.create_candidate_embeddings_feature_group(
        fs=fs, df=embeddings_df, online_enabled=True
    )
)
logger.info("✅ Uploaded 'candidate_embeddings' Feature Group to Hopsworks!!")

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/15551/fs/15471/fg/1361248


Uploading Dataframe: 0.00% |          | Rows 0/11954 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: candidate_embeddings_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/15551/jobs/named/candidate_embeddings_1_offline_fg_materialization/executions


[32m2024-11-21 13:27:23.597[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1m✅ Uploaded 'candidate_embeddings' Feature Group to Hopsworks!![0m


## Expose it to the feature pipeline as a Feature View


In [12]:
feature_view = (
    hopsworks_integration.feature_store.create_candidate_embeddings_feature_view(
        fs=fs, fg=candidate_embeddings_fg
    )
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/15551/fs/15471/fv/candidate_embeddings/version/1


## <span style="color:#ff5f27"> Inspecting the embeddings in Hopsworks UI </span>

View results in [Hopsworks Serverless](https://rebrand.ly/serverless-github): **Feature Store → Feature Groups**

---

In [13]:
notebook_end_time = time.time()
notebook_execution_time = notebook_end_time - notebook_start_time

logger.info(
    f"⌛️ Notebook Execution time: {notebook_execution_time:.2f} seconds ~ {notebook_execution_time / 60:.2f} minutes"
)

[32m2024-11-21 13:27:25.918[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1m⌛️ Notebook Execution time: 176.10 seconds ~ 2.93 minutes[0m


# <span style="color:#ff5f27">→ Next Steps </span>

Now that we have our vector index populated with item embeddings, everything is ready for production. In the next notebook, we will zoom in into the inference pipeline and how we can deploy it to Hopsworks as a real-time deployment.