# Recommender Systems Workshop
*Presented by Stefan Dominicus at Deep Learning IndabaX 2025.*

# Part 2: TFX Pipeline & Tensorflow Recommenders
In this notebook, we'll build a TFX pipeline that trains a personalised recommender using the Tensorflow Recommenders two-tower model architecture.

Relevant Guides:
- https://www.tensorflow.org/recommenders/examples/basic_retrieval
- https://www.tensorflow.org/tfx/tutorials/tfx/recommenders

In [None]:
from importlib import reload
from pathlib import Path

import pandas as pd
import tensorflow as tf
import tensorflow_model_analysis as tfma
from absl import logging
from tfx import v1 as tfx
from tfx.components import (
    CsvExampleGen,
    Evaluator,
    Pusher,
    SchemaGen,
    StatisticsGen,
    Transform,
)
from tfx.orchestration.experimental.interactive.interactive_context import (
    InteractiveContext,
)
from tfx.types.standard_component_specs import (
    BLESSING_KEY,
    EVALUATION_KEY,
    EXAMPLES_KEY,
    MODEL_KEY,
    POST_TRANSFORM_SCHEMA_KEY,
    SCHEMA_KEY,
    STATISTICS_KEY,
    TRANSFORM_GRAPH_KEY,
    TRANSFORMED_EXAMPLES_KEY,
)

from recommender_systems import evaluator_module, trainer_module, transform_module
from recommender_systems.features import ProductFeatures
from recommender_systems.splits import Splits
from tfx_tfrs.trainer import Trainer

logging.set_verbosity(logging.INFO)

DATA = Path.cwd().parent / "data"

PIPELINE_NAME = "recommender_systems"

context = InteractiveContext(
    pipeline_name=PIPELINE_NAME,
    pipeline_root=str(Path("pipeline-root") / PIPELINE_NAME),
)

%load_ext tensorboard

Set `PARTICIPANT` below so that your trained models can be identified in the Google Cloud Storage bucket.

In [None]:
# TODO[IndabaX]: Enter your name here
PARTICIPANT = "stefan-dominicus"

## Ingest Reviews

### Examples
Docs:
- https://www.tensorflow.org/tfx/guide/examplegen
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/CsvExampleGen
- https://github.com/tensorflow/tfx/blob/master/tfx/proto/example_gen.proto

In [None]:
reviews_example_gen_component = CsvExampleGen(
    input_base=str(DATA / "reviews"),
    input_config=tfx.proto.Input(
        splits=[
            tfx.proto.Input.Split(name=split, pattern=f"{split}.csv")
            for split in [Splits.TRAIN, Splits.VALIDATION]
        ]
    ),
)
context.run(reviews_example_gen_component, enable_cache=True)

### Statistics
Docs:
- https://www.tensorflow.org/tfx/guide/statsgen
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/StatisticsGen

In [None]:
reviews_statistics_gen_component = StatisticsGen(
    examples=reviews_example_gen_component.outputs[EXAMPLES_KEY]
)
context.run(reviews_statistics_gen_component, enable_cache=True)

In [None]:
context.show(reviews_statistics_gen_component.outputs[STATISTICS_KEY])

### Schema
Docs:
- https://www.tensorflow.org/tfx/guide/schemagen
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/SchemaGen

In [None]:
reviews_schema_gen_component = SchemaGen(
    statistics=reviews_statistics_gen_component.outputs[STATISTICS_KEY]
)
context.run(reviews_schema_gen_component, enable_cache=True)

In [None]:
context.show(reviews_schema_gen_component.outputs[SCHEMA_KEY])

## Transform Reviews
This is one of the most significant benefits of the TFX framework - the ability to clearly define feature transformations which are performant during training, and repeatable in production.

### Transform
Docs:
- https://www.tensorflow.org/tfx/guide/transform
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Transform

In [None]:
# TODO[IndabaX]: Open `recommender_systems/transform_module.py`

reload(transform_module)

transform_component = Transform(
    examples=reviews_example_gen_component.outputs[EXAMPLES_KEY],
    schema=reviews_schema_gen_component.outputs[SCHEMA_KEY],
    module_file=transform_module.__file__,
    splits_config=tfx.proto.SplitsConfig(
        # Analyse all splits for full vocabulary coverage (default: train only)
        analyze=[Splits.TRAIN, Splits.VALIDATION],
        # Transform (and materialise) examples from all splits (default)
        transform=[Splits.TRAIN, Splits.VALIDATION],
    ),
)
context.run(transform_component, enable_cache=True)

### Statistics
Docs:
- https://www.tensorflow.org/tfx/guide/statsgen
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/StatisticsGen

In [None]:
post_transform_statistics_gen_component = StatisticsGen(
    examples=transform_component.outputs[TRANSFORMED_EXAMPLES_KEY]
)
context.run(post_transform_statistics_gen_component, enable_cache=True)

In [None]:
context.show(post_transform_statistics_gen_component.outputs[STATISTICS_KEY])

### Schema

In [None]:
context.show(transform_component.outputs[POST_TRANSFORM_SCHEMA_KEY])

## Ingest Products

### Examples
Docs:
- https://www.tensorflow.org/tfx/guide/examplegen
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/CsvExampleGen
- https://github.com/tensorflow/tfx/blob/master/tfx/proto/example_gen.proto

In [None]:
product_example_gen_component = CsvExampleGen(
    input_base=str(DATA),
    input_config=tfx.proto.Input(
        splits=[tfx.proto.Input.Split(name=Splits.SINGLE, pattern="products.csv")]
    ),
    output_config=tfx.proto.Output(
        split_config=tfx.proto.SplitConfig(
            splits=[tfx.proto.SplitConfig.Split(name=Splits.SINGLE, hash_buckets=1)]
        )
    ),
)
context.run(product_example_gen_component, enable_cache=True)

### Statistics
Docs:
- https://www.tensorflow.org/tfx/guide/statsgen
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/StatisticsGen

In [None]:
product_statistics_gen_component = StatisticsGen(
    examples=product_example_gen_component.outputs[EXAMPLES_KEY]
)
context.run(product_statistics_gen_component, enable_cache=True)

In [None]:
context.show(product_statistics_gen_component.outputs[STATISTICS_KEY])

### Schema
Docs:
- https://www.tensorflow.org/tfx/guide/schemagen
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/SchemaGen

In [None]:
product_schema_gen_component = SchemaGen(
    statistics=product_statistics_gen_component.outputs[STATISTICS_KEY]
)
context.run(product_schema_gen_component, enable_cache=True)

In [None]:
context.show(product_schema_gen_component.outputs[SCHEMA_KEY])

## Train Model

### Trainer
Docs:
- https://www.tensorflow.org/tfx/guide/trainer
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Trainer
- https://github.com/tensorflow/tfx/blob/master/tfx/proto/trainer.proto

In [None]:
# TODO[IndabaX]: Open `recommender_systems/trainer_module.py`

reload(trainer_module)

trainer_component = Trainer(
    examples=transform_component.outputs[TRANSFORMED_EXAMPLES_KEY],
    transform_graph=transform_component.outputs[TRANSFORM_GRAPH_KEY],
    schema=transform_component.outputs[POST_TRANSFORM_SCHEMA_KEY],
    item_examples=product_example_gen_component.outputs[EXAMPLES_KEY],
    item_schema=product_schema_gen_component.outputs[SCHEMA_KEY],
    module_file=trainer_module.__file__,
    train_args=tfx.proto.TrainArgs(splits=[Splits.TRAIN]),
    eval_args=tfx.proto.EvalArgs(splits=[Splits.VALIDATION]),
    custom_config=dict(
        # NOTE: `tensorboard_log_dir` must match in the next cell
        tensorboard_log_dir="tensorboard",
    ),
)
context.run(trainer_component, enable_cache=False)

Use TensorBoard to view the training and validation metrics.

In [None]:
%tensorboard --logdir tensorboard

### Evaluator
Docs:
- https://www.tensorflow.org/tfx/guide/evaluator
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Evaluator
- https://github.com/tensorflow/tfx/blob/master/tfx/proto/evaluator.proto

In [None]:
# TODO[IndabaX]: Open `recommender_systems/evaluator_module.py`
# TODO[IndabaX]: Consider referencing a baseline model for validation

reload(evaluator_module)

evaluator_component = Evaluator(
    examples=reviews_example_gen_component.outputs[EXAMPLES_KEY],
    model=trainer_component.outputs[MODEL_KEY],
    example_splits=[Splits.VALIDATION],
    eval_config=tfma.EvalConfig(
        metrics_specs=[
            tfma.MetricsSpec(
                metrics=[
                    tfma.MetricConfig(
                        class_name="ExampleCount",
                        threshold=tfma.MetricThreshold(
                            value_threshold=tfma.GenericValueThreshold(
                                lower_bound=dict(value=1)
                            ),
                        ),
                    ),
                    tfma.MetricConfig(
                        class_name="TopKAccuracy",
                        module=evaluator_module.__name__,
                    ),
                ],
            ),
        ],
        model_specs=[
            tfma.ModelSpec(
                label_key=ProductFeatures.ID,
                signature_name="evaluate_products_for_customer",
            ),
        ],
    ),
    schema=reviews_schema_gen_component.outputs[SCHEMA_KEY],
)
context.run(evaluator_component, enable_cache=False)

The TFX Evaluator uses Tensorflow Model Analysis under the hood. We can use the same library to inspect the evaluation output.

In [None]:
output_path = evaluator_component.outputs[EVALUATION_KEY].get()[0].uri

# Load the evaluation result
eval_result = tfma.load_eval_result(output_path)
print("\n📈 EvalResult:\n", eval_result)

# Load the evaluation metrics
metrics = tfma.load_metrics(output_path)
print("\n🎯 Metrics:\n", list(metrics))

# Load the validation results
validation_result = tfma.load_validation_result(output_path)
print("\n✅ ValidationResult:\n", validation_result)
if not validation_result.validation_ok:
    print("\n❌ Validation failed (model not blessed).")

Let's also inspect some results to get a sense of what the model may have learned.

In [None]:
# Load customers, and pick one at random
customers = pd.read_csv(DATA / "customers.csv")
random_customer_id = customers.sample(1)["customer_id"].values[0]

# Load products
products = pd.read_csv(DATA / "products.csv")

# Load reviews for the random customer
reviews = pd.read_csv(DATA / "reviews.csv")
reviews = reviews[reviews["customer_id"] == random_customer_id]

# Merge reviews with product titles, sort by timestamp, and drop unnecessary columns
df = (
    reviews.merge(products[["product_id", "product_title"]], on="product_id")
    .sort_values("review_timestamp", ascending=False)
    .reset_index(drop=True)
    .drop(["review_id", "review_text", "review_timestamp", "customer_id"], axis=1)
    .rename(
        columns={
            "product_id": "reviewed_product_id",
            "product_title": "reviewed_product_title",
        }
    )
)


# Load the trained model
model = tf.saved_model.load(
    # str(Path(trainer_component.outputs[MODEL_KEY].get()[0].uri) / "Format-Serving")
    "/home/stefan_dominicus_takealot_com/workspace/stefandominicus-takealot/indabax-2025/recommender_systems/notebooks/pipeline-root/recommender_systems/Trainer/model/9/Format-Serving"
).signatures[tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY]

# Make predictions for the random customer
prediction = model(customer_id=tf.constant(random_customer_id, shape=(1, 1)))
recommended_product_ids = pd.Series(
    prediction["product_ids"].numpy().squeeze().astype(str).tolist(), name="product_id"
)

# Merge the recommended product IDs with their products titles
recommendations_for_customer = (
    products[["product_id", "product_title"]]
    .merge(recommended_product_ids, on="product_id")
    .rename(
        columns={
            "product_id": "recommended_product_id",
            "product_title": "recommended_product_title",
        }
    )
)

# Merge the reviews with the recommendations (so we can view them side-by-side)
df = df.merge(
    recommendations_for_customer, how="left", left_index=True, right_index=True
)
df.head(20)

Remember, you can view specific products on Takealot.com if you want more info.

In [None]:
product_id = "PLID95234247"

print(f"https://takealot.com/abc/{product_id}")

### Pusher
Docs:
- https://www.tensorflow.org/tfx/guide/pusher
- https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Pusher
- https://github.com/tensorflow/tfx/blob/master/tfx/proto/pusher.proto

In [None]:
# TODO[IndabaX]: Make sure you've set `PARTICIPANT` to your name
# TODO[IndabaX]: Consider using the validation result to avoid pushing bad models

pusher_component = Pusher(
    model=trainer_component.outputs[MODEL_KEY],
    model_blessing=evaluator_component.outputs[BLESSING_KEY],
    push_destination=tfx.proto.PushDestination(
        filesystem=tfx.proto.PushDestination.Filesystem(
            base_directory=f"gs://tal-deep-learning-indabax-models/{PARTICIPANT}",
            versioning=tfx.proto.Versioning.UNIX_TIMESTAMP,
        )
    ),
)
context.run(pusher_component, enable_cache=True)

Well done! Check the leaderboard to see how your model compares to others. If you have time, feel free to try improve your model's performance.