[![Open In
Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alibaba/feathub/blob/master/docs/examples/feature_embedding.ipynb)

# Feature Embedding

Feature embedding is a way to translate a high-dimensional feature vector to a
lower-dimensional vector, where the embedding can be learned and reused across
models. In this example, we show how one can define feature embeddings via
Python UDF (User Defined Function).

We use a sample hotel review dataset downloaded from [Azure-Samples
repository](https://github.com/Azure-Samples/azure-search-sample-data). The
original dataset can be found
[here](https://www.kaggle.com/datasets/datafiniti/hotel-reviews).

For the embedding, a pre-trained [HuggingFace Transformer
model](https://huggingface.co/sentence-transformers) is used to encode texts
into numerical values. The text embeddings can be used for many NLP problems
such as detecting fake reviews, sentiment analysis, and finding similar hotels,
but building such models is out of scope and thus we don't cover that in this
notebook.

Please feel free to view this example interactively with Colab by clicking the
badge at the top left corner of this notebook.

## Install dependencies

This example has been verified in Python 3.7 with the following libraries.

- feathub-nightly[flink]
- sentence-transformers
- plotly
- matplotlib

Execute the following cells to install these dependencies. **If the notebook is
executed in Colab, restart the runtime after the following cells are executed,
in order to make sure Python 3.7 is correctly configured to execute the Python
cells.**

In [None]:
%%bash
python_version=`python -V`
if [[ $python_version != *"3.7"* ]]; then
    # install python 3.7
    sudo apt-get update -y
    sudo apt-get install python3.7 python3-pip python3.7-distutils python3-apt

    # change alternatives
    sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 0
    sudo update-alternatives --set python3 /usr/bin/python3.7
fi

In [None]:
%%bash
feathub_dependencies=`pip list | grep feathub`
if [[ -z "$feathub_dependencies" ]]; then
    pip install "feathub-nightly[flink]"
fi

pip install sentence-transformers plotly matplotlib

## Import Python dependencies

In [None]:
import os
from typing import Any, List

import numpy as np
import pandas as pd
import requests
from sentence_transformers import SentenceTransformer

from feathub.common import types
from feathub.feathub_client import FeathubClient
from feathub.feature_tables.sinks.memory_store_sink import MemoryStoreSink
from feathub.feature_tables.sources.file_system_source import FileSystemSource
from feathub.feature_tables.sources.memory_store_source import MemoryStoreSource
from feathub.feature_views.derived_feature_view import DerivedFeatureView
from feathub.feature_views.feature import Feature
from feathub.feature_views.on_demand_feature_view import OnDemandFeatureView
from feathub.feature_views.transforms.python_udf_transform import PythonUdfTransform
from feathub.table.schema import Schema

## Download and preprocess resource files

Download the hotel review dataset and append an incremental number column as
review IDs.

In [None]:
source_file_name = "HotelReviews_data.csv"

if not os.path.exists(source_file_name):
    url = (
        "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/hotelreviews/"
        + source_file_name
    )
    r = requests.get(url)
    open(source_file_name, "wb").write(r.content)
    df = pd.read_csv(source_file_name)
    os.remove(source_file_name)
    df["reviews_id"] = df.index
    df.to_csv(source_file_name, index=False, header=False)

## Initialize FeatHub client

In [None]:
client = FeathubClient(
    props={
        "processor": {
            "type": "flink",
            "flink": {
                "master": "local",
            },
        },
        "registry": {
            "type": "local",
            "local": {
                "namespace": "default",
            },
        },
        "feature_service": {
            "type": "local",
            "local": {},
        },
    }
)

## Specify source dataset

In [None]:
schema = (
    Schema.new_builder()
    .column("address", types.String)
    .column("categories", types.String)
    .column("city", types.String)
    .column("country", types.String)
    .column("latitude", types.Float64)
    .column("longitude", types.Float64)
    .column("name", types.String)
    .column("postalCode", types.String)
    .column("province", types.String)
    .column("reviews_date", types.String)
    .column("reviews_dateAdded", types.String)
    .column("reviews_rating", types.Int32)
    .column("reviews_text", types.String)
    .column("reviews_title", types.String)
    .column("reviews_username", types.String)
    .column("reviews_id", types.Int32)
    .build()
)

source = FileSystemSource(
    name="source_1",
    path=source_file_name,
    data_format="csv",
    schema=schema,
    keys=["reviews_id"],
)

## Create feature embedding UDF

Create the feature-embedding UDF from [a pretrained Transformer model from
HuggingFace](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2).

In [None]:
def predict_batch_udf(row: pd.Series) -> List[float]:
    model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
    return [float(x) for x in model.encode(row["reviews_text"]).tolist()]

## Build and register features

In [None]:
feature_view = DerivedFeatureView(
    name="feature_view",
    source=source,
    features=[
        Feature(
            name="reviews_text_embedding",
            dtype=types.Float64Vector,
            transform=PythonUdfTransform(predict_batch_udf),
            keys=["reviews_id"],
        )
    ],
    keep_source_fields=True,
)

_ = client.build_features([feature_view])

## Materialize features into online feature store

In [None]:
sink = MemoryStoreSink(table_name="table_name_1")

job = client.materialize_features(
    feature_descriptor=feature_view,
    sink=sink,
    allow_overwrite=True,
)
job.wait(timeout_ms=10000)

## Fetch features from online feature store with on-demand transformations

In [None]:
source = MemoryStoreSource(
    name="online_store_source",
    keys=["reviews_id"],
    table_name="table_name_1",
)
on_demand_feature_view = OnDemandFeatureView(
    name="on_demand_feature_view",
    features=[
        "online_store_source.name",
        "online_store_source.reviews_text",
        "online_store_source.reviews_text_embedding",
    ],
    request_schema=Schema.new_builder().column("reviews_id", types.Int32).build(),
)
client.build_features([source, on_demand_feature_view])

request_df = pd.DataFrame(np.array([[i] for i in range(19)]), columns=["reviews_id"])
online_features = client.get_online_features(
    request_df=request_df,
    feature_view=on_demand_feature_view,
)

online_features

## Visualize online features

Let's visualize the feature values. Here, we use TSNE (T-distributed Stochastic
Neighbor Embedding) using scikit-learn to plot the vectors in 2D space.

In [None]:
import numpy as np
import plotly.graph_objs as go
from sklearn.manifold import TSNE


X = np.stack(online_features["reviews_text_embedding"], axis=0)
result = TSNE(
    n_components=2,
    init="random",
    perplexity=10,
).fit_transform(X)

result[:10]

In [None]:
names = set(online_features["name"])
names

In [None]:
fig = go.Figure()

for name in names:
    mask = online_features["name"] == name

    fig.add_trace(
        go.Scatter(
            x=result[mask, 0],
            y=result[mask, 1],
            name=name,
            textposition="top center",
            mode="markers+text",
            marker={
                "size": 8,
                "opacity": 0.8,
            },
        )
    )

fig.update_layout(
    margin={"l": 0, "r": 0, "b": 0, "t": 0},
    showlegend=True,
    autosize=False,
    width=1000,
    height=500,
)
fig.show()