>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Embeddings)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Embeddings) to leverage the power of whylogs and WhyLabs together!*

# Logging Generic Embeddings Data using Reference Distances

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/experimental/embeddings/Embeddings_Distance_Logging.ipynb)

High dimensional embedding spaces can be difficult to understand because we often rely on our own subjective judgement of clusters in the space. Often, data scientists try to find issues solely by hovering over individual data points and noting trends in which ones feel out of place.

In whylogs, you are able to profile embeddings values by comparing them to reference data points. These references can be completely determined by users (helpful when they represent prototypical "ideal" representations of a cluster or scenario) but can also be chosen programmatically.

## Setup

### Install package extras for whylogs

For convenience, we include helper functions to select reference data points for comparing new embedding vectors against. To follow this notebook in full, install the `embeddings` extra (for helper functions) and `viz` extra (for visualizing drift) when installing whylogs.

In [1]:
# Note: you may need to restart the kernel to use updated packages.
%pip install --upgrade whylogs[all] -q

Note: you may need to restart the kernel to use updated packages.


## MNIST dataset

### Downloading from OpenML

We'll use the 784-dimensional MNIST dataset as our example. This can be downloaded from OpenML via scikit-learn. Because the download can take a few minutes, we suggest saving the data locally as well.

In [2]:
import os
import pickle
from sklearn.datasets import fetch_openml

if os.path.exists("mnist_784_X_y.pkl"):
    X, y = pickle.load(open("mnist_784_X_y.pkl", 'rb'))
else:
    X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)

  warn(


### Splitting into training and production datasets

Instead of training a model, we'll use the same functionality to split our dataset into an original training dataset and data we'll see in our first day of production.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_prod, y_train, y_prod = train_test_split(X, y, test_size=0.1)

In [4]:
import numpy as np
length_of_x = X_prod.shape[0]
print(length_of_x)

# here we generate some placeholder ids, in practice these could be uuid or some unique string or int
ids = np.arange(length_of_x)
id_vector_tuples = np.column_stack((ids, X_prod))
print(id_vector_tuples.shape)


7000
(7000, 785)


## Finding references

We would like to compare incoming embeddings against up to 30 predefined references. These can chosen by the user either manually or algorithmically. Both reference selection algorithms provided are conducted on raw data, but only for the purposes of finding references itself.

#### Manual selection

If we had prototypical examples of digits that we wanted to compare our incoming data against, we would collect those data points now.

#### Algorithmic selection for labeled data

If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, `PCACentroidSelector`, that finds the centroids in PCA space before converting back to the raw 784-dimensional space.

Let's utilize the labels available in the dataset for determining our references.

In [5]:
from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector

references, labels = PCACentroidsSelector(n_components=20).calculate_references(X_train, y_train)

#### Algorithmic selection for unlabeled data

If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, `PCAKMeansSelector`, that finds the unsupervised centroids in PCA space then converting back to raw space.

We'll also calculate these but will elect to use the supervised version for the rest of the notebook.

In [6]:
from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector

unsup_references, unsup_labels = PCAKMeansSelector(n_clusters=8, n_components=20).calculate_references(X_train, y_train)

PCAKMeansSelector is unsupervised; ignoring labels


## Profiling with whylogs

As with other advanced features, we can create a `DeclarativeSchema` to tell whylogs to resolve columns of a certain name to the `EmbeddingMetric` that we want to use.

We must pass our references, labels, and preferred distance function (either cosine distance or Euclidean distance) as parameters to `EmbeddingConfig` then log as normal.

In [26]:
from typing import Optional, Tuple
import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
    DistanceFunction,
    EmbeddingConfig,
    EmbeddingMetric,
)
from whylogs.experimental.core.metrics.udf_metric import (
    generate_udf_schema,
    register_metric_udf,
)

@register_metric_udf(col_name="id_vector_tuple")
def embeddings_outliers(indexed_vectors: Tuple[int, np.ndarray],
                        embedding_references = references,
                        DISTANCE_THRESHOLD = 850) -> Optional[str]:
    from sklearn.metrics.pairwise import cosine_distances, euclidean_distances

    matrix = id_vector_tuples.T[1:]
    row_ids = id_vector_tuples.T[0]

    # First, make sure single embeddings are in a 2D matrix (1 row, columns = embedding dims)
    if len(matrix.shape) == 1:
        matrix = matrix.reshape((1, matrix.shape[0]))
    outlier_row_ids = []
    min_d = None
    for i in range(len(row_ids)):
        row_id = row_ids[i]
        vector = matrix.T[i]
        # Get nparray of 
        reference_distances = euclidean_distances([vector], embedding_references)
        if min_d is None:
            min_d = reference_distances.min()
        else:
            min_d = min(min_d, reference_distances.min())
        outlier_distances = reference_distances < DISTANCE_THRESHOLD
        if outlier_distances.any():
            outlier_row_ids.append(row_id)

    if outlier_row_ids:
        return str(outlier_row_ids)
    return min_d

config = EmbeddingConfig(
    references=references,
    labels=labels,
    distance_fn=DistanceFunction.euclidean,
)
schema = DeclarativeSchema(
    [ResolverSpec(column_name="pixel_values", metrics=[MetricSpec(EmbeddingMetric, config)])] + generate_udf_schema(),
)

train_profile = why.log(row={"pixel_values": X_train}, schema=schema)

Let's confirm the contents of our profile measures the distribution of embeddings relative to the references we've provided.

In [8]:
train_profile_view = train_profile.view()
column = train_profile_view.get_column("pixel_values")
summary = column.to_summary_dict()
for digit in [str(i) for i in range(10)]:
    mean = summary[f'embedding/{digit}_distance:distribution/mean']
    stddev = summary[f'embedding/{digit}_distance:distribution/stddev']
    print(f"{digit} distance: mean {mean}   stddev {stddev}")

0 distance: mean 2190.804087986482   stddev 202.9271250806984
1 distance: mean 2065.1900710071172   stddev 473.02864974704306
2 distance: mean 1995.5666316550871   stddev 233.4807738289276
3 distance: mean 1998.5577454476095   stddev 274.5659056313274
4 distance: mean 1978.138399595618   stddev 304.4965127662653
5 distance: mean 1911.848097052227   stddev 254.1624837472048
6 distance: mean 2026.1385940336384   stddev 259.55087444579095
7 distance: mean 2013.5022809924112   stddev 349.2571781182969
8 distance: mean 1936.5924006704263   stddev 263.6855684871088
9 distance: mean 1943.798363232508   stddev 329.4170641147685


## Measuring embeddings drift in WhyLabs

This distance approach can be really powerful for measuring drift across new batches of embeddings in a programmatic way using drift metrics as well as the WhyLabs Observability Platform.

We'll look at a single example where an engineer introduces a change to reduce the amount of unnecessary processing by filtering out images where more than 90% of pixels are zeros. This is a realistic cleaning step that might be added to an ML pipeline, but will have a detrimental impact on our incoming data, especially the 1s.

In [9]:
# Find which digits have more than or equal to 90% missing
not_empty_mask = (X_prod == 0).sum(axis=1) <= (0.9 * 784)
X_prod_filtered = X_prod[not_empty_mask]
id_vectors_filtered = id_vector_tuples[not_empty_mask]
y_prod_filtered = y_prod[not_empty_mask]

In [27]:
# Log production digits using the same schema
prod_profile_view = why.log(row={"pixel_values": X_prod_filtered, "id_vector_tuple": id_vectors_filtered}, schema=schema).profile().view()

In [28]:
prod_profile_view.get_column("id_vector_tuple").to_summary_dict()

{'udf/embeddings_outliers:counts/n': 1,
 'udf/embeddings_outliers:counts/null': 0,
 'udf/embeddings_outliers:counts/nan': 0,
 'udf/embeddings_outliers:counts/inf': 0,
 'udf/embeddings_outliers:types/integral': 0,
 'udf/embeddings_outliers:types/fractional': 0,
 'udf/embeddings_outliers:types/boolean': 0,
 'udf/embeddings_outliers:types/string': 1,
 'udf/embeddings_outliers:types/object': 0,
 'udf/embeddings_outliers:types/tensor': 0,
 'udf/embeddings_outliers:distribution/mean': 0.0,
 'udf/embeddings_outliers:distribution/stddev': 0.0,
 'udf/embeddings_outliers:distribution/n': 0,
 'udf/embeddings_outliers:distribution/max': nan,
 'udf/embeddings_outliers:distribution/min': nan,
 'udf/embeddings_outliers:distribution/q_01': None,
 'udf/embeddings_outliers:distribution/q_05': None,
 'udf/embeddings_outliers:distribution/q_10': None,
 'udf/embeddings_outliers:distribution/q_25': None,
 'udf/embeddings_outliers:distribution/median': None,
 'udf/embeddings_outliers:distribution/q_75': None

Let's look at this using the whylogs profile view's summaries:

In [12]:
train_profile_summary = train_profile_view.get_column("pixel_values").to_summary_dict()
prod_profile_summary = prod_profile_view.get_column("pixel_values").to_summary_dict()
for digit in [str(i) for i in range(10)]:
    mean_diff = train_profile_summary[f'embedding/{digit}_distance:distribution/mean'] - prod_profile_summary[f'embedding/{digit}_distance:distribution/mean']
    stddev_diff = train_profile_summary[f'embedding/{digit}_distance:distribution/stddev'] - prod_profile_summary[f'embedding/{digit}_distance:distribution/stddev']
    print(f"{digit} distance difference (target-prod): mean {mean_diff}   stddev {stddev_diff}")

0 distance difference (target-prod): mean -1.0643165250717175   stddev -3.6655108378841703
1 distance difference (target-prod): mean -47.41027240533458   stddev 53.22718190553525
2 distance difference (target-prod): mean -8.175996896501147   stddev 2.9526443587183735
3 distance difference (target-prod): mean -9.230368340017549   stddev 2.063415827007759
4 distance difference (target-prod): mean -6.469144722837655   stddev -4.544108913614707
5 distance difference (target-prod): mean -12.712325265648133   stddev 4.2041559907290775
6 distance difference (target-prod): mean -12.015676261615454   stddev 0.1252791484623117
7 distance difference (target-prod): mean -11.930807147114365   stddev 0.9495908865241063
8 distance difference (target-prod): mean -7.396926679449962   stddev 0.36210386386750315
9 distance difference (target-prod): mean -8.536834250590118   stddev -2.4705431799656026


This particular drift has shown up in the distances to our reference data points as we'd expect. In particular, the 1s seem most affected by our rule.

## What's Next?

### Upload profiles to WhyLabs for more drift calculations and monitoring

See [example notebook](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_to_WhyLabs.html) for monitoring your profiles continuously with the WhyLabs Observability Platform.

### Exploring other sources of drift

Consider comparing this profile to different transformations and subsets of our MNIST dataset: randomly selected subsets of the data, normalized values, missing one or more labels, sorted values, and more.

### More example notebooks and documentation

Go to the [examples page](https://whylogs.readthedocs.io/en/stable/examples.html) for the complete list of examples!