# Visualize Embeddings

In [1]:
import os

from pathlib import Path
import shutil
import numpy as np
import pandas as pd
from torch.utils.tensorboard import SummaryWriter

from text_classification import defs
from text_classification.data import Samples
from text_classification.transforms import LabelTransform
import text_classification.config as cfg

🧠 Will load pre-computed embeddings.

In [2]:
train_set: Samples = defs.load_asset_value("train_set")  # type: ignore
train_embeddings: np.ndarray = defs.load_asset_value("train_embeddings")  # type: ignore
label_transform: LabelTransform = defs.load_asset_value("label_transform")  # type: ignore

2023-02-12 23:33:38 +0000 - dagster - DEBUG - system - Loading file from: /Users/thomelane/Projects/text_classification/data/storage/train_set
2023-02-12 23:33:38 +0000 - dagster - DEBUG - system - Loading file from: /Users/thomelane/Projects/text_classification/data/storage/train_embeddings
2023-02-12 23:33:38 +0000 - dagster - DEBUG - system - Loading file from: /Users/thomelane/Projects/text_classification/data/storage/label_transform


In [3]:
train_df = pd.DataFrame(train_set)
train_df["class_label"] = train_df["category"].apply(lambda e: cfg.CLASS_LABELS[e])

🧠 We don't want to show 65,000 samples in TensorBoard's embedding viewer.

🧠 Will sample to 100 samples per class.

In [4]:
samples_per_class = 100
sampled_train_df = train_df.groupby("category").sample(samples_per_class)
sampled_idxs = sampled_train_df.index
sampled_embeddings = train_embeddings[sampled_idxs]

🧠 Choose what fields to show in TensorBoard.

In [5]:
metadata_header = ["id", "class_label", "headline", "short_description"]
metadata = sampled_train_df[metadata_header].values.tolist()

In [6]:
data_root = os.environ["DATA_ROOT"]
assert data_root is not None and len(data_root) > 0
output_path = Path(data_root, "embeddings/train")
if output_path.exists():
    shutil.rmtree(output_path)

In [7]:
writer = SummaryWriter(output_path)
writer.add_embedding(
    sampled_embeddings,
    metadata=metadata,
    metadata_header=metadata_header
)

Starting the TensorBoard server, and can view the embeddings [here](http://localhost:6006/#projector).

In [8]:
#!tensorboard --logdir {output_path}/

## Visualization

### UMAP

Colouring by class_label.

* 2D
* Neighbours: 10
* Iterations: 500

<img src="./umap.png" alt="umap" style="width: 400px;"/>

🧠 Get some unsupervised seperation, but would like to have given it more iterations.

### T-SNE

Colouring by class_label.

* 2D
* Perplexity: 5
* Learning Rate: 1
* Supervision: 25
* Iterations: 300

<img src="./t-sne.png" alt="t-sne" style="width: 400px;"/>

🧠 As expected, we get better seperation with supervision.

🧠 Shows there potential for learning a good head classifier, but we know that from our earlier models too.

❓ What are the outliers in the clusters?

## Outliers

In [9]:
def print_row(row: pd.Series):
    print("#" * 50)
    fields = [str(f) for f in row.index]
    for field in fields:
        print(f"# {field}:")
        print(str(row[field]) + "\n")

In [10]:
print_row(train_df.query('id == 8063').iloc[0])

##################################################
# category:
D

# headline:
Closet Confidential: 10 Ways To Wear White After Labor Day

# short_description:
We've all heard the age old adage, 'No white after Labor Day.' Still, many fashion rules are definitely meant to be broken

# id:
8063

# class_label:
D: Diversity



🧠 Should this not be in "F: Fashion"?

🧠 Could be an incorrect label? Or we manually labelled the classes wrong.

In [11]:
print_row(train_df.query('id == 11520').iloc[0])

##################################################
# category:
D

# headline:
Here's A Brilliant Way You Can Explain Marriage Equality To Kids

# short_description:
❤️  ❤️  ❤️

# id:
11520

# class_label:
D: Diversity



🧠 Can this be classified into one class? It's also "J: Parenting" and "E: Relationship".

In [12]:
print_row(train_df.query('id == 51394').iloc[0])

##################################################
# category:
H

# headline:
LGBT Rights -- Modernity vs. Forces of Yesteryear

# short_description:
No matter what, we need to be clear about our vision on global LGBT rights, so we can develop a better strategy and translate it into a sound, reliable and realistic policy. We should act, be relentless and impatient when there is an immediate need to protect individuals across the globe.

# id:
51394

# class_label:
H: Foreign Affairs



🧠 Also "D: Diversity" class.

⭐️ Will stop there for now, but would inspect more outliers given more time.