<a href="https://colab.research.google.com/github/rmr327/LLMEmbeddingVisualization/blob/main/stella_400m_embedding_visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Reference

1 ) https://colab.research.google.com/github/AIPI-590-XAI/Duke-AI-XAI/blob/dev/explainable-ml-example-notebooks/embedding-visualization.ipynb#scrollTo=iWAi_aAXWyfU

(Used some visualization code from here)

2) https://huggingface.co/dunzhang/stella_en_400M_v5

(Used code to get embeddings from author page above)

3) Used google gemini for code completion and help with debugging

4) Used ChatGpt to generate a significant number of test sentences and words

## Overview

In this notebook I demonstrate some embedding space visualisation techniques for LLMs (PCA, t-SNE & UMAP). Specifically, I chose the most compact model from the MTEB overall leaderboard at the time of making this notebook. The chosen model is Stella_en_400m. This model can be used to extract sentence level as well as docment level embeddings. For this demo I chose to visualise the models embedding space using sentences and words. In order to make the plots readable, I restricted myself to 15 sentences coming from 5 distinct sectors. Similarly I chose 50 words again from 5 distinct sectors. The sectors chosen were "data science", "finance", "sports", "healthcare" and "Technology".

In our embedding visualizations, we should expect to see entries from closer sectors being closer to each other. We should also see entries from the same sectors to be the closets to each other.

A table has been included at the end of the page to compare and contrasts the different visualization techniques used.

In [None]:
# !pip install xformers
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# !pip install umap-learn==0.5.6

from transformers import AutoModel, AutoTokenizer
import torch
from typing import List
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import pandas as pd

# stop future warning
import warnings
warnings.filterwarnings("ignore")


> Ensure GPU is used if available

In [None]:
# Assign device to make sur GPU is used when available
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
    print("Warning: CUDA not available, using CPU. Performance may be slower.")

> Load the Stella Model & Tokenizer from huggingface.

In [None]:
model = AutoModel.from_pretrained("dunzhang/stella_en_400M_v5", trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained("dunzhang/stella_en_400M_v5", trust_remote_code=True)

Some weights of the model checkpoint at dunzhang/stella_en_400M_v5 were not used when initializing NewModel: ['new.pooler.dense.bias', 'new.pooler.dense.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


>Function for getting embeddings from (similar to developers)

In [None]:
def get_stella_style_text_embeddings(text_list: List[str]) -> np.array:
  """This function is used to extract embeddings for the passed text as defined by the
  authors of the Stella_en_400M_v5 sentence embedding model"""
  # Collect embeddings
  embeddings = []


  with torch.no_grad():
      for entry in text_list:
          # Tokenize the text entry
          input_data = tokenizer(entry, return_tensors="pt", truncation=True, max_length=512).to(device)
          input_data = {k: v.cuda() for k, v in input_data.items()}

          # Forward pass through the model
          last_hidden_state = model(**input_data).last_hidden_state

          # Extract the embedding representing the word/sentence
          sentence_embedding = last_hidden_state[:, 0, :].cpu().numpy().flatten()

          # Appending to the list of embeddings
          embeddings.append(sentence_embedding)

  return np.array(embeddings)

>Function to generate PCA visualizations of the embedding space of Stella_en_400m

In [None]:
def generate_pca_visualization(embeddings: np.array, labels_=None, color_=None) -> None:
  """This funcion is used to generate a PCA visualization of the embedding space
  of Stella_en_400m, using the plotly library"""
  # See first reference page for more info on PCA
  # Apply PCA to reduce to 2D
  pca = PCA(n_components=2)
  reduced_embeddings = pca.fit_transform(embeddings)

  # Plot PCA results using Plotly for interactivity
  fig_pca = px.scatter(
      reduced_embeddings, x=0, y=1,
      text=labels_,
      title="PCA of Stella_en_400m embedding space",
      labels={'0': 'Principal Component 1', '1': 'Principal Component 2'},
      color=color_
  )

  fig_pca.update_traces(marker=dict(size=8))

  # making the xaxis a little wider
  fig_pca.update_xaxes(range=[-20, 20])

  fig_pca.show()



>Function to generate T-sne visualizations of the embedding space of Stella_en_400m

In [None]:
def generate_tsne_visualization(embeddings: np.array, labels_=None, color_=None) -> None:
  """This funcion is used to generate a T-SNE visualization of the embedding space
  of Stella_en_400m, using the plotly library"""
  # perplexity must be less than the number of labels
  if len(labels_)> 30:
    perplexity = 30
  else:
    perplexity = len(labels_) - 1

  # See first reference page for more info on t-SNE
  # Apply t-SNE to reduce to 2D
  tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=300, random_state=42)
  embeddings_tsne = tsne.fit_transform(embeddings)

  # Plot t-SNE results using Plotly for interactivity
  fig_tsne = px.scatter(
      embeddings_tsne, x=0, y=1,
      text=labels_,
      title="t-SNE of Stella_en_400m embedding space",
      labels={'0': 'Component 1', '1': 'Component 2'},
      color=color_
  )
  fig_tsne.update_traces(marker=dict(size=8))

  # # making the xaxis a little wider
  fig_tsne.update_xaxes(range=[-200, 200])

  fig_tsne.show()

>Function to generate Umap visualization of the embedding space of Stella_en_400m

In [None]:
def generate_umap_visualization(embeddings: np.array, labels_=None, color_=None) -> None:
  """This funcion is used to generate a UMAP visualization of the embedding space
  of Stella_en_400m, using the plotly library"""
  # See first reference page for more info on UMAP
  # Apply UMAP
  umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
  embeddings_umap = umap_model.fit_transform(embeddings)

  # Plot UMAP results using Plotly
  fig_umap = px.scatter(
      embeddings_umap, x=0, y=1,
      text=labels_,
      title="UMAP of Stella_en_400m Embeddings",
      labels={'0': 'Component 1', '1': 'Component 2'},
      color=color_
  )
  fig_umap.update_traces(marker=dict(size=8))
  fig_umap.show()

> Defining the set of sentences and words to explore the embedding space of stella_en_400m model.

In [None]:
sentences = [
    "There are many effective ways to reduce stress.",
    "Deep breathing, meditation, and physical activity are common techniques.",
    "Green tea is known for its potential health benefits.",
    "Regular consumption of green tea is associated with improved heart health.",
]

words = [
    "stress", "reduce", "effective", "techniques", "breathing", "meditation", "activity",
]

In [None]:
# Domains List
domains = ['Data Science'] * 10 + ['Finance'] * 10 + ['Healthcare'] * 10 + ['Technology'] * 10 + ['Sports'] * 10
domains_short = ['Data Science'] * 3 + ['Finance'] * 3 + ['Healthcare'] * 3 + ['Technology'] * 3 + ['Sports'] * 3

# Sentences List (Organized by Domain)
sentences = [
    # Data Science
    "Data scientists use machine learning models to uncover patterns in large datasets.",
    "Predictive analytics help businesses forecast trends and make data-driven decisions.",
    "Reinforcement learning models train algorithms by rewarding desired actions.",

    # Finance
    "Compound interest is a fundamental principle in personal finance and investing.",
    "Hedge funds use quantitative models to identify profitable trading strategies.",
    "Bonds are debt instruments that pay fixed interest to investors.",


    # Healthcare
    "Telemedicine connects patients with healthcare providers remotely.",
    "Gene editing tools like CRISPR have revolutionized personalized medicine.",
    "Precision medicine tailors treatments to an individual’s genetic profile.",

    # Technology
    "Blockchain technology provides a secure, decentralized ledger for transactions.",
    "Cloud computing allows data storage and processing over the internet.",
    "Virtual reality immerses users in a digitally simulated environment.",

    # sports
    "Basketball is a team sport played on a rectangular court.",
    "Football can be a dangerous sport",
    "Table tennis is a very exciting game."
]

# Words List (Organized by Domain)
words = [
    # Data Science
    "Algorithm", "Regression", "Data Scientist", "Feature", "Model", "Prediction", "Clustering", "Neural Network", "Overfitting", "Normalization",

    # Finance
    "Equity", "Dividend", "Derivative", "Arbitrage", "Trader", "Bond", "Portfolio", "Asset", "Interest", "Yield",

    # Healthcare
    "Diagnosis", "Therapy", "Immunology", "Doctor", "Biopsy", "Vaccination", "Prescription", "Surgery", "Pathology", "Cardiology",

    # Technology
    "Encryption", "Protocol", "Hacker", "Processor", "Network", "Firewall", "Algorithm", "Data Center", "API", "Cloud Computing",

    # Sports
    "Basketball", "Football", "Soccer", "Fencing", "Badminton", "Tennis", "Swimming", "Running", "Cycling", "Volleyball"
]

>

In [None]:
# Generate sentect embeddings for stella model
embeddings = get_stella_style_text_embeddings(sentences)

# Generate PCA visualization
generate_pca_visualization(embeddings, sentences, domains_short)

# Generate T-sne visualization
generate_tsne_visualization(embeddings, sentences, domains_short)

# Generate Umap visualizations
generate_umap_visualization(embeddings, sentences, domains_short)

In [None]:
# generate  word embeddings for stella model
embeddings = get_stella_style_text_embeddings(words)

# generate PCA visualization
generate_pca_visualization(embeddings, words, domains)

# generate t-SNE visualization
generate_tsne_visualization(embeddings, words, domains)

# generate umap visualization
generate_umap_visualization(embeddings, words, domains)


All of the plots above show us that differnt sectors embeddings are generally close to each other. ANd thus gives us a better idea of how the Stella 400m english model represents the world internally. Our assumptions about the visualizations from the overview have been confirmed.

## Compared and contrast PCS, t-SNA & UMAP visualization

| Feature | PCA | t-SNE | UMAP |
|---|---|---|---|
| Speciality | Global structure (variance) | Local structure (neighborhoods) | Both global and local structure |
| Speed | Fastest | Slowest | Middle |
| Scalability | Good | Limited | Good |
| Preservation of high dimensional structure | Global | Local | Both |
| Interpretability | Somewhat easy | Somewhat Difficut difficult | Moderate (easy if less entries) |