# Model Setup and Embedding Pipeline

This notebook loads the EmbeddingGemma model, sets up the embedding pipeline, and demonstrates how to compute embeddings for text.


In [None]:
# Import functions from the scripts directory
from src.models.embedding_pipeline import load_embeddinggemma_model, embed_texts, compute_cosine_similarity
import torch


In [None]:
import os  # Python standard library (environment variables)
from getpass import getpass  # Python standard library (hidden input)


def get_hf_token() -> str:
    """Return a usable Hugging Face token from env or a hidden prompt.

    Raises:
        ValueError: If the token is missing or empty after prompting.
    """
    token = os.environ.get("HF_TOKEN")
    if token:
        return token

    print("HF_TOKEN not found in environment.")
    token = getpass("Paste HF_TOKEN (input hidden): ").strip()
    if not token:
        raise ValueError("HF_TOKEN is required to authenticate with Hugging Face.")

    # Set it for this kernel session only (not persisted to your shell).
    os.environ["HF_TOKEN"] = token
    return token


hf_token = get_hf_token()
print("HF_TOKEN loaded (masked):", f"{hf_token[:4]}...{hf_token[-4:]}")

## Load EmbeddingGemma Model

The EmbeddingGemma model is a 300M parameter multilingual embedding model that produces 768-dimensional embeddings.


In [None]:
# Load the model and tokenizer
# Note: Make sure you've accepted the license and are logged into Hugging Face
tokenizer, model = load_embeddinggemma_model()

# Check device
device = next(model.parameters()).device
print(f"Model loaded on device: {device}")
print(f"Model type: {type(model).__name__}")


## Test Embedding Computation

Let's compute embeddings for a few example sentences and verify the pipeline works correctly.


In [None]:
# Example sentences to test
s1 = "The cat sits on the mat."
s2 = "A cat is sitting on a rug."
s3 = "The weather is sunny today."

# Compute embeddings using the embedding pipeline
embeddings = embed_texts([s1, s2, s3], model, tokenizer)

print(f"Embedding shape: {embeddings.shape}")
print(f"Expected shape: (3, 768)")
print(f"Embeddings are normalized: {torch.allclose(torch.norm(embeddings, dim=1), torch.ones(3), atol=1e-5)}")


## Compute Cosine Similarities

Since embeddings are normalized, cosine similarity equals the dot product. Let's see how similar our example sentences are.


In [None]:
# Compute cosine similarity matrix
cos_sim_matrix = compute_cosine_similarity(embeddings)

# Display the similarity matrix
import pandas as pd
import numpy as np

sentences = [s1, s2, s3]
sim_df = pd.DataFrame(
    cos_sim_matrix.numpy(),
    index=[f"S{i+1}" for i in range(len(sentences))],
    columns=[f"S{i+1}" for i in range(len(sentences))]
)

print("Cosine Similarity Matrix:")
print(sim_df.round(3))

print("\nSentence pairs:")
for i, s in enumerate(sentences, 1):
    print(f"S{i}: '{s}'")


## Expected Behavior

- S1 and S2 (both about cats on floor) should have high similarity (~0.8-0.9)
- S1 and S3 (cat vs weather) should have low similarity (~0.1-0.3)
- S2 and S3 should also have low similarity

The diagonal should be 1.0 (each sentence compared to itself).
