# Stage 2: Embedding Generation with BGE-M3

**Primary author:** Victoria
**Builds on:**
- *Hierarchical_Clustering_Indicators_with_BGE_M3_Embeddings.ipynb* (Victoria/Sahana — BGE-M3 model selection and inline embedding approach)
- *NC_Comprehensive_Embeddings.ipynb* (Nathan — multi-model comparison that helped justify BGE-M3 as the primary model)
- `01_data_cleaning.ipynb` (Stage 1 output: verified unique indicators)

**Prompt engineering:** Victoria
**AI assistance:** Claude (Anthropic)
**Environment:** Great Lakes (GPU required) or Google Colab (GPU enabled)

This notebook loads the deduplicated list of 12,622 verified unique indicator strings
produced by Stage 1, generates 1024-dimensional embeddings using the BGE-M3 sentence
transformer model, and saves the results as `.npy` and `.csv` files for downstream
dimensionality reduction (Stage 3) and clustering (Stage 4).

**Great Lakes session settings:**
- Partition: gpu
- GPUs: 1 (V100 or A40)
- CPUs: 4
- Memory: 32GB
- Wall time: 1 hour (embedding takes ~2-5 min; most time is model download on first run)

## Running on Google Colab

If you are running this notebook on Google Colab after the course ends:

1. Go to **Runtime > Change runtime type**
2. Select a **GPU** accelerator:
   - **T4** is available on the free tier and is sufficient for this notebook
   - **A100** is available with Colab Pro and will be faster
3. Click **Save**, then run all cells

Embedding 12,622 short phrases takes approximately 2-5 minutes on a T4 GPU.
Without a GPU, it will still work but may take 15-30 minutes on CPU.

## Imports

In [None]:
import os
import numpy as np
import pandas as pd
from pathlib import Path
from sentence_transformers import SentenceTransformer

## Environment Auto-Detection and Paths

In [None]:
# --- Environment Auto-Detection ---
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
else:
    try:
        PROJECT_ROOT = Path(__file__).resolve().parent.parent
    except NameError:
        PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Batch size for embedding generation.
# Colab free-tier T4 GPUs have 16GB VRAM — use a smaller batch to avoid OOM.
# Great Lakes V100/A40 and local GPUs with more VRAM can handle larger batches.
BATCH_SIZE = 32 if IS_COLAB else 64

print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Batch size: {BATCH_SIZE}')

In [None]:
np.random.seed(42)

## Load Unique Indicators

The input file `verified_indicators_unique.csv` is produced by `01_data_cleaning.ipynb`.
It contains one row per unique indicator string (12,622 indicators), with no wordplay labels.

Labels are stored separately in `verified_clues_labeled.csv` and can be joined by indicator
string whenever needed for evaluation.

In [None]:
# Check that the input file exists before proceeding
input_file = DATA_DIR / 'verified_indicators_unique.csv'
assert input_file.exists(), (
    f'Missing input file: {input_file}\n'
    f'Run 01_data_cleaning.ipynb first to produce this file.'
)

df_indicators = pd.read_csv(input_file)
indicators_list = df_indicators['indicator'].tolist()

print(f'Loaded {len(indicators_list):,} unique indicators')
print(f'Examples: {indicators_list[:5]}')
print(f'Shortest: "{min(indicators_list, key=len)}" ({len(min(indicators_list, key=len))} chars)')
print(f'Longest: "{max(indicators_list, key=len)}" ({len(max(indicators_list, key=len))} chars)')

## Generate BGE-M3 Embeddings

We use the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model from the
`sentence-transformers` library. BGE-M3 produces 1024-dimensional dense embeddings
and is part of the CALE (Concept-Aligned Language Embeddings) family of models
pretrained to distinguish word senses in context.

**Why BGE-M3?** Our indicators are short phrases (1-6 words) that carry specific
semantic meaning related to wordplay operations. BGE-M3 handles short text well
and produces embeddings where semantically similar phrases (e.g., "scrambled" and
"mixed up") are close in vector space. This is the settled model choice per
FINDINGS_AND_DECISIONS.md.

**What we are NOT doing:** We embed each indicator in isolation (not within its
clue context). This is a settled decision — see FINDINGS_AND_DECISIONS.md for
the rationale.

In [None]:
# Load the BGE-M3 model
# First run will download the model (~2.3 GB). Subsequent runs use the cached version.
model = SentenceTransformer('BAAI/bge-m3')
print(f'Model loaded: {model.get_sentence_embedding_dimension()} dimensions')

In [None]:
# Generate embeddings for all unique indicators
# show_progress_bar=True displays a tqdm progress bar during encoding
embeddings = model.encode(
    indicators_list,
    batch_size=BATCH_SIZE,
    show_progress_bar=True
)

print(f'Embeddings shape: {embeddings.shape}')
print(f'Dtype: {embeddings.dtype}')
print(f'Memory: {embeddings.nbytes / 1024**2:.1f} MB')

## Save Outputs

Two files are saved:

1. **`embeddings_bge_m3_all.npy`** — NumPy array of shape (N, 1024) where N is the
   number of unique indicators. Row `i` in this array corresponds to row `i` in the
   indicator index CSV.

2. **`indicator_index_all.csv`** — Maps each row number to its indicator string.
   The CSV index (first column) is the row number in the embedding array. This is
   the contract between the embedding file and the indicator identity.

Downstream notebooks (Stage 3, 4, 5) should load these files rather than
recomputing embeddings.

In [None]:
# Save the embedding matrix
np.save(DATA_DIR / 'embeddings_bge_m3_all.npy', embeddings)
print(f'Saved embeddings to {DATA_DIR / "embeddings_bge_m3_all.npy"}')

# Save the indicator index (row number -> indicator string)
df_indicators.to_csv(DATA_DIR / 'indicator_index_all.csv', index=True)
print(f'Saved indicator index to {DATA_DIR / "indicator_index_all.csv"}')

## Verification

Reload the saved files and verify that shapes match and the row mapping is correct.

In [None]:
# Reload and verify
embeddings_check = np.load(DATA_DIR / 'embeddings_bge_m3_all.npy')
index_check = pd.read_csv(DATA_DIR / 'indicator_index_all.csv', index_col=0)

assert embeddings_check.shape[0] == len(index_check), (
    f'Shape mismatch: embeddings has {embeddings_check.shape[0]} rows, '
    f'index has {len(index_check)} rows'
)
assert embeddings_check.shape[1] == 1024, (
    f'Expected 1024 dimensions, got {embeddings_check.shape[1]}'
)

print(f'Embeddings: {embeddings_check.shape}')
print(f'Index: {len(index_check)} rows')
print(f'All checks passed.')

# Spot-check: find a known indicator and verify it has a non-zero embedding
spot_check = 'about'
row = index_check[index_check['indicator'] == spot_check].index[0]
norm = np.linalg.norm(embeddings_check[row])
print(f'\nSpot check: "{spot_check}" is at row {row}, embedding L2 norm = {norm:.4f}')