# Stage 5: Constrained and Targeted Experiments

**Primary author:** Victoria
**Builds on:** `04_clustering.ipynb` (Stage 4: unconstrained clustering results)
**Prompt engineering:** Victoria
**AI assistance:** Claude (Anthropic)
**Environment:** Great Lakes (for definitions embedding); Colab (for subset experiments); Local (for label evaluation)

**Research question: "Does expert knowledge improve clustering, and do theoretically
motivated subsets behave as predicted?"**

Stage 4 asked what structure emerges when we let clustering algorithms explore freely.
The answer: strong local structure at fine granularity (282 HDBSCAN clusters, metrics
improving up to k=250 for agglomerative), but no natural k=8 grouping and no stable
intermediate level in HDBSCAN.

This notebook introduces domain knowledge for the first time:
- **Wordplay type labels** (Ho blog labels and algorithmically derived GT labels) to
  evaluate whether unconstrained clusters correspond to known types
- **Seed words** from expert sources to guide constrained clustering
- **Subset experiments** to test specific hypotheses about which types should be easy
  vs. hard to separate

The four sections:
1. **Setup, Load Data, Build Label Sets** — load all inputs and create per-indicator
   label mappings
2. **Label-Based Evaluation of NB 04 Results** — overlay labels on unconstrained
   clusters to see what they captured
3. **Constrained Agglomerative Clustering** — use seed words to guide clustering
4. **Subset Experiments** — test separation and overlap hypotheses on targeted subsets

## Running on Google Colab

If running on Google Colab:

1. Go to **Runtime > Change runtime type**
2. A GPU is not required for Sections 1-2 (label evaluation). GPU is needed only for
   definitions embedding in Section 3.
3. Click **Save**, then run all cells

---
## Section 1: Setup and Data Preparation

### Imports

In [1]:
import os
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.metrics import (
    silhouette_score,
    davies_bouldin_score,
    calinski_harabasz_score,
)

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

matplotlib.rcParams['figure.dpi'] = 120
np.random.seed(42)

### Environment Auto-Detection and Paths

In [2]:
# --- Environment Auto-Detection ---
try:
    IS_COLAB = 'google.colab' in str(get_ipython())
except NameError:
    IS_COLAB = False

IS_GREATLAKES = 'SLURM_JOB_ID' in os.environ

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
elif IS_GREATLAKES:
    # Update YOUR_UNIQNAME to your actual UMich uniqname
    PROJECT_ROOT = Path('/scratch/YOUR_UNIQNAME/ccc_project')
else:
    try:
        PROJECT_ROOT = Path(__file__).resolve().parent.parent
    except NameError:
        PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = PROJECT_ROOT / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

print(f'Project root: {PROJECT_ROOT}')
print(f'Data directory: {DATA_DIR}')
print(f'Output directory: {OUTPUT_DIR}')
print(f'Figures directory: {FIGURES_DIR}')

Project root: /Users/victoria/Desktop/MADS/ccc-project/indicator_clustering
Data directory: /Users/victoria/Desktop/MADS/ccc-project/indicator_clustering/data
Output directory: /Users/victoria/Desktop/MADS/ccc-project/indicator_clustering/outputs
Figures directory: /Users/victoria/Desktop/MADS/ccc-project/indicator_clustering/outputs/figures


### Input File Validation

This notebook requires outputs from Stages 1–4. We check that all required files exist
before proceeding, rather than failing partway through.

| File | Produced by | Description |
|------|-------------|-------------|
| `embeddings_umap_10d.npy` | Stage 3 | 10D UMAP embeddings for clustering |
| `embeddings_umap_2d.npy` | Stage 3 | 2D UMAP embeddings for visualization |
| `indicator_index_all.csv` | Stage 2 | Row-to-indicator-string mapping |
| `verified_clues_labeled.csv` | Stage 1 | Clue-indicator pairs with Ho and GT labels |
| `cluster_labels_hdbscan_eps_0p0000.csv` | Stage 4 | Best HDBSCAN labels (eps=0.0) |
| `cluster_labels_agglo_k8.csv` | Stage 4 | Agglomerative k=8 labels |
| `cluster_labels_agglo_k10.csv` | Stage 4 | Agglomerative k=10 labels |
| `cluster_labels_agglo_k34.csv` | Stage 4 | Agglomerative k=34 labels |
| `clustering_metrics_summary.csv` | Stage 4 | Metrics from all Stage 4 runs |
| `wordplay_seeds.xlsx` | Manual (expert) | Seed words for constrained clustering (Section 3) |

**New in this notebook:** `verified_clues_labeled.csv` and `wordplay_seeds.xlsx`. These
were deliberately excluded from Notebook 04 to keep the unconstrained analysis label-free.
This is where domain knowledge enters the clustering pipeline for the first time.

In [3]:
required_files = {
    # Stage 2-3: embeddings
    'embeddings_umap_10d.npy': 'Run 03_dimensionality_reduction.ipynb',
    'embeddings_umap_2d.npy': 'Run 03_dimensionality_reduction.ipynb',
    'indicator_index_all.csv': 'Run 02_embedding_generation.ipynb',
    # Stage 1: labels
    'verified_clues_labeled.csv': 'Run 01_data_cleaning.ipynb',
    # Stage 4: cluster assignments
    'cluster_labels_hdbscan_eps_0p0000.csv': 'Run 04_clustering.ipynb',
    'cluster_labels_agglo_k8.csv': 'Run 04_clustering.ipynb',
    'cluster_labels_agglo_k10.csv': 'Run 04_clustering.ipynb',
    'cluster_labels_agglo_k34.csv': 'Run 04_clustering.ipynb',
    # Seed words (used in Section 3)
    'wordplay_seeds.xlsx': 'Manually created from expert sources (Minute Cryptic, CC for Dummies)',
}

# Metrics summary is in OUTPUT_DIR, not DATA_DIR
required_output_files = {
    'clustering_metrics_summary.csv': 'Run 04_clustering.ipynb',
}

all_ok = True
for fname, fix_msg in required_files.items():
    fpath = DATA_DIR / fname
    if not fpath.exists():
        print(f'MISSING: {fpath}')
        print(f'  Fix: {fix_msg}')
        all_ok = False

for fname, fix_msg in required_output_files.items():
    fpath = OUTPUT_DIR / fname
    if not fpath.exists():
        print(f'MISSING: {fpath}')
        print(f'  Fix: {fix_msg}')
        all_ok = False

if all_ok:
    print('All input files found.')
else:
    raise FileNotFoundError('One or more required files are missing. See messages above.')

All input files found.


### Load Embeddings and Indicator Index

These are the same files loaded in Notebook 04. The 10D embeddings are used for
clustering; the 2D embeddings are used for scatter plot visualization. The indicator
index maps each row number to its indicator string.

In [4]:
# Load UMAP embeddings
embeddings_10d = np.load(DATA_DIR / 'embeddings_umap_10d.npy')
embeddings_2d = np.load(DATA_DIR / 'embeddings_umap_2d.npy')

# Load indicator index (maps row i -> indicator string)
df_index = pd.read_csv(DATA_DIR / 'indicator_index_all.csv', index_col=0)
indicator_names = df_index['indicator'].values

n_indicators = len(df_index)
print(f'10D embeddings shape: {embeddings_10d.shape}')
print(f'2D embeddings shape:  {embeddings_2d.shape}')
print(f'Indicator count:      {n_indicators:,}')

assert embeddings_10d.shape == (n_indicators, 10)
assert embeddings_2d.shape == (n_indicators, 2)
print('Shape checks passed.')

10D embeddings shape: (12622, 10)
2D embeddings shape:  (12622, 2)
Indicator count:      12,622
Shape checks passed.


### Load Wordplay Labels

**This is the first time labels enter the clustering pipeline.**

`verified_clues_labeled.csv` contains one row per verified (clue_id, indicator) pair —
76,015 rows total. Each row carries two kinds of wordplay labels:

#### Ho Labels (`wordplay_ho`)

These are the original wordplay type labels parsed from cryptic crossword blog
commentary by George Ho. They cover **all 8 wordplay types**: anagram, reversal,
hidden, container, insertion, deletion, homophone, and alternation. Every row in the
file has a Ho label.

**Strengths:** Complete coverage of all types; large sample size.
**Weaknesses:** Parsed from informal blog text, so some labels may be noisy or
inconsistent. The parsing treats the blogger's description as ground truth, but
bloggers occasionally mislabel or use ambiguous terminology.

#### GT Labels (`wordplay_gt`)

These are algorithmically derived "ground truth" labels produced by Victoria's
verification code in Stage 1. The algorithm checks whether the answer can be
mechanically derived from the clue components using the rules of each wordplay type.
It covers **only 4 types**: hidden, reversal, alternation, and anagram — the types
where the transformation can be verified algorithmically (e.g., checking that the
answer's letters appear consecutively in the fodder for hidden-word clues).

**Strengths:** High precision — if the algorithm says it's a hidden-word indicator,
the mechanical check confirms it. No human judgment involved.
**Weaknesses:** Only 4 of 8 types are covered. Container, insertion, deletion, and
homophone require contextual understanding that the algorithm cannot perform. About
74% of rows have no GT label.

#### Why Both Matter

Neither label set supersedes the other:
- **Ho labels** give us full coverage but lower precision
- **GT labels** give us high precision but partial coverage

When they disagree, that's informative — it highlights indicators where the blog
commentary and the mechanical verification diverge. The `label_match` column tracks
this agreement (92.6% match rate where GT exists).

#### Multi-Label Indicators

A single indicator string can appear under **multiple wordplay types**. For example,
"about" is used as a container indicator, a reversal indicator, and an anagram
indicator in different clues. This is linguistically real — the word genuinely serves
multiple functions — not parsing noise. In `verified_clues_labeled.csv`, such an
indicator appears in multiple rows with different `wordplay_ho` values.

For this analysis, we build:
- A **label set** per indicator: all Ho (or GT) types it appears under (for overlay
  plots where a multi-label indicator should appear in every relevant type's subplot)
- A **primary label**: the single most frequent Ho (or GT) type across that
  indicator's clue appearances (for heatmaps and coloring where a single assignment
  is needed)

In [5]:
# Load the full labels file
df_labels = pd.read_csv(DATA_DIR / 'verified_clues_labeled.csv')

print(f'Labels file: {len(df_labels):,} rows')
print(f'Unique indicators: {df_labels["indicator"].nunique():,}')
print(f'\nHo label distribution (instance-level):')
print(df_labels['wordplay_ho'].value_counts().to_string())
print(f'\nGT label distribution (NaN = no GT label):')
print(df_labels['wordplay_gt'].value_counts(dropna=False).to_string())

Labels file: 76,015 rows
Unique indicators: 12,622

Ho label distribution (instance-level):
wordplay_ho
anagram        38226
container      10836
reversal       10149
insertion       8305
homophone       3642
hidden          2595
deletion        1608
alternation      654

GT label distribution (NaN = no GT label):
wordplay_gt
NaN            56348
anagram        15346
hidden          2556
reversal        1506
alternation      259


### Build Per-Indicator Label Sets

For each unique indicator, we compute:

1. **`ho_labels`** — the set of all Ho wordplay types this indicator appears under
   (e.g., `{'container', 'reversal', 'anagram'}` for "about")
2. **`gt_labels`** — the set of all GT wordplay types (may be empty if no GT label
   exists)
3. **`primary_ho`** — the single Ho type this indicator is most frequently labeled as
   (the mode across all its clue appearances)
4. **`primary_gt`** — the single most frequent GT type, or NaN if no GT labels exist

The label sets (1–2) are used for the per-type overlay plots, where an indicator should
appear in every subplot for every type it belongs to. The primary labels (3–4) are used
for heatmaps and single-color scatter plots where each indicator needs exactly one label.

We align everything with `indicator_index_all.csv` so that row `i` in the label arrays
corresponds to row `i` in the embedding matrices.

In [6]:
# --- Build label sets per indicator ---

# Ho label sets: for each indicator, the set of all Ho types it appears under
ho_label_sets = (
    df_labels
    .groupby('indicator')['wordplay_ho']
    .apply(lambda x: frozenset(x.dropna().unique()))
    .to_dict()
)

# GT label sets: for each indicator, the set of all GT types it appears under
gt_label_sets = (
    df_labels
    .groupby('indicator')['wordplay_gt']
    .apply(lambda x: frozenset(x.dropna().unique()))
    .to_dict()
)

# Primary Ho label: the most common Ho label across all clue appearances
primary_ho_map = (
    df_labels
    .groupby('indicator')['wordplay_ho']
    .agg(lambda x: x.value_counts().index[0])
    .to_dict()
)

# Primary GT label: the most common GT label, if any GT labels exist for this indicator
df_labels_gt = df_labels[df_labels['wordplay_gt'].notna()]
primary_gt_map = (
    df_labels_gt
    .groupby('indicator')['wordplay_gt']
    .agg(lambda x: x.value_counts().index[0])
    .to_dict()
)

# --- Create master dataframe aligned with embedding order ---
# This ensures row i matches embeddings_10d[i] and embeddings_2d[i]
df_master = df_index[['indicator']].copy()
df_master['ho_labels'] = df_master['indicator'].map(ho_label_sets)
df_master['gt_labels'] = df_master['indicator'].map(gt_label_sets)
df_master['primary_ho'] = df_master['indicator'].map(primary_ho_map)
df_master['primary_gt'] = df_master['indicator'].map(primary_gt_map)

# Verify alignment: every indicator should have at least one Ho label
assert df_master['primary_ho'].notna().all(), 'Some indicators have no Ho label!'
print(f'Master dataframe: {len(df_master):,} indicators')
print(f'Indicators with GT labels: {df_master["primary_gt"].notna().sum():,} '
      f'({df_master["primary_gt"].notna().mean():.1%})')
print(f'Indicators without GT labels: {df_master["primary_gt"].isna().sum():,}')

Master dataframe: 12,622 indicators
Indicators with GT labels: 5,827 (46.2%)
Indicators without GT labels: 6,795


In [7]:
# --- Label statistics ---

# Ho type counts (unique indicators, using primary label)
print('=== Primary Ho Label Distribution (unique indicators) ===')
ho_counts = df_master['primary_ho'].value_counts()
for wtype, count in ho_counts.items():
    print(f'  {wtype:>12s}: {count:>5,} ({count / len(df_master):.1%})')

print(f'\n=== Primary GT Label Distribution (unique indicators) ===')
gt_counts = df_master['primary_gt'].value_counts()
for wtype, count in gt_counts.items():
    print(f'  {wtype:>12s}: {count:>5,} ({count / len(df_master):.1%})')
n_no_gt = df_master['primary_gt'].isna().sum()
print(f'  {"(no GT)":>12s}: {n_no_gt:>5,} ({n_no_gt / len(df_master):.1%})')

# Multi-label indicators
n_multi_ho = sum(1 for s in df_master['ho_labels'] if len(s) > 1)
print(f'\nMulti-label indicators (Ho): {n_multi_ho:,} '
      f'({n_multi_ho / len(df_master):.1%})')

n_multi_gt = sum(1 for s in df_master['gt_labels'] if len(s) > 1)
print(f'Multi-label indicators (GT): {n_multi_gt:,}')

# Examples of multi-label indicators
multi_ho_df = df_master[df_master['ho_labels'].apply(len) > 1]
print(f'\nExamples of multi-label indicators (Ho):')
for _, row in multi_ho_df.head(8).iterrows():
    types_str = ', '.join(sorted(row['ho_labels']))
    print(f'  "{row["indicator"]}" \u2192 {types_str}')

=== Primary Ho Label Distribution (unique indicators) ===
       anagram: 6,453 (51.1%)
     container: 1,523 (12.1%)
     insertion: 1,385 (11.0%)
      reversal: 1,350 (10.7%)
      deletion:   604 (4.8%)
        hidden:   560 (4.4%)
     homophone:   541 (4.3%)
   alternation:   206 (1.6%)

=== Primary GT Label Distribution (unique indicators) ===
       anagram: 4,364 (34.6%)
        hidden:   867 (6.9%)
      reversal:   474 (3.8%)
   alternation:   122 (1.0%)
       (no GT): 6,795 (53.8%)

Multi-label indicators (Ho): 1,248 (9.9%)
Multi-label indicators (GT): 304

Examples of multi-label indicators (Ho):
  "a bit of" → hidden, insertion
  "a little" → hidden, insertion
  "abandoned" → anagram, deletion
  "abducted by" → hidden, insertion
  "aboard" → container, hidden, insertion
  "aborted" → anagram, deletion
  "about" → anagram, container, hidden, insertion, reversal
  "absorbed" → container, hidden


---
## Section 2: Label-Based Evaluation of Notebook 04 Results

Now that we have wordplay type labels attached to each indicator, we can answer:
**do the unconstrained clusters from Notebook 04 correspond to known wordplay types?**

Recall from NB 04:
- **HDBSCAN at eps=0.0** found 282 tight clusters with 33.4% noise — high silhouette
  (0.631) but many excluded points
- **Agglomerative k=8** is the reference point matching the number of Ho types
- **Agglomerative k=10** is the local silhouette optimum (the only evidence for coarse
  structure in the metrics sweep)
- **Agglomerative k=34** is a mid-range granularity where clusters become semantically
  more coherent

This section produces:
1. **Per-type overlay plots** — where does each wordplay type's indicators live in the
   2D UMAP space? This shows the "ground truth" spatial layout of types.
2. **Per-cluster type distribution heatmaps** — for each clustering run, what is the
   Ho type composition of each cluster? This reveals whether clusters are type-pure or
   mixed.

### Load Cluster Labels from Notebook 04

We load the cluster assignments saved by NB 04 for the four runs we want to evaluate.
Each CSV has columns `indicator` and `cluster_label`, aligned with
`indicator_index_all.csv`.

In [8]:
# Define the clustering runs to evaluate
runs_to_evaluate = {
    'HDBSCAN eps=0.0': {
        'file': 'cluster_labels_hdbscan_eps_0p0000.csv',
        'has_noise': True,
    },
    'Agglomerative k=8': {
        'file': 'cluster_labels_agglo_k8.csv',
        'has_noise': False,
    },
    'Agglomerative k=10': {
        'file': 'cluster_labels_agglo_k10.csv',
        'has_noise': False,
    },
    'Agglomerative k=34': {
        'file': 'cluster_labels_agglo_k34.csv',
        'has_noise': False,
    },
}

cluster_labels_dict = {}
for run_name, run_info in runs_to_evaluate.items():
    df_cl = pd.read_csv(DATA_DIR / run_info['file'])
    # Verify alignment with indicator index
    assert list(df_cl['indicator']) == list(indicator_names), (
        f'Indicator order mismatch in {run_info["file"]}'
    )
    cluster_labels_dict[run_name] = df_cl['cluster_label'].values
    n_clusters = len(set(df_cl['cluster_label'])) - (1 if run_info['has_noise'] else 0)
    n_noise = (df_cl['cluster_label'] == -1).sum() if run_info['has_noise'] else 0
    print(f'{run_name}: {n_clusters} clusters, {n_noise} noise points')

print(f'\nLoaded {len(cluster_labels_dict)} clustering runs.')

HDBSCAN eps=0.0: 282 clusters, 4212 noise points
Agglomerative k=8: 8 clusters, 0 noise points
Agglomerative k=10: 10 clusters, 0 noise points
Agglomerative k=34: 34 clusters, 0 noise points

Loaded 4 clustering runs.


### Color Palette and Helper Functions

We define a consistent color palette for the 8 Ho wordplay types. The same colors are
used in every plot throughout this notebook and in Notebook 06, making it easy to track
types across visualizations.

The palette is chosen to be colorblind-accessible where possible, with high-frequency
types (anagram, container, reversal) getting the most visually distinct colors.

In [9]:
# --- Consistent color palette for wordplay types ---
# Used in all overlay plots, heatmaps, and scatter plots throughout NB 05 and 06

HO_TYPES = ['anagram', 'reversal', 'hidden', 'container',
            'insertion', 'deletion', 'homophone', 'alternation']

GT_TYPES = ['anagram', 'reversal', 'hidden', 'alternation']

TYPE_COLORS = {
    'anagram':     '#e41a1c',  # red
    'reversal':    '#377eb8',  # blue
    'hidden':      '#4daf4a',  # green
    'container':   '#984ea3',  # purple
    'insertion':   '#ff7f00',  # orange
    'deletion':    '#a65628',  # brown
    'homophone':   '#f781bf',  # pink
    'alternation': '#999999',  # gray
}

# Pre-compute boolean masks for each Ho type:
# ho_type_masks['anagram'][i] is True if indicator i has ever been labeled 'anagram'
ho_type_masks = {}
for wtype in HO_TYPES:
    ho_type_masks[wtype] = np.array([
        wtype in label_set for label_set in df_master['ho_labels']
    ])

# Same for GT types
gt_type_masks = {}
for wtype in GT_TYPES:
    gt_type_masks[wtype] = np.array([
        wtype in label_set for label_set in df_master['gt_labels']
    ])

# Print type counts for overlay reference
print('Ho type indicator counts (an indicator can appear in multiple types):')
for wtype in HO_TYPES:
    n = ho_type_masks[wtype].sum()
    print(f'  {wtype:>12s}: {n:>5,}')

print(f'\nGT type indicator counts:')
for wtype in GT_TYPES:
    n = gt_type_masks[wtype].sum()
    print(f'  {wtype:>12s}: {n:>5,}')

Ho type indicator counts (an indicator can appear in multiple types):
       anagram: 6,610
      reversal: 1,495
        hidden:   971
     container: 1,728
     insertion: 1,915
      deletion:   695
     homophone:   565
   alternation:   216

GT type indicator counts:
       anagram: 4,436
      reversal:   631
        hidden:   964
   alternation:   131


### Per-Type Overlay: Ho Labels

Each subplot below highlights the indicators belonging to one Ho wordplay type,
plotted on top of the full 2D UMAP cloud (shown in light gray). This reveals
**where each type lives in the embedding space**:

- **Concentrated clusters:** If a type's indicators form a tight region, the
  embedding model captured something distinctive about that type's vocabulary.
- **Dispersed clouds:** If a type's indicators are scattered across the full space,
  the type's vocabulary is semantically diverse and may not form a natural cluster.
- **Overlapping types:** If two types occupy the same region, their indicators share
  semantic properties and will be hard to separate by clustering.

These plots use all Ho labels (not just the primary label), so a multi-label indicator
like "about" appears in every type subplot where it has been used. This gives the
complete picture of where each type's vocabulary lives.

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(22, 10))
axes_flat = axes.flatten()

for i, wtype in enumerate(HO_TYPES):
    ax = axes_flat[i]
    mask = ho_type_masks[wtype]
    n_type = mask.sum()

    # Background: all indicators in light gray
    ax.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],
               s=1, alpha=0.08, color='lightgray', rasterized=True)

    # Foreground: this type's indicators
    ax.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1],
               s=3, alpha=0.4, color=TYPE_COLORS[wtype], rasterized=True)

    ax.set_title(f'{wtype} (n={n_type:,})', fontsize=11,
                 color=TYPE_COLORS[wtype], fontweight='bold')
    ax.set_xlabel('UMAP 1', fontsize=8)
    ax.set_ylabel('UMAP 2', fontsize=8)
    ax.tick_params(labelsize=7)

plt.suptitle('Ho Label Overlay: Where Each Wordplay Type Lives in Embedding Space',
             fontsize=14, y=1.02)
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'overlay_ho_types.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: {FIGURES_DIR / "overlay_ho_types.png"}')

### Per-Type Overlay: GT Labels

The same visualization using the algorithmically derived GT labels. Only 4 types are
covered (anagram, reversal, hidden, alternation). Indicators without a GT label are
part of the gray background only.

Comparing this to the Ho overlay above reveals:
- GT labels tend to highlight a **subset** of each type's indicators (the ones where
  the transformation could be mechanically verified)
- If the GT-highlighted region is a spatial subset of the Ho-highlighted region for
  the same type, the two label sources are consistent
- If GT highlights indicators in a different part of the space than Ho, there may be
  labeling disagreements worth investigating

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(22, 5))

for i, wtype in enumerate(GT_TYPES):
    ax = axes[i]
    mask = gt_type_masks[wtype]
    n_type = mask.sum()

    # Background: all indicators in light gray
    ax.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],
               s=1, alpha=0.08, color='lightgray', rasterized=True)

    # Foreground: this type's GT-verified indicators
    ax.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1],
               s=4, alpha=0.5, color=TYPE_COLORS[wtype], rasterized=True)

    ax.set_title(f'{wtype} — GT verified (n={n_type:,})', fontsize=11,
                 color=TYPE_COLORS[wtype], fontweight='bold')
    ax.set_xlabel('UMAP 1', fontsize=8)
    ax.set_ylabel('UMAP 2', fontsize=8)
    ax.tick_params(labelsize=7)

plt.suptitle('GT Label Overlay: Algorithmically Verified Types in Embedding Space',
             fontsize=14, y=1.05)
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'overlay_gt_types.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: {FIGURES_DIR / "overlay_gt_types.png"}')

### Per-Cluster Type Distribution Heatmaps

For each of the four clustering runs from NB 04, we create a heatmap showing the Ho
type composition of each cluster. Each row is a cluster, each column is a wordplay
type, and each cell shows the **proportion** of that cluster's indicators that have
that type as their primary Ho label. Rows sum to 1.0.

**How to read these heatmaps:**
- A **pure cluster** has one bright cell in its row and near-zero everywhere else —
  the cluster captured a single wordplay type.
- A **mixed cluster** has color spread across multiple columns — it contains a blend
  of types that the algorithm could not separate.
- The **dominant type** of each cluster is the column with the highest value.
- Clusters are sorted by size (largest at top) for readability.

For HDBSCAN, which labels some points as noise (cluster = −1), we include a separate
"noise" row showing the type distribution of noise points. If noise points are
distributed similarly to the full dataset, the density-based method is not
systematically excluding any particular type. If noise is concentrated in certain
types, those types' indicators may be more dispersed in the embedding space.

These heatmaps use the **primary Ho label** (single most common type per indicator)
to avoid double-counting multi-label indicators.

In [None]:
def plot_type_distribution(cluster_labels, primary_ho, ho_types, type_colors,
                           title, filename, max_clusters=20, show_noise=False):
    """Heatmap showing the Ho type composition of each cluster.

    Parameters
    ----------
    cluster_labels : array of int
        Cluster assignment for each indicator (-1 = noise for HDBSCAN).
    primary_ho : array of str
        Primary Ho label for each indicator.
    ho_types : list of str
        Ordered list of Ho type names (column order in heatmap).
    type_colors : dict
        Colors per type (used for column header coloring).
    title : str
        Plot title.
    filename : str
        Output filename (saved to FIGURES_DIR).
    max_clusters : int
        Maximum number of cluster rows to display (largest first).
    show_noise : bool
        If True, include a 'noise' row for cluster=-1 points.
    """
    df_temp = pd.DataFrame({
        'cluster': cluster_labels,
        'type': primary_ho,
    })

    # Separate noise if applicable
    if show_noise:
        noise_df = df_temp[df_temp['cluster'] == -1]
        df_clean = df_temp[df_temp['cluster'] != -1]
    else:
        noise_df = pd.DataFrame()
        df_clean = df_temp

    # Crosstab: rows = cluster, columns = type
    ct = pd.crosstab(df_clean['cluster'], df_clean['type'])
    for t in ho_types:
        if t not in ct.columns:
            ct[t] = 0
    ct = ct[ho_types]

    # Sort by cluster size (descending)
    ct = ct.loc[ct.sum(axis=1).sort_values(ascending=False).index]

    # Limit to top N clusters if there are too many rows
    if len(ct) > max_clusters:
        ct = ct.head(max_clusters)

    # Add noise row if applicable
    if show_noise and len(noise_df) > 0:
        noise_counts = noise_df['type'].value_counts()
        noise_row = pd.DataFrame(
            [[noise_counts.get(t, 0) for t in ho_types]],
            columns=ho_types,
            index=['noise']
        )
        ct = pd.concat([ct, noise_row])

    # Row labels with cluster sizes
    sizes = ct.sum(axis=1).astype(int)
    row_labels = [f'{idx} (n={int(sizes[idx]):,})' for idx in ct.index]

    # Normalize by row (proportions)
    ct_norm = ct.div(ct.sum(axis=1), axis=0)

    # Decide whether to annotate cells (skip for large heatmaps)
    n_rows = len(ct_norm)
    do_annot = n_rows <= 15
    height = max(4, n_rows * 0.45 + 2)

    fig, ax = plt.subplots(figsize=(14, height))
    sns.heatmap(
        ct_norm.values.astype(float),
        annot=ct_norm.values.astype(float) if do_annot else False,
        fmt='.2f' if do_annot else '',
        cmap='YlOrRd',
        ax=ax,
        vmin=0, vmax=0.8,
        linewidths=0.5,
        xticklabels=ho_types,
        yticklabels=row_labels,
    )
    ax.set_title(title, fontsize=13, pad=12)
    ax.set_xlabel('Primary Ho Wordplay Type', fontsize=11)
    ax.set_ylabel('Cluster (sorted by size)', fontsize=11)

    # Color the x-axis tick labels by type for visual consistency
    for tick_label in ax.get_xticklabels():
        wtype = tick_label.get_text()
        if wtype in type_colors:
            tick_label.set_color(type_colors[wtype])
            tick_label.set_fontweight('bold')

    plt.tight_layout()
    fig.savefig(FIGURES_DIR / filename, dpi=150, bbox_inches='tight')
    plt.show()
    print(f'Saved: {FIGURES_DIR / filename}')

    return ct_norm

In [None]:
# --- Generate heatmaps for each clustering run ---
primary_ho_array = df_master['primary_ho'].values

type_dist_results = {}
for run_name, labels in cluster_labels_dict.items():
    run_info = runs_to_evaluate[run_name]
    # Create a filename-safe version of the run name
    safe_name = run_name.lower().replace(' ', '_').replace('=', '').replace('.', '')
    ct_norm = plot_type_distribution(
        cluster_labels=labels,
        primary_ho=primary_ho_array,
        ho_types=HO_TYPES,
        type_colors=TYPE_COLORS,
        title=f'Per-Cluster Ho Type Distribution \u2014 {run_name}',
        filename=f'type_distribution_{safe_name}.png',
        max_clusters=20 if run_info['has_noise'] else 40,
        show_noise=run_info['has_noise'],
    )
    type_dist_results[run_name] = ct_norm

### Dominant Type per Cluster

As a complement to the heatmaps, we print each cluster's **dominant type** (the Ho type
with the highest proportion) and its **purity** (that proportion). A purity of 1.0 means
every indicator in the cluster has the same primary Ho label; a purity of 0.3 means the
cluster is a mixture with no single type dominating.

This is a quick summary for comparing runs: higher average purity means the clustering
better separates wordplay types.

In [None]:
for run_name, ct_norm in type_dist_results.items():
    print(f'\n{"=" * 60}')
    print(f'{run_name}')
    print(f'{"=" * 60}')
    for idx in ct_norm.index:
        row = ct_norm.loc[idx]
        dominant_type = row.idxmax()
        purity = row.max()
        print(f'  Cluster {str(idx):>6s}: dominant={dominant_type:<12s} purity={purity:.2f}')

    # Average purity (excluding noise row if present)
    numeric_rows = ct_norm.loc[ct_norm.index != 'noise']
    avg_purity = numeric_rows.max(axis=1).mean()
    print(f'\n  Average cluster purity: {avg_purity:.3f}')

### Save Section 2 Outputs

In [None]:
# Save the per-cluster type distribution tables for use in NB 06
for run_name, ct_norm in type_dist_results.items():
    safe_name = run_name.lower().replace(' ', '_').replace('=', '').replace('.', '')
    out_path = OUTPUT_DIR / f'type_distribution_{safe_name}.csv'
    ct_norm.to_csv(out_path)
    print(f'Saved: {out_path}')

# List all figures produced in this section
print(f'\nFigures saved to {FIGURES_DIR}:')
for f in sorted(FIGURES_DIR.glob('overlay_*.png')):
    print(f'  {f.name}')
for f in sorted(FIGURES_DIR.glob('type_distribution_*.png')):
    print(f'  {f.name}')

### Interpretation: Do Unconstrained Clusters Correspond to Wordplay Types?

#### Type Overlay Findings

The per-type overlay plots reveal the spatial distribution of each wordplay type in
the embedding space. Key observations:

- **Homophone** indicators are expected to be the most spatially concentrated — words
  like "sounds like", "I hear", and "reportedly" share a distinctive hearing/speaking
  semantic field that is well-separated from other types.
- **Anagram** indicators are expected to be the most dispersed — they span many
  conceptual metaphors (disorder, cooking, damage, movement) and occupy a large region
  of the space. This is consistent with anagram having the largest and most diverse
  indicator vocabulary (6,610 unique verified indicators).
- **Container and insertion** indicators are expected to heavily overlap in space,
  since they share placement/containment conceptual metaphors and many of the same
  indicator phrases ("in", "about", "around" appear under both types).
- **Reversal** indicators likely form a moderately concentrated region, especially the
  up/rising words (for down clues) and the back/return words.
- **Hidden** indicators may partially overlap with container/insertion due to shared
  placement metaphors ("in", "within", "inside" are shared across these types).

The GT overlay confirms the spatial patterns using the higher-precision algorithmic
labels, covering only the 4 types where mechanical verification is possible.

#### Per-Cluster Distribution Findings

The heatmaps answer the core question of whether unconstrained clusters align with
wordplay types:

- **At k=8**: If clusters perfectly aligned with types, each row would have a single
  bright cell and near-zero elsewhere. In practice, most clusters are likely mixed —
  especially those spanning the container/insertion/hidden region.
- **At k=10**: The local silhouette optimum from NB 04. Compare the purity of the k=10
  heatmap to k=8 to see whether the two extra clusters help isolate overlapping types.
- **At k=34**: Finer granularity should produce purer clusters, but now multiple
  clusters correspond to the same type. This is consistent with the NB 04 finding that
  the data's natural structure is finer-grained than 8 types.
- **HDBSCAN eps=0.0**: The 282 tight clusters are very fine-grained. The top 20 largest
  clusters likely show high type purity — each tight cluster contains indicators of
  mostly one type. The noise row reveals which types have the most "ambiguous" indicators
  that don't fit neatly into any dense region.

#### Key Takeaway

The alignment between unconstrained clusters and wordplay types establishes the baseline
for Section 3, where we ask: **can domain knowledge (seed words, constraints) improve
this alignment?** If the unconstrained clusters already separate types well for some
types (e.g., homophone, reversal) but not others (e.g., container/insertion/hidden),
that tells us exactly where constrained clustering has room to help — and where the
linguistic reality of shared indicator vocabulary may make clean separation impossible.

---
## Section 3: Constrained Agglomerative Clustering — to be added

---
## Section 4: Subset Experiments — to be added