# Benchmark: LandmarkTriangulation vs t-SNE

This notebook compares **LandmarkTriangulation** against scikit-learn's **t-SNE** on synthetic clustered data.

**Objectives:**
- Compare execution time across different landmark selection strategies
- Evaluate clustering quality using silhouette scores
- Visualize embeddings for qualitative assessment

**Dataset:** 2,000 samples, 50 features, 5 clusters

## 1. Setup and Imports

Install the package first if needed:
```bash
uv sync --extra examples
```

In [None]:
import time

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score

from landmark_triangulation import LandmarkTriangulation

print("âœ“ All imports successful")

## 2. Helper Functions

Define utilities for data generation and benchmarking.

In [None]:
def generate_synthetic_data(n_samples=2000, n_features=50, n_clusters=5):
    """
    Generate high-dimensional synthetic data with distinct clusters.
    
    Returns:
        X: Feature matrix of shape (n_samples, n_features)
        y: Cluster labels of shape (n_samples,)
    """
    print(f"Generating {n_samples} samples with {n_features} features and {n_clusters} clusters...")
    
    X, y = make_blobs(
        n_samples=n_samples,
        n_features=n_features,
        centers=n_clusters,
        cluster_std=2.0,
        random_state=42,
    )
    return X, y


def run_method(name, model, X, y):
    """
    Run a dimensionality reduction method and compute metrics.
    
    Returns:
        Dictionary with method name, runtime, silhouette score, and embedding.
    """
    print(f"Running {name}...")
    start_time = time.time()
    
    try:
        X_embedded = model.fit_transform(X)
        duration = time.time() - start_time
        
        # Silhouette score: measures cluster separation (-1 to 1, higher is better)
        score = silhouette_score(X_embedded, y)
        
        print(f"  âœ“ Finished in {duration:.2f}s | Silhouette: {score:.3f}")
        
        return {
            "Method": name,
            "Time (s)": duration,
            "Silhouette": score,
            "Embedding": X_embedded,
        }
    except Exception as e:
        print(f"  âœ— Failed: {e}")
        return None

## 3. Generate Dataset

Create synthetic data matching the README benchmark specifications.

In [None]:
# Generate synthetic dataset
X, y = generate_synthetic_data(
    n_samples=2000, 
    n_features=50, 
    n_clusters=5
)

print(f"\nData shape: {X.shape}")
print(f"Number of clusters: {len(set(y))}")

## 4. Configure Methods

Set up all dimensionality reduction methods to benchmark:
- **Random mode**: Randomly samples landmarks from data
- **Synthetic mode**: Generates sine-wave landmarks
- **Hybrid mode**: Generates synthetic landmarks and snaps to nearest real points
- **t-SNE**: Baseline comparison

In [None]:
# Define methods to benchmark
methods = [
    (
        "Random Mode",
        LandmarkTriangulation(
            n_landmarks=150, 
            landmark_mode="random", 
            random_state=42
        ),
    ),
    (
        "Synthetic Mode",
        LandmarkTriangulation(
            n_landmarks=150, 
            landmark_mode="synthetic", 
            random_state=42
        ),
    ),
    (
        "Hybrid Mode",
        LandmarkTriangulation(
            n_landmarks=150, 
            landmark_mode="hybrid", 
            random_state=42
        ),
    ),
    (
        "t-SNE",
        TSNE(
            n_components=2, 
            init="pca", 
            learning_rate="auto", 
            random_state=42
        ),
    ),
]

print(f"Prepared {len(methods)} methods for benchmarking")

## 5. Run Benchmark

Execute all methods and collect timing and quality metrics.

In [None]:
# Run all methods and collect results
print("\n" + "=" * 60)
print("BENCHMARK START")
print("=" * 60 + "\n")

results = []
for name, model in methods:
    res = run_method(name, model, X, y)
    if res is not None:
        results.append(res)

print("\n" + "=" * 60)
print("BENCHMARK COMPLETE")
print("=" * 60)

## 6. Results Summary

Display performance metrics and identify the fastest method and best clustering quality.

In [None]:
# Create summary table
df_results = pd.DataFrame(results)[["Method", "Time (s)", "Silhouette"]]

print("\n" + "=" * 60)
print("RESULTS SUMMARY")
print("=" * 60)
print(df_results.to_string(index=False))
print("=" * 60)

# Highlight fastest and best quality
fastest = df_results.loc[df_results["Time (s)"].idxmin()]
best_quality = df_results.loc[df_results["Silhouette"].idxmax()]

print(f"\nâš¡ Fastest: {fastest['Method']} ({fastest['Time (s)']:.2f}s)")
print(f"ðŸŽ¯ Best Quality: {best_quality['Method']} (Silhouette: {best_quality['Silhouette']:.3f})")

# Calculate speedup
tsne_time = df_results[df_results["Method"] == "t-SNE"]["Time (s)"].values[0]
for _, row in df_results.iterrows():
    if row["Method"] != "t-SNE":
        speedup = tsne_time / row["Time (s)"]
        print(f"ðŸ“Š {row['Method']}: {speedup:.1f}x faster than t-SNE")

## 7. Visualization

Compare embeddings visually across all methods.

In [None]:
# Create comparison plot
fig, axes = plt.subplots(2, 2, figsize=(14, 10), constrained_layout=True)
axes = axes.flatten()

for i, res in enumerate(results):
    ax = axes[i]
    emb = res["Embedding"]

    scatter = ax.scatter(
        emb[:, 0],
        emb[:, 1],
        c=y,
        cmap="viridis",
        alpha=0.6,
        s=8,
        edgecolors="none",
    )

    ax.set_title(
        f"{res['Method']}\n"
        f"Time: {res['Time (s)']:.2f}s | Silhouette: {res['Silhouette']:.3f}",
        fontsize=11,
    )
    ax.set_xticks([])
    ax.set_yticks([])
    ax.grid(True, alpha=0.2, linestyle="--")

# Add a single colorbar for the whole figure
fig.colorbar(scatter, ax=axes.ravel().tolist(), label="Cluster")

fig.suptitle(
    "Dimensionality Reduction Benchmark: LandmarkTriangulation vs t-SNE",
    fontsize=14,
    fontweight="bold",
)

plt.show()