# UAT for Cycle 03: The Filter - Surrogate Explorer and Selector

This notebook provides a powerful, visual demonstration of the **`SurrogateExplorer`** module. Its purpose is to answer the critical question: *"How do you avoid wasting supercomputer time on useless calculations?"*

We will walk through the two-stage filtering process, making the concept of "intelligent selection" tangible and easy to understand. Using 2D scatter plots to represent the high-dimensional descriptor space, we will see how a large, noisy dataset is automatically refined into a small, information-rich subset.

## Part 1: Generating a Candidate Set

First, we'll create a synthetic dataset to simulate the output of the `PhysicsInformedGenerator`. To make the scenario interesting, the data will contain two distinct clusters, representing a common situation where many candidate structures are redundant and clustered together in descriptor space.

In [None]:
import os
import sys
from unittest.mock import MagicMock

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Add the project root to the Python path to import our modules
sys.path.append(os.path.abspath(os.path.join("..", "..")))

from mlip_autopipec.config_schemas import (
    ExplorerParams,
    FPSParams,
    SOAPParams,
    SurrogateModelParams,
)
from mlip_autopipec.modules.descriptors import SOAPDescriptorCalculator
from mlip_autopipec.modules.explorer import SurrogateExplorer
from mlip_autopipec.modules.screening import SurrogateModelScreener

print("âœ“ Imports successful")

In [None]:
# Step 1.1: Generate a synthetic dataset with two clusters
np.random.seed(42)
cluster1 = np.random.randn(400, 2) + np.array([3, 3])
cluster2 = np.random.randn(400, 2) + np.array([-3, -3])
outliers = np.random.uniform(-10, 10, size=(200, 2))
initial_candidates = np.vstack([cluster1, cluster2, outliers])

print(f"Generated {len(initial_candidates)} initial candidate structures.")

# Step 1.2: Visualize the initial candidate set
sns.set_style("whitegrid")
plt.figure(figsize=(8, 8))
plt.scatter(initial_candidates[:, 0], initial_candidates[:, 1], s=10, alpha=0.6)
plt.title("Step 1: Initial Candidate Set (1000 structures)")
plt.xlabel("Descriptor Dimension 1")
plt.ylabel("Descriptor Dimension 2")
plt.show()

The plot above clearly shows the problem: a random selection would likely pick many points from the dense clusters, providing redundant information. Our goal is to select points that are spread out and cover the entire space.

## Part 2: Surrogate Screening

Next, we simulate the first filtering stage. The `SurrogateExplorer` would use a fast, pre-trained model (like MACE) to predict the energy of each structure. Any structure with an unphysically high energy (e.g., due to overlapping atoms) is discarded.

For this visualization, we will synthetically label some points as "high-energy" and remove them.

In [None]:
# Step 2.1: Simulate the energy screening
# We'll define a region (a circle) and label points within it as high-energy.
center = np.array([0, 0])
radius = 4.0
distances = np.linalg.norm(initial_candidates - center, axis=1)

high_energy_mask = distances < radius
screened_candidates = initial_candidates[~high_energy_mask]
discarded_candidates = initial_candidates[high_energy_mask]

print(f"{np.sum(high_energy_mask)} structures were discarded as 'high-energy'.")
print(f"{len(screened_candidates)} structures remain after screening.")

# Step 2.2: Visualize the result of the screening
plt.figure(figsize=(8, 8))
plt.scatter(
    screened_candidates[:, 0],
    screened_candidates[:, 1],
    s=10,
    alpha=0.6,
    label="Kept",
)
plt.scatter(
    discarded_candidates[:, 0],
    discarded_candidates[:, 1],
    s=10,
    alpha=0.6,
    c="red",
    label="Discarded",
)
plt.title("Step 2: After Surrogate Screening")
plt.xlabel("Descriptor Dimension 1")
plt.ylabel("Descriptor Dimension 2")
plt.legend()
plt.show()

## Part 3: Farthest Point Sampling (FPS)

This is the core of the intelligent selection process. From the remaining pool of physically plausible structures, we now select a small, maximally diverse subset using FPS. This algorithm iteratively picks the point that is farthest from any previously selected point, ensuring broad coverage of the descriptor space.

This is the **"Wow" moment**.

In [None]:
# Step 3.1: Run the FPS algorithm
num_structures_to_select = 15

# We need a dummy config and dependencies to instantiate the explorer
dummy_explorer_config = ExplorerParams(
    surrogate_model=SurrogateModelParams(model_path="", energy_threshold_ev=0),
    fps=FPSParams(num_structures_to_select=num_structures_to_select),
)
dummy_calculator = SOAPDescriptorCalculator(SOAPParams(), species=["X"])
dummy_screener = MagicMock(spec=SurrogateModelScreener)

explorer = SurrogateExplorer(
    config=dummy_explorer_config,
    descriptor_calculator=dummy_calculator,
    screener=dummy_screener,
)

# Fix the seed for reproducibility
np.random.seed(1)
selected_indices = explorer._farthest_point_sampling(
    screened_candidates, num_structures_to_select
)
final_selection = screened_candidates[selected_indices]

print(f"Selected {len(final_selection)} structures via FPS.")

# Step 3.2: Create the final visualization
plt.figure(figsize=(10, 10))
# Plot all original points in light grey
plt.scatter(
    initial_candidates[:, 0],
    initial_candidates[:, 1],
    s=10,
    alpha=0.1,
    c="gray",
    label="Original Candidates",
)
# Plot the screened-out points in light red
plt.scatter(
    discarded_candidates[:, 0],
    discarded_candidates[:, 1],
    s=10,
    alpha=0.3,
    c="red",
    label="Discarded by Surrogate",
)
# Plot the final FPS selection in vibrant blue
plt.scatter(
    final_selection[:, 0],
    final_selection[:, 1],
    s=50,
    c="blue",
    edgecolor="w",
    linewidth=1,
    label=f"Final Selection ({num_structures_to_select} structures)",
    zorder=3,
)
# Connect the selected points to show the selection path
plt.plot(
    final_selection[:, 0],
    final_selection[:, 1],
    "-o",
    c="blue",
    alpha=0.5,
    markersize=3,
    linewidth=1,
)

plt.title("Step 3: Final Diverse Selection via FPS", fontsize=16)
plt.xlabel("Descriptor Dimension 1")
plt.ylabel("Descriptor Dimension 2")
plt.legend()
plt.show()

The final plot is undeniable proof of the system's intelligence. Instead of picking redundant points from the clusters, the FPS algorithm has chosen points that cover the boundaries and outliers of the dataset. This ensures that the subsequent, expensive DFT calculations will provide the maximum possible information for training the MLIP, leading to a more accurate and robust potential with less wasted computation.