<a href="https://colab.research.google.com/github/sidchaini/dimmadtutorial/blob/main/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![github-badge](https://img.shields.io/badge/GitHub-sidchaini/dimmadtutorial-blue)](https://github.com/sidchaini/dimmadtutorial)

# Distance Multi-Metric Anomaly Detection
Author: Siddharth Chaini

4 February 2026 (Prepared for [Quasar Bazaar Hackweek](https://indico.sissa.it/event/178/))

## 0. Installing DistClassiPy
- The anomaly detector is now included in DistClassiPy

In [None]:
import distclassipy as dcpy

print(dcpy.__version__)

## 1. Other imports and preamble

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from distclassipy.anomaly import DistanceAnomaly
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score

sns.set_theme(context="talk", style="whitegrid", palette="tab10")
%matplotlib inline

seed_val = 44
np.random.seed(seed_val)

## 2. Let us use features derived from ZTF by ALeRCE 
These features were extracted by the [ALeRCE team](https://science.alerce.online/) from the Zwicky Transient Facility light curves using their pipeline [(Sánchez-Sáez+21)](https://ui.adsabs.harvard.edu/abs/2021AJ....161..141S/abstract) and their [GitHub](https://github.com/alercebroker/pipeline).

Note: We'll download this as a parquet file from the [DiMMAD paper](https://ml4physicalsciences.github.io/2025/files/NeurIPS_ML4PS_2025_222.pdf) directory.

In [None]:
url = "https://github.com/sidchaini/DiMMAD/raw/refs/heads/main/data/alerceztf_features.parquet"

df = pd.read_parquet(...)
# df.head(5)

In [None]:
print(f"Total objects: {len(df)}")
print(f"Classes found: {df['class'].unique()}")

## 3. Knowns, and Unknowns

Let us consider an alternate universe: a world where we know a few classes of supernovae are well known, but QSOs or AGNs have not been discovered and are ```unknown``` to us.

So, we will train our algorithm(s) on objects belonging only to the known classes. And hide the unknowns in between a new set of hidden knowns, and see if our algorithm can recover these objects from the  as anomalies.

In [None]:
known_classes = ["SNIa", "SNII", "SLSN", "SNIbc"]
unknown_classes = [...]

In [None]:
features_to_use = [c for c in df.columns if c.startswith("SPM")]
# features_to_use

In [None]:
df_subset = (
    df[df["class"].isin(known_classes + unknown_classes)][features_to_use + ["class"]]
    .dropna()
    .copy()
)

In [None]:
# knowns

df_known = df_subset[df_subset["class"].isin(known_classes)].dropna()
X_known = df_known[features_to_use].values
y_known = df_known["class"].values

In [None]:
# unknowns

df_unknown = (
    df_subset[df_subset["class"].isin(unknown_classes)]
    .dropna()
    .sample(100, random_state=seed_val)
)
X_unknown = df_unknown[features_to_use].values
y_unknown = df_unknown["class"].values

In [None]:
# train = knowns

X_train, X_test_inliers, y_train, y_test_inliers = train_test_split(
    X_known, y_known, test_size=0.3, stratify=y_known, random_state=seed_val
)

In [None]:
# test = knowns + unknowns

X_test = np.vstack([X_test_inliers, X_unknown])
y_test = np.concatenate([y_test_inliers, y_unknown])


y_test_binary = np.isin(y_test, unknown_classes).astype(int)  # label for anom

In [None]:
print(f"Training Size: {len(...)} (All Knowns)")
print(
    f"Test Size: {len(...)} ({np.sum(y_test_binary)} unknowns + {(y_test_binary==0).sum()} knowns)"
)

## 4. Using DistanceAnomaly

In [None]:
model = ...


# cluster_agg='min': Distance to the nearest class centroid
# metric_agg='median': Consensus across the 16 distance metrics

In [None]:
print("Training DiMMAD...")
model.fit(...)
print("Done!")

scores = ...
# higher is more anomalous

## 5. Some quick checks on the results

### 5.1. What are the most anomalous objects?

In [None]:
results = pd.DataFrame(
    {"True_Class": y_test, "Is_Anomaly": y_test_binary, "Anomaly_Score": scores}
)

results = results.sort_values("Anomaly_Score", ascending=False)
# most anomalous as the top

...

### 5.2. Do anomalous objects have high anomalous scores?

In [None]:
sns.histplot(
    data=results,
    x="Anomaly_Score",
    hue="Is_Anomaly",
    element="step",
    common_norm=False,
    bins=100,
    palette={0: "tab:blue", 1: "tab:orange"},
)

plt.xlabel("DiMMAD Anomaly Score")
plt.title("Separation of Knowns (0) vs Unknowns (1)")
plt.show()

### 5.3. If we have a limited "budget", how well do we do?

In [None]:
budget = ...
top_candidates = results.head(budget)
purity = top_candidates["Is_Anomaly"].mean()

print(f"Budget: {budget} observations")
print(f"True Discoveries (QSO/AGN): {top_candidates['Is_Anomaly'].sum()}")
print(f"Purity: {purity:.1%}")

---
---

If you're interested in more details, do take a look at our,
1. MLPS@NeuRIPS [Paper](https://ml4physicalsciences.github.io/2025/files/NeurIPS_ML4PS_2025_222.pdf) / [Poster](https://neurips.cc/media/PosterPDFs/NeurIPS%202025/122936.png)
2. Accompanying [GitHub](https://github.com/sidchaini/dimmad)
3. DistClassiPy [source code](https://github.com/sidchaini/distclassipy/)

<h1 align="center">
<picture align="center">
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/sidchaini/DistClassiPy/main/docs/_static/logo-dark.svg" width="300">
  <img alt="DistClassiPy Logo" src="https://raw.githubusercontent.com/sidchaini/DistClassiPy/main/docs/_static/logo.svg" width="300">
</picture>
</h1>


---
---