[![colab-button](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sidchaini/DistClassiPyTutorial/blob/main/tutorial.ipynb)

[![github-badge](https://img.shields.io/badge/GitHub-sidchaini/DistClassiPyTutorial-blue)](https://github.com/sidchaini/DistClassiPyTutorial)

# Leveraging Distance Metrics for Better Machine Learning

**Siddharth Chaini, 6th January, 2025**

(Special thanks to Federica Bianco, Ashish Mahabal and Ajit Kembhavi!)

This hands-on session is largely based on and derived from the work described in [Chaini et. al 2024](https://arxiv.org/abs/2403.12120). It will go over:
1. What are distance metrics?
2. Where are they used in machine learning?
3. DistClassiPy
    - Demo on a real astronomical dataset!

---

### 0. Prerequisites

Let us first install DistClassiPy from PyPI. I am installing 0.2.1, the latest as of 2025-01-05.

In [None]:
!pip install distclassipy==0.2.1 # latest as of 2025-01-05.

In [None]:
# @title
%%capture
!wget https://github.com/sidchaini/DistClassiPyTutorial/archive/refs/heads/main.zip
!unzip main.zip
!mv DistClassiPyTutorial-main/* .
!rm -rf main.zip DistClassiPyTutorial-main

In [None]:
import numpy as np

seed = 0
np.random.seed(seed)
import pandas as pd
import distclassipy as dcpy
import utils
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
print(f"distclassipy version {dcpy.__version__}")

---

### 1. What are distance metrics?

**Definition**: A distance is a quantity that tells us how similar two objects are. It follows the axioms:
1. *Identity of indiscernibles*: $$d(x, y)=0 \iff x=y $$
2. *Symmetry*: $$d(x, y)=d(y, x)$$
3. *Triangle inequality*: $$d(x, y)\leq d(x, z) + d(z, y)$$

---

**Small exercise**: Which of the following is a distance metric, and which is not? Why?

In [None]:
def custom_fn1(x, y):
    return np.sum(np.abs(x - y))

In [None]:
def custom_fn2(x, y):
    return (1 + np.sum(np.abs(x - y)))**2

In [None]:
...

**Visualizing 2D distance metric spaces**: We can plot the locus of a central point (*e.g.,*$(5,5)$) in a given two dimensional space. The locus appear as contour lines, which can illustrate geometry of the space when plotted in Euclidean space.

In [None]:
...

---

### 2. Distances in Machine Learning

Distance metrics power different ML tasks:

- **Clustering**: Distance metrics help group similar data points (e.g., K-Means, Hierarchical Clustering).
- **Dimensionality Reduction**: They preserve data structure in fewer dimensions (e.g., PCA, t-SNE).
- **Classification**: They determine proximity for decision-making (e.g., K-Nearest Neighbors, SVM, **DistClassiPy**).

In [None]:
# @title
from IPython.display import Video
Video(
    "https://sidchaini.github.io/videos/distclassipy.mp4", 
    width=480, height=240
)

---

### 3. DistClassiPy for ZTF Light Curve Classification

For this example, we will be using data from "The ZTF Source Classification Project: III. A Catalog of Variable Sources" through which they have made available on Zenodo.

[![zenodo-badge](https://zenodo.org/badge/DOI/10.5281/zenodo.14155156.svg)](https://zenodo.org/records/14155156)

I downloaded and downsampled them to choose 4000 objects from 4 classes of variable stars:

In [None]:
...

In [None]:
...

For the sake of simplicity, let us focus on three features from the complete ZTF SCoPE features (refer to [Healy et al. 2024](https://arxiv.org/abs/2312.00143) for more details):
- ```inv_vonneumannratio```: Inverse of von Neumann ratio ([von Neumann 1941](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-12/issue-4/Distribution-of-the-Ratio-of-the-Mean-Square-Successive-Difference/10.1214/aoms/1177731677.full), [1942](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-13/issue-1/A-Further-Remark-Concerning-the-Distribution-of-the-Ratio-of/10.1214/aoms/1177731645.full)), which is the ratio of correlated variance and variance - it detects non-randomness, and a high value implies periodic behaviour.
- ```norm_peak_to_peak_amp```: Normalized peak-to-peak amplitude [(Sokolovsky et al. 2009)](https://arxiv.org/abs/0901.1064) - it tells us about the source brightness.
- ```stetson_k```: Stetson K coefficient ([Stetson 1996](https://iopscience.iop.org/article/10.1086/133808/meta?casa_token=EMo0hxKqIkUAAAAA:b8y8ONGzEQAJq2WJfrCASQt_FMw7HX_h7i-VChDbTYc1ShDkEih4I2Sm184VFLTS1UpDbATGN8GPmTY4YXRG87jP2Q)) is related to the observed scatter - it tells us about the light curve shape.

In [None]:
feature_names = ...

In [None]:
# @title
df = features.loc[:, feature_names]
df["class"] = labels["class"]
sns.pairplot(df, hue="class")
plt.show()

In [None]:
# @title
X = features.loc[:, feature_names].to_numpy()
y = labels.to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=seed
)

In [None]:
clf = ...

In [None]:
clf.fit(...)

In [None]:
y_pred = ...

In [None]:
# @title
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
f1 = f1_score(y_true=y_test, y_pred=y_pred, average="macro")

print(f"Accuracy = {acc:.3f}")
print(f"F1 = {f1:.3f}")

---

#### Using multiple distance metrics together!

We can combine multiple distance metrics together!

**Case 1**: Keeping the same set of features, vary the distance metric.

In [None]:
ensemble_clf = ...
ensemble_clf.fit(...)

In [None]:
y_pred_ensemble = ...

In [None]:
# @title
acc = accuracy_score(y_true=y_test, y_pred=y_pred_ensemble)
f1 = f1_score(y_true=y_test, y_pred=y_pred_ensemble, average="macro")

print(f"Accuracy = {acc:.3f}")
print(f"F1 = {f1:.3f}")

In [None]:
...

In [None]:
...

In [None]:
# @title
sns.heatmap(
    ensemble_clf.quantile_scores_df_.drop_duplicates(), annot=True, cmap="Blues"
)
plt.show()

The performance improves, but not by a lot.

But what if we also allowed each metric to work with different features?

---

**Case 2**: Varying the features AND the distance metric.

From our work, we found:
- ```We can select a distance metric that works best based on the object of interest!```

In [None]:
# @title
from IPython.display import Image
Image(url="https://arxiv.org/html/2403.12120v2/x31.png",width=480)

Performance improvement here is much more significant!

If you are interested in more details:

[![arxiv-badge](https://img.shields.io/badge/arXiv-2403.12120-red)](https://arxiv.org/abs/2403.12120)
[![github-badge](https://img.shields.io/badge/GitHub-sidchaini/LightCurveDistanceClassification-blue)](https://github.com/sidchaini/LightCurveDistanceClassification)
