<a href="https://colab.research.google.com/github/sidchaini/DistClassiPyTutorial/blob/main/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 0. Prerequisites

In [None]:
!pip install distclassipy==0.2.1 # latest as of 2024-10-22

In [None]:
# @title
%%capture
!wget https://github.com/sidchaini/DistClassiPyTutorial/archive/refs/heads/main.zip
!unzip main.zip
!mv DistClassiPyTutorial-main/* .
!rm -rf main.zip DistClassiPyTutorial-main

In [None]:
import numpy as np
seed = 0
np.random.seed(seed)
import pandas as pd
import distclassipy as dcpy
import utils
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

### 1. Visualizing 2D distance metric spaces

We can visualize the distance metric space by plotting the locus of a central point, such as (5, 5) in a given two dimensional space. The locus appear as contour lines, which can illustrate geometry of the space when plotted in Euclidean space.

In [None]:
# @title
utils.visualize_distance("euclidean")
plt.show()

### 2. Data

For this example, we will be using data from "The ZTF Source Classification Project: III. A Catalog of Variable Sources" through which they have made available on Zenodo.

[![zenodo-badge](https://zenodo.org/badge/DOI/10.5281/zenodo.13920513.svg)](https://zenodo.org/records/13920513)

I downloaded and sampled them to choose 4000 objects from 4 classes of variable stars:

In [None]:
features = pd.read_csv("data/ztfscope_features.csv", index_col=0)
labels = pd.read_csv("data/ztfscope_labels.csv", index_col=0)

In [None]:
# @title
labels.value_counts()

In the  understand what's going on, let us focus on three simple features (refer to [(Healy et al. 2024)](https://arxiv.org/abs/2312.00143) for more details):
- ```inv_vonneumannratio```: Inverse of von Neumann ratio ([von Neumann 1941](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-12/issue-4/Distribution-of-the-Ratio-of-the-Mean-Square-Successive-Difference/10.1214/aoms/1177731677.full), [1942](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-13/issue-1/A-Further-Remark-Concerning-the-Distribution-of-the-Ratio-of/10.1214/aoms/1177731645.full)), which is the ratio of correlated variance and variance: a sign of variablity.
- ```norm_peak_to_peak_amp```: Normalized peak-to-peak amplitude (Sokolovsky et al. 2009)
- ```stetson_k```: Stetson K coefficient ([Stetson 1996](https://iopscience.iop.org/article/10.1086/133808/meta?casa_token=EMo0hxKqIkUAAAAA:b8y8ONGzEQAJq2WJfrCASQt_FMw7HX_h7i-VChDbTYc1ShDkEih4I2Sm184VFLTS1UpDbATGN8GPmTY4YXRG87jP2Q))

In [None]:
feature_names = ...

In [None]:
# @title
X = features.loc[:,feature_names].to_numpy()
y = labels.to_numpy().ravel()

In [None]:
# @title
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=seed)

### 3. DistanceMetricClassifier

The DistanceMetricClassifier calculates the distance between a centroid for each class, and each test point, and scales it by the standard deviation.

In [None]:
clf = ...
clf.fit(...)

In [None]:
# @title
y_pred = clf.predict(X_test, metric="euclidean")
acc = accuracy_score(y_true = y_test, y_pred = y_pred)
f1 = f1_score(y_true = y_test, y_pred = y_pred, average="macro")

print(f"Accuracy = {acc:.3f}")
print(f"F1 = {f1:.3f}")

In [None]:
quantile_scores_df, best_metrics_per_quantile, group_bins = dcpy.classifier.find_best_metrics(
    clf, X_train, y_train, feat_idx=0, n_quantiles=6, random_state=seed
)

### 4. EnsembleDistanceClassifier

The EnsembleDistanceClassifier splits the training set into multiple quantiles based on a feature (```feat_idx```), iterates among all metrics to see which one performs the best on a validation set, and then prepares an ensemble based on the best performing metric for each quantile.

In [None]:
ensemble_clf = ...
ensemble_clf.fit(...)

In [None]:
# @title
y_pred_ensemble = ensemble_clf.predict(X_test)
acc = accuracy_score(y_true = y_test, y_pred = y_pred_ensemble)
f1 = f1_score(y_true = y_test, y_pred = y_pred_ensemble, average="macro")

print(f"Accuracy = {acc:.3f}")
print(f"F1 = {f1:.3f}")