# Basic Usage of reLAISS
### Authors: Evan Reynolds and Alex Gagliano

## Introduction

reLAISS is the second version of LAISS (Lightcurve Anomaly Identification & Similarity Search); a tool to find similar supernovae & identify anomalous supernovae (and the galaxies that host them) using their photometric features.

The similarity search takes advantage of [Approximate Nearest Neighbors Oh Yeah (ANNOY)](https://github.com/spotify/annoy), the approximate nearest neighbors algorithm developed by Spotify that allows you to come up with a relevant song to listen to before your current one ends. The anomaly detection classifier is an isolation forest model trained on a dataset bank of over 22,000 transients.

This notebook demonstrates the basic features of the reLAISS library for finding similar astronomical transients.

## Topics Covered
1. Initializing the ReLAISS client
2. Loading reference data
3. Finding optimal number of neighbors
4. Basic nearest neighbor search
5. Using Monte Carlo simulations and feature weighting
6. Basic anomaly detection

## Setup

First, let's import the required packages and create the necessary output directories:

In [None]:
import os
import pandas as pd
import relaiss

# Create output directories
os.makedirs('./figures', exist_ok=True)
os.makedirs('./sfddata-master', exist_ok=True)
os.makedirs('./models', exist_ok=True)
os.makedirs('./timeseries', exist_ok=True)

## 1. Initialize the ReLAISS Client

First, we create an instance of the ReLAISS client that we'll use to find similar transients.

In [None]:
# Create ReLAISS client
client = relaiss.ReLAISS()

## 2. Load Reference Data

Next, we load the reference dataset bank. This contains the features of known transients that we'll use for comparison.

The `load_reference` function will automatically download the SFD dust map files if they don't exist in the specified directory. These files are required for extinction corrections in the reLAISS pipeline.

In [None]:
# Load reference data
client.load_reference(
    path_to_sfd_folder='./sfddata-master',  # Directory for SFD dust maps
    use_pca=False,  # Don't use PCA for this example
    host_features=[]  # Empty list for this example
)

## 3. Finding the Optimal Number of Neighbors

Before doing a full neighbor search, we can use reLAISS to suggest an optimal number of neighbors based on the distance distribution. This helps avoid arbitrary choices for the number of neighbors to return.

First, let's run a search with a larger number of neighbors and set `suggest_neighbor_num=True`. This will show us a distance plot that helps identify a reasonable cutoff point for similar objects.

In [None]:
# Find optimal number of neighbors
client.find_neighbors(
    ztf_object_id='ZTF21aaublej',  # ZTF ID to find neighbors for
    n=40,  # Search in a larger pool
    suggest_neighbor_num=True,  # Only suggest optimal number, don't return neighbors
    plot=True,  # Show the distance elbow plot
    save_figures=True,  # Save plots to disk
    path_to_figure_directory='./figures'
)

## 4. Basic Nearest Neighbor Search

Now we can find the most similar transients to a given ZTF object. Let's use ZTF21aaublej as an example.

The `find_neighbors` function allows you to:
- Specify the number of neighbors to return
- Set a maximum distance threshold
- Adjust the weight of lightcurve features relative to host features
- Generate diagnostic plots

Based on the distance curve we saw earlier, we'll choose to return 5 neighbors.

In [None]:
# Find nearest neighbors
neighbors_df = client.find_neighbors(
    ztf_object_id='ZTF21aaublej',  # ZTF ID to find neighbors for
    n=5,  # Number of neighbors to return
    suggest_neighbor_num=False,  # Return actual neighbors
    plot=True,  # Generate diagnostic plots
    save_figures=True,  # Save plots to disk
    path_to_figure_directory='./figures'
)

# Display the results
print("\nNearest Neighbors:")
print(neighbors_df)

## 5. Using Monte Carlo Simulations and Feature Weighting

reLAISS allows you to adjust the relative importance of lightcurve features compared to host galaxy features using the `weight_lc_feats_factor` parameter. A value greater than 1.0 will make lightcurve features more important in the similarity search.

The Monte Carlo simulation functionality (`num_sims` parameter) helps account for measurement uncertainties by running multiple simulations with perturbed feature values.

If you find that your matches aren't quite what you're looking for, you can try:
- Using Monte Carlo simulations to account for feature measurement uncertainties
- Upweighting lightcurve features to focus more on the transient's photometric properties than its host
- Removing host features entirely for a "lightcurve-only" search
- Removing lightcurve features for a "host-only" search

Let's try using Monte Carlo simulations with upweighted lightcurve features:

In [None]:
# Using Monte Carlo simulations and feature weighting
neighbors_df = client.find_neighbors(
    ztf_object_id='ZTF21aaublej',  # Using the test transient
    n=5,
    num_sims=20,  # Number of Monte Carlo simulations
    weight_lc_feats_factor=3.0,  # Up-weight lightcurve features
    plot=True,
    save_figures=True,
    path_to_figure_directory='./figures'
)

print("\nNearest neighbors with Monte Carlo simulations:")
print(neighbors_df)

## 6. Basic Anomaly Detection

reLAISS also includes tools for anomaly detection that can help identify unusual transients. The anomaly detection module uses an Isolation Forest algorithm to identify outliers in the feature space.

The anomaly detection process will produce plots showing the lightcurve of the input transient and a graph of the probability (in time) that the transient is anomalous. If the probability exceeds 50% at any epoch, the transient is flagged as anomalous.

### Training an Anomaly Detection Model

First, let's train an anomaly detection model on our dataset bank:

In [None]:
from relaiss.anomaly import train_AD_model

# Train the anomaly detection model
model_path = train_AD_model(
    lc_features=client.lc_features,
    host_features=client.host_features,
    path_to_dataset_bank=client.bank_csv,
    path_to_sfd_folder='./sfddata-master',
    path_to_models_directory="./models",
    n_estimators=100,  # Using a smaller value for faster execution
    contamination=0.02,  # Expected proportion of anomalies
    max_samples=256,  # Maximum samples used for each tree
    force_retrain=False  # Only retrain if model doesn't exist
)

print(f"Model saved to: {model_path}")

### Running Anomaly Detection on a Transient

Now we can run anomaly detection on a specific transient to see if it's considered anomalous:

In [None]:
from relaiss.anomaly import anomaly_detection

# Run anomaly detection on a transient
anomaly_detection(
    transient_ztf_id="ZTF21aaublej",  # Use the same transient for this example
    lc_features=client.lc_features,
    host_features=client.host_features,
    path_to_timeseries_folder="./timeseries",
    path_to_sfd_folder='./sfddata-master',
    path_to_dataset_bank=client.bank_csv,
    path_to_models_directory="./models",
    path_to_figure_directory="./figures",
    save_figures=True,
    n_estimators=100,
    contamination=0.02,
    max_samples=256,
    force_retrain=False
)

print("Anomaly detection figures saved to ./figures/AD/")

## Next Steps

To explore more advanced features, check out the `advanced_usage.ipynb` notebook which covers:
- Using PCA for dimensionality reduction
- Creating theorized lightcurves
- Swapping host galaxies
- Setting maximum neighbor distances
- Tweaking ANNOY parameters
- Making corner plots
- Advanced anomaly detection techniques