# Building a New Dataset Bank for reLAISS

This notebook explains how to create a new dataset bank for reLAISS using your own data. The process involves:

1. Preparing your data in the required format
2. Building the dataset bank with built-in preprocessing
3. Creating the ANNOY index for fast similarity search
4. Testing your new bank

## 1. Data Preparation

Your data should be organized into two CSV files:

1. Lightcurve data with columns:
   - ztf_object_id: ZTF ID of the transient
   - ant_mjd: Modified Julian Date of observations
   - ant_passband: Filter (g, r, i, z)
   - ant_mag: Magnitude
   - ant_magerr: Magnitude error

2. Host galaxy data with columns:
   - ztf_object_id: ZTF ID of the transient
   - ra, dec: Host galaxy coordinates
   - gKronMag, rKronMag, iKronMag, zKronMag: Host magnitudes
   - gKronMagErr, rKronMagErr, iKronMagErr, zKronMagErr: Magnitude errors

In [None]:
import pandas as pd
import numpy as np
from relaiss import build_dataset_bank

# Load your data
lc_df = pd.read_csv('path/to/your/lightcurves.csv')
host_df = pd.read_csv('path/to/your/hosts.csv')

# Merge lightcurve and host data
raw_df_bank = pd.merge(lc_df, host_df, on='ztf_object_id')
print(f"Loaded {len(raw_df_bank)} transients")

## 2. Building the Dataset Bank

The `build_dataset_bank` function performs several preprocessing steps:

1. **Missing Value Handling**:
   - Replaces infinite values and -999 with NaN
   - Uses KNN imputation for missing values
   - Drops rows with NaN values after imputation

2. **Dust Correction**:
   - Uses SFD dust maps to correct host galaxy magnitudes
   - Creates dust-corrected magnitude columns

3. **Feature Engineering**:
   - Creates color indices (g-r, r-i, i-z)
   - Calculates color uncertainties
   - Adds additional features for similarity search

In [None]:
# Build the dataset bank
bank_df = build_dataset_bank(
    raw_df_bank=raw_df_bank,
    path_to_sfd_folder='./sfddata-master',  # Path to SFD dust maps
    building_entire_df_bank=True  # Set to True when building a new bank
)

# Save the processed bank
output_path = './my_dataset_bank.csv'
bank_df.to_csv(output_path, index=False)
print(f"Dataset bank created at: {output_path}")

## 3. Creating the ANNOY Index

After building the dataset bank, you need to create an ANNOY index for fast similarity search. This is done automatically when you initialize the ReLAISS client with your new bank.

In [None]:
from relaiss import ReLAISS

# Initialize ReLAISS with your new bank
client = ReLAISS()
client.load_reference(
    bank_path=output_path,
    path_to_sfd_folder='./sfddata-master'
)

## 4. Testing Your Bank

Let's test your new dataset bank by finding neighbors for a known transient:

In [None]:
# Find neighbors for a test transient
test_ztf_id = 'ZTF21aaublej'  # Replace with a ZTF ID from your bank
neighbors = client.find_neighbors(
    ztf_object_id=test_ztf_id,
    n=5,  # Number of neighbors to find
    plot=True  # Generate diagnostic plots
)

print("\nFound neighbors:")
print(neighbors[['neighbor_num', 'ztf_link', 'dist', 'spec_cls', 'z']])

## Summary

You have now:
1. Prepared your data in the required format
2. Built a dataset bank with built-in preprocessing
3. Created an ANNOY index for fast similarity search
4. Tested your bank by finding similar transients

Your new dataset bank is ready to use with reLAISS!