# Building a New Dataset Bank for reLAISS
### Authors: Evan Reynolds and Alex Gagliano

## Introduction

This notebook demonstrates how to build a new dataset bank for reLAISS and use different feature combinations for nearest neighbor searches. The dataset bank is the foundation of reLAISS, containing all the features of transients that are used for similarity searches and anomaly detection.

Building your own dataset bank allows you to incorporate new data, apply custom preprocessing steps, and tailor the feature set to your specific research needs.

## Topics Covered
1. Adding extinction corrections (A_V)
2. Joining new lightcurve features
3. Handling missing values
4. Building the final dataset bank
5. Using different feature combinations for nearest neighbor search:
   - Lightcurve-only features
   - Host-only features
   - Custom feature subsets
   - Feature weighting

## Setup

First, let's import the necessary libraries and create the required directories:

In [1]:
import os
import pandas as pd
import numpy as np
from relaiss import constants
import relaiss as rl

# Create necessary directories
os.makedirs('./figures', exist_ok=True)
os.makedirs('./sfddata-master', exist_ok=True)

# Define default feature sets from constants
default_lc_features = constants.lc_features_const.copy()
default_host_features = constants.host_features_const.copy()

# Initialize client
client = rl.ReLAISS()
client.load_reference(
    path_to_sfd_folder='./sfddata-master'
)

Preprocessing reference bank...
Loading preprocessed features from cache...
Caching preprocessed reference bank...
Building search index...
Loading previously saved ANNOY index...
Done!

Loaded index with 25515 items


## 1. Adding Extinction Corrections (A_V)

The first step in building a dataset bank is to add extinction corrections to account for interstellar dust. The Schlegel, Finkbeiner & Davis (SFD) dust maps are used to estimate the amount of extinction.

```python
# Example code for adding extinction corrections
from sfdmap2 import sfdmap

df = pd.read_csv("../data/large_df_bank.csv")
m = sfdmap.SFDMap('../data/sfddata-master')
RV = 3.1  # Standard value for Milky Way
ebv = m.ebv(df['ra'].values, df['dec'].values)
df['A_V'] = RV * ebv
df.to_csv("../data/large_df_bank_wAV.csv", index=False)
```

This adds the A_V (extinction in V-band) column to your dataset, which will be used later in the feature processing pipeline.

## 2. Joining New Lightcurve Features

If you have additional features in a separate dataset, you can merge them with your existing bank:

```python
# Example code for joining features
df_large = pd.read_csv("../data/large_df_bank_wAV.csv")
df_small = pd.read_csv("../data/small_df_bank_re_laiss.csv")

key = 'ztf_object_id'
extra_features = [col for col in df_large.columns if col not in df_small.columns]

merged_df = df_small.merge(df_large[[key] + extra_features], on=key, how='left')

lc_feature_names = constants.lc_features_const.copy()
host_feature_names = constants.host_features_const.copy()

small_final_df = merged_df.replace([np.inf, -np.inf, -999], np.nan).dropna(subset=lc_feature_names + host_feature_names)

small_final_df.to_csv("../data/small_hydrated_df_bank_re_laiss.csv", index=False)
```

This merges additional features from a larger dataset into your working dataset.

## 3. Handling Missing Values

Missing values in the dataset can cause problems during analysis. reLAISS uses KNN imputation to fill in missing values:

```python
# Example code for handling missing values
from sklearn.impute import KNNImputer

raw_host_feature_names = constants.raw_host_features_const.copy()
raw_dataset_bank = pd.read_csv('../data/large_df_bank_wAV.csv')

X = raw_dataset_bank[lc_feature_names + raw_host_feature_names]
feat_imputer = KNNImputer(weights='distance').fit(X)
imputed_filt_arr = feat_imputer.transform(X)

imputed_df = pd.DataFrame(imputed_filt_arr, columns=lc_feature_names + raw_host_feature_names)
imputed_df.index = raw_dataset_bank.index
raw_dataset_bank[lc_feature_names + raw_host_feature_names] = imputed_df

imputed_df_bank = raw_dataset_bank
```

KNN imputation works by finding the k-nearest neighbors in feature space for samples with missing values and using their values to fill in the gaps.

## 4. Building the Final Dataset Bank

With all the preprocessing done, we can now build the final dataset bank using the `build_dataset_bank` function from reLAISS:

```python
# Example code for building the final dataset bank
from relaiss.features import build_dataset_bank

dataset_bank = build_dataset_bank(
    raw_df_bank=imputed_df_bank,
    av_in_raw_df_bank=True,
    path_to_sfd_folder="../data/sfddata-master",
    building_entire_df_bank=True
)

# Clean and save final dataset
final_dataset_bank = dataset_bank.replace(
    [np.inf, -np.inf, -999], np.nan
).dropna(subset=lc_feature_names + host_feature_names)

final_dataset_bank.to_csv('../data/large_final_df_bank_new_lc_feats.csv', index=False)
```

This function applies additional processing to prepare the features for reLAISS, including normalization and other transformations.

## Conclusion

Building your own dataset bank and customizing feature combinations provides powerful flexibility for tailoring reLAISS to your specific research questions. By selecting different feature combinations and adjusting feature weights, you can explore various aspects of transient similarity and discover new insights about the transient population.

The process involves several steps:
1. Preprocessing your data with extinction corrections
2. Merging additional features if needed
3. Handling missing values through imputation
4. Building the final dataset bank
5. Customizing feature sets for different analysis goals

These capabilities make reLAISS a versatile tool across a wide range of research applications.