# Building a New Dataset Bank for reLAISS
### Authors: Evan Reynolds and Alex Gagliano

## Introduction

This notebook demonstrates how to build a new dataset bank for reLAISS and use different feature combinations for nearest neighbor searches. The dataset bank is the foundation of reLAISS, containing all the features of transients that are used for similarity searches and anomaly detection.

Building your own dataset bank allows you to incorporate new data, apply custom preprocessing steps, and tailor the feature set to your specific research needs.

## Topics Covered
1. Adding extinction corrections (A_V)
2. Joining new lightcurve features
3. Handling missing values
4. Building the final dataset bank
5. Using different feature combinations for nearest neighbor search:
   - Lightcurve-only features
   - Host-only features
   - Custom feature subsets
   - Feature weighting

## Setup

First, let's import the necessary libraries and create the required directories:

In [None]:
import os
import pandas as pd
import numpy as np
from relaiss import constants
import relaiss as rl

# Create necessary directories
os.makedirs('./figures', exist_ok=True)
os.makedirs('./sfddata-master', exist_ok=True)

# Define default feature sets from constants
default_lc_features = constants.lc_features_const.copy()
default_host_features = constants.host_features_const.copy()

# Initialize client
client = rl.ReLAISS()
client.load_reference(
    path_to_sfd_folder='./sfddata-master'
)

## 1. Adding Extinction Corrections (A_V)

The first step in building a dataset bank is to add extinction corrections to account for interstellar dust. The Schlegel, Finkbeiner & Davis (SFD) dust maps are used to estimate the amount of extinction.

```python
# Example code for adding extinction corrections
from sfdmap2 import sfdmap

df = pd.read_csv("../data/large_df_bank.csv")
m = sfdmap.SFDMap('../data/sfddata-master')
RV = 3.1  # Standard value for Milky Way
ebv = m.ebv(df['ra'].values, df['dec'].values)
df['A_V'] = RV * ebv
df.to_csv("../data/large_df_bank_wAV.csv", index=False)
```

This adds the A_V (extinction in V-band) column to your dataset, which will be used later in the feature processing pipeline.

## 2. Joining New Lightcurve Features

If you have additional features in a separate dataset, you can merge them with your existing bank:

```python
# Example code for joining features
df_large = pd.read_csv("../data/large_df_bank_wAV.csv")
df_small = pd.read_csv("../data/small_df_bank_re_laiss.csv")

key = 'ztf_object_id'
extra_features = [col for col in df_large.columns if col not in df_small.columns]

merged_df = df_small.merge(df_large[[key] + extra_features], on=key, how='left')

lc_feature_names = constants.lc_features_const.copy()
host_feature_names = constants.host_features_const.copy()

small_final_df = merged_df.replace([np.inf, -np.inf, -999], np.nan).dropna(subset=lc_feature_names + host_feature_names)

small_final_df.to_csv("../data/small_hydrated_df_bank_re_laiss.csv", index=False)
```

This merges additional features from a larger dataset into your working dataset.

## 3. Handling Missing Values

Missing values in the dataset can cause problems during analysis. reLAISS uses KNN imputation to fill in missing values:

```python
# Example code for handling missing values
from sklearn.impute import KNNImputer

raw_host_feature_names = constants.raw_host_features_const.copy()
raw_dataset_bank = pd.read_csv('../data/large_df_bank_wAV.csv')

X = raw_dataset_bank[lc_feature_names + raw_host_feature_names]
feat_imputer = KNNImputer(weights='distance').fit(X)
imputed_filt_arr = feat_imputer.transform(X)

imputed_df = pd.DataFrame(imputed_filt_arr, columns=lc_feature_names + raw_host_feature_names)
imputed_df.index = raw_dataset_bank.index
raw_dataset_bank[lc_feature_names + raw_host_feature_names] = imputed_df

imputed_df_bank = raw_dataset_bank
```

KNN imputation works by finding the k-nearest neighbors in feature space for samples with missing values and using their values to fill in the gaps.

## 4. Building the Final Dataset Bank

With all the preprocessing done, we can now build the final dataset bank using the `build_dataset_bank` function from reLAISS:

```python
# Example code for building the final dataset bank
from relaiss.features import build_dataset_bank

dataset_bank = build_dataset_bank(
    raw_df_bank=imputed_df_bank,
    av_in_raw_df_bank=True,
    path_to_sfd_folder="../data/sfddata-master",
    building_entire_df_bank=True
)

# Clean and save final dataset
final_dataset_bank = dataset_bank.replace(
    [np.inf, -np.inf, -999], np.nan
).dropna(subset=lc_feature_names + host_feature_names)

final_dataset_bank.to_csv('../data/large_final_df_bank_new_lc_feats.csv', index=False)
```

This function applies additional processing to prepare the features for reLAISS, including normalization and other transformations.

## 5. Using Different Feature Combinations

reLAISS allows you to customize which features are used for similarity search. This can be useful for studying the importance of different features and for tailoring the search to specific scientific questions.

### 5.1 Using Only Lightcurve Features

You can perform a search using only lightcurve features, ignoring host galaxy properties. This is useful when:
- You want to focus solely on the temporal evolution of the transient
- Host data might be unreliable or missing
- You're testing hypotheses about lightcurve-based classification

Here's how to set up a lightcurve-only search:

In [None]:
lc_only_client = rl.ReLAISS()
lc_only_client.load_reference(
    path_to_sfd_folder='./sfddata-master',
    lc_features=default_lc_features,  # Use default lightcurve features
    host_features=[],  # Empty list means no host features
)

# Find neighbors using only lightcurve features
neighbors_df_lc_only = lc_only_client.find_neighbors(
    ztf_object_id='ZTF21abbzjeq',
    n=5,
    plot=True,
    save_figures=True,
    path_to_figure_directory='./figures/lc_only'
)
print("\nNearest neighbors using only lightcurve features:")
print(neighbors_df_lc_only)

### 5.2 Using Only Host Features

Alternatively, you can perform a search using only host galaxy features, ignoring the lightcurve properties. This approach is valuable when:
- You're more interested in environmental effects on transients
- You want to find transients in similar host galaxies
- You're studying correlations between host properties and transient types

Here's how to set up a host-only search:

In [None]:
host_only_client = rl.ReLAISS()
host_only_client.load_reference(
    path_to_sfd_folder='./sfddata-master',
    lc_features=[],  # Empty list means no lightcurve features
    host_features=default_host_features,  # Use default host features
)

# Find neighbors using only host features
neighbors_df_host_only = host_only_client.find_neighbors(
    ztf_object_id='ZTF21abbzjeq',
    n=5,
    plot=True,
    save_figures=True,
    path_to_figure_directory='./figures/host_only'
)
print("\nNearest neighbors using only host features:")
print(neighbors_df_host_only)

### 5.3 Using Custom Feature Subset

You can also select specific features from both categories for a more targeted search. This allows you to:
- Focus on the features most relevant to your research question
- Reduce noise by excluding less useful features
- Test hypotheses about which features drive similarity

Here's how to create a custom feature subset:

In [None]:
# Select specific lightcurve and host features
custom_lc_features = ['g_peak_mag', 'r_peak_mag', 'g_peak_time', 'r_peak_time']
custom_host_features = ['host_ra', 'host_dec', 'gKronMag', 'rKronMag']

custom_client = rl.ReLAISS()
custom_client.load_reference(
    path_to_sfd_folder='./sfddata-master',
    lc_features=custom_lc_features,  # Custom subset of lightcurve features
    host_features=custom_host_features,  # Custom subset of host features
)

# Find neighbors with custom feature subset
neighbors_df_custom = custom_client.find_neighbors(
    ztf_object_id='ZTF21abbzjeq',
    n=5,
    plot=True,
    save_figures=True,
    path_to_figure_directory='./figures/custom_features'
)
print("\nNearest neighbors using custom feature subset:")
print(neighbors_df_custom)

### 5.4 Using Feature Weighting

You can also adjust the relative importance of lightcurve features versus host galaxy features using the `weight_lc_feats_factor` parameter:
- Values > 1: Emphasize lightcurve features
- Values < 1: Emphasize host features
- Value = 1: Equal weighting (default)

This allows you to fine-tune the balance between photometric and host properties:

In [None]:
# Regular search prioritizing lightcurve features
neighbors_df_lc_weighted = client.find_neighbors(
    ztf_object_id='ZTF21abbzjeq',
    n=5,
    weight_lc_feats_factor=3.0,  # Strongly prioritize lightcurve features
    plot=True,
    save_figures=True,
    path_to_figure_directory='./figures/lc_weighted'
)
print("\nNearest neighbors with lightcurve features weighted 3x:")
print(neighbors_df_lc_weighted)

# Now prioritize host features by using a factor < 1
neighbors_df_host_weighted = client.find_neighbors(
    ztf_object_id='ZTF21abbzjeq',
    n=5,
    weight_lc_feats_factor=0.3,  # Prioritize host features
    plot=True,
    save_figures=True,
    path_to_figure_directory='./figures/host_weighted'
)
print("\nNearest neighbors with host features given higher weight:")
print(neighbors_df_host_weighted)

## Conclusion

Building your own dataset bank and customizing feature combinations provides powerful flexibility for tailoring reLAISS to your specific research questions. By selecting different feature combinations and adjusting feature weights, you can explore various aspects of transient similarity and discover new insights about the transient population.