# Black Hole Evolution Dataset Preparation

This notebook extracts and formats time-series data from TNG100 to enable a LSTM to predict supermassive black hole evolution.


### 1. Environment Setup
---
Import necessary libraries and configure global settings for reproducibility.


In [1]:
import requests
import numpy as np
import torch
import random

random.seed(42)  # Ensures reproducible random sampling later

print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")


NumPy version: 1.24.3
PyTorch version: 2.0.1+cpu


### 2. Load and Filter TNG100 Subhalo Catalog  
---
Locate the TNG100 simulation directory and loading the subhalo catalog from snapshot 33. We then extract all subhalos hosting supermassive black holes (SMBHs).


#### 2.1 Load Preprocessed Dataset
---
This cell loads the preprocessed black hole evolution dataset from the data directory and confirms its structure. It also sets the simulation base path for future data access.

In [3]:
import illustris_python as il
import pandas as pd

# Set simulation base path
basePath = "/home/tnguser/sims.TNG/TNG100-1"

# Load precompiled black hole sample from CSV
csv_path = "/home/tnguser/cosmic-evolution-ml/black_hole_evolution/data/black_hole_evolution_tng100.csv"
df = pd.read_csv(csv_path)

print(f"Dataset loaded with shape: {df.shape}")
print("Columns:", df.columns.tolist())


Dataset loaded with shape: (2500, 91)
Columns: ['subhalo_id', 'bh_mass_snap18', 'bh_acc_snap18', 'stellar_mass_snap18', 'sfr_snap18', 'halo_mass_snap18', 'vel_disp_snap18', 'bh_mass_snap19', 'bh_acc_snap19', 'stellar_mass_snap19', 'sfr_snap19', 'halo_mass_snap19', 'vel_disp_snap19', 'bh_mass_snap20', 'bh_acc_snap20', 'stellar_mass_snap20', 'sfr_snap20', 'halo_mass_snap20', 'vel_disp_snap20', 'bh_mass_snap21', 'bh_acc_snap21', 'stellar_mass_snap21', 'sfr_snap21', 'halo_mass_snap21', 'vel_disp_snap21', 'bh_mass_snap22', 'bh_acc_snap22', 'stellar_mass_snap22', 'sfr_snap22', 'halo_mass_snap22', 'vel_disp_snap22', 'bh_mass_snap23', 'bh_acc_snap23', 'stellar_mass_snap23', 'sfr_snap23', 'halo_mass_snap23', 'vel_disp_snap23', 'bh_mass_snap24', 'bh_acc_snap24', 'stellar_mass_snap24', 'sfr_snap24', 'halo_mass_snap24', 'vel_disp_snap24', 'bh_mass_snap25', 'bh_acc_snap25', 'stellar_mass_snap25', 'sfr_snap25', 'halo_mass_snap25', 'vel_disp_snap25', 'bh_mass_snap26', 'bh_acc_snap26', 'stellar_mass_s

#### 2.2 Extract Time-Series Data for Modeling
---
This cell organizes the dataset into time-series format, reshaping the features for each black hole into sequences suitable for temporal modeling.

In [7]:
import numpy as np
import torch

# Define input features and target columns
feature_cols = [
    'BH_Mass', 'BH_AccretionRate', 'StellarMass',
    'HaloMass', 'VelocityDispersion', 'SFR'
]
target_cols = [
    'Future_BH_Mass', 'Future_BH_AccretionRate',
    'Future_BH_to_StellarMass_Ratio', 'Quenching_Snapshot'
]

# Get number of time steps (snapshots 18–32 inclusive → 15 steps)
num_steps = 15

# Group by SubhaloID and sort by Snapshot
grouped = df.groupby('subhalo_id')
subhalo_ids = []
sequences = []
targets = []

for sub_id, group in grouped:
    group_sorted = group.sort_values('Snapshot')
    if len(group_sorted) == num_steps:
        subhalo_ids.append(sub_id)
        sequences.append(group_sorted[feature_cols].values)
        targets.append(group_sorted[target_cols].values[-1])  # only final snapshot

# Convert to tensors
X = torch.tensor(np.array(sequences), dtype=torch.float32)
y = torch.tensor(np.array(targets), dtype=torch.float32)

print(f"Input tensor shape: {X.shape}")
print(f"Target tensor shape: {y.shape}")


KeyError: 'Snapshot'