# Black Hole Evolution Dataset Preparation

This notebook extracts and formats time-series data from TNG100 to enable a LSTM to predict supermassive black hole evolution.


### 1. Environment Setup
---
Import necessary libraries and configure global settings for reproducibility.


In [1]:
import requests
import numpy as np
import torch
import random

random.seed(42)  # Ensures reproducible random sampling later

print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")


NumPy version: 1.24.3
PyTorch version: 2.0.1+cpu


### 2. Load and Filter TNG100 Subhalo Catalog  
---
Locate the TNG100 simulation directory and loading the subhalo catalog from snapshot 33. We then extract all subhalos hosting supermassive black holes (SMBHs).


#### 2.1 Load Dataset
---
This cell loads the preprocessed black hole evolution dataset from the data directory and confirms its structure. It also sets the simulation base path for future data access.

In [2]:
import illustris_python as il
import pandas as pd

# Set simulation base path
basePath = "/home/tnguser/sims.TNG/TNG100-1"

# Load precompiled black hole sample from CSV
csv_path = "/home/tnguser/cosmic-evolution-ml/black_hole_evolution/data/black_hole_evolution_tng100.csv"
df = pd.read_csv(csv_path)

print(f"Dataset loaded with shape: {df.shape}")
print("Columns:", df.columns.tolist())


Dataset loaded with shape: (37500, 8)
Columns: ['subhalo_id', 'snapshot', 'bh_mass', 'bh_acc', 'stellar_mass', 'sfr', 'halo_mass', 'vel_disp']


#### 2.2 Save Processed Data as NumPy Arrays
---
This section converts the cleaned long-format CSV dataset into NumPy arrays for efficient model training and stores them alongside the CSV in the data directory.

In [4]:
import numpy as np
import pandas as pd
from pathlib import Path

# Paths
DATA_DIR = Path("../data")
CSV_PATH = DATA_DIR / "black_hole_evolution_tng100.csv"  # long-format dataset

# Load long-format CSV
df = pd.read_csv(CSV_PATH)

# Convert to NumPy arrays
ids = df["subhalo_id"].to_numpy()
snapshots = df["snapshot"].to_numpy()
features = df.drop(columns=["subhalo_id", "snapshot"]).to_numpy()

# Save arrays in the same directory as the CSV
np.save(DATA_DIR / "ids.npy", ids)
np.save(DATA_DIR / "snapshots.npy", snapshots)
np.save(DATA_DIR / "features.npy", features)

print(f"[OK] Processed arrays saved to: {DATA_DIR}")


[OK] Processed arrays saved to: ../data


### 3. Data Loading
---
This section prepares the processed dataset for model training by defining a PyTorch-compatible `Dataset` and `DataLoader`. The goal is to efficiently feed the model sequential input–output pairs representing black hole and galaxy properties across snapshots.

#### 3.1 Dataset and DataLoader Setup
---
We load the processed `.npy` files generated in Section 2, organize them into sequences of `(initial_conditions, final_conditions)`, and configure a `DataLoader` to support batching, shuffling, and efficient GPU training.

In [15]:
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class BlackHoleEvolutionDataset(Dataset):
    def __init__(self, data_dir, sequence_length=2):
        self.ids = np.load(data_dir / "ids.npy")
        self.snapshots = np.load(data_dir / "snapshots.npy")
        self.features = np.load(data_dir / "features.npy")
        self.sequence_length = sequence_length
        
        # Replace NaNs with column means
        nan_mask = np.isnan(self.features)
        if nan_mask.any():
            col_means = np.nanmean(self.features, axis=0)
            self.features[nan_mask] = np.take(col_means, np.where(nan_mask)[1])
        
        # Group by subhalo_id
        self.subhalo_sequences = {}
        for sid in np.unique(self.ids):
            mask = self.ids == sid
            seq_features = self.features[mask]
            seq_snapshots = self.snapshots[mask]
            sort_idx = np.argsort(seq_snapshots)
            self.subhalo_sequences[sid] = seq_features[sort_idx]

        # Build input-output pairs
        self.samples = []
        for seq in self.subhalo_sequences.values():
            if len(seq) >= self.sequence_length:
                for i in range(len(seq) - self.sequence_length + 1):
                    initial = seq[i]
                    final = seq[i + self.sequence_length - 1]
                    self.samples.append((initial, final))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        initial, final = self.samples[idx]
        return torch.tensor(initial, dtype=torch.float32), torch.tensor(final, dtype=torch.float32)

# Initialize dataset and dataloader
dataset = BlackHoleEvolutionDataset(DATA_DIR, sequence_length=2)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

print(f"Total samples: {len(dataset)}")
for batch in dataloader:
    x, y = batch
    print("Batch X shape:", x.shape)
    print("Batch Y shape:", y.shape)
    break


Total samples: 35000
Batch X shape: torch.Size([64, 6])
Batch Y shape: torch.Size([64, 6])


#### 3.2 Model Architecture Definition
---
We define a neural network model to learn the mapping from `initial_conditions` to `final_conditions`. The architecture consists of fully connected layers with nonlinear activations, allowing the model to capture complex relationships in the astrophysical data.


In [17]:
import torch.nn as nn

class BlackHoleEvolutionModel(nn.Module):
    def __init__(self, input_dim, hidden_dim=128, output_dim=None):
        super().__init__()
        if output_dim is None:
            output_dim = input_dim  # Predict same number of features as input
        
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# Initialize model
input_dim = dataset[0][0].shape[0]  # Feature size from initial_conditions
model = BlackHoleEvolutionModel(input_dim=input_dim, hidden_dim=128)
print(model)


BlackHoleEvolutionModel(
  (net): Sequential(
    (0): Linear(in_features=6, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=6, bias=True)
  )
)


#### 3.3 Loss Function & Optimizer Setup
---
We configure the loss function to measure prediction accuracy and the optimizer to update model weights. Mean Squared Error (MSE) is used since we are predicting continuous astrophysical quantities, and Adam is chosen for its adaptive learning rate capabilities.


In [18]:
import torch.optim as optim

# Loss function
criterion = nn.MSELoss()

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-3)

print("Loss function and optimizer ready.")


Loss function and optimizer ready.


#### 3.4 Training Loop
---
We iterate over the dataset for multiple epochs, performing forward passes, computing the loss, backpropagating gradients, and updating model parameters. Progress is printed each epoch to monitor convergence.


In [19]:
import torch

# Ensure device is defined
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Make sure DataLoader variables exist
if "train_loader" not in locals():
    train_loader = dataloader  # From 3.1, using full dataset as training data

# Training parameters
num_epochs = 20

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for initial_conditions, final_conditions in train_loader:
        # Move data to device
        initial_conditions = initial_conditions.to(device)
        final_conditions = final_conditions.to(device)

        # Forward pass
        outputs = model(initial_conditions)
        loss = criterion(outputs, final_conditions)

        # Backward pass + optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * initial_conditions.size(0)

    epoch_loss = running_loss / len(train_loader.dataset)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.6f}")


Epoch [1/20], Loss: nan
Epoch [2/20], Loss: nan
Epoch [3/20], Loss: nan
Epoch [4/20], Loss: nan
Epoch [5/20], Loss: nan
Epoch [6/20], Loss: nan
Epoch [7/20], Loss: nan
Epoch [8/20], Loss: nan
Epoch [9/20], Loss: nan
Epoch [10/20], Loss: nan
Epoch [11/20], Loss: nan
Epoch [12/20], Loss: nan
Epoch [13/20], Loss: nan
Epoch [14/20], Loss: nan
Epoch [15/20], Loss: nan
Epoch [16/20], Loss: nan
Epoch [17/20], Loss: nan
Epoch [18/20], Loss: nan
Epoch [19/20], Loss: nan
Epoch [20/20], Loss: nan
