## Black Hole Evolution Dataset (TNG100)
This notebook builds a dataset for modeling the evolution of black holes using the IllustrisTNG100 simulation (snapshots 18–33). The dataset will be used to train an LSTM to predict future black hole properties based on their past evolution.

### 1. Environment Setup
---
Import necessary libraries and configure global settings for reproducibility.


In [1]:
import requests
import numpy as np
import torch
import random

random.seed(42)  # Ensures reproducible random sampling later

print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")

NumPy version: 1.24.3
PyTorch version: 2.0.1+cpu


### 2. Data Access & Preprocessing
---

#### 2.1 Load Subhalo Catalog (Snapshot 33)
---
We begin by selecting black-hole-hosting subhalos at snapshot 33 (z ≈ 0, present day).

In [2]:
import illustris_python as il

basePath = "/home/tnguser/sims.TNG/TNG100-1/output" # Adjust based on Environment

subhalos = il.groupcat.loadSubhalos(
    basePath, 
    33, 
    fields=['SubhaloBHMass', 'SubhaloMassType']
)

bh_mass = subhalos['SubhaloBHMass']
stellar_mass = subhalos['SubhaloMassType'][:, 4]  # Type 4 = stellar component

bh_mask = bh_mass > 0
print(f"Total subhalos with black holes: {bh_mask.sum()}")

Total subhalos with black holes: 29415


#### 2.2 Extract Black Hole Evolutionary Histories
---
Trace each black hole’s most bound progenitor branch through merger trees and store time-ordered properties across snapshots 18-33.


In [3]:
tree_base = f"{basePath}/postprocessing/trees/sublink"

full_histories = {}
bh_list = [i for i, has_bh in enumerate(bh_mask) if has_bh]
required_snaps = set(range(18, 34))

for count, sub_id in enumerate(bh_list, start=1):
    try:
        tree = il.sublink.loadTree(
            basePath,
            33,
            sub_id,
            fields=[
                'SubhaloID',
                'SnapNum',
                'SubhaloBHMass',
                'SubhaloBHMdot',
                'SubhaloMassType',
                'SubhaloSFR',
                'SubhaloVelDisp'
            ],
            onlyMPB=True
        )

        mask = (tree['SnapNum'] <= 32) & (tree['SnapNum'] >= 18)
        snaps = set(tree['SnapNum'][mask])

        # Keep only if ≥90% snapshot coverage
        if len(snaps & required_snaps) >= int(0.9 * len(required_snaps)):
            sorted_idx = np.argsort(tree['SnapNum'][mask])
            full_histories[sub_id] = {
                "snap_nums": tree['SnapNum'][mask][sorted_idx].tolist(),
                "bh_mass": tree['SubhaloBHMass'][mask][sorted_idx].tolist(),
                "bh_accretion": tree['SubhaloBHMdot'][mask][sorted_idx].tolist(),
                "stellar_mass": tree['SubhaloMassType'][mask][sorted_idx, 4].tolist(),
                "halo_mass": tree['SubhaloMassType'][mask][sorted_idx].sum(axis=1).tolist(),
                "sfr": tree['SubhaloSFR'][mask][sorted_idx].tolist(),
                "vel_dispersion": tree['SubhaloVelDisp'][mask][sorted_idx].tolist()
            }
    except:
        continue

    if count % 5000 == 0:
        print(f"Checked {count}/{len(bh_list)} subhalos...", flush=True)

print(f"Black holes with ≥90% snapshot coverage: {len(full_histories)}")


Checked 5000/29415 subhalos...
Checked 10000/29415 subhalos...
Checked 15000/29415 subhalos...
Checked 20000/29415 subhalos...
Checked 25000/29415 subhalos...
Black holes with ≥90% snapshot coverage: 29232


#### 2.3 Sample and Verify Black Holes
---
Randomly select 2,500 black holes with ≥90% snapshot coverage 
and inspect one example to confirm that time-series properties 
were extracted correctly.

In [9]:
sampled_ids = random.sample(list(full_histories.keys()), 2500)
print(f"Sampled {len(sampled_ids)} black holes.")


Sampled 2500 black holes.


In [10]:
# Verification
check_id = random.choice(sampled_ids)
print(f"Inspecting black hole ID: {check_id}")
print("Snapshots:", full_histories[check_id]["snap_nums"])
print("BH Mass Sequence:", full_histories[check_id]["bh_mass"])


Inspecting black hole ID: 334728
Snapshots: [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
BH Mass Sequence: [0.0, 0.0, 0.0, 0.0, 8.097196405287832e-05, 8.336849714396521e-05, 8.618094580015168e-05, 9.101477917283773e-05, 9.366880112793297e-05, 9.904281614581123e-05, 0.000103422709798906, 0.00010685356392059475, 0.0001107670905184932, 0.00011433172039687634, 0.00011698293383233249]


#### 2.4 Prepare Dataset for Machine Learning
---
Structure the extracted properties into a consistent time-series format and save as a CSV for future training.

In [11]:
import re
import numpy as np
import pandas as pd

ID_COL = "subhalo_id"
FEATURE_ORDER = ["bh_mass", "bh_acc", "stellar_mass", "sfr", "halo_mass", "vel_disp"]
OUT_CSV = "../data/black_hole_evolution_tng100.csv"

# Parse *_snapXX columns
snap_col_pattern = re.compile(r"^(?P<feature>.+)_snap(?P<snapshot>\d+)$")
parsed = [(c, m.group("feature"), int(m.group("snapshot")))
          for c in df.columns
          if (m := snap_col_pattern.match(c))]

parse_map = pd.DataFrame(parsed, columns=["col", "feature", "snapshot"])

# Melt wide → long
long = df[[ID_COL] + parse_map["col"].tolist()].melt(
    id_vars=[ID_COL], var_name="col", value_name="value"
).merge(parse_map, on="col").drop(columns=["col"])

# Pivot to one row per (ID, snapshot)
tidy = long.pivot(index=[ID_COL, "snapshot"], columns="feature", values="value").reset_index()
tidy["snapshot"] = tidy["snapshot"].astype(int)
tidy = tidy.sort_values([ID_COL, "snapshot"]).reset_index(drop=True)

# Add missing expected features
for f in FEATURE_ORDER:
    if f not in tidy.columns:
        tidy[f] = np.nan

tidy = tidy[[ID_COL, "snapshot"] + FEATURE_ORDER + [c for c in tidy.columns if c not in [ID_COL, "snapshot"] + FEATURE_ORDER]]

tidy.to_csv(OUT_CSV, index=False)
print(f"[OK] Saved long-format dataset to: {OUT_CSV}")
tidy.head()


NameError: name 'df' is not defined