Add hyperparameter search toolkit and Modal GPU training script by KrishnaRaheja · Pull Request #2 · williamQyq/pyreflect

KrishnaRaheja · 2026-04-01T01:52:18Z

This PR adds tooling for systematically searching CNN hyperparameters and training on cloud GPUs, built on top of the existing pyreflect training pipeline.
Nothing in src/ is modified.

What's added

hyperparam_search/

An Optuna-based hyperparameter search for the NR→SLD CNN (layers, dropout, batch size). Modal GPU and Weights & Biases are both fully optional — runs locally
with no external services by default.

Local, no logging

python hyperparam_search.py --config config.yml

With W&B logging

python hyperparam_search.py --config config.yml --use-wandb

On Modal GPU

python hyperparam_search.py --config config.yml --use-modal

Modal + W&B

python hyperparam_search.py --config config.yml --use-modal --use-wandb

Configuration lives in config.yml — point data_dir at any dataset folder following the pyreflect init structure and set your search space:

data_dir: ../my-dataset
n_trials: 20
epochs_per_trial: 10
search_space:
layers: [6, 8, 12]
dropout_min: 0.05
dropout_max: 0.3
batch_size: [32, 64, 128]

The Optuna study is persisted to SQLite so interrupted runs resume automatically.

experiments/train_modal.py

A companion script for running training on a Modal GPU instead of locally. Reads directly from the experiment's settings.yml (generated by pyreflect init) —
no separate config needed.

python experiments/train_modal.py --experiment-dir experiments/my_experiment
python experiments/train_modal.py --experiment-dir experiments/my_experiment --use-wandb

Uploads the experiment's data/ directory to Modal, trains on a T4 GPU, and saves trained_model.pth back to the experiment directory.

Documentation

experiments/README.md — explains the experiment folder convention and local vs Modal training workflow
hyperparam_search/README.md — quick start, config reference, prerequisites

Prerequisites for new scripts

pip install optuna pyyaml # hyperparam search (always)
pip install wandb # optional, for --use-wandb
pip install modal && modal setup # optional, for --use-modal

For --use-modal --use-wandb: create a Modal secret named wandb-secret containing WANDB_API_KEY.

…arch

- Implemented training script rain_best_model_modal.py - Configured model with 6 layers, 0.11 dropout, batch size of 64, and 20 epochs - Integrated W&B for experiment tracking - Added data loading and preprocessing steps - Included model saving functionality

… 6 layers, 0.107 dropout, and 64 batch size.

- Replace optuna_search.py + optuna_modal.py with unified hyperparam_search.py (--use-modal, --use-wandb flags) - Add hyperparam_search/config.yml for search configuration - Replace hardcoded train_best_model_modal.py with train_modal.py that reads settings.yml - Rename experiment folders: best_model_6L_0.11D -> best_model_30k_6L_0.107D, best_model_125k -> best_model_125k_6L_0.087D - Rename experimental notebooks with descriptive model params in filename - Update notebook paths to match renamed experiment folders - Remove master_training_data/ - Add README.md to hyperparam_search/ and experiments/ - Unignore hyperparam_search/config.yml in .gitignore

The app.function() dynamic decoration pattern was silently falling back to local execution (Tasks: 0 in Modal). Fix by generating a self-contained temp script with the Modal function at module level and invoking via subprocess modal run, matching the pattern Modal requires.

…format - Add traceback.print_exc() to failed trial handler for better debugging - Fix W&B run name dropout format from :.3f to :.2f to match original scripts

qodo-code-review · 2026-04-01T01:52:36Z

Review Summary by Qodo

Add Optuna hyperparameter search and Modal GPU training toolkit

✨ Enhancement

Walkthroughs

Description

• Add unified Optuna hyperparameter search with optional Modal GPU and W&B logging
• Implement Modal GPU training script that reads experiment settings and trains models
• Add comprehensive documentation and configuration files for both tools
• Include VS Code settings and reorganize experiment notebooks to runs/ directory

Diagram

flowchart LR
  A["Config YAML"] -->|hyperparam_search.py| B["Optuna Study"]
  B -->|Local or Modal| C["Train CNN Trials"]
  C -->|Minimize Val Loss| D["Best Params"]
  D -->|optuna_best_params.txt| E["Results"]
  F["settings.yml"] -->|train_modal.py| G["Modal GPU"]
  G -->|Train Model| H["trained_model.pth"]
  B -->|Optional| I["W&B Logging"]
  G -->|Optional| I

File Changes

1. hyperparam_search/hyperparam_search.py ✨ Enhancement +452/-0

Unified Optuna hyperparameter search with Modal and W&B support

hyperparam_search/hyperparam_search.py

2. experiments/train_modal.py ✨ Enhancement +268/-0

Modal GPU training script reading experiment settings

experiments/train_modal.py

3. hyperparam_search/README.md 📝 Documentation +56/-0

Quick start guide and configuration reference for search

hyperparam_search/README.md

View more (6)

4. experiments/README.md 📝 Documentation +62/-0

Experiment folder structure and training workflow documentation

experiments/README.md

5. hyperparam_search/config.yml ⚙️ Configuration changes +40/-0

Example configuration for hyperparameter search space

hyperparam_search/config.yml

6. .vscode/settings.json ⚙️ Configuration changes +4/-0

VS Code Python environment manager configuration

.vscode/settings.json

7. runs/reflectivity_pipeline_experimental_125k_6L_0.087D.ipynb Miscellaneous +742/-0

Experimental notebook for 125k dataset model

runs/reflectivity_pipeline_experimental_125k_6L_0.087D.ipynb

8. runs/reflectivity_pipeline_experimental_30k_6L_0.107D.ipynb Miscellaneous +760/-0

Experimental notebook for 30k dataset model

runs/reflectivity_pipeline_experimental_30k_6L_0.107D.ipynb

9. examples/example_reflectivity_pipeline.ipynb Miscellaneous +88/-86

Updated example notebook paths and references

examples/example_reflectivity_pipeline.ipynb

qodo-code-review · 2026-04-01T01:52:37Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

1. ~~Wandb import always executed~~ ☑ 🐞 Bug ☼ Reliability

Description

_run_training unconditionally imports wandb, so running train_modal without wandb installed (and
Modal images built without wandb when --use-wandb is false) will crash before training starts.

Code

experiments/train_modal.py[R112-114]

+    import torch
+    import wandb as _wandb  # imported lazily below if needed
+    from pyreflect.input import NRSLDDataProcessor

Evidence

The training function imports wandb regardless of the use_wandb flag, but the Modal image only
installs wandb when use_wandb is true; therefore a default run (no --use-wandb) will raise
ImportError in both local and remote environments where wandb isn’t installed.

experiments/train_modal.py[92-126]
experiments/train_modal.py[189-200]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`experiments/train_modal.py::_run_training` imports `wandb` unconditionally, causing an `ImportError` when users run without `--use-wandb` and don't have `wandb` installed (and also in Modal runs where the image omits `wandb` unless `use_wandb=True`).
## Issue Context
- `wandb` is intended to be optional (enabled only via `--use-wandb`).
- The Modal image installs `wandb` only when `use_wandb` is true.
## Fix
- Remove the unconditional `import wandb as _wandb`.
- Keep `import wandb` strictly inside `if use_wandb:`.
## Fix Focus Areas
- experiments/train_modal.py[112-126]
- experiments/train_modal.py[189-196]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Wrong normalization_stats key~~ ☑ 🐞 Bug ≡ Correctness

Description

_extract_training_params reads normalization_stats from nr_predict_sld.file, but pyreflect’s
settings.yml schema stores normalization_stats under nr_predict_sld.models; this will raise KeyError
and prevent Modal training.

Code

experiments/train_modal.py[R70-78]

+    nr_cfg = config["nr_predict_sld"]
+    files = nr_cfg["file"]
+    models = nr_cfg["models"]
+
+    return {
+        "nr_train": str((experiment_dir / files["nr_train"]).resolve()),
+        "sld_train": str((experiment_dir / files["sld_train"]).resolve()),
+        "norm_stats": str((experiment_dir / files["normalization_stats"]).resolve()),
+        "layers": models["layers"],

Evidence

train_modal.py currently looks up files["normalization_stats"], but pyreflect’s init template and
config loader place normalization_stats under the models section
(nr_predict_sld.models.normalization_stats). This mismatch will crash when reading a standard
pyreflect-generated settings.yml.

experiments/train_modal.py[65-83]
src/pyreflect/cli/init_content.py[34-49]
src/pyreflect/models/config.py[106-150]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`experiments/train_modal.py::_extract_training_params` reads `normalization_stats` from `nr_predict_sld.file`, but `pyreflect init` and the config parser expect `normalization_stats` under `nr_predict_sld.models`. This causes a `KeyError` when using a standard `settings.yml`.
## Issue Context
`src/pyreflect/cli/init_content.py` writes:
- `nr_predict_sld.file`: `nr_train`, `sld_train`
- `nr_predict_sld.models`: `normalization_stats`, `layers`, `dropout`, etc.
## Fix
- Change lookup to `models["normalization_stats"]`.
- (Optional hardening) Add a friendly error message if required keys are missing.
## Fix Focus Areas
- experiments/train_modal.py[65-83]
- src/pyreflect/cli/init_content.py[34-49]
- src/pyreflect/models/config.py[124-150]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Trials use different splits 🐞 Bug ≡ Correctness

Description

hyperparam_search re-splits the dataset inside each Optuna trial using DataProcessor.split_arrays,
which is non-deterministic; each trial evaluates on a different validation set, making best_params
selection unreliable.

Code

hyperparam_search/hyperparam_search.py[R149-153]

+        list_arrays = DataProcessor.split_arrays(X, y, size_split=TRAIN_SPLIT)
+        tensor_arrays = DataProcessor.convert_tensors(list_arrays)
+        _, _, _, train_loader, valid_loader, _ = DataProcessor.get_dataloaders(
+            *tensor_arrays, batch_size=batch_size
+        )

Evidence

Each trial calls DataProcessor.split_arrays(), which uses sklearn.train_test_split without a fixed
random_state, so the validation set changes between trials. This introduces noise unrelated to
hyperparameters and can select suboptimal parameters.

hyperparam_search/hyperparam_search.py[133-154]
src/pyreflect/input/data_processor.py[14-22]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Optuna trials are not comparable because `train_with_epoch_logging()` re-splits data each trial via `DataProcessor.split_arrays()`, which uses `train_test_split` without a fixed seed.
## Issue Context
Reliable hyperparameter optimization requires evaluating all trials on the same validation set (or using consistent CV folds).
## Fix
Implement a single deterministic split once per study (outside `objective`), then reuse the same `train_loader`/`valid_loader` (or the same `X_train/X_val`) for all trials.
Concrete options:
1) In `_run_study`, create fixed index splits using numpy/torch RNG with a constant seed, then slice `X_all/y_all`.
2) Or update `DataProcessor.split_arrays` to accept `random_state` and pass it.
## Fix Focus Areas
- hyperparam_search/hyperparam_search.py[133-154]
- src/pyreflect/input/data_processor.py[14-22]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Modal Optuna DB not persisted 🐞 Bug ☼ Reliability

Description

In --use-modal mode, Optuna writes optuna_study.db to /root/data inside the remote container, but
the script only copies optuna_best_params.txt back locally; interrupted Modal runs cannot be resumed
as documented.

Code

hyperparam_search/hyperparam_search.py[R337-406]

+    modal_config = {**config, "data_dir": "/root/data"}
+    out_path = str(data_dir / "optuna_best_params.txt")
+    wandb_pkg = '"wandb",' if use_wandb else ""
+    secret_expr = '[modal.Secret.from_name("wandb-secret")]' if use_wandb else "[]"
+
+    # Embed _run_study verbatim — it has all its imports inside so Modal can
+    # serialize and ship it to the remote container without local dependencies.
+    run_study_src = inspect.getsource(_run_study)
+
+    # Build result-saving footer as plain strings to avoid f-string escaping
+    # conflicts with the curly braces already present in run_study_src.
+    footer_lines = [
+        "",
+        '@app.function(gpu="T4", image=image, secrets=secrets, timeout=12*60*60)',
+        "def _remote():",
+        "    return _run_study(MODAL_CONFIG, USE_WANDB)",
+        "",
+        "@app.local_entrypoint()",
+        "def main():",
+        "    result = _remote.remote()",
+        "    if not result['success']:",
+        "        print('No trials completed.')",
+        "        return",
+        "    print('Best params: ' + str(result['best_params']))",
+        "    print('Best val loss: ' + str(result['best_val_loss']))",
+        "    with open(" + repr(out_path) + ", 'w') as f:",
+        "        f.write('Best validation loss: ' + f\"{result['best_val_loss']:.6f}\" + '\\n')",
+        "        f.write('Best parameters:\\n')",
+        "        for k, v in result['best_params'].items():",
+        "            f.write('  ' + str(k) + ': ' + str(v) + '\\n')",
+        "    print('Results saved to: ' + " + repr(out_path) + ")",
+    ]
+
+    script = "\n".join([
+        "import modal",
+        "",
+        'app = modal.App("pyreflect-optuna")',
+        "image = (",
+        '    modal.Image.debian_slim(python_version="3.10")',
+        "    .pip_install(",
+        '        "torch==2.5.1", "numpy==2.1.0", "optuna",',
+        '        "pandas", "scikit-learn", "scipy", "opencv-python",',
+        '        "pyyaml", "tqdm", "refnx", "llvmlite", "numba",',
+        f"        {wandb_pkg}",
+        "    )",
+        '    .apt_install("git")',
+        '    .run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")',
+        f"    .add_local_dir({repr(str(data_dir))}, remote_path='/root/data', copy=True)",
+        ")",
+        f"secrets = {secret_expr}",
+        f"MODAL_CONFIG = {json.dumps(modal_config)}",
+        f"USE_WANDB = {use_wandb}",
+        "",
+    ]) + run_study_src + "\n".join(footer_lines)
+
+    tmp = tempfile.NamedTemporaryFile(
+        suffix=".py", mode="w", delete=False, encoding="utf-8"
+    )
+    try:
+        tmp.write(script)
+        tmp.close()
+
+        print(f"Launching Optuna search on Modal GPU (T4)...")
+        print(f"Data: {data_dir}")
+        print(f"Trials: {config['n_trials']}, Epochs/trial: {config['epochs_per_trial']}\n")
+
+        env = {**os.environ, "PYTHONIOENCODING": "utf-8"}
+        subprocess.run(["modal", "run", tmp.name], check=True, env=env)
+    finally:
+        os.unlink(tmp.name)

Evidence

_run_study persists the study to /optuna_study.db. run_modal forces data_dir to /root/data and
mounts the local dataset directory into the image via add_local_dir(copy=True), but the generated
Modal script only writes the best params file to the local path and never exports the SQLite DB
back, so the persisted study is lost for subsequent local reruns/resume.

hyperparam_search/hyperparam_search.py[247-254]
hyperparam_search/hyperparam_search.py[333-406]
hyperparam_search/README.md[41-46]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
For `--use-modal`, the Optuna SQLite DB is created inside the remote container at `/root/data/optuna_study.db`, but it is never persisted back to the local machine nor to durable remote storage. This breaks the documented “resume automatically” behavior for Modal runs.
## Issue Context
- `_run_study` always uses `<data_dir>/optuna_study.db`.
- `run_modal` rewrites `data_dir` to `/root/data`.
- `.add_local_dir(..., copy=True)` is a one-way copy into the image; the current flow only writes the best-params text file locally.
## Fix
Pick one:
1) Use a `modal.Volume` mounted at e.g. `/vol` and point Optuna storage to `sqlite:////vol/optuna_study.db` so the study persists across Modal runs.
2) Or have the remote function return the DB bytes (or upload it as an artifact) and write it to `<local data_dir>/optuna_study.db` after the run.
## Fix Focus Areas
- hyperparam_search/hyperparam_search.py[247-254]
- hyperparam_search/hyperparam_search.py[333-406]
- hyperparam_search/README.md[41-46]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. Unpinned git install 🐞 Bug ⛨ Security

Description

Modal images install pyreflect via an unpinned git URL, making remote training/search
non-reproducible and vulnerable to upstream changes breaking runs.

Code

experiments/train_modal.py[R189-199]

+    image = (
+        modal.Image.debian_slim(python_version="3.10")
+        .pip_install(
+            "torch==2.5.1", "numpy==2.1.0",
+            "pandas", "scikit-learn", "scipy",
+            "opencv-python", "pyyaml", "tqdm", "refnx", "llvmlite", "numba",
+            *(["wandb"] if use_wandb else []),
+        )
+        .apt_install("git")
+        .run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")
+        .add_local_dir(str(data_dir), remote_path="/root/data", copy=True)

Evidence

The Modal image build runs pip install git+https://.../pyreflect.git without pinning to a tag or
commit; the same pattern exists in the generated Modal script for hyperparam_search. This means the
code used remotely can change over time without any change in this repo.

experiments/train_modal.py[189-200]
hyperparam_search/hyperparam_search.py[370-385]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Modal builds install `pyreflect` from GitHub without pinning a revision. This makes remote runs non-reproducible and can break unexpectedly if upstream changes.
## Issue Context
Both Modal paths include a `pip install git+https://github.com/williamQyq/pyreflect.git` command.
## Fix
- Pin to a specific commit SHA or a released tag, e.g.:
- `pip install git+https://github.com/williamQyq/pyreflect.git@<sha>`
- Alternatively, if the intent is to run *this repo’s* code remotely, mount/add the repo and `pip install .` from that mounted source.
## Fix Focus Areas
- experiments/train_modal.py[189-200]
- hyperparam_search/hyperparam_search.py[370-385]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

- normalization_stats was read from nr_predict_sld.file instead of nr_predict_sld.models, causing a KeyError on standard settings.yml - unconditional `import wandb` at top of _run_training would crash when wandb is not installed and --use-wandb is false

KrishnaRaheja · 2026-04-01T21:41:06Z

if this shouldnt be on the main branch, please create another one and I can merge it there

williamQyq · 2026-04-03T19:49:02Z

@claude review

claude · 2026-04-03T19:49:19Z

Claude finished @williamQyq's task in 1m 42s —— View job

PR Review: Add hyperparameter search toolkit and Modal GPU training script

Tasks:

Gather context and understand the PR
Read key source files
Analyze bugs flagged by Qodo
Review code quality and design
Post comprehensive review

Overall this is a useful addition — Optuna-based search and cloud GPU training fill a real gap. The design is broadly sound, but there are confirmed bugs that will cause failures on first use, plus some architectural concerns worth addressing before merge.

Confirmed Bugs

1. Wrong `normalization_stats` lookup key — will crash on any standard experiment

experiments/train_modal.py line ~77 reads normalization_stats from nr_predict_sld.file, but pyreflect init places it under nr_predict_sld.models (verified in src/pyreflect/cli/init_content.py:48 and src/pyreflect/models/config.py:133):

# train_modal.py — WRONG key
"norm_stats": str((experiment_dir / files["normalization_stats"]).resolve())
#                                   ^^^^^ KeyError every time

Fix: change to models["normalization_stats"]. Fix this →

2. `wandb` imported unconditionally — crashes without `--use-wandb`

In experiments/train_modal.py line ~112, import wandb as _wandb runs regardless of the use_wandb flag. The Modal image only installs wandb when use_wandb=True, so a plain run without --use-wandb raises ImportError before any training starts. Fix this →

3. Non-deterministic train/val splits across Optuna trials

hyperparam_search.py re-splits the dataset inside every trial via DataProcessor.split_arrays, which calls sklearn.train_test_split without a random_state (confirmed in src/pyreflect/input/data_processor.py:19). Each trial evaluates on a different validation set, so the Optuna objective values are not comparable — the search picks "best" hyperparameters partly based on which trial got a lucky split. Fix this →

4. Modal Optuna DB is lost after each run — resume is broken

The README promises "interrupted runs resume automatically" via SQLite persistence. But when using --use-modal, the DB is written to /root/data/optuna_study.db inside the remote container and is never returned. run_modal only writes optuna_best_params.txt back locally. The resume feature only works for local runs. Fix this →

Additional Issues

5. Unpinned `pip install git+...` in Modal images

Both experiments/train_modal.py:195 and hyperparam_search/hyperparam_search.py:~382 install pyreflect via:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")

This pulls HEAD of the default branch at image-build time. A future commit to main could silently change behaviour in older experiments. Pin to a tag or commit SHA:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git@v0.1.0")

6. Dynamically-generated Modal script is fragile

run_modal in hyperparam_search.py builds a Python script as a string using inspect.getsource(_run_study) and subprocess.run(["modal", "run", tmp.name]). This is brittle:

Any closure variable or import inside _run_study that isn't self-contained will fail at runtime in the container (no local imports available).
The generated script can silently embed the wrong config if json.dumps escapes anything unexpectedly.
A simpler, more robust approach: use Modal's Python API (app.run()) directly without subprocess and script generation.

7. `get_xy_norm_stats` has a `@staticmethod` + `self` signature bug (existing code, but referenced by PR)

src/pyreflect/input/data_processor.py:93 declares get_xy_norm_stats(self, data_curves) as a @staticmethod, making self just a positional arg rather than the instance. This is an existing bug in the library that the new code depends on — worth noting.

8. No `requirements.txt` or `pyproject.toml` extras for new dependencies

The PR documents prerequisites in README files but doesn't add them to any machine-readable dependency file. This makes reproducible environment setup harder. Consider adding an extras_require or a requirements-dev.txt:

optuna
pyyaml
wandb; extra == "wandb"
modal; extra == "modal"

9. Branch concern (from PR author's comment)

The PR targets main → main (fork's main into this repo's main). @KrishnaRaheja asked if a separate branch should be used. I'd recommend the PR author push to a feature branch (e.g. feature/hyperparam-search) to keep main clean and allow iterative review without blocking other work.

Summary

Severity	Issue
🔴 Blocker	Wrong `normalization_stats` key — `KeyError` on any standard settings.yml
🔴 Blocker	`wandb` unconditionally imported — `ImportError` without `--use-wandb`
🟡 High	Non-deterministic splits — Optuna results are unreliable
🟡 High	Modal Optuna DB not persisted — resume doesn't work as documented
🟠 Medium	Unpinned git install — non-reproducible remote runs
🟠 Medium	Dynamically-generated Modal script — fragile, hard to debug
🔵 Low	Missing machine-readable dependency declaration

The two blockers (#1 and #2) should be fixed before merge. The non-deterministic split (#3) significantly impacts the correctness of the search and should also be addressed.

KrishnaRaheja · 2026-04-03T20:02:04Z

Issue fixes:

Wrong normalization_stats lookup key — will crash on any standard experiment
This issue is fixed.
wandb imported unconditionally — crashes without --use-wandb
import wandb only happens inside if use_wandb: blocks, not at the top level.
Non-deterministic train/val splits across Optuna trials
The split logic is inside pyreflect's DataProcessor.split_arrays, I am using it as is by calling the Pyreflect API

williamQyq · 2026-04-03T20:15:40Z

@claude visualize the the added feature in simple diagrams, explain in details how is this feature implemented and what does it do. Check again for any potential bugs.

claude · 2026-04-03T20:15:56Z

Claude finished @williamQyq's task in 3m 7s —— View job

Claude finished @williamQyq's task —

Tasks:

Gather context and understand the request
Read key source files (hyperparam_search.py, train_modal.py, data_processor.py, config.py, init_content.py)
Create architecture diagrams
Write detailed feature explanations
Re-check for potential bugs
Post final review

Feature 1: Hyperparameter Search (`hyperparam_search/`)

What it does

Systematically explores the NR→SLD CNN architecture space using Optuna, a Bayesian optimization framework. Instead of manually trying hyperparameter combinations, Optuna adaptively samples from the search space and narrows toward configurations that minimize validation loss.

Architecture Diagram

flowchart TD
    A["CLI: hyperparam_search.py\n--config config.yml\n[--use-modal] [--use-wandb]"] --> B{Mode?}

    B -->|Local| C["_run_study(config, use_wandb)"]
    B -->|Modal| D["run_modal(config, use_wandb, data_dir)"]

    D --> E["Generate script.py\n(inspect.getsource + string template)"]
    E --> F["subprocess: modal run script.py\n(T4 GPU, remote container)"]
    F --> C

    C --> G["Create Optuna Study\n(SQLite: optuna_study.db)"]
    G --> H["For each trial (1..n_trials)"]

    H --> I["Optuna suggests:\n- layers: choice [6,8,12]\n- dropout: float [0.05, 0.3]\n- batch_size: choice [32,64,128]"]
    I --> J["DataProcessor.split_arrays(X, y)\n(train 80% / val 10% / test 10%)"]
    J --> K["Build CNN + Train\nfor epochs_per_trial"]
    K --> L["Return val_loss"]
    L --> M{More trials?}
    M -->|yes| H
    M -->|no| N["Save optuna_best_params.txt"]
    N --> O{use_wandb?}
    O -->|yes| P["Log to Weights & Biases"]
    O -->|no| Q["Done"]

How It's Implemented (step-by-step)

Config loading — config.yml defines data_dir, n_trials, epochs_per_trial, and search_space (layers list, dropout range, batch_size list).
Data loading — NR and SLD .npy files are loaded from data_dir via NRSLDDataProcessor. Normalization stats are loaded from normalization_stat.npy.
Optuna study — Created with optuna.create_study(direction="minimize", storage="sqlite:///optuna_study.db"). SQLite persistence means a local run can be resumed if interrupted.
Objective function — Each trial:
- Optuna samples layers, dropout, batch_size using its TPE sampler
- Data is split into train/val/test via DataProcessor.split_arrays()
- A CNN model is built with the sampled architecture and trained
- Validation loss is returned as the objective to minimize
Modal execution path — When --use-modal is passed, inspect.getsource(_run_study) extracts the function source code as a string, a complete Modal Python script is assembled with the config embedded as json.dumps, and it is written to a temp file and executed via subprocess.run(["modal", "run", tmp.py]). The remote container installs all dependencies and runs _run_study on a T4 GPU.

Feature 2: Modal GPU Training (`experiments/train_modal.py`)

What it does

Trains the NR→SLD CNN on a cloud GPU using Modal, reading directly from the standard pyreflect experiment directory structure. No separate config needed — it parses the experiment's settings.yml.

Architecture Diagram

flowchart TD
    A["CLI: train_modal.py\n--experiment-dir experiments/my_exp\n[--use-wandb]"] --> B["Load settings.yml"]
    B --> C["_extract_training_params()\nRead nr_predict_sld section"]

    C --> D["files: nr_train, sld_train\nmodels: normalization_stats,\nlayers, dropout, batch_size, epochs"]

    D --> E["Build Modal Image\n(debian_slim + pip_install)"]
    E --> F["add_local_dir(data/) → /root/data\n(one-way copy into container)"]

    F --> G["Remote: _run_training() on T4 GPU"]

    G --> H["Import torch, pyreflect\n(conditionally: wandb)"]
    H --> I["NRSLDDataProcessor\nload + normalize NR & SLD"]
    I --> J["split_arrays → DataLoaders"]
    J --> K["Build CNN(layers, dropout)\nTrain for epochs"]
    K --> L{use_wandb?}
    L -->|yes| M["Log metrics to W&B"]
    L -->|no| N["Local training loop"]
    M --> O["Save trained_model.pth\n→ experiment dir"]
    N --> O

How It's Implemented (step-by-step)

Parse experiment — _extract_training_params(experiment_dir, config) reads settings.yml and extracts:
- File paths from nr_predict_sld.file: nr_train, sld_train
- Model params from nr_predict_sld.models: normalization_stats, layers, dropout, batch_size, epochs
Modal image — A modal.Image.debian_slim is constructed with pip_install for all dependencies (torch, numpy, pyreflect, etc.). The local data/ directory is uploaded into the container at /root/data via add_local_dir(..., copy=True).
Remote training — _run_training(params, use_wandb) runs on the T4 GPU:
- Loads and normalizes NR/SLD data using NRSLDDataProcessor
- Splits into train/val/test sets
- Builds and trains the CNN
- Saves trained_model.pth
Result retrieval — The trained model file is returned from Modal back to the local experiment directory.

Data Flow (end-to-end)

Local filesystem                Modal/Cloud
─────────────────               ─────────────
experiments/
  my_exp/
    settings.yml  ──────────── read by train_modal.py
    data/
      nr_train.npy ────copy──► /root/data/nr_train.npy
      sld_train.npy ───copy──► /root/data/sld_train.npy
      norm_stats.npy ──copy──► /root/data/norm_stats.npy
                               ┌──────────────┐
                               │ T4 GPU Train │
                               └──────┬───────┘
    trained_model.pth ◄─────────────-┘

Bug Re-check

Based on the source files I read (including data_processor.py, init_content.py, config.py), here's an updated assessment:

Fixed (per @KrishnaRaheja)

#	Issue	Status
1	Wrong `normalization_stats` key (`files[...]` → `models[...]`)	✅ Fixed
2	`wandb` imported unconditionally	✅ Fixed

Still Open

Bug A — Non-deterministic splits (Correctness, High)

DataProcessor.split_arrays() at src/pyreflect/input/data_processor.py:19 calls train_test_split without a random_state:

# data_processor.py:19 — no random_state!
crv_tr, crv_hld, chi_tr, chi_hld = train_test_split(X, y, train_size=size_split)

Since this is called inside every Optuna trial, each trial evaluates on a different validation set. Trial A may see a harder subset than Trial B, making the Optuna objective values incomparable. @KrishnaRaheja says "I'm using the API as-is", but this fundamentally undermines the search results — the "best" params may simply be the trial that got a lucky val split.

Minimal fix without modifying pyreflect: Move the split outside objective() and reuse train_loader/valid_loader across all trials (only recreate DataLoader per trial to change batch_size). Fix this →

Bug B — Modal Optuna DB lost after each run (Reliability, High)

The README says "The Optuna study is persisted to SQLite so interrupted runs resume automatically." This is only true for local runs. For --use-modal runs:

_run_study creates optuna_study.db at /root/data/optuna_study.db inside the container
run_modal only copies optuna_best_params.txt back locally
The DB is destroyed when the container exits → resume is broken for Modal runs

Fix this →

Bug C — Unpinned pip install git+... in both Modal images (Reproducibility, Medium)

Both experiments/train_modal.py:~195 and the generated Modal script in hyperparam_search.py:~382 use:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")

This installs the latest HEAD of main at image-build time. A future commit could silently break old experiments. Pin to a commit SHA or tag:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git@<sha_or_tag>")

Bug D — Fragile dynamic script generation (Reliability, Medium)

run_modal() in hyperparam_search.py builds a Python file by concatenating strings including inspect.getsource(_run_study). This breaks if:

_run_study closes over any variable not serializable to json.dumps
The source extraction picks up decorators or formatting that changes the indentation

A more robust alternative: use Modal's Python API (@app.function + app.run()) directly from the calling script, instead of code generation + subprocess.

Pre-existing Bug E — @staticmethod with self parameter (Existing, Low)

src/pyreflect/input/data_processor.py:92-93 (pre-PR code):

@staticmethod
def get_xy_norm_stats(self, data_curves):  # ← self is just a positional arg here!

As a @staticmethod, self is not the instance — it's just the first positional argument. Callers would need to pass an extra dummy value. This method isn't directly used by the new PR code, but is a latent bug in the library.

Summary

Severity	Issue	Status
✅ Fixed	Wrong `normalization_stats` key	KrishnaRaheja
✅ Fixed	`wandb` unconditional import	KrishnaRaheja
🟡 Open	Non-deterministic trial splits	Needs fix
🟡 Open	Modal Optuna DB not persisted (resume broken)	Needs fix
🟠 Open	Unpinned `git+` install in Modal images	Low priority
🟠 Open	Fragile `inspect.getsource` + subprocess approach	Low priority
🔵 Pre-existing	`@staticmethod get_xy_norm_stats(self, ...)` in data_processor	Pre-PR

williamQyq · 2026-04-06T07:31:27Z

@claude fix the pre-existing issues then merge the pr.

claude · 2026-04-06T07:31:43Z

Claude finished @williamQyq's task in 1m 13s —— View job

Fixing Pre-existing Issues

Tasks:

Gather context and understand the request
Read src/pyreflect/input/data_processor.py to verify the @staticmethod bug
Fix the @staticmethod get_xy_norm_stats(self, ...) issue
Commit and push the fix
Note limitation regarding PR merge

Fix Applied

src/pyreflect/input/data_processor.py:93 — Removed the erroneous self parameter from DataProcessor.get_xy_norm_stats. As a @staticmethod, self was just treated as a regular positional argument, meaning any caller would have needed to pass a dummy first value to use this method correctly. The fix aligns the signature with its decorator:

- def get_xy_norm_stats(self, data_curves):
+ def get_xy_norm_stats(data_curves):

Commit 4ab6ab8 pushed to main.

Regarding "merge the PR"

I'm unable to merge pull requests due to GitHub App permission constraints. You can merge it directly via the GitHub UI. More info in the FAQ.

Note: The following issues from the prior review are still open if you'd like to address them before merging:

🟡 Non-deterministic Optuna trial splits (each trial re-splits data without a fixed seed)

🟡 Modal Optuna DB not persisted — resume doesn't work as documented for --use-modal

🟠 Unpinned pip install git+... in Modal image builds (non-reproducible)

KrishnaRaheja added 9 commits February 26, 2026 13:22

optuna, w&b, modal scripts

a85152b

[Fix] update EPOCHS_PER_TRIAL and dropout range for hyperparameter se…

b88e535

…arch

reflectivity_pipeline_experimental is a run of 6-layer cnn model with…

01e897b

… 6 layers, 0.107 dropout, and 64 batch size.

reflectivity_pipeline_experimental_125k

8be70dc

Move user inference notebooks from examples/ to runs/

bb09019

Fix minor issues in hyperparam_search: traceback logging and dropout …

df366bd

…format - Add traceback.print_exc() to failed trial handler for better debugging - Fix W&B run name dropout format from :.3f to :.2f to match original scripts

qodo-code-review Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread experiments/train_modal.py

Comment thread experiments/train_modal.py

Add experiment setup docs for data generation before Modal training

044c69f

williamQyq merged commit eaea0f0 into williamQyq:main Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hyperparameter search toolkit and Modal GPU training script#2

Add hyperparameter search toolkit and Modal GPU training script#2
williamQyq merged 11 commits into
williamQyq:mainfrom
Northeastern-Research-ORNL-1:main

KrishnaRaheja commented Apr 1, 2026

Uh oh!

qodo-code-review Bot commented Apr 1, 2026

Uh oh!

qodo-code-review Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

KrishnaRaheja commented Apr 1, 2026

Uh oh!

williamQyq commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

KrishnaRaheja commented Apr 3, 2026 •

edited

Loading

Uh oh!

williamQyq commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

williamQyq commented Apr 6, 2026

Uh oh!

claude Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KrishnaRaheja commented Apr 1, 2026

Local, no logging

With W&B logging

On Modal GPU

Modal + W&B

Uh oh!

qodo-code-review Bot commented Apr 1, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

KrishnaRaheja commented Apr 1, 2026

Uh oh!

williamQyq commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add hyperparameter search toolkit and Modal GPU training script

Confirmed Bugs

1. Wrong normalization_stats lookup key — will crash on any standard experiment

2. wandb imported unconditionally — crashes without --use-wandb

3. Non-deterministic train/val splits across Optuna trials

4. Modal Optuna DB is lost after each run — resume is broken

Additional Issues

5. Unpinned pip install git+... in Modal images

6. Dynamically-generated Modal script is fragile

7. get_xy_norm_stats has a @staticmethod + self signature bug (existing code, but referenced by PR)

8. No requirements.txt or pyproject.toml extras for new dependencies

9. Branch concern (from PR author's comment)

Summary

Uh oh!

KrishnaRaheja commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

williamQyq commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature 1: Hyperparameter Search (hyperparam_search/)

What it does

Architecture Diagram

How It's Implemented (step-by-step)

Feature 2: Modal GPU Training (experiments/train_modal.py)

What it does

Architecture Diagram

How It's Implemented (step-by-step)

Data Flow (end-to-end)

Bug Re-check

Fixed (per @KrishnaRaheja)

Still Open

Summary

Uh oh!

williamQyq commented Apr 6, 2026

Uh oh!

claude Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixing Pre-existing Issues

Fix Applied

Regarding "merge the PR"

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qodo-code-review Bot commented Apr 1, 2026 •

edited

Loading

claude Bot commented Apr 3, 2026 •

edited

Loading

1. Wrong `normalization_stats` lookup key — will crash on any standard experiment

2. `wandb` imported unconditionally — crashes without `--use-wandb`

5. Unpinned `pip install git+...` in Modal images

7. `get_xy_norm_stats` has a `@staticmethod` + `self` signature bug (existing code, but referenced by PR)

8. No `requirements.txt` or `pyproject.toml` extras for new dependencies

KrishnaRaheja commented Apr 3, 2026 •

edited

Loading

claude Bot commented Apr 3, 2026 •

edited

Loading

Feature 1: Hyperparameter Search (`hyperparam_search/`)

Feature 2: Modal GPU Training (`experiments/train_modal.py`)

claude Bot commented Apr 6, 2026 •

edited

Loading