Skip to content

Add hyperparameter search toolkit and Modal GPU training script#2

Merged
williamQyq merged 11 commits into
williamQyq:mainfrom
Northeastern-Research-ORNL-1:main
Apr 6, 2026
Merged

Add hyperparameter search toolkit and Modal GPU training script#2
williamQyq merged 11 commits into
williamQyq:mainfrom
Northeastern-Research-ORNL-1:main

Conversation

@KrishnaRaheja
Copy link
Copy Markdown
Contributor


This PR adds tooling for systematically searching CNN hyperparameters and training on cloud GPUs, built on top of the existing pyreflect training pipeline.
Nothing in src/ is modified.


What's added

hyperparam_search/

An Optuna-based hyperparameter search for the NR→SLD CNN (layers, dropout, batch size). Modal GPU and Weights & Biases are both fully optional — runs locally
with no external services by default.

Local, no logging

python hyperparam_search.py --config config.yml

With W&B logging

python hyperparam_search.py --config config.yml --use-wandb

On Modal GPU

python hyperparam_search.py --config config.yml --use-modal

Modal + W&B

python hyperparam_search.py --config config.yml --use-modal --use-wandb

Configuration lives in config.yml — point data_dir at any dataset folder following the pyreflect init structure and set your search space:

data_dir: ../my-dataset
n_trials: 20
epochs_per_trial: 10
search_space:
layers: [6, 8, 12]
dropout_min: 0.05
dropout_max: 0.3
batch_size: [32, 64, 128]

The Optuna study is persisted to SQLite so interrupted runs resume automatically.


experiments/train_modal.py

A companion script for running training on a Modal GPU instead of locally. Reads directly from the experiment's settings.yml (generated by pyreflect init) —
no separate config needed.

python experiments/train_modal.py --experiment-dir experiments/my_experiment
python experiments/train_modal.py --experiment-dir experiments/my_experiment --use-wandb

Uploads the experiment's data/ directory to Modal, trains on a T4 GPU, and saves trained_model.pth back to the experiment directory.


Documentation

  • experiments/README.md — explains the experiment folder convention and local vs Modal training workflow
  • hyperparam_search/README.md — quick start, config reference, prerequisites

Prerequisites for new scripts

pip install optuna pyyaml # hyperparam search (always)
pip install wandb # optional, for --use-wandb
pip install modal && modal setup # optional, for --use-modal

For --use-modal --use-wandb: create a Modal secret named wandb-secret containing WANDB_API_KEY.

- Implemented training script 	rain_best_model_modal.py
- Configured model with 6 layers, 0.11 dropout, batch size of 64, and 20 epochs
- Integrated W&B for experiment tracking
- Added data loading and preprocessing steps
- Included model saving functionality
… 6 layers, 0.107 dropout, and 64 batch size.
- Replace optuna_search.py + optuna_modal.py with unified hyperparam_search.py (--use-modal, --use-wandb flags)
- Add hyperparam_search/config.yml for search configuration
- Replace hardcoded train_best_model_modal.py with train_modal.py that reads settings.yml
- Rename experiment folders: best_model_6L_0.11D -> best_model_30k_6L_0.107D, best_model_125k -> best_model_125k_6L_0.087D
- Rename experimental notebooks with descriptive model params in filename
- Update notebook paths to match renamed experiment folders
- Remove master_training_data/
- Add README.md to hyperparam_search/ and experiments/
- Unignore hyperparam_search/config.yml in .gitignore
The app.function() dynamic decoration pattern was silently falling back
to local execution (Tasks: 0 in Modal). Fix by generating a self-contained
temp script with the Modal function at module level and invoking via
subprocess modal run, matching the pattern Modal requires.
…format

- Add traceback.print_exc() to failed trial handler for better debugging
- Fix W&B run name dropout format from :.3f to :.2f to match original scripts
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add Optuna hyperparameter search and Modal GPU training toolkit

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add unified Optuna hyperparameter search with optional Modal GPU and W&B logging
• Implement Modal GPU training script that reads experiment settings and trains models
• Add comprehensive documentation and configuration files for both tools
• Include VS Code settings and reorganize experiment notebooks to runs/ directory
Diagram
flowchart LR
  A["Config YAML"] -->|hyperparam_search.py| B["Optuna Study"]
  B -->|Local or Modal| C["Train CNN Trials"]
  C -->|Minimize Val Loss| D["Best Params"]
  D -->|optuna_best_params.txt| E["Results"]
  F["settings.yml"] -->|train_modal.py| G["Modal GPU"]
  G -->|Train Model| H["trained_model.pth"]
  B -->|Optional| I["W&B Logging"]
  G -->|Optional| I
Loading

Grey Divider

File Changes

1. hyperparam_search/hyperparam_search.py ✨ Enhancement +452/-0

Unified Optuna hyperparameter search with Modal and W&B support

hyperparam_search/hyperparam_search.py


2. experiments/train_modal.py ✨ Enhancement +268/-0

Modal GPU training script reading experiment settings

experiments/train_modal.py


3. hyperparam_search/README.md 📝 Documentation +56/-0

Quick start guide and configuration reference for search

hyperparam_search/README.md


View more (6)
4. experiments/README.md 📝 Documentation +62/-0

Experiment folder structure and training workflow documentation

experiments/README.md


5. hyperparam_search/config.yml ⚙️ Configuration changes +40/-0

Example configuration for hyperparameter search space

hyperparam_search/config.yml


6. .vscode/settings.json ⚙️ Configuration changes +4/-0

VS Code Python environment manager configuration

.vscode/settings.json


7. runs/reflectivity_pipeline_experimental_125k_6L_0.087D.ipynb Miscellaneous +742/-0

Experimental notebook for 125k dataset model

runs/reflectivity_pipeline_experimental_125k_6L_0.087D.ipynb


8. runs/reflectivity_pipeline_experimental_30k_6L_0.107D.ipynb Miscellaneous +760/-0

Experimental notebook for 30k dataset model

runs/reflectivity_pipeline_experimental_30k_6L_0.107D.ipynb


9. examples/example_reflectivity_pipeline.ipynb Miscellaneous +88/-86

Updated example notebook paths and references

examples/example_reflectivity_pipeline.ipynb


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 1, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Wandb import always executed🐞 Bug ☼ Reliability
Description
_run_training unconditionally imports wandb, so running train_modal without wandb installed (and
Modal images built without wandb when --use-wandb is false) will crash before training starts.
Code

experiments/train_modal.py[R112-114]

+    import torch
+    import wandb as _wandb  # imported lazily below if needed
+    from pyreflect.input import NRSLDDataProcessor
Evidence
The training function imports wandb regardless of the use_wandb flag, but the Modal image only
installs wandb when use_wandb is true; therefore a default run (no --use-wandb) will raise
ImportError in both local and remote environments where wandb isn’t installed.

experiments/train_modal.py[92-126]
experiments/train_modal.py[189-200]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`experiments/train_modal.py::_run_training` imports `wandb` unconditionally, causing an `ImportError` when users run without `--use-wandb` and don't have `wandb` installed (and also in Modal runs where the image omits `wandb` unless `use_wandb=True`).
## Issue Context
- `wandb` is intended to be optional (enabled only via `--use-wandb`).
- The Modal image installs `wandb` only when `use_wandb` is true.
## Fix
- Remove the unconditional `import wandb as _wandb`.
- Keep `import wandb` strictly inside `if use_wandb:`.
## Fix Focus Areas
- experiments/train_modal.py[112-126]
- experiments/train_modal.py[189-196]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Wrong normalization_stats key🐞 Bug ≡ Correctness
Description
_extract_training_params reads normalization_stats from nr_predict_sld.file, but pyreflect’s
settings.yml schema stores normalization_stats under nr_predict_sld.models; this will raise KeyError
and prevent Modal training.
Code

experiments/train_modal.py[R70-78]

+    nr_cfg = config["nr_predict_sld"]
+    files = nr_cfg["file"]
+    models = nr_cfg["models"]
+
+    return {
+        "nr_train": str((experiment_dir / files["nr_train"]).resolve()),
+        "sld_train": str((experiment_dir / files["sld_train"]).resolve()),
+        "norm_stats": str((experiment_dir / files["normalization_stats"]).resolve()),
+        "layers": models["layers"],
Evidence
train_modal.py currently looks up files["normalization_stats"], but pyreflect’s init template and
config loader place normalization_stats under the models section
(nr_predict_sld.models.normalization_stats). This mismatch will crash when reading a standard
pyreflect-generated settings.yml.

experiments/train_modal.py[65-83]
src/pyreflect/cli/init_content.py[34-49]
src/pyreflect/models/config.py[106-150]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`experiments/train_modal.py::_extract_training_params` reads `normalization_stats` from `nr_predict_sld.file`, but `pyreflect init` and the config parser expect `normalization_stats` under `nr_predict_sld.models`. This causes a `KeyError` when using a standard `settings.yml`.
## Issue Context
`src/pyreflect/cli/init_content.py` writes:
- `nr_predict_sld.file`: `nr_train`, `sld_train`
- `nr_predict_sld.models`: `normalization_stats`, `layers`, `dropout`, etc.
## Fix
- Change lookup to `models["normalization_stats"]`.
- (Optional hardening) Add a friendly error message if required keys are missing.
## Fix Focus Areas
- experiments/train_modal.py[65-83]
- src/pyreflect/cli/init_content.py[34-49]
- src/pyreflect/models/config.py[124-150]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Trials use different splits 🐞 Bug ≡ Correctness
Description
hyperparam_search re-splits the dataset inside each Optuna trial using DataProcessor.split_arrays,
which is non-deterministic; each trial evaluates on a different validation set, making best_params
selection unreliable.
Code

hyperparam_search/hyperparam_search.py[R149-153]

+        list_arrays = DataProcessor.split_arrays(X, y, size_split=TRAIN_SPLIT)
+        tensor_arrays = DataProcessor.convert_tensors(list_arrays)
+        _, _, _, train_loader, valid_loader, _ = DataProcessor.get_dataloaders(
+            *tensor_arrays, batch_size=batch_size
+        )
Evidence
Each trial calls DataProcessor.split_arrays(), which uses sklearn.train_test_split without a fixed
random_state, so the validation set changes between trials. This introduces noise unrelated to
hyperparameters and can select suboptimal parameters.

hyperparam_search/hyperparam_search.py[133-154]
src/pyreflect/input/data_processor.py[14-22]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Optuna trials are not comparable because `train_with_epoch_logging()` re-splits data each trial via `DataProcessor.split_arrays()`, which uses `train_test_split` without a fixed seed.
## Issue Context
Reliable hyperparameter optimization requires evaluating all trials on the same validation set (or using consistent CV folds).
## Fix
Implement a single deterministic split once per study (outside `objective`), then reuse the same `train_loader`/`valid_loader` (or the same `X_train/X_val`) for all trials.
Concrete options:
1) In `_run_study`, create fixed index splits using numpy/torch RNG with a constant seed, then slice `X_all/y_all`.
2) Or update `DataProcessor.split_arrays` to accept `random_state` and pass it.
## Fix Focus Areas
- hyperparam_search/hyperparam_search.py[133-154]
- src/pyreflect/input/data_processor.py[14-22]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Modal Optuna DB not persisted 🐞 Bug ☼ Reliability
Description
In --use-modal mode, Optuna writes optuna_study.db to /root/data inside the remote container, but
the script only copies optuna_best_params.txt back locally; interrupted Modal runs cannot be resumed
as documented.
Code

hyperparam_search/hyperparam_search.py[R337-406]

+    modal_config = {**config, "data_dir": "/root/data"}
+    out_path = str(data_dir / "optuna_best_params.txt")
+    wandb_pkg = '"wandb",' if use_wandb else ""
+    secret_expr = '[modal.Secret.from_name("wandb-secret")]' if use_wandb else "[]"
+
+    # Embed _run_study verbatim — it has all its imports inside so Modal can
+    # serialize and ship it to the remote container without local dependencies.
+    run_study_src = inspect.getsource(_run_study)
+
+    # Build result-saving footer as plain strings to avoid f-string escaping
+    # conflicts with the curly braces already present in run_study_src.
+    footer_lines = [
+        "",
+        '@app.function(gpu="T4", image=image, secrets=secrets, timeout=12*60*60)',
+        "def _remote():",
+        "    return _run_study(MODAL_CONFIG, USE_WANDB)",
+        "",
+        "@app.local_entrypoint()",
+        "def main():",
+        "    result = _remote.remote()",
+        "    if not result['success']:",
+        "        print('No trials completed.')",
+        "        return",
+        "    print('Best params: ' + str(result['best_params']))",
+        "    print('Best val loss: ' + str(result['best_val_loss']))",
+        "    with open(" + repr(out_path) + ", 'w') as f:",
+        "        f.write('Best validation loss: ' + f\"{result['best_val_loss']:.6f}\" + '\\n')",
+        "        f.write('Best parameters:\\n')",
+        "        for k, v in result['best_params'].items():",
+        "            f.write('  ' + str(k) + ': ' + str(v) + '\\n')",
+        "    print('Results saved to: ' + " + repr(out_path) + ")",
+    ]
+
+    script = "\n".join([
+        "import modal",
+        "",
+        'app = modal.App("pyreflect-optuna")',
+        "image = (",
+        '    modal.Image.debian_slim(python_version="3.10")',
+        "    .pip_install(",
+        '        "torch==2.5.1", "numpy==2.1.0", "optuna",',
+        '        "pandas", "scikit-learn", "scipy", "opencv-python",',
+        '        "pyyaml", "tqdm", "refnx", "llvmlite", "numba",',
+        f"        {wandb_pkg}",
+        "    )",
+        '    .apt_install("git")',
+        '    .run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")',
+        f"    .add_local_dir({repr(str(data_dir))}, remote_path='/root/data', copy=True)",
+        ")",
+        f"secrets = {secret_expr}",
+        f"MODAL_CONFIG = {json.dumps(modal_config)}",
+        f"USE_WANDB = {use_wandb}",
+        "",
+    ]) + run_study_src + "\n".join(footer_lines)
+
+    tmp = tempfile.NamedTemporaryFile(
+        suffix=".py", mode="w", delete=False, encoding="utf-8"
+    )
+    try:
+        tmp.write(script)
+        tmp.close()
+
+        print(f"Launching Optuna search on Modal GPU (T4)...")
+        print(f"Data: {data_dir}")
+        print(f"Trials: {config['n_trials']}, Epochs/trial: {config['epochs_per_trial']}\n")
+
+        env = {**os.environ, "PYTHONIOENCODING": "utf-8"}
+        subprocess.run(["modal", "run", tmp.name], check=True, env=env)
+    finally:
+        os.unlink(tmp.name)
Evidence
_run_study persists the study to /optuna_study.db. run_modal forces data_dir to /root/data and
mounts the local dataset directory into the image via add_local_dir(copy=True), but the generated
Modal script only writes the best params file to the local path and never exports the SQLite DB
back, so the persisted study is lost for subsequent local reruns/resume.

hyperparam_search/hyperparam_search.py[247-254]
hyperparam_search/hyperparam_search.py[333-406]
hyperparam_search/README.md[41-46]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
For `--use-modal`, the Optuna SQLite DB is created inside the remote container at `/root/data/optuna_study.db`, but it is never persisted back to the local machine nor to durable remote storage. This breaks the documented “resume automatically” behavior for Modal runs.
## Issue Context
- `_run_study` always uses `<data_dir>/optuna_study.db`.
- `run_modal` rewrites `data_dir` to `/root/data`.
- `.add_local_dir(..., copy=True)` is a one-way copy into the image; the current flow only writes the best-params text file locally.
## Fix
Pick one:
1) Use a `modal.Volume` mounted at e.g. `/vol` and point Optuna storage to `sqlite:////vol/optuna_study.db` so the study persists across Modal runs.
2) Or have the remote function return the DB bytes (or upload it as an artifact) and write it to `<local data_dir>/optuna_study.db` after the run.
## Fix Focus Areas
- hyperparam_search/hyperparam_search.py[247-254]
- hyperparam_search/hyperparam_search.py[333-406]
- hyperparam_search/README.md[41-46]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Unpinned git install 🐞 Bug ⛨ Security
Description
Modal images install pyreflect via an unpinned git URL, making remote training/search
non-reproducible and vulnerable to upstream changes breaking runs.
Code

experiments/train_modal.py[R189-199]

+    image = (
+        modal.Image.debian_slim(python_version="3.10")
+        .pip_install(
+            "torch==2.5.1", "numpy==2.1.0",
+            "pandas", "scikit-learn", "scipy",
+            "opencv-python", "pyyaml", "tqdm", "refnx", "llvmlite", "numba",
+            *(["wandb"] if use_wandb else []),
+        )
+        .apt_install("git")
+        .run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")
+        .add_local_dir(str(data_dir), remote_path="/root/data", copy=True)
Evidence
The Modal image build runs pip install git+https://.../pyreflect.git without pinning to a tag or
commit; the same pattern exists in the generated Modal script for hyperparam_search. This means the
code used remotely can change over time without any change in this repo.

experiments/train_modal.py[189-200]
hyperparam_search/hyperparam_search.py[370-385]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Modal builds install `pyreflect` from GitHub without pinning a revision. This makes remote runs non-reproducible and can break unexpectedly if upstream changes.
## Issue Context
Both Modal paths include a `pip install git+https://github.com/williamQyq/pyreflect.git` command.
## Fix
- Pin to a specific commit SHA or a released tag, e.g.:
- `pip install git+https://github.com/williamQyq/pyreflect.git@<sha>`
- Alternatively, if the intent is to run *this repo’s* code remotely, mount/add the repo and `pip install .` from that mounted source.
## Fix Focus Areas
- experiments/train_modal.py[189-200]
- hyperparam_search/hyperparam_search.py[370-385]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread experiments/train_modal.py
Comment thread experiments/train_modal.py
- normalization_stats was read from nr_predict_sld.file instead of
  nr_predict_sld.models, causing a KeyError on standard settings.yml
- unconditional `import wandb` at top of _run_training would crash when
  wandb is not installed and --use-wandb is false
@KrishnaRaheja
Copy link
Copy Markdown
Contributor Author

if this shouldnt be on the main branch, please create another one and I can merge it there

@williamQyq
Copy link
Copy Markdown
Owner

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 3, 2026

Claude finished @williamQyq's task in 1m 42s —— View job


PR Review: Add hyperparameter search toolkit and Modal GPU training script

Tasks:

  • Gather context and understand the PR
  • Read key source files
  • Analyze bugs flagged by Qodo
  • Review code quality and design
  • Post comprehensive review

Overall this is a useful addition — Optuna-based search and cloud GPU training fill a real gap. The design is broadly sound, but there are confirmed bugs that will cause failures on first use, plus some architectural concerns worth addressing before merge.


Confirmed Bugs

1. Wrong normalization_stats lookup key — will crash on any standard experiment

experiments/train_modal.py line ~77 reads normalization_stats from nr_predict_sld.file, but pyreflect init places it under nr_predict_sld.models (verified in src/pyreflect/cli/init_content.py:48 and src/pyreflect/models/config.py:133):

# train_modal.py — WRONG key
"norm_stats": str((experiment_dir / files["normalization_stats"]).resolve())
#                                   ^^^^^ KeyError every time

Fix: change to models["normalization_stats"]. Fix this →

2. wandb imported unconditionally — crashes without --use-wandb

In experiments/train_modal.py line ~112, import wandb as _wandb runs regardless of the use_wandb flag. The Modal image only installs wandb when use_wandb=True, so a plain run without --use-wandb raises ImportError before any training starts. Fix this →

3. Non-deterministic train/val splits across Optuna trials

hyperparam_search.py re-splits the dataset inside every trial via DataProcessor.split_arrays, which calls sklearn.train_test_split without a random_state (confirmed in src/pyreflect/input/data_processor.py:19). Each trial evaluates on a different validation set, so the Optuna objective values are not comparable — the search picks "best" hyperparameters partly based on which trial got a lucky split. Fix this →

4. Modal Optuna DB is lost after each run — resume is broken

The README promises "interrupted runs resume automatically" via SQLite persistence. But when using --use-modal, the DB is written to /root/data/optuna_study.db inside the remote container and is never returned. run_modal only writes optuna_best_params.txt back locally. The resume feature only works for local runs. Fix this →


Additional Issues

5. Unpinned pip install git+... in Modal images

Both experiments/train_modal.py:195 and hyperparam_search/hyperparam_search.py:~382 install pyreflect via:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")

This pulls HEAD of the default branch at image-build time. A future commit to main could silently change behaviour in older experiments. Pin to a tag or commit SHA:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git@v0.1.0")

6. Dynamically-generated Modal script is fragile

run_modal in hyperparam_search.py builds a Python script as a string using inspect.getsource(_run_study) and subprocess.run(["modal", "run", tmp.name]). This is brittle:

  • Any closure variable or import inside _run_study that isn't self-contained will fail at runtime in the container (no local imports available).
  • The generated script can silently embed the wrong config if json.dumps escapes anything unexpectedly.
  • A simpler, more robust approach: use Modal's Python API (app.run()) directly without subprocess and script generation.

7. get_xy_norm_stats has a @staticmethod + self signature bug (existing code, but referenced by PR)

src/pyreflect/input/data_processor.py:93 declares get_xy_norm_stats(self, data_curves) as a @staticmethod, making self just a positional arg rather than the instance. This is an existing bug in the library that the new code depends on — worth noting.

8. No requirements.txt or pyproject.toml extras for new dependencies

The PR documents prerequisites in README files but doesn't add them to any machine-readable dependency file. This makes reproducible environment setup harder. Consider adding an extras_require or a requirements-dev.txt:

optuna
pyyaml
wandb; extra == "wandb"
modal; extra == "modal"

9. Branch concern (from PR author's comment)

The PR targets main → main (fork's main into this repo's main). @KrishnaRaheja asked if a separate branch should be used. I'd recommend the PR author push to a feature branch (e.g. feature/hyperparam-search) to keep main clean and allow iterative review without blocking other work.


Summary

Severity Issue
🔴 Blocker Wrong normalization_stats key — KeyError on any standard settings.yml
🔴 Blocker wandb unconditionally imported — ImportError without --use-wandb
🟡 High Non-deterministic splits — Optuna results are unreliable
🟡 High Modal Optuna DB not persisted — resume doesn't work as documented
🟠 Medium Unpinned git install — non-reproducible remote runs
🟠 Medium Dynamically-generated Modal script — fragile, hard to debug
🔵 Low Missing machine-readable dependency declaration

The two blockers (#1 and #2) should be fixed before merge. The non-deterministic split (#3) significantly impacts the correctness of the search and should also be addressed.

@KrishnaRaheja
Copy link
Copy Markdown
Contributor Author

KrishnaRaheja commented Apr 3, 2026

Issue fixes:

  1. Wrong normalization_stats lookup key — will crash on any standard experiment
    This issue is fixed.

  2. wandb imported unconditionally — crashes without --use-wandb
    import wandb only happens inside if use_wandb: blocks, not at the top level.

  3. Non-deterministic train/val splits across Optuna trials
    The split logic is inside pyreflect's DataProcessor.split_arrays, I am using it as is by calling the Pyreflect API

@williamQyq
Copy link
Copy Markdown
Owner

@claude visualize the the added feature in simple diagrams, explain in details how is this feature implemented and what does it do. Check again for any potential bugs.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 3, 2026

Claude finished @williamQyq's task in 3m 7s —— View job


Claude finished @williamQyq's task


Tasks:

  • Gather context and understand the request
  • Read key source files (hyperparam_search.py, train_modal.py, data_processor.py, config.py, init_content.py)
  • Create architecture diagrams
  • Write detailed feature explanations
  • Re-check for potential bugs
  • Post final review

Feature 1: Hyperparameter Search (hyperparam_search/)

What it does

Systematically explores the NR→SLD CNN architecture space using Optuna, a Bayesian optimization framework. Instead of manually trying hyperparameter combinations, Optuna adaptively samples from the search space and narrows toward configurations that minimize validation loss.

Architecture Diagram

flowchart TD
    A["CLI: hyperparam_search.py\n--config config.yml\n[--use-modal] [--use-wandb]"] --> B{Mode?}

    B -->|Local| C["_run_study(config, use_wandb)"]
    B -->|Modal| D["run_modal(config, use_wandb, data_dir)"]

    D --> E["Generate script.py\n(inspect.getsource + string template)"]
    E --> F["subprocess: modal run script.py\n(T4 GPU, remote container)"]
    F --> C

    C --> G["Create Optuna Study\n(SQLite: optuna_study.db)"]
    G --> H["For each trial (1..n_trials)"]

    H --> I["Optuna suggests:\n- layers: choice [6,8,12]\n- dropout: float [0.05, 0.3]\n- batch_size: choice [32,64,128]"]
    I --> J["DataProcessor.split_arrays(X, y)\n(train 80% / val 10% / test 10%)"]
    J --> K["Build CNN + Train\nfor epochs_per_trial"]
    K --> L["Return val_loss"]
    L --> M{More trials?}
    M -->|yes| H
    M -->|no| N["Save optuna_best_params.txt"]
    N --> O{use_wandb?}
    O -->|yes| P["Log to Weights & Biases"]
    O -->|no| Q["Done"]
Loading

How It's Implemented (step-by-step)

  1. Config loadingconfig.yml defines data_dir, n_trials, epochs_per_trial, and search_space (layers list, dropout range, batch_size list).

  2. Data loading — NR and SLD .npy files are loaded from data_dir via NRSLDDataProcessor. Normalization stats are loaded from normalization_stat.npy.

  3. Optuna study — Created with optuna.create_study(direction="minimize", storage="sqlite:///optuna_study.db"). SQLite persistence means a local run can be resumed if interrupted.

  4. Objective function — Each trial:

    • Optuna samples layers, dropout, batch_size using its TPE sampler
    • Data is split into train/val/test via DataProcessor.split_arrays()
    • A CNN model is built with the sampled architecture and trained
    • Validation loss is returned as the objective to minimize
  5. Modal execution path — When --use-modal is passed, inspect.getsource(_run_study) extracts the function source code as a string, a complete Modal Python script is assembled with the config embedded as json.dumps, and it is written to a temp file and executed via subprocess.run(["modal", "run", tmp.py]). The remote container installs all dependencies and runs _run_study on a T4 GPU.


Feature 2: Modal GPU Training (experiments/train_modal.py)

What it does

Trains the NR→SLD CNN on a cloud GPU using Modal, reading directly from the standard pyreflect experiment directory structure. No separate config needed — it parses the experiment's settings.yml.

Architecture Diagram

flowchart TD
    A["CLI: train_modal.py\n--experiment-dir experiments/my_exp\n[--use-wandb]"] --> B["Load settings.yml"]
    B --> C["_extract_training_params()\nRead nr_predict_sld section"]

    C --> D["files: nr_train, sld_train\nmodels: normalization_stats,\nlayers, dropout, batch_size, epochs"]

    D --> E["Build Modal Image\n(debian_slim + pip_install)"]
    E --> F["add_local_dir(data/) → /root/data\n(one-way copy into container)"]

    F --> G["Remote: _run_training() on T4 GPU"]

    G --> H["Import torch, pyreflect\n(conditionally: wandb)"]
    H --> I["NRSLDDataProcessor\nload + normalize NR & SLD"]
    I --> J["split_arrays → DataLoaders"]
    J --> K["Build CNN(layers, dropout)\nTrain for epochs"]
    K --> L{use_wandb?}
    L -->|yes| M["Log metrics to W&B"]
    L -->|no| N["Local training loop"]
    M --> O["Save trained_model.pth\n→ experiment dir"]
    N --> O
Loading

How It's Implemented (step-by-step)

  1. Parse experiment_extract_training_params(experiment_dir, config) reads settings.yml and extracts:

    • File paths from nr_predict_sld.file: nr_train, sld_train
    • Model params from nr_predict_sld.models: normalization_stats, layers, dropout, batch_size, epochs
  2. Modal image — A modal.Image.debian_slim is constructed with pip_install for all dependencies (torch, numpy, pyreflect, etc.). The local data/ directory is uploaded into the container at /root/data via add_local_dir(..., copy=True).

  3. Remote training_run_training(params, use_wandb) runs on the T4 GPU:

    • Loads and normalizes NR/SLD data using NRSLDDataProcessor
    • Splits into train/val/test sets
    • Builds and trains the CNN
    • Saves trained_model.pth
  4. Result retrieval — The trained model file is returned from Modal back to the local experiment directory.


Data Flow (end-to-end)

Local filesystem                Modal/Cloud
─────────────────               ─────────────
experiments/
  my_exp/
    settings.yml  ──────────── read by train_modal.py
    data/
      nr_train.npy ────copy──► /root/data/nr_train.npy
      sld_train.npy ───copy──► /root/data/sld_train.npy
      norm_stats.npy ──copy──► /root/data/norm_stats.npy
                               ┌──────────────┐
                               │ T4 GPU Train │
                               └──────┬───────┘
    trained_model.pth ◄─────────────-┘

Bug Re-check

Based on the source files I read (including data_processor.py, init_content.py, config.py), here's an updated assessment:

Fixed (per @KrishnaRaheja)

# Issue Status
1 Wrong normalization_stats key (files[...]models[...]) ✅ Fixed
2 wandb imported unconditionally ✅ Fixed

Still Open

Bug A — Non-deterministic splits (Correctness, High)

DataProcessor.split_arrays() at src/pyreflect/input/data_processor.py:19 calls train_test_split without a random_state:

# data_processor.py:19 — no random_state!
crv_tr, crv_hld, chi_tr, chi_hld = train_test_split(X, y, train_size=size_split)

Since this is called inside every Optuna trial, each trial evaluates on a different validation set. Trial A may see a harder subset than Trial B, making the Optuna objective values incomparable. @KrishnaRaheja says "I'm using the API as-is", but this fundamentally undermines the search results — the "best" params may simply be the trial that got a lucky val split.

Minimal fix without modifying pyreflect: Move the split outside objective() and reuse train_loader/valid_loader across all trials (only recreate DataLoader per trial to change batch_size). Fix this →

Bug B — Modal Optuna DB lost after each run (Reliability, High)

The README says "The Optuna study is persisted to SQLite so interrupted runs resume automatically." This is only true for local runs. For --use-modal runs:

  • _run_study creates optuna_study.db at /root/data/optuna_study.db inside the container
  • run_modal only copies optuna_best_params.txt back locally
  • The DB is destroyed when the container exits → resume is broken for Modal runs

Fix this →

Bug C — Unpinned pip install git+... in both Modal images (Reproducibility, Medium)

Both experiments/train_modal.py:~195 and the generated Modal script in hyperparam_search.py:~382 use:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")

This installs the latest HEAD of main at image-build time. A future commit could silently break old experiments. Pin to a commit SHA or tag:

.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git@<sha_or_tag>")

Bug D — Fragile dynamic script generation (Reliability, Medium)

run_modal() in hyperparam_search.py builds a Python file by concatenating strings including inspect.getsource(_run_study). This breaks if:

  • _run_study closes over any variable not serializable to json.dumps
  • The source extraction picks up decorators or formatting that changes the indentation

A more robust alternative: use Modal's Python API (@app.function + app.run()) directly from the calling script, instead of code generation + subprocess.

Pre-existing Bug E — @staticmethod with self parameter (Existing, Low)

src/pyreflect/input/data_processor.py:92-93 (pre-PR code):

@staticmethod
def get_xy_norm_stats(self, data_curves):  # ← self is just a positional arg here!

As a @staticmethod, self is not the instance — it's just the first positional argument. Callers would need to pass an extra dummy value. This method isn't directly used by the new PR code, but is a latent bug in the library.


Summary

Severity Issue Status
✅ Fixed Wrong normalization_stats key KrishnaRaheja
✅ Fixed wandb unconditional import KrishnaRaheja
🟡 Open Non-deterministic trial splits Needs fix
🟡 Open Modal Optuna DB not persisted (resume broken) Needs fix
🟠 Open Unpinned git+ install in Modal images Low priority
🟠 Open Fragile inspect.getsource + subprocess approach Low priority
🔵 Pre-existing @staticmethod get_xy_norm_stats(self, ...) in data_processor Pre-PR

@williamQyq
Copy link
Copy Markdown
Owner

@claude fix the pre-existing issues then merge the pr.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 6, 2026

Claude finished @williamQyq's task in 1m 13s —— View job


Fixing Pre-existing Issues

Tasks:

  • Gather context and understand the request
  • Read src/pyreflect/input/data_processor.py to verify the @staticmethod bug
  • Fix the @staticmethod get_xy_norm_stats(self, ...) issue
  • Commit and push the fix
  • Note limitation regarding PR merge

Fix Applied

src/pyreflect/input/data_processor.py:93 — Removed the erroneous self parameter from DataProcessor.get_xy_norm_stats. As a @staticmethod, self was just treated as a regular positional argument, meaning any caller would have needed to pass a dummy first value to use this method correctly. The fix aligns the signature with its decorator:

- def get_xy_norm_stats(self, data_curves):
+ def get_xy_norm_stats(data_curves):

Commit 4ab6ab8 pushed to main.


Regarding "merge the PR"

I'm unable to merge pull requests due to GitHub App permission constraints. You can merge it directly via the GitHub UI. More info in the FAQ.

Note: The following issues from the prior review are still open if you'd like to address them before merging:

  • 🟡 Non-deterministic Optuna trial splits (each trial re-splits data without a fixed seed)
  • 🟡 Modal Optuna DB not persisted — resume doesn't work as documented for --use-modal
  • 🟠 Unpinned pip install git+... in Modal image builds (non-reproducible)

@williamQyq williamQyq merged commit eaea0f0 into williamQyq:main Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants