Add hyperparameter search toolkit and Modal GPU training script#2
Conversation
- Implemented training script rain_best_model_modal.py - Configured model with 6 layers, 0.11 dropout, batch size of 64, and 20 epochs - Integrated W&B for experiment tracking - Added data loading and preprocessing steps - Included model saving functionality
… 6 layers, 0.107 dropout, and 64 batch size.
- Replace optuna_search.py + optuna_modal.py with unified hyperparam_search.py (--use-modal, --use-wandb flags) - Add hyperparam_search/config.yml for search configuration - Replace hardcoded train_best_model_modal.py with train_modal.py that reads settings.yml - Rename experiment folders: best_model_6L_0.11D -> best_model_30k_6L_0.107D, best_model_125k -> best_model_125k_6L_0.087D - Rename experimental notebooks with descriptive model params in filename - Update notebook paths to match renamed experiment folders - Remove master_training_data/ - Add README.md to hyperparam_search/ and experiments/ - Unignore hyperparam_search/config.yml in .gitignore
The app.function() dynamic decoration pattern was silently falling back to local execution (Tasks: 0 in Modal). Fix by generating a self-contained temp script with the Modal function at module level and invoking via subprocess modal run, matching the pattern Modal requires.
…format - Add traceback.print_exc() to failed trial handler for better debugging - Fix W&B run name dropout format from :.3f to :.2f to match original scripts
Review Summary by QodoAdd Optuna hyperparameter search and Modal GPU training toolkit
WalkthroughsDescription• Add unified Optuna hyperparameter search with optional Modal GPU and W&B logging • Implement Modal GPU training script that reads experiment settings and trains models • Add comprehensive documentation and configuration files for both tools • Include VS Code settings and reorganize experiment notebooks to runs/ directory Diagramflowchart LR
A["Config YAML"] -->|hyperparam_search.py| B["Optuna Study"]
B -->|Local or Modal| C["Train CNN Trials"]
C -->|Minimize Val Loss| D["Best Params"]
D -->|optuna_best_params.txt| E["Results"]
F["settings.yml"] -->|train_modal.py| G["Modal GPU"]
G -->|Train Model| H["trained_model.pth"]
B -->|Optional| I["W&B Logging"]
G -->|Optional| I
File Changes1. hyperparam_search/hyperparam_search.py
|
Code Review by Qodo
1.
|
- normalization_stats was read from nr_predict_sld.file instead of nr_predict_sld.models, causing a KeyError on standard settings.yml - unconditional `import wandb` at top of _run_training would crash when wandb is not installed and --use-wandb is false
|
if this shouldnt be on the main branch, please create another one and I can merge it there |
|
@claude review |
|
Claude finished @williamQyq's task in 1m 42s —— View job PR Review: Add hyperparameter search toolkit and Modal GPU training scriptTasks:
Overall this is a useful addition — Optuna-based search and cloud GPU training fill a real gap. The design is broadly sound, but there are confirmed bugs that will cause failures on first use, plus some architectural concerns worth addressing before merge. Confirmed Bugs1. Wrong
|
| Severity | Issue |
|---|---|
| 🔴 Blocker | Wrong normalization_stats key — KeyError on any standard settings.yml |
| 🔴 Blocker | wandb unconditionally imported — ImportError without --use-wandb |
| 🟡 High | Non-deterministic splits — Optuna results are unreliable |
| 🟡 High | Modal Optuna DB not persisted — resume doesn't work as documented |
| 🟠 Medium | Unpinned git install — non-reproducible remote runs |
| 🟠 Medium | Dynamically-generated Modal script — fragile, hard to debug |
| 🔵 Low | Missing machine-readable dependency declaration |
The two blockers (#1 and #2) should be fixed before merge. The non-deterministic split (#3) significantly impacts the correctness of the search and should also be addressed.
|
Issue fixes:
|
|
@claude visualize the the added feature in simple diagrams, explain in details how is this feature implemented and what does it do. Check again for any potential bugs. |
|
Claude finished @williamQyq's task in 3m 7s —— View job Claude finished @williamQyq's task — Tasks:
Feature 1: Hyperparameter Search (
|
| # | Issue | Status |
|---|---|---|
| 1 | Wrong normalization_stats key (files[...] → models[...]) |
✅ Fixed |
| 2 | wandb imported unconditionally |
✅ Fixed |
Still Open
Bug A — Non-deterministic splits (Correctness, High)
DataProcessor.split_arrays() at src/pyreflect/input/data_processor.py:19 calls train_test_split without a random_state:
# data_processor.py:19 — no random_state!
crv_tr, crv_hld, chi_tr, chi_hld = train_test_split(X, y, train_size=size_split)Since this is called inside every Optuna trial, each trial evaluates on a different validation set. Trial A may see a harder subset than Trial B, making the Optuna objective values incomparable. @KrishnaRaheja says "I'm using the API as-is", but this fundamentally undermines the search results — the "best" params may simply be the trial that got a lucky val split.
Minimal fix without modifying pyreflect: Move the split outside objective() and reuse train_loader/valid_loader across all trials (only recreate DataLoader per trial to change batch_size). Fix this →
Bug B — Modal Optuna DB lost after each run (Reliability, High)
The README says "The Optuna study is persisted to SQLite so interrupted runs resume automatically." This is only true for local runs. For --use-modal runs:
_run_studycreatesoptuna_study.dbat/root/data/optuna_study.dbinside the containerrun_modalonly copiesoptuna_best_params.txtback locally- The DB is destroyed when the container exits → resume is broken for Modal runs
Bug C — Unpinned pip install git+... in both Modal images (Reproducibility, Medium)
Both experiments/train_modal.py:~195 and the generated Modal script in hyperparam_search.py:~382 use:
.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git")This installs the latest HEAD of main at image-build time. A future commit could silently break old experiments. Pin to a commit SHA or tag:
.run_commands("pip install git+https://github.com/williamQyq/pyreflect.git@<sha_or_tag>")Bug D — Fragile dynamic script generation (Reliability, Medium)
run_modal() in hyperparam_search.py builds a Python file by concatenating strings including inspect.getsource(_run_study). This breaks if:
_run_studycloses over any variable not serializable tojson.dumps- The source extraction picks up decorators or formatting that changes the indentation
A more robust alternative: use Modal's Python API (@app.function + app.run()) directly from the calling script, instead of code generation + subprocess.
Pre-existing Bug E — @staticmethod with self parameter (Existing, Low)
src/pyreflect/input/data_processor.py:92-93 (pre-PR code):
@staticmethod
def get_xy_norm_stats(self, data_curves): # ← self is just a positional arg here!As a @staticmethod, self is not the instance — it's just the first positional argument. Callers would need to pass an extra dummy value. This method isn't directly used by the new PR code, but is a latent bug in the library.
Summary
| Severity | Issue | Status |
|---|---|---|
| ✅ Fixed | Wrong normalization_stats key |
KrishnaRaheja |
| ✅ Fixed | wandb unconditional import |
KrishnaRaheja |
| 🟡 Open | Non-deterministic trial splits | Needs fix |
| 🟡 Open | Modal Optuna DB not persisted (resume broken) | Needs fix |
| 🟠 Open | Unpinned git+ install in Modal images |
Low priority |
| 🟠 Open | Fragile inspect.getsource + subprocess approach |
Low priority |
| 🔵 Pre-existing | @staticmethod get_xy_norm_stats(self, ...) in data_processor |
Pre-PR |
|
@claude fix the pre-existing issues then merge the pr. |
|
Claude finished @williamQyq's task in 1m 13s —— View job Fixing Pre-existing IssuesTasks:
Fix Applied
- def get_xy_norm_stats(self, data_curves):
+ def get_xy_norm_stats(data_curves):Commit Regarding "merge the PR"I'm unable to merge pull requests due to GitHub App permission constraints. You can merge it directly via the GitHub UI. More info in the FAQ.
|
This PR adds tooling for systematically searching CNN hyperparameters and training on cloud GPUs, built on top of the existing pyreflect training pipeline.
Nothing in src/ is modified.
What's added
hyperparam_search/
An Optuna-based hyperparameter search for the NR→SLD CNN (layers, dropout, batch size). Modal GPU and Weights & Biases are both fully optional — runs locally
with no external services by default.
Local, no logging
python hyperparam_search.py --config config.yml
With W&B logging
python hyperparam_search.py --config config.yml --use-wandb
On Modal GPU
python hyperparam_search.py --config config.yml --use-modal
Modal + W&B
python hyperparam_search.py --config config.yml --use-modal --use-wandb
Configuration lives in config.yml — point data_dir at any dataset folder following the pyreflect init structure and set your search space:
data_dir: ../my-dataset
n_trials: 20
epochs_per_trial: 10
search_space:
layers: [6, 8, 12]
dropout_min: 0.05
dropout_max: 0.3
batch_size: [32, 64, 128]
The Optuna study is persisted to SQLite so interrupted runs resume automatically.
experiments/train_modal.py
A companion script for running training on a Modal GPU instead of locally. Reads directly from the experiment's settings.yml (generated by pyreflect init) —
no separate config needed.
python experiments/train_modal.py --experiment-dir experiments/my_experiment
python experiments/train_modal.py --experiment-dir experiments/my_experiment --use-wandb
Uploads the experiment's data/ directory to Modal, trains on a T4 GPU, and saves trained_model.pth back to the experiment directory.
Documentation
Prerequisites for new scripts
pip install optuna pyyaml # hyperparam search (always)
pip install wandb # optional, for --use-wandb
pip install modal && modal setup # optional, for --use-modal
For --use-modal --use-wandb: create a Modal secret named wandb-secret containing WANDB_API_KEY.