Update the benchmark to launch one instance per (dataset, synthesizer) pair#611
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
==========================================
- Coverage 85.91% 85.81% -0.11%
==========================================
Files 40 40
Lines 3749 3722 -27
==========================================
- Hits 3221 3194 -27
Misses 528 528
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fealho
left a comment
There was a problem hiding this comment.
This runs the UniformSynthesizer in every instance. So for example, for the 81 single table runs, it ran 81 times. Is this intended?
Also, do we want to change this? I don't think it affects the monthly run, because the config returns before reaching that line, but if you pass datasets/synthesizers, it still creates only 1 instance.
And could you double check the CLI path still works? I think maybe it depended on the yaml files you deleted. Something like this:
python sdgym/_benchmark_launcher/script.py --modality single_table --output-destination s3://bucket/test --synthesizers CTGANSynthesizer
amontanez24
left a comment
There was a problem hiding this comment.
This looks good. I just want to confirm something. Does the instance that runs the workflow to launch the jobs download all the datasets for all jobs, or only the one for the job it launches?
I understand we are doing one job = 1 synthesizer and dataset, but I want to make sure the instance that launches the job isn't downloading too much
ea2f9bc to
c378d46
Compare
95be93a to
9a91033
Compare
ba22fa7 to
fdb0a93
Compare
@amontanez24, yes, that was the case. The instance that runs the workflow was downloading all the datasets. I tried adding the https://github.com/sdv-dev/SDGym/actions/runs/26508750022/job/78067938428 To avoid this in this PR I moved the dataset loading during execution so it's only the gcp instances that will load the data. I'm running some test and check that it works end-to-end with the |
725c22c to
5555441
Compare
19c7c43 to
5cb84b0
Compare
Hi @fealho thanks for the review!
python sdgym/_benchmark_launcher/script.py --modality single_table --output-destination s3://sdgym-benchmark/Debug/Issue_605_CLI --synthesizers GaussianCopulaSynthesizer |
b97fa0d to
0786528
Compare
| 'instacart_marketbasket_ml', | ||
| 'MovieLens', | ||
| 'rossmann', | ||
| 'Telstra', | ||
| 'walmart', |
There was a problem hiding this comment.
I checked it with Kalyan this morning and we want to have all the private demo datasets included in the benchmark
cdf74e5 to
142d9b9
Compare
4907bb3 to
e4bfb1d
Compare
142d9b9 to
bd95074
Compare
sarahmish
left a comment
There was a problem hiding this comment.
I prefer if the list of synthesizers and datasets is moved to a separate file as you suggested to make the processes of adding / removing datasets and synthesizers easy. That could be another issue
bd95074 to
4e89898
Compare
Resolve #605
86b9zz525