Move the current benchmark configs to yaml files#566
Move the current benchmark configs to yaml files#566R-Palazzo merged 21 commits intofeature/benchmark_launcherfrom
Conversation
| compute_privacy_score: false | ||
|
|
||
| compute: | ||
| service: 'gcp' |
There was a problem hiding this comment.
In this PR, we only indicate which compute service to use. In a future issue, we will move the compute config defined here
SDGym/sdgym/_benchmark/config_utils.py
Line 19 in 3c2a248
Inside this yaml file also
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## feature/benchmark_launcher #566 +/- ##
==============================================================
+ Coverage 82.41% 83.30% +0.88%
==============================================================
Files 33 38 +5
Lines 2923 3198 +275
==============================================================
+ Hits 2409 2664 +255
- Misses 514 534 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pvk-developer
left a comment
There was a problem hiding this comment.
I think this is looking very good. I would suggest going over the _validation.py and try to modularize it a little bit if you have the time since it is hard to navigate the long functions in there.
abe9001 to
f670ba5
Compare
| @@ -0,0 +1,21 @@ | |||
| method_params: | |||
| timeout: 345600 | |||
| output_destination: 's3://sdgym-benchmark/Debug/Benchmark_Launcher/' | |||
There was a problem hiding this comment.
TODO: Update before merging
| ) | ||
| " | ||
| python -m pip install "sdgym[all]" | ||
| python -m pip install "sdgym[all] @ git+https://github.com/sdv-dev/SDGym.git@issue-532-define-yaml-files" |
There was a problem hiding this comment.
TODO: Revert before merging
amontanez24
left a comment
There was a problem hiding this comment.
I think we should restructure the credentials a bit. Lets define a format for the dict like:
gcp:
GCP_SERVICE_ACCOUNT_JSON: ...
GCP_SERVICE_ACCOUNT_PATH: ...
GCP_PROJECT_ID: ...
GCP_ZONE: ...
sdv_enterprise:
SDV_ENTERPRISE_USERNAME: ...
SDV_ENTERPRISE_LICENSE_KEY: ...
aws:
AWS_ACCESS_KEY_ID: ...
AWS_SECRET_ACCESS_KEY: ...We then load this dict and check for expected keys. If they key isn't present we try to load from the environment. So if someone is running with GCP, we would attempt to load all the expected GCP keys from the dict and then check the environment if the dict doesn't have it.
If you want to make the names less redundant, you can remove the prefixes ('SDV', 'AWS', 'GCP') and say that if you are using environment variables you should store it as {service_name}_{key_name}.
| def _get_credentials(credential_locations): | ||
| """Get resolved credentials dict.""" | ||
| config = credential_locations or {} | ||
| filepath = config.get('credential_filepath') |
There was a problem hiding this comment.
in the case where the filepath is provided, is the structure of the credentials dict the same as credential_locations?
| sig = inspect.signature(method_to_run) | ||
| required = { | ||
| parameter.name | ||
| for parameter in sig.parameters.values() | ||
| if parameter.default is inspect.Parameter.empty | ||
| and parameter.kind | ||
| in (inspect.Parameter.POSITIONAL_OR_KEYWORD, inspect.Parameter.KEYWORD_ONLY) | ||
| } | ||
| required_from_yaml = required - _INJECTED_PARAMS | ||
| missing = required_from_yaml - set(method_params) | ||
| if missing: | ||
| errors.append( | ||
| f'method_params: missing required parameters for {method_to_run.__name__}:' | ||
| f' {sorted(missing)}' | ||
| ) |
There was a problem hiding this comment.
is this necessary? Can't we just have defaults for whatever they don't provide? The only one that I can see as required is the output destination, and I don't think that one should be grouped with the other parameters.
| 'method_params': dict of parameters to pass to the benchmark method (e.g. timeout), | ||
| 'credentials': dict specifying how to resolve credentials (e.g. from env vars or a file), | ||
| 'compute': dict specifying the compute configuration (e.g. service: 'gcp'), | ||
| 'instance_jobs': list of dicts, each specifying a combination of synthesizers and datasets: |
There was a problem hiding this comment.
I think the output destination should be specified with the jobs
There was a problem hiding this comment.
I was thinking it makes more sense to have all the results for a benchmark in the same location, but you’re right we could also set one output_destination per instance_job. Both options work for me. Let me know which we prefer.
There was a problem hiding this comment.
I think at some point Kalyan told me we may want to have different output destinations so we should put it with the jobs
amontanez24
left a comment
There was a problem hiding this comment.
I think we're almost there! I left a couple comments and think we should move the output destination to the jobs but besides that looks good
| 'method_params': dict of parameters to pass to the benchmark method (e.g. timeout), | ||
| 'credentials': dict specifying how to resolve credentials (e.g. from env vars or a file), | ||
| 'compute': dict specifying the compute configuration (e.g. service: 'gcp'), | ||
| 'instance_jobs': list of dicts, each specifying a combination of synthesizers and datasets: |
There was a problem hiding this comment.
I think at some point Kalyan told me we may want to have different output destinations so we should put it with the jobs
Resolve #545
CU-86b8h52bh
It's a pretty big PR, thanks in advance for your review :)
Currently the
BenchmarkLauncheronly works for GCP.To make it work on other compute servie (e.g. AWS) we will have to define benchmark methods that have similar parameter as the gcp ones (with
credentials/compute_config):SDGym/sdgym/_benchmark/benchmark.py
Line 350 in 3c2a248