Problem Description
The BenchmarkConfig supports defining and validating the configuration used to launch benchmark runs on remote compute services.
Currently, the compute section only requires a service key, with values such as gcp or aws. Based on that service, the instance configuration is resolved from Python-side defaults that are currently hardcoded in config_utils.py
We want to update the compute section so that instance-level configuration is defined directly in the YAML file.
As part of this change, we should define provider-agnostic terminology for the YAML schema and continue supporting provider-specific aliases internally. This will make it clearer which values users are expected to provide while still allowing the implementation to map those fields to each compute service.
Expected behavior
- Move the current configuration defined in config_utils.py to the
benchmark_base.yaml:
- Define the compute instance settings explicitly in YAML, for example:
compute:
service: gcp
instance_type: n1-highmem-16
boot_image: projects/deeplearning-platform-release/global/images/family/common-cu128-ubuntu-2204-nvidia-570
root_disk_gb: 300
gpu_type: nvidia-tesla-t4
gpu_count: 1
swap_gb: 64
name_prefix: sdgym-run
- Validate the new compute schema as part of
BenchmarkConfig
- Update BenchmarkLauncher and benchmark methods to work with the new structure.
Additional context
The config should use canonical keys such as:
- instance_type
- boot_image
- root_disk_gb
- gpu_type
- gpu_count
These keys can then be translated internally to provider-specific names. For example:
instance_type → machine_type on GCP, instance_type on AWS
boot_image → source_image on GCP, ami on AWS
root_disk_gb → disk_size_gb on GCP, volume_size_gb on AWS
Problem Description
The BenchmarkConfig supports defining and validating the configuration used to launch benchmark runs on remote compute services.
Currently, the compute section only requires a service key, with values such as
gcporaws. Based on that service, the instance configuration is resolved from Python-side defaults that are currently hardcoded in config_utils.pyWe want to update the compute section so that instance-level configuration is defined directly in the YAML file.
As part of this change, we should define provider-agnostic terminology for the YAML schema and continue supporting provider-specific aliases internally. This will make it clearer which values users are expected to provide while still allowing the implementation to map those fields to each compute service.
Expected behavior
benchmark_base.yaml:BenchmarkConfigAdditional context
The config should use canonical keys such as:
These keys can then be translated internally to provider-specific names. For example:
instance_type→machine_typeon GCP,instance_typeon AWSboot_image→source_imageon GCP,amion AWSroot_disk_gb→disk_size_gbon GCP,volume_size_gbon AWS