-
Notifications
You must be signed in to change notification settings - Fork 64
Open
Labels
feature requestRequest for a new featureRequest for a new feature
Description
Problem Description
As a user, I'd like a way to run the multi-table benchmark on an EC2 instance since I do not have the compute power to do so on my machine.
We want to add a multi table version of benchmark_single_table_aws.
Expected behavior
Add new function to the sdgym.benchmark module
def benchmark_multi_table(
output_destination,
aws_access_key_id=None,
aws_secret_access_key=None,
synthesizers=['HMASynthesizer', 'MultiTableUniformSynthesizer'],
custom_synthesizers=None,
sdv_datasets=['NBA', 'financial', 'Student_loan', 'Biodegradability', 'fake_hotels', 'restbase', airbnb-simplified'],
additional_datasets_folder=None,
limit_dataset_size=False,
compute_quality_score=True,
compute_diagnostic_score=True,
timeout=None
show_progress=False
):
"""
Args:
output_destination (str):
An S3 bucket or filepath. The results output folder will be written here.
Should be structured as:
s3://{s3_bucket_name}/{path_to_file} or s3://{s3_bucket_name}.
aws_access_key_id (str): The AWS access key id. Optional
aws_secret_access_key (str): The AWS secret access key. Optional
synthesizers (list[string] | sdgym.synthesizer.BaselineSynthesizer): List of synthesizers to use.
sdv_datasets (list[str] or ``None``):Names of the SDV demo datasets to use for the benchmark.
additional_datasets_folder (str or ``None``): The path to an S3 bucket. Datasets found in this folder are
run in addition to the SDV datasets. If ``None``, no additional datasets are used.
limit_dataset_size (bool):
We should still limit the dataset to 10 columns per table (not including primary/foreign keys).
But as for the # of rows: The overall dataset needs to be subsampled with referential integrity.
We should use the [get_random_subset](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/cleaning-your-data#get_random_subset) function to perform the subsample.
For the main table, select the table with the larges # of rows; and for num rows, set it to 1000.
compute_quality_score (bool):
Whether or not to evaluate an overall quality score. In this case we should use the MultiTableQualityReport.
compute_diagnostic_score (bool):
Whether or not to evaluate an overall diagnostic score. In this case we should use the MultiTableDiagnosticReport.
timeout (int or ``None``):
The maximum number of seconds to wait for synthetic data creation. If ``None``, no
timeout is enforced.
"""
Additional context
- Once Add benchmark_multi_table function #486 is done, this should be relatively straight forward. You just have to adapt the startup script that we give the EC2 instance to use the benchmark_multi_table function.
- Consider that we may add support for other cloud services (like GCP). This means we should abstract things in a way that any cloud can be plugged in.
Metadata
Metadata
Assignees
Labels
feature requestRequest for a new featureRequest for a new feature