Skip to content

Add benchmark_multi_table_aws #487

@amontanez24

Description

@amontanez24

Problem Description

As a user, I'd like a way to run the multi-table benchmark on an EC2 instance since I do not have the compute power to do so on my machine.
We want to add a multi table version of benchmark_single_table_aws.

Expected behavior

Add new function to the sdgym.benchmark module

def benchmark_multi_table(
    output_destination,
    aws_access_key_id=None,
    aws_secret_access_key=None,
    synthesizers=['HMASynthesizer', 'MultiTableUniformSynthesizer'],
    custom_synthesizers=None,
    sdv_datasets=['NBA', 'financial', 'Student_loan', 'Biodegradability', 'fake_hotels', 'restbase', airbnb-simplified'],
    additional_datasets_folder=None,
    limit_dataset_size=False,
    compute_quality_score=True,
    compute_diagnostic_score=True,
    timeout=None
    show_progress=False
):
    """
    Args:
        output_destination (str):
            An S3 bucket or filepath. The results output folder will be written here.
            Should be structured as:
            s3://{s3_bucket_name}/{path_to_file} or s3://{s3_bucket_name}.
        aws_access_key_id (str): The AWS access key id. Optional
        aws_secret_access_key (str): The AWS secret access key. Optional
        synthesizers (list[string] | sdgym.synthesizer.BaselineSynthesizer): List of synthesizers to use.
        sdv_datasets (list[str] or ``None``):Names of the SDV demo datasets to use for the benchmark. 
        additional_datasets_folder (str or ``None``): The path to an S3 bucket. Datasets found in this folder are
            run in addition to the SDV datasets. If ``None``, no additional datasets are used.
        limit_dataset_size (bool):
            We should still limit the dataset to 10 columns per table (not including primary/foreign keys). 
            But as for the # of rows: The overall dataset needs to be subsampled with referential integrity.
            We should use the [get_random_subset](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/cleaning-your-data#get_random_subset) function to perform the subsample.
            For the main table, select the table with the larges # of rows; and for num rows, set it to 1000.
        compute_quality_score (bool):
            Whether or not to evaluate an overall quality score. In this case we should use the MultiTableQualityReport.
        compute_diagnostic_score (bool):
            Whether or not to evaluate an overall diagnostic score. In this case we should use the MultiTableDiagnosticReport.
        timeout (int or ``None``):
            The maximum number of seconds to wait for synthetic data creation. If ``None``, no
            timeout is enforced.
    """
    

Additional context

  • Once Add benchmark_multi_table function #486 is done, this should be relatively straight forward. You just have to adapt the startup script that we give the EC2 instance to use the benchmark_multi_table function.
  • Consider that we may add support for other cloud services (like GCP). This means we should abstract things in a way that any cloud can be plugged in.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions