Add benchmark_multi_table function

### Problem Description

_As a user I'd like to have a reliable method and set of metrics to use to compare multi table synthesizers._

We want to add a multi table version of [benchmark_single_table](https://github.com/sdv-dev/SDGym/blob/95f770ef32dd2771a347f46660797fb43f469636/sdgym/benchmark.py#L1155)

### Expected behavior

Add new function to the sdgym.benchmark module
```python
def benchmark_multi_table(
    synthesizers=['HMASynthesizer', 'MultiTableUniformSynthesizer'],
    custom_synthesizers=None,
    sdv_datasets=['NBA', 'financial', 'Student_loan', 'Biodegradability', 'fake_hotels', 'restbase', 'airbnb-simplified'],
    additional_datasets_folder=None,
    limit_dataset_size=False,
    compute_quality_score=True,
    compute_diagnostic_score=True,
    timeout=None,
    output_destination=None,
    show_progress=False
):
    """
    Args:
        synthesizers (list[string] | sdgym.synthesizer.BaselineSynthesizer): List of synthesizers to use.
        custom_synthesizers (list[class] or ``None``): Same as single table.
        sdv_datasets (list[str] or ``None``):Names of the SDV demo datasets to use for the benchmark. 
        additional_datasets_folder (str or ``None``): The path to a local folder. Datasets found in this folder are
            run in addition to the SDV datasets. If ``None``, no additional datasets are used.
        limit_dataset_size (bool):
            We should still limit the dataset to 10 columns per table (not including primary/foreign keys). 
            But as for the # of rows: The overall dataset needs to be subsampled with referential integrity.
            We should use the [get_random_subset](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/cleaning-your-data#get_random_subset) function to perform the subsample.
            For the main table, select the table with the larges # of rows; and for num rows, set it to 1000.
        compute_quality_score (bool):
            Whether or not to evaluate an overall quality score. In this case we should use the MultiTableQualityReport.
        compute_diagnostic_score (bool):
            Whether or not to evaluate an overall diagnostic score. In this case we should use the MultiTableDiagnosticReport.
        timeout (int or ``None``):
            The maximum number of seconds to wait for synthetic data creation. If ``None``, no
            timeout is enforced.
        output_destination (str or ``None``):
            The path to the output directory where results will be saved. If ``None``, no
            output is saved.
        show_progress (bool):
            Whether to use tqdm to keep track of the progress. Defaults to ``False``.

    """
    
```

### Changes to storage and artifacts
We should store the artifacts in this new folder structure
```
output_destination/

|-- single-table 
    |-- SDGym_results_06_24_2025/
          |--- census_06_24_2025/
               |--- CTGANSynthesizer/  
                    |--- CTGANSynthesizer.pkl
                    |--- CTGANSynthesizer_synthetic_data.csv
                    |--- CTGANSynthesizer_benchmark_result.csv
               |--- TVAEynthesizer/  
                    |--- <artifacts>
          |--- expedia_hotel_logs_06_24_2025/
               |--- ...
          |--- meta.yaml
          |--- results.csv
     |--- SDGym_results_07_24_2025/
          |--- ...
|--- multi_table
     |--- SDGym_results_06_24_2025/
          |--- berka_06_24_2025/
               |--- HMASynthesizer/  
                    |--- HMASynthesizer.pkl
                    |--- HMASynthesizer_synthetic_data.zip
                    |--- HMASynthesizer_benchmark_result.csv
               |--- HSASynthesizer/  
                    |--- <artifacts>
          |--- synthea_06_24_2025/
               |--- ...
          |--- meta.yaml
          |--- results.csv
     |--- SDGym_results_07_24_2025/
```
The main difference is that everything will now be nested in a folder for modality.

### Changes to results
- The result columns should be the same as in the single table case.
- We should still add adjusted total time and quality score only it should use the MultiTableUniformSynthesizer results instead.

### Additional context
- Don't worry about AWS yet. That will be in #487 
- A lot of code will need to be adapted to support the multi-table case.
- Most functions in [this file](https://github.com/sdv-dev/SDGym/blob/main/sdgym/benchmark.py) can be generalized to work for single or multi table synthesizers.
    - We should re-use as much as possible and not just copy it all over and replace single table with multi table everywhere.
- Code like the following snippet needs to be restructured or abstracted out so that we can easily replace the modality.
https://github.com/sdv-dev/SDGym/blob/95f770ef32dd2771a347f46660797fb43f469636/sdgym/benchmark.py#L1073-L1079
- We may end up making classes in the future to benchmark and view results. Imagine someone initializes a benchmark class with whether or not it's single table, if it's going to be run on a cloud, etc.
- Note that there will be no support for metrics outside of the Quality and Diagnostic Reports

	if synthesizer not in SDV_SINGLE_TABLE_SYNTHESIZERS:
	ext_lib = EXTERNAL_SYNTHESIZER_TO_LIBRARY.get(synthesizer)
	if ext_lib:
	library_version = version(ext_lib)
	metadata[f'{ext_lib}_version'] = library_version
	elif 'sdv' not in metadata.keys():
	metadata['sdv_version'] = version('sdv')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark_multi_table function #486

Problem Description

Expected behavior

Changes to storage and artifacts

Changes to results

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add benchmark_multi_table function #486

Description

Problem Description

Expected behavior

Changes to storage and artifacts

Changes to results

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions