Skip to content

SDGym should be able to automatically discover SDV Enterprise synthesizers #481

@npatki

Description

@npatki

Problem Description

SDGym is designed to be able to benchmark synthesizers. SDV synthesizers are natively supported, while external, 3rd party synthesizers can be integrated as a custom synthesizer.

Currently, the main publicly-available SDV synthesizers are natively supported in SDGym. However, this does not apply to any synthesizers in SDV Enterprise or bundles (eg. the SegmentSynthesizer from the XSynthesizers bundle). The user would have to integrate these manually as custom synthesizers.

Expected behavior

I expect that SDGym should be able to automatically discover any single- or multi-table SDV synthesizer that I have access to in my Python environment.

Single table synthesizers: SDGym should be able to search for the synthesizer name in the sdv.single_table namespace. If that synthesizer name exists, then it should be able to load it into the appropriate format (eg. see this base class). As a result, a user should be able to benchmark any SDV single-table synthesizer that they have access to by providing the string of its name.

# assuming I have SDV Enterprise and the XSynthesizers bundle already installed in my environment
# I should be able to benchmark the synthesizers by inputting their names
sdgym.benchmark_single_table(
    synthesizers=['SegmentSynthesizer', 'XGCSynthesizer'] 
)

Additionally, I should be able to create variants of these types of synthesizers following this guide.

from sdgym import create_sdv_synthesizer_variant

BiSegmentSyntheiszer = create_sdv_synthesizer_variant(
  synthesizer_class='SegmentSynthesizer',
  synthesizer_parameters={ 'n_segments': 2 }
  display_name='BiSegmentSynthesizer'
)

Multi table synthesizers: SDGym should also be able to search for synthesizer names in the sdv.multi_table namespace. In this case, it should be able to find and load in the synthesizer in a similar format to single-table but with some modifications:

  • _get_trained_synthesizer function: The input parameter data should be a dictionary of dataframes instead of a single pandas DataFrame
  • _sample_from_synthesizer function: Instead of n_samples, the parameter it accepts should be scale.

Note that multi-table benchmarking is not currently supported in SDGym but we hope to add support for it in the future.

Additional context

We should also generally clean up this file when making the changes.

  • The nomenclature is "single-table" and "multi-table". Get rid of any references to "tabular" or "relational" (these were the old names)
  • None of the SDV synthesizers should be hard-coded (eg. GaussianCopulaSynthesizer, CTANSynthesizers, etc.). All of them should be able to be dynamically discovered from sdv.single_table and sdv.multi_table modules
  • We can remove the sequential code for now.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions