Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: replace synthesizer function with class #89

Closed
sbrugman opened this issue Jun 1, 2021 · 5 comments
Closed

rfc: replace synthesizer function with class #89

sbrugman opened this issue Jun 1, 2021 · 5 comments
Labels
feature request Request for a new feature resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@sbrugman
Copy link
Contributor

sbrugman commented Jun 1, 2021

Description

SDGym's synthesizers all inherit from the Baseline class (or BaseSynthesizer class in previous versions). Users can provide custom synthesizer functions. The convenience inheritance is demonstrated throughout SDGym's code base and has all sort of other benefits. My suggestion would be to make the following changes:

  • All synthesizers should inherit from a synthesizer base class (Baseline)
  • All synthesizers should implement a separate fit and sample method

These changes provide consistency between SDGym's native and user-provided synthesizers and clear distinction between fit and sample logic, at nearly no cost:

def synthesizer_function(real_data: dict[str, pandas.DataFrame],
                         metadata: sdv.Metadata) -> real_data: dict[str, pandas.DataFrame]:
    ...
    # do all necessary steps to learn from the real data
    # and produce new synthetic data that resembles it
    ...
    return synthetic_data

will become

from sdgym.synthesizers.base import Baseline


class MySynthesizer(Baseline):
    def fit(self, real_data: dict[str, pandas.DataFrame], metadata: sdv.Metadata) -> None:
        # ...
        # do all necessary steps to learn from the real data
        # ...
    
    def sample(self, n_samples: int) -> dict[str, pandas.DataFrame]:
        # and produce new synthetic data that resembles it
        return synthetic_data

More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.

The code that needs to be changed for this is minimal, however I wanted to make sure you see value in this point before drafting a PR.

@sbrugman
Copy link
Contributor Author

@csala Any thoughts on this proposal?

@sbrugman
Copy link
Contributor Author

sbrugman commented Nov 3, 2021

@leix28 @katxiao would a PR be welcome?

@csala
Copy link
Contributor

csala commented Nov 3, 2021

Hi @sbrugman I think that for now the exact change that you are proposing is not within the current SDGym roadmap, but some variation is:

My suggestion would be to make the following changes:

  • All synthesizers should inherit from a synthesizer base class (Baseline)
  • All synthesizers should implement a separate fit and sample method

To add some more context to it, the reason for which the required input is a function instead of a class is that wrapping a class-based synthesizer that follows the fit/sample abstraction within a single function that receives real data, runs fit/sample internally and returns a synthetic clone is far easier than the opposite approach of trying to adapt a functional based synthesizer into this fit/sample abstraction. Also, the current implementation of SDGym already supports class-based synthesizers that inherit from the Baseline class, so making this a hard requirement does not really expand the support, it only narrows it.

On the other hand, it is true that this support for class-based synthesizer is not really documented, so adding it to the docs would be interesting!

More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.

This is another story, and it could actually be interesting too! An option that would be acceptable without sacrificing the functional input, would be to modify the code to capture such metrics only when the provided synthesizer is a Baseline subclass. We could make it so that model_time stays the same and is always reported for all synthesizer, but for Baseline subclasses two new columns, fit_time and sample_time, are added to the output table.

@sbrugman
Copy link
Contributor Author

sbrugman commented Nov 3, 2021

Hi Carles, thanks for getting back at this. The clarification on why you would not like to impose the fit/sample abstraction on users is helpful. The backwards-compatibility argument also makes sense. Good to know that we can move forward on by documenting the class-based synthesizers and with the conditional extension of the benchmark with metrics on whether the implementation is Baseline-based or otherwise.

@npatki
Copy link

npatki commented Jun 5, 2023

Hello, I'm jumping in here a few years after this initial conversation. Since 2021, we have made significant updates to the usage/API of SDGym as well as its functionality. And I believe that some key features that were discussed here have now been incorporated into the library.

  1. You can now supply a custom synthesizer by supplying 2 separate functions for fit and sample. For more information, see the custom synthesizer user guide. We will automatically use these functions to create a class for you.
  2. The benchmarking script now reports more granular results, such as time (fit time, sample time, evaluation time) and memory usage. See interpreting the results.

So I'm going to mark this issue as (more-or-less) resolved.

Apologies if I've overlooked any of the finer points in the discussion. If there is more to add, I'd recommend filing a new issue and we can start a new discussion based on the vision and capabilities of the newest SDGym library. Thanks!

@npatki npatki closed this as completed Jun 5, 2023
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. feature request Request for a new feature labels Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

3 participants