Need Help on benchmark function for real data #23

ArupDukeBanerjee · 2020-03-20T14:03:24Z

benchmark function requires a my_synthesizer_function which takes input real data, categorical, ordinal features and make output of synthesized data. Though the documentation provided is not sufficient for a novice like me and hence facing issue in implementing and moreover in benchmark function it's showing up that it is taking data from predefined defult_datasets which has its own metdata file stored in server in json format, hence not allowing me to benchmark on my data as I don't have metadata ready for my data sets, there are quite a few and they are large.

so any detailed documentation on how to use this benchmark function more efficiently will be helpful.
Thanks a lot for such a beautiful package.
I am new to this domain

csala · 2020-03-20T14:25:16Z

Hi @ArupDukeBanerjee at the moment SDGym is not intended to be used with your own dataset, but rather to only evaluate and compare the performance of data synthesis methods over a set of well-known datasets.

For the scenario that you mention, we are working on a separated package called SDMetrics that will be made public in the upcoming days.

ArupDukeBanerjee · 2020-03-23T12:25:24Z

Hi @csala
I just wanted to know one thing about this package. Can it be used for only data generation for real data as you already stated benchmark is yet to come, meanwhile can I use/leverage different generators on my own set of data. Thanks a lot in advance!

Thanks,
Arup

csala · 2020-06-23T08:48:39Z

@ArupDukeBanerjee Yes, SDGym synthesizers can be used for modeling and sampling your own data, but this is just a secondary effect of having all the synthesizers here implemented with a uniform API.

I would rather recommend you to use the CTGAN package, which is simpler to use and will give you better results in the long term, since it is an actively maintained package with ease of use and sampling quality in mind while SDGym's goal is only to provide benchmark.

ArupDukeBanerjee · 2020-06-23T08:56:02Z

Hi Carles, Thanks a lot for replying to me. I got your point on benchmarking and also CTGAN is a great package, but while my *data has missing values, it throws errors*. As a part of realistic data generation missing values is also something that needs to be handled. Having said that I mean to say I intend to generate a realistic missing values in my synthetic data, which I believe is not there in CTGAN package. It would be great if you please let me know about the resolution of missing data handling. Thanks a lot! Regards, Arup

…

On Tue, Jun 23, 2020 at 2:18 PM Carles Sala ***@***.***> wrote: @ArupDukeBanerjee <https://github.com/ArupDukeBanerjee> Yes, SDGym synthesizers can be used for modeling and sampling your own data, but this is just a secondary effect of having all the synthesizers here implemented with a uniform API. I would rather recommend you to use the CTGAN <http:///sdv-dev/CTGAN> package, which is simpler to use and will give you better results in the long term, since it is an actively maintained package with ease of use and sampling quality in mind while SDGym's goal is only to provide benchmark. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIMRW7VF4DUG3EIWPAYBMNTRYBT7NANCNFSM4LQMTUJQ> .

csala self-assigned this Mar 20, 2020

csala added the question General question about the software label Mar 20, 2020

csala closed this as completed Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need Help on benchmark function for real data #23

Need Help on benchmark function for real data #23

ArupDukeBanerjee commented Mar 20, 2020 •

edited

Loading

csala commented Mar 20, 2020 •

edited

Loading

ArupDukeBanerjee commented Mar 23, 2020

csala commented Jun 23, 2020

ArupDukeBanerjee commented Jun 23, 2020 via email

Need Help on benchmark function for real data #23

Need Help on benchmark function for real data #23

Comments

ArupDukeBanerjee commented Mar 20, 2020 • edited Loading

csala commented Mar 20, 2020 • edited Loading

ArupDukeBanerjee commented Mar 23, 2020

csala commented Jun 23, 2020

ArupDukeBanerjee commented Jun 23, 2020 via email

ArupDukeBanerjee commented Mar 20, 2020 •

edited

Loading

csala commented Mar 20, 2020 •

edited

Loading