Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data #8

Baukebrenninkmeijer · 2019-11-26T15:48:09Z

In columns where the continuous data is distributed in a really non-gaussian approximable way (e.g. Dates that increase in frequency) and follow a line are not well approximated with the GMM. I've not used the BGMT that much, because it is much slower, but if this does not occur there, please correct me. However, using a GMM, the following pattern occurs. The plots show the cumulative distribution.

Where you can clearly see the several gaussian that are fit to the curve, resulting in a not horrible but definitly not great fit. Do you have any thoughts on how this can be improved?

In TGAN, this problem was much less, and the curves looked as follows. In preprocessing, I think the only difference is using 4 x std instead of 2 x std. Apart from the architecture that's different, I can't immediately think of a reason for this behaviour.

leix28 · 2019-12-02T16:44:08Z

Hi, I'm not sure if I understand your plots correctly. Are these plots about the cumulative distribution for synthetic data (generated by GAN) and real (training) data?

Baukebrenninkmeijer · 2019-12-02T21:58:42Z

Both, orange is synthetic, blue is real. I'll upload some clearer plots tomorrow.

Baukebrenninkmeijer · 2019-12-03T14:06:29Z

This is a bit clearer with a distribution plot. For all plots: blue is real data of one column, orange is fake data of the same column. The fake data in this plot was generated with CTGAN.

For example, this was from data generated with TGAN:

And from my WGAN adaptation of TGAN:

So in these plots, we see a clear decrease in spikyness of the generated data. I'm trying to figure out what causes this, cause the data in the TGAN-WGAN is modelled quite well, while the data in CTGAN and TGAN is quite clearly several smaller distributions.

shreyanshs · 2020-11-12T05:39:30Z

@Baukebrenninkmeijer just to confirm, the TGAN-WGAN implementation you are talking about is https://github.com/Baukebrenninkmeijer/On-the-Generation-and-Evaluation-of-Synthetic-Tabular-Data-using-GANs/tree/master/tgan_wgan_gp?

Baukebrenninkmeijer · 2020-11-13T15:52:48Z

@shreyanshs Yes, that is correct. However, it's quite old now.

csala added the question General question about the software label Jun 22, 2020

shreyansh26 mentioned this issue Dec 2, 2020

Tensorpack version Baukebrenninkmeijer/On-the-Generation-and-Evaluation-of-Synthetic-Tabular-Data-using-GANs#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data #8

Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data #8

Baukebrenninkmeijer commented Nov 26, 2019 •

edited

leix28 commented Dec 2, 2019

Baukebrenninkmeijer commented Dec 2, 2019 •

edited

Baukebrenninkmeijer commented Dec 3, 2019

shreyanshs commented Nov 12, 2020

Baukebrenninkmeijer commented Nov 13, 2020

Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data #8

Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data #8

Comments

Baukebrenninkmeijer commented Nov 26, 2019 • edited

leix28 commented Dec 2, 2019

Baukebrenninkmeijer commented Dec 2, 2019 • edited

Baukebrenninkmeijer commented Dec 3, 2019

shreyanshs commented Nov 12, 2020

Baukebrenninkmeijer commented Nov 13, 2020

Baukebrenninkmeijer commented Nov 26, 2019 •

edited

Baukebrenninkmeijer commented Dec 2, 2019 •

edited