Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data #8

Open
Baukebrenninkmeijer opened this issue Nov 26, 2019 · 5 comments
Labels
question General question about the software

Comments

@Baukebrenninkmeijer
Copy link
Contributor

Baukebrenninkmeijer commented Nov 26, 2019

In columns where the continuous data is distributed in a really non-gaussian approximable way (e.g. Dates that increase in frequency) and follow a line are not well approximated with the GMM. I've not used the BGMT that much, because it is much slower, but if this does not occur there, please correct me. However, using a GMM, the following pattern occurs. The plots show the cumulative distribution.
image

Where you can clearly see the several gaussian that are fit to the curve, resulting in a not horrible but definitly not great fit. Do you have any thoughts on how this can be improved?

In TGAN, this problem was much less, and the curves looked as follows. In preprocessing, I think the only difference is using 4 x std instead of 2 x std. Apart from the architecture that's different, I can't immediately think of a reason for this behaviour.
image

@leix28
Copy link
Collaborator

leix28 commented Dec 2, 2019

Hi, I'm not sure if I understand your plots correctly. Are these plots about the cumulative distribution for synthetic data (generated by GAN) and real (training) data?

@Baukebrenninkmeijer
Copy link
Contributor Author

Baukebrenninkmeijer commented Dec 2, 2019

Both, orange is synthetic, blue is real. I'll upload some clearer plots tomorrow.

@Baukebrenninkmeijer
Copy link
Contributor Author

This is a bit clearer with a distribution plot. For all plots: blue is real data of one column, orange is fake data of the same column. The fake data in this plot was generated with CTGAN.

image

For example, this was from data generated with TGAN:

image

And from my WGAN adaptation of TGAN:

image

So in these plots, we see a clear decrease in spikyness of the generated data. I'm trying to figure out what causes this, cause the data in the TGAN-WGAN is modelled quite well, while the data in CTGAN and TGAN is quite clearly several smaller distributions.

@csala csala added the question General question about the software label Jun 22, 2020
@shreyanshs
Copy link

@Baukebrenninkmeijer
Copy link
Contributor Author

@shreyanshs Yes, that is correct. However, it's quite old now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

4 participants