Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Training Files #226

Closed
gennsev opened this issue Oct 29, 2020 · 3 comments
Closed

Large Training Files #226

gennsev opened this issue Oct 29, 2020 · 3 comments
Labels
question General question about the software

Comments

@gennsev
Copy link

gennsev commented Oct 29, 2020

  • SDV version: 0.4.5
  • Python version: 3.8.5
  • Operating System: Ubuntu 20.04.1 LTS

Description

I'm having problem fitting the model for large single tables.
In terms of memory, the table is not even very heavy (16MB csv file), but it does contain around 300k lines and 6 columns.

I wonder what are the practical limitations of the models (regarding training time), and what is the optimal trainind data size for single table and relational data usages.
Also, is there any built-in tool to automatically handle this problem (such as batching?).

What I Did

employees = pd.read_csv("employees.csv")
model = GaussianCopula()
model.fit(employees) # here it hangs, using 100% CPU
synthetic_employees = model.sample()
@csala
Copy link
Contributor

csala commented Oct 30, 2020

Hi @gennsev

The limitations depend a lot on the type of data that you are using. For example, if you have categorical variables in your dataset with a lot of unique values, the one hot encoding transformation that the GaussianCopula uses by default may be generating too many new columns for the model to be able to handle it in a reasonable time.

So one thing that you can try, if you have categorical variables with a lot of values, is to change the transformer that is used for those columns to categorical or categorical_fuzzy, which will avoid creating extra columns to encode the values.

Another option that you have, if you are letting the GaussianCopula figure out the distributions for you, is to do the distribution discovery on a subsample of your data, and then fit on the entire table with fixed distributions. Since the distribution discovery is the slowest part, this should help accelerating the entire process. You can see that this approach is already under consideration in Copulas: sdv-dev/Copulas#183

Your code would look something like this:

sample_size = 100  # or any other size that seems reasonable to you
model = GaussianCopula()
model.fit(employees.sample(sample_size))

distributions = model.get_distributions()

model = GaussianCopula(distribution=distributions)
model.fit(employees)

@csala csala added the question General question about the software label Oct 30, 2020
@gennsev
Copy link
Author

gennsev commented Oct 30, 2020

Thanks @csala that really helped improving the execution time here.

However, for the seccond model fit I still had to take a sample of the input table (this one could be considerably larger than the first one that was used to extract the distributions).

@csala
Copy link
Contributor

csala commented Nov 2, 2020

Great, I'm glad it worked! Closing this, as the response is answered.

However, for the seccond model fit I still had to take a sample of the input table (this one could be considerably larger than the first one that was used to extract the distributions).

Yes, I'm afraid that sometimes sub sampling may be required. There is definitely a maximum size after which adding more rows just makes the fitting process slower, but it does not really help in capturing the parameters of the real distributions. However, it's hard to given an accurate notion of what this size is, because it depends a lot on the data itself: marginal distributions, number of columns, types of variables, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

2 participants