New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Training Files #226
Comments
Hi @gennsev The limitations depend a lot on the type of data that you are using. For example, if you have categorical variables in your dataset with a lot of unique values, the one hot encoding transformation that the GaussianCopula uses by default may be generating too many new columns for the model to be able to handle it in a reasonable time. So one thing that you can try, if you have categorical variables with a lot of values, is to change the transformer that is used for those columns to Another option that you have, if you are letting the GaussianCopula figure out the distributions for you, is to do the distribution discovery on a subsample of your data, and then fit on the entire table with fixed distributions. Since the distribution discovery is the slowest part, this should help accelerating the entire process. You can see that this approach is already under consideration in Copulas: sdv-dev/Copulas#183 Your code would look something like this: sample_size = 100 # or any other size that seems reasonable to you
model = GaussianCopula()
model.fit(employees.sample(sample_size))
distributions = model.get_distributions()
model = GaussianCopula(distribution=distributions)
model.fit(employees) |
Thanks @csala that really helped improving the execution time here. However, for the seccond model fit I still had to take a sample of the input table (this one could be considerably larger than the first one that was used to extract the distributions). |
Great, I'm glad it worked! Closing this, as the response is answered.
Yes, I'm afraid that sometimes sub sampling may be required. There is definitely a maximum size after which adding more rows just makes the fitting process slower, but it does not really help in capturing the parameters of the real distributions. However, it's hard to given an accurate notion of what this size is, because it depends a lot on the data itself: marginal distributions, number of columns, types of variables, etc. |
Description
I'm having problem fitting the model for large single tables.
In terms of memory, the table is not even very heavy (16MB csv file), but it does contain around 300k lines and 6 columns.
I wonder what are the practical limitations of the models (regarding training time), and what is the optimal trainind data size for single table and relational data usages.
Also, is there any built-in tool to automatically handle this problem (such as batching?).
What I Did
The text was updated successfully, but these errors were encountered: