Large Training Files #226

gennsev · 2020-10-29T21:13:52Z

SDV version: 0.4.5
Python version: 3.8.5
Operating System: Ubuntu 20.04.1 LTS

Description

I'm having problem fitting the model for large single tables.
In terms of memory, the table is not even very heavy (16MB csv file), but it does contain around 300k lines and 6 columns.

I wonder what are the practical limitations of the models (regarding training time), and what is the optimal trainind data size for single table and relational data usages.
Also, is there any built-in tool to automatically handle this problem (such as batching?).

What I Did

employees = pd.read_csv("employees.csv")
model = GaussianCopula()
model.fit(employees) # here it hangs, using 100% CPU
synthetic_employees = model.sample()

csala · 2020-10-30T12:01:44Z

Hi @gennsev

The limitations depend a lot on the type of data that you are using. For example, if you have categorical variables in your dataset with a lot of unique values, the one hot encoding transformation that the GaussianCopula uses by default may be generating too many new columns for the model to be able to handle it in a reasonable time.

So one thing that you can try, if you have categorical variables with a lot of values, is to change the transformer that is used for those columns to categorical or categorical_fuzzy, which will avoid creating extra columns to encode the values.

Another option that you have, if you are letting the GaussianCopula figure out the distributions for you, is to do the distribution discovery on a subsample of your data, and then fit on the entire table with fixed distributions. Since the distribution discovery is the slowest part, this should help accelerating the entire process. You can see that this approach is already under consideration in Copulas: sdv-dev/Copulas#183

Your code would look something like this:

sample_size = 100  # or any other size that seems reasonable to you
model = GaussianCopula()
model.fit(employees.sample(sample_size))

distributions = model.get_distributions()

model = GaussianCopula(distribution=distributions)
model.fit(employees)

gennsev · 2020-10-30T18:51:52Z

Thanks @csala that really helped improving the execution time here.

However, for the seccond model fit I still had to take a sample of the input table (this one could be considerably larger than the first one that was used to extract the distributions).

csala · 2020-11-02T21:03:13Z

Great, I'm glad it worked! Closing this, as the response is answered.

However, for the seccond model fit I still had to take a sample of the input table (this one could be considerably larger than the first one that was used to extract the distributions).

Yes, I'm afraid that sometimes sub sampling may be required. There is definitely a maximum size after which adding more rows just makes the fitting process slower, but it does not really help in capturing the parameters of the real distributions. However, it's hard to given an accurate notion of what this size is, because it depends a lot on the data itself: marginal distributions, number of columns, types of variables, etc.

csala added the question General question about the software label Oct 30, 2020

csala closed this as completed Nov 2, 2020

arobey1 mentioned this issue Aug 31, 2022

model.fit() hangs before training begins for medium to large datasets #988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Training Files #226

Large Training Files #226

gennsev commented Oct 29, 2020 •

edited

csala commented Oct 30, 2020

gennsev commented Oct 30, 2020

csala commented Nov 2, 2020

Large Training Files #226

Large Training Files #226

Comments

gennsev commented Oct 29, 2020 • edited

Description

What I Did

csala commented Oct 30, 2020

gennsev commented Oct 30, 2020

csala commented Nov 2, 2020

gennsev commented Oct 29, 2020 •

edited