model.fit() hangs before training begins for medium to large datasets #988
Labels
data:single-table
Related to tabular datasets
feature:performance
Related to time or memory usage
question
General question about the software
Environment details
If you are already running SDV, please indicate the following details about the environment in
which you are running it:
Problem description
I'm interested in using CTGAN and CopulaGAN on medium to large datasets (anywhere from thousands to tens of millions of rows, and anywhere from tens to hundreds of columns). For example, one dataset of interest is the
covtype
dataset, which can be loaded directly viaWhen I define a CTGAN model via
the program hangs for minutes or even hours (depending on the size of the dataset) before training begins. From what I can tell, there is some sort of clustering going on as a pre-processing step, which seems to be responsible for the slowdown. I'm guessing that this has to do with identifying the modes in different columns. But as
covtype
isn't a particularly large dataset, this hanging seems undesirable.I'm interested in ways of speeding this up. Is it possible to identify the clusters (or more generally to complete the pre-processing) on a small subset of the data and then to train on the full dataset? Alternatively, is it possible to supply additional meta-data to the model which will speed up the pre-processing?
What I already tried
I looked through the docs, but couldn't find anything about speeding up the pre-processing. The closest solution seems to be #226 which describes a method for doing this with the GaussianCopula model. I didn't see an analogous method for CTGAN or CopulaGAN, but it's entirely possible that I missed it.
Thanks in advance!
The text was updated successfully, but these errors were encountered: