model.fit() hangs before training begins for medium to large datasets #988

arobey1 · 2022-08-31T22:04:48Z

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

SDV version: 0.16.0
Python version: 3.7.12
Operating System: Linux

Problem description

I'm interested in using CTGAN and CopulaGAN on medium to large datasets (anywhere from thousands to tens of millions of rows, and anywhere from tens to hundreds of columns). For example, one dataset of interest is the covtype dataset, which can be loaded directly via

from sdv.demo import load_tabular_demo
data = load_tabular_demo('covtype')

When I define a CTGAN model via

from sdv.tabular import CTGAN
model = CTGAN(verbose=True)
model.fit(data)

the program hangs for minutes or even hours (depending on the size of the dataset) before training begins. From what I can tell, there is some sort of clustering going on as a pre-processing step, which seems to be responsible for the slowdown. I'm guessing that this has to do with identifying the modes in different columns. But as covtype isn't a particularly large dataset, this hanging seems undesirable.

I'm interested in ways of speeding this up. Is it possible to identify the clusters (or more generally to complete the pre-processing) on a small subset of the data and then to train on the full dataset? Alternatively, is it possible to supply additional meta-data to the model which will speed up the pre-processing?

What I already tried

I looked through the docs, but couldn't find anything about speeding up the pre-processing. The closest solution seems to be #226 which describes a method for doing this with the GaussianCopula model. I didn't see an analogous method for CTGAN or CopulaGAN, but it's entirely possible that I missed it.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

npatki · 2022-09-01T00:02:04Z

Hi @arobey1, nice to meet you. There are 2 phases to the overall training process:

Preprocessing the data to an optimized, numerical format (using a clustering algorithm) and
Training the GAN for the desired number of epochs

I assume that you're referring to phase 1 as the "hanging" time. The epochs begin to print out during phase 2.

Basic Benchmarking

The covtype dataset has 55 columns and roughtly 580K rows. On my device, CTGAN takes 12 hours to model. Less than 1% of the time is spent in Phase 1 (6 min) while the remaining 99%+ of the time is spent in Phase 2 (estimate 2.57 min/epoch for 300 epochs).

How does this compare with your observation? If you have cuda, phase 2 might be faster but I'm guessing it won't overtake phase 1. Will improving phase 1 ultimately lead to any substantial improvements?

Improving Phase 1

The clustering algorithm is implemented in the ClusterBasedNormalizer from the RDT library. You are welcome to file an RDT feature request for speeding it up using the methods you describe (estimating the clusters using a subset of the data).

SDV issue #643 is related but takes a different approach: We can save the transformed data so that you do not have to repeat this step for future models. Feel free to follow along and add relevant details to the issue.

Workarounds for overall speedup

(Proposed in Large Training Files #226) Train with a smaller subset of the data
Train for a fewer number of epochs (default is 300)
Use a different model. I'm curious about your use case and why you'd like to stick to CTGAN/CopulaGAN. The FAST ML TabluarPreset optimizes for speed. On my device, it took 46 seconds to model the entire dataset end-to-end.

npatki · 2022-09-14T18:22:26Z

Hi @arobey1, I'm marking this issue as closed since we've discussed your original question & have some follow ups/workarounds. As mentioned above, it would be great if you could file a feature request in our RDT library.

Feel free to respond if there is more to discuss and I can re-open the issue.

arobey1 added new Automatic label applied to new issues question General question about the software labels Aug 31, 2022

npatki added data:single-table Related to tabular datasets under discussion Issue is currently being discussed feature:performance Related to time or memory usage and removed new Automatic label applied to new issues labels Sep 1, 2022

npatki closed this as completed Sep 14, 2022

npatki removed the under discussion Issue is currently being discussed label Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.fit() hangs before training begins for medium to large datasets #988

model.fit() hangs before training begins for medium to large datasets #988

arobey1 commented Aug 31, 2022

npatki commented Sep 1, 2022

npatki commented Sep 14, 2022

model.fit() hangs before training begins for medium to large datasets #988

model.fit() hangs before training begins for medium to large datasets #988

Comments

arobey1 commented Aug 31, 2022

Environment details

Problem description

What I already tried

npatki commented Sep 1, 2022

Basic Benchmarking

Improving Phase 1

Workarounds for overall speedup

npatki commented Sep 14, 2022