Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.fit() hangs before training begins for medium to large datasets #988

Closed
arobey1 opened this issue Aug 31, 2022 · 2 comments
Closed
Labels
data:single-table Related to tabular datasets feature:performance Related to time or memory usage question General question about the software

Comments

@arobey1
Copy link

arobey1 commented Aug 31, 2022

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 0.16.0
  • Python version: 3.7.12
  • Operating System: Linux

Problem description

I'm interested in using CTGAN and CopulaGAN on medium to large datasets (anywhere from thousands to tens of millions of rows, and anywhere from tens to hundreds of columns). For example, one dataset of interest is the covtype dataset, which can be loaded directly via

from sdv.demo import load_tabular_demo
data = load_tabular_demo('covtype')

When I define a CTGAN model via

from sdv.tabular import CTGAN
model = CTGAN(verbose=True)
model.fit(data)

the program hangs for minutes or even hours (depending on the size of the dataset) before training begins. From what I can tell, there is some sort of clustering going on as a pre-processing step, which seems to be responsible for the slowdown. I'm guessing that this has to do with identifying the modes in different columns. But as covtype isn't a particularly large dataset, this hanging seems undesirable.

I'm interested in ways of speeding this up. Is it possible to identify the clusters (or more generally to complete the pre-processing) on a small subset of the data and then to train on the full dataset? Alternatively, is it possible to supply additional meta-data to the model which will speed up the pre-processing?

What I already tried

I looked through the docs, but couldn't find anything about speeding up the pre-processing. The closest solution seems to be #226 which describes a method for doing this with the GaussianCopula model. I didn't see an analogous method for CTGAN or CopulaGAN, but it's entirely possible that I missed it.

Thanks in advance!

@arobey1 arobey1 added new Automatic label applied to new issues question General question about the software labels Aug 31, 2022
@npatki
Copy link
Contributor

npatki commented Sep 1, 2022

Hi @arobey1, nice to meet you. There are 2 phases to the overall training process:

  1. Preprocessing the data to an optimized, numerical format (using a clustering algorithm) and
  2. Training the GAN for the desired number of epochs

I assume that you're referring to phase 1 as the "hanging" time. The epochs begin to print out during phase 2.

Basic Benchmarking

The covtype dataset has 55 columns and roughtly 580K rows. On my device, CTGAN takes 12 hours to model. Less than 1% of the time is spent in Phase 1 (6 min) while the remaining 99%+ of the time is spent in Phase 2 (estimate 2.57 min/epoch for 300 epochs).

How does this compare with your observation? If you have cuda, phase 2 might be faster but I'm guessing it won't overtake phase 1. Will improving phase 1 ultimately lead to any substantial improvements?

Improving Phase 1

The clustering algorithm is implemented in the ClusterBasedNormalizer from the RDT library. You are welcome to file an RDT feature request for speeding it up using the methods you describe (estimating the clusters using a subset of the data).

SDV issue #643 is related but takes a different approach: We can save the transformed data so that you do not have to repeat this step for future models. Feel free to follow along and add relevant details to the issue.

Workarounds for overall speedup

  • (Proposed in Large Training Files #226) Train with a smaller subset of the data
  • Train for a fewer number of epochs (default is 300)
  • Use a different model. I'm curious about your use case and why you'd like to stick to CTGAN/CopulaGAN. The FAST ML TabluarPreset optimizes for speed. On my device, it took 46 seconds to model the entire dataset end-to-end.

@npatki npatki added data:single-table Related to tabular datasets under discussion Issue is currently being discussed feature:performance Related to time or memory usage and removed new Automatic label applied to new issues labels Sep 1, 2022
@npatki
Copy link
Contributor

npatki commented Sep 14, 2022

Hi @arobey1, I'm marking this issue as closed since we've discussed your original question & have some follow ups/workarounds. As mentioned above, it would be great if you could file a feature request in our RDT library.

Feel free to respond if there is more to discuss and I can re-open the issue.

@npatki npatki closed this as completed Sep 14, 2022
@npatki npatki removed the under discussion Issue is currently being discussed label Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:single-table Related to tabular datasets feature:performance Related to time or memory usage question General question about the software
Projects
None yet
Development

No branches or pull requests

2 participants