Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #52

Closed
adamFinastra opened this issue Jun 11, 2020 · 6 comments
Closed

CUDA out of memory #52

adamFinastra opened this issue Jun 11, 2020 · 6 comments
Labels
question General question about the software

Comments

@adamFinastra
Copy link

I have been trying to run a dataset that is ~100k rows with ~40 columns through the synthesizer but am getting

CUDA out of memory. Tried to allocate 2.52 GiB (GPU 0; 11.17 GiB total capacity; 9.79 GiB already allocated; 706.81 MiB free; 10.19 GiB reserved in total by PyTorch)

when I run

ctgan = CTGANSynthesizer(batch_size=50) ctgan.fit(data,discrete_cols,epochs=3,log_frequency=True)

I have reduced the batch size from a default of 500 to 50 and am still getting the above. Is it required to use a very very small batch size here with a large dataset?

I am able to run 10-20k rows just fine but would like to synthesize all available data. Any pointers around running large datasets?

@leix28
Copy link
Collaborator

leix28 commented Jun 12, 2020

Hi,

This could happen when a categorical column has too many categorical values. In the first 10k rows, there are say 1000 different categories, the model is still small and can fit into the GPU memory. When there are 100k rows, there could a possibility that there are more different categories and it runs out of memory. One typical example of such a column is unique id. Unique id columns are not supported by our model.

Is there a categorical column in your data that has a large set of catagorical values?

@adamFinastra
Copy link
Author

Thank you! I had a few columns that had 20k unique and 300k unique categorical values. I am removing those now.

What about taking a high dimensionality categorical column then remapping it as an integer and running the model. Then the model will learn an integer (I can round it if it comes out as float) and try and remap back

e.g.
['A,'B',...,'Z'] --> [1,2,...26]

@leix28
Copy link
Collaborator

leix28 commented Jun 14, 2020

It depends on the meaning of the column.

If the column is a unique identifier, it should be removed.
If the column is an ordinal column, converting categories to integers may make sense. For example, ["elementary school", "middle school", "college"] --> [1, 2, 3] is reasonable. But ["Massachusetts", "New York", "California] --> [1, 2, 3] does not make much sense.

@adamFinastra
Copy link
Author

adamFinastra commented Jun 14, 2020 via email

@csala csala added the question General question about the software label Jun 22, 2020
@leix28
Copy link
Collaborator

leix28 commented Jul 6, 2020

When there’s a discrete column with so many categories, it usually means that there are fewer examples for each category. So the learning task becomes much more difficult. The solution to this problem depends on the data and use case. For example, when there're sufficient data, training multiple models on subsets of the data is helpful. It's also possible to cluster the categories. In your example, you can replace cities with states so that the number of categories becomes much smaller.

@fealho
Copy link
Member

fealho commented Dec 22, 2020

Closing this issue, as it seems all questions have been answered.

@fealho fealho closed this as completed Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

4 participants