CUDA out of memory #52

adamFinastra · 2020-06-11T18:37:15Z

I have been trying to run a dataset that is ~100k rows with ~40 columns through the synthesizer but am getting

CUDA out of memory. Tried to allocate 2.52 GiB (GPU 0; 11.17 GiB total capacity; 9.79 GiB already allocated; 706.81 MiB free; 10.19 GiB reserved in total by PyTorch)

when I run

ctgan = CTGANSynthesizer(batch_size=50) ctgan.fit(data,discrete_cols,epochs=3,log_frequency=True)

I have reduced the batch size from a default of 500 to 50 and am still getting the above. Is it required to use a very very small batch size here with a large dataset?

I am able to run 10-20k rows just fine but would like to synthesize all available data. Any pointers around running large datasets?

The text was updated successfully, but these errors were encountered:

leix28 · 2020-06-12T04:09:34Z

Hi,

This could happen when a categorical column has too many categorical values. In the first 10k rows, there are say 1000 different categories, the model is still small and can fit into the GPU memory. When there are 100k rows, there could a possibility that there are more different categories and it runs out of memory. One typical example of such a column is unique id. Unique id columns are not supported by our model.

Is there a categorical column in your data that has a large set of catagorical values?

adamFinastra · 2020-06-12T13:43:43Z

Thank you! I had a few columns that had 20k unique and 300k unique categorical values. I am removing those now.

What about taking a high dimensionality categorical column then remapping it as an integer and running the model. Then the model will learn an integer (I can round it if it comes out as float) and try and remap back

e.g.
['A,'B',...,'Z'] --> [1,2,...26]

leix28 · 2020-06-14T19:01:19Z

It depends on the meaning of the column.

If the column is a unique identifier, it should be removed.
If the column is an ordinal column, converting categories to integers may make sense. For example, ["elementary school", "middle school", "college"] --> [1, 2, 3] is reasonable. But ["Massachusetts", "New York", "California] --> [1, 2, 3] does not make much sense.

adamFinastra · 2020-06-14T22:26:38Z

What if there is High dimensionality for a categorical column, say there are 50,000 different cities for example that are not ordinal. It’s not more efficient to remap these to an integer value 1,2,...,50000 and learn an integer representation. I’m just concerned that the 50k different categories could be too much

…

________________________________ From: Lei Xu <notifications@github.com> Sent: Sunday, June 14, 2020 3:01 PM To: sdv-dev/CTGAN <CTGAN@noreply.github.com> Cc: Lieberman, Adam <Adam.Lieberman@finastra.com>; Author <author@noreply.github.com> Subject: [EXT] Re: [sdv-dev/CTGAN] CUDA out of memory (#52) It depends on the meaning of the column. If the column is a unique identifier, it should be removed. If the column is an ordinal column, converting categories to integers may make sense. For example, ["elementary school", "middle school", "college"] --> [1, 2, 3] is reasonable. But ["Massachusetts", "New York", "California] --> [1, 2, 3] does not make much sense. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsdv-dev%2FCTGAN%2Fissues%2F52%23issuecomment-643807862&data=02%7C01%7Cadam.lieberman%40finastra.com%7C401525d9318740ec5a1608d810955d1e%7C0b9b90da3fe1457ab340f1b67e1024fb%7C0%7C0%7C637277580985764940&sdata=e2kYHcQEPvFItprgY520EEssABHqmnKmTAHkOyVLb7I%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALCT62SOQ233HPLZXS5XKMTRWUNAXANCNFSM4N3VDBAQ&data=02%7C01%7Cadam.lieberman%40finastra.com%7C401525d9318740ec5a1608d810955d1e%7C0b9b90da3fe1457ab340f1b67e1024fb%7C0%7C0%7C637277580985764940&sdata=z9eIMZDcBXlTIcKYOXOLMl1D1T2AoG7JIx9WUk2VehU%3D&reserved=0>. "FINASTRA" is the trade name of the FINASTRA group of companies. This email and any attachments have been scanned for known viruses using multiple scanners. This email message is intended for the named recipient only. It may be privileged and/or confidential. If you are not the named recipient of this email please notify us immediately and do not copy it or use it for any purpose, nor disclose its contents to any other person. This email does not constitute the commencement of legal relations between you and FINASTRA. Please refer to the executed contract between you and the relevant member of the FINASTRA group for the identity of the contracting party with which you are dealing.

leix28 · 2020-07-06T01:39:18Z

When there’s a discrete column with so many categories, it usually means that there are fewer examples for each category. So the learning task becomes much more difficult. The solution to this problem depends on the data and use case. For example, when there're sufficient data, training multiple models on subsets of the data is helpful. It's also possible to cluster the categories. In your example, you can replace cities with states so that the number of categories becomes much smaller.

fealho · 2020-12-22T00:33:19Z

Closing this issue, as it seems all questions have been answered.

csala added the question General question about the software label Jun 22, 2020

fealho closed this as completed Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory #52

CUDA out of memory #52

adamFinastra commented Jun 11, 2020

leix28 commented Jun 12, 2020 •

edited by kveerama

Loading

adamFinastra commented Jun 12, 2020

leix28 commented Jun 14, 2020

adamFinastra commented Jun 14, 2020 via email

leix28 commented Jul 6, 2020

fealho commented Dec 22, 2020

CUDA out of memory #52

CUDA out of memory #52

Comments

adamFinastra commented Jun 11, 2020

leix28 commented Jun 12, 2020 • edited by kveerama Loading

adamFinastra commented Jun 12, 2020

leix28 commented Jun 14, 2020

adamFinastra commented Jun 14, 2020 via email

leix28 commented Jul 6, 2020

fealho commented Dec 22, 2020

leix28 commented Jun 12, 2020 •

edited by kveerama

Loading