-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #52
Comments
Hi, This could happen when a categorical column has too many categorical values. In the first 10k rows, there are say 1000 different categories, the model is still small and can fit into the GPU memory. When there are 100k rows, there could a possibility that there are more different categories and it runs out of memory. One typical example of such a column is unique id. Unique id columns are not supported by our model. Is there a categorical column in your data that has a large set of catagorical values? |
Thank you! I had a few columns that had 20k unique and 300k unique categorical values. I am removing those now. What about taking a high dimensionality categorical column then remapping it as an integer and running the model. Then the model will learn an integer (I can round it if it comes out as float) and try and remap back e.g. |
It depends on the meaning of the column. If the column is a unique identifier, it should be removed. |
What if there is High dimensionality for a categorical column, say there are 50,000 different cities for example that are not ordinal. It’s not more efficient to remap these to an integer value 1,2,...,50000 and learn an integer representation. I’m just concerned that the 50k different categories could be too much
…________________________________
From: Lei Xu <notifications@github.com>
Sent: Sunday, June 14, 2020 3:01 PM
To: sdv-dev/CTGAN <CTGAN@noreply.github.com>
Cc: Lieberman, Adam <Adam.Lieberman@finastra.com>; Author <author@noreply.github.com>
Subject: [EXT] Re: [sdv-dev/CTGAN] CUDA out of memory (#52)
It depends on the meaning of the column.
If the column is a unique identifier, it should be removed.
If the column is an ordinal column, converting categories to integers may make sense. For example, ["elementary school", "middle school", "college"] --> [1, 2, 3] is reasonable. But ["Massachusetts", "New York", "California] --> [1, 2, 3] does not make much sense.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsdv-dev%2FCTGAN%2Fissues%2F52%23issuecomment-643807862&data=02%7C01%7Cadam.lieberman%40finastra.com%7C401525d9318740ec5a1608d810955d1e%7C0b9b90da3fe1457ab340f1b67e1024fb%7C0%7C0%7C637277580985764940&sdata=e2kYHcQEPvFItprgY520EEssABHqmnKmTAHkOyVLb7I%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALCT62SOQ233HPLZXS5XKMTRWUNAXANCNFSM4N3VDBAQ&data=02%7C01%7Cadam.lieberman%40finastra.com%7C401525d9318740ec5a1608d810955d1e%7C0b9b90da3fe1457ab340f1b67e1024fb%7C0%7C0%7C637277580985764940&sdata=z9eIMZDcBXlTIcKYOXOLMl1D1T2AoG7JIx9WUk2VehU%3D&reserved=0>.
"FINASTRA" is the trade name of the FINASTRA group of companies. This email and any attachments have been scanned for known viruses using multiple scanners. This email message is intended for the named recipient only. It may be privileged and/or confidential. If you are not the named recipient of this email please notify us immediately and do not copy it or use it for any purpose, nor disclose its contents to any other person. This email does not constitute the commencement of legal relations between you and FINASTRA. Please refer to the executed contract between you and the relevant member of the FINASTRA group for the identity of the contracting party with which you are dealing.
|
When there’s a discrete column with so many categories, it usually means that there are fewer examples for each category. So the learning task becomes much more difficult. The solution to this problem depends on the data and use case. For example, when there're sufficient data, training multiple models on subsets of the data is helpful. It's also possible to cluster the categories. In your example, you can replace cities with states so that the number of categories becomes much smaller. |
Closing this issue, as it seems all questions have been answered. |
I have been trying to run a dataset that is ~100k rows with ~40 columns through the synthesizer but am getting
CUDA out of memory. Tried to allocate 2.52 GiB (GPU 0; 11.17 GiB total capacity; 9.79 GiB already allocated; 706.81 MiB free; 10.19 GiB reserved in total by PyTorch)
when I run
ctgan = CTGANSynthesizer(batch_size=50) ctgan.fit(data,discrete_cols,epochs=3,log_frequency=True)
I have reduced the batch size from a default of 500 to 50 and am still getting the above. Is it required to use a very very small batch size here with a large dataset?
I am able to run 10-20k rows just fine but would like to synthesize all available data. Any pointers around running large datasets?
The text was updated successfully, but these errors were encountered: