New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Feature: Pascal, Cuda 8, Unified memory #3678

deeplearning-ai-research opened this Issue Aug 6, 2016 · 11 comments


None yet

deeplearning-ai-research commented Aug 6, 2016


As Cuda 8 enables unified memory for Pascal GPU, combining CPU and GPU on the same address level, and enhancing the memory size available for GPU (with limited latency), using CPU RAM.

  1. Is there a possibility to have larger than GPU ram NN+data (lower than CPU ram) for training in tensor flow ? (it would help reducing distributed computing/network latency) ?

ie, using the idea of Oversubscribe GPU memory for large dataset/models.

here, CUDA API :

Example of 64GB allocation on GPU:

void foo() {
// Allocate 64 GB on GPU, using CPU RAM
char *data;
size_t size = 64*1024*1024*1024;
cudaMallocManaged(&data, size);

This comment has been minimized.


zheng-xq commented Aug 7, 2016

UVM is implemented through page-fault between CPU and GPU. It makes GPU program easier to be adopted, but not necessarily as fast as possible. Without more investigation, we are not sure this is suitable for high-performance machine learning. Note that the typical bandwidth across PCIE is often about two orders of magnitude slower than accessing the GPU dram itself.

If the model is indeed too large and cannot fit into GPU memory, it may make sense to load parts of the model in parallel, instead of relying the page faults in the kernels to page in the data.


This comment has been minimized.

arita37 commented Aug 9, 2016

Cuda 8 with Pascal has Nvlink which is 80Gb/sec, so Latency is very low between RAM.
It allows to create larger than GPU memory through single memory allocation,
At very low latency. Performance would be enhanced.

See the slides.


This comment has been minimized.


zheng-xq commented Aug 9, 2016

My understanding was that 80GB/s was between devices. The CPU/GPU communication through PCIE used by UVM is only a small fraction of that. And they are both much smaller than the 720GB/s when GPU accessing its own memory.

This is an active research area, and we might still find a good use of UVM down the road. But the current belief is that it is better to page in/out memory with CPU in parallel, while the compute engine on GPU accesses its own memory in full speed. A good example is the "swap_memory" option in "tf.while_loop", which swaps the temporary memory created in the loop to the host, when the device memory is under pressure.


This comment has been minimized.


yaroslavvb commented Oct 3, 2017

initial support added in cd4f584


This comment has been minimized.

evolu8 commented Oct 19, 2017

UVM support throughout would be of enormous benefit. I would be very keen to see this implemented. Having ideal training throughput is often less important than coping with large input and large model sizes which simply fail to fit on a card.


This comment has been minimized.


byronyi commented Oct 19, 2017

Looks like another “we don’t need distributed transactions in Google so we choose to let users implement crappy/wrong versions of their own” for BigTable/MegaStore.

Now we have Spanner :)


This comment has been minimized.

ranshadmi commented Oct 21, 2017

+1 for "UVM support throughout"!

1 similar comment

This comment has been minimized.

Orna123 commented Oct 22, 2017

+1 for "UVM support throughout"!


This comment has been minimized.

evolu8 commented Nov 10, 2017

Any movement on this?



This comment has been minimized.

tarlovsky commented May 30, 2018

Hey guys, any progress? Highly interested!


This comment has been minimized.


smit-hinsu commented Jun 16, 2018

UVM support was added recently in b113981.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment