# Managed Memory

## Learning Objectives

In this lab we will learn about managed memory, a solution to dealing with data management that substantially improves productivity. Particularly, we will:

- Learn what Unified Memory (managed memory) is and how to use it
- Understand the performance limitations and how to avoid them if desired

## Prerequisites

It is assumed that participants understand data management using `cudaMalloc()`, `cudaFree()`, and `cudaMemcpy`. (Although managed memory provides an alternative to these methods, we contrast it with the explicit memory management APIs.)

## Motivation

In our first introduction to CUDA, we demonstrated a three-step processing workflow for CUDA. 

![](images/simple_processing_flow.png)

Wouldn't it be nice if we didn't explicitly have to do steps 1 and 3?

## Unified Memory

Starting with CUDA 6 and the Kepler generation of GPUs, NVIDIA introduced Unified Memory (synonymous with "managed memory"). When using unified memory, you can allocate pointers that can be deferenced in *both* host and device code.

![](images/unified_memory_cuda_6_kepler.png)

Then in CUDA 8 with the Pascal generation of GPUs, Unified Memory became greatly expanded, to the model we have today.

![](images/unified_memory_cuda_8_pascal.png)

As a result, we can take code that looks this with manual data management:

![](images/simplified_memory_management_before.png)

and turn it into:

![](images/simplified_memory_management_after.png)

## Unified Memory Properties

Let's look at a code sample to understand how the unified memory implementation works. First, we'll discuss how it works on Linux, for Pascal generation and later GPUs.

In this code sample, we allocate data with the unified memory allocator, `cudaMallocManaged()`, and the allocated pointer is then accessed on the CPU (using `memset()`), then on the CPU (in the function `useData()`).

```
__global__
void setValue(int *ptr, int index, int val) 
{
  ptr[index] = val;
}


void foo(int size) {
  char *data;
  cudaMallocManaged(&data, size);

  memset(data, 0, size);

  setValue<<<...>>>(data, size/2, 5);
  cudaDeviceSynchronize();

  useData(data);
  
  cudaFree(data);
}
```

The animation below describes what happens. On both CPUs and GPUs, on the vast majority of operating systems, virtual memory is allocated in chunks called *pages*, and when an element of its data is touched, the processor asks if there is a corresponding physical memory location that the virtual memory is backed by. If there is no such location known to the processor's memory management unit (MMU), we have a *page fault*, and the system must provide physical memory to match the virtual memory. This is true on both CPUs and GPUs when using unified memory; in the code above, the CPU will page fault when `memset()` is called, and some pages will be filled. Then, when `setValue()` is executed on the GPU, the GPU will page fault, and the CUDA driver will *migrate* the pages from the CPU to the GPU. After the kernel completes, the `useData()` function on the host will page fault because the pages no longer reside in CPU RAM, and the CUDA driver will migrate the pages back to the CPU.

![](images/managed_memory_page_faulting.gif)

It is possible to allocate more managed memory than there is GPU memory:

```
void foo() {
  // Assume GPU has 16 GB memory
  // Allocate 64 GB
  char *data;
  // be careful with size type:
  size_t size = 64ULL * 1024 * 1024 * 1024;
  cudaMallocManaged(&data, size);
}
```

Only a subset of the pages will reside on the GPU at any one time. If the GPU memory is full and new pages are requested to move over, the driver will *evict* some pages back to the CPU. This enables you to work on datasets that are larger than can fit in the GPU -- however, achieving reasonable performance on this case may require some effort.

### Exercise

Let's test this oversubscription idea to verify that it works. [exercises/vector_addition.cu](exercises/vector_addition.cu) is a vector addition implementation that currently uses arrays which are much smaller than GPU memory. Change the length of the arrays so that the total amount of dynamic memory allocated is larger than GPU memory. Does the time it takes to run the code proportionately or disproportionately increase as you begin to overflow GPU memory?

For each of the sizes you try, calculate the total size of the data you've allocated and estimate an effective bandwidth of the kernel. How does it compare to the DRAM bandwidth of the GPU? Don't worry if it is slow -- we will talk about Unified Memory performance later.

In [None]:
!nvcc -arch=native -o vector_addition exercises/vector_addition.cu
%time !./vector_addition

Additionally, it is also possible to *concurrently* access a unified memory allocation on both CPU and GPU.

```
__global__ void mykernel(char *data) {
  data[1] = ‘g’;
}

void foo() {
  char *data;
  cudaMallocManaged(&data, 2);

  mykernel<<<...>>>(data);
  // no synchronize here
  data[0] = ‘c’;

  cudaFree(data);
}
```

However, the fact that this is possible does not exempt you from considering race conditions. Generally, the unified memory implementation does not enforce ordering or visibility guarantees for concurrent CPU-GPU accesses. In the example above, without a synchronization after the kernel, there is no guarantee about what the value of either `data[0]` or `data[1]` will be. (Though if they were not on the same memory page, this example may work as desired.)

With that said, on Pascal and later GPUs, system-wide atomic operations are possible. These can be combined with CPU atomic operations for joint CPU-GPU atomic operations.

```
__global__ void mykernel(int *addr) {
  // GPU atomic:
  atomicAdd_system(addr, 10);
}

void foo() {
  int *addr;
  cudaMallocManaged(addr, 4);
  *addr = 0;

  mykernel<<<...>>>(addr);
  // CPU atomic:
  __sync_fetch_and_add(addr, 10); 
}
```

There are a few contexts where this unified memory implementation doesn't exist, and instead there is a less powerful implementation. In particular, that's pre-Pascal generation GPUs, all Jetson GPUs, and on Windows. In those situations:

- When you launch a kernel, *all* managed data migrates immediately to the GPU
- CPU page faulting works as normal
- No concurrent access to unified memory between CPUs and GPUs is permitted
- THe limit of allocatable unified memory is the size of the GPU DRAM

![](images/pre_pascal_unified_memory_page_faulting.gif)

## Unified Memory Use Cases

Unified Memory is primarily designed around productivity. It is nice that you don't have to know where and when data motion occurs -- the CUDA driver will figure this out for you. However, there are some particular use cases where unified memory truly shines.

### Deep Copy

Suppose we define a struct as follows:

```
struct dataElem {
  int key;
  int len;
  char *name;
}
```

If we have an instance of this struct on the host, and we want to copy it to the device, we need to allocate a device copy of the data and also all the data it points to:

```
void launch(dataElem *elem) {
  dataElem *d_elem;
  char *d_name;

  int namelen = strlen(elem->name) + 1;

  // Allocate storage for struct and name
  cudaMalloc(&d_elem, sizeof(dataElem));
  cudaMalloc(&d_name, namelen);

  // Copy up each piece separately, including new “name” pointer value
  cudaMemcpy(d_elem, elem, sizeof(dataElem), cudaMemcpyHostToDevice);
  cudaMemcpy(d_name, elem->name, namelen, cudaMemcpyHostToDevice);
  cudaMemcpy(&(d_elem->name), &d_name, sizeof(char*), cudaMemcpyHostToDevice);

  // Finally we can launch our kernel, but CPU and GPU use different copies of “elem”
  kernel<<< ... >>>(d_elem);
}
```

Obviously, this can be very tedious if there are many data elements to copy. However, if all of the data has been allocated with managed memory, we can simply launch the kernel:

```
void launch(dataElem *elem) {
  kernel<<< ... >>>(elem);
}
```

### Linked List

Another example would be a linked list shared between the CPU and GPU. Since a linked list is a chain of pointers, logic that dealt with consistency between the CPU and GPU accurately would be fairly complex. With managed memory, we don't have to worry about that, we can just use the data when it is needed.

![](images/linked_list.png)

### Exercise

Let's experiment with this linked list idea. In [exercises/linked_list.cu](exercises/linked_list.cu) you will find an example code that attempts to print out a specific member of a linked list on both the CPU and GPU. However, because the data is not accessible on the GPU, the kernel will fail. Rewrite this code to use managed memory (this should just be a one-line change) and verify that you get the expected result. The solution can be found in [solutions/linked_list.cu](solutions/linked_list.cu).

In [None]:
!nvcc -arch=native -o linked_list exercises/linked_list.cu
!./linked_list

### Automatic Memory Management for C++ Objects

Managed memory also helps substantially in managing complex C++ classes. We can declare a base class `Managed`:

```
class Managed {
public:
  void *operator new(size_t len) {
    void *ptr;
    cudaMallocManaged(&ptr, len);
    cudaDeviceSynchronize();
    return ptr;
  }

  void operator delete(void *ptr) {
    cudaDeviceSynchronize();
    cudaFree(ptr);
  }
};
```

and then other classes can derive from it:

```
// Deriving from “Managed” allows pass-by-reference to kernel
class String : public Managed {
  int length;
  char *data;

public:
  // Unified memory copy constructor allows pass-by-value to kernel
  String (const String &s) {
    length = s.length;
    cudaMallocManaged(&data, length);
    memcpy(data, s.data, length);
  }

  // ...
};
```

Here we also implement a copy constructor that allocates the data with managed memory.

This can also be used for structs:

```
class dataElem : public Managed {
public:
  int prop1;
  int prop2;
  String name;
};

...

dataElem *data = new dataElem[N];

...

// C++ now handles our deep copies
kernel<<< ... >>>(data);

```

Now we can have kernels that both pass by reference and pass by value:

```
// Pass-by-reference version
__global__ void kernel_by_ref(dataElem &data) { ... }

// Pass-by-value version
__global__ void kernel_by_val(dataElem data) { ... }

int main(void) {
  dataElem *data = new dataElem;
  ...
  // pass data to kernel by reference
  kernel_by_ref<<<1,1>>>(*data);

  // pass data to kernel by value -- this will create a copy
  kernel_by_val<<<1,1>>>(*data);
}
```

## Performance Considerations

Consider the following scenario:

```
__global__ void kernel(float *data) {
  int idx = threadIdx.x + blockIdx.x * blockDim.x;
  data[idx] = val;
}

...

int n = 256 * 256;
float *data;
cudaMallocManaged(&data, n * sizeof(float);
kernel<<<256, 256>>>(data);
```

This kernel runs *much* slower than the case where we explicitly allocate and copy the data with `cudaMalloc()` and `cudaMemcpy()`. The reason is that every thread will trigger a page fault, which has some service overhead, and this will result in many inefficient, small copies rather than an efficient, bulk copy.

If this overhead significantly affects your application performance, you can trigger a bulk copy with:

```
cudaMemPrefetchAsync(ptr, length, destDevice);
```

(As is suggested by the name, the resulting copy happens asynchronously, like kernel execution.)

In the code sample above, that would look like:

```
// Note that the default device is 0
cudaMemPrefetchAsync(data, ds, 0); 
kernel<<<256, 256>>>(data);
cudaMemPrefetchAsync(data, ds, cudaCpuDeviceId); // copy back to host
```


## Exercise

Consider the code in [exercises/array_increment.cu](exercises/array_increment.cu), which allocates an array on the host and then increments every value in the array on the device.

Let's compile and run the code as-is, noting the duration of the kernel:

In [None]:
!nvcc -arch=native -o array_increment exercises/array_increment.cu
!nsys profile --stats=true ./array_increment

Now convert the code to use unified memory. First, replace `cudaMalloc()` with `cudaMallocManaged()`, and eliminate the calls to `cudaMemcpy()`. Again, note the kernel runtime. Then, use `cudaMemPrefetchAsync()` to improve the performance, and validate that with your profiling results. Look at [solutions/array_increment.cu](solutions/array_increment.cu) if you need help.

In [None]:
!nvcc -arch=native -o array_increment exercises/array_increment.cu
!nsys profile --stats=true ./array_increment

If you like, you can also experiment with running the kernel many times in a row (say, 10000). This is representative of many real world use cases where the data is transferred to the device once and stays there for a long time. What can we say about the fraction of the time spent in kernels versus memory operations in this case?

## Unified Memory Hints

You can advise the unified memory runtime on expected memory access behaviors with:

```
 cudaMemAdvise(ptr, count, hint, device);
```

Some available "hints" are:

- `cudaMemAdviseSetReadMostly`: specifies read duplication (both CPU and GPU have a copy)
- `cudaMemAdviseSetPreferredLocation`: suggest best location (data will stay here if possible)
- `cudaMemAdviseSetAccessedBy`: suggest a page mapping (to avoid page faults on later access)

Note that these hints don’t trigger data movement by themselves. For more details, see the [CUDA Runtime API documentation](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge37112fc1ac88d0f6bab7a945e48760a).

## Review

In this lab we learned:

- How to allocate Unified Memory (managed memory) with `cudaMallocManaged()`
- The on-demand paging nature of Unified Memory transfers on Linux and Pascal+
- How to asynchronously copy it to the device (or back to the host) with `cudaMemPrefetchAsync()`
- Some ideas for how to optimize applications that use Unified Memory

## Further Study

[NVIDIA Developer Blog: Unified Memory for CUDA Beginners](https://devblogs.nvidia.com/unified-memory-cuda-beginners/)

[NVIDIA Developer Blog: Unified Memory in CUDA 6](https://devblogs.nvidia.com/unified-memory-in-cuda-6/)

[NVIDIA Developer Blog: Maximizing Unified Memory Performance in CUDA](https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/)

[GTC 2018: Everything You Need to Know About Unified Memory](http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf)

[CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd)

CUDA Samples:  conjugateGradientUM

## Lab Materials

You can download this notebook using the `File > Download as > Notebook (.ipnyb)` menu item. Source code files can be downloaded from the `File > Download` menu item after opening them.