Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: 4ea1bde7c6
Fetching contributors…

Cannot retrieve contributors at this time

214 lines (194 sloc) 10.172 kb
\section{Device Memory Management}
Gdev manages the device memory using the virtual memory management unit
of the GPU.
%We store a page table for each GPU context in the device memory but not
%in the host memory to reduce paging latency.
%For every memory copy operation, Gdev configures DMA engines to write a
%sequential value to the specified memory area so that it can poll the
%value to wait for the operations to be completed.
In addition to generic pieces of memory management, Gdev
identifies how to increase memory-copy throughput and support shared
memory and memory swapping for the GPU.
\subsection{Memory-Copy Optimization}
The memory-copy throughput could govern application performance overall,
since the data need to be moved across the device and host memory in GPU
Although our goal is to enhance resource management for the GPU in
time-sharing systems, we still respect standalone performance, since it
relates to a practical use of our solution in the real world.
We hence study the characteristic of memory-copy operations for the GPU.
It should be noted that our problem is different from those considered
in previous work~\cite{Jablin_PLDI11, Rossbach_SOSP11} in that we are
looking into a basic single instance of memory-copy transaction, while
the previous work addressed more applied situations where multiple
contexts compete for memory-copy transaction.
First of all, we have found that the memory-copy API functions provided
by proprietary software~\cite{CUDA40} are well-optimized for standalone
In the following, hence, we disclose how to realize this optimization.
\textbf{Split Transaction:}
In order to copy data between the device and host memory, the data
typically need to be copied twice, unless users directly allocate
buffers to the host I/O memory.
For example, when uploading user buffers from the host to the device
memory, the data are first copied from the main memory to intermediate
\textit{bounce} buffers in the I/O memory accessible to the GPU, and
then copied to the device memory.
To optimize this two-step memory-copy operation, we split each operation
by a fixed size of chunks, which play a role of ping-pong buffers,
\textit{i.e.}, while some chunk is transferred between the main and I/O
memory, the preceding chunk can be transferred between the I/O and
device memory.
In this way, only the first and last chunks need to be transferred
alone, reducing the total makespan almost half.
This split transaction also needs only a small size of bounce buffers
equal to the chunk size, reducing the usage of the host I/O memory
%In Gdev, the chunk size is configurable.
%The same idea can also be applied to all types of DMA engines desribed in
\textbf{Direct I/O Access:}
The split transaction is effective for a large size of data.
For a small size of data, however, the use of DMA engines incurs
non-trivial overhead by itself.
Hence, we also employ a method to read/write data one by one by mapping
device memory space onto host I/O memory space, rather than send/receive
data in burst mode by exploiting DMA transaction.
We have found that such a direct I/O access method is much faster than
using DMA engines for a small size of data.
In our experiment presented in Section~\ref{sec:evaluation}, we will
show a boundary on the data size that inverts the superiority of I/O
access and DMA, together with the best chunk size, to optimize
memory-copy throughput.
\subsection{Shared Memory Support}
The current GPU programming model does not support IPC.
For example, data communication among contexts incurs significant
overhead by copying data back and forth between the device- and
host-memory buffers.
Currently, an OS dataflow abstraction~\cite{Rossbach_SOSP11} is a useful
tool to minimize such data movement costs; however users are required to
use a dataflow programming model and understand that optimization is
applied implicitly by the OS at runtime.
It would be nice if users could manage data communication among contexts
easily using a familiar method, such as a POSIX IPC mechanism.
Gdev supports shared memory for the GPU, providing a set of API
functions, as listed in Table~\ref{tab:gdev_api}, based on the POSIX IPC
standard, \textit{i.e.}, \texttt{gshmget}, \texttt{gshmat},
\texttt{gshmdt}, and \texttt{gshmctl} correspond to \texttt{shmget},
\texttt{shmat}, \texttt{shmdt}, and \texttt{shmctl} respectively.
We have also added \texttt{cuShmGet}, \texttt{cuShmAt},
\texttt{cuShmDt}, and \texttt{cuShmCtl} to our CUDA API
implementation, which correspondingly call the Gdev shared memory
functions, so that CUDA applications can easily leverage Gdev's shared
memory support.
Our shared memory design is straightforward, though its implementation
is challenging.
Upon the first call to \texttt{gshmget}, new space is allocated to the
device memory, similar to \texttt{gmalloc}, and it holds an identifier
to this memory object.
After the first call, however, Gdev only returns this identifier to the
The allocated space is mapped onto the context virtual address space
when \texttt{gshmat} is called.
Address mapping is done by setting the page table so that the virtual
addresses point to the shared physical memory space.
The allocated space can also be unmapped by \texttt{gshmdt} and freed by
Gdev counts the number of users referencing the shared memory, and frees
it when unmapped by the last user -- hence the call to \texttt{gshmctl}
is optional.
If the shared memory needs to be accessed exclusively, the host
program itself is responsible for taking care of traditional
mutex/semaphore mechanisms.
We believe that our shared memory scheme can be easily integrated into
GPU programming.
For example, legacy CUDA applications can use our shared memory scheme
by replacing \texttt{cuMemAlloc} with \texttt{cuShmGet} and
\texttt{cuShmAt}, and \texttt{cuMemFree} with \texttt{cuShmDt}.
\subsection{Memory Swapping}
The proprietary software in Linux~\cite{CUDA40} fails to allocate
device memory, when the memory demand exceeds the physical memory
capacity, while Windows software can somehow swap device memory,
according to PTask~\cite{Rossbach_SOSP11}.
In either case, however, it is not well studied how to swap device
memory and speed up its operation.
Gdev uses shared memory to achieve device memory swapping.
When a memory allocation request fails due to a short of free memory
space, Gdev seeks memory objects whose allocated size is greater than
the requested size, and selects one owned by a low-priority context,
where ties are broken arbitrarily.
Gdev here ensures not to select a memory object from the caller context
Once a victim memory object is selected, it is shared with the caller
context, and behaves as an \textit{implicit} shared memory object.
%The allocated space is never freed until unreferenced by all associated
Unlike an explicit shared memory object presented in
Section~\ref{sec:shared_memory}, however, the implicit shared memory
object needs to evict data when other contexts are accessing it, and
retrieve them later when the corresponding context is resumed.
Thanks to the Gdev API design, we know exactly when contexts could
access the shared memory: it could be accessed when either a family of
\texttt{gmemcpy*} or \texttt{glaunch} is called.
They however need to be handled in a different manner:
\item \texttt{gmemcpy*} accesses a specific range of address space
given by the function arguments.
Hence, we need to evict and retrieve data related to this range.
\item \texttt{glaunch}, on the other hand, does not tell which address
range could be accessed when it is called, especially due to
dynamically allocated memory space.
Hence, we need to evict and retrieve data associated with all
memory objects owned by the corresponding context.
We allocate swap buffers to the host main memory to save the evicted
It uses \texttt{gmemcpy\_from\_device} and \texttt{gmemcpy\_to\_device}
to evict and retrieve data.
These memory swapping procedures are not visible to application programs.
It should be noted that swapping does not occur when downloading data
from the device memory.
Even if the data are evicted, we can copy the data from the host swap
buffer directly.
\textbf{Reducing Latency:}
The swapping latency could be non-trivial depending on the data size.
Given memory copy operations within the device memory faster than
those between the device and host memory by an order of ten, Gdev
reserves a certain amount of device memory space to use as temporal swap
Data can be evicted to this device swap space temporarily, if they fit
the space, to reduce the swapping latency.
The temporarily-evicted data are eventually evicted to the host memory
after a while to free the swap space for other contexts.
Gdev tries to hide the latency of this second data eviction by
overlapping it with the computation launched by the context itself.
To do so, it creates a special GPU context that is dedicated to the
device-to-host data movement, since the compute and DMA units cannot be
used by the same CPU context simultaneously.
This approach is reasonable, since some computation is likely following
the data eviction.
For example, \texttt{glaunch} will apparently launch some computation, and
\texttt{gmemcpy\_to\_device} is also often called prior to \texttt{glaunch}.
The evicted data, if exist, need to be retrieved when \texttt{glaunch}
is called for the associated GPU context after evicting the current data
of some other context sharing the memory space.
If the evicted data still exist in the device swap space, they can be
retrieved quickly.
Else, Gdev retrieves them from the host main memory.
Jump to Line
Something went wrong with that request. Please try again.