Skip to content
Browse files

Merge branch 'master' of

  • Loading branch information...
2 parents dca630f + faa05a1 commit fe9d696272a32dddd761a27ddef162de4cb586e1 Shinpei Kato committed
13 final/Makefile
@@ -0,0 +1,13 @@
+TARGET = main
+ latex $(TARGET)
+ bibtex $(TARGET)
+ latex $(TARGET)
+ latex $(TARGET)
+ dvips -t letter -o $(TARGET).ps $(TARGET).dvi
+ ps2pdf $(TARGET).ps
+ #dvipdf draft.dvi
+ rm -fr *~ *.aux *.ps *.pdf *.dvi *.log *.bbl *.blg *.ent
32 final/abstract.tex
@@ -0,0 +1,32 @@
+ The graphics processing unit (GPU) has become a very powerful platform
+ embracing a concept of heterogeneous many-core computing.
+ Despite its significant benefits in performance, however, the
+ application domains of GPUs are currently limited, largely due to
+ a lack of first-class resource management primitives to support the GPU
+ in general-purpose time-sharing systems.
+ We present Gdev, a new approach to GPU resource management in the
+ operating system (OS), which allows user-space applications and the OS
+ itself to use GPUs as first-class computing resources.
+ It integrates runtime support into the OS, coordinated with the
+ device driver, to extend a class of applications that can benefit from
+ the GPU.
+ This runtime-unified OS approach also enhances the memory management
+ and scheduling capabilities for the GPU.
+ Specifically, Gdev supports shared memory for inter-process
+ communication among GPU contexts and memory swapping for memory
+ allocation demands that exceed the physical space.
+ Scheduling is also provided to control GPU resource usage for
+ computations and data transmissions.
+ Furthermore, Gdev enables the GPU to be virtualized into logical GPUs,
+ isolating a certain potion of GPU resources from other time-sharing users.
+ We implement an open-source prototype of Gdev for Linux on NVIDIA
+ GPUs, to identify the advantage and disadvantage of using Gdev,
+ compared to the proprietary software and previous work.
+ Our detailed experiments show that Gdev can, for instance, gain about
+ 2x speedups for an encrypted filesystem leveraging the GPU, improve the
+ makespans of data-flow programs by up to 49\%, and manage the virtual
+ GPU utilization within an error of 7\%.
19 final/conclusion.tex
@@ -0,0 +1,19 @@
+This paper has presented Gdev, a new approach to GPU resource management
+that integrates runtime support into the OS.
+Gdev also provides memory management and scheduling schemes
+for the GPU to enable GPUs to be used as first-class computing
+resources in time-sharing systems.
+We implemented a prototype of Gdev, and conducted thorough experiments
+to demonstrate the advantage and disadvantage of our Gdev approach.
+Our conclusion is that some basic performance needs to be compromised
+due to moving runtime support into the OS, but GPU resources can be
+efficiently shared among users in time-sharing systems, using Gdev.
+Our Gdev prototype and the application programs used in our experiments
+are open-source, and may be downloaded from {\sf
119 final/ecosystem.tex
@@ -0,0 +1,119 @@
+\section{Gdev Ecosystem}
+ \begin{center}
+ \includegraphics[width=0.863\hsize]{eps/gdev.eps}\\
+ \vspace{-0.5em}
+ \caption{Logical view of Gdev ecosystem.}
+ \label{fig:gdev}
+ \end{center}
+ \vspace{-1.5em}
+Gdev is aimed at extending GPU resource management capabilities and a
+class of applications that can benefit from the GPU.
+To this end, it integrates runtime support into the OS.
+Figure~\ref{fig:gdev} illustrates the overview of the Gdev ecosystem.
+First, Gdev optionally supports a traditional execution model for
+compatibility\footnote{For practice use, we can disable this user-space
+runtime.}, where
+applications make API calls to the user-space runtime library, and the
+library sends GPU commands to the device driver via the system call.
+As demonstrated in previous work~\cite{Kato_ATC11}, however, it is hard
+to analyze at runtime the sequence of GPU commands (there could be
+hundreds of commands for one operation), and it would not be appropriate
+to use resource management primitives along with command calls.
+This argument motivates our Gdev approach.
+Unlike traditional approaches, Gdev employs runtime support in
+the OS, providing GPU resource management primitives along with API calls.
+OS applications can use this runtime directly, while user-space
+applications can also use it through the wrapper library provided by
+Gdev, which simply relays API calls to the OS runtime using the system
+This runtime-unified approach enables Gdev to be API-driven, where the
+scheduler, for instance, is invoked for GPU computations and host-device
+data transmissions upon corresponding API calls.
+\textbf{Low-Level API:}
+Gdev provides a set of low-level API functions for GPU programming, as
+listed in Table~\ref{tab:gdev_api}.
+Programmers may use either this low-level API or a high-level API,
+\textit{e.g.}, CUDA, built on top of Gdev API.
+Gdev supports several other API functions, including asynchronous data
+transmission, but they are not within the scope of this paper.
+Table~\ref{tab:gdev_api} presents an initial set of Gdev API.
+We plan to extend it to make more functions available, such as texture
+and 3-D processing.
+\textbf{CUDA Support:}
+In addition to Gdev API, we support CUDA Driver API~\cite{CUDA40} so
+that legacy CUDA programs can perform on top of Gdev in both the user
+space and OS.
+The supported API functions are limited to those that can be implemented
+by the current version of Gdev API, but many legacy CUDA programs can
+execute with this limited set of API, as we will show in
+Gdev assumes that programs are compiled by NVIDIA CUDA Compiler
+Since CUDA Driver API disjoints device binary and host binary, Gdev only
+needs to parse the device binary to acquire static information, such
+as the code size, stack size, shared memory size, local memory size, and
+parameters size, when it is loaded by the host binary.
+Global memory is also dynamically allocated by the host binary at
+It should be noted that CUDA also provides a different type of API,
+called Runtime API, which is more high-level than Driver API.
+Our prototype implementation does not support CUDA Runtime API by
+itself, but we could leverage Ocelot~\cite{Diamos_PACT10} that
+translates Runtime API to Driver API, in order to run CUDA programs
+written in Runtime API.
+\textbf{First-Class Resource Management:}
+Gdev provides first-class memory management and scheduling for the GPU
+to allow time-sharing multi-tasking systems to use the GPU efficiently,
+including shared memory for IPC, swapping for large memory demands,
+resource-based queuing for high-throughput, and resource partitioning
+for virtual GPUs.
+Our resource management scheme is also not limited to GPUs, but can be
+generalized for a broad class of heterogeneous compute devices.
+ \caption{Representatives of Gdev API.}
+ \vspace{-0.5em}
+ \label{tab:gdev_api}
+ \begin{center}
+ {\footnotesize
+ \begin{tabular}{|l|l|}
+ \hline
+ \textbf{API name} & \textbf{Description}\\
+ \hline
+ \texttt{gopen}/\texttt{gclose} & Open/close the device\\
+ \hline
+ \texttt{gmalloc}/\texttt{gfree} & Allocate/free device memory\\
+ \hline
+ \texttt{gmalloc\_io}/\texttt{gfree\_io} & Allocate/free host I/O memory\\
+ \hline
+ \texttt{gmemcpy\_to\_device} & Copy memory to the device\\
+ \hline
+ \texttt{gmemcpy\_from\_device} & Copy memory from the device\\
+ \hline
+ \texttt{gmemcpy\_in\_device} & Copy memory within the device\\
+ \hline
+ \texttt{glaunch}/\texttt{gsync} & Launch/wait computation\\
+ \hline
+ \texttt{gquery} & Query device information\\
+ \hline
+ \texttt{gshmget}/\texttt{gshmctl} & Manage shared device memory\\
+ \hline
+ \texttt{gshmat}/\texttt{gshmdt} & Share/unshare device memory\\
+ \hline
+ \end{tabular}
+ }
+ \end{center}
BIN final/eps/basic_performance.eps
Binary file not shown.
BIN final/eps/chunk.eps
Binary file not shown.
BIN final/eps/dataflow.eps
Binary file not shown.
BIN final/eps/dma.eps
Binary file not shown.
BIN final/eps/ecryptfs_read.eps
Binary file not shown.
BIN final/eps/ecryptfs_write.eps
Binary file not shown.
BIN final/eps/ecryptfs_write_multitask.eps
Binary file not shown.
BIN final/eps/gdev.eps
Binary file not shown.
BIN final/eps/memcpy.eps
Binary file not shown.
BIN final/eps/scheduler_overhead.eps
Binary file not shown.
BIN final/eps/swapping.eps
Binary file not shown.
BIN final/eps/swapping_vgpu.eps
Binary file not shown.
BIN final/eps/vgpu_2_band.eps
Binary file not shown.
BIN final/eps/vgpu_2_band_compute.eps
Binary file not shown.
BIN final/eps/vgpu_2_band_memory.eps
Binary file not shown.
BIN final/eps/vgpu_2_credit.eps
Binary file not shown.
BIN final/eps/vgpu_2_fifo.eps
Binary file not shown.
BIN final/eps/vgpu_fair_2_band.eps
Binary file not shown.
454 final/evaluation.tex
@@ -0,0 +1,454 @@
+\section{Experimental Evaluation}
+We evaluate our Gdev prototype, using the Rodinia
+benchmarks~\cite{Che_IISWC09}, GPU-accelerated eCryptfs encrypted
+filesystem from KGPU~\cite{Sun_SECURITY11_Poster}, FAST database
+search~\cite{Kim_SIGMOD10}, and some dataflow
+microbenchmarks from PTask~\cite{Rossbach_SOSP11}.
+We disclose that the basic performance of our prototype is practical
+even compared to proprietary software, and also demonstrate that Gdev
+provides significant benefits for GPU applications in time-sharing
+Our experiments are conducted with the Linux kernel 2.6.39 on NVIDIA
+GeForce~GTX~480 graphics card and Intel Core~2~Extreme QX9650 processor.
+GPU programs are written in CUDA and compiled by NVCC~\cite{CUDA40},
+while CPU programs are compiled by gcc 4.4.6.
+\subsection{Basic Performance}
+ \begin{center}
+ \includegraphics[width=0.8\hsize]{eps/chunk.eps}\\
+ \vspace{-1.5em}
+ \caption{Impact of the chunk size on DMA speeds.}
+ \label{fig:chunk}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \includegraphics[width=0.8\hsize]{eps/dma.eps}\\
+ \vspace{-1.5em}
+ \caption{Relative speed of I/O access to DMA.}
+ \label{fig:io_access}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/memcpy.eps}\\
+ \vspace{-1.5em}
+ \caption{Memory-copy throughput.}
+ \label{fig:memcpy}
+ \end{center}
+ \vspace{-1.5em}
+We first investigate the basic performance of standalone applications
+achieved by our Gdev prototype in order to argue that the rest of our
+evaluation is practical in the real world.
+To do so, we need to determine what resource parameters can maximize the
+performance of our Gdev prototype.
+Figure~\ref{fig:chunk} shows the impact of the chunk size on memory-copy
+DMA speeds for both host-to-device (HtoD) and device-to-host (DtoH)
+data transmissions, tested by different data sizes.
+The throughput is affected by overhead when the chunk size is small,
+while it is also affected by blocking times when the chunk size is
+One may observe that a chunk size of 4MB is the best trade-off
+for both HtoD and DtoH.
+Figure~\ref{fig:io_access} shows the relative speed of direct I/O
+access to DMA for a small size of data.
+Due to some cache effects, HtoD and DtoH behave in a different manner.
+According to the results, the data transfer speeds inverse around a
+data size of 4KB for HtoD and 1KB for DtoH.
+Henceforth, we set the chunk size to 4MB, and the data size boundary of
+direct I/O access to 4KB for HtoD and 1KB for DtoH.
+Figure~\ref{fig:memcpy} shows the memory-copy throughput achieved
+by our Gdev prototype compared to the proprietary software.
+``Gdev/User'' employs a runtime library in the user-space, while
+``Gdev'' integrates the runtime support into the OS.
+Interestingly, the user-space runtime achieves higher throughput than
+the OS runtime, especially for DtoH.
+This difference comes from host-to-host \texttt{memcpy} effects.
+In particular, \texttt{memcpy} in the Linux kernel performs slower than
+that in the user-space GNU library, when copying data from the I/O
+memory to the main memory.
+This could be the disadvantage of integrating the runtime support into
+the OS, but a further in-depth investigation is required.
+Apart from DtoH with the OS runtime, our Gdev prototype and the
+proprietary software are almost competitive.
+ \caption{List of benchmarks.}
+ \label{tab:benchmarks}
+ \vspace{-0.5em}
+ \begin{center}
+ {\footnotesize
+ \begin{tabular}{|l|l|}
+ \hline
+ \textbf{Benchmark} & \textbf{Description}\\
+ \hline
+ LOOP & Long-loop compute without data \\
+ \hline
+ MADD & 1024x1024 matrix addition\\
+ \hline
+ MMUL & 1024x1024 matrix multiplication\\
+ \hline
+ CPY & 256MB of HtoD and DtoH\\
+ \hline
+ PINCPY & CPY using pinned host I/O memory\\
+ \hline
+ BP & Back propagation (pattern recognition)\\
+ \hline
+ BFS & Breadth-first search (graph algorithm)\\
+ \hline
+ HW & Heart wall (medical imaging)\\
+ \hline
+ HS & Hotspot (physics simulation)\\
+ \hline
+ LUD & LU decomposition (linear algebra)\\
+ \hline
+ NN & K-nearest neighbors (data mining)\\
+ \hline
+ NW & Needleman-wunsch (bioinformatics)\\
+ \hline
+ SRAD & Speckle reducing anisotropic diffusion (imaging)\\
+ \hline
+ SRAD2 & SRAD with random pseudo-inputs (imaging)\\
+ \hline
+ \end{tabular}
+ }
+ \end{center}
+ \vspace{-1.5em}
+Figure~\ref{fig:basic_performance} depicts the standalone performance of
+benchmarks achieved by our Gdev prototype compared to the
+proprietary software, using some microbenchmarks and
+Rodinia~\cite{Che_IISWC09} as listed in Table~\ref{tab:benchmarks}.
+First of all, we found that NVIDIA GPUs have some ``performance mode''
+to boost hardware performance, which we have not yet figured out how to
+turn on.
+As observed in LOOP, our open-source implementation incurs about 20\% of
+decrease in performance, compared to the proprietary software.
+The impact on application performance, however, depends on workloads.
+If the workload is very compute-intensive, such as HW and SRAD, the
+impact appears high, while some friendly workload, such as BFS and
+HS, hides this impact.
+In either case, we claim that this performance disadvantage is due to
+implementation issues, but it does not limit the concept of Gdev.
+These benchmarks also identify that our runtime-unified OS approach
+would not be appreciated by data-intensive workloads.
+For example, BP deals with a very large size of data, while its compute
+demand is not very high.
+Such a workload decreases performance with our Gdev prototype due to the
+slow host-to-host \texttt{memcpy} operation in the OS discussed above.
+PINCPY, on the other hand, exhibits little difference in
+performance, since it does not need the host-to-host \texttt{memcpy}
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/basic_performance.eps}\\
+ \vspace{-1.5em}
+ \caption{Basic standalone performance.}
+ \label{fig:basic_performance}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/scheduler_overhead.eps}\\
+ \vspace{-1.5em}
+ \caption{Unconstrained real-time performance.}
+ \label{fig:scheduler_overhead}
+ \end{center}
+ \vspace{-1.5em}
+We next investigate the reliability of GPU schedulers, comparing the OS
+API-driven scheme adopted by Gdev and PTask~\cite{Rossbach_SOSP11}, OS
+command-driven scheme adopted by TimeGraph~\cite{Kato_ATC11} and
+GERM~\cite{Bautin_MCNC08}, and user-space API-driven scheme adopted by
+We execute each Rodinia benchmark recursively as fast as possible in
+real-time, together with some workloads launching many meaningless GPU
+commands bypassing the runtime library.
+The user-space API-driven scheduler severely suffers from this
+situation, since it cannot schedule such workloads that bypass the
+scheduler itself.
+The OS command-driven scheduler can sustain the influences by command
+scheduling, but still incurs some overhead due to many scheduler
+The OS API-driven scheduler, on the other hand, can simply reject such
+workloads, since they are not submitted using the API.
+Gdev and PTask are both API-driven, but PTask exposes the scheduler
+system call to the user space, such as \texttt{sys\_set\_ptask\_prio},
+which could allow misbehaving applications to abuse priorities.
+As a consequence, we believe that Gdev is a self-contained reliable
+\subsection{GPU Acceleration for the OS}
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/ecryptfs_read.eps}\\
+ \vspace{-1.5em}
+ \caption{eCryptfs read throughput.}
+ \label{fig:ecryptfs_read}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/ecryptfs_write.eps}\\
+ \vspace{-1.5em}
+ \caption{eCryptfs write throughput.}
+ \label{fig:ecryptfs_write}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/ecryptfs_write_multitask.eps}\\
+ \vspace{-1.5em}
+ \caption{eCryptfs write throughput with priorities.}
+ \label{fig:ecryptfs_write_multitask}
+ \end{center}
+ \vspace{-1.5em}
+We now evaluate the GPU acceleration for the Linux encrypted filesystem,
+using KGPU's implementation of eCryptfs~\cite{Sun_SECURITY11_Poster}.
+KGPU is a framework that allows the OS to access the user-space runtime
+library to use the GPU for its computations.
+We have modified KGPU's eCryptfs to call the CUDA API functions
+provided by Gdev directly, instead of sending requests to the KGPU user-space
+Figure~\ref{fig:ecryptfs_read} and \ref{fig:ecryptfs_write} show the
+read and write throughput of several versions of eCryptfs.
+``CPU'' represents the CPU implementation, while ``KGPU \& NVIDIA'' and
+``KGPU \& Gdev/User'' represent those that use KGPU with NVIDIA's
+library and Gdev's library in the user space respectively.
+``Gdev'' is our solution that enables the eCryptfs module to use the GPU
+directly in the OS.
+Due to some page cache effects, the read and write throughput are not
+identical, but the advantage of using the GPU is clearly observed.
+One can also observe that our runtime-unified OS approach does not
+provide significant improvements over the KGPU approach.
+However, this is reasonable because a magnitude of improvements in
+latency achieved by our OS approach would be at most microseconds, while
+the AES/DES operations of eCryptfs performed on the GPU are
+Nonetheless, Gdev is fairly beneficial since OS applications are freed
+from the user space.
+A further benefit of Gdev for OS applications appears in multi-tasking
+Figure~\ref{fig:ecryptfs_write_multitask} shows the write throughput of
+eCryptfs when the FAST search task~\cite{Kim_SIGMOD10} is competing
+the GPU.
+Since Gdev supports priorities, we assign eCryptfs the highest priority,
+while the search task is also assigned a higher priority than other tasks.
+What happens in this scenario is that the performance of eCryptfs is
+affected by the search task without a priority scheme, as observed in
+``KGPU \& NVIDIA''.
+Even with priorities, KGPU could suffer from priority inversions where
+the high-priority eCryptfs task is reduced to the KGPU priority level
+when accessing the GPU, while the search task is executing at a
+higher priority level.
+We can assign a high priority to the KGPU daemon, but it affects all
+user-space GPU applications.
+Using Gdev, on the other hand, GPU applications execute at the identical
+priority level, which avoids such priority inversions.
+\subsection{Impact of Shared Memory}
+ \begin{center}
+ \includegraphics[width=\hsize]{eps/dataflow.eps}\\
+ \vspace{-1.5em}
+ \caption{Impact of shared memory on dataflow tasks.}
+ \label{fig:dataflow}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \includegraphics[width=0.75\hsize]{eps/swapping.eps}\\
+ \vspace{-1.5em}
+ \caption{Impact of swapping latency.}
+ \label{fig:swapping}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \includegraphics[width=0.75\hsize]{eps/swapping_vgpu.eps}\\
+ \vspace{-1.5em}
+ \caption{Impact of swapping latency on virtual GPUs.}
+ \label{fig:swapping_vgpu}
+ \end{center}
+ \vspace{-1.5em}
+ \begin{center}
+ \subfigure[FIFO scheduler] {
+ \includegraphics[width=0.319\hsize]{eps/vgpu_2_fifo.eps}
+ }
+ \subfigure[Credit scheduler] {
+ \includegraphics[width=0.319\hsize]{eps/vgpu_2_credit.eps}
+ }
+ \subfigure[Band scheduler] {
+ \includegraphics[width=0.319\hsize]{eps/vgpu_2_band.eps}
+ }
+ \vspace{-1.2em}
+ \caption{Util. of virtual GPUs under unfair workloads.}
+ \label{fig:vgpu_2}
+ \end{center}
+ \vspace{-1.5em}
+Figure~\ref{fig:dataflow} shows the speedups of dataflow benchmarks
+brought by Gdev's shared memory scheme.
+We construct a dataflow by a 6x32 tree or a 6x10 rectangle, respecting
+PTask's setup~\cite{Rossbach_SOSP11}.
+``NVIDIA/modular'' and ``Gdev/modular'' use NVIDIA's CUDA API and Gdev's
+CUDA API respectively to implement a dataflow in such a way that
+allocates a self-contained context to each graph node as a module, and
+connects their output and input by copying data between the host and
+device memory back and forth.
+On the other hand, ``Gdev/shm'' uses shared memory instead of
+host-to-device data communication, \textit{i.e.}, it connects the output
+and input by sharing the same ``key'' associated with the corresponding
+shared memory.
+According to the results, the usage of shared memory is very effective
+for dataflows with a large size of data, \textit{e.g.}, it gains a 49\%
+speedup for the 1024x1024 madd tree.
+Specifically, we have observed that ``Gdev/modular'' took 1424ms while
+``Gdev/shm'' took 953ms to complete this dataflow.
+This makes sense, \textit{i.e.}, the data transfer time for each
+1024x1024 integer value was about 8ms on average, and we can reduce data
+communications by a total of 32+16+8+4+2=62 intermediate nodes for a
+6x32 tree, which leads to a total reduced time of 8x62=496ms.
+It should be noted that PTask achieves more speedups due to advanced
+dataflow scheduling~\cite{Rossbach_SOSP11}.
+However, we provide users with a first-class API primitive for shared
+memory, which could be used as a generic IPC method in different program
+Therefore, we distinguish our contribution from PTask.
+It is also surprising that our Gdev prototype performs much better than
+the proprietary software.
+We suspect that the proprietary one takes a long time to initialize
+contexts when there are many active contexts, though an in-depth
+investigation is required.
+Figure~\ref{fig:swapping} depicts the impact of memory swapping,
+provided by Gdev, on the makespan of multiple 128MB-data FAST search
+tasks, where another 1GB-data FAST search task is running at the highest
+priority level.
+Given that the GPU used in this evaluation supports
+1.6GB of device memory, we cannot create more than three 128MB-data
+search tasks at once without memory swapping.
+``Gdev/User'' hence fails when the number of the small search tasks
+exceeds three, since our prototype does not support shared memory in the
+user space.
+NVIDIA' proprietary software also fails much earlier.
+We suspect that it would reserve a large space of device memory for
+other purposes.
+With memory swapping, however, all the 128MB-data search tasks can
+survive to execute under this memory pressure, though the slope of
+increase in the makespan changes at a different point when using the
+temporal swap space on the device memory under Gdev.
+In particular, a reflection point appears clearly when the device swap
+space is not leveraged, as observed in ``Gdev w/o swp'', because the
+swapping latency influences the makespan.
+Using the device swap space, however, Gdev can reduce the impact of
+swapping on the makespan, though a reflection point appears a little
+earlier due to the swap space itself causing a less total amount of
+device memory available for applications.
+Figure~\ref{fig:swapping_vgpu} shows the impact of memory swapping on
+virtual GPUs.
+In this experiment, we introduce virtual GPUs, and execute 128MB-data
+search tasks on the first virtual GPU.
+The memory size available for the virtual GPU is more restricted in
+the presence of more virtual GPUs.
+We confirm that the makespans become longer and their reflection points
+appear earlier for a greater number of virtual GPUs, but all the search
+tasks can still complete.
+This explains that memory swapping is also useful on virtual GPUs.
+\subsection{Isolation among Virtual GPUs}
+We now examine Gdev's virtual GPU support.
+Figure~\ref{fig:vgpu_2} shows the actual GPU utilization of two virtual
+GPUs, where the first virtual GPU (VGPU 0) executes short-length compute
+programs using the LUD benchmark, while the other (VGPU 1) executes
+long-length compute programs using the HW benchmark.
+These programs run repeatedly to impose high workloads for 200 seconds
+VGPU~1 starts workloads 30 seconds later than VGPU~0.
+We compare three schedulers, using the SDQ scheme.
+The FIFO scheduler represents a lack of virtual GPU support, and
+the Credit scheduler adopts the scheduling policy of the Xen
+The Band scheduler is one provided by Gdev.
+According to the results, the FIFO scheduler does not respect isolation
+at all by definition.
+The Credit scheduler also does not really work due to unawareness of
+non-preemptive burst workload.
+The Band scheduler, however, can provide fairer GPU bandwidth allocations.
+The difference between the utilization of two virtual GPUs is retained
+within 7\% on average.
+We next study the effectiveness of the MRQ scheme that separates the
+queues for compute and memory-copy operations.
+Figure~\ref{fig:vgpu_2_band_mrq} illustrates the utilization of two
+virtual GPUs under the Band scheduler, executing the SRAD benchmark
+tasks with different sizes of image.
+We noticed that the compute and memory-copy operations can be
+overlapped, but they affect the run-to-completion time with each other.
+When VGPU 1 uses more compute resources due to a large size of
+computation, the length of memory-copy operations requested by VGPU 0 is
+prolonged due to overlapping.
+As a result, it requires more memory-copy bandwidth.
+However, the available bandwidth is capped by the Band scheduler,
+\textit{i.e.}, both the compute and memory-copy operations are limited
+to about 50\% of bandwidth at most.
+One can also observe that the MRQ scheme allowed the sum of compute and
+memory-copy bandwidth to exceed 100\%.
+We finally demonstrate the scalability of our virtual GPU support.
+Figure~\ref{fig:vgpu_fair_4_band} shows the utilization of four virtual
+GPUs under the Band scheduler, where all virtual GPUs execute four
+instances of the LUD benchmark task exhaustively to produce fair
+The workloads of each virtual GPU begin in turn at an interval of 30
+Under such sane workloads, our virtual GPU support can provide fair
+bandwidth allocations, even if the system exhibits non-preemptive burst
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/vgpu_2_band_compute.eps}\\
+ \vspace{-0.5em}
+ \includegraphics[width=0.9\hsize]{eps/vgpu_2_band_memory.eps}\\
+ \vspace{-1.5em}
+ \caption{Util. of virtual GPUs with the MRQ scheme (upper for compute
+ and lower for memory-copy).}
+ \label{fig:vgpu_2_band_mrq}
+ \end{center}
+ \begin{center}
+ \includegraphics[width=0.9\hsize]{eps/vgpu_fair_4_band.eps}\\
+ \vspace{-1.5em}
+ \caption{Util. of virtual GPUs under fair workloads.}
+ \label{fig:vgpu_fair_4_band}
+ \end{center}
+ \vspace{-1.5em}
81 final/implementation.tex
@@ -0,0 +1,81 @@
+\section{System Implementation}
+Our prototype implementation is fully open-source and available for the
+Linux kernel 2.6.32 or later, without any modifications to the
+underlying kernel, on NVIDIA Fermi GPUs.
+It does not depend on proprietary software except for compilers.
+%Our device driver implementation is particularly based on what we have
+%been developing in collaboration with PathScale
+%The OS runtime and the CUDA Driver API library, on the other hand, are
+%newly implemented.
+Our prototype implementation is still partly experimental, but it could
+contribute to future research on GPU resource management, given that
+open-source drivers/runtimes for GPUs are very limited today.
+Gdev is a Linux kernel module (a character device driver) composing the
+device driver and runtime library.
+The device driver manages low-level hardware resources, such as channels
+and page tables, to operate the GPU.
+The runtime library manages GPU commands and API calls.
+It directly uses the device-driver functions to control hardware
+resource usage for first-class GPU resource management.
+The Gdev API is implemented in this runtime library.
+The kernel symbols of the API functions are exported so that other OS
+modules can call them.
+These API functions are also one-to-one mapped to the \texttt{ioctl}
+commands defined by Gdev so that user-space programs can also be managed
+by Gdev.
+We provide two versions of CUDA Driver API: one for the user space and
+the other for the OS.
+The former is provided as a typical user-space library, while the
+latter is provided as a kernel module, called \textit{kcuda},
+which implements and exports the CUDA API functions.
+They however internally use Gdev API to access the GPU.
+We use \texttt{/proc} filesystem in Linux to configure Gdev.
+For example, the number of virtual GPUs and their maps to physical GPUs
+are visible to users through \texttt{/proc}.
+The compute and memory bandwidth and memory share for each virtual GPU
+are also configurable at runtime through \texttt{/proc}.
+We further plan to integrate the configuration of priority and
+reserve for each single task into \texttt{/proc}, using the TimeGraph
+Gdev creates the same number of character device files as virtual GPUs,
+\textit{i.e.}, /dev/\{gdev0,gdev1,...\}.
+When users open one of these device files using Gdev API or CUDA API,
+it behaves as if it were one for the physical GPU.
+\textbf{Resource Parameters:}
+The performance of Gdev is governed by resource parameters, such as the
+page size for virtual memory, temporal swap size, waiting time for the
+Band scheduler, period for virtual GPU budgets, chunk size
+for memory-copy, and boundary between I/O access and DMA.
+We use a page size of 4KB, as the Linux kernel uses the same page size
+for host virtual memory by default.
+The swap size is statically set 10\% of the physical device memory.
+The waiting time for the Band scheduler is also statically set 500
+For the period of virtual GPU budgets, we respect Xen's default setup,
+\textit{i.e.}, we set it 30ms.
+The rest of resource parameters will be determined in
+We use Direct Rendering Infrastructure (DRI)~\cite{DRI} -- a Linux
+framework for graphics rendering with the GPU -- to communicate with the
+Linux kernel.
+Hence, some Gdev functionality may be used to manage not only compute
+but also 3-D graphics applications.
+Our implementation approach also abstracts GPU resources by device,
+address space, context, and memory objects, which allows other device
+drivers and GPU architectures to be easily ported.
+%We believe that this portability would help to generalize the concept
+%of Gdev.
123 final/introduction.tex
@@ -0,0 +1,123 @@
+Recent innovations in heterogeneous compute devices with many-core
+technology have achieved an order-of-magnitude gain in computing
+In particular, the graphics processing unit (GPU) receives considerable
+attention as a mature heterogeneous compute device that embraces a
+concept of many-core computing.
+A recent announcement from the TOP500 supercomputing sites in November
+2011~\cite{TOP500} disclosed that three of the top five supercomputers
+use GPUs.
+For example, scientific climate applications can gain 80x speedups using
+such GPU-based supercomputers~\cite{Shimokawabe10}.
+The benefits of GPUs appear not only in high-performance computing but
+also in general-purpose and embedded computing domains.
+According to previous research, GPUs can provide up to an order of 10x
+speedups for software routers~\cite{Han_SIGCOMM10}, 20x speedups for
+encrypted networks~\cite{Jang_NSDI11}, and 15x speedups for motion
+planning in autonomous vehicles~\cite{McNaughton_ICRA11}.
+Such a rapid growth of general-purpose computing on GPUs,
+\textit{a.k.a.}, GPGPU, is brought by recent advances in GPU
+programming languages, such as CUDA~\cite{CUDA40}.
+Although the use of GPUs in general-purpose domains provides significant
+performance improvements, system software support for GPUs in the market
+is especially tailored to accelerate particular applications dedicated
+to the system, but is not well-designed to integrate GPUs into more
+general time-sharing systems.
+The research community thereby has articulated the need to manage GPU
+resources in the operating system (OS)~\cite{Bautin_MCNC08, Kato_ATC11,
+However, these pieces of OS support still limit a class of applications
+that can use GPUs due to a lack of first-class resource management
+primitives, such as virtual memory management, inter-process
+communication (IPC), and time-and-space resource partitioning.
+On time-sharing systems where multiple independent users are logged in,
+for instance, if one executes a high-workload GPGPU program, the
+GPU resources available for the rest of users could be limited.
+GPU schedulers in the state of the arts~\cite{Kato_ATC11,
+Rossbach_SOSP11} provide priority schemes for the GPU, yet users
+are required to specify priorities and other parameters by themselves.
+The system thus cannot partition GPU resources in time-sharing systems.
+More essentially, the current GPGPU framework does not permit
+users to share memory resources among GPU contexts, and could limit the
+total amount of effective data by the size of physical memory.
+Such constraints may not be acceptable in general-purpose programming.
+The current OS and system support for GPUs also leaves the
+application programming interface (API) to the user space, which
+restricts the availability of GPUs to user-space applications.
+In addition, employing the API in the user-space library implies that
+the device driver exposes its resource management primitives to the user
+space through a system call, which allows malicious programs to abuse
+GPUs using this system call.
+As a matter of fact, non-privileged user-space programs can directly
+allocate memory and launch computation on NVIDIA GPUs, via the
+\texttt{ioctl} system call in Linux.
+This explains that the API should also be protected by the OS.
+This paper presents \textbf{Gdev}, a new GPGPU ecosystem that addresses
+the current limitations of GPU resource management.
+Specifically, Gdev integrates GPU runtime support, including the API,
+into the OS to allow a wide class of user-space applications and the OS
+itself to use GPUs in a reliable manner.
+While OS applications can directly call the API, user-space
+programs can also use it through the system call, such as
+We also provide an implementation of CUDA API~\cite{CUDA40} wrapping
+Gdev API so that legacy CUDA applications can work with Gdev in both the
+user space and the OS.
+This runtime-unified OS approach forces GPU applications running in
+the same system to be managed by the identical resource management
+entity, which eliminates the concern about malicious programs attempting
+to access GPUs directly.
+The contributions of Gdev also include first-class support for GPU
+resource management in time-sharing systems.
+Specifically, Gdev allows programmers to share device memory resources
+among GPU contexts explicitly using the API.
+We also leverage this shared memory scheme to allow the system to allocate
+data beyond the size of physical memory space, exploiting implicit data
+eviction and reload between the host and device memory.
+Moreover, Gdev devises virtual GPU support to isolate GPU users in
+time-sharing systems, using a new GPU scheduler to deal with the
+non-preemptive and burst nature of GPU workloads.
+As a proof-of-concept, we finally provide an open-source implementation of Gdev,
+including a device driver and runtime/API library.
+In summary, this paper makes the following contributions:
+ \vspace{-0.25em}
+ \item Identifies the advantage/disadvantage of integrating GPU runtime
+ support into the OS.
+ \vspace{-0.5em}
+ \item Enables the OS to use GPUs for computation.
+ \vspace{-0.5em}
+ \item Makes GPUs as first-class computing resources in time-sharing
+ systems -- support for shared memory, memory swapping,
+ and virtual GPUs, in addition to device memory management and GPU
+ scheduling.
+ \vspace{-0.5em}
+ \item Provides open-source implementations of the GPU device driver,
+ runtime/API libraries, utility tools, and Gdev resource
+ management components.
+ \vspace{-0.5em}
+ \item Demonstrates the capabilities of Gdev using real-world benchmarks
+ and applications.
+ \vspace{-0.25em}
+The rest of this paper is organized as follows.
+Section~\ref{sec:model} provides the model and assumptions behind
+this paper.
+Section~\ref{sec:ecosystem} outlines the Gdev ecosystem.
+Section~\ref{sec:memory_management} and \ref{sec:scheduling} propose new
+memory management and scheduling schemes for the GPU.
+Section~\ref{sec:implementation} describes a prototype implementation,
+and its capabilities are demonstrated in Section~\ref{sec:evaluation}.
+Related work are discussed in Section~\ref{sec:related_work}.
+We provide our concluding remarks in Section~\ref{sec:conclusion}.
88 final/main.tex
@@ -0,0 +1,88 @@
+% TEMPLATE for Usenix papers, specifically to meet requirements of
+% USENIX '05
+% originally a template for producing IEEE-format articles using LaTeX.
+% written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
+% adapted by David Beazley for his excellent SWIG paper in Proceedings,
+% Tcl 96
+% turned into a smartass generic template by De Clarke, with thanks to
+% both the above pioneers
+% use at your own risk. Complaints to /dev/null.
+% make it two column with no page numbering, default is 10 point
+% Munged by Fred Douglis <> 10/97 to separate
+% the .sty file from the LaTeX source template, so that people can
+% more easily include the .sty file into an existing document. Also
+% changed to more closely follow the style guidelines as represented
+% by the Word sample file.
+% Note that since 2010, USENIX does not require endnotes. If you want
+% foot of page notes, don't include the endnotes package in the
+% usepackage command, below.
+% This version uses the latex2e styles, not the very ancient 2.09 stuff.
+%don't want date printed
+%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
+\title{\Large \bf
+Gdev: First-Class GPU Resource Management in the Operating System
+%for single author (just remove % characters)
+{\rm Shinpei Kato,
+Michael McThrow,
+Carlos Maltzahn,
+and Scott Brandt
+Department of Computer Science, UC Santa Cruz
+%{\rm Second Name}\\
+%Second Institution
+% copy the following lines to add more authors
+% \and
+% {\rm Name}\\
+%Name Institution
+} % end author
+% Use the following at camera-ready time to suppress page numbers.
+% Comment it out when you first submit the paper for review.
+{\scriptsize \bibliographystyle{acm}
213 final/memory_management.tex
@@ -0,0 +1,213 @@
+\section{Device Memory Management}
+Gdev manages the device memory using the virtual memory management unit
+of the GPU.
+%We store a page table for each GPU context in the device memory but not
+%in the host memory to reduce paging latency.
+%For every memory copy operation, Gdev configures DMA engines to write a
+%sequential value to the specified memory area so that it can poll the
+%value to wait for the operations to be completed.
+In addition to generic pieces of memory management, Gdev
+identifies how to increase memory-copy throughput and support shared
+memory and memory swapping for the GPU.
+\subsection{Memory-Copy Optimization}
+The memory-copy throughput could govern application performance overall,
+since the data need to be moved across the device and host memory in GPU
+Although our goal is to enhance resource management for the GPU in
+time-sharing systems, we still respect standalone performance, since it
+relates to a practical use of our solution in the real world.
+We hence study the characteristic of memory-copy operations for the GPU.
+It should be noted that our problem is different from those considered
+in previous work~\cite{Jablin_PLDI11, Rossbach_SOSP11} in that we are
+looking into a basic single instance of memory-copy transaction, while
+the previous work addressed more applied situations where multiple
+contexts compete for memory-copy transaction.
+First of all, we have found that the memory-copy API functions provided
+by proprietary software~\cite{CUDA40} are well-optimized for standalone
+In the following, hence, we disclose how to realize this optimization.
+\textbf{Split Transaction:}
+In order to copy data between the device and host memory, the data
+typically need to be copied twice, unless users directly allocate
+buffers to the host I/O memory.
+For example, when uploading user buffers from the host to the device
+memory, the data are first copied from the main memory to intermediate
+\textit{bounce} buffers in the I/O memory accessible to the GPU, and
+then copied to the device memory.
+To optimize this two-step memory-copy operation, we split each operation
+by a fixed size of chunks, which play a role of ping-pong buffers,
+\textit{i.e.}, while some chunk is transferred between the main and I/O
+memory, the preceding chunk can be transferred between the I/O and
+device memory.
+In this way, only the first and last chunks need to be transferred
+alone, reducing the total makespan almost half.
+This split transaction also needs only a small size of bounce buffers
+equal to the chunk size, reducing the usage of the host I/O memory
+%In Gdev, the chunk size is configurable.
+%The same idea can also be applied to all types of DMA engines desribed in
+\textbf{Direct I/O Access:}
+The split transaction is effective for a large size of data.
+For a small size of data, however, the use of DMA engines incurs
+non-trivial overhead by itself.
+Hence, we also employ a method to read/write data one by one by mapping
+device memory space onto host I/O memory space, rather than send/receive
+data in burst mode by exploiting DMA transaction.
+We have found that such a direct I/O access method is much faster than
+using DMA engines for a small size of data.
+In our experiment presented in Section~\ref{sec:evaluation}, we will
+show a boundary on the data size that inverts the superiority of I/O
+access and DMA, together with the best chunk size, to optimize
+memory-copy throughput.
+\subsection{Shared Memory Support}
+The current GPU programming model does not support IPC.
+For example, data communication among contexts incurs significant
+overhead by copying data back and forth between the device- and
+host-memory buffers.
+Currently, an OS dataflow abstraction~\cite{Rossbach_SOSP11} is a useful
+tool to minimize such data movement costs; however users are required to
+use a dataflow programming model and understand that optimization is
+applied implicitly by the OS at runtime.
+It would be nice if users could manage data communication among contexts
+easily using a familiar method, such as a POSIX IPC mechanism.
+Gdev supports shared memory for the GPU, providing a set of API
+functions, as listed in Table~\ref{tab:gdev_api}, based on the POSIX IPC
+standard, \textit{i.e.}, \texttt{gshmget}, \texttt{gshmat},
+\texttt{gshmdt}, and \texttt{gshmctl} correspond to \texttt{shmget},
+\texttt{shmat}, \texttt{shmdt}, and \texttt{shmctl} respectively.
+We have also added \texttt{cuShmGet}, \texttt{cuShmAt},
+\texttt{cuShmDt}, and \texttt{cuShmCtl} to our CUDA API
+implementation, which correspondingly call the Gdev shared memory
+functions, so that CUDA applications can easily leverage Gdev's shared
+memory support.
+Our shared memory design is straightforward, though its implementation
+is challenging.
+Upon the first call to \texttt{gshmget}, new space is allocated to the
+device memory, similar to \texttt{gmalloc}, and it holds an identifier
+to this memory object.
+After the first call, however, Gdev only returns this identifier to the
+The allocated space is mapped onto the context virtual address space
+when \texttt{gshmat} is called.
+Address mapping is done by setting the page table so that the virtual
+addresses point to the shared physical memory space.
+The allocated space can also be unmapped by \texttt{gshmdt} and freed by
+Gdev counts the number of users referencing the shared memory, and frees
+it when unmapped by the last user -- hence the call to \texttt{gshmctl}
+is optional.
+If the shared memory needs to be accessed exclusively, the host
+program itself is responsible for taking care of traditional
+mutex/semaphore mechanisms.
+We believe that our shared memory scheme can be easily integrated into
+GPU programming.
+For example, legacy CUDA applications can use our shared memory scheme
+by replacing \texttt{cuMemAlloc} with \texttt{cuShmGet} and
+\texttt{cuShmAt}, and \texttt{cuMemFree} with \texttt{cuShmDt}.
+\subsection{Memory Swapping}
+The proprietary software in Linux~\cite{CUDA40} fails to allocate
+device memory, when the memory demand exceeds the physical memory
+capacity, while Windows software can somehow swap device memory,
+according to PTask~\cite{Rossbach_SOSP11}.
+In either case, however, it is not well studied how to swap device
+memory and speed up its operation.
+Gdev uses shared memory to achieve device memory swapping.
+When a memory allocation request fails due to a short of free memory
+space, Gdev seeks memory objects whose allocated size is greater than
+the requested size, and selects one owned by a low-priority context,
+where ties are broken arbitrarily.
+Gdev here ensures not to select a memory object from the caller context
+Once a victim memory object is selected, it is shared with the caller
+context, and behaves as an \textit{implicit} shared memory object.
+%The allocated space is never freed until unreferenced by all associated
+Unlike an explicit shared memory object presented in
+Section~\ref{sec:shared_memory}, however, the implicit shared memory
+object needs to evict data when other contexts are accessing it, and
+retrieve them later when the corresponding context is resumed.
+Thanks to the Gdev API design, we know exactly when contexts could
+access the shared memory: it could be accessed when either a family of
+\texttt{gmemcpy*} or \texttt{glaunch} is called.
+They however need to be handled in a different manner:
+ \vspace{-0.25em}
+ \item \texttt{gmemcpy*} accesses a specific range of address space
+ given by the function arguments.
+ Hence, we need to evict and retrieve data related to this range.
+ \vspace{-0.5em}
+ \item \texttt{glaunch}, on the other hand, does not tell which address
+ range could be accessed when it is called, especially due to
+ dynamically allocated memory space.
+ Hence, we need to evict and retrieve data associated with all
+ memory objects owned by the corresponding context.
+ \vspace{-0.25em}
+We allocate swap buffers to the host main memory to save the evicted
+It uses \texttt{gmemcpy\_from\_device} and \texttt{gmemcpy\_to\_device}
+to evict and retrieve data.
+These memory swapping procedures are not visible to application programs.
+It should be noted that swapping does not occur when downloading data
+from the device memory.
+Even if the data are evicted, we can copy the data from the host swap
+buffer directly.
+\textbf{Reducing Latency:}
+The swapping latency could be non-trivial depending on the data size.
+Given memory copy operations within the device memory faster than
+those between the device and host memory by an order of ten, Gdev
+reserves a certain amount of device memory space to use as temporal swap
+Data can be evicted to this device swap space temporarily, if they fit
+the space, to reduce the swapping latency.
+The temporarily-evicted data are eventually evicted to the host memory
+after a while to free the swap space for other contexts.
+Gdev tries to hide the latency of this second data eviction by
+overlapping it with the computation launched by the context itself.
+To do so, it creates a special GPU context that is dedicated to the
+device-to-host data movement, since the compute and DMA units cannot be
+used by the same CPU context simultaneously.
+This approach is reasonable, since some computation is likely following
+the data eviction.
+For example, \texttt{glaunch} will apparently launch some computation, and
+\texttt{gmemcpy\_to\_device} is also often called prior to \texttt{glaunch}.
+The evicted data, if exist, need to be retrieved when \texttt{glaunch}
+is called for the associated GPU context after evicting the current data
+of some other context sharing the memory space.
+If the evicted data still exist in the device swap space, they can be
+retrieved quickly.
+Else, Gdev retrieves them from the host main memory.
65 final/model.tex
@@ -0,0 +1,65 @@
+\section{System Model}
+This paper focuses on a system composed of a GPU and a multi-core CPU.
+GPU applications use a set of the API supported by the system, and
+typically take the following steps:
+(i) allocate space to the device memory,
+(ii) move data to the allocated space on the device memory,
+(iii) launch computation on the GPU,
+(iv) move resultant data back to the host memory, and
+(v) free the allocated space from the device memory.
+%This is the most well-known GPU programming model.
+We also assume that the GPU is based on NVIDIA's \textit{Fermi}
+The concept of Gdev, however, is not limited to Fermi, but is also
+applicable to others if the following model is applicable.
+The GPU is operated by commands.
+The commands are architecture-specific.
+Each GPU context is assigned a FIFO queue into which the CPU programs
+submit GPU commands.
+When the GPU dispatches these commands, the corresponding GPU context
+can execute.
+Each GPU context is assigned a hardware channel.
+Command dispatching and context execution are managed per channel.
+In Fermi architecture, multiple channels cannot run simultaneously when
+using the same GPU functional unit, while they can when using different
+However, they are allowed to coexist, and the GPU switches the channels
+automatically in hardware.
+\textbf{Address Space:}
+Each GPU context runs in separate virtual address space, which is also
+associated with the channel.
+The device driver is in charge of setting page tables for the memory
+management unit on the GPU.
+\textbf{I/O Register:}
+The GPU provides a bunch of memory-mapped I/O registers per context
+visible to the device driver through the (PCI) I/O bus.
+The device driver needs to manage these registers to send commands and
+set up channels and address space.
+\textbf{Compute Unit:}
+The GPU maps threads assigned by programmers to cores on the compute unit.
+This thread assignment, however, is not visible to the CPU, which
+implies that GPU resource management by the system should be
+Multiple contexts cannot run on the compute unit together, since
+more than channel cannot access the same functional unit simultaneously,
+though multiple requests spawned from the same context can be processed.
+GPU computation is non-preemptive.
+\textbf{DMA Unit:}
+There are two types of DMA units for data transmission: (i) synchronous
+with the compute unit and (ii) asynchronous.
+Only the latter type of DMA units can overlap their operations with the
+compute unit.
+DMA data transmission is also non-preemptive.
248 final/references.bib
@@ -0,0 +1,248 @@
+author = {P. Barham and B. Dragovic and K. Fraser and S. Hand and T. Harris and A. Ho and R. Neugebauer and I. Pratt and A. Warfield},
+title = {{Xen and the art of virtualization}},
+booktitle = {Proc. of the ACM Symposium on Operating Systems Principles},
+year = {2003}
+author = {M. Bautin and A. Dwarakinath and T. Chiueh},
+title = {{Graphics engine resource management}},
+booktitle = {Proc. of the Annual Multimedia Computing and Networking Conference},
+year = {2008}
+author = {S. Che and M. Boyer and J. Meng and D. Tarjan and J. Sheaffer and S-H. Lee and K. Skadron},
+title = {{Rodinia: A benchmark suite for heterogeneous computing}},
+booktitle = {Proc. of the IEEE International Conference on Workload Characterization},
+pages = {44--54},
+year = {2009}
+author = {L. Chen and O. Villa and S. Krishnamoorthy and G. Gao},
+title = {{Dynamic Load Balancing on Single- and Multi-GPU Systems}},
+booktitle = {Proc. of the IEEE International Parallel and Distributed Processing Symposium},
+year = {2010}
+author = {G. Diamos and A. Kerr and S. Yalamanchili and N. Clark},
+title = {{Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems}},
+booktitle = {Proc. of the ACM International Conference on Parallel Architectures and Compilation Techniques},
+pages = {353--364},
+year = {2010}
+author = {M. Dowty and J. Sugeman},
+title = {{GPU virtualization on VMware's hosted I/O architecture}},
+journal = {ACM Operating Systems Review},
+volume = {43},
+number = {3},
+pages = {73--82},
+year = {2009}
+author = {M. Guevara and C. Gregg and K. Hazelwood and K. Skadron},
+title = {{Enabling Task Parallelism in the CUDA Scheduler}},
+booktitle = {Proc. of the Workshop on Programming Models for Emerging Architectures},
+pages = {69--76},
+year = {2009}
+author = {A. Gulati and I. Ahmad and C. Waldspurger},
+title = {{PARDA: Proportional allocation of resources for distributed storage access}},
+booktitle = {Proc. of the USENIX Conference on File and Storage Technology},
+year = {2009}
+author = {V. Gupta and K. Schwan and N. Tolia and V. Talwar and P. Ranganathan},
+title = {{Pegasus: Coordinated scheduling for virtualized accelerator-based systems}},
+booktitle = {Proc. of the USENIX Annual Technical Conference},
+year = {2011}
+author = {S. Hand and K. Jang and K. Park and S. Moon},
+title = {{PacketShader: a GPU-accelerated software router}},
+booktitle = {Proc. of ACM SIGCOMM},
+year = {2010}
+author = {T. Jablin and P. Prabhu and J. Jablin and N. Johnson and S. Beard and D. August},
+title = {{Automatic CPU-GPU communication management and optimization}},
+booktitle = {Proc. of the ACM Conference on Programming Language Design and Implementation},
+year = {2011}
+author = {K. Jang and S. Han and S. Han and S. Moon and K. Park},
+title = {{SSLShader: Cheap SSL acceleration with commodity processors}},
+booktitle = {Proc. of the USENIX Conference on Networked Systems Design and Implementation},
+year = {2011}
+author = {S. Kato and K. Lakshmanan and R. Rajkumar and Y. Ishikawa},
+title = {{TimeGraph: GPU scheduling for real-time multi-tasking environments}},
+booktitle = {Proc. of the USENIX Annual Technical Conference},
+year = {2011}
+author = {S. Kato and S. Brandt and Y. Ishikawa and R. Rajkumar},
+title = {{Operating systems challenges for GPU resource management}},
+booktitle = {Proc. of the International Workshop on Operating Systems Platforms for Embedded Real-Time Applications},
+pages = {23--32},
+year = {2011}
+author = {S. Kato and K. Lakshmanan and Y. Ishikawa and R. Rajkumar},
+title = {{Resource sharing in GPU-accelerated windowing systems}},
+booktitle = {Proc. of the IEEE Real-Time and Embedded Technology and Aplications Symposium},
+pages = {191--200},
+year = {2011}
+author = {S. Kato and K. Lakshmanan and A. Kumar and M. Kelkar and Y. Ishikawa and R. Rajkumar},
+title = {{RGEM: A responsive GPGPU execution model for runtime engines}},
+booktitle = {Proc. of the IEEE Real-Time Systems Symposium},
+pages = {57--66},
+year = {2011}
+author = {C. Kim and J. Chhugani and N. Satish and E. Sedlar and A. Nguyen and T. Kaldewey and V. Lee and S. Brandt and P. Dubey},
+title = {{FAST: Fast architecture sensitive tree search on modern CPUs and GPUs}},
+booktitle = {Proc. of the ACM International Conference on Management of Data},
+year = {2010}
+author = {H.A. Lagar-Cavilla and N. Tolia and M. Satyanarayanan and E. de Lara},
+title = {{VMM-independent graphics acceleration}},
+booktitle = {Proc. of the ACM/Usenix International Conference on Virtual Execution Environments},
+pages = {33--43},
+year = {2007}
+author = {M. McNaughton and C. Urmson and J. Dolan and J-W. Lee},
+title = {{Motion Planning for Autonomous Driving with a Conformal Spatiotemporal Lattice}},
+booktitle = {Proc. of the IEE International Conference on Robotics and Automation},
+pages = {4889--4895},
+year = {2011}
+author = {A. Povzner and T. Kaldewy and S. Brandt and R. Golding and T. Wong and C. Maltzahn},
+title = {{Efficient guaranteed disk request scheduling with Fahrrad}},
+booktitle = {Proc. of the ACM European Conference on Computer Systems},
+pages = {13--25},
+year = {2008}
+author = {C. Rossbach and J. Currey and M. Silberstein and B. Ray and E. Witchel},
+title = {{PTask: Operating system abstractions to manage GPUs as compute devices}},
+booktitle = {Proc. of the ACM Symposium on Operating Systems Principles},
+year = {2011}
+author = {A. Saba and R. Mangharam},
+title = {{Anytime Algorithms for GPU Architectures}},
+booktitle = {Proc. of the IEEE Real-Time Systems Symposium},
+year = {2011}
+author = {T. Shimokawabe and T. Aoki and C. Muroi and J. Ishida and K. Kawano and T> Endo and A. Nukada and N. Maruyama and S. Matsuoka},
+title = {{An 80-Fold Speedup, 15.0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code}},
+booktitle = {Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis},
+year = {2010}
+author = {W. Sun and R. Ricci},
+title = {{Using GPUs for OS kernel security}},
+howpublished = {USENIX Security, Poster Session},
+year = {2011}
+author = {Y. Wang and A. Merchant},
+title = {{Proportional-share scheduling for distributed storage systems}},
+booktitle = {Proc. of the USENIX Conference on File and Storage Technology},
+year = {2007}
+author = {NVIDIA},
+title = {{Linux~X64~Display Driver}},
+howpublished = {\url{}},
+year = {2010}
+author = {NVIDIA},
+title = {{CUDA 4.0}},
+howpublished = {\url{}},
+year = {2011}
+author = {K.E. Martin and R.E. Faith and J. Owen and A. Akin},
+title = {{Direct Rendering Infrastructure, Low-Level Design Document}},
+organization = {Precision Insight, Inc.},
+year = {1999}
+author = {M. Koscielnicki},
+title = {{envytools}},
+howpublished = {\url{git://}},
+year = {2012}
+author = {PathScale},
+title = {{ENZO}},
+howpublished = {\url{}},
+year = {2011}
+author = {NVIDIA},
+title = {{NVIDIA's next generation CUDA compute architecture: Fermi}},
+howpublished = {\url{}},
+year = {2009}
+author = {Mesa3D},
+title = {Gallium3D},
+howpublished = {\url{}},
+year = {2012}
+author = {Khronos Group},
+title = {{OpenCL 1.2}},
+howpublished = {\url{}},
+year = {2011}
+author = {{TOP500~Supercomputing~Site}},
+howpublished = {\url{}},
+year = {2011}
125 final/related_work.tex
@@ -0,0 +1,125 @@
+\section{Related Work}
+\textbf{GPU Resource Management:}
+The GPU is a compute device that essentially requires resource management
+to operate.
+Recently the research community has devised several approaches to GPU
+resource management.
+TimeGraph~\cite{Kato_ATC11} and GERM~\cite{Bautin_MCNC08} are GPU
+command-driven schedulers integrated in the device driver.
+While TimeGraph provides priority and isolation schemes for
+GPU applications, later extended with resource sharing
+schemes~\cite{Kato_RTAS11}, GERM enhances fair-share GPU resource
+The Gdev scheduler also respects priorities, isolation, and fairness
+similar to TimeGraph and GERM, but adopts an API-driven scheduler model,
+which could reduce a number of scheduler invocations at runtime.
+The API-driven scheduler model is also more reliable to control GPU
+resource usage, since it accounts only when the GPU is used
+for computation and data transmission, wheres the command-driven
+scheduler model assumes all commands to consume GPU resources.
+PTask~\cite{Rossbach_SOSP11} is an OS abstraction for GPU applications
+to minimize data communication between the host and device memory
+through a data-flow programming model.
+It also addresses a scheduling problem.
+CGCM~\cite{Jablin_PLDI11} is a compiler and runtime library solution to
+dynamically and automatically optimize host-device data communication.
+Gdev does not support such data-flow programming or automatic code
+generation; however, it provides programmers with a first-class IPC
+primitive to share device memory among GPU contexts, which can similarly
+reduce data communication overhead.
+RGEM~\cite{Kato_RTSS11} is a user-space runtime model for real-time
+GPGPU applications.
+It creates preemption points for host-device data transmissions to bound
+the blocking times imposed on high-priority tasks.
+It also separates the context queues to demultiplex the scheduling of
+data transmissions and computations.
+Albeit using a similar queue design to RGEM, Gdev addresses a core
+challenge of integrating GPU resource management into the OS to overcome
+traditional user-space limitations.
+In addition to the above differences, Gdev provides virtual GPUs that
+enable users to view a physical GPU as multiple logical GPUs for
+resource usage isolation.
+None of previous work also distinguishes compute and memory bandwidth,
+which could cause GPU resources to be very underutilized in
+reservation strategies.
+Furthermore, our Gdev prototype design and implementation are
+self-contained, allowing the OS to fully control and even use the GPU as
+first-class resources, whereas previous work depend on, more or less,
+proprietary software or existing drivers, which could force their
+solutions, if not concepts, to partly adhere to user-space runtimes.
+Comparisons of Gdev and representatives of the above GPU resource
+management approaches are summarized in Table~\ref{tab:related_work}.
+ \caption{Comparisons of Gdev and prior GPU resource management
+ approaches.}
+ \label{tab:related_work}
+ \begin{center}
+ {\sf
+ \begin{tabular}{|l|p{12.8cm}|}
+ \hline
+ \hline
+ \end{tabular}
+ }
+ \end{center}
+\textbf{GPUs as OS Resources:}
+A significant limitation on the current GPU programming framework is
+that GPU applications must reside in the user space.
+KGPU~\cite{Sun_SECURITY11_Poster} is a combination of the OS kernel
+module and user-space daemon, which allows the OS to use GPUs by
+up-calling the user-space daemon from the OS to access the GPU.
+On the other hand, Gdev provides OS modules with a set of traditional
+API functions for GPU programming, such as CUDA.
+Hence, legacy GPU application code can run in the OS without any
+modifications and needs not to move back and forth between the user
+space and OS.
+\textbf{GPU Virtualization:}
+VMGL~\cite{Lagar-Cavilla_VEE07} virtualizes GPUs at the OpenGL
+API level, and VMware's Virtual GPU~\cite{Dowty_SIGOPS09} exhibits I/O
+virtualization through graphics runtimes.
+On the other hand, Pegasus~\cite{Gupta_ATC11} uses a hypervisor,
+Xen~\cite{Barham_SOSP03} in particular, to co-schedule GPUs and virtual
+CPUs in VMs.
+Nonetheless, these virtualization systems rely on user-space runtimes
+provided by proprietary software, preventing the system from managing
+GPU resources in a fine-grained manner.
+In addition, they are mainly designed to make GPUs available in
+virtualized environments, but are not tailored to isolate GPU resources
+among users.
+Gdev provides virtual GPUs with strong time and space partitioning, and
+hence could underlie these GPU virtualization systems.
+\textbf{I/O Scheduling:}
+GPU scheduling deals with a non-preemptive nature of execution as well
+as traditional I/O scheduling.
+Several disk bandwidth-aware schedulers~\cite{Gulati_FAST09,
+Povzner_EUROSYS08, Wang_FAST07}, for example, contain a similar idea to
+the Gdev scheduler.
+Unlike typical I/O devices, however, GPUs are coprocessors operating
+asynchronously with own sets of execution contexts, registers, and memory.
+Therefore, Gdev adopts a scheduling algorithm more appropriate for
+compute-intensive workload.
+\textbf{Compile-Time and Application Approaches:}
+GPU resource usage can be also managed in user application
+programs~\cite{Chen_IPDPS10,Guevara09,Saba_RTSS11}, but these approaches
+essentially require the programs to be modified or recompiled using
+specific compilers and algorithms.
+Thereby, a generality of programming needs to be compromised.
+Under Gdev, on the other hand, applications can use traditional GPU
+programming languages.
180 final/scheduling.tex
@@ -0,0 +1,180 @@
+\section{GPU Scheduling}
+The goal of the Gdev scheduler is to correctly assign computation and
+data transmission times for each GPU context based on the given
+scheduling policy.
+Although we make use of some previous
+techniques~\cite{Kato_RTSS11,Kato_ATC11}, Gdev provides a new queuing
+scheme and virtual GPU support for time-sharing systems.
+Gdev also propagates the task priority used in the OS to the GPU context.
+\subsection{Scheduling and Queuing}
+Gdev uses a similar scheme to TimeGraph~\cite{Kato_ATC11} for GPU
+Specifically, it allows GPU contexts to use GPU resources only if no
+other contexts are using the corresponding resources.
+The pending GPU contexts are queued by the scheduler while waiting for
+the current context using the resources.
+To notify the completion of the current context execution, Gdev
+uses additional GPU commands to generate an interrupt from the GPU.
+The highest-priority context is chosen from the queue upon every
+interrupt, and dispatched to the GPU.
+The computation and data transmission times are separately accumulated
+for resource accounting.
+For compute requests, we also allow the same context to launch compute
+instances simultaneously, and the total makespan from the first to the last
+instance is deemed as the computation time.
+PTask~\cite{Rossbach_SOSP11} and RGEM~\cite{Kato_RTSS11} also use
+similar mechanisms, but do not use interrupts, and thereby resource
+accounting is managed by the user space via the API.
+Gdev is API-driven, invoking a scheduler only when \texttt{gmemcpy*} or
+\texttt{glaunch} is called, while TimeGraph is command-driven, invoking
+a scheduler whenever GPU commands are flushed.
+In this regard, Gdev is similar to PTask~\cite{Rossbach_SOSP11} and
+However, Gdev differs even from these prior work in that it supports
+separate queues for resource accounting of compute and memory-copy
+operations, which we call \textit{Multiple Resource Queues} (MRQ), while
+we call \textit{Single Device Queue} (SDQ) for the previous approach
+where the scheduler supports only a single queue per device for resource
+The MRQ scheme is apparently more efficient than the SDQ scheme, when
+different compute and memory-copy operations can be overlapped.
+Suppose that there are two contexts both requesting 50\% of compute
+and 50\% of memory-copy demands.
+The SDQ scheme considers that the demand of each context is 100\% by
+adding compute and memory-copy demands, and the total demand of the
+two context is 200\%.
+This workload thereby looks overloaded under the SDQ scheme.
+The MRQ scheme, on the other hand, does not consider the total workload
+to be overloaded but each resource to be fully utilized.
+Gdev creates scheduler threads to separately control the resource
+usage of the GPU compute unit and DMA unit.
+The compute scheduler thread is invoked by GPU interrupts generated upon
+the completion of each GPU compute operation, while the DMA scheduler
+thread is awakened by the Gdev runtime when the memory-copy operation is
+completed, since we do not use interrupts for memory-copy operations.
+\subsection{Virtual GPU Support}
+Gdev provides virtual GPUs, which virtualize a physical GPU into logical
+GPUs to protect a group of GPU users from others.
+Virtual GPUs are activated by specifying the weights of GPU resources
+assigned to each of them.
+We classify GPU resources to the \textit{memory share}, \textit{memory
+bandwidth}, and \textit{compute bandwidth}.
+The memory share is the weight of the physical memory available for the
+virtual GPU.
+The memory bandwidth is the amount of time in some period given for
+memory-copy operations related to the GPU, and the compute bandwidth is
+that for GPU compute operations.
+For the memory share, Gdev simply partitions the physical memory.
+For the compute and memory-copy bandwidth, however, we leverage the GPU
+schedulers to meet their requirements.
+Considering a similar characteristic of non-preemptive compute and
+memory-copy operations, we apply the same policy to both the compute and
+memory-copy schedulers.
+The challenge for virtual GPU scheduling is raised by the non-preemptive
+and burst nature of GPU workloads.
+We have implemented the Credit scheduling algorithm supported by Xen
+hypervisor~\cite{Barham_SOSP03} to verify if an existing virtual CPU
+scheduling policy can be applied for a virtual GPU scheduler.
+However, we have found that the Credit scheduler fails to maintain the
+desired bandwidth for the virtual GPU, largely attributed to the fact
+that it presumes preemptive constantly-working CPU workloads, while GPU
+workloads are non-preemptive and bursting.
+ \begin{center}
+ \begin{tabular}{l}
+ \hline
+ \hline
+ {\small \verb|vgpu->bgt|: budget of the virtual GPU.}\\
+ {\small \verb|vgpu->utl|: actual GPU utilization of the virtual GPU.}\\
+ {\small \verb|vgpu->bw|: bandwidth assigned to the virtual
+ GPU.}\\
+ {\small \verb|current/next|: current/next virtual GPU selected for run.}\\
+ \hline
+ {\small \verb|void on_arrival(vgpu, ctx) {|}\\
+ {\small \verb| if (current && current != vgpu)|}\\
+ {\small \verb| suspend(ctx);|}\\
+ {\small \verb| dispatch(ctx);|}\\
+ {\small \verb|}|}\\
+ {\small \verb|vgpu_object* on_completion(vgpu, ctx) {|}\\
+ {\small \verb| if (vgpu->bgt < 0 && vgpu->utl > vgpu->bw)|}\\
+ {\small \verb| move_to_queue_tail(vgpu);|}\\
+ {\small \verb| next = get_queue_head();|}\\
+ {\small \verb| if (!next) return null;|}\\
+ {\small \verb| if (next != vgpu && next->utl > next->bw) {|}\\
+ {\small \verb| wait_for_short();|}\\
+ {\small \verb| if (current) return null;|}\\
+ {\small \verb| }|}\\
+ {\small \verb| return next;|}\\
+ {\small \verb|}|}\\
+ \hline
+ \end{tabular}
+ \caption{Pseudo-code of the Band scheduler.}
+ \label{fig:band}
+ \end{center}
+ \vspace{-1.5em}
+To overcome the virtual GPU scheduling problem, we propose a
+\textit{bandwidth-aware non-preemptive device} (Band) scheduling
+The pseudo-code of the Band scheduler is shown in
+The \texttt{on\_arrival} function is called when a GPU context
+(\texttt{ctx}) running on a virtual GPU (\texttt{vgpu}) tries to use GPU
+resources via the \texttt{glaunch} or \texttt{gmemcpy*} functions.
+The context can be dispatched to the GPU, only if no other virtual GPUs
+are accessing the GPU.
+Otherwise, the corresponding task is suspended.
+The \texttt{on\_completion} function is called by the scheduler thread
+upon the completion of a GPU context (\texttt{ctx}) assigned to a virtual
+GPU (\texttt{vgpu}), in order to select the next virtual GPU to run.
+The Band scheduler is based on the Credit scheduler, but differs in the
+following two points.
+First, the Band scheduler lowers the priority of the virtual GPU, when
+its budget (credit) is exhausted \textit{and} its actual utilization of
+the GPU is exceeding the assigned bandwidth, whereas the Credit
+scheduler always lowers the priority, when the budget is exhausted.
+This prioritization compensates for credit errors posed due to
+non-preemptive executions.
+The second modification to the Credit scheduler is that the Band
+scheduler waits for a certain amount of time specified by the system
+designer, if the GPU utilization of the virtual GPU selected
+by the scheduler is exceeding its assigned bandwidth.
+This time-buffering approach works for non-preemptive burst workloads.
+Suppose that the system has two virtual GPUs, both of which run some
+burst-workload GPU contexts, but their non-preemptive execution times
+are different.
+If the contexts arrive in turn, they are also dispatched to the GPU in
+turn, but the GPU utilization could not be fair due to different lengths
+of non-preemptive executions.
+If the scheduler waits for a short interval, however, the context with a
+short length of non-preemptive execution could arrive with the next
+request, and the \texttt{on\_arrival} function can dispatch it to the
+GPU while the scheduler is waiting.
+Thus, resource allocations could become fairer.
+In this case, we need not to select the next virtual GPU, since the
+\texttt{on\_arrival} function has already dispatched one.
+If no contexts have arrived, however, we return the selected virtual GPU.
+This situation implies that there are no burst workloads, and hence no
+emergency to meet the bandwidth.
94 final/usenix.sty
@@ -0,0 +1,94 @@
+% usenix.sty - to be used with latex2e for USENIX.
+% To use this style file, look at the template usenix_template.tex
+% $Id: usenix.sty,v 1.2 2005/02/16 22:30:47 maniatis Exp $
+% The following definitions are modifications of standard article.sty
+% definitions, arranged to do a better job of matching the USENIX
+% guidelines.
+% It will automatically select two-column mode and the Times-Roman
+% font.
+% USENIX papers are two-column.
+% Times-Roman font is nice if you can get it (requires NFSS,
+% which is in latex2e.
+\if@twocolumn\else\input twocolumn.sty\fi
+% USENIX wants margins of: 1" sides, 1" bottom, and 1" top.
+% 0.25" gutter between columns.
+% Gives active areas of 6.5" x 9"
+% Usenix wants no page numbers for camera-ready papers, so that they can
+% number them themselves. But submitted papers should have page numbers
+% for the reviewers' convenience.
+% \pagestyle{empty}
+% Usenix titles are in 14-point bold type, with no date, and with no
+% change in the empty page headers. The whole author section is 12 point
+% italic--- you must use {\rm } around the actual author names to get
+% them in roman.
+ \begingroup
+ \renewcommand\thefootnote{\fnsymbol{footnote}}%
+ \def\@makefnmark{\hbox to\z@{$\m@th^{\@thefnmark}$\hss}}%
+ \long\def\@makefntext##1{\parindent 1em\noindent
+ \hbox to1.8em{\hss$\m@th^{\@thefnmark}$}##1}%
+ \if@twocolumn
+ \twocolumn[\@maketitle]%
+ \else \newpage
+ \global\@topnum\z@
+ \@maketitle \fi\@thanks
+ \endgroup
+ \setcounter{footnote}{0}%
+ \let\maketitle\relax
+ \let\@maketitle\relax
+ \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
+ \vbox to 1.5in{ %% Modified by shinpei - it was 2.5in originally.
+ \vspace*{\fill}
+ \vskip 2em
+ \begin{center}%
+ {\Large\bf \@title \par}%
+ \vskip 0.375in minus 0.300in
+ {\large\it
+ \lineskip .5em
+ \begin{tabular}[t]{c}\@author
+ \end{tabular}\par}%
+ \end{center}%
+ \par
+ \vspace*{\fill}
+% \vskip 1.5em
+ }
+% The abstract is preceded by a 12-pt bold centered heading
+{\large\bf \abstractname\vspace{-.5em}\vspace{\z@}}%
+% Main section titles are 12-pt bold. Others can be same or smaller.
+\def\section{\@startsection {section}{1}{\z@}{-3.5ex plus-1ex minus
+ -.2ex}{2.3ex plus.2ex}{\reset@font\large\bf}}

0 comments on commit fe9d696

Please sign in to comment.
Something went wrong with that request. Please try again.