Permalink
Browse files

submitted version

  • Loading branch information...
1 parent accab93 commit 688511abe22957b48feb8ada4819ce7daa856420 Shinpei Kato committed Nov 19, 2012
@@ -0,0 +1,7 @@
+\section{Acknowlegdement}
+
+We thank members of the HBT-EP Team at Columbia University for providing
+us with knowledge of plasma and fusion use cases.
+Figures 7, 8, 9, and 10 are provided by the HBT-EP Team.
+This work is also supported in part by U.S. Department of Energy (DOE)
+Grant DE-FG02-86ER53222.
View
@@ -18,7 +18,7 @@ \section{Benchmarking}
affects time to completion.
We also showcase effective host read and write throughput for each I/O
processing scheme.
-This benchmarking clarifies the capabilities of the proposed scheme, not
+This benchmarking clarifies the capabilities of the presented scheme, not
specific to plasma control but applicable to generic low-latency
GPU computing.
View
@@ -32,7 +32,7 @@ \section{Case Study}
from the viewpoint of the cycle time of algorithm execution and the
latency of data transfer.
Each scheme is applied as follows:
-\begin{itemize}
+\begin{itemize} \itemsep1pt
\item In the {\hd} scheme, the device driver of the digitizer transfers
the input data set to the buffers allocated on the host memory.
The control program copies this data set to the device memory via
@@ -49,7 +49,7 @@ \section{Case Study}
need to perform data copies.
However, this scheme must compromise the latency of data access
imposed on the GPU when executing the control algorithm.
- \item Similarly to the {\hp} scheme, the {\dm} scheme proposed in this
+ \item Similarly to the {\hp} scheme, the {\dm} scheme presented in this
paper uses pinned PCI-mapped host memory space to allocate the
input and out buffers, and further maps it to the device memory
through PCI BAR space.
@@ -68,7 +68,7 @@ \section{Case Study}
and read the output data from the specified PCI regions through DMA,
respectively.
These PCI regions are directly mapped to the device memory space
-allocated by the control system, using the {\dm} scheme proposed in this
+allocated by the control system, using the {\dm} scheme presented in this
paper.
In consequence, once the input and output modules are configured, and
the device program is launched at the beginning, the algorithm can keep
@@ -87,6 +87,11 @@ \section{Case Study}
system.
Figure~\ref{fig:eval_plasma} shows the result of experiments conducted
under the three different schemes, respectively.
+The values of the algorithm cycle and the data transfer are measured
+individually.
+We estimate that the {\dm} and the {\hd} schemes should have the same
+algorithm cycle, while the the {\dm} and the {\hp} schemes should
+have the same data latency, by nature.
The {\dm} scheme achieves the highest rate in both algorithm
execution and data transfer.
The remaining two schemes, on the other hand, compromise one or the
@@ -104,7 +109,7 @@ \section{Case Study}
We suspect that this latency comes from some interactions among the host
computer, the graphics card, and the I/O modules.
Lessons learned from this evaluation are summarized as follows:
-\begin{itemize}
+\begin{itemize} \itemsep1pt
\item Zero-copy I/O processing is very effective for this control
system, reducing the latency of data transfer from $16\mu$s to
$4\mu$s.
View
@@ -13,11 +13,11 @@ \section{Conclusion}
We believe that the contribution of this paper facilitates a grander
vision of CPS using parallel computing technology.
-In future work, we extend our schemes to support multiple contexts.
-This extension is essential to control multiple plants using the
-identical GPU.
-A key challenge is to ensure exclusive direct access to the same memory
-space, because the device driver and the runtime system cannot
-intermidiate in zero-copy I/O processing.
-We also plan to generalize our schemes for arbitrary I/O devices such as
-ethernet and firewire interfaces.
+%In future work, we extend our schemes to support multiple contexts.
+%This extension is essential to control multiple plants using the
+%identical GPU.
+%A key challenge is to ensure exclusive direct access to the same memory
+%space, because the device driver and the runtime system cannot
+%intermidiate in zero-copy I/O processing.
+%We also plan to generalize our schemes for arbitrary I/O devices such as
+%ethernet and firewire interfaces.
View
Binary file not shown.
View
@@ -89,9 +89,9 @@
\affaddr{Dept. Information Engineering}\\
\affaddr{Nagoya University}
\and
-\alignauthor Nikolaus Rath\\
- \affaddr{Dept. Applied Physics and Applied Mathematics}\\
- \affaddr{Columbia University}
+%\alignauthor Nikolaus Rath\\
+% \affaddr{Dept. Applied Physics and Applied Mathematics}\\
+% \affaddr{Columbia University}
\and
\alignauthor Jason Aumiller and Scott Brandt\\
\affaddr{Dept. Computer Science}\\
@@ -126,6 +126,7 @@
\input{benchmarking.tex}
\input{related_work.tex}
\input{conclusion.tex}
+\input{acknowledgement.tex}
\bibliographystyle{abbrv}
%{\footnotesize
View
@@ -41,9 +41,9 @@ \section{System Implementation}
functions, {\tt cuMemMap()}, {\tt cuMemUnmap()}, and {\tt
cuMemGetPhysAddr()}, which are compatible to the form of the CUDA Driver
API.
-We now provide the details of these functions below:
+We now provide the details of these functions:
-\begin{itemize}
+\begin{itemize} \itemsep1pt
\item \textbf{Mapping Memory:}
We assign the second PCIe BAR region, \textit{i.e.}, BAR1 (among
the five of those supported by hardware as a window of the device
View
@@ -3,7 +3,7 @@ \section{Introduction}
\begin{figure}[t]
\centering
- \includegraphics[width=0.82\hsize]{eps/tokamak.eps}
+ \includegraphics[width=0.79\hsize]{eps/tokamak.eps}
\caption{Columbia's HBT-EP ``Tokamak''.}
\label{fig:tokamak}
\end{figure}
@@ -81,9 +81,9 @@ \section{Introduction}
The rest of this paper is organized as follows.
Section~\ref{sec:system_model} describes the system model and
assumptions behind this paper.
-Section~\ref{sec:io_processing} proposes our zero-copy I/O processing
+Section~\ref{sec:io_processing} presents our zero-copy I/O processing
scheme, and differentiates it from the existing schemes.
-Section~\ref{sec:implementation} presents details of system
+Section~\ref{sec:implementation} describes details of system
implementation.
In Section~\ref{sec:case_study}, a case study of plasma control is
provided to demonstrate the real-world impact of our contribution.
View
@@ -31,7 +31,7 @@ \subsection{Host and Device Memory ({\hd})}
explicitly.
Figure~\ref{fig:hd} illustrates an overview of how this scheme works:
-\begin{enumerate}
+\begin{enumerate} \itemsep1pt
\item The device driver of the input device configures the input device
to send data to the allocated space of the host memory.
\item The device driver of the GPU copies the data to the
@@ -112,7 +112,7 @@ \subsection{Device Memory Mapped to Host ({\dm})}
memory, the system is required to support the following functions to have I/O
devices directly access this memory space:
-\begin{itemize}
+\begin{itemize} \itemsep1pt
\item \textbf{Mapping Memory:}
As technology reads today, PCIe BARs are the most reasonable
windows that see through the device memory from the host and I/O

0 comments on commit 688511a

Please sign in to comment.