Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

updated case_study.tex

  • Loading branch information...
commit 2febcf7a8395046554d5a456625bb523a87ab742 1 parent 66e8613
Shinpei Kato authored
21 draft/case_study.tex
@@ -57,6 +57,25 @@ \section{Case Study}
the control algorithm is limited within the device memory.
+This paper does not provide the details of the control algorithm, since
+it is not within the scope of this paper.
+Interested readers are encouraged to reference \cite{Boozer_PP99}.
+The outline of the control system implementation is that the host
+program launches the device program on the GPU once at the beginning
+when the system is loaded.
+The device program is polling until the input data set arrives.
+This is due to a requirement of low-latency computing.
+The input and output modules are configured to write the input data to
+and read the output data from the specified PCI regions through DMA,
+These PCI regions are directly mapped to the device memory space
+allocated by the control system, using the {\dm} scheme proposed in this
+In consequence, once the input and output modules are configured, and
+the device program is launched at the beginning, the algorithm can keep
+executing on the GPU, without accessing the CPU and the host memory at
@@ -64,7 +83,7 @@ \section{Case Study}
-We now demonstrate that our zero-copy I/O processing scheme reduces both
+We now demonstrate that the proposed {\dm} scheme reduces both
the cycle time of algorithm execution and the latency of data transfer
in this control system.
Figure~\ref{fig:eval_plasma} shows the result of experiments conducted
BIN  draft/draft.pdf
Binary file not shown
2  draft/introduction.tex
@@ -21,7 +21,7 @@ \section{Introduction}
University~\cite{Maurer_PPCF11,Rath_FED12} that magnetically controls
the 3-D perturbed equilibrium state of the plasma~\cite{Boozer_PP99}.
It is required to process 96 inputs and 64 outputs of 16-bit data at a
-sampling rate of a few microseconds.
+control rate of a few microseconds.
An initial attempt of the Columbia team employed fast CPUs or FPGAs, but
even the simplified algorithm failed to run within 20$\mu$s.
An alternative approach was to parallelize the algorithm using the
6 draft/io_processing.tex
@@ -55,7 +55,7 @@ \subsection{Host and Device Memory ({\hd})}
the device memory.
This overhead might be a crucial penalty for low-latency GPU computing.
-\subsection{Host Pinned Memory (Hpin)}
+\subsection{Host Pinned Memory ({\hp})}
@@ -83,7 +83,7 @@ \subsection{Host Pinned Memory (Hpin)}
GPU can read and write the data directly.
However, this data access is expensive as is a PCIe communication.
-\subsection{Device Memory Mapped to Host (Dmap)}
+\subsection{Device Memory Mapped to Host ({\dm})}
@@ -143,7 +143,7 @@ \subsection{Device Memory Mapped to Host (Dmap)}
same concept of zero-copy I/O processing can be also applied to any
mapping methods.
-\subsection{Device Memory Mapped Hybrid (DmapH)}
+\subsection{Device Memory Mapped Hybrid ({\dmh})}
This is a hybrid of the {\dm} and the {\hd} schemes, which is in
Please sign in to comment.
Something went wrong with that request. Please try again.