Permalink
Browse files

reflected Scott's comments

  • Loading branch information...
1 parent 2febcf7 commit accab935176648399578411da2e99d84e8c6df09 Shinpei Kato committed Oct 28, 2012
View
@@ -1,30 +1,22 @@
\begin{abstract}
Cyber-physical systems (CPS) aim to control complex real-world
- phenomenon.
- Due to real-time constraints, however, the computational cost of
- control algorithms is becoming a major issue of CPS.
- Parallel computing of the control algorithms using state-of-the-art
- compute devices, such as graphics processing units (GPUs), is a
- promissing approach to reduce this computational cost, given that
- real-world phenomenon by nature compose data parallelism, yet another
- problem is introduced by the overhead of data transfer between the host
- processor and the compute device.
- As a matter of fact, plasma control requires an order of a few
- microseconds for the control rate, while today's systems may take an
- order of ten microseconds to copy the corresponding problem size of
- data between the host processor and the compute device, which is
- unacceptable lantecy.
- In this paper, we propose a zero-copy I/O processing scheme that
- enables sensor and actuator devices to directly transfer data to and
- from the compute device.
- The basic idea behind this scheme is to map I/O address space,
- accessible to sensor and actuator devices, to virtual memory space of
- the compute device.
- The results of experiments using the real-world plasma control system
- demonstrates that the computational cost of the plasma control
- algorithm is reduced by 33\% under our new scheme.
- We further provide the results of microbenchmarking to show that more
- generic matrix computations are completed in 34\% less time than
- current methods, using our new scheme, while effective data throughput
- remains at least as good as the current best performers.
+ phenomenon where the computational cost and real-time constraints could
+ be a major challenge.
+ Parallel computing technology using compute devices such as graphics
+ processing units (GPUs) promises to enhancing computation, leveraging
+ the data parallelism often found in real-world scenarios, but
+ performance is limited by the overhead of the data transfer between the
+ host and the device memory.
+ For example, plasma control in the HBT-EP ``Tokamak'' device at Columbia
+ University~\cite{Maurer_PPCF11,Rath_FED12} must run the control
+ algorithm in a few microseconds, but may take tens of microseconds to
+ copy the data set between the host and the device memory.
+ We present a new zero-copy I/O processing scheme that maps
+ the I/O address space of the system to the virtual address space of the
+ compute device, allowing sensor and actuator devices to transfer data
+ to and from the compute device directly.
+ Experiments with the plasma control system show a 33\% reduction in
+ computational cost, and microbenchmarks with more generic matrix
+ operations show a 34\% reduction, while in both cases, effective data
+ throughput remains at least as good as the current best performers.
\end{abstract}
View
Oops, something went wrong.
View
@@ -2,8 +2,8 @@ \section{Case Study}
\label{sec:case_study}
In this section, we provide a case study of magnetic control of
-perturbed plasma equilibria using the HBT-EP ``Tokamak'' equipped with
-the GPU and the presented I/O processing schemes.
+perturbed plasma equilibria using the HBT-EP Tokamak equipped with
+the GPU and the aforementioned I/O processing schemes.
This control system requires low latency and high computing capabilities
to achieve a sampling period of the order of ten microseconds, while
processing 96 inputs and 64 outputs of 16-bit data with a complex
@@ -26,9 +26,9 @@ \section{Case Study}
The control input comes from a set of magnetic sensors through a D-TACQ
ACQ196 digitizer, and the resulting control signal is sent to two D-TACQ
AO32 analog output modules to excite control coils.
-These input and output modules are connected to the GPU upon the PCIe
-bus.
-We evaluate three different schemes against this system architecture
+These input and output modules are connected to the NVIDIA Tesla C2050
+GPU (448 cores and 4GB memory) upon the PCIe bus.
+We evaluate three schemes against this system architecture
from the viewpoint of the cycle time of algorithm execution and the
latency of data transfer.
Each scheme is applied as follows:
@@ -57,13 +57,12 @@ \section{Case Study}
the control algorithm is limited within the device memory.
\end{itemize}
-This paper does not provide the details of the control algorithm, since
-it is not within the scope of this paper.
-Interested readers are encouraged to reference \cite{Boozer_PP99}.
+This paper does not provide the details of the control
+algorithm~\cite{Boozer_PP99}, which is outside the scope of this paper.
The outline of the control system implementation is that the host
program launches the device program on the GPU once at the beginning
when the system is loaded.
-The device program is polling until the input data set arrives.
+The device program polls until the input data set arrives.
This is due to a requirement of low-latency computing.
The input and output modules are configured to write the input data to
and read the output data from the specified PCI regions through DMA,
@@ -83,20 +82,20 @@ \section{Case Study}
\label{fig:eval_plasma}
\end{figure}
-We now demonstrate that the proposed {\dm} scheme reduces both
-the cycle time of algorithm execution and the latency of data transfer
-in this control system.
+We now show that the {\dm} scheme reduces both the cycle time of
+algorithm execution and the latency of data transfer in this control
+system.
Figure~\ref{fig:eval_plasma} shows the result of experiments conducted
under the three different schemes, respectively.
-Apparently, the {\dm} scheme achieves the highest rate in both algorithm
+The {\dm} scheme achieves the highest rate in both algorithm
execution and data transfer.
-The remaining two schemes, on the other hand, compromise either of them.
-It is fairly reasonable to observe that the {\hd} and the {\dm} schemes
-exhibit the same performance level for algorithm execution since they
-both use the device memory, while the latency of data transfer is
-equivalent between the {\hp} and the {\dm} schemes since they both
-remove data copies.
-Comparing the {\hd} and the {\hp} schemes, one can also observe that the
+The remaining two schemes, on the other hand, compromise one or the
+other of them.
+The {\hd} and the {\dm} schemes exhibit the same performance level for
+algorithm execution since they both use the device memory, while the
+data transfer latency of the {\hp} and the {\dm} schemes are equivalent
+since they both remove data copies.
+Comparing the {\hd} and the {\hp} schemes, one can also see that the
impact of overhead introduced by data copies, \textit{i.e.}, $16\mu$s, is
greater than that introduced by the GPU accessing pinned host memory
space, \textit{i.e.}, $6\mu$s, on the overall system performance.
@@ -110,10 +109,10 @@ \section{Case Study}
system, reducing the latency of data transfer from $16\mu$s to
$4\mu$s.
The speed-up ratio is 4$\times$.
- \item Furthermore, the {\dm} scheme proposed in this paper reduces the
- cycle time of algorithm execution from $6\mu$s to $4\mu$s.
+ \item Furthermore, the {\dm} scheme reduces the cycle time of algorithm
+ execution from $6\mu$s to $4\mu$s.
The speed-up ratio is 1.5$\times$.
- Since the HBT-EP ``Tokamak'' accommodates up to 216 inputs,
+ Since the HBT-EP Tokamak accommodates up to 216 inputs,
meaning that the cycle time of algorithm execution is more
dominated by data accesses, the benefit of the {\dm} scheme over
the {\hp} scheme would be more significant for a larger scale of
@@ -196,7 +195,7 @@ \section{Case Study}
Finally, we discuss practical findings of plasma control regarding the
HBT-EP ``Tokamak'' device.
Figure~\ref{fig:plasma_overview} shows a comparison of the average
-perturbation amplitudes with different phasings.
+perturbation amplitudes with different phasing.
The control system incorporates four arrays of magnetic sensors and
control coils, each of which controls one specific mode.
They are placed at different poloidal angles around the toroidal ring.
@@ -206,5 +205,5 @@ \section{Case Study}
perturbation, while feedback at 100 degrees is the right range for
suppression.
As compared to no feedback scenario (``No FB'' in the figure), for
-example, we find that we can supress the strength of the rotating field
+example, we find that we can suppress the strength of the rotating field
by up to 30\% for any mode observed in this experiment.
View
@@ -0,0 +1,23 @@
+\section{Conclusion}
+\label{sec:conclusion}
+
+In this paper, we have presented a new approach to zero-copy I/O
+processing schemes for low-latency GPU computing.
+This is a significant contribution to meet a real-time and real-fast
+requirement of CPS.
+The plasma control system, developed as an example application of CPS,
+demonstrated that our zero-copy I/O processing scheme achieved a very
+high rate of $16\mu$s for full plasma control processing.
+Our microbenchmarking evaluation also disclosed the detailed properties
+of I/O processing schemes for the GPU.
+We believe that the contribution of this paper facilitates a grander
+vision of CPS using parallel computing technology.
+
+In future work, we extend our schemes to support multiple contexts.
+This extension is essential to control multiple plants using the
+identical GPU.
+A key challenge is to ensure exclusive direct access to the same memory
+space, because the device driver and the runtime system cannot
+intermidiate in zero-copy I/O processing.
+We also plan to generalize our schemes for arbitrary I/O devices such as
+ethernet and firewire interfaces.
View
Binary file not shown.
View
@@ -42,8 +42,8 @@
\begin{document}
\title{
-%Zero-Copy I/O Processing for Low-Latency GPU Computing
-Zero-Copy I/O for Low-Latency GPU Computing
+Zero-Copy I/O Processing for Low-Latency GPU Computing
+%Zero-Copy I/O for Low-Latency GPU Computing
}
%
% You need the command \numberofauthors to handle the 'placement
@@ -116,7 +116,7 @@
\thispagestyle{empty}
\input{abstract.tex}
-\keywords{GPGPU, Zero-Copy I/O, Fusion and Plasma, CPS}
+\keywords{GPGPU; Low Latency; Fusion and Plasma; Energy CPS}
\input{introduction.tex}
\input{system_model.tex}
@@ -125,8 +125,11 @@
\input{case_study.tex}
\input{benchmarking.tex}
\input{related_work.tex}
+\input{conclusion.tex}
\bibliographystyle{abbrv}
+%{\footnotesize
\bibliography{references}
+%}
\end{document}
Oops, something went wrong.

0 comments on commit accab93

Please sign in to comment.