Skip to content
Browse files

updated case_study.tex

  • Loading branch information...
1 parent dbd87bf commit 9d211636b7854723d316cf181756057389f4d6bf Shinpei Kato committed Oct 24, 2012
Showing with 141 additions and 19 deletions.
  1. +137 −18 draft/case_study.tex
  2. +2 −1 draft/draft.tex
  3. BIN draft/eps/eval_plasma.eps
  4. +2 −0 draft/system_model.tex
View
155 draft/case_study.tex
@@ -2,15 +2,15 @@ \section{Case Study}
\label{sec:case_study}
In this section, we provide a case study of magnetic control of
-perturbed plasma equilibria using the HBT-EP ``Tokamak'' equiped with
+perturbed plasma equilibria using the HBT-EP ``Tokamak'' equipped with
the GPU and the presented I/O processing schemes.
This control system requires low latency and high computing capabilities
-to achieve a sampling period of the order of ten microsenconds, while
+to achieve a sampling period of the order of ten microseconds, while
processing 96 inputs and 64 outputs of 16-bit data with a complex
algorithm.
As mentioned in Section~\ref{sec:introduction}, the CPU implementation
of this algorithm has never met the requirement of computing power.
-The case study presented herein, therefore, is signicant in that we look
+The case study presented herein, therefore, is significant in that we look
into a possibility of GPU implementations for the plasma control system.
\begin{figure}[t]
@@ -23,25 +23,83 @@ \section{Case Study}
Figure~\ref{fig:tokamak_sysarch} shows a system architecture used in
this case study.
-The control input comes from a set of magnetic sensors through the
-D-TACQ ACQ196 digitizer, and the resulting control signal is pulled by
-two D-TACQ AO32 analog output modules to excite control coils.
-These input and output module devices are connected to the GPU via the
-PCIe bus.
-We demonstrate that our zero-copy I/O processing scheme reduces both
-the algorithm cycle time and the data transfer latency of feedback
-control under this system architecture.
+The control input comes from a set of magnetic sensors through a D-TACQ
+ACQ196 digitizer, and the resulting control signal is sent to two D-TACQ
+AO32 analog output modules to excite control coils.
+These input and output modules are connected to the GPU upon the PCIe
+bus.
+We evaluate three different schemes against this system architecture
+from the viewpoint of the cycle time of algorithm execution and the
+latency of data transfer.
+Each scheme is applied as follows:
+\begin{itemize}
+ \item In the {\hd} scheme, the device driver of the digitizer transfers
+ the input data set to the buffers allocated on the host memory.
+ The control program copies this data set to the device memory via
+ PCI-mapped host memory space, and the parallelized control
+ algorithm runs on the GPU with the data on the device memory.
+ Once the output data set is produced by the GPU, the control
+ program copies it back to the host memory, and it is finally
+ pulled by the device driver of the analog output modules.
+ This is the traditional form of GPU computing.
+ \item The {\hp} scheme pins the input and output buffers to PCI-mapped
+ host memory space.
+ Since the data set pushed and pulled by the device drivers of
+ the I/O modules is directly accessible to the GPU, there is no
+ need to perform data copies.
+ However, this scheme must compromise the latency of data access
+ imposed on the GPU when executing the control algorithm.
+ \item Similarly to the {\hp} scheme, the {\dm} scheme proposed in this
+ paper uses pinned PCI-mapped host memory space to allocate the
+ input and out buffers, and further maps it to the device memory
+ through PCI BAR space.
+ Thus, there is no need of data copies while the data access of
+ the control algorithm is limited within the device memory.
+\end{itemize}
\begin{figure}[t]
\centering
\includegraphics[width=\hsize]{eps/eval_plasma.eps}
- \caption{Cycle time and latency of feedback control using the HBT-EP
- ``Tokamak''.}
+ \caption{Cycle time and latency of the plasma control system.}
\label{fig:eval_plasma}
\end{figure}
-The cycle time and the latency of this feedback control system are shown
-in Figure~\ref{fig:eval_plasma}.
+We now demonstrate that our zero-copy I/O processing scheme reduces both
+the cycle time of algorithm execution and the latency of data transfer
+in this control system.
+Figure~\ref{fig:eval_plasma} shows the result of experiments conducted
+under the three different schemes, respectively.
+Apparently, the {\dm} scheme achieves the highest rate in both algorithm
+execution and data transfer.
+The remaining two schemes, on the other hand, compromise either of them.
+It is fairly reasonable to observe that the {\hd} and the {\dm} schemes
+exhibit the same performance level for algorithm execution since they
+both use the device memory, while the latency of data transfer is
+equivalent between the {\hp} and the {\dm} schemes since they both
+remove data copies.
+Comparing the {\hd} and the {\hp} schemes, one can also observe that the
+impact of overhead introduced by data copies, \textit{i.e.}, $16\mu$s, is
+greater than that introduced by the GPU accessing pinned host memory
+space, \textit{i.e.}, $6\mu$s, on the overall system performance.
+Curiously, there is additional latency of $4\mu$s observed when running
+the control system.
+We suspect that this latency comes from some interactions among the host
+computer, the graphics card, and the I/O modules.
+Lessons learned from this evaluation are summarized as follows:
+\begin{itemize}
+ \item Zero-copy I/O processing is very effective for this control
+ system, reducing the latency of data transfer from $16\mu$s to
+ $4\mu$s.
+ The speed-up ratio is 4$\times$.
+ \item Furthermore, the {\dm} scheme proposed in this paper reduces the
+ cycle time of algorithm execution from $6\mu$s to $4\mu$s.
+ The speed-up ratio is 1.5$\times$.
+ Since the HBT-EP ``Tokamak'' accommodates up to 216 inputs,
+ meaning that the cycle time of algorithm execution is more
+ dominated by data accesses, the benefit of the {\dm} scheme over
+ the {\hp} scheme would be more significant for a larger scale of
+ plasma control.
+\end{itemize}
\begin{figure}[t]
\centering
@@ -50,13 +108,58 @@ \section{Case Study}
\label{fig:oscilloscope}
\end{figure}
+The above measurement explains that the control system can operate at a
+sampling period of $16\mu$s.
+The data transfer from and to the input and output modules takes $4\mu$s
+each.
+The algorithm execution takes $4\mu$s.
+Adding additional latency of $4\mu$s, the total control rate must be
+able to achieve $16\mu$s.
+Figure~\ref{fig:oscilloscope} depicts the screenshot of the oscilloscope
+where we measure the signals of the input and the output modules.
+The topmost and middle signals represent the input and the output,
+respectively, while the lower signal indicates the base clock.
+The grid spacing of the X axis is $5\mu$s.
+The time interval from the first downward edge in the clock signal after
+the input signal goes up to the instant when the output signal starts
+uprising is almost equal to $16\mu$s.
+This means that the total control processing time is $16\mu$s.
+
+We next demonstrate that the control system is running properly at a
+rate of $16\mu$s.
+The control input comes from a set of magnetic sensors arranged in a
+ring, as illustrated in Figure~\ref{fig:tokamak_sysarch}, and the
+magnetic field that they measure is rotating, whose orientation is
+described by a \textit{phase}.
+Ideally, the phase is equivalent to multiplication of time and
+frequency.
+To control this mode, the control system needs to produce a control
+signal that generates an equal and opposite field, which also needs to
+rotate.
+Obviously, the two fields should have a constant phase difference,
+because it is given by multiplication of the control processing time and
+the rotation frequency.
+However, in practice, the rotation frequency is not constant but is
+changing.
+As a result, the phase difference appears to oscillate, with the base
+output signal, which can be found as spikes in
+Figure~\ref{fig:phase_base}.
+Now, we measure the phase difference with the output signal time-shifted
+by $16\mu$s.
+In other words, the effective control system latency is reduced by
+$16\mu$s.
+As shown in Figure~\ref{fig:phase_shifted}, the spikes are now all
+removed.
+This indicates that the control system is perfectly in phase with the
+mode, and the effective control system latency now must be zero,
+\textit{i.e.}, the actual latency is $16\mu$s.
+
\begin{figure}[t]
\centering
\includegraphics[width=0.7\hsize]{eps/75221_base.eps}
- \caption{Phase difference observed with the based output signal.}
+ \caption{Phase difference observed with the base output signal.}
\label{fig:phase_base}
\end{figure}
-
\begin{figure}[t]
\centering
\includegraphics[width=0.7\hsize]{eps/75221_shifted.eps}
@@ -67,6 +170,22 @@ \section{Case Study}
\begin{figure}[t]
\centering
\includegraphics[width=0.9\hsize]{eps/overview.eps}
- \caption{New findings of plasma control.}
+ \caption{Practical findings of plasma control.}
\label{fig:plasma_overview}
\end{figure}
+
+Finally, we discuss practical findings of plasma control regarding the
+HBT-EP ``Tokamak'' device.
+Figure~\ref{fig:plasma_overview} shows a comparison of the average
+perturbation amplitudes with different phasings.
+The control system incorporates four arrays of magnetic sensors and
+control coils, each of which controls one specific mode.
+They are placed at different poloidal angles around the toroidal ring.
+Due to their different locations, they measure slightly different
+amplitudes.
+From this experiment, we find that feedback at 280 degrees excites
+perturbation, while feedback at 100 degrees is the right range for
+suppression.
+As compared to no feedback scenario (``No FB'' in the figure), for
+example, we find that we can supress the strength of the rotating field
+by up to 30\% for any mode observed in this experiment.
View
3 draft/draft.tex
@@ -42,7 +42,8 @@
\begin{document}
\title{
-Zero-Copy I/O Processing for Real-Time GPU Applications
+%Zero-Copy I/O Processing for Low-Latency GPU Computing
+Zero-Copy I/O for Low-Latency GPU Computing
}
%
% You need the command \numberofauthors to handle the 'placement
View
BIN draft/eps/eval_plasma.eps
Binary file not shown.
View
2 draft/system_model.tex
@@ -0,0 +1,2 @@
+
+main memory, often referred to as \textit{host memory}.

0 comments on commit 9d21163

Please sign in to comment.
Something went wrong with that request. Please try again.