Browse files

updated abstract, introduction, and system model

  • Loading branch information...
1 parent 9d21163 commit dc0130621f11070a33d136c72fd0c5d9e7c9e6ac Shinpei Kato committed Oct 25, 2012
Showing with 128 additions and 66 deletions.
  1. +25 −22 draft/abstract.tex
  2. +2 −1 draft/draft.tex
  3. +54 −42 draft/introduction.tex
  4. 0 draft/io_processing.tex
  5. +47 −1 draft/system_model.tex
View
47 draft/abstract.tex
@@ -1,27 +1,30 @@
\begin{abstract}
- Cyber-physical systems (CPS) often control complex physical
- phenomenon.
- The computational workload of control algorithms, hence, is becoming a
- core challenge of CPS due to their real-time constraints.
- By nature, control algorithms of CPS exhibit a high degree of data
- parallelism, which can be offloaded to parallel compute devices,
- such as graphics processing units (GPUs).
- Yet another problem is introduced by the communication between the host
+ Cyber-physical systems (CPS) aim to control complex real-world
+ phenomenon.
+ Due to real-time constraints, however, the computational cost of
+ control algorithms is becoming a major issue of CPS.
+ Parallel computing of the control algorithms using state-of-the-art
+ compute devices, such as graphics processing units (GPUs), is a
+ promissing approach to reduce this computational cost, given that
+ real-world phenomenon by nature compose data parallelism, yet another
+ problem is introduced by the overhead of data transfer between the host
processor and the compute device.
As a matter of fact, plasma control requires an order of a few
- microseconds for the sampling period, while today's systems may take
- several ten microseconds to copy data between the host and the device
- memory at scale of the required data size.
- In this paper, we present a zero-copy I/O processing scheme that
+ microseconds for the control rate, while today's systems may take an
+ order of ten microseconds to copy the corresponding problem size of
+ data between the host processor and the compute device, which is
+ unacceptable lantecy.
+ In this paper, we propose a zero-copy I/O processing scheme that
enables sensor and actuator devices to directly transfer data to and
- from compute devices without using the host processor.
- The basic idea behind this scheme is to map the I/O address space onto
- the device memory, removing data-copy operations upon the host memory.
- The experimental results from the real-world plasma control
- system demonstrate that a sampling period of plasma control can be
- reduced by 33\% under the zero-copy I/O scheme.
- The microbenchmarking results also show that GPU-accelerated matrix
- computations can be completed in 34\% less time than current methods,
- while effective data throughput is at least as good as the current best
- performers.
+ from the compute device.
+ The basic idea behind this scheme is to map I/O address space,
+ accessible to sensor and actuator devices, to virtual memory space of
+ the compute device.
+ The results of experiments using the real-world plasma control system
+ demonstrates that the computational cost of the plasma control
+ algorithm is reduced by 33\% under our new scheme.
+ We further provide the results of microbenchmarking to show that more
+ generic matrix computations are completed in 34\% less time than
+ current methods, using our new scheme, while effective data throughput
+ remains at least as good as the current best performers.
\end{abstract}
View
3 draft/draft.tex
@@ -116,9 +116,10 @@
\thispagestyle{empty}
\input{abstract.tex}
-\keywords{GPGPU, Zero-Copy I/O, Plasma Fusion}
+%\keywords{GPGPU, Zero-Copy I/O, Fusion and Plasma, CPS}
\input{introduction.tex}
+\input{system_model.tex}
\input{case_study.tex}
\input{benchmarking.tex}
View
96 draft/introduction.tex
@@ -2,15 +2,18 @@ \section{Introduction}
\label{sec:introduction}
Cyber-physical systems (CPS) are next generations of networked and
-embedded systems, tightly coupled with computations and physical
-elements to control physical phenomenon.
+embedded systems, tightly coupled with computation and physical
+elements to control real-world phenomenon.
Control algorithms of CPS, therefore, are becoming more and more
complex, which makes CPS distinguished from traditional safety-critical
systems.
-In CPS applications, the real-fast is as important as the real-time,
-while only the real-time is a primary concern in safety-critical systems.
-This double-edge requirement of the real-time and the real-fast,
-however, has posed a core challenge of CPS platforms.
+In CPS applications, ``real-fast'' is often as important as ``real-time'',
+while safety-critical systems are likely to have only the real-time
+constraint.
+Such a double-edge real-time and real-fast requirement of CPS, however,
+has imposed a core challenge on systems technology.
+In this paper, we tackle this problem with a specific example of plasma
+control.
\begin{figure}[tb]
\centering
@@ -28,48 +31,57 @@ \section{Introduction}
sampling rate of a few microseconds.
An initial attempt of the Columbia team employed fast CPUs or FPGAs, but
even the simplified algorithm failed to run within 20$\mu$s.
-An alternative approach was to parallelize the algorithm for the
-graphics processing unit (GPU) using CUDA~\cite{CUDA}, the most
+An alternative approach was to parallelize the algorithm using the
+graphics processing unit (GPU) and CUDA~\cite{CUDA} -- the most
successful massively parallel computing technology.
However, the current system for GPU computing is not designed to
-integrate sensor and actuator devices.
-This is largely attributed to the fact that the GPU computing stack is
-independent of I/O device drivers.
-Since it may take tens of microseconds to transfer hundreds of bytes
-between the CPU and the GPU, the current system does not allow plasma
-control to use the GPU.
+integrate sensor and actuator devices with the GPU.
+This is attributed to the fact that the current software stack
+of GPU computing is independent of I/O device drivers.
+Since it might take tens of microseconds to transfer hundreds-of-bytes
+data between the CPU and the GPU, it is not affordable for the current
+software stack to apply the GPU for plasma control in real-time.
This is a signficant problem not only for plasma control but also for
-many applications of CPS that utilize compute devices with I/O devices.
+any applications of CPS that are augmented with compute devices.
-To the best of our knowledge, there is currently no generic support for
-direct communication between the GPU and I/O devices, though a
-specialized proprietary product for InfiniBand networks is
-available~\cite{GPUDirect}.
-There are also pinned memory allocation methods available from current
-programming frameworks to reduce data-copy operations, but it is unclear
-if they are best suited for real-time GPU applications.
-Although GPUs have been increasingly utilized in the domain of
+In order to utilize the GPU for applications of CPS, the system is
+required to support a method of bypassing data transfer between the GPU
+and the CPU, instead connecting the GPU and I/O devices directly.
+To the best of our knowledge, however, there is currently no generic
+systems support for such a direct data transfer machanism except for
+specialized commercial products for the InfiniBand
+network~\cite{GPUDirect}.
+As a way of eliminating the data transfer cost between the CPU and the
+GPU, the current programming framework often supports host memory
+allocation, which enables the GPU to access data on the host memory;
+however, it is unclear if this scheme is best suited for low-latency GPU
+computing, because the data access to the host memory from the GPU is
+expensive.
+Given that GPUs are increasingly deployed in the domain of
CPS~\cite{Hirabayashi_REACTION12, Mangharam11, McNaughton_ICRA11,
-Michel_IROS07}, and GPU resource management techniques have been
-invented~\cite{Elliott_RTS12, Elliott_ECRTS12, Kato_RTAS11, Kato_RTSS11,
-Kato_ATC11, Kato_ATC12, Liu_PACT12}, an integration of I/O processing
-and GPUs remains an open problem.
+Michel_IROS07},
+and basic real-time resource management techniques for the GPU started to be
+disclosed \cite{Elliott_RTS12, Elliott_ECRTS12, Kato_RTAS11,
+Kato_RTSS11, Kato_ATC11, Kato_ATC12, Liu_PACT12}, it is time to look
+into a tight integration of I/O processing and GPU computing.
+\textbf{Contribution:}
In this paper, we present a zero-copy I/O processing scheme for GPU
-applications.
-This scheme incorporates functions and their application programming
-interface (API) for I/O device drivers to directly transfer data to and
-from GPU memory space, removing additional data-copy operations between
-the CPU and the GPU.
-We also investigate exisiting approaches, and compare them to the
-presented zero-copy I/O processing scheme.
-Our case study uses the Columbia University's Tokamak plasma control
-system to evalaute a reduced sampling rate of plasma control.
-In order to evaluate more generic properties of I/O processing schemes,
-we further provide microbenchmarks, and discuss the pros and cons of each
-scheme.
-By clarifying GPU capabilities, we aim to not only improve the overall
+computing.
+This scheme enables I/O devices to directly transfer a limited size of
+data to and from the GPU, by coordinating their device drivers.
+We also investigate a possibility of the exisiting schemes to support
+low-latency GPU computing, and identify an advantage of our new scheme.
+To do so, we provide a case study using the Columbia University's
+Tokamak plasma control system that demonstrates an effect of our
+new scheme on a sampling period of plasma control.
+Furthermore, we provide microbenchmarking results to evaluate more
+generic properties of the I/O processing schemes
+By clarifying these capabilities, we aim to not only improve the overall
performance but also broaden the scope of CPS that can benefit from the
-use of GPU technology.
-
+state-of-the-art GPU computing technology.
+\textbf{Organization:}
+The rest of this paper is organized as follows.
+Section~\ref{sec:system_model} describes the system model and
+assumptions behind this paper.
View
0 draft/io_processing.tex
No changes.
View
48 draft/system_model.tex
@@ -1,2 +1,48 @@
+\section{System Model}
+\label{sec:system_model}
-main memory, often referred to as \textit{host memory}.
+This paper assumes that the system is composed of compute devices, input
+sensors, and output actuators, in addition to typical workstation
+components of the host computer.
+In particular, we restrict our attention to the GPU as a compute device,
+which best embraces a concept of many-core computing in the current
+state of the art.
+There are at least three \textit{device drivers} employed by the host
+computer to manage the GPU, the sensors, and the actuators,
+respectively.
+We also assume that these devices are connect to the Peripheral
+Component Interconnect Express (PCIe) bus to communicate with each
+other.
+The contribution of this paper, however, is not limited to the PCIe bus,
+but is also applicable to any interconnect as far as it is mappable to
+I/O address space.
+There are two kinds of memory associated with the address space.
+One is the host memory, also often referred to the main memory, which is
+incorporated by the host computer.
+The other is the device memory, which is encapsulated in the GPU.
+Both of the memory types must be mappable to the PCIe bus.
+
+The control algorithm is parallelized and offloaded to the GPU.
+We specifically use CUDA to implement the algorithm, but any programming
+language for the GPU is available in our system model, because the
+application programming interface (API) considered in this paper does
+not depend on a specific programming language.
+All input data come from the sensor modules, while all output data go to
+the actuator modules.
+The buffers for these data can be allocated to any place visible to I/O
+address space.
+According to the control system design, these data may or may not be
+further copied to other buffers.
+In either case, they are expressed as data \textit{arrays} in the
+control algorithm.
+
+There are several other assumptions that simplify our system model.
+The system contains only the single real-time process (task) that
+executes the control algorithm, except for trivial background jobs to
+run the system.
+Therefore, we ignore the problem of shared resources among multiple
+tasks.
+We also focus on a single instance of the GPU to implement the control
+algorithm; this is not a conceptual limitation of this paper, and the
+algorithm implementation could use multiple GPUs, as far as I/O address
+space is unified among them.

0 comments on commit dc01306

Please sign in to comment.