Browse files

init with Abstract, Instruction, and Platform Technology

  • Loading branch information...
Shinpei Kato
Shinpei Kato committed Sep 17, 2012
1 parent 206301d commit d1e4c8c0eee4d4746e30bd8b469907e109fe3afc
2,425 IEEEtran.bst

Large diffs are not rendered by default.

Oops, something went wrong.
4,722 IEEEtran.cls

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -0,0 +1,12 @@
+TARGET = farm
+TEX = latex
+ $(TEX) $(TARGET)
+ bibtex $(TARGET)
+ $(TEX) $(TARGET)
+ $(TEX) $(TARGET)
+ dvipdfm $(TARGET).dvi
+ rm -fr *~ *.aux *.ps *.pdf *.dvi *.log *.bbl *.blg *.ent2012/
@@ -0,0 +1,94 @@
+%!TEX root = farm.tex
+% no \IEEEPARstart
+Graphics processing units (GPUs) are becoming more and more commonplace
+to support compute-intensive and data-parallel computing.
+In many application domains, GPU-accelerated systems provide significant
+performance gains over traditional multi-core CPU-based systems.
+As shown in Table~\ref{tab:cpu-gpu}, the peak performance of the
+state-of-the-art GPUs exceeds 3,000 GFLOPS, integrating more than 1,500
+cores on a chip, which is nearly equivalent of 19 times that of
+traditional microprocessors, such as Intel Core i7 series.
+Such a rapid growth of GPUs is due to recent advances in programming
+support, such as CUDA\cite{cuda} and OpenCL\cite{opencl}, for
+general-purpose computing on GPUs, also known as GPGPU.
+ \caption{Comparison of the Intel CPU Architectures and the NVIDIA GPU
+ Architectures}
+ \label{tab:cpu-gpu}
+ \begin{center}
+ \hbox to\hsize{\hfil
+ \begin{tabular}{|c|c|c|c|c|c|}\hline
+ & Core i7 980XE & Core i7 3960X & GeForce GTX285 & GeForce GTX480 &
+ GeForce GTX680 \\ \hline
+ \# of processing cores & 6 & 6 & 240 & 480 & 1536 \\ \hline
+ Single-precision performance (GFLOPS) & 108.0 & 158.4 & 933.0 & 1350.0
+ & 3090.0 \\ \hline
+ Memory bandwidth (GB/sec) & 37.55 & 51.2 & 159.0 & 177.0 & 192.2 \\ \hline
+ Power consumption (watt) & 130 & 278 & 183 & 250 & 195 \\ \hline
+ Release date & 2010/03 & 2011/11 & 2009/01 & 2010/04 & 2012/03 \\ \hline
+ \end{tabular}\hfil}
+ \end{center}
+In recent years, real-time systems have been augmented with
+the GPU~\cite{Kato_ATC11, Kato_RTSS11, Kato_RTAS11, Basaran_ECRTS12,
+Elliott_ECRTS12, Elliott_RTS12}.
+The motivation of using the GPU in real-time systems is mainly found in
+emerging applications of cyber-physical systems~\cite{Aumiller_CPSNA12,
+McNaughton_ICRA11, Ferreira_JRTIP11}, where a large
+amount of data acquired from the physical world needs to be processed in
+Given that the workload of such applications is highly compute-intensive and
+data-parallel, many-core computing on the GPU is best suited to meet the
+real-fast requirements of computation.
+What is challenging in this line of work is to control the GPU under
+real-time constraints.
+The GPU is a coprocessor independent of the CPU, and hence two different
+pieces of code are running concurrently on the GPU and the CPU, respectively.
+This heterogeneity poses a core challenge in resource management.
+Since the GPU is designed to accelerate particular workload, resource
+management functions are often performed on the CPU.
+In other words, the GPU and the CPU must be synchronized in some way to
+ensure timeliness.
+Unfortunately, this could be a major source of latency that makes
+real-time systems unpredictable~\cite{Kato_ATC11}, though the previous
+work are forced to take this approach due to a lack of functionality
+that enables resource management functions to offload on to the GPU.
+While compute cores or shaders on the GPU are not available to perform
+resource management, recent GPUs integrate microcontrollers on a chip
+where firmware code is launched to control the functional units of the
+These microcontrollers are highly available to extend the functionality
+of GPU resource management, launching special pieces of firmware code to
+control GPU executions and data transfers.
+This paper presents a compiler and debugging environment for NVIDIA's
+GPU microcontrollers based on the well-known portable LLVM compiler
+The main purpose of this environment is to enhance the productivity of
+GPU firmware development so that the community can facilitate future
+research on fine-grained GPU resource management using microcontrollers.
+Firmware is self-contained within the GPU, and there will be
+interference from background jobs running on the CPU, once it is
+uploaded by the device driver.
+Therefore, we believe that GPU computing would be more timely and
+reliable for real-time systems, if the firmware can support GPU resource
+management by itself.
+In this paper, we develop an initial stage of the firmware, and evaluate
+its basic performance.
+The rest of this paper is organized as follows.
+Section~\ref{sec:intro} introduces the underlying platform technology.
+Section~\ref{sec:tech} describes the design and implementation of our
+compiler and debugging environment for NVIDIA's GPU microcontrollers,
+and Section~\ref{sec:evaluation} evaluates its basic performance.
+Related work are discussed in Section~\ref{sec:related}.
+This paper is concluded in Section~\ref{sec:con}.
@@ -0,0 +1,101 @@
+%!TEX root = farm.tex
+\section{Platform Technology}\label{sec:tech}
+First of all, we describe the platform technology underlying our
+We intensively focus on NVIDIA's GPU architectures, while the idea of
+integrating GPU resource management into on-chip microcontrollers is not
+limited to these specific architectures.
+All pieces of technology presented herein are open-source, and may be
+downloaded from the corresponding websites, respectively.
+\subsection{Assembler for GPU microcontrollers}\label{sec:envy}
+The assembler is comprised in package of the envytools suite~\cite{envytools}.
+The envytools suite is a rich set of open-source tools to compile or
+decompile GPU shader code, firmware code, macro code, and so on.
+It is also used to generate header files of GPU command definitions used
+by the device driver and the runtime library.
+There are many other useful tools and documentations for NVIDIA's GPU
+architectures enclosed in the envytools suite.
+\subsection{GPU Device Driver}\label{sec:driver}
+In general, the application programming interface (API) for the GPU is
+provided by the runtime library.
+GPU resource management, on the other hand, is often supported by the
+device driver and the operating system (OS) module~\cite{Kato_ATC11,
+Kato_ATC12, Bautin_MCNC08}.
+As part of resource management, the device driver communicates with
+microcontrollers integrated on the GPU.
+The communication is typically managed by specific commands, which can
+be handled by firmware running on each microcontroller.
+The firmware is built into the device driver by a shape of byte code,
+and is uploaded on to the GPU at boot time.
+To do so, we require open-source software, because we have to build the
+firmware into the device driver.
+In this paper, we use Gdev~\cite{Kato_ATC12}, an open-source module of
+the GPGPU device driver and runtime library.
+\subsection{LLVM Infrastructure}
+The LLVM (Low Level Virtual Machine) project is a collection of
+open-source modular and reusable compiler tool sets.
+Since the microcontroller has its own instrunction set architecture, we
+develop an architecture-dependent backend of LLVM so that we can make
+use of all the front-end modules of LLVM.
+Figure \ref{fig:llvm} illustrates the structure of LLVM.
+It first generates the LLVM IR (Intermediate Representation) from the
+source code.
+This IR code is assembled by the LLVM backend.
+The assembly code is finally translated to the object code for the
+target machine.
+\includegraphics[scale = 0.5]{./img/llvmflow.pdf}
+\caption{Compiling stages in LLVM.}
+\subsubsection{LLVM IR}
+The LLVM IR is an intermediate language used in LLVM, also called bitcode or
+LLVM assembly languages.
+This intermediate language is very powerful, scalable, light-weight, and
+low-level enough to underlie many languages on top of many
+LLVM uses an expression of SSA (Static Single Assignment), which is
+suitable for a lot of compiler optimization algorithms.
+\subsubsection{LLVM frontend}\label{set:clang}
+The LLVM frontend generates an intermediate language from a high-level
+language in LLVM.
+It is mainly used for code generation and its optimization.
+In particular, we use Clang for our development, which is an open-source
+compiler for the C family of programming languages provided by LLVM.
+\subsubsection{LLVM backend}\label{set:backend}
+The LLVM backend generates target code from an intermediate language in
+The backend of LLVM features a target-independent code generator that
+may create output for several types of target processors including X86,
+PowerPC, ARM, and SPARC.
+This backend framework may also be used to generate code targeted at
+accelerators such as Cell B/E and GPUs.
+In fact, NVIDIA has announced recently that they use LLVM for the basis
+of their CUDA compiler.
+The backend of LLVM is composed of the LLC (LLVM static Compiler) and
+the LLI (LLVM Interpreter).
+LLI is an interpreter of the LLVM IR, also available as a JIT compiler,
+while LLC is a static compiler to generate code.
+We use this backend part of LLVM to generate code targeted at NVIDIA's
+GPU microcontrollers.
@@ -0,0 +1,161 @@
+%!TEX root = farm.tex
+\section{Compiler and Debugging Environment}\label{sec:design}
+This section describes the design and implementation of our compiler and
+debugging environment for NVIDIA's GPU microcontrollers.
+\subsection{GPU microcontroller}
+\caption{Microcontroller Specifications in GF100 }
+% $B:81&$N7S@~$O$D$1$:!$0lHV>e$N7S@~$OFs=E@~(B
+\hbox to\hsize{\hfil
+Name & HUB & GPC\\\hline
+Architecture & Fermi & Fermi \\\hline
+Number & 1 & 4\\\hline
+Bit & 32bit & 32bit\\\hline
+Code size & 16,384 byte & 8,192 byte\\\hline
+Data size & 4,096 byte & 2,048 byte\\\hline
+% \multicolumn{4}{l}{type-1\,: enumerate$BEy(B\quad type-2\,: enumerate*$BEy(B}\\
+% \multicolumn{4}{l}{type-3\,: Enumerate$BEy(B\quad type-4\,: ENUMERATE$BEy(B}\\
+This research use the microcontroller of Nvidia's Fermi architecture such as GF100 (GeForce GTX480).
+In GF100, a Streaming Multiprocessor (SM) consists of 32 CUDA cores, and a Graphical Processor Cluster (GPC) consists of 4 SM's, and GF100 consist of 4 GPC's. Thus GF100 mounted 512 CUDA cores. Since one full SM is disabled. GF100 has 480 valid CUDA cores.
+Since maximum code size of the microcontroller is limited to 16KB as indicated in Table \ref{tab:fermi}, developers should carefully design the firmware.
+\subsection{The compiler for GPU microcontroller}
+The compiler for GPU microcontroller generates object code of GPU microcontroller manufactured by NVIDIA.
+\subsubsection{The overall flow}\label{section:flow}
+\caption{Detail of Compiler for GPU Microcontroller}
+The compiler for GPU microcontroller implemented using the LLVM.
+Figure \ref{fig:compiler} shows an overall view of the compiler for GPU microcontroller.
+The main flow of the compilation is Clang done generate LLVM IR from the source of the C language in the first,
+Next then, LLC generates assembly code from LLVM IR.
+After, assembly code divided to code part and data part, code part unions bootstrap code.
+Finally, envyas generates an executable file.
+The executable file can be run using the debugging support tool or device driver.
+If developer develops the firmware in this development environment that only need to write the code for the C language.
+\item[ (1) Clang]\mbox{}\\
+Clang is a frontend that generates IR for LLVM from C language source codes.
+\caption{Step of code generation in LLC}
+\caption{C source code and generated code. Left : C, Center : LLVM IR, Right : Assembly}
+\item[ (2) LLC with nvfuc]\mbox{}\\
+As mentioned Section \ref{set:backend}, LLC is the backend of LLVM that compiles LLVM IR code into assembly language for a specified architecture.
+Figure \ref{fig:llc} shows LLC flow.
+There are five steps to convert an LLVM IR to specific assembly code; flow analysis, optimization, instruction selection, register allocation, code generation.
+The flow is not depend on the target machine agnostic has been standardized.
+LLC reads the configuration of the target machine at the time of the instruction selection and selects the instruction and register to meet the specifications of each machine.
+A new configuration called nvfc (NVidia Firmware Compiler) for GPU microcontroller manufactured by NVIDIA is added to target machine configurations.
+\item[ (3) LLVM to envyas]\mbox{}\\
+``LLVM to envyas'' divides the generated assembly code into code section and data section.
+``LLVM to envyas'' combines code section and BootstrapCode including code to set up an interrupt handler and a call main function.
+Further ``LLVM to envyas'' replaces labels of code section to data address.
+\item[ (4) envyas]\mbox{}\\
+As mentioned in Section \ref{sec:design},
+``envyas'' is assembler for the GPU micro controller and is included in the envytools.
+The envyas generates the execution files from generated the code section by Step (3).
+\item[ (5) hex to bin]\mbox{}\\
+``hex to bin'' converts to the data portion split Step (3) to an executable file.
+\item[ (6) Running the firmware]\mbox{}\\
+There are two ways to run the firmware: incorporating the firmware into the device driver, or using the debugging support tool.
+The device driver and the debugging support tool load the binary file of the firmware at boot time.
+\subsubsection{The generated code}
+Figure \ref{fig:llvm_code} shows example of C language source codes, LLVM IR code, and assembly code.
+Left is C language source codes, center is LLVM IR code that is generated by \ref{section:flow}(1), right is assembly code that is generated by \ref{section:flow}(3).
+\caption{Flowchart of Debugging Support Tool}
+\caption{Flowchart of Our Firmware }
+\subsection{Debugging support tool}
+Debugging support tool is to load the firmware, to send commands and data, and to display a register value of GPU.
+Figure \ref{fig:loader} shows the flow of this tool flow, we describe use it.
+Microcontrollers memory space map to CPU memory space in MMIO (Memory Mapped IO).
+\item[ (1) Load the firmware]\mbox{}\\
+Debugging support tool load the HUB firmware and GPC firmware executable code to mapping address by MMIO.
+After the load completion, this firmware runs by set a flag in the specified register.
+\item[ (2) Sends commands and data]\mbox{}\\
+A processing on microcontroller is suspended until receives a command.
+The debugging support tool sends the command.
+After an interrupt is executed by the command, the processing is resumed,
+\item[ (3) Display a register] \mbox{}\\
+The microcontroller has the register may be used freely on the host side.
+The register is used for execute completion flag by the traditional firmware.
+Thus we assumed that the register is used for the same purpose during debugging time,
+the register value is displayed.
+\subsection{Firmware development}
+In this section, we describe the our developing firmware on HUB.
+Figure \ref{fig:firmware} shows that firmware flow chart.
+The Firmware is started by setting the value in the register.
+\item[ (1) initialize]\mbox{}\\
+The firmware sets the interrupt handler and get the data when started.
+Next then Step(2).
+\item[ (2) sleep]\mbox{}\\
+The firmware makes the shift to the standby state, which wait for the receive command by the device driver or the debug support tools.
+The firmware interrupt occurs when the firmware received command, which started ``ihbody''.
+\item[ (3) ihbody] \mbox{}\\
+``ihbody'' enqueued command, and then it releases wait state of firmware.
+\item[ (4) work] \mbox{}\\
+``work'' function is called when the wait state of firmware is released.
+``work'' calls the function after the dequeue.
+It will check the end flag of firmware after the function execution.
+We can recognize from what has been said that the firmware controlled by execute the function is better suited command.
Oops, something went wrong.

0 comments on commit d1e4c8c

Please sign in to comment.