Browse files

added experimental results

  • Loading branch information...
1 parent cba5743 commit 040222a6d435c9faaf377414cd421b83980a5188 @shinpei0208 committed Apr 3, 2013
Showing with 96 additions and 10 deletions.
  1. +1 −1 draft/abstract.tex
  2. +1 −1 draft/conclusion.tex
  3. BIN draft/draft.pdf
  4. +1 −1 draft/draft.tex
  5. +67 −4 draft/evaluation.tex
  6. +26 −3 draft/implementation.tex
@@ -13,5 +13,5 @@
specific algorithm for practical use.
We apply the presented technique to the real-world vehicle detection
program and demonstrate that our implementation using commodity
-GPUs can achieve speedups of 1.5x to 3x in frame-rate over sequential
+GPUs can achieve speedups of 3x to 5x in frame-rate over sequential
and multithreaded implementations using traditional CPUs.
@@ -11,7 +11,7 @@ \section{Conclusion}
computational blocks.
Our evaluation using a commodity GPU showed that our GPU implementation
can speed up the existing HOG-based vehicle detection program tailored
-to the deformable models by 1.5x to 3x over traditional CPU
+to the deformable models by 3x to 5x over traditional CPU
Given that this performance improvement is obtained from the entire
program runtime rather than particular algorithm parts of the program,
Binary file not shown.
@@ -26,7 +26,7 @@
\IEEEauthorblockN{Kazuya Takeda}
\IEEEauthorblockA{Dept. of Media Science\\Nagoya University}
-\IEEEauthorblockN{Seiichi Mita}
+\IEEEauthorblockN{Taiki Kawano and Seiichi Mita}
\IEEEauthorblockA{Research Center for Smart Vehicles\\Toyota
Technological Institute}
@@ -14,9 +14,9 @@ \subsection{Experimental Setup}
We prepare three variants of the vehicle detection program implemented
using (i) a single core of the multicore CPU, (ii) multiple cores of the
multicore CPU, (iii) and massively parallel compute cores of the GPU.
-The CPU implementations use the Intel Core i7 2700K series while we
+The CPU implementations use the Intel Core i7 3930K series while we
provide several varied GPUs for the GPU implementations: namely NVIDIA
-GeForce GTX 560 Ti, GTX 580, GTX 680, Titan, and K20X.
+GeForce GTX 560 Ti, GTX 580, GTX 680, TITAN, and K20Xm.
The same set of 10 images as previous work \cite{Niknejad12} is used as
input data and their average computation time is considered as a major
performance metrics.
@@ -35,6 +35,45 @@ \subsection{Experimental Results}
+Fig.~\ref{fig:float_exe_time} shows the execution times of all variants
+of the vehicle detection program configured to use the single precision
+for floating operations.
+The dimensions of input images are 640$\times$480 pixels.
+``sequential'' uses a single CPU core while ``multicore'' uses multiple
+CPU cores with \textit{pthread}.
+Other labels except for ``best'' represent our GPU implementations using
+corresponding GPUs.
+``best'' is the best combination of the GPU and CPU implementations.
+Most computational blocks benefit from GPUs; only the HOG calculation
+prefers the multicore implementation.
+This is attributed to the fact that the HOG calculation contains atomic
+operations that squeeze massively parallel threads of the GPU.
+A comparison of different GPUs provides an interesting observation.
+The GPUs based on the state-of-the-art \textit{Kepler}
+architecture \cite{NVIDIA_Kepler} are inferior to those based on the
+previous \textit{Fermi} architecture \cite{NVIDIA_Fermi}.
+The Kepler GPUs employ a significant number of compute cores with the
+enhanced multithreading mechanism.
+However they operate at a lower frequency than the Fermi GPUs due to
+their complex architecture.
+Since the vehicle detection program is compute-intensive as depicted
+through Listing~\ref{lst:score} to \ref{lst:hog}, the operating
+frequency is more dominating than the architectural benefit.
+This is a useful finding toward the future development of image
+processing with GPUs.
+As a result, the best performance is obtained from such a setup that
+uses the multicore implementation for the HOG calculation while using
+the GeForce GTX 580 GPU for other computational blocks offloaded.
+It speeds up the execution of vehicle detection by more than 5x over the
+traditional single-core CPU implementation and 3x over the multicore CPU
+implementation respectively.
+A 3$\sim$5x speed-up for the overall execution of a complex real-world
+application program is a significant contribution, whereas often an
+order-of-magnitude speed-up is reported for a particular part of the
+program or the algorithm.
@@ -43,12 +82,36 @@ \subsection{Experimental Results}
+Fig. \ref{fig:double_exe_time} shows the execution times of all variants
+of the vehicle detection problem configuired to use the double precision
+for floating operations.
+Unlike the single-precision scenario, the Kepler GPUs outperform the
+Fermi GPUs.
+This explains that the double-precision performance of GPUs is improved
+as the generation of GPUs advances.
+Another notable finding is that the TITAN GPU is slightly faster than
+the K20Xm GPU for our vehicle detection program.
+Given that the TITAN GPU is a consumer price while the K20Xm is very
+expensive for supercomputing, we suggest that the vehicle detection
+program uses the TITAN GPU for a better cost performance.
\caption{Impact of the image size on execution times.}
- \label{fig:time_on_image_sizedouble_exe_time}
+ \label{fig:time_on_image_size}
+Fig. \ref{fig:time_on_image_size} shows the impact of the image size on
+execution times.
+We herein use the program configured to use the single precision for
+floating-point operations.
+The GPU implementation uses the GeForce GTX 580 GPU, which is the best
+performer in all the GPUs demonstrated in Fig. \ref{fig:float_exe_time}.
+The lessons learned from this experiment are that the execution time of
+the vehicle detection program is proportionally influenced by the input
+image size.
+This means that the benefit of our GPU implementations as compared to
+the traditional CPU implementations would hold for more high-resolution
+image processing.
@@ -38,6 +38,17 @@ \subsection{Basic Understanding}
This paper provides a guideline of how to use GPUs in an efficient way
for vision-based object detection.
+As aforementioned, we use CUDA~\cite{NVIDIA_CUDA} for GPU programming.
+A unit of code that is individually launched on the GPU is called
+a \textit{kernel}.
+The kernel is composed of multiple \textit{threads} that execute the
+code in parallel.
+A unit of threads that are co-scheduled by hardware is called a
+\textit{block}, while a collection of blocks for the corresponding
+kernel is called a \textit{grid}.
+The maximum number of threads that can be contained by an individual
+block is defined by the GPU architecture.
\subsection{Program Analysis}
@@ -76,7 +87,7 @@ \subsection{Program Analysis}
In order to identify computationally time-consuming blocks of the object
detection program, we conducted a preliminary measurement running the
original sequential code \cite{Niknejad12} on a generic Intel Core i7
-2700K CPU.
+3930K CPU (3.2GHz).
The measurement method is straightforward.
We find high-level \textit{for} loops (in case that the program is
written in the C-like language) by scanning the program structure and
@@ -198,8 +209,8 @@ \subsection{GPU Programming}
still providing much better performance than CPU implementations.
An optimization of GPU programming is left open for future work.
-Listing~\ref{lst:detect} illustrates the remainig parts of the program
-that we parallelize using the GPU.
+Listing~\ref{lst:detect}~and~\ref{lst:hog} illustrate the remainig parts
+of the program that we parallelize using the GPU.
We also unroll all the loops of these blocks to accelerate computations
on the GPU.
Due to a space constraint, we skip the details of implementation approaches.
@@ -219,4 +230,16 @@ \subsection{GPU Programming}
+\begin{lstlisting}[caption=The program structure of HOG calculation, label=lst:hog]
+ for(int ii=0; ii<interval; ii++) {
+ for(int x=0; x<vis_R[1]; x++) {
+ for(int y=0; y<vis_R[0]; y++) {
+ .....
+ *(Htemp+ ixp_b+iyp)+=vy1*vx1Xv; // need atomicity
+ ......
+ }
+ }
+ }

0 comments on commit 040222a

Please sign in to comment.