# System Software for Big Data Computing

**Cho-Li Wang The University of Hong Kong** 







# **HKU High-Performance Computing Lab.**

Total # of cores: 3004 CPU + 5376 GPU cores

RAM Size: 8.34 TB

Disk storage: 130 TB

Peak computing power: 27.05 TFlops



CS Gideon-II & CC MDRP Clusters



GPU-Cluster (Nvidia M2050, "Tianhe-1a"): 7.62 Tflops

#### 31.45TFlops (X12 in 3.5 years)





# Big Data: The "3Vs" Model

- High Volume (amount of data)
- High Velocity (speed of data in and out)
- High Variety (range of data types and sources)

















2010: 800,000 petabytes (would fill a stack of DVDs reaching from the earth to the moon and back)

By 2020, that pile of DVDs would stretch half way to Mars.

# **Our Research**

- Heterogeneous Manycore Computing (CPUs+ GUPs)
- Big Data Computing on Future Manycore Chips
- Multi-granularity Computation Migration



(1) Heterogeneous Manycore Computing (CPUs+ GUPs)

JAPONICA: Java with Auto-Parallelization ON GraphIcs Coprocessing Architecture



# **New GPU & Coprocessors**

| Vendor | Model             | Launch<br>Date | Fab.<br>(nm) | #Accelerator<br>Cores (Max.)                    | GPU<br>Clock<br>(MHz) | TDP<br>(watts) | Memory                       | Bandwidth<br>(GB/s) | Programming<br>Model             | Remarks                                      |  |
|--------|-------------------|----------------|--------------|-------------------------------------------------|-----------------------|----------------|------------------------------|---------------------|----------------------------------|----------------------------------------------|--|
| Intel  | Sandy<br>Bridge   | 2011Q1         | 32           | 12 HD graphics<br>3000 EUs (8<br>threads/EU)    | 850 –<br>1350         | 95             | L3: 8MB<br>Sys mem<br>(DDR3) | 21                  | OnenCl                           | Bandwidth is system                          |  |
|        | lvy<br>Bridge     | 2012Q2         | 22           | 16 HD graphics<br>4000 EUs (8<br>threads/EU)    | 650 –<br>1150         | 77             | L3: 8MB<br>Sys mem<br>(DDR3) | 25.6                | OpenCL                           | DDR3 memory<br>bandwidth                     |  |
|        | Xeon<br>Phi       | 2012H2         | 22           | 60 x86 cores<br>(with a 512-bit<br>vector unit) | 600-<br>1100          | 300            | 8GB<br>GDDR5                 | 320                 | OpenMP#,<br>OpenCL*,<br>OpenACC% | Less sensitive to branch divergent workloads |  |
| AMD    | Brazos<br>2.0     | 2012Q2         | 40           | 80 Evergreen shader cores                       | 488-680               | 18             | L2: 1MB<br>Sys mem<br>(DDR3) | 21                  | OpenCL,                          |                                              |  |
|        | Trinity           | 2012Q2         | 32           | 128-384<br>Northern<br>Islands cores            | 723-800               | 17-100         | L2: 4MB<br>Sys mem<br>(DDR3) | 25                  | C++AMP                           | APU                                          |  |
| Nvidia | Fermi             | 2010Q1         | 40           | 512 Cuda<br>cores<br>(16 SMs)                   | 1300                  | 238            | L1: 48KB<br>L2: 768KB<br>6GB | 148                 | CUDA, OpenCL,                    |                                              |  |
|        | Kepler<br>(GK110) | 2012Q4         | 28           | 2880 Cuda<br>cores                              | 836/876               | 300            | 6GB<br>GDDR5                 | 288.5               | OpenACC                          | 3X Perf/Watt, Dynamic<br>Parallelism, HyperQ |  |

# #1 in Top500 (11/2012): Titan @ Oak Ridge National Lab.

- <u>18,688</u> AMD Opteron 6274 16-core CPUs (32GB DDR3).
- <u>18,688</u> Nvidia Tesla **K20X GPUs**
- Total RAM size: over 710 TB
- Total Storage: 10 PB.
- Peak Performance: 27 Petaflop/s
  - $\circ$  GPU: CPU = 1.311 TF/s: 0.141 TF/ $\P$  = 9.3: 1
- Linpack: 17.59 Petaflop/s
- Power Consumption: 8.2 MW





Titan compute board: 4 AMD Opteron+ 4 NVIDIA Tesla K20X GPUs



NVIDIA Tesla K20X (Kepler GK110) GPU: **2688** CUDA cores

## Design Challenge: GPU Can't Handle Dynamic Loops

### **GPU = SIMD/Vector**

Data Dependency Issues (RAW, WAW)



Solutions?

### **Static loops**

```
for(i=0; i<N; i++)
{
    C[i] = A[i] + B[i];
}
```

### **Dynamic loops**

```
for(i=0; i<N; i++)
{
    A[ w[i] ] = 3 * A[ r[i] ];
}
```

Non-deterministic data dependencies inhibit exploitation of inherent parallelism; only DO-ALL loops or embarrassingly parallel workload gets admitted to GPUs.

# Dynamic loops are common in scientific and engineering applications



Source: Z. Shen, Z. Li, and P. Yew, "An Empirical Study on Array Subscripts and Data Dependencies"

# **GPU-TLS:** Thread-level Speculation on GPU

- **Incremental parallelization** 
  - sliding window style execution.
- Efficient dependency checking schemes
- **Deferred update**

Speculative updates are stored in the write buffer of each thread

until the commit time.

3 phases of execution





- intra-thread RAW
- valid inter-thread RAW in GPU 🗸
- true inter-thread RAW 🗶



GPU: lock-step execution in the same warp (32 threads per warp).

## **JAPONICA: Profile-Guided Work Dispatching**



# **JAPONICA:** System Architecture





(2) Crocodiles: Cloud
Runtime with Object
Coherence On Dynamic
tILES"



# "General Purpose" Manycore

| Micro-<br>architecture      | # of cores               | On-Chip<br>Network (Link<br>Bandwidth) | H/W<br>Coherence | L1\$/core | L2\$/core            | L3\$             | DDR<br>Controller    |
|-----------------------------|--------------------------|----------------------------------------|------------------|-----------|----------------------|------------------|----------------------|
| Teraflops<br>Research Chip  | 80 (4.0<br>GHz)          | 2D Mesh<br>(256Gb/s)                   | No               | 5KB       | 256KB                | NA               | 3D stacked<br>memory |
| MIT's ATAC<br>(2008)        | 1000<br>(simulat<br>ion) | 2D (optical)<br>Mesh (32Gb/s)          | Yes              | NA        | NA                   | NA               | NA                   |
| Single-Chip<br>Cloud (2009) | 48 (1.0<br>GHz)          | 2D Mesh<br>(512Gb/s)                   | No               | 32KB      | 256KB +<br>8KB MPB   | Nil              | 4                    |
| Tilera<br>Tile-GX (2009)    | 100 (1.5<br>GHz)         | 2D Mesh<br>(320Gb/s)                   | Yes              | 64KB      | 256KB                | 26MB<br>(shared) | 4                    |
| Godson-T<br>(FPGA, 2011)    | 64 (1.0<br>GHz)          | 2D Mesh                                | Yes              | 32KB      | 128KB x<br>16 shared | Nil              | 4                    |

<u>Tile-based architecture</u>: Cores are connected through a 2D networkon-a-chip

## 鳄鱼 @ HKU (01/2013-12/2015)

• Crocodiles: Cloud Runtime with Object Coherence On Dynamic tILES for future 1000-core tiled processors"



### Dynamic Zoning

- Multi-tenant Cloud Architecture → Partition varies over time, mimic "Data center on a Chip".
- Performance isolation
- **o** On-demand scaling.
- Power efficiency (high flops/watt).



# Design Challenge: "Off-chip Memory Wall" Problem

DRAM performance (latency) improved slowly over the past 40 years.



(a) Gap of DRAM Density & Speed



(b) DRAM Latency Not Improved

Memory density has doubled nearly every two years, while performance has improved slowly (e.g. still 100+ of core clock cycles per memory access)

# **Lock Contention in Multicore System**

Physical memory allocation performance sorted by function. As more cores are added more processing time is spent contending for locks.





# **Challenges and Potential Solutions**

- Cache-aware design
  - Data Locality/Working Set getting critical!
  - Compiler or runtime techniques to improve data reuse
- Stop multitasking
  - Context switching breaks data locality
  - Time Sharing → Space Sharing

马其顿方阵**众核操作系**统: Next-generation Operating System for 1000-core processor



# Thanks!

## For more information:



C.L. Wang's webpage:

http://www.cs.hku.hk/~clwang/

### http://i.cs.hku.hk/~clwang/recruit2012.htm

We have a few PhD (or 4-year 直博生) positions open for self-motivated and academically strong students this year. If you are interested in one of the projects, please contact me at <a href="clwang@cs.hku.hk">clwang@cs.hku.hk</a>. Interview will be arranged for qualified students.

1 Crocodiles: Scalable Cloud-on-Chip Runtime Support with Software Coherence for Future 1000-Core Tiled Architectures, HKU 716712E, 9/2012-8/2015, supported by HK RGC.

Moving up to a parallelism with 1,000 cores requires a fairly radical rethinking of how to design system software. With a growing number of cores, providing hardware-level cache coherence gets increasingly complicated and costly, leading researchers to promote abandoning it if future many-core architectures are to stay inherently scalable. That means software now has to take on the role in ensuring data coherence among cores, yet exposing the low-level core-to-core message passing interfaces to programmers for managing coherence hampers programmability considerably. In this research, we address the above issues and propose novel methodologies to build a scalable CoC runtime platform, dubbed *Crocodiles* (*Cloud Runtime with Object Coherence On Dynamic tILES*), for future 1000-core tiled processors. *Crocodiles* involves the development of two important software subsystems: (1) Cache coherency protocol (2) <u>DVFS</u>-based power management.

2 Ph.D students (highest priority): strong background in OS kernel, full knowledge in memory subsystem (cache/DRAM, paging), cache coherent protocols.

>> 1-2 RAs: Require strong background in software distributed shared memory systems (e.g., TreadMarks, JiaJia, JUMP), programming experiences in multicore power management systems. Starting date: ASAP.

2 >> Japonica: Transparent Runtime and Memory Coherence Support for GPU Based Heterogeneous Many-Core Architecture, 11/2011-10/2013, supported by HK RGC.

In this project, we propose a new runtime platform, called **Japonica** (Java with Auto-Parallelization ON GraphIcs Co-processing Architecture), which enables a multithreaded Java program to scale transparently on a GPU-based heterogeneous system. With the transparent runtime support, application developers can utilize both CPU and GPU resources seamlessly with an idiomatic Java programming model. Japonica possesses several unique features: (1) automatic translation from Java bytecode to OpenCL, (2) auto-parallelization of loops with non-deterministic data dependencies, (3) dynamic load scheduling and rebalancing via task migration between CPU and GPU, (4) virtual shared memory between host and device, and (5) speculative coherency protocol for threads running on both CPU and GPU cores. The proposed work explores GPU-friendly ways to support a partial Java heap and STM-based synchronization of shared objects and arrays mirrored in GPU.

1 Ph.D student: interested in Java Virtual Machine, compiler (e.g., loop parallelization), software transactional memory. Experiences in GPU programming (CUDA or OpenCL) is required.

RAs: We are now building Japonica on a multicore GPU Cluster. We need 1 or 2 RAs who are good in OpenCL or CUDA programming.

More Information about Distributed Java Virtual Machines: http://www.cs.hku.hk/~clwang/projects/JESSICA2.html

#### 3 Self-Organizing Desktop Cloud (SODC)

We are currently building a P2P Personal Cloud system which heavily adopts the Xen virtual machines. The research focus will be on (1) VM I/O performance isolation, (2) live VM migration over WAN. Read WAVNet webpage for more details.

1 PhD student: Look for a strong candidate with good knowledge in virtual machines internal design (e.g., KVM, Xen), solid background in Linux kernel, and good knowledge in NAT/firewall tunneling solutions.

#### 

Traditional operating systems are based on the sequential execution model developed in the 1960s. Such operating systems cannot address new many-core parallel hardware architecture without major redevelopment. For instance, how can you harness the power a next-generation manycore processor with >1,000 cores? We will investigate various perspectives on the future OS design towards the goal.

# **Multi-granularity Computation Migration**



# WAVNet: Live VM Migration over WAN

- A P2P Cloud with Live VM Migration over WAN
  - "Virtualized LAN" over the Internet"
- High penetration via NAT hole punching
  - Establish direct host-to-host connection
  - Free from proxies, able to traverse most NATs







Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (<a href="https://linear.org/length/international"><u>ICPP2011</u></a>)

# **WAVNet: Experiments at Pacific Rim Areas**



### **JESSICA2: Distributed Java Virtual Machine**

**J**ava

Enabled

Single



# **History and Roadmap of JESSICA Project**

- **JESSICA V1.0** (1996-1999)
  - Execution mode: **Interpreter Mode**
  - JVM kernel modification (Kaffe JVM)
  - Global heap: built on top of TreadMarks (Lazy Release Consistency + homeless)
- **JESSICA V2.0 (2000-2006)** 
  - Execution mode: **JIT-Compiler Mode**
  - JVM kernel modification
  - Lazy release consistency + migrating-home protocol
- **JESSICA V3.0 (2008~2010)** 
  - **Built above JVM (via JVMTI)**
  - Support Large Object Space
- **JESSICA v.4 (2010~)** 
  - Japonica: Automatic loop parallization and speculative execution on GPU and multicore CPU
  - **TrC-DC**: a software transactional memory system on cluster with distributed clocks (not discussed)



**Past Members** 







Kinson Chan



Ricky Ma

# **Stack-on-Demand (SOD)**



### **Elastic Execution Model via SOD**



- (a) "Remote Method Call"
- (b) Mimic thread migration
- (c) "Task Roaming": like a mobile agent roaming over the network or workflow

With such flexible or *composable* execution paths, SOD enables agile and elastic exploitation of distributed resources (storage), a Big Data Solution!

Lightweight, Portable, Adaptable



Ricky K. K. Ma, King Tin Lam, Cho-Li Wang, "eXCloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE International Conference on Cloud and Service Computing (CSC2011),. (Best Paper Award)