



# OPTIMIZING THREADED CODE PERFORMANCE AND SCALABILITY

Intel Software Developer Conference – London, 2017

# **AGENDA**

- Which tools should I use for threading and scalability?
- Intel® Performance Snapshot
- Intel<sup>®</sup> VTune<sup>™</sup> Amplifier
- Some examples and solutions
- What's new in 2018?



# WHICH TOOL SHOULD I USE FOR THREADING AND SCALABILITY?

- Intel<sup>®</sup> Performance Snapshots
  - Provides high level and easy to understand metrics
  - Highlight the main bottlenecks
  - Can be easily integrated in the build chain to provide feedback to developers
- Intel<sup>®</sup> Vtune<sup>™</sup> Amplifier
  - Go deeper, get detailed information about source lines
  - Dedicated analysis to target a specific aspect (threading, memory, etc)





# **AGENDA**

- Which tools should I use for threading and scalability?
- Intel<sup>®</sup> Performance Snapshot
- Intel® VTune™ Amplifier
- Some examples and solutions
- What's new in 2018?



# BEFORE DIVE TO A PARTICULAR TOOL...

- How to assess that I have potential in performance tuning?
- Which tool should I use first?
- What to use on big scale not be overwhelmed with huge trace size, post processing time and collection overhead?
  - On a KNL cluster customers can end-up with more than 1000 ranks on just 8 nodes
- How to quickly evaluate environment settings or incremental code changes?
- Answer: try Application Performance Snapshot 2018

# APPLICATION PERFORMANCE SNAPSHOT (APS)

- High-level overview of application performance
- Identify primary optimization areas and next steps in analysis
- Easy to use
- Detailed reports available via command line
- Scales to large jobs
- Multiple methods to obtain
  - Part of Intel<sup>®</sup> Parallel Studio XE 2018
  - Separate free download from Performance Snapshot page

# **APS HTML REPORT**



## **Application Performance Snapshot**

Application:heart\_demo\_avx\_2 Number of ranks:22 Used statistics:/home/vtune/dprohoro/apps/Cardiac/Cardiac/build/stat\_20170605 Creation date:2017-06-05 21:33:32

20.22s

Elapsed Time

60.81

1.12

SP FLOPS

CPI (MAX 1.13, MIN 1.12)

OpenMP Imbalance

4.03% of Elapsed Time (0.81s)

**Memory Footprint** 

Per node: AVG 11055.40 MB, PEAK 11055.40 MB Per rank: AVG 502.52 MB, PEAK 610.43 MB Memory Stalls

**MPI Time** 

I/O Bound

Memory Stalls

**FPU Utilization** 

23.33% of pipeline slots

OpenMP Imbalance 4.03% <10%

Cache Stalls 24.45% ★ of cycles

DRAM Stalls 0.01% of cycles

NUMA 16.03% of remote accesses

Your application is MPI bound. This may be caused by high busy wait

and Collector to explore performance bottlenecks.

Current run Target

62.60% <15%

0.90% >50%

0.00% <10%

23.33% < < 20%

time inside the library (imbalance), non-optimal communication schema

or MPI library settings. Use MPI profiling tools like Intel® Trace Analyzer

FPU Utilization

SP FLOPs per Cycle 0.28 Out of 32.00

Vector Capacity Usage 26.21%

FP Instruction Mix
% of Packed FP Instr.: 3.06%
% of 128-bit: 2.11%
% of 256-bit: 0.96%
% of Scalar FP Instr.: 96.94%

FP Arith/Mem Rd Instr. Ratio 0.44

#### **MPI** Time

62.60% of Elapsed Time (12.66s)

MPI Imbalance 53.16% of Elapsed Time (10.75s)

 TOP 5 MPI Functions
 %

 Waitall
 55.30

 Barrier
 5.80

 Isend
 0.28

 Irecv
 0.15

 Scattery
 0.01



# **APS USAGE**

## Setup Environment

source <APS\_Install\_dir>/apsvars.sh

## **Run Application**

mpirun <mpi options> aps.sh <application and args>

## Generate Report on Results

aps.sh –report <result folder>

## Generate advanced CL reports on Results

aps-report.sh –<option> <result folder>



# **AGENDA**

- Which tools should I use for threading and scalability?
- Intel® Performance Snapshot
- Intel® VTune™ Amplifier
- Some examples and solutions
- What's new in 2018?



#### Performance Profiler

#### Where is my application...

#### **Spending Time?**



- Focus tuning on functions taking time
- See call stacks
- See time on source

#### **Wasting Time?**

| Line |                    | MEM_LOAD<br>LLC_MISS |
|------|--------------------|----------------------|
| 475  | float rx, ry, rz = |                      |
| 476  | float param1 = (AA | 30,000               |
| 477  | float param2 = (AA |                      |
| 478  | bool neg = (rz < 0 |                      |

- See cache misses on your source
- See functions sorted by # of cache misses

#### **Waiting Too Long?**



- See locks by wait time
- Red/Green for CPU utilization during wait

- Windows & Linux
- Low overhead
- No special recompiles

Advanced Profiling For Scalable Multicore Performance

## Tune Applications for Scalable Multicore Performance

#### Fast, Accurate Performance Profiles

- · Hotspot (Statistical call tree)
- Call counts (Statistical)
- Hardware-Event Sampling

#### Thread Profiling

- Visualize thread interactions on timeline
- Balance workloads

#### Easy set-up

- Pre-defined performance profiles
- Use a normal production build

#### Find Answers Fast

- Filter extraneous data
- View results on the source / assembly

#### Compatible

- Microsoft, GCC, Intel compilers
- C/C++, Fortran, Assembly, .NET, Java
- Latest Intel® processors and compatible processors¹

#### Windows or Linux

- Visual Studio Integration (Windows)
- Standalone user i/f and command line
- 32 and 64-bit





<sup>&</sup>lt;sup>1</sup> IA32 and Intel<sup>®</sup> 64 architectures. Many features work with compatible processors. Event based sampling requires a genuine Intel<sup>®</sup> Processor.

Analysis Types (based on technology)

| Software Collector Any x86 processor, any virtual, no driver                               | Hardware Collector Higher res., lower overhead, system wide                                     |
|--------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| Basic Hotspots Which functions use the most time?                                          | Advanced Hotspots Which functions use the most time? Where to inline? – Statistical call counts |
| Concurrency Tune parallelism. Colors show number of cores used.                            | General Exploration Where is the biggest opportunity? Cache misses? Branch mispredictions?      |
| Locks and Waits  Tune the #1 cause of slow threaded performance – waiting with idle cores. | Advanced Analysis  Dig deep to tune bandwidth, cache misses, access contention, etc.            |



Software or hardware collector?



Get a quick snapshot

#### Thread Concurrency Histogram

This histogram represents a breakdown of the Elapsed Time. It visualizes the percentage of the wall time the specific number of threads were considered running if they are either actually running on a CPU or are in the runnable state in the OS scheduler. Essentially, Thread Concurrer that were not waiting. Thread Concurrency may be higher than CPU usage if threads are in the runnable state and not consuming CPU time.





#### Look for Common Patterns

Coarse Grain Locks

6.5s 7.5s ✓ Thread Q⊬Q+Q-Q+ 5s 7s 8.5s mainCRTStartup (0x2c4 ✓ Waits CPU Time OMP Worker Thread #1 **✓** ↑ ↓ Transitions (0x1790)✓ CPU Usage CPU Time OMP Worker Thread #2 (0x228c) ✓ Thread Concurrency Line Concurrency OMP Worker Thread #3 **CPU Usage** Thread Concurrency

High Lock Contention



Low Concurrency

Load Imbalance



## Identify hotspots



#### Find Answers Fast



Timeline Visualizes Thread Behavior



- Optional: Use API to mark frames and user tasks
- Optional: Add a mark during collection









# THREE KEYS TO HPC PERFORMANCE

Threading, Memory Access, Vectorization – Intel VTune™ Amplifier

- Threading: CPU Utilization
- Serial vs. Parallel time
- Top OpenMP regions by potential gain
- Tip: Use hotspot OpenMP region analysis for more detail
- Memory Access Efficiency
- Stalls by memory hierarchy
- Bandwidth utilization
- Tip: Use Memory Access analysis
- Vectorization: FPU Utilization
- FLOPS † estimates from sampling
- Tip: Use Intel Advisor for precise metrics and vectorization optimization



<sup>&</sup>lt;sup>†</sup> For 3rd, 5th, 6th Generation Intel<sup>®</sup> Core<sup>™</sup> processors and second generation Intel<sup>®</sup> Xeon Phi<sup>™</sup> processor code named Knights Landing.





# **OPTIMIZE MEMORY ACCESS**

Memory Access Analysis - Intel® VTune™ Amplifier 2017

- Tune data structures for performance
  - Attribute cache misses to data structures (not just the code causing the miss)
  - Support for custom memory allocators
- Optimize NUMA latency & scalability
  - True & false sharing optimization
  - Auto detect max system bandwidth
  - Easier tuning of inter-socket bandwidth
- Easier install, Latest processors
  - No special drivers required on Linux\*
  - Intel® Xeon Phi™ processor MCDRAM (high bandwidth memory) analysis



| Bandwidth Domain / Bandwidth Utiliz | CPU Time ▼ | L2 Miss Count |
|-------------------------------------|------------|---------------|
| ▼ DRAM, GB/sec                      | 840.803s   | 6,000,180     |
| ▼ High                              | 508.635s   | 4,000,120     |
| ▶ stream.c:100 (381 MB)             |            | 2,000,060     |
| ▶ stream.c:98 (381 MB)              |            | 2,000,060     |
| ▶ Medium                            | 241.638s   | 0             |
| ▶ Low                               | 90.529s    | 2,000,060     |
| ▶ MCDRAM Flat, GB/sec               | 840.803s   | 6,000,180     |



# User API

#### Enable you to

- control collection
- set marks during the execution of the specific code
- specify custom synchronization primitives implemented without standard system APIs

#### To use the user APIs, do the following:

- Include ittnotify.h, located at <install\_dir>/include
- Insert \_\_itt\_\* notifications in your code
- Link to the libittnotify.lib file located at <install\_dir>/lib

# User API

## Collection control and threads naming

#### Collection Control APIs

void \_\_itt\_pause (void) Run the application without collecting data. VTune™

Amplifier XE reduces the overhead of collection, by collecting only critical information, such as thread and

process creation.

void \_\_itt\_resume (void) Resume data collection. VTune™ Amplifier XE resumes

collecting all data.

#### Thread naming APIs

void \_\_itt\_thread\_set\_name (const Set thread name using char or Unicode string,
 itt char \*name) where name is the thread name.

void \_\_itt\_thread\_ignore (void)

Indicate that this thread should be ignored from analysis. It will not affect the concurrency of the application. It will not be visible in the Timeline pane.



# User API

## Collection Control Example

```
int main(int argc, char* argv[])
{
    doSomeInitializationWork();

    __itt_resume();
    while(gRunning) {
        doSomeDataParallelWork();
    }
    __itt_pause();

    doSomeFinalizationWork();
    return 0;
}
```

# **AGENDA**

- Which tools for threading and scalability?
- Intel® Performance Snapshot
- Intel® VTune™ Amplifier
- Some examples and solutions
- What's new in 2018?



# Fibonacci and scheduling

- Very naïve implementation (just want to show a common pattern)
  - We want to fill an array with numbers from the Fibonacci suite

```
#pragma omp parallel for
for(int i=0; i<SIZE; i++){
  fib_array[i] = fib(i);
}

int fib(int i){
    if(i==0) return 0;
    if(i==1) return 1;
    return fib(i-1) + fib(i-2);
}</pre>

By default, OMP uses a static scheduling.
Each thread will do the same number of iterations
```



## **CPU Usage Histogram**

This histogram displays a percentage of the wall time the specific



Very poor threading

Fib(0) is much faster to compute than Fib(50) !!!!

A static scheduling creates a very high Load imbalance.





1 thread



- Very naïve implementation (just want to show a common pattern)
  - We want to fill an array with numbers from the Fibonacci suite

```
#pragma omp parallel for schedule(guided)
for(int i=0; i<SIZE; i++){
  fib_array[i] = fib(i);
}

int fib(int i){
    if(i==0) return 0;
    if(i==1) return 1;
    return fib(i-1) + fib(i-2);
}</pre>
```



This histogram displays a percentage of the wall time the spe



Just changing the scheduling provides an important speedup, around 2x for Fib(50)

3 threads

2 threads

1 thread

Thread Concurrency



# Linear regression and false sharing identification

- 2 or more threads reading/writing the same cache line
  - At least 1 thread is writing data
  - Other threads want to read another data in the same cache line
- Linear regression sample (available in Vtune's package)



Running the memory analysis shows a bottleneck on the L1 cache system.



#### 1- Look for memory object responsible for latency

| Memory Object                                | Total Latency | Loads          | Stores        | LLC Miss Count |
|----------------------------------------------|---------------|----------------|---------------|----------------|
| linear_regression_pthread<br>.c:136 (512 B ) | 64,4%         | 14,058,042,174 | 4,998,074,970 | . 9            |
| [Unknown]                                    | 28.7%         | 19,104,057,312 | 202,803,042   | 19             |
| linear regression othread<br>.c:118 (54 MB ) | 6.9%          | 10,536,031,608 | Ö             |                |
| [Stack]                                      | 0.0%          | 0              | 2,400,036     |                |

#### 2- Identify allocation site, object size and average latency

| Grouping: Memory Object / Function / Allocation Stack                           |                |               |    |  |  |  |  |
|---------------------------------------------------------------------------------|----------------|---------------|----|--|--|--|--|
| Memory Object / Function / Allocation Stack Loads ▼ Stores Average Latency (cyc |                |               |    |  |  |  |  |
| ▶ [Unknown]                                                                     | 19,104,057,312 | 202,803,042   | 8  |  |  |  |  |
| ▶ linear_regression_pthread.c:136 (512 B )                                      | 14,058,042,174 | 4,998,074,970 | 37 |  |  |  |  |
| ▶ linear_regression_pthread.c:118 (54 MB )                                      | 10,536,031,608 | 0             | 8  |  |  |  |  |
| ▶ [Stack]                                                                       | 0              | 2,400,036     | 0  |  |  |  |  |

#### 3- Look into the code



This structure seems to be responsible

pthread\_t tid;
POINT\_T \*points;
int num\_elems;
long long SX;
long long SY;
long long SXX;
long long SYY;
long long SYY;
long long SXY;







Here the structure is 64bytes (same as cache line) But depending on alignment, 2 lreg\_args objects can Share the same cache line.

```
typedef struct
   pthread t tid:
   POINT T *points:
   int num elems;
    long long SX;
   long long SY;
    long long SXX;
   long long SYY;
   long long SXY;
 lreg_args;
```



 To solve the false sharing, we can add an array that will pad our structure and avoid having data of 2 lreg\_args objects sharing the same cache line.

```
typedef struct
{
    char pad[80];|
    pthread_t tid;
    POINT_T *points;
    int num_elems;
    long long SX;
    long long SY;
    long long SXX;
    long long SYY;
    long long SYY;
    long long SXY;
} leng_args;
```

Bonus, not explained in the sample!

In this test, aligning the data to a 64 bytes boundary can also solve the problem!



# **AGENDA**

- Which tools for threading and scalability?
- Intel<sup>®</sup> Performance Snapshot
- Intel® VTune™ Amplifier
- Some examples and solutions
- What's new in 2018?



# APPLICATION PERFORMANCE SNAPSHOT ADDS MPI

All the data in one place: MPI + OpenMP + Memory + Floating Point

# Quick & easy performance overview

Does the app need performance tuning?

# MPI and non-MPI Apps<sup>†</sup>

- Distributed MPI with or without threading
- Shared memory applications

# Popular MPI implementations supported

- Intel® MPI
- MPICH and Cray MPI

# Richer metrics on computation efficiency

- CPU (processor stalls, memory access)
- FPU (vectorization metrics)



<sup>&</sup>lt;sup>†</sup> MPI supported only on Linux\*.

# MORE COMPLETE HPC PERFORMANCE OVERVIEW

## MPI metrics added to HPC analysis

#### **MPI** Imbalance Metric

- Metric for performance of rank on critical path
- Computational bottlenecks and outlier rank behavior now available in VTune Amplifier
- For communication pattern problems between ranks use Intel® Trace Analyzer and Collector (ITAC)

#### Threading: CPU Utilization

- Serial vs. Parallel time
- Top OpenMP regions by potential gain
- Tip: Use hotspot OpenMP region analysis for more detail

#### **Memory Access Efficiency**

- Stalls by memory hierarchy
- Bandwidth utilization
- Tip: Use Memory Access analysis

#### Vectorization: FPU Utilization

- FLOPS † estimates from sampling
- Tip: Use Intel Advisor for precise metrics and vectorization optimization



# WHAT'S USING ALL THE MEMORY?

Memory Consumption Analysis

## See What Is Allocating Memory

- Lists top memory consuming functions and objects
- View source to understand cause
- Filter by time using the memory consumption timeline
- Standard & Custom Allocators
  - Recognizes libc malloc/free, memkind and jemalloc libraries
  - Use custom allocators after markup with ITT Notify API

#### Languages

- Python\*
- Linux\*: Native C, C++, Fortran



Native language support is not currently available for Windows\*

# **OPTIMIZE PRIVATE CLOUD-BASED APPLICATIONS**

Profile native & Java apps in containers

# **Profile Enterprise Applications**

- Native C, C++, Fortran
- Attach to running Java services (e.g., Mail)
- Profile Java daemons without restart
- Accurate low-overhead data collection
  - Advanced hotspots and hardware events
  - Memory analysis
  - Accurate stack information for Java and HHVM

# Popular containers supported

Docker\*
 Mesos\*
 LXC\*



- No container configuration required
- Detection of the container is automatic



Software collectors (e.g. Locks & Waits) and Python profiling are not currently available for containers.

# Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED "AS ISP. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR

- INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
- Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
- Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

