# Memory Bandwidth and Latency Analysis: AMD Rome, Intel Cascade Lake and Ice Lake Servers

## Vamsi Sripathi\*

## September 2020

## Contents

| 1                | Introduction                                                                                                                                                                                       | 3                                |
|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
|                  | 1.1 Scope of Study                                                                                                                                                                                 | 3                                |
| 2                | Experiments Setup  2.1 Hardware Platforms  2.2 STREAM Benchmark  2.2.1 Target ISA  2.2.2 Cache Write Policy: vmovpd vs vmovntpd  2.3 Intel Memory Latency Checker  2.4 Runtime Settings            | 4<br>4<br>4<br>4<br>5<br>5       |
| 3                | DRAM                                                                                                                                                                                               | 5                                |
|                  | 3.1 1-Core                                                                                                                                                                                         | 6<br>6<br>6<br>7<br>7            |
|                  | 3.2.2 Peak Bandwidth and Scaling with NT 3.2.3 RFO vs. NT Stores 3.2.4 Compact vs Distributed Pinning 3.3 2-Socket                                                                                 | 8<br>8<br>9<br>10                |
|                  | 3.3.1 Peak Bandwidth with RFO 3.3.2 Peak Bandwidth with NT  3.4 NUMA 3.4.1 1-Core, 1-Socket:  3.5 Sub-NUMA Clustering 3.5.1 ICX: SNC-1, 2, 4                                                       | 10<br>10<br>10<br>11<br>12<br>12 |
|                  | 3.5.2 CLX: SNC-1, 2                                                                                                                                                                                | 14                               |
| 4                | CPU Caches         4.1       Latency: Idle, Loaded          4.1.1       Level-2 Cache          4.1.2       Level-3 Cache          4.1.3       Cache-to-Cache Transfer          4.2       Bandwidth | 16<br>16<br>16<br>17<br>17       |
| $\mathbf{A}_{]}$ | ppendices                                                                                                                                                                                          | 18                               |
| A                | Rome, CLX, ICX: NUMA with NT Stores                                                                                                                                                                | 18                               |
| $\mathbf{B}$     | AMD Rome: 7402, 7502, 7742                                                                                                                                                                         | 18                               |
| $\mathbf{C}$     | ICX: SNC with NT Stores                                                                                                                                                                            | 19                               |
|                  | *Intel Corp IAGS/DXPS/DEE/TCE/AWE                                                                                                                                                                  |                                  |

# List of Figures

| 1                | Memory sub-system study                                                              |
|------------------|--------------------------------------------------------------------------------------|
| 2                | STREAM 1-Core DRAM bandwidth                                                         |
| 3                | STREAM 1-Socket DRAM bandwidth scaling with RFO and Compact pinning                  |
| 4                | STREAM 1-Socket DRAM bandwidth scaling with NT Stores and Compact pinning 8          |
| 5                | STREAM 1-Socket DRAM bandwidth scaling, speed-up of NT over RFO Stores               |
| 6                | STREAM 1-Socket DRAM bandwidth: Compact vs Distributed Pinning                       |
| 7                | STREAM 2-socket DRAM bandwidth                                                       |
| 8                | STREAM 1-Core, 1-Socket NUMA bandwidth with RFO                                      |
| 9                | STREAM ICX 1-Core DRAM bandwidth: SNC modes                                          |
| 10               | STREAM ICX 2-Socket DRAM bandwidth: SNC modes                                        |
| 11               | STREAM 1-Socket DRAM bandwidth scaling with RFO and Compact pinning: SNC modes 12    |
| 12               | STREAM 1-Socket DRAM bandwidth scaling with RFO with specific affinity: SNC modes 12 |
| 13               | STREAM 1-Core, 1-Socket NUMA bandwidth with RFO: SNC modes                           |
| 14               | STREAM CLX 1-Core DRAM bandwidth: SNC modes                                          |
| 15               | STREAM CLX 2-Socket DRAM bandwidth: SNC modes                                        |
| 16               | STREAM 1-Socket DRAM bandwidth scaling with RFO and Compact pinning: SNC modes 14    |
| 17               | STREAM 1-Socket DRAM bandwidth scaling with RFO with specific affinity: SNC modes 14 |
| 18               | STREAM 1-Core, 1-Socket NUMA bandwidth with RFO: SNC modes                           |
| 19               | Level-2 Cache: Idle and Loaded Latencies                                             |
| 20               | Level-2 Cache: Idle and Loaded Latencies                                             |
| 21               | Cache-to-Cache Transfer Latency                                                      |
| 22               | Peak Injection Bandwidth                                                             |
| 23               | STREAM 1-Core, 1-Socket NUMA bandwidth with NT                                       |
| 24               | STREAM 1-Core DRAM bandwidth: AMD Rome SKU's                                         |
| 25               | STREAM 2-Socket DRAM bandwidth: AMD Rome SKU's                                       |
| 26               | STREAM 1-Socket DRAM bandwidth scaling with NT and Compact pinning: SNC modes 19     |
| 27               | STREAM 1-Socket DRAM bandwidth scaling, speed-up of NT over RFO Stores               |
| 28               | STREAM 1-Core, 1-Socket NUMA bandwidth with NT: SNC modes                            |
| 29               | STREAM 1-Socket DRAM bandwidth scaling with NT and Compact pinning: SNC modes 20     |
| 30               | STREAM 1-Socket DRAM bandwidth scaling, speed-up of NT over RFO Stores               |
| 31               | STREAM 1-Core, 1-Socket NUMA bandwidth with NT: SNC modes                            |
|                  |                                                                                      |
| T :4             | of Tables                                                                            |
| List             | of Tables                                                                            |
| 1                | Hardware Platforms                                                                   |
| $\overset{1}{2}$ | STREAM Kernels                                                                       |
| $\frac{2}{3}$    | Runtime Settings                                                                     |
| 4                | 1-Core peak bandwdith: RFO                                                           |
| 5                | 1-Core peak bandwdith: NT                                                            |
| 6                | 1-Socket peak bandwdith: RFO                                                         |
| 7                | 1-Socket peak bandwdith: NT                                                          |
| 8                | 2-socket peak bandwdith: RFO                                                         |
| 9                | 2-socket peak bandwdith: NT                                                          |
| 10               | ICX 1-Core peak bandwdith: RFO with SNC modes                                        |
| 11               | ICX 1-Core peak bandwdith: NT with SNC modes                                         |
| 12               | ICX 2-Socket peak bandwdith: RFO with SNC modes                                      |
| 13               | ICX 2-Socket peak bandwdith: NT with SNC modes                                       |
| 14               | CLX 1-Core peak bandwdith: RFO with SNC modes                                        |
| 15               | CLX 1-Core peak bandwdith: NT with SNC modes                                         |
| 16               | CLX 2-Socket peak bandwdith: RFO with SNC modes                                      |
| 17               | CLX 2-Socket peak bandwdith: NT with SNC modes                                       |
| 18               | Cache Sub-system                                                                     |
| 10               | - Cache but by but $\Pi$                                                             |



Figure 1: Memory sub-system study

## 1 Introduction

As Moore's law pushes the compute capabilities, it's imperative to study the bandwidth and latency characteristics of the memory sub-system in order to design a well-balanced CPU architecture. Figure-1 shows one of the approaches to evaluate the memory hierarchy where each of the axis represents a control variable that impacts the observed performance, namely -

- 1. Memory Hierarchy: Modern CPU's consists of different layers of memory pools. These include multiple levels of on-chip CPU data caches, on-chip multi-channel DRAM (MCDRAM), off-chip DDR. In addition, many CPU's support non-uniform memory access (NUMA) which acts as another memory tier.
- 2. Application Threads: CPU vendors have taken different architectural routes (e.g., chiplet, large mesh) in scaling the number of cores. As the number of cores in modern CPU's increase, the cost of implementing cache coherency protocols get expensive. On these systems, the number of application threads employed and how these threads are pinned to underlying cpu cores gives an ability to understand the strengths and weaknesses of a CPU architecture.
- 3. Traffic Pattern: The type of memory traffic generated play a defining role in determining the performance. After all, the memory bandwidth requirements of a dense matrix-matrix multiplication is vastly different from a stencil kernel. In addition, the cache write policies also impact performance depending upon whether a non-temporal store instruction is used in writing back the date to main memory.

### 1.1 Scope of Study

This report aims to characterize the memory bandwidth and latency for the following 3 CPU architectures:

- 1. AMD Rome ( $2^{nd}$  generation EPYC Processors)
- 2. Intel Cascade Lake ( $2^{nd}$  generation Xeon Scalable Processors)
- 3. Intel Ice Lake ( $3^{rd}$  generation Xeon Scalable Processors)

Since memory bandwidth and latency is a broad topic, we will limit the scope of this study to the following aspects -

- 1. More emphasis on DDR memory bandwidth followed by cache bandwidth and latency.
- 2. Synthetic kernels consisting of contiguous access pattern.

In the following sections, we will describe the experiments setup and benchmark methodology. This is followed by providing the DRAM memory bandwidth results for different traffic patterns, thread configurations, type of store instructions and NUMA domain. Next, we will provide the cache bandwidth and latency results.

## 2 Experiments Setup

### 2.1 Hardware Platforms

The details of the CPU configuration benchmarked in this study are listed in table-1. It is important to note that the CPU's have different memory channels, memory speed, number of cores etc. In the rest of the document, we will refer to the CPU platforms by their corresponding code-names.

|                                 | AMD Epyc 7742         | Intel Xeon 8268       | Intel ICX XCC         |
|---------------------------------|-----------------------|-----------------------|-----------------------|
| Code name                       | Rome                  | Cascade Lake (CLX)    | Ice Lake (ICX)        |
| Number of sockets               | 2                     | 2                     | 2                     |
| Number of cores per socket      | 32                    | 24                    | 32                    |
| Total number of cores           | 64                    | 48                    | 64                    |
| LLC per socket                  | 256 MB                | 35.7 MB               | 48 MB                 |
| Total memory                    | 264 GB                | 197 GB                | 264 GB                |
| Memory type                     | DDR4 @ 3200 MT/s      | DDR4 @ 2933 MT/s      | DDR4 @ 3200 MT/s      |
| Memory channels per socket      | 8                     | 6                     | 8                     |
| Theoretical Peak B/W per socket | 204.8 GB/s            | $140.8~\mathrm{GB/s}$ | $204.8~\mathrm{GB/s}$ |
| Theoretical Peak B/W            | $409.6~\mathrm{GB/s}$ | $281.6~\mathrm{GB/s}$ | $409.6~\mathrm{GB/s}$ |

Table 1: Hardware Platforms

The theoretical peak memory bandwidth is calculated as:

 $Peak\ B/W\ (GB/s) = x\ GT/s\ imes\ 8\ bytes\_per\_channel\ imes\ n\ channels\_per\_socket\ imes\ m\ sockets$ 

## 2.2 STREAM Benchmark

STREAM is a widely used standard benchmark to characterize the memory performance. It consists of the following 4 kernels – Copy, Scale, Add, Triad. The characteristics of these kernels are listed in table-2. As can be observed from the table, Scale and Add process the same number of bytes as Copy and Triad respectively and only differ in the floating-point operations. The differences in floating-point calculations is insignificant as these operations are memory-bound. Hence, we do not show the performance of these two kernels in the remaining part of the report.

To cover additional traffic patterns, we developed two more kernels representing 100% read (Reduce) and 100% write traffic (Fill). The Reduce kernel performs a vector reduction by an *add* opeartor and the Fill kernel writes a constant scalar value to all elements of a buffer. The source code for these kernels can be accessed at the following internal GIT repository <a href="https://gitlab.devtools.intel.com/vsripath/stream/">https://gitlab.devtools.intel.com/vsripath/stream/</a>

## 2.2.1 Target ISA

The ISA used determines how many number of bytes are transferred to/from memory from/to CPU registers in a single vector instruction. AVX2 and AVX512F vector instructions operate on 256b and 512b wide CPU registers respectively and hence transfer 32B and 64B per load/store instruction respectively. STREAM benchmark is compiled with Intel C Compiler with the relevant flags to generate the optimal instruction generation for the target processor. The highest ISA supported on AMD Rome and Intel CLX, ICX is AVX2 and AVX512F respectively. Hence, AVX2 and AVX512F instructions are generated in the binary that is executed on AMD Rome and Intel CLX, ICX respectively. Refer to the makefile at the aforementioned GIT source repository for all the Compiler flags used in building the binaries used in this study.

## 2.2.2 Cache Write Policy: vmovpd vs vmovntpd

A CPU caches' write policy determines how and when the data in cache is written back to main memory. Usually, there are two write policies – write-through and write-back. Both AMD Rome and Intel CLX, ICX use write-back (WB) cache policy. In WB policy, when the CPU modifies data, it's only updated in the cache and the write to the main memory is postponed until the cache line containing the updated data is evicted. Hence, a read-miss on a WB cache may require two memory accesses – if modified, write the contents of the to-be evicted cache line to main memory and then fetch the requested data from main memory. For write-misses, they can be two approaches – write-allocate, no-write allocate. In write-allocate policy, a store that misses the cache first fetches the cache line containing the data from main memory. And then when combined with the write-back policy, results in an additional memory access to write the updated data back to main memory.

If we look at the STREAM kernels, we observe the following two characteristics –

- Full Cache Line update: There are no partial cache line modifications done by the kernels.
- Non-Temporal Locality: The data that is written is never accessed/read again. Hence, there is no need to keep the data in cache.

Considering these aspects of STREAM kernels, we can observe that the *write-allocate* policy introduces an unnecessary overhead of fetching the data to cache. This introduces additional constrains on memory bandwidth when all the cores are actively writing data to main memory. To alleviate this, we can use an alternative store instruction called *non-temporal* store (vmovntpd) which by-passes the cache and directly writes to main memory.

In order to understand the impact of type of store instruction on memory bandwidth performance, STREAM benchmark is compiled in the following two configurations –

- Regular Stores: This configuration uses regular stores (vmovpd) that generates two memory accesses on a write-miss. We will refer to this configuration as Read-For-Ownership (RFO) in the rest of the report. RFO is a cache coherency operation specifying a read operation with intent to write to that memory address.
- Non-Temporal (NT) Stores: This configuration uses the aforementioned vmovntpd instruction to write back directly to main memory without bringing the data into cache. We will refer to this configuration as NT in the rest of the report.

For the different STREAM kernels, we can come up with the following expected speed-up of NT over RFO –

- Copy: The Copy kernel has a total of 2 (1 read, 1 write) and 3 (2-reads, 1 write) memory-ops when using NT and regular stores respectively. Hence, the ideal speed-up of NT over RFO should be 1.5x.
- Triad: The Triad kernel has a total of 3 (2 reads, 1 write) and 4 (3 reads, 1 write) memory-ops when using NT and regular stores. Hence, the ideal speed-up of NT over RFO should be 1.33x.
- Reduce: The Reduce kernel has 0 write ops and so the performance should remain the same with both NT and regular stores.
- Fill: The Fill kernel has only memory write ops and so the ideal speed-up of NT over RFO should be 2x.

| Kernel | Op                                 | Bytes Read<br>(RFO) | Bytes Read (NT) | Bytes<br>Written | FLOPs | Ideal Speed-up<br>of NT over RFO |
|--------|------------------------------------|---------------------|-----------------|------------------|-------|----------------------------------|
| COPY   | a[i] = b[i]                        | 16                  | 8               | 8                | 0     | 1.5x                             |
| SCALE  | $a[i] = scalar \times b[i]$        | 16                  | 8               | 8                | 1     | 1.5x                             |
| ADD    | c[i] = a[i] + b[i]                 | 24                  | 16              | 8                | 1     | 1.33x                            |
| TRIAD  | $c[i] = a[i] + scalar \times b[i]$ | 24                  | 16              | 8                | 2     | 1.33x                            |
| REDUCE | sum + = a[i]                       | 8                   | 8               | 0                | 1     | 1x                               |
| FILL   | a[i] = constant                    | 8                   | 0               | 8                | 0     | 2x                               |

Table 2: STREAM Kernels

## 2.3 Intel Memory Latency Checker

For cache bandwidth and latency analysis, we rely on the more versatile Intel Memory Latency Checker.

### 2.4 Runtime Settings

Table-3 shows the run-time settings used for most of the results presented in this study. Where we depart from these settings, we will explicitly call out the changes in the corresponding section.

## 3 DRAM

In the following sub-sections, we will look at the peak achieved memory bandwidth by different CPU's in 1-core, 1-socket and 2-socket configuration. We will also evaluate the scaling performance within a socket and the impact of NUMA domains and OpenMP thread to CPU core affinity as well. Each sub-section contains performance data to primarily answer the following two questions -

- 1. How does the different CPU architectures compare against each other?
- 2. What is the impact of using RFO and NT stores on a given architecture?

|                        | AMD Rome 7742                                    | Intel Xeon 8268     | Intel ICX XCC |  |  |
|------------------------|--------------------------------------------------|---------------------|---------------|--|--|
| NUMA per socket        | 4                                                | 1                   | 1             |  |  |
| Operating System       | Cer                                              | ntOS Linux 7 (Core  | e)            |  |  |
| Kernel                 | 3.10.0-1                                         | 127.18.2.el7.crt1.x | 86_64         |  |  |
| CPU Turbo Boost        |                                                  | Enabled             |               |  |  |
| CPU Scaling Governor   | userspace                                        |                     |               |  |  |
| CPU Scaling Driver     |                                                  | acpi-cpufreq        |               |  |  |
| Transparent Huge Pages |                                                  | Enabled             |               |  |  |
| Intel C Compiler       | ICC 19.1.2.254 20200623                          |                     |               |  |  |
| Hyper-Threading        | Enabled in HW,                                   |                     |               |  |  |
| 11yper-1 meading       | but logical threads are not used in benchmarking |                     |               |  |  |
| OpenMP Thread Affinity | KMP_AFFINITY = "granularity=fine,compact,1,0"    |                     |               |  |  |

Table 3: Runtime Settings

## 3.1 1-Core

### 3.1.1 Peak Bandwidth with RFO

Figure-2a RFO observations:

- 1. Rome vs ICX: Even though these two processors have identical DRAM configuration, Rome delivers higher bandwidth across the kernels.
- 2. ICX vs CLX: Comparing gen-to-gen Xeon performance, there is a slow-down of 31% and 56% for Reduce and Fill kernels respectively on ICX. This is despite the fact that ICX has higher DDR4 memory speed and channels.
- 3. Table-4 summarizes the speed-up of all kernels.

| Kernel | Rome  | CLX   | ICX   | ICX/Rome | CLX/Rome | ICX/CLX |
|--------|-------|-------|-------|----------|----------|---------|
| Copy   | 18.22 | 12.85 | 12.16 | 0.67     | 0.71     | 0.95    |
| Triad  | 20.64 | 13.42 | 12.45 | 0.60     | 0.65     | 0.93    |
| Reduce | 19.29 | 13.30 | 10.13 | 0.53     | 0.69     | 0.76    |
| Fill   | 15.29 | 11.77 | 7.57  | 0.50     | 0.77     | 0.64    |

Table 4: 1-Core peak bandwdith: RFO

### 3.1.2 Peak Bandwidth with NT

Figure-2b NT observations:

- 1. Rome vs ICX: Similar to the trend observed with RFO stores, Rome delivers higher bandwidth across the kernels with the speed-up over ICX even greater.
- 2. ICX vs CLX: On ICX, the Fill kernel using NT stores shows marked improvement of 2x over CLX. And the rest of the kernels exhibit slow-down.
- 3. Table-5 summarizes the speed-up of all kernels.



Figure 2: STREAM 1-Core DRAM bandwidth

| Kernel | Rome  | CLX   | ICX   | ICX/Rome | CLX/Rome | ICX/CLX |
|--------|-------|-------|-------|----------|----------|---------|
| Copy   | 36.57 | 10.36 | 9.53  | 0.26     | 0.28     | 0.92    |
| Triad  | 31.37 | 13.06 | 9.04  | 0.29     | 0.42     | 0.69    |
| Reduce | 22.29 | 12.78 | 10.08 | 0.45     | 0.57     | 0.79    |
| Fill   | 23.43 | 7.15  | 15.30 | 0.65     | 0.31     | 2.14    |

Table 5: 1-Core peak bandwdith: NT

### 3.1.3 RFO vs. NT Stores

Figure-2c observations:

- 1. Across the CPU's, the Reduce kernel performance remains identical between NT and RFO configurations since it has no writes to main memory.
- 2. Rome: As can be expected, NT stores show higher performance than using regular stores which generate RFO requests. Interestingly, the speed-up of Copy kernel with NT stores exceeds the expected speed-up.
- 3. CLX: None of the kernels shows gains with NT. This indicates that memory bandwidth is not the constraint when using 1 core because RFO generates more memory traffic than NT. This leads to the observation that when the system is not memory bandwidth constrained (as is the case here with 1-thread), the latency of NT stores is greater than regular stores. This could be due to the fact that the hardware prefetchers do a good job of fetching the write-miss requests with regular stores. In contrast, with NT stores the hardware prefetchers have no active role since the data is never to be brought into cache.
- 4. ICX: Only the Fill kernel using NT stores shows higher performance than regular stores. Find out why?

## 3.2 1-Socket



Figure 3: STREAM 1-Socket DRAM bandwidth scaling with RFO and Compact pinning

For these experiments, Rome is configured to be in 4 NUMA domains Per Socket (NPS) mode whereas ICX and CLX are configured to have 1 NPS. Also, OpenMP threads are pinned to cores that are closer first i.e., compact mode. We will explore the performance impact of distributed pinning in the next section.

### 3.2.1 Peak Bandwidth and Scaling with RFO

Figure-3 observations:

- Peak Bandwidth: Using all the cores in the socket
  - Rome vs ICX: Contrary to 1-core performance, ICX beats Rome for all kernels using RFO stores.
  - Rome vs CLX: Rome delivers superior performance for all kernels.
  - ICX vs CLX: Comparing gen-to-gen Xeon performance, ICX delivers significant performance improvements over CLX for all kernels.
  - Table-6 shows the speed-up of all kernels.

### • Scaling Performance:

- Rome: Across the 4 kernels, Rome exhibits similar scaling performance pattern. Since Rome contains 64 cores per socket partitioned into 4 NUMA domains, each NUMA domain contains 16 cores. The data in the figure shows that the performance remains similar from 1 to 16 threads i.e., when threads are populated in NUMA domain-0. After that, the performance increases linearly with thread count and the gains from 16 to 64 threads show ideal scaling behavior.

- CLX, ICX: In contrast to Rome, both CLX and ICX reach close to peak bandwidth using much less number of threads. In addition, CLX shows higher performance at some thread counts for Reduce and Fill kernels.

| Kernel | Rome   | CLX    | ICX    | ICX/Rome | CLX/Rome | ICX/CLX |
|--------|--------|--------|--------|----------|----------|---------|
| Copy   | 102.08 | 71.92  | 148.85 | 1.46     | 0.70     | 2.07    |
| Triad  | 112.16 | 81.35  | 153.71 | 1.37     | 0.73     | 1.89    |
| Reduce | 122.46 | 115.73 | 159.95 | 1.31     | 0.95     | 1.38    |
| Fill   | 86.61  | 51.38  | 150.26 | 1.73     | 0.59     | 2.92    |

Table 6: 1-Socket peak bandwdith: RFO

## 3.2.2 Peak Bandwidth and Scaling with NT



Figure 4: STREAM 1-Socket DRAM bandwidth scaling with NT Stores and Compact pinning

#### Figure-4 observations:

- Peak Bandwidth: Using all the cores in the socket
  - Rome vs ICX: Contrary to RFO, ICX delivers similar performance as Rome.
  - Rome vs CLX: Rome delivers superior performance for all kernels.
  - ICX vs CLX: Comparing gen-to-gen Xeon performance, ICX delivers 40-80% performance improvements over CLX for all kernels.
  - Table-7 shows the speed-up of all kernels.
- Scaling Performance: The scaling profile looks similar to RFO except for the following exceptions
  - Rome: The Fill kernel shows gains with thread count starting at 8 instead of 16 as observed with RFO stores.
  - CLX, ICX: On ICX, the Fill kernel loses performance when using all the threads in the socket. This results in reduction of about 10% peak bandwidth for this kernel. For thread counts less than 10, CLX delivers slightly higher performance than ICX for Triad and Reduce kernels.

| Kernel | Rome   | CLX    | ICX    | ICX/Rome | CLX/Rome | ICX/CLX |
|--------|--------|--------|--------|----------|----------|---------|
| Copy   | 159.77 | 107.55 | 158.43 | 0.99     | 0.67     | 1.47    |
| Triad  | 157.31 | 110.45 | 158.21 | 1.01     | 0.70     | 1.43    |
| Reduce | 172.57 | 123.45 | 173.41 | 1.00     | 0.72     | 1.40    |
| Fill   | 158.14 | 81.99  | 146.69 | 0.93     | 0.52     | 1.79    |

Table 7: 1-Socket peak bandwdith: NT

## 3.2.3 RFO vs. NT Stores

1. Copy: As mentioned in table-2, the ideal speed-up of NT over RFO for Copy kernel should be 1.5x. Rome matches with the expected ideal speed-up. CLX reaches the 1.5x speed-up at higher thread count whereas ICX shows far lesser gains with NT stores. Also, to be noted is that both CLX and ICX shows performance degradation with NT stores with lower thread counts which isn't the case with Rome.



Figure 5: STREAM 1-Socket DRAM bandwidth scaling, speed-up of NT over RFO Stores

- 2. Triad: The ideal speed-up of NT over RFO for Triad should be 1.33x. Rome exhibits the ideal speed-up. CLX shows expected gains at higher thread counts, whereas ICX shows maximum gains of about 5% with NT stores. On ICX, using NT stores slows performance (even more than on CLX) at lower thread counts.
- 3. Reduce: The performance of Reduce kernel should remain identical with both NT and regular stores. However, the Reduce kernel on Rome shows upto 40% performance gains with NT stores. Deeper investigation of this kernel code generation revealed that the Intel compiler does aggressive unrolling when using regular stores. This results in more in-flight loads and appears to be the root-cause for the slowdown with regular stores. Interestingly, even though the Intel Compiler does the same aggressive unrolling with regular stores on CLX and ICX, these two architectures seems to be more resilient to additional in-flight loads and do not result in lower performance.
- 4. Fill: The ideal speed-up of NT over RFO should be 2x. Rome reaches the expected gains once the thread count is greater than the number of cores of NUMA domain. CLX and ICX exhibit different behavior for this kernel with ICX showing significant performance loss as thread count increases.

## 3.2.4 Compact vs Distributed Pinning

The experiments with thread affinity settings make for an interesting performance study only when there are multiple NUMA domains in a socket. Since only Rome is configured to have 4 NUMA domains per socket in our setup, we will limit these experiments to Rome architecture. As mentioned earlier, Compact pinning refers to thread affinity scheme where OpenMP threads are pinned to cores that are closer to each other. In distributed pinning scheme, we affinitize the OpenMP threads to spread across the NUMA domains first. For e.g, with 4 threads this translates to using a single core on each of the 4 NUMA domains in a socket.



Figure 6: STREAM 1-Socket DRAM bandwidth: Compact vs Distributed Pinning

## Figure-6 observations:

- For both RFO and NT, close to peak bandwidth is achieved with fewer threads in distributed pinning when compared to compact mode. This make sense since in NPS-4 mode, the 8 memory channels are partitioned to the 4 NUMA domains with each NUMA domain accessing DRAM through 2 channels. Hence, with 8 threads all memory channels are actively used in distributed pinning.
- RFO: The Reduce kernel with distributed affinity reaches peak bandwidth at fewer threads and then loses performance when using all the cores in the socket.

• NT: When compared to the rest of the kernels, the Fill kernel in distributed pinning shows much less achieved bandwidth. Once the number of threads reaches more than 32, the performance is only marginally better than compact pinning.

## 3.3 2-Socket



Figure 7: STREAM 2-socket DRAM bandwidth

Since each socket has it's own set of memory controllers, DRAM bandwidth for 2-socket scales by a factor of 2 compared to 1-socket performance. There are no unique observations when it comes to 2-socket data and all the points made in the 1-socket section are valid here as well.

#### 3.3.1 Peak Bandwidth with RFO

Figure-7 shows the peak achieved bandwidth when all cores in a 2 socket system are accessing data to/from DRAM. Table-8, 9 show the corresponding speed-up for each of the architecture with RFO and NT stores respectively. Comparing ICX to Rome, ICX delivers higher performance for all kernels using regular stores.

| Kernel | Rome   | CLX    | ICX    | ICX/Rome | CLX/Rome | ICX/CLX |
|--------|--------|--------|--------|----------|----------|---------|
| Copy   | 206.17 | 144.30 | 301.02 | 1.46     | 0.70     | 2.09    |
| Triad  | 224.09 | 161.42 | 304.43 | 1.36     | 0.72     | 1.89    |
| Reduce | 181.47 | 228.17 | 307.14 | 1.69     | 1.26     | 1.35    |
| Fill   | 193.33 | 105.40 | 272.65 | 1.41     | 0.55     | 2.59    |

Table 8: 2-socket peak bandwdith: RFO

### 3.3.2 Peak Bandwidth with NT

Figure-7 shows the peak achieved bandwidth using NT stores.

| Kernel | Rome   | CLX    | ICX    | ICX/Rome | CLX/Rome | ICX/CLX |
|--------|--------|--------|--------|----------|----------|---------|
| Copy   | 318.01 | 198.35 | 309.48 | 0.97     | 0.62     | 1.56    |
| Triad  | 312.99 | 217.81 | 316.58 | 1.01     | 0.70     | 1.45    |
| Reduce | 345.37 | 244.26 | 341.96 | 0.99     | 0.71     | 1.40    |
| Fill   | 313.32 | 152.32 | 273.49 | 0.87     | 0.49     | 1.80    |

Table 9: 2-socket peak bandwdith: NT

## 3.4 NUMA

In this section, we will explore the performance impact of explicitly pinning memory to a specific NUMA domain. We achieve this by doing the following:

- 1. NUMA Placement: Explicitly specify a single NUMA domain using *numactl*. This forces all the memory accesses to go to that NUMA domain.
- 2. Threads Placement: For both 1-core and 1-socket (utilizing all the cores in a socket), we explicitly pin the OpenMP threads to socket-0 cores.

- 3. The combination of the above two enables us to evaluate the cross-socket interconnect (Intel UPI, AMD Infinity Fabric) bandwidth as well.
- 4. Since Rome and CLX, ICX are setup in different NPS modes, it is not a fair one-to-one comparison. Rather, it should be seen as what are the trade-off's of using one mode on given architecture on application performance.

#### 3.4.1 1-Core, 1-Socket:

Figure-8 shows the performance of STREAM kernels with regular stores using different NUMA domains. There isn't much difference between RFO and NT stores with respect to NUMA, so we will just focus our discussion on the impact of NUMA. Figure-23 in the appendix section shows the performance with NT stores.

#### • Rome:

- Local vs Remote NUMA: Since Rome is setup in NPS-4 mode, it has 4 NUMA domain domains per socket (0-3) and a total of 8 NUMA domains in a 2-socket system. This naturally translates to higher bandwdith for memory accesses going to local NUMA domains in socket-0 i.e., NUMA domain 0 to 3. With NUMA domain 4 to 7, the memory requests have to cross the socket over the on-chip interconnect and hence leads to lower performance.
- Local NUMA: In NPS-4, memory accesses going to each NUMA domain are interleaved across only 2 memory channels (4 NUMA domains × 2 channels = 8 channels), this leads to substantially lower memory bandwidth even when using all cores in a socket. This can be observed by comparing the 1-socket performance in figure-8 with figure-3 where memory is distributed across all NUMA domain in a socket.

## • CLX, ICX:

- Local vs Remote NUMA: Comparing gen-to-gen, the cross-socket memory bandwidth shows more than 2x gains in ICX.
- Local NUMA: Since both CLX and ICX are in NPS-1 mode, memory accesses use all the available channels in a socket. Hence, the 1-socket performance in figure-8 is identical to figure-3



Figure 8: STREAM 1-Core, 1-Socket NUMA bandwidth with RFO

## 3.5 Sub-NUMA Clustering

## 3.5.1 ICX: SNC-1, 2, 4



Figure 9: STREAM ICX 1-Core DRAM bandwidth: SNC modes



Figure 10: STREAM ICX 2-Socket DRAM bandwidth: SNC modes



Figure 11: STREAM 1-Socket DRAM bandwidth scaling with RFO and Compact pinning: SNC modes



Figure 12: STREAM 1-Socket DRAM bandwidth scaling with RFO with specific affinity: SNC modes

### Summary

- 1-Core, 2-Socket Memory Bandwidth: There are only marginal gains of using SNC on DRAM bandwidth.
- NUMA: Since NUMA cluster is bound to fewer memory channels in SNC-2, 4 modes, the local and remote NUMA bandwidth is much lower compared to SNC-1 mode.



Figure 13: STREAM 1-Core, 1-Socket NUMA bandwidth with RFO: SNC modes

| Kernel | SNC-1 | SNC-2 | SNC-4 | SNC-4/SNC-1 | SNC-2/SNC-1 | SNC-4/SNC-2 |
|--------|-------|-------|-------|-------------|-------------|-------------|
| Copy   | 13.17 | 13.99 | 13.99 | 1.06        | 1.06        | 1.00        |
| Triad  | 16.27 | 16.45 | 15.69 | 0.96        | 1.01        | 0.95        |
| Reduce | 16.77 | 16.75 | 17.28 | 1.03        | 1.00        | 1.03        |
| Fill   | 7.14  | 7.52  | 7.66  | 1.07        | 1.05        | 1.02        |

Table 10: ICX 1-Core peak bandwdith: RFO with SNC modes

| Kernel | SNC-1 | SNC-2 | SNC-4 | SNC-4/SNC-1 | SNC-2/SNC-1 | SNC-4/SNC-2 |
|--------|-------|-------|-------|-------------|-------------|-------------|
| Copy   | 24.70 | 25.50 | 23.16 | 0.94        | 1.03        | 0.91        |
| Triad  | 20.95 | 23.05 | 20.63 | 0.98        | 1.10        | 0.90        |
| Reduce | 16.76 | 16.73 | 17.38 | 1.04        | 1.00        | 1.04        |
| Fill   | 26.29 | 29.81 | 31.62 | 1.20        | 1.13        | 1.06        |

Table 11: ICX 1-Core peak bandwdith: NT with SNC modes

| Kernel | SNC-1  | SNC-2  | SNC-4  | SNC-4/SNC-1 | SNC-2/SNC-1 | SNC-4/SNC-2 |
|--------|--------|--------|--------|-------------|-------------|-------------|
| Copy   | 305.52 | 303.83 | 290.05 | 0.95        | 0.99        | 0.95        |
| Triad  | 313.66 | 312.04 | 306.90 | 0.98        | 0.99        | 0.98        |
| Reduce | 307.77 | 317.58 | 331.93 | 1.08        | 1.03        | 1.05        |
| Fill   | 276.74 | 274.89 | 283.67 | 1.03        | 0.99        | 1.03        |

Table 12: ICX 2-Socket peak bandwdith: RFO with SNC modes

| Kernel | SNC-1  | SNC-2  | SNC-4  | SNC-4/SNC-1 | SNC-2/SNC-1 | SNC-4/SNC-2 |
|--------|--------|--------|--------|-------------|-------------|-------------|
| Copy   | 321.62 | 326.42 | 332.79 | 1.03        | 1.01        | 1.02        |
| Triad  | 324.89 | 330.86 | 338.79 | 1.04        | 1.02        | 1.02        |
| Reduce | 345.71 | 361.04 | 357.46 | 1.03        | 1.04        | 0.99        |
| Fill   | 273.78 | 277.03 | 278.62 | 1.02        | 1.01        | 1.01        |

Table 13: ICX 2-Socket peak bandwdith: NT with SNC modes

## 3.5.2 CLX: SNC-1, 2



Figure 14: STREAM CLX 1-Core DRAM bandwidth: SNC modes



Figure 15: STREAM CLX 2-Socket DRAM bandwidth: SNC modes



Figure 16: STREAM 1-Socket DRAM bandwidth scaling with RFO and Compact pinning: SNC modes



Figure 17: STREAM 1-Socket DRAM bandwidth scaling with RFO with specific affinity: SNC modes

| Kernel | SNC-1 | SNC-2 | SNC-2/SNC-1 | SNC-1/SNC-1 | SNC-2/SNC-1 |
|--------|-------|-------|-------------|-------------|-------------|
| Copy   | 12.93 | 14.15 | 1.09        | 1.00        | 1.09        |
| Triad  | 13.65 | 14.39 | 1.05        | 1.00        | 1.05        |
| Reduce | 13.63 | 15.96 | 1.17        | 1.00        | 1.17        |
| Fill   | 11.75 | 12.83 | 1.09        | 1.00        | 1.09        |

Table 14: CLX 1-Core peak bandwdith: RFO with SNC modes



Figure 18: STREAM 1-Core, 1-Socket NUMA bandwidth with RFO: SNC modes  $\left( \frac{1}{2} \right)$ 

| Kernel | SNC-1 | SNC-2 | SNC-2/SNC-1 | SNC-1/SNC-1 | SNC-2/SNC-1 |
|--------|-------|-------|-------------|-------------|-------------|
| Copy   | 10.49 | 11.02 | 1.05        | 1.00        | 1.05        |
| Triad  | 13.29 | 13.54 | 1.02        | 1.00        | 1.02        |
| Reduce | 13.51 | 15.98 | 1.18        | 1.00        | 1.18        |
| Fill   | 7.23  | 7.67  | 1.06        | 1.00        | 1.06        |

Table 15: CLX 1-Core peak bandwdith: NT with SNC modes

| Kernel | SNC-1  | SNC-2  | SNC-2/SNC-1 | SNC-1/SNC-1 | SNC-2/SNC-1 |
|--------|--------|--------|-------------|-------------|-------------|
| Copy   | 141.11 | 144.28 | 1.02        | 1.00        | 1.02        |
| Triad  | 157.13 | 159.37 | 1.01        | 1.00        | 1.01        |
| Reduce | 208.37 | 223.97 | 1.07        | 1.00        | 1.07        |
| Fill   | 100.90 | 106.81 | 1.06        | 1.00        | 1.06        |

Table 16: CLX 2-Socket peak bandwdith: RFO with SNC modes

| Kernel | SNC-1  | SNC-2  | SNC-2/SNC-1 | SNC-1/SNC-1 | SNC-2/SNC-1 |
|--------|--------|--------|-------------|-------------|-------------|
| Copy   | 203.36 | 209.63 | 1.03        | 1.00        | 1.03        |
| Triad  | 215.35 | 215.68 | 1.00        | 1.00        | 1.00        |
| Reduce | 243.25 | 247.07 | 1.02        | 1.00        | 1.02        |
| Fill   | 156.55 | 162.02 | 1.03        | 1.00        | 1.03        |

Table 17: CLX 2-Socket peak bandwdith: NT with SNC modes

## 4 CPU Caches

Table-18 shows the cache hierarchy of AMD Rome and Intel CLX, ICX systems. We use Intel MLC to evaluate the cache performance on the target systems.

For latency measurements, the workload type is 100% dependent reads with random access within a 512 KiB chunk. The dependent reads helps prevent pipelining, random access beats the hardware prefetchers and the 512 KiB chunk minimizes the TLB penalties.

|                            | AMD Epyc 7742   | Intel Xeon 8268    | Intel ICX XCC    |
|----------------------------|-----------------|--------------------|------------------|
| Code name                  | Rome            | Cascade Lake (CLX) | Ice Lake (ICX)   |
| Number of sockets          | 2               | 2                  | 2                |
| Number of cores per socket | 32              | 24                 | 32               |
| Total number of cores      | 64              | 48                 | 64               |
| L1 Cache                   | 32 KiB          | 32 KiB             | 48 KiB           |
| Li Cache                   | (8-way private) | (8-way private)    | (12-way private) |
| L2 Cache                   | 512 KiB         | 1024 KiB           | 1280 KiB         |
| L2 Cache                   | (8-way private) | (16-way private)   | (20-way private) |
| L3 Cache slice per core    | 4 MiB           | 1.5 MiB            | 1.5 MiB          |
| L3 Cache per socket        | 256 MiB         | 35.7 MiB           | 48 MiB           |
| Lo Cache per socket        | (16-way shared) | (11-way shared)    | (12-way shared)  |

Table 18: Cache Sub-system

## 4.1 Latency: Idle, Loaded

Idle latency is measured on cpu core-0 while the rest of the cores on the CPU are in idle state. Although this is an artificial scenario, this is a best case setup from a CPU architecture study.

Loaded latency measured on core-0 when all the remaining cores are stressing memory bandwidth. Inject delay (in cycles) specify the duration window when the bandwidth generation cores go to idle spin loop. As injection delay increases, latency decreases since there is less contention for queueing resources.

## 4.1.1 Level-2 Cache



Figure 19: Level-2 Cache: Idle and Loaded Latencies

### 4.1.2 Level-3 Cache



Figure 20: Level-2 Cache: Idle and Loaded Latencies

## 4.1.3 Cache-to-Cache Transfer

Figure-21 shows the latency to fetch a modified cache line to core-0 from a remote core ID. On Rome, the latency increases as the distance between core-0 and remote core ID increases. We can observe this patten even within a socket since Rome is configured in NPS-4 mode. On CLX, ICX with NPS-/SNC-1, the latency remains flat for remote core ID's that are within socket-0.



Figure 21: Cache-to-Cache Transfer Latency

## 4.2 Bandwidth



Figure 22: Peak Injection Bandwidth

# **Appendices**

## A Rome, CLX, ICX: NUMA with NT Stores



Figure 23: STREAM 1-Core, 1-Socket NUMA bandwidth with NT

## B AMD Rome: 7402, 7502, 7742



Figure 24: STREAM 1-Core DRAM bandwidth: AMD Rome SKU's



Figure 25: STREAM 2-Socket DRAM bandwidth: AMD Rome SKU's

# C ICX: SNC with NT Stores



Figure 26: STREAM 1-Socket DRAM bandwidth scaling with NT and Compact pinning: SNC modes



Figure 27: STREAM 1-Socket DRAM bandwidth scaling, speed-up of NT over RFO Stores



Figure 28: STREAM 1-Core, 1-Socket NUMA bandwidth with NT: SNC modes

## D CLX: SNC with NT Stores



Figure 29: STREAM 1-Socket DRAM bandwidth scaling with NT and Compact pinning: SNC modes



Figure 30: STREAM 1-Socket DRAM bandwidth scaling, speed-up of NT over RFO Stores



Figure 31: STREAM 1-Core, 1-Socket NUMA bandwidth with NT: SNC modes