Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raspberry Pi 5 4GB is faster (sometimes significantly so) than the 8GB version #1854

Open
Brunnis opened this issue Dec 13, 2023 · 50 comments
Open

Comments

@Brunnis
Copy link

Brunnis commented Dec 13, 2023

Describe the bug
The Raspberry Pi 5 4GB performs slightly better (0-10%) than the 8GB version at default 2.4 GHz and the gap widens to >100% for certain workloads when overclocked. These workloads will see a dramatic reduction in performance when overclocking the 8GB board. This reduction in performance is not present at all on the 4GB board.

It's unclear to me whether the small performance difference at default clock frequency has the same root cause as the more dramatic one that emerges as the ARM core frequency is increased. For the time being I'm treating them as related and reporting on both in this issue.

To reproduce
I have found two workloads in particular that expose the issue: Geekbench 5 "Text Rendering" multi-core sub-test and stress-ng "numa" stressor. To reproduce this issue, I suggest benchmarking the 4GB and 8GB boards at both 2.4 GHz and 2.8 GHz.

Geekbench 5 is available here: https://www.geekbench.com/preview/

To run stress-ng with profiling info:

  1. Install stress-ng and the "perf" tool: sudo apt install stress-ng linux-perf linux-perf-dbgsym
  2. Run the following to enable perf to work with stress-ng: sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
  3. Perform the test and note the bogo ops/s value: stress-ng --numa 4 --numa-ops 1000 --metrics --perf

Expected behaviour
Ideally the 4GB and 8GB boards would perform close to the same. The smaller difference at stock frequencies could be deemed normal/expected (for example due to different RAM ICs being used), but the dramatically increasing gap in performance as ARM core frequency increases suggests something may be misbehaving.

Actual behaviour
The 4GB board is anywhere from a few to more than 100% faster than the 8GB board, depending on clock frequency and workload. Below is a summary of tests I've run. As can be seen, the 4GB uses Samsung RAM and the 8GB uses Micron RAM.

Pi 5 4GB vs 8GB v2

These benchmark results are completely reproducible. I've also looked at other people's submission of Geekbench 5 results and can see the same reduction in "Text Rendering" scores on overclocked 8GB boards (but not on overclocked 4GB boards), so this is not limited to my specimen.

Below are the Geekbench 5 results at 2.4 and 2.8 GHz for the runs listed in the table above.

4GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028307
8GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028116
4GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028479
8GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028225

Below are "perf" tool output for the stress-ng runs at 2.4 and 2.8 GHz for both boards:

4GB (2.4 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [3278] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [3278] dispatching hogs: 4 numa
stress-ng: info:  [3279] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [3278] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [3278]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [3278] numa               1000      8.44     33.33      0.05       118.48          29.96        98.85          7040
stress-ng: info:  [3278] numa:
stress-ng: info:  [3278]             81,220,668,668 CPU Cycles                     9.480 B/sec
stress-ng: info:  [3278]              4,757,399,712 Instructions                   0.555 B/sec (0.059 instr. per cycle)
stress-ng: info:  [3278]                    258,736 Branch Misses                 30.199 K/sec ( 0.000%)
stress-ng: info:  [3278]                122,043,476 Stalled Cycles Frontend       14.244 M/sec
stress-ng: info:  [3278]             76,572,843,124 Stalled Cycles Backend         8.937 B/sec
stress-ng: info:  [3278]             81,230,134,176 Bus Cycles                     9.481 B/sec
stress-ng: info:  [3278]              4,349,256,396 Cache References               0.508 B/sec
stress-ng: info:  [3278]                  1,719,788 Cache Misses                   0.201 M/sec ( 0.040%)
stress-ng: info:  [3278]              4,436,001,520 Cache L1D Read                 0.518 B/sec
stress-ng: info:  [3278]                  1,745,244 Cache L1D Read Miss            0.204 M/sec ( 0.039%)
stress-ng: info:  [3278]              1,235,273,488 Cache L1I Read                 0.144 B/sec
stress-ng: info:  [3278]                    665,872 Cache L1I Read Miss           77.718 K/sec
stress-ng: info:  [3278]                  1,549,288 Cache LL Read                  0.181 M/sec
stress-ng: info:  [3278]                  1,187,048 Cache LL Read Miss             0.139 M/sec (76.619%)
stress-ng: info:  [3278]              4,343,855,792 Cache DTLB Read                0.507 B/sec
stress-ng: info:  [3278]                  5,006,956 Cache DTLB Read Miss           0.584 M/sec ( 0.115%)
stress-ng: info:  [3278]              1,160,470,704 Cache BPU Read                 0.135 B/sec
stress-ng: info:  [3278]                    220,720 Cache BPU Read Miss           25.762 K/sec ( 0.019%)
stress-ng: info:  [3278]             33,889,417,812 CPU Clock                      3.955 B/sec
stress-ng: info:  [3278]             33,889,294,388 Task Clock                     3.955 B/sec
stress-ng: info:  [3278]                      1,060 Page Faults Total            123.720 /sec 
stress-ng: info:  [3278]                      1,060 Page Faults Minor            123.720 /sec 
stress-ng: info:  [3278]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [3278]                        228 Context Switches              26.611 /sec 
stress-ng: info:  [3278]                        160 Cgroup Switches               18.675 /sec 
stress-ng: info:  [3278]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [3278]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [3278]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [3278] successful run completed in 8.57s

8GB (2.4 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [2825] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [2825] dispatching hogs: 4 numa
stress-ng: info:  [2826] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2825] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2825]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2825] numa               1000      9.33     37.08      0.06       107.20          26.93        99.51          7072
stress-ng: info:  [2825] numa:
stress-ng: info:  [2825]             87,844,236,208 CPU Cycles                     9.284 B/sec
stress-ng: info:  [2825]              4,775,378,096 Instructions                   0.505 B/sec (0.054 instr. per cycle)
stress-ng: info:  [2825]                    199,680 Branch Misses                 21.103 K/sec ( 0.000%)
stress-ng: info:  [2825]                148,631,000 Stalled Cycles Frontend       15.708 M/sec
stress-ng: info:  [2825]             83,139,642,756 Stalled Cycles Backend         8.786 B/sec
stress-ng: info:  [2825]             87,849,436,396 Bus Cycles                     9.284 B/sec
stress-ng: info:  [2825]              4,360,769,624 Cache References               0.461 B/sec
stress-ng: info:  [2825]                  3,826,552 Cache Misses                   0.404 M/sec ( 0.088%)
stress-ng: info:  [2825]              4,326,723,104 Cache L1D Read                 0.457 B/sec
stress-ng: info:  [2825]                  3,818,744 Cache L1D Read Miss            0.404 M/sec ( 0.088%)
stress-ng: info:  [2825]              1,205,851,084 Cache L1I Read                 0.127 B/sec
stress-ng: info:  [2825]                    688,460 Cache L1I Read Miss           72.759 K/sec
stress-ng: info:  [2825]                  3,388,484 Cache LL Read                  0.358 M/sec
stress-ng: info:  [2825]                  2,846,120 Cache LL Read Miss             0.301 M/sec (83.994%)
stress-ng: info:  [2825]              4,366,253,912 Cache DTLB Read                0.461 B/sec
stress-ng: info:  [2825]                  5,043,928 Cache DTLB Read Miss           0.533 M/sec ( 0.116%)
stress-ng: info:  [2825]              1,179,947,616 Cache BPU Read                 0.125 B/sec
stress-ng: info:  [2825]                    194,724 Cache BPU Read Miss           20.579 K/sec ( 0.017%)
stress-ng: info:  [2825]             36,669,122,012 CPU Clock                      3.875 B/sec
stress-ng: info:  [2825]             36,668,453,648 Task Clock                     3.875 B/sec
stress-ng: info:  [2825]                      1,064 Page Faults Total            112.447 /sec 
stress-ng: info:  [2825]                      1,064 Page Faults Minor            112.447 /sec 
stress-ng: info:  [2825]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [2825]                        360 Context Switches              38.046 /sec 
stress-ng: info:  [2825]                        360 Cgroup Switches               38.046 /sec 
stress-ng: info:  [2825]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [2825]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [2825]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [2825] successful run completed in 9.46s

4GB (2.8 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [2897] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [2897] dispatching hogs: 4 numa
stress-ng: info:  [2898] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2897] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2897]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2897] numa               1000      8.43     33.43      0.06       118.66          29.86        99.35          7024
stress-ng: info:  [2897] numa:
stress-ng: info:  [2897]             94,804,086,272 CPU Cycles                    11.088 B/sec
stress-ng: info:  [2897]              4,743,732,748 Instructions                   0.555 B/sec (0.050 instr. per cycle)
stress-ng: info:  [2897]                    171,072 Branch Misses                 20.008 K/sec ( 0.000%)
stress-ng: info:  [2897]                141,005,824 Stalled Cycles Frontend       16.491 M/sec
stress-ng: info:  [2897]             90,175,677,856 Stalled Cycles Backend        10.546 B/sec
stress-ng: info:  [2897]             94,811,739,556 Bus Cycles                    11.089 B/sec
stress-ng: info:  [2897]              4,351,381,748 Cache References               0.509 B/sec
stress-ng: info:  [2897]                  2,356,576 Cache Misses                   0.276 M/sec ( 0.054%)
stress-ng: info:  [2897]              4,363,258,428 Cache L1D Read                 0.510 B/sec
stress-ng: info:  [2897]                  2,354,384 Cache L1D Read Miss            0.275 M/sec ( 0.054%)
stress-ng: info:  [2897]              1,205,560,624 Cache L1I Read                 0.141 B/sec
stress-ng: info:  [2897]                    579,892 Cache L1I Read Miss           67.821 K/sec
stress-ng: info:  [2897]                  2,040,828 Cache LL Read                  0.239 M/sec
stress-ng: info:  [2897]                  1,692,944 Cache LL Read Miss             0.198 M/sec (82.954%)
stress-ng: info:  [2897]              4,372,534,272 Cache DTLB Read                0.511 B/sec
stress-ng: info:  [2897]                  5,064,992 Cache DTLB Read Miss           0.592 M/sec ( 0.116%)
stress-ng: info:  [2897]              1,175,805,832 Cache BPU Read                 0.138 B/sec
stress-ng: info:  [2897]                    140,216 Cache BPU Read Miss           16.399 K/sec ( 0.012%)
stress-ng: info:  [2897]             33,905,985,500 CPU Clock                      3.965 B/sec
stress-ng: info:  [2897]             33,905,692,244 Task Clock                     3.965 B/sec
stress-ng: info:  [2897]                      1,064 Page Faults Total            124.440 /sec 
stress-ng: info:  [2897]                      1,064 Page Faults Minor            124.440 /sec 
stress-ng: info:  [2897]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [2897]                        224 Context Switches              26.198 /sec 
stress-ng: info:  [2897]                        204 Cgroup Switches               23.859 /sec 
stress-ng: info:  [2897]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [2897]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [2897]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [2897] successful run completed in 8.55s

8GB (2.8 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [2065] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [2065] dispatching hogs: 4 numa
stress-ng: info:  [2066] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2065] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2065]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2065] numa               1000     19.53     77.52      0.19        51.19          12.87        99.45          7024
stress-ng: info:  [2065] numa:
stress-ng: info:  [2065]            213,041,661,036 CPU Cycles                    10.802 B/sec
stress-ng: info:  [2065]              4,850,988,464 Instructions                   0.246 B/sec (0.023 instr. per cycle)
stress-ng: info:  [2065]                    252,376 Branch Misses                 12.796 K/sec ( 0.000%)
stress-ng: info:  [2065]                635,563,820 Stalled Cycles Frontend       32.225 M/sec
stress-ng: info:  [2065]            207,857,257,028 Stalled Cycles Backend        10.539 B/sec
stress-ng: info:  [2065]            213,076,738,404 Bus Cycles                    10.804 B/sec
stress-ng: info:  [2065]              4,413,222,720 Cache References               0.224 B/sec
stress-ng: info:  [2065]                 67,889,660 Cache Misses                   3.442 M/sec ( 1.538%)
stress-ng: info:  [2065]              4,424,491,460 Cache L1D Read                 0.224 B/sec
stress-ng: info:  [2065]                 67,486,816 Cache L1D Read Miss            3.422 M/sec ( 1.525%)
stress-ng: info:  [2065]              1,239,622,236 Cache L1I Read                62.852 M/sec
stress-ng: info:  [2065]                  1,194,376 Cache L1I Read Miss           60.558 K/sec
stress-ng: info:  [2065]                 73,587,652 Cache LL Read                  3.731 M/sec
stress-ng: info:  [2065]                 67,162,612 Cache LL Read Miss             3.405 M/sec (91.269%)
stress-ng: info:  [2065]              4,374,670,156 Cache DTLB Read                0.222 B/sec
stress-ng: info:  [2065]                  5,522,360 Cache DTLB Read Miss           0.280 M/sec ( 0.126%)
stress-ng: info:  [2065]              1,194,694,580 Cache BPU Read                60.574 M/sec
stress-ng: info:  [2065]                    210,600 Cache BPU Read Miss           10.678 K/sec ( 0.018%)
stress-ng: info:  [2065]             76,637,242,788 CPU Clock                      3.886 B/sec
stress-ng: info:  [2065]             76,635,993,260 Task Clock                     3.886 B/sec
stress-ng: info:  [2065]                      1,064 Page Faults Total             53.948 /sec 
stress-ng: info:  [2065]                      1,064 Page Faults Minor             53.948 /sec 
stress-ng: info:  [2065]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [2065]                        556 Context Switches              28.191 /sec 
stress-ng: info:  [2065]                        536 Cgroup Switches               27.177 /sec 
stress-ng: info:  [2065]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [2065]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [2065]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [2065] successful run completed in 19.72s

The 8GB 2.8 GHz result sticks out when compared to the same board running at 2.4 GHz, due to:

  • Instructions per cycle is reduced to less than half: 0.050 vs 0.023 instr. per cycle
  • Stalled Cycles Frontend (per second) doubles: 16.491 M/sec vs 32.225 M/sec
  • Cache L1D Read Miss increases from 275 000 per second to 3 422 000 per second
  • Cache LL Read & Cache LL Read Miss go from 239 M/sec & 198 M/sec to 3.7 M/sec & 3.4 M/sec

Finally, I should mention that RAM bandwidth and latency tests do not show any issues.

System
raspinfo output: https://gist.github.com/Brunnis/4d8242cf757f28e1d5331b3f73b3a446

@pelwell
Copy link
Contributor

pelwell commented Dec 13, 2023

Have you tried with the other 64-bit kernel (kernel=kernel8.img)? It is identical except for the use of 4KB pages where kernel_2712.img uses 16KB pages.

@Brunnis
Copy link
Author

Brunnis commented Dec 13, 2023

Yes, I've run both the Geekbench 5 and stress-ng tests with kernel8.img and get the very same low results @ 2.8 GHz on the 8GB board, unfortunately.

I should perhaps also mention that all tests are run on the very same SD card/installation. I'm just moving it between the boards. So there are no setup differences that could explain the discrepancies between the boards.

@Brunnis
Copy link
Author

Brunnis commented Dec 14, 2023

One pretty interesting comparison is the 4GB @ 2.4 GHz vs 8GB @ 2.8 GHz in Geekbench 5. See the full comparison here:

https://browser.geekbench.com/v5/cpu/compare/22028225?baseline=22028307

It is interesting to note that the single-core results mostly look to be in the 8GB board's favor (which they of course are expected be, given the higher clock frequency). One notable regression is the AES-XTS test which is down 5% on the 8GB board.

Turning over to the multi-core results, things get a bit more interesting. The overall score of the 8GB board is now 5% lower than the 4GB board, despite being overclocked by almost 17%. Big performance differences (-12 to -35 %) can be seen in 7 out of the 21 sub-tests.

@popcornmix
Copy link
Contributor

Can you rule out throttling being a factor?
After running one of the "bad" tests (e.g. 8GB overclocked Pi5) check that output of vcgencmd get_throttled is 0.

@Brunnis
Copy link
Author

Brunnis commented Dec 14, 2023

Yes, no throttling detected on either board (checked via vcgencmd). The 8GB board uses the active cooler and the 4GB board is in the official fan case.

I've used force_turbo=1 for all of my benchmarks, to ensure consistency. I've also tried manually setting the overvoltage delta on the 8GB board to give an additional 0.05V on top of the DVFS curve, but it had no effect on the results. Which I expected, but hey, might as well try it all. 😁

I'd recommend that you run the simple stress-ng test on an 8GB board at 2.4 and 2.8 GHz and see if you can reproduce the issue. The test takes only 10 to 20 seconds to run.

@popcornmix
Copy link
Contributor

Some difference in performance between sdram devices is expected. A read on this may be useful.

This is important:

This means that if a page in a particular bank is already open, you cannot open another page within the same bank without first closing the currently open page. Once a page is open, it can be accessed multiple times without the need to reopen it for subsequent operations.

so physical memory is split into pages, and pages into banks. I think Pi5 has 8 banks, so you can access 8 different pages cheaply if they are all in different banks. As soon as you access a different page in a bank, you have to close the old one, and open the new one, which is expensive. This is a form of thrashing (as you may see with a cache) and reduces your usable memory bandwidth, depending on access patterns.

You could naively say, each of the 8 gigabytes is a different bank, but that wouldn't work very well at start of day when perhaps only the first GB is in use, so you effectively only have one bank in use.
So instead some lower address bits are used to segment the banks.

You might imagine a benchmark that uses 8 different buffers. If luckily, they all sit in physical memory in separate banks, you could get no page opens/closes when running and maximal memory bandwidth. If unluckily they all sat in the same bank, performance would be much worse.

I believe the architecture of the 8GB sdram means different address bits are used in the bank segmentation. In theory this is not inherently worse (with my buffers arranged to be in physical memory in a way that maximises bank usage, I'm sure it would be possible to get better results from 8GB than 4GB). In practice it may be that the 4GB layout is typically better, although you tend to find the longer the kernel has been running, the more scattered the virtual->physical mapping gets, and the less likely this is to have a measurable difference.

This is just speculation at what might be one of the effects at play here. It may not be the only effect.

The overclocking result is more surprising. I'll see if I can reproduce.

There may be ways in which multiple cores may thrash against each other more destructively when running faster compared to the fixed sdram speed. We can get stats out of the sdram controller that say how busy it is (ratio of active cycles to total cycles), and how efficient it is (ratio of cycles where data is actually read/written, compared to other stuff, like opening/closing pages).

Or it may be something simpler. As well as the arm clock, there is a dsu clock which controls the shared L3 cache the arms use, and is clocked at about 90% of arm's speed. Need to confirm that is running as expected.

@Brunnis
Copy link
Author

Brunnis commented Dec 14, 2023

Thanks for the detailed response. I agree that part of the issue may be different internal RAM arrangement causing certain performance differences. As you say, though, the adverse effect of overlocking is a bit harder to explain in an intuitive way. Looking forward to your results. As it stands, I'm not sure what else I could do to help in a meaningful way, but let me know if there's anything specific you want me to check.

@popcornmix
Copy link
Contributor

popcornmix commented Dec 14, 2023

I get (6.1 kernel, no display, stress-ng bogo ops/s real):
8GB part:
arm=2.4GHz dsu=1.8GHz 110
arm=2.8GHz dsu=2.1GHz 49

4GB part
arm=2.4GHz dsu=1.8GHz 115
arm=2.8GHz dsu=2.1GHz 115

So I'm seeing the surprising 8GB + overclock result you see.

@semool
Copy link

semool commented Dec 14, 2023

I can confirm this:

Kernel 6.6.5-v8+ aarch64
eeprom: 2023/12/06

8GB (active cooled):
2.8GHz: 52.25
2.4Ghz: 105.91

4GB (active cooled):
2.8GHz: 109.88
2.4Ghz: 110.00

@popcornmix
Copy link
Contributor

It seems the stress-ng test we are running is basically a multi-threaded memset.

I've created a simpler test that runs memset from a variable number of cores and that shows the same behaviour:

cores      2.4GHz     2.8GHz
1         12.4GB/s   14.7GB/s
2          8.9GB/s    8.4GB/s
3          7.7GB/s    5.8GB/s
4          6.9GB/s    4.1GB/s

It's not unexpected that having more cores thrashing memory reduces overall bandwidth (other cores are interfering by closing pages the original core was using).

In the single core case, the overclock does provide a benefit, but it appears overclocking may be giving the additional cores the ability to interfere more often.

@Brunnis
Copy link
Author

Brunnis commented Dec 15, 2023

Interesting. So what are the results of the 4GB board in this test?

@seamusdemora
Copy link

Hmm... I guess this means that there's not much point in paying extra £££ for the 8GB Pi5?

@pelwell
Copy link
Contributor

pelwell commented Dec 16, 2023

Swapping to handle a large workload is going to more than negate any small speed benefit of the 4GB part. If you need more than 4GB, you need more than 4GB.

@seamusdemora
Copy link

What sort of tasks would present a "large workload"? A kernel compilation, or ...? Just curious.

@pelwell
Copy link
Contributor

pelwell commented Dec 18, 2023

Yes , you're a curious fellow.

@AKK9
Copy link

AKK9 commented Dec 24, 2023

What sort of tasks would present a "large workload"? A kernel compilation, or ...? Just curious.

A large workload, as in, anything that needs that additional 4GB of RAM.

It could be any number of things. And bear in mind not all uses cases are necessarily one single task, like compiling a kernel.

@aravhawk
Copy link

Does this mean I should refrain from overclocking my Pi 5 8GB? Or is anyone gonna fix this? Or does it mean that I should ditch my Pi 5 8GB and get a Pi 5 4GB instead?

@popcornmix
Copy link
Contributor

Does this mean I should refrain from overclocking my Pi 5 8GB? Or is anyone gonna fix this? Or does it mean that I should ditch my Pi 5 8GB and get a Pi 5 4GB instead?

If all you care about is the performance of a multicore memset benchmark then yes.
I would imagine for any real world use case overclocking will run faster.
And for any use case that uses significant RAM, the 8GB Pi will be faster than 4GB.

@aravhawk
Copy link

So overclocking the Raspberry Pi 5 8GB will not reduce its real-world performance, or should I refrain from overclocking? I want the best real-world performance.

@popcornmix
Copy link
Contributor

It's up to you. Pi5 is generally plenty fast enough that there is little need for overclocking.
But, real world performance will go up with overclocking.

@aravhawk
Copy link

So then what about these benchmark results?

@Brunnis
Copy link
Author

Brunnis commented Jan 9, 2024

@popcornmix Did you run your test program yet on the 4GB board to see how it compares?

Is anyone going to look further into this from the Raspberry Pi side? I of course understand it's not going to have the highest priority. While there are some theories in this comment thread as to what might be going on, there are no real conclusions as I see it. I will admit that as a hardware guy (although not with extensive DDR RAM knowledge), I do find the behavior interesting. :-)

One thing that I can add that I wasn't clear with in my initial post: Sequential RAM bandwidth and latency performance does not seem to go down with increased CPU frequency on either the 4GB or 8GB variant (it actually increases slightly on both with CPU frequency). However, with both boards running at the same frequency, the 4GB board enjoys an advantage of around 3% for read bandwidth/latency and ~4.5% for write bandwidth/latency. This corresponds pretty well with several benchmarks/apps (such as DosBox and Google Octane 2.0 browser bench) running ~3% faster on the 4GB board at default frequencies.

I can't tell if this bandwidth/latency discrepancy is related to the larger performance discrepancy for certain workloads/access patterns at higher frequencies.

@Darkflib
Copy link

Darkflib commented Jan 9, 2024

One thought springs to mind. There is a config.txt option to limit the total memory.

total_mem=4096

This should limit the mem to 4G on both boards. Does this change the result?

I don't have any 8G versions otherwise I would test it myself.

@Brunnis
Copy link
Author

Brunnis commented Jan 9, 2024

@Darkflib I actually did test that a while ago, but it did not change the results. Well, at least not in terms of the major performance regression at higher frequencies. I don't think I measured pure bandwidth and latency at default frequencies to see if it closed the 3-5 % gap that exists between the boards.

@popcornmix
Copy link
Contributor

We have been investigating this.

The 4GB and 8GB sdram devices have numerous difference in timings based on the spec.
The most significant one is T_RFC (all-bank auto-refresh time) which is higher on 8G.

The 4GB setting is:

sudo busybox devmem 0x107c001004 32 0x0821095a

The 8GB setting is:

sudo busybox devmem 0x107c001004 32 0x08210caa

If you are willing to run unsafely you can switch a single board between the two modes, and typically you find the benchmark behaviour changes in the same way as swapping between a 4GB and 8GB board.

(running a 4GB with 8GB timing is likely safe but sub-optimal. Running an 8GB with 4GB is not safe, but in my testing has been reliable enough to run benchmarks. YMMV).

When doing a multicore memset type test it seems possible to get into a state where sdram thrashes (see earlier link about pages and banks) which lowers sdram efficiency. Ideally for perfect efficiency when you open a page, you will access a page size's worth of data (I believe 8KB). But in the thrashing case the statistics suggest we are only writing about 130 bytes per page open before another core gets in and causes that page to close.

This thrashing behaviour seems to occur when you approach 0% idle cycles in the sdram controller. It's pretty hard to achieve this typically, but then following effects get you closer:

  • multiple cores writing to sdram as fast as possible (with no actual cpu work)
  • higher arm clock speed (typically needs overclock)
  • higher overhead due to T_RFC timing (which typically costs a few %)

I can replace the arm overclock with an sdram underclock and observe the same behaviour.

We're still trying to understand the exact behaviour, but so far I don't think the dramatic change in benchmark performance will occur in any real world workload (once the arm cores do some actual processing, rather than just writing to memory, you get idle cycles in sdram controller and the thrashing behaviour does not occur).

@Brunnis
Copy link
Author

Brunnis commented Jan 9, 2024

Thank you for the elaborate response. I did a quick test of the two suggested timing settings on my 8GB board, using Geekbench 5. These tests were run on a fully updated installation of Ubuntu 23.10, as that's the only "throwaway" installation I had up and running. The two tests below were both run at 2.8 GHz, with only the tRFC setting changed between them:

https://browser.geekbench.com/v5/cpu/compare/22120461?baseline=22120453

As can be seen, the multi-core composite score increases by 7% and the "Text Rendering" sub-test increases by 59%. As you say, the scores with the tighter timing seem to fall in line with the 4GB board.

I will experiment with this a bit more on a Raspberry Pi OS installation once I have the time. Nice to know some more details about what causes these effects, even if the behavior is all within spec and not risk-free to mitigate.

@aravhawk
Copy link

aravhawk commented Jan 9, 2024

I bet these issues should be resolved soon enough, as (hopefully) the devs at Raspberry Pi will see this. Seems to be a big flaw, however. It might be a firmware issue, but I suspect that it might also have to do a lot with the hardware, so possibly a new revision (i.e., “Raspberry Pi 5 Model B Revision 2.0”) will solve this.

@JamesH65
Copy link
Contributor

I bet these issues should be resolved soon enough, as (hopefully) the devs at Raspberry Pi will see this. Seems to be a big flaw, however. It might be a firmware issue, but I suspect that it might also have to do a lot with the hardware, so possibly a new revision (i.e., “Raspberry Pi 5 Model B Revision 2.0”) will solve this.

Two Pi engineers have actually posted in this thread (three now!), and have explained what is causing the results.

@pelwell
Copy link
Contributor

pelwell commented Jan 10, 2024

Two Pi engineers ... have explained what is causing the results.

Actually we haven't, because it is still actively being investigated, and the mechanism behind the slow down under benchmark conditions is not yet understood. However, we are really keen to find the explanation.

@JamesH65
Copy link
Contributor

Soz!

@popcornmix
Copy link
Contributor

We have a good understanding of what is going on here.

The arm cluster incorporates a streaming cache.

This means if you read consecutive memory locations, the cache will start to prefetch predicted locations which may be helpful later. However speculative prefetch is only done when the bus is idle.
So as you approach zero idle cycles in sdram controller, prefetching is disabled, and you see higher cache miss rates.

The cache also supports write streaming. If the cluster detects writing enough consecutive locations, it decides you are in a memset/memcpy mode and stops allocating in the cache, and just writes through to memory. We suspect that this behaviour may also be disabled when the bus is busy ("I'd may as well store it in the cache, as otherwise I'll stall").

These are the causes of the tailing off in performance as arms are overclocked relative to sdram (so idle cycles in sdram tend towards zero). As mentioned earlier a larger SDRAM device (8GB) requires more overhead for refresh, so the threshold for this behaviour occurs a little sooner.

During investigating this, I did discover that we can do a little better. The sdram refresh interval is currently using the default data sheet settings. You can actually monitor the temperature of the sdram and it reports if refresh at half or quarter the rate can be done. That allows the overhead due to refresh to be reduced by a half or a quarter which does improve benchmark results.

I have a test bootloader (for Pi5) which implements this. If anyone is brave they can test this (you can use rpi-imager to create an sdcard from this zip file).

You can monitor the sdram temperature value with:

vcgencmd readmr 4

This is the meaning of the bottom 3 bits:

000b: SDRAM low temperature operating limit exceeded
001b: 4x refresh
010b: 2x refresh
011b: 1x refresh (default)
100b: 0.5x refresh
101b: 0.25x refresh, no derating
110b: 0.25x refresh, with derating
111b: SDRAM high temperature operating limit exceeded

In testing, it typically reports 1 (for a cool pi) or 2 (for a warm pi).
I've only been able to observe a 3 reported with extreme effort (stress test, no fan, no ventilation running for a long time).

@Brunnis
Copy link
Author

Brunnis commented Feb 3, 2024

Nice troubleshooting! Quite interesting how the different parts interact.

I might give that bootloader a test later. In the meantime, is there any way of testing the performance improvement of 4x and 2x refresh by issuing devmem commands?

Are you considering implementing this dynamic refresh interval in the default firmware?

@popcornmix
Copy link
Contributor

Are you considering implementing this dynamic refresh interval in the default firmware?

Yes this be included in default firmware, following successful testing.

@Brunnis
Copy link
Author

Brunnis commented Feb 3, 2024

Okay, good. I plan on testing the bootloader, but it will have to wait a bit as I won’t have access to my Pi 5 in the coming week.

@Brunnis
Copy link
Author

Brunnis commented Feb 3, 2024

I'm leaving home for a week tomorrow, but was curious enough to stay awake a bit longer tonight to run some initial test with the beta FW. I checked with "vcgencmd readmr 4" and it starts at 1 and then quickly ends up at 2 and stays there, as you said.

I ran a few benchmarks as well (during which the reported refresh value was 2) at stock 2.4 GHz. With the beta FW and for the benchmarks where there existed a meaningful/measurable difference in the first place, the performance of the 8GB board is generally inbetween the 4GB and the stock 8GB results. The results with the beta FW are ever so slightly better than when using the busybox/devmem command posted earlier.

Notable in comparing the 4GB board to the 8GB with beta FW:

  • Compiling dosbox-staging is quite a bit faster on the 4GB board: 264 vs 282 seconds
  • Still some pretty big differences for some sub-tests of Geekbench 5 and 6, especially the multi-core tests. The huge anomaly with the Geekbench 5 multi-core "Text Rendering" test is fixed, though.
  • Passmark v11.0 has some pretty big differences in favor of the 4GB board. The CPU tests "Prime Numbers", "Sorting" and "Physics" are 20-25% faster on the 4GB. The memory test "Database Operations" is 11% faster on the 4GB.
  • Other than the above, tests like Google Octane 2.0, dosbox-staging, memory bandwidth/latency [via sysbench] and Quake 2 and vkQuake 3 seem to perform the same between the 4GB and the 8GB with beta FW.

@seamusdemora
Copy link

@Brunnis

Interesting... very interesting. As a buyer/user, the choice of "which Pi5?" has become a far simpler one.

timg236 added a commit to timg236/rpi-eeprom that referenced this issue Feb 8, 2024
…(latest)

* Adjust the SDRAM refresh interval based on the temperature. This
  addresses the gap in performance between the 8GB and 4GB variants.
  See raspberrypi/firmware#1854
* Preliminary support for signed boot
@timg236
Copy link

timg236 commented Feb 8, 2024

Bootloader changes are now available via "rpi-update"

@aravhawk
Copy link

aravhawk commented Feb 8, 2024

Great! Will test it later today!

@senothechad
Copy link

I should have searched reviews first, but I just went for the 8GB Pi 5. I never thought memory could play such big role on overclocking. I can't get mine to run above 2.7GHz while most people can push the 4GB variant above 3GHz. Should I just get the 4GB and use that as my main?

@geerlingguy
Copy link

@senothechad - I have a few 8 GB Pi 5 and some work at 3.0 GHz, others are more stable at 2.8... and a few can only get to 2.6 before they get flakey. I don't think the memory issue here has as much to do with overclock-ability.

@aravhawk
Copy link

aravhawk commented Feb 9, 2024

I agree with @geerlingguy. I'm pretty sure that it's more about the silicon lottery than anything when it comes to overclocking.

@timg236
Copy link

timg236 commented Feb 9, 2024

Indeed there is no official support for overclocking and Pi5 is a 2.4GHz product.
Some boards will not run reliably faster than 2.4 GHz.
Overclocking provides some interesting architectural insights on this Issue but we aren’t saying that this is something to support a > 2.4 GHz Pi5.

@senothechad
Copy link

Is this really just a matter of silicon lottery, or does the type of memory, in this case 4GB and 8GB variants actually play a role?

@timg236
Copy link

timg236 commented Feb 9, 2024

Is this really just a matter of silicon lottery, or does the type of memory, in this case 4GB and 8GB variants actually play a role?

Max ARM CPU speed is purely silicon lottery.

Previously, 4GB / 8GB made a difference in terms of the performance for memcpy like benchaks that you get at overclocked speeds but following this PR to increase available memory bandwidth it's less of a difference.

@popcornmix
Copy link
Contributor

Is this really just a matter of silicon lottery, or does the type of memory, in this case 4GB and 8GB variants actually play a role?

Note the chip that you overclock (with arm_freq= etc) is the BCM2712.
The SDRAM is a physically separate chip.

You could theoretically remove a 4GB SDRAM chip and replace with an 8GB SDRAM chip and the BCM2712 will still overclock in exactly the same way. There is zero correlation between how far you can overclock and whether you have a 4GB or 8GB SDRAM.

@senothechad
Copy link

Got it. Though is there any way I can bypass voltage limitations, maybe even with separate hardware? 1.0V is the maximum I can set but this chip could technically handle a little bit more voltage. Just want to try everything for that extra juice.

@timg236
Copy link

timg236 commented Feb 9, 2024

Got it. Though is there any way I can bypass voltage limitations, maybe even with separate hardware? 1.0V is the maximum I can set but this chip could technically handle a little bit more voltage. Just want to try everything for that extra juice.

No

@senothechad
Copy link

What if I didn't care about the risks of destroying it?

timg236 added a commit to raspberrypi/rpi-eeprom that referenced this issue Feb 14, 2024
* Adjust the SDRAM refresh interval based on the temperature. This
  addresses the gap in performance between the 8GB and 4GB variants.
  See raspberrypi/firmware#1854
* Preliminary support for signed boot.
timg236 added a commit to raspberrypi/rpi-eeprom that referenced this issue Feb 14, 2024
* Adjust the SDRAM refresh interval based on the temperature. This
  addresses the gap in performance between the 8GB and 4GB variants.
  See raspberrypi/firmware#1854
* Preliminary support for signed boot.
popcornmix added a commit that referenced this issue Feb 29, 2024
…river

See: #1854

firmware: arm_loader: mailbox: Optionally return extended board rev
See: #1831

firmware: arm_loader: Set dma-channel-mask as well as brcm,dma-channel-mask

firmware: board_info: Add Compute Module 5 model info string
popcornmix added a commit to raspberrypi/rpi-firmware that referenced this issue Feb 29, 2024
…river

See: raspberrypi/firmware#1854

firmware: arm_loader: mailbox: Optionally return extended board rev
See: raspberrypi/firmware#1831

firmware: arm_loader: Set dma-channel-mask as well as brcm,dma-channel-mask

firmware: board_info: Add Compute Module 5 model info string
@popcornmix
Copy link
Contributor

The adjustment to reduce sdram refresh where possible (which gives a performance benefit) that was done for pi5 has now been applied to pi4. On pi4 this is implemented in start4.elf firmware, rather than bootloader.

Should be no obvious change in behaviour, except sdram bandwidth limited tasks may be a few percent faster on Pi4.

timg236 added a commit to timg236/rpi-eeprom that referenced this issue Apr 18, 2024
Interesting changes since the last automatic update:
* Enable network install
* Enable over-clocking frequencies > 3GHz
  See: ttps://github.com/raspberrypi/firmware/issues/1876
* Adjust SDRAM refresh rate according to temperature and address a performance
  gap between 4GB and 8GB parts in benchmarks.
  See: raspberrypi/firmware#1854
* Support custom CA certs with HTTPS boot
* Move non Kernel ARM stages back to 512KB
  raspberrypi/firmware#1868
* Assorted HAT+ and NVMe interop improvements.
* Fix TRYBOOT if secure-boot is enabled.
* Preliminary support for D0 and CM5.
timg236 added a commit to timg236/rpi-eeprom that referenced this issue Apr 18, 2024
Interesting changes since the last automatic update:
* Enable network install
* Enable over-clocking frequencies > 3GHz
  See: ttps://github.com/raspberrypi/firmware/issues/1876
* Adjust SDRAM refresh rate according to temperature and address a performance
  gap between 4GB and 8GB parts in benchmarks.
  See: raspberrypi/firmware#1854
* Support custom CA certs with HTTPS boot
* Move non Kernel ARM stages back to 512KB
  raspberrypi/firmware#1868
* Assorted HAT+ and NVMe interop improvements.
* Fix TRYBOOT if secure-boot is enabled.
* Preliminary support for D0 and CM5.
timg236 added a commit to raspberrypi/rpi-eeprom that referenced this issue Apr 18, 2024
Interesting changes since the last automatic update:
* Enable network install
* Enable over-clocking frequencies > 3GHz
  See: ttps://github.com/raspberrypi/firmware/issues/1876
* Adjust SDRAM refresh rate according to temperature and address a performance
  gap between 4GB and 8GB parts in benchmarks.
  See: raspberrypi/firmware#1854
* Support custom CA certs with HTTPS boot
* Move non Kernel ARM stages back to 512KB
  raspberrypi/firmware#1868
* Assorted HAT+ and NVMe interop improvements.
* Fix TRYBOOT if secure-boot is enabled.
* Preliminary support for D0 and CM5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests