Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test manually profiling STREAM with likwid-perfctr #242

Open
Tracked by #240
tkoskela opened this issue Nov 27, 2023 · 4 comments
Open
Tracked by #240

Test manually profiling STREAM with likwid-perfctr #242

tkoskela opened this issue Nov 27, 2023 · 4 comments
Assignees

Comments

@tkoskela
Copy link
Member

tkoskela commented Nov 27, 2023

Test profiling with likwid-perfctr on a system where we don't have root access (e.g. Kathleen, Young). Use a simple application, such as STREAM and investigate what metric groups in LIKWID we can have access to with user permissions. Build STREAM manually using the Makefile in the github repository. Run in serial as first step. It's probably fine to run on a login node for debugging, because it only takes a few seconds to run. In the end we need to run on a compute node using the scheduler.

Investigate what metrics are availble from LIKWID, e.g.

+----------------------------+--------------+
|           Metric           |    Core 1    |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    | 3.522605e-03 |
|    Runtime unhalted [s]    | 1.107221e-04 |
|         Clock [MHz]        | 7.982933e+02 |
|             CPI            | 1.867334e+00 |
|         Branch rate        | 2.191491e-01 |
|  Branch misprediction rate | 1.979745e-02 |
| Branch misprediction ratio | 9.033780e-02 |
|   Instructions per branch  | 4.563103e+00 |
+----------------------------+--------------+`
@tkoskela tkoskela changed the title Test manually profiling some simple application, STREAM for example Test manually profiling a simple application Nov 27, 2023
@tkoskela tkoskela changed the title Test manually profiling a simple application Test manually profiling STREAM with likwid-perfctr Dec 12, 2023
@themkots themkots self-assigned this Dec 13, 2023
@tkoskela tkoskela self-assigned this Dec 13, 2023
@themkots
Copy link

themkots commented Jan 1, 2024

  • Young run of STREAM with likwid-perfctr and group "BRANCH" for cache hit/miss stats in Young's login02:
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login02.ib.young:7 STREAM]$ hostname
login02
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login02.ib.young:7 STREAM]$ pwd
/home/ucaseko/Excalibur/dloads/STREAM
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login02.ib.young:7 STREAM]$ ../../likwid-5.3.0/bin/likwid-perfctr -C S0:1  -g BRANCH ./stream_c.exe
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
CPU type:       Intel Cascadelake SP processor
CPU clock:      2.49 GHz
--------------------------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 8129 microseconds.
   (= 8129 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10042.2     0.016097     0.015933     0.016232
Scale:          13951.9     0.011647     0.011468     0.011744
Add:            15880.2     0.015263     0.015113     0.015490
Triad:          15587.4     0.015524     0.015397     0.015697
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+------------+
|             Event            | Counter | HWThread 1 |
+------------------------------+---------+------------+
|       INSTR_RETIRED_ANY      |  FIXC0  | 2343230311 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  | 2288835488 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 1550699600 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  367655406 |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |       8310 |
+------------------------------+---------+------------+

+----------------------------+--------------+
|           Metric           |  HWThread 1  |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    |       0.6810 |
|    Runtime unhalted [s]    |       0.9177 |
|         Clock [MHz]        |    3681.2784 |
|             CPI            |       0.9768 |
|         Branch rate        |       0.1569 |
|  Branch misprediction rate | 3.546386e-06 |
| Branch misprediction ratio | 2.260269e-05 |
|   Instructions per branch  |       6.3734 |
+----------------------------+--------------+
  • Young's login01 run of same:
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login01.ib.young:0 STREAM]$ hostname
login01
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login01.ib.young:0 STREAM]$ pwd
/home/ucaseko/Excalibur/dloads/STREAM
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login01.ib.young:0 STREAM]$ ../../likwid-5.3.0/bin/likwid-perfctr -C S0:1  -g BRANCH ./stream_c.exe
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
CPU type:       Intel Cascadelake SP processor
CPU clock:      2.49 GHz
--------------------------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 7749 microseconds.
   (= 7749 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10362.1     0.015691     0.015441     0.015894
Scale:          14450.7     0.011247     0.011072     0.011325
Add:            16494.9     0.014793     0.014550     0.014897
Triad:          16132.2     0.015059     0.014877     0.015234
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+------------+
|             Event            | Counter | HWThread 1 |
+------------------------------+---------+------------+
|       INSTR_RETIRED_ANY      |  FIXC0  | 2343225425 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  | 2222105438 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 1513890200 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  367654668 |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |       8687 |
+------------------------------+---------+------------+

+----------------------------+--------------+
|           Metric           |  HWThread 1  |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    |       0.7105 |
|    Runtime unhalted [s]    |       0.8912 |
|         Clock [MHz]        |    3659.6906 |
|             CPI            |       0.9483 |
|         Branch rate        |       0.1569 |
|  Branch misprediction rate | 3.707283e-06 |
| Branch misprediction ratio | 2.362815e-05 |
|   Instructions per branch  |       6.3734 |
+----------------------------+--------------+

  • Run same in one of Kathleen's nodes:
(EXCALIBUR) ✔ /lustre/home/ucaseko/Excalibur/dloads/STREAM [master|✔]
[ucaseko@node-c11a-069.kathleen:3 STREAM]$ hostname
node-c11a-069
(EXCALIBUR) ✔ /lustre/home/ucaseko/Excalibur/dloads/STREAM [master|✔]
[ucaseko@node-c11a-069.kathleen:3 STREAM]$ pwd
/lustre/home/ucaseko/Excalibur/dloads/STREAM
(EXCALIBUR) ✔ /lustre/home/ucaseko/Excalibur/dloads/STREAM [master|✔]
[ucaseko@node-c11a-069.kathleen:3 STREAM]$ ../../likwid-5.3.0/bin/likwid-perfctr -C S0:1  -g BRANCH ./stream_c.exe
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
CPU type:       Intel Cascadelake SP processor
CPU clock:      2.49 GHz
--------------------------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 8126 microseconds.
   (= 8126 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10291.3     0.015586     0.015547     0.015614
Scale:          13636.7     0.011751     0.011733     0.011781
Add:            15642.4     0.015392     0.015343     0.015443
Triad:          15381.8     0.015624     0.015603     0.015647
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+------------+
|             Event            | Counter | HWThread 1 |
+------------------------------+---------+------------+
|       INSTR_RETIRED_ANY      |  FIXC0  | 2343250440 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  | 2245719147 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 1559575000 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  367660744 |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |       8244 |
+------------------------------+---------+------------+

+----------------------------+--------------+
|           Metric           |  HWThread 1  |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    |       0.6769 |
|    Runtime unhalted [s]    |       0.9004 |
|         Clock [MHz]        |    3591.3189 |
|             CPI            |       0.9584 |
|         Branch rate        |       0.1569 |
|  Branch misprediction rate | 3.518190e-06 |
| Branch misprediction ratio | 2.242285e-05 |
|   Instructions per branch  |       6.3734 |
+----------------------------+--------------+

@themkots
Copy link

themkots commented Jan 1, 2024

  • ❌ Attempting to run STREAM under likwid-perfctr on my laptop snags, with message about "unknown processor"
[ucaseko@littlebeast:18 STREAM]$ pwd
/home/ucaseko/dloads/STREAM
(EXCALIBUR) ✔ ~/dloads/STREAM [master|✔]
[ucaseko@littlebeast:18 STREAM]$ ~/likwid-5.3.0/bin/likwid-perfctr -C S0:1  -g BRANCH ./stream_c.exe
Cannot access directory /home/ucaseko/likwid-5.3.0/share/likwid/perfgroups/unknown
--------------------------------------------------------------------------------
CPU name:       13th Gen Intel(R) Core(TM) i7-1365U
CPU type:       Unknown Intel Processor
CPU clock:      2.69 GHz
ERROR - [/home/ucaseko/dloads/likwid-5.3.0/src/perfmon.c:perfmon_init_maps:1184] Unsupported Processor
ERROR - [/home/ucaseko/dloads/likwid-5.3.0/src/perfmon.c:perfmon_init:2109] No such file or directory.
Failed to initialize event and counter lists for Unknown Intel Processor
  • Excerpt from cat /proc/cpuinfo
processor       : 11
vendor_id       : GenuineIntel
cpu family      : 6
model           : 186
model name      : 13th Gen Intel(R) Core(TM) i7-1365U
stepping        : 3
microcode       : 0xffffffff
cpu MHz         : 2688.010
cache size      : 12288 KB
physical id     : 0
siblings        : 12
core id         : 5
cpu cores       : 6
apicid          : 11
initial apicid  : 11
fpu             : yes
fpu_exception   : yes
cpuid level     : 28
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabilities
vmx flags       : vnmi invvpid ept_x_only ept_ad ept_1gb tsc_offset vtpr ept vpid unrestricted_guest ept_mode_based_exec tsc_scaling usr_wait_pause
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs retbleed eibrs_pbrsb
bogomips        : 5376.02
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

  • For comparison, similar excerpt from Kathleen's node used above:
processor       : 79
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
stepping        : 7
microcode       : 0x5003003
cpu MHz         : 2500.000
cache size      : 28160 KB
physical id     : 1
siblings        : 40
core id         : 28
cpu cores       : 20
apicid          : 121
initial apicid  : 121
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
bogomips        : 5004.96
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

  • Not sure if the ID tag of mine,
    model name : 13th Gen Intel(R) Core(TM) i7-1365U vs Kathleen's
    model name : Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
    makes it snag, i.e. not starting with letters but by numbers ... 🤷 ... will look into it -- maybe we need to include / request some new patterns in the likwid code, for correct CPU identification.

@tkoskela
Copy link
Member Author

The results on young and kathleen look promising! Could you check what other performance groups are availble on those machines?

@tkoskela
Copy link
Member Author

They have RRZE-HPC/likwid#468 open on support for newer Intel architectures. Does not seem to be much progress happening lately, so your laptop cpu might be still unsupported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants