Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect arch detection #15151

Closed
quellyn opened this issue Feb 21, 2020 · 7 comments
Closed

Incorrect arch detection #15151

quellyn opened this issue Feb 21, 2020 · 7 comments

Comments

@quellyn
Copy link
Contributor

quellyn commented Feb 21, 2020

Hi guys,

On our local Frankencluster, I've noticed an odd inconsistency with Spack's arch detection. This cluster is composed of many flavors of x86_64, Power 9, and ARM nodes, all running CentOS Linux release 7.7.1908 (Core). My particular issue is with our x86_64 nodes.

Example 1: On node cn123, with a fresh Spack instance:

[quellyn@cn123 cn123]$ git clone https://github.com/spack/spack.git
Cloning into 'spack'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 197245 (delta 6), reused 5 (delta 5), pack-reused 197236
Receiving objects: 100% (197245/197245), 72.96 MiB | 31.81 MiB/s, done.
Resolving deltas: 100% (87768/87768), done.
Checking out files: 100% (5637/5637), done.
[quellyn@cn123 cn123]$ source ./spack/share/spack/setup-env.sh 
[quellyn@cn123 cn123]$ echo $SPACK_ROOT
/home/quellyn/Scratch/cn123/spack
[quellyn@cn123 cn123]$ spack arch
linux-centos7-haswell

The node itself agrees with this assessment:

[quellyn@cn123 cn123]$ cat /sys/devices/cpu/caps/pmu_name 
haswell

[quellyn@cn123 cn123]$ grep Intel /proc/cpuinfo | sort -u
model name	: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
vendor_id	: GenuineIntel

Example 2: On node cn141, with a fresh Spack instance:

[quellyn@cn141 cn141]$ git clone https://github.com/spack/spack.git
Cloning into 'spack'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 197245 (delta 6), reused 5 (delta 5), pack-reused 197236
Receiving objects: 100% (197245/197245), 72.96 MiB | 35.81 MiB/s, done.
Resolving deltas: 100% (87768/87768), done.
Checking out files: 100% (5637/5637), done.
[quellyn@cn141 cn141]$ source ./spack/share/spack/setup-env.sh 
[quellyn@cn141 cn141]$ echo $SPACK_ROOT
/home/quellyn/Scratch/cn141/spack
[quellyn@cn141 cn141]$ spack arch
linux-centos7-nehalem

But this node disagrees with Spack; it thinks it's a Haswell also:

[quellyn@cn141 cn141]$ cat /sys/devices/cpu/caps/pmu_name
haswell
[quellyn@cn141 cn141]$                 
[quellyn@cn141 cn141]$ grep Intel /proc/cpuinfo | sort -u
model name	: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
vendor_id	: GenuineIntel

I'm afraid I don't understand the magic of Spack's arch detection well enough to even start looking for a root cause. If you could give me a hint as to where to start that would be great.

Thanks!
Quellyn

P.S. This is my first time opening an issue; please let me know if I've left something out.

@quellyn quellyn added the bug label Feb 21, 2020
@alalazo
Copy link
Member

alalazo commented Feb 21, 2020

Can you post the "flags" of both nodes in /proc/cpuinfo and see if they differ in any way?

@alalazo alalazo self-assigned this Feb 21, 2020
@quellyn
Copy link
Contributor Author

quellyn commented Feb 21, 2020

I lost my allocation on my "correct" node (cn123), but I was able to grab another node (cn126) with the same processor (which Spack identifies correctly as a Haswell).

On cn126 (correct arch detection):

[quellyn@cn126 cn126]$ cat /proc/cpuinfo | grep flags | sort -u
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

On cn141 (incorrect arch detection):

[quellyn@cn141 ~]$ cat /proc/cpuinfo | grep flags | sort -u
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt xsave avx f16c rdrand lahf_lm abm epb invpcid_single intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

Looks like cn126 has the flag "aes", while cn141 does not.

@alalazo
Copy link
Member

alalazo commented Feb 21, 2020

Our detection is mainly targeted towards binary compatibility and as such we started from the instruction set mentioned in GCC manual. For haswell there's mention of AES:

haswell
               Intel Haswell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT,
               AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2 and F16C instruction set support.

as there is for ivybridge, sandy-bridge and westmere. nehalem is the first architecture in the hierarchy that does not have support for that. I would say that detection is working correctly for Spack, as it ensures that the binaries that are generated would run on the node.

Now, it would be interesting to understand why there's no aes on that node... Do they have all the same kernel and OS?

@quellyn
Copy link
Contributor Author

quellyn commented Feb 21, 2020 via email

@quellyn
Copy link
Contributor Author

quellyn commented Feb 21, 2020

After talking to Massimiliano I'm pretty convinced this is NOT a Spack problem. I've put in a ticket with our cluster admins to see if there are BIOS differences between the Haswell nodes.

Thanks so much for the help in troubleshooting!
Q

@quellyn quellyn closed this as completed Feb 21, 2020
@tgamblin
Copy link
Member

@boegel: FYI -- another reason not to use pmu_name for binary compatibility!

@boegel
Copy link
Contributor

boegel commented Feb 27, 2020

@tgamblin I just brought up pmu_name because it looked interesting at first sight, but it's clear it not correct enough for our purpose...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants