Knights Landing support #991

loveshack · 2016-10-20T16:46:45Z

cpuid(1) on a Knights Landing system shows:

   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
      model           = 0x7 (7)
      stepping id     = 0x1 (1)
      extended family = 0x0 (0)
      extended model  = 0x5 (5)
      (simple synth)  = Intel Xeon Phi x200 (Knights Landing), 14nm

So this patch detects it as HASWELL:

*** OpenBLAS-0.2.19/cpuid_x86.c~    2016-09-08 17:46:18.122229447 +0100
--- OpenBLAS-0.2.19/cpuid_x86.c 2016-10-20 16:50:14.003283533 +0100
***************
*** 1192,1197 ****
--- 1192,1198 ----
            else
        return CPUTYPE_NEHALEM;
    case 5:
+   case 7:         // Knights Landing
          case 14:
      // Skylake
            if(support_avx())
@@ -1700,6 +1701,7 @@
           else
        return CORE_NEHALEM;
    case 5:
+   case 7:         // Knights Landing
    case 14:
      // Skylake
           if(support_avx())
*** OpenBLAS-0.2.19/driver/others/dynamic.c~    2016-09-01 04:58:42.000000000 +0100
--- OpenBLAS-0.2.19/driver/others/dynamic.c 2016-10-20 16:57:02.108394771 +0100
***************
*** 286,291 ****
--- 286,300 ----
        return &gotoblas_NEHALEM; //OS doesn't support AVX. Use old kernels.
      }
    }
+   //Intel Knights Landing
+   if (model == 7) {
+     if(support_avx())
+       return &gotoblas_HASWELL;
+     else{
+       openblas_warning(FALLBACK_VERBOSE, NEHALEM_FALLBACK);
+       return &gotoblas_NEHALEM; //OS doesn't support AVX. Use old kernels.
+     }
+   }
    return NULL;
        }
        case 0xf:

Is there any chance of AVX512 support? (Sorry I couldn't fund it.) Unfortunately, with HASWELL on KNL, OpenBLAS dgemm is about three times slower than MKL.

The text was updated successfully, but these errors were encountered:

brada4 · 2016-10-22T21:44:00Z

Xeon phi is not a general purpose CPU, it does not support some normal instructions available since i386, and from your measurement it looks that many others are emulated in firmware..

BTW can you point to any reference that MIC is based or related to Haswell in any way?

martin-frbg · 2016-10-23T10:16:15Z

As far as I know, the Knights Landing generation of Phi is binary compatible with Haswell - even if some or even most of that is likely to be microcode emulation trickery, Haswell is probably still a better OpenBLAS target for it than the Atom architecture it is somewhat distantly related to.
I suspect AVX512 support will become important once it is availabe in more mainstream Skylake EP or Kabi Lake systems expected for next year. Alas modifying the current haswell gemm kernels to make use of AVX512 looks to be a bit more complicated than just changing the instruction names.
@loveshack, while you cannot provide funds, would you be able to provide access to your system if and when a capable developer becomes available ?

brada4 · 2016-10-23T10:54:55Z

CPUID says no MMX or SSE (With other hand intel writes you have to cjeck cpuid bits before emitting special instructions)
https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf#section.B.8
Would be interesting to at least measure GEMM to find if SSE2 or AVX or AVX2 works best)

loveshack · 2016-10-25T18:15:06Z

Martin Kroeker notifications@github.com writes:

As far as I know, the Knights Landing generation of Phi is binary
compatible with Haswell - even if some or even most of that is likely
to be microcode emulation trickery, Haswell is probably still a better
OpenBLAS target for it than the Atom architecture it is somewhat
distantly related to.

Indeed. [As far as I know, all Xeons are microcoded anyhow.]

I suspect AVX512 support will become important once it is availabe in
more mainstream Skylake EP or Kabi Lake systems expected for next
year. Alas modifying the current haswell gemm kernels to make use of
AVX512 looks to be a bit more complicated than just changing the
instruction names.

For what it's worth, ATLAS has Knights Corner (I think) support, but I
haven't managed to build the development version.

@loveshack, while you cannot provide funds, would you be able to
provide access to your system if and when a capable developer becomes
available ?

Unfortunately not, but I could suggest another site to try. Also, there
is an Intel proprietary simulator for KNL, but I don't know how useful
it would be for this.

There is avx512 support for small matrix multiplication in
https://github.com/hfp/libxsmm, but I guess that won't help with the
general case.

loveshack · 2016-10-25T18:19:57Z

CPUID says no MMX or SSE

Please stop talking rubbish about it, not that those are relevant. # cpuid -1 | egrep 'MMX|SSE |simple synth' (simple synth) = Intel Xeon Phi x200 (Knights Landing), 14nm MMX Technology = true SSE extensions = true XCR0 supported: SSE state = true I've no idea in what way that it's not "general purpose" either, running that vanilla CentOS 7 like any other server.

(With other hand intel writes you have to cjeck cpuid bits before emitting special instructions)

Yes, I modified dynamic.c.

https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf#section.B.8 Would be interesting to at least measure GEMM to find if SSE2 or AVX or AVX2 works best)

It wouldn't; we know how they compare.

brada4 · 2016-10-25T18:52:27Z

How do we know? That spec explicitly says there is not even MMX in CPUID.... While I dont really believe that .... At least can you check Sandybridge vs Haswell (With seconds, not 2-5x faster, and we know it)

Emulator will not help with cache timings...

martin-frbg · 2016-10-25T20:05:44Z

My knowledge may well be outdated, but isn't ATLAS "just" plain C with some clever functions to tune loop unrolling etc. for the build target (making it easier to "port", but harder to match well-written machine code for a given target) ? And thanks for the pointer to libxsmm - the goals and license seem similar enough, but at first glance the code organization looks too complicated to even suggest trying a quick-and-dirty replacement of one of OpenBLAS' Haswell-optimized functions with its AVX512-using equivalent.

loveshack · 2016-10-28T16:28:48Z

You wrote:

My knowledge may well be outdated, but isn't ATLAS "just" plain C with
some clever functions to tune loop unrolling etc. for the build target
(making it easier to "port", but harder to match well-written machine
code for a given target) ?

It has assembler kernels, if that's what you mean. Knights Corner is
called "avxz" there, so presumably the relevant bits are
tune/blas/gemm/AMMCASES/*avxz.S in the development version (looking at
3.11.39) but I don't know anything much about it. It seems unlikely to
be useful for KNL, unfortunately, per Clint's initial response in
https://sourceforge.net/p/math-atlas/support-requests/1038/ and my
later build attempt.

For what it's worth, I just got a pointer to some level of support for
larger-size GEMM in libxsmm, with current status in
libxsmm/libxsmm#99 (comment)

loveshack · 2016-10-28T16:33:59Z

You wrote:

How do we know? That spec explicitly says there is not even MMX in
CPUID.... While I dont really believe that .... At least can you check
Sandybridge vs Haswell (With seconds, not 2-5x faster, and we know it)

Emulator will not help with cache timings...

Please stop responding like this; it's only likely to drive people away
from OpenBLAS, which I don't want.

brada4 · 2016-10-28T17:06:22Z

OK, on MIC emulator fastest kernel is one that matches host CPU.
Does it work for you too?

jeffhammond · 2017-01-18T01:08:32Z

Knights Landing (KNL) aka Xeon Phi 72xx is binary-compatible with Haswell (Xeon v3) except for TSX (transactional memory). This was documented long ago (here).
Knights Landing: Second-Generation Intel Xeon Phi Product discusses the microarchitecture in detail.
Mixing SSE and AVX instructions is a bad idea but there is no good reason to do this.
LIBXSMM is written by some of the smartest people at Intel and is another great resource for microkernel insight.
BLIS already supports KNL (code) well and should be consulted for BLAS implementation insight. The author of that code knows what he is doing and has been in frequent contact with some of the authors of LIBXSMM.
Some of the comments in this thread are factually incorrect.

I work for Intel.

jeffhammond · 2017-01-18T22:31:09Z

@brada4

Linux has no trouble figuring out the cpuid bits associated with MMX, etc.

$ cat /proc/cpuinfo  | more 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 87
model name	: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
stepping	: 1
microcode	: 0x12b
cpu MHz		: 1396.793
cache size	: 1024 KB
physical id	: 0
siblings	: 272
core id		: 0
cpu cores	: 68
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm const
ant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni 
pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 
3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase 
tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt
bogomips	: 2793.58
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

martin-frbg · 2017-01-18T22:51:16Z

Thanks for your insight. (Actually I am not sure we still need to argue about Haswell compatibility or the chance to deduct it from cpuid features.) Dave Love's patch was committed as part of PR #1010 on november 7 and from similar open issues I consider it likely that not all of the observed performance difference to MKL comes from not using avx512.

jeffhammond · 2017-01-18T23:07:36Z

@martin-frbg Certainly not using AVX-512 is the leading order reason that performance is behind MKL, but one needs to consider the KNL cache hierarchy (no L3, L2 for each tile of 2 cores) carefully. Finally, using instruction encodings wider than 8-bytes reduces performance. The Intel manuals and Agner Fog's website have details.

brada4 · 2017-01-19T09:57:06Z

(Already good info that Knights Landing can run Haswell)
Currently huge numbers of cores will not be very well handled. One core until some arbitrary threshold, then all.
Any memory access encoding exceeds 8 bytes - is it just observation that RAM is slower than registers or there is some extra care needed around lines of FMA?

loveshack · 2017-01-19T10:59:24Z

Jeff Hammond <notifications@github.com> writes:

* Mixing SSE and AVX instructions is a bad idea but there is no good reason to do this.

[For what it's worth, it seems there is reason in the case of fftw, not that I'm suggesting it here, obviously <FFTW/fftw3@579cec9>.]

* [LIBXSMM](https://github.com/hfp/libxsmm) is written by some of the smartest people at Intel and is another great resource for microkernel insight.

As mentioned previously, but the paper is now published to marvel at. Its non-SMM DGEMM I referenced earlier didn't do significantly better than openblas on KNL qua Haswell when I measured it some minor versions ago.

* BLIS already supports KNL ([code](https://github.com/flame/blis/tree/master/kernels/x86_64/knl)) well and should be consulted for BLAS implementation insight.

Thanks! I'm sure I looked, but may have been fooled by it not being released. Do you happen to know if there are performance figures available?

jeffhammond · 2017-01-19T16:45:43Z

@loveshack Thanks for the pointer to that FFTW issue. I commented already. I have seen BLIS performance on KNL but the data is not mine to share. You might try to measure yourself if you have a KNL system. I won't have time to do it for a while.

loveshack · 2017-06-27T12:31:17Z

A while ago I got access to KNL again, got some results, and intended to pursue them further but didn't before losing access again. Here's what I have for serial dgemm with BLIS (knl kernel as of 2017-02-20) on KNL 7290, compared with MKL, OpenBLAS 0.2.19 for Haswell, and what was meant to be libxsmm's own dgemm (which needs checking for suspicious similarity to OB).

martin-frbg · 2017-06-27T16:36:10Z

Thanks for that update, a pity indeed that OpenBLAS fares that poorly. I may have "more time to spend with friends and family" in the near future so maybe I will actually get around to learning assembly some day.

jeffhammond · 2017-06-27T16:42:57Z

LIBXSMM calls BLAS DGEMM when it lacks a native implementation. Because you are using large matrices, it will not surprise me at all if that is what is happening here. You should try to use LIBXSMM with your own cache blocking to ensure that it is called such that it will execute it's own code.

jeffhammond · 2017-06-27T16:43:54Z

Also, LIBXSMM will do a lot better than an AVX2 kernel on KNL, so I do not believe you are measuring LIBXSMM here. LIBXSMM usually beats MKL on KNL for small matrices. See publications listed on the LIBXSMM GitHub page for details.

loveshack · 2017-07-04T14:22:52Z

LIBXSMM calls BLAS DGEMM when it lacks a native implementation. Because you are using large matrices, it will not surprise me at all if that is what is happening here.

It was meant to be linked against the "noblas" library variant, which defines dgemm_; something doubtless went wrong with that, despite trying to check the linkage at the time. Unfortunately I lost the box before I could re-check. [Correction that I should have made before in case of confusion: it should be the libxsmm "ext" variant, not "noblas", but that won't have been the reason for the result.]

jeffhammond · 2017-07-04T14:44:24Z

I'll try to reproduce later. If you need KNL access, I can add you to my NERSC project so you can use Cori. Write me privately if this is of interest.

loveshack · 2018-03-12T14:40:01Z

I could have sworn I'd updated this a while ago... I guess it's
off-topic, but perhaps worth recording.

It turns out that the libxsmmext library I tried before is only
relevant for threaded operation, and just falls through to the
external BLAS otherwise, although that doesn't explain the blips
relative to OB.

I've tried BLIS again, after getting the 0.3.0 release working on
KNL, and found it does somewhat better relative to MKL than I measured
before, though I don't think the BLIS kernel has changed. I get ~85%
(30200 v. 35600 Mflops) of MKL 2018.1 7000x7000 serial DGEMM
performance on a 7290 KNL with the latency-performance profile,
slightly variable with matrix size around the plateau.

Also, I noticed these slides on KNL DGEMM, which might be of interest:
https://www.ixpug.org/documents/152044396506-RoltaekLim-ixpug2018-rlim.pdf

martin-frbg · 2018-05-17T06:47:33Z

AVX512 seems to be available in the low-end i3-8121U now, which should provide a much cheaper testbed.

jeffhammond · 2018-05-17T16:14:57Z

@martin-frbg Intel SDE is a free testbed for AVX-512. Sure, it's just an emulator, but it is great for getting the code working.

The performance of AVX-512 will be quite different on Core i3, Knights Landing, and Skylake Xeon, so I don't see a lot of utility in buying a Core i3 for AVX-512 support unless that is your primary target for OpenBLAS. Also, the low-end Xeon Scalable and Xeon W parts with AVX-512 support are pretty cheap (e.g. Xeon Bronze 3104 and Xeon W 2123), although Xeon Bronze with one VPU will have different performance characteristics than the Gold and Platinum parts with two VPUs.

loveshack · 2018-05-24T08:59:56Z

In addition to what Jeff said, I'm sure there will be people to test on real SKX hardware as well as KNL, if it gets that far, and I'd have thought someone would be able to provide development access, though I can't. There isn't so much of a need for it now though, with BLIS' dynamic dispatch on x86_64 recently released.

jeffhammond · 2018-05-24T13:33:44Z

I have a NERSC allocation and will do my best to provide accounts to anyone who is going to port OpenBLAS to KNL. However, I'd like to see those parties demonstrate interest by doing a functional port with SDE before requesting a NERSC account for them.

tkelman mentioned this issue May 13, 2017

Implement function multi versioning in sysimg JuliaLang/julia#21849

Merged

6 tasks

martin-frbg added this to To do in 0.3.1 and beyond Apr 2, 2018

martin-frbg mentioned this issue Jun 3, 2018

Initial support for SkylakeX / AVX512 #1589

Merged

martin-frbg mentioned this issue Jan 31, 2019

Various fixes for the new Z14 target #1993

Merged

martin-frbg closed this as completed May 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knights Landing support #991

Knights Landing support #991

loveshack commented Oct 20, 2016

brada4 commented Oct 22, 2016

martin-frbg commented Oct 23, 2016

brada4 commented Oct 23, 2016

loveshack commented Oct 25, 2016

loveshack commented Oct 25, 2016 via email

brada4 commented Oct 25, 2016

martin-frbg commented Oct 25, 2016

loveshack commented Oct 28, 2016

loveshack commented Oct 28, 2016

brada4 commented Oct 28, 2016

jeffhammond commented Jan 18, 2017

jeffhammond commented Jan 18, 2017

martin-frbg commented Jan 18, 2017

jeffhammond commented Jan 18, 2017

brada4 commented Jan 19, 2017

loveshack commented Jan 19, 2017 via email

jeffhammond commented Jan 19, 2017

loveshack commented Jun 27, 2017

martin-frbg commented Jun 27, 2017

jeffhammond commented Jun 27, 2017

jeffhammond commented Jun 27, 2017 •

edited

loveshack commented Jul 4, 2017 via email •

edited

jeffhammond commented Jul 4, 2017 via email

loveshack commented Mar 12, 2018

martin-frbg commented May 17, 2018

jeffhammond commented May 17, 2018

loveshack commented May 24, 2018 via email

jeffhammond commented May 24, 2018

Knights Landing support #991

Knights Landing support #991

Comments

loveshack commented Oct 20, 2016

brada4 commented Oct 22, 2016

martin-frbg commented Oct 23, 2016

brada4 commented Oct 23, 2016

loveshack commented Oct 25, 2016

loveshack commented Oct 25, 2016 via email

brada4 commented Oct 25, 2016

martin-frbg commented Oct 25, 2016

loveshack commented Oct 28, 2016

loveshack commented Oct 28, 2016

brada4 commented Oct 28, 2016

jeffhammond commented Jan 18, 2017

jeffhammond commented Jan 18, 2017

martin-frbg commented Jan 18, 2017

jeffhammond commented Jan 18, 2017

brada4 commented Jan 19, 2017

loveshack commented Jan 19, 2017 via email

jeffhammond commented Jan 19, 2017

loveshack commented Jun 27, 2017

martin-frbg commented Jun 27, 2017

jeffhammond commented Jun 27, 2017

jeffhammond commented Jun 27, 2017 • edited

loveshack commented Jul 4, 2017 via email • edited

jeffhammond commented Jul 4, 2017 via email

loveshack commented Mar 12, 2018

martin-frbg commented May 17, 2018

jeffhammond commented May 17, 2018

loveshack commented May 24, 2018 via email

jeffhammond commented May 24, 2018

jeffhammond commented Jun 27, 2017 •

edited

loveshack commented Jul 4, 2017 via email •

edited