Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knights Landing support #991

Closed
loveshack opened this issue Oct 20, 2016 · 28 comments
Closed

Knights Landing support #991

loveshack opened this issue Oct 20, 2016 · 28 comments

Comments

@loveshack
Copy link

cpuid(1) on a Knights Landing system shows:

   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
      model           = 0x7 (7)
      stepping id     = 0x1 (1)
      extended family = 0x0 (0)
      extended model  = 0x5 (5)
      (simple synth)  = Intel Xeon Phi x200 (Knights Landing), 14nm

So this patch detects it as HASWELL:

*** OpenBLAS-0.2.19/cpuid_x86.c~    2016-09-08 17:46:18.122229447 +0100
--- OpenBLAS-0.2.19/cpuid_x86.c 2016-10-20 16:50:14.003283533 +0100
***************
*** 1192,1197 ****
--- 1192,1198 ----
            else
        return CPUTYPE_NEHALEM;
    case 5:
+   case 7:         // Knights Landing
          case 14:
      // Skylake
            if(support_avx())
@@ -1700,6 +1701,7 @@
           else
        return CORE_NEHALEM;
    case 5:
+   case 7:         // Knights Landing
    case 14:
      // Skylake
           if(support_avx())
*** OpenBLAS-0.2.19/driver/others/dynamic.c~    2016-09-01 04:58:42.000000000 +0100
--- OpenBLAS-0.2.19/driver/others/dynamic.c 2016-10-20 16:57:02.108394771 +0100
***************
*** 286,291 ****
--- 286,300 ----
        return &gotoblas_NEHALEM; //OS doesn't support AVX. Use old kernels.
      }
    }
+   //Intel Knights Landing
+   if (model == 7) {
+     if(support_avx())
+       return &gotoblas_HASWELL;
+     else{
+       openblas_warning(FALLBACK_VERBOSE, NEHALEM_FALLBACK);
+       return &gotoblas_NEHALEM; //OS doesn't support AVX. Use old kernels.
+     }
+   }
    return NULL;
        }
        case 0xf:

Is there any chance of AVX512 support? (Sorry I couldn't fund it.) Unfortunately, with HASWELL on KNL, OpenBLAS dgemm is about three times slower than MKL.

@brada4
Copy link
Contributor

brada4 commented Oct 22, 2016

Xeon phi is not a general purpose CPU, it does not support some normal instructions available since i386, and from your measurement it looks that many others are emulated in firmware..

BTW can you point to any reference that MIC is based or related to Haswell in any way?

@martin-frbg
Copy link
Collaborator

As far as I know, the Knights Landing generation of Phi is binary compatible with Haswell - even if some or even most of that is likely to be microcode emulation trickery, Haswell is probably still a better OpenBLAS target for it than the Atom architecture it is somewhat distantly related to.
I suspect AVX512 support will become important once it is availabe in more mainstream Skylake EP or Kabi Lake systems expected for next year. Alas modifying the current haswell gemm kernels to make use of AVX512 looks to be a bit more complicated than just changing the instruction names.
@loveshack, while you cannot provide funds, would you be able to provide access to your system if and when a capable developer becomes available ?

@brada4
Copy link
Contributor

brada4 commented Oct 23, 2016

CPUID says no MMX or SSE (With other hand intel writes you have to cjeck cpuid bits before emitting special instructions)
https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf#section.B.8
Would be interesting to at least measure GEMM to find if SSE2 or AVX or AVX2 works best)

@loveshack
Copy link
Author

Martin Kroeker notifications@github.com writes:

As far as I know, the Knights Landing generation of Phi is binary
compatible with Haswell - even if some or even most of that is likely
to be microcode emulation trickery, Haswell is probably still a better
OpenBLAS target for it than the Atom architecture it is somewhat
distantly related to.

Indeed. [As far as I know, all Xeons are microcoded anyhow.]

I suspect AVX512 support will become important once it is availabe in
more mainstream Skylake EP or Kabi Lake systems expected for next
year. Alas modifying the current haswell gemm kernels to make use of
AVX512 looks to be a bit more complicated than just changing the
instruction names.

For what it's worth, ATLAS has Knights Corner (I think) support, but I
haven't managed to build the development version.

@loveshack, while you cannot provide funds, would you be able to
provide access to your system if and when a capable developer becomes
available ?

Unfortunately not, but I could suggest another site to try. Also, there
is an Intel proprietary simulator for KNL, but I don't know how useful
it would be for this.

There is avx512 support for small matrix multiplication in
https://github.com/hfp/libxsmm, but I guess that won't help with the
general case.

@loveshack
Copy link
Author

loveshack commented Oct 25, 2016 via email

@brada4
Copy link
Contributor

brada4 commented Oct 25, 2016

How do we know? That spec explicitly says there is not even MMX in CPUID.... While I dont really believe that .... At least can you check Sandybridge vs Haswell (With seconds, not 2-5x faster, and we know it)

Emulator will not help with cache timings...

@martin-frbg
Copy link
Collaborator

My knowledge may well be outdated, but isn't ATLAS "just" plain C with some clever functions to tune loop unrolling etc. for the build target (making it easier to "port", but harder to match well-written machine code for a given target) ? And thanks for the pointer to libxsmm - the goals and license seem similar enough, but at first glance the code organization looks too complicated to even suggest trying a quick-and-dirty replacement of one of OpenBLAS' Haswell-optimized functions with its AVX512-using equivalent.

@loveshack
Copy link
Author

You wrote:

My knowledge may well be outdated, but isn't ATLAS "just" plain C with
some clever functions to tune loop unrolling etc. for the build target
(making it easier to "port", but harder to match well-written machine
code for a given target) ?

It has assembler kernels, if that's what you mean. Knights Corner is
called "avxz" there, so presumably the relevant bits are
tune/blas/gemm/AMMCASES/*avxz.S in the development version (looking at
3.11.39) but I don't know anything much about it. It seems unlikely to
be useful for KNL, unfortunately, per Clint's initial response in
https://sourceforge.net/p/math-atlas/support-requests/1038/ and my
later build attempt.

For what it's worth, I just got a pointer to some level of support for
larger-size GEMM in libxsmm, with current status in
libxsmm/libxsmm#99 (comment)

@loveshack
Copy link
Author

You wrote:

How do we know? That spec explicitly says there is not even MMX in
CPUID.... While I dont really believe that .... At least can you check
Sandybridge vs Haswell (With seconds, not 2-5x faster, and we know it)

Emulator will not help with cache timings...

Please stop responding like this; it's only likely to drive people away
from OpenBLAS, which I don't want.

@brada4
Copy link
Contributor

brada4 commented Oct 28, 2016

OK, on MIC emulator fastest kernel is one that matches host CPU.
Does it work for you too?

@jeffhammond
Copy link

  • Knights Landing (KNL) aka Xeon Phi 72xx is binary-compatible with Haswell (Xeon v3) except for TSX (transactional memory). This was documented long ago (here).
  • Knights Landing: Second-Generation Intel Xeon Phi Product discusses the microarchitecture in detail.
  • Mixing SSE and AVX instructions is a bad idea but there is no good reason to do this.
  • LIBXSMM is written by some of the smartest people at Intel and is another great resource for microkernel insight.
  • BLIS already supports KNL (code) well and should be consulted for BLAS implementation insight. The author of that code knows what he is doing and has been in frequent contact with some of the authors of LIBXSMM.
  • Some of the comments in this thread are factually incorrect.

I work for Intel.

@jeffhammond
Copy link

@brada4

Linux has no trouble figuring out the cpuid bits associated with MMX, etc.

$ cat /proc/cpuinfo  | more 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 87
model name	: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
stepping	: 1
microcode	: 0x12b
cpu MHz		: 1396.793
cache size	: 1024 KB
physical id	: 0
siblings	: 272
core id		: 0
cpu cores	: 68
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm const
ant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni 
pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 
3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase 
tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt
bogomips	: 2793.58
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

@martin-frbg
Copy link
Collaborator

Thanks for your insight. (Actually I am not sure we still need to argue about Haswell compatibility or the chance to deduct it from cpuid features.) Dave Love's patch was committed as part of PR #1010 on november 7 and from similar open issues I consider it likely that not all of the observed performance difference to MKL comes from not using avx512.

@jeffhammond
Copy link

@martin-frbg Certainly not using AVX-512 is the leading order reason that performance is behind MKL, but one needs to consider the KNL cache hierarchy (no L3, L2 for each tile of 2 cores) carefully. Finally, using instruction encodings wider than 8-bytes reduces performance. The Intel manuals and Agner Fog's website have details.

@brada4
Copy link
Contributor

brada4 commented Jan 19, 2017

(Already good info that Knights Landing can run Haswell)
Currently huge numbers of cores will not be very well handled. One core until some arbitrary threshold, then all.
Any memory access encoding exceeds 8 bytes - is it just observation that RAM is slower than registers or there is some extra care needed around lines of FMA?

@loveshack
Copy link
Author

loveshack commented Jan 19, 2017 via email

@jeffhammond
Copy link

@loveshack Thanks for the pointer to that FFTW issue. I commented already. I have seen BLIS performance on KNL but the data is not mine to share. You might try to measure yourself if you have a KNL system. I won't have time to do it for a while.

@loveshack
Copy link
Author

A while ago I got access to KNL again, got some results, and intended to pursue them further but didn't before losing access again. Here's what I have for serial dgemm with BLIS (knl kernel as of 2017-02-20) on KNL 7290, compared with MKL, OpenBLAS 0.2.19 for Haswell, and what was meant to be libxsmm's own dgemm (which needs checking for suspicious similarity to OB).

dgemm

@martin-frbg
Copy link
Collaborator

Thanks for that update, a pity indeed that OpenBLAS fares that poorly. I may have "more time to spend with friends and family" in the near future so maybe I will actually get around to learning assembly some day.

@jeffhammond
Copy link

LIBXSMM calls BLAS DGEMM when it lacks a native implementation. Because you are using large matrices, it will not surprise me at all if that is what is happening here. You should try to use LIBXSMM with your own cache blocking to ensure that it is called such that it will execute it's own code.

@jeffhammond
Copy link

jeffhammond commented Jun 27, 2017

Also, LIBXSMM will do a lot better than an AVX2 kernel on KNL, so I do not believe you are measuring LIBXSMM here. LIBXSMM usually beats MKL on KNL for small matrices. See publications listed on the LIBXSMM GitHub page for details.

@loveshack
Copy link
Author

loveshack commented Jul 4, 2017 via email

@jeffhammond
Copy link

jeffhammond commented Jul 4, 2017 via email

@loveshack
Copy link
Author

I could have sworn I'd updated this a while ago... I guess it's
off-topic, but perhaps worth recording.

It turns out that the libxsmmext library I tried before is only
relevant for threaded operation, and just falls through to the
external BLAS otherwise, although that doesn't explain the blips
relative to OB.

I've tried BLIS again, after getting the 0.3.0 release working on
KNL, and found it does somewhat better relative to MKL than I measured
before, though I don't think the BLIS kernel has changed. I get ~85%
(30200 v. 35600 Mflops) of MKL 2018.1 7000x7000 serial DGEMM
performance on a 7290 KNL with the latency-performance profile,
slightly variable with matrix size around the plateau.

Also, I noticed these slides on KNL DGEMM, which might be of interest:
https://www.ixpug.org/documents/152044396506-RoltaekLim-ixpug2018-rlim.pdf

@martin-frbg martin-frbg added this to To do in 0.3.1 and beyond Apr 2, 2018
@martin-frbg
Copy link
Collaborator

AVX512 seems to be available in the low-end i3-8121U now, which should provide a much cheaper testbed.

@jeffhammond
Copy link

@martin-frbg Intel SDE is a free testbed for AVX-512. Sure, it's just an emulator, but it is great for getting the code working.

The performance of AVX-512 will be quite different on Core i3, Knights Landing, and Skylake Xeon, so I don't see a lot of utility in buying a Core i3 for AVX-512 support unless that is your primary target for OpenBLAS. Also, the low-end Xeon Scalable and Xeon W parts with AVX-512 support are pretty cheap (e.g. Xeon Bronze 3104 and Xeon W 2123), although Xeon Bronze with one VPU will have different performance characteristics than the Gold and Platinum parts with two VPUs.

@loveshack
Copy link
Author

loveshack commented May 24, 2018 via email

@jeffhammond
Copy link

I have a NERSC allocation and will do my best to provide accounts to anyone who is going to port OpenBLAS to KNL. However, I'd like to see those parties demonstrate interest by doing a functional port with SDE before requesting a NERSC account for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants