Skip to content
This repository

Optimizing Level-3 BLAS on Intel Sandy Bridge #83

Closed
xianyi opened this Issue March 23, 2012 · 28 comments

4 participants

Zhang Xianyi wangqian Viral B. Shah zchothia
Zhang Xianyi
Owner

All Level-3 BLAS functions.

Viral B. Shah

I believe julia's dgemm performance bug that was related recently, is related to Sandy Bridge optimizations. Would be amazing if support for Sandy Bridge could be added.

JuliaLang/julia#747

Zhang Xianyi
Owner
Viral B. Shah
wangqian wangqian referenced this issue from a commit June 19, 2012
wangqian Refs #83 #53. Adding Intel Sandy Bridge (AVX supported) kernel codes …
…for BLAS level 3 functions.
f76f952
Zhang Xianyi
Owner
xianyi commented June 19, 2012

Hi @ViralBShah @zchothia ,

We already push sandybridge codes into github. It only support x86-64.
We just tested them on Intel Core i7 and Intel Xeon E5-26xx with 64-bit Linux. Could you help us test them?

Thanks

Xianyi

Viral B. Shah
Zhang Xianyi
Owner
xianyi commented June 19, 2012

Hi @ViralBShah ,

The new Intel core i5 is sandy bridge architecture. I think OpenBLAS can detect it automatically.

Thanks

Xianyi

Viral B. Shah
Viral B. Shah
zchothia
Collaborator

Hello @xianyi,

I was able to successfully build the sandybridge code on Windows but the tests failed, for example:

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat3 < ./sblat3.dat
OPENBLAS_NUM_THREADS=2 ./xscblat2 < sin2

I have posted the full build log here: https://gist.github.com/2953434

Processor: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Operating System: Windows 7 SP1, 64-bit [Version 6.1.7601]
Compiler: MinGW-w64 x86_64-w64-mingw32-gcc-4.6.3-release-win64_rubenvb.7z

--Zaheer

Zhang Xianyi
Owner
xianyi commented June 19, 2012

Hi viral,

Please check this wiki page http://en.wikipedia.org/wiki/Sandy_Bridge#Desktop_platform

I think your gcc tool chain didn't support AVX instructions. Thus, you can upgrade it to 4.6 version

Thanks
Xianyi

Viral B. Shah
Zhang Xianyi
Owner
xianyi commented June 19, 2012

Hi @zchothia ,

Thank you for your report. I think there is a bug in Windows ABI.

Xianyi

Viral B. Shah
Viral B. Shah
Zhang Xianyi
Owner
xianyi commented June 19, 2012

Hi Viral,

I think this is a OpenBLAS bug ".align 64" on Mac OSX

Thanks

Xianyi

wangqian
Collaborator
Viral B. Shah
zchothia
Collaborator

@xianyi: I was unable to debug the tests themselves, so I wrote a small program to narrow down the issue on Windows.
It computes C = A * A, where A is a small 5 x 5 matrix. The code is here: https://gist.github.com/2956370

This is the output with sgemm:

sizeof(float) = 4

C = [
  22890.000000, 18900.000000, 15225.000000, 14490.000000, 17220.000000,
  17850.000000, 22575.000000, 17115.000000, 15120.000000, 16065.000000,
  14700.000000, 17640.000000, 24045.000000, 17640.000000, 14700.000000,
  16065.000000, 15120.000000, 17115.000000, 22575.000000, 17850.000000,
  17220.000000, 14490.000000, 15225.000000, 18900.000000, 22890.000000,
]

C ./ expected = [
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
]

... and with dgemm:

sizeof(float) = 8

C = [
  22890.000000, 18900.000000, 15225.000000, 14490.000000, 17220.000000,
  17850.000000, 22575.000000, 17115.000000, 15120.000000, 16065.000000,
  14700.000000, 17640.000000, 24045.000000, 17640.000000, 14700.000000,
  16065.000000, 15120.000000, 17115.000000, 22575.000000, 17850.000000,
  0.000000, 0.000000, 0.000000, 0.000000, 0.000000,
]

C ./ expected = [
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  0.000000, 0.000000, 0.000000, 0.000000, 0.000000,
]

For some reason, it appears as though the results are being scaled by a factor of 21. In addition it looks like dgemm doesn't fill the last row of the output matrix. For reference, this example works fine on Linux (sandybridge branch too).

EDIT: The factor of 21 comes from A[3, 3]. I realised this when I tried the equivalent of this MATLAB snippet and observed that the result was (incorrectly) scaled by 4:

>> A = diag(1:5);
>> B = ones(5,5);
>> C = A * B;

--Zaheer

zchothia
Collaborator

@ViralBShah: if you make these two changes the code should compile and the tests pass (tested with Clang 3.1/Ubuntu 12.04):

diff --git a/kernel/x86_64/cgemm_kernel_4x8_sandy.S b/kernel/x86_64/cgemm_kernel_4x8_sandy.S
index 56ebee1..5987b8e 100644
--- a/kernel/x86_64/cgemm_kernel_4x8_sandy.S
+++ b/kernel/x86_64/cgemm_kernel_4x8_sandy.S
@@ -3578,7 +3578,7 @@ ADDQ      $8*SIZE, ptrba;
 ADDQ   $16*SIZE, ptrbb;
 DECQ   k;
 JG     .L241_bodyB;
-.align
+ALIGN_5
 .L241_loopE:
 #ifndef        TRMMKERNEL
 TEST $2, bk;
diff --git a/kernel/x86_64/sgemm_kernel_8x8_sandy.S b/kernel/x86_64/sgemm_kernel_8x8_sandy.S
index 4d16a60..23eda3a 100644
--- a/kernel/x86_64/sgemm_kernel_8x8_sandy.S
+++ b/kernel/x86_64/sgemm_kernel_8x8_sandy.S
@@ -2412,7 +2412,7 @@ ADDQ      $4*SIZE, ptrba;
 ADDQ   $16*SIZE, ptrbb;
 DECQ   k;
 JG .L241_bodyB;
-.align
+ALIGN_4
 .L241_loopE:
 #ifndef        TRMMKERNEL
 TEST $2, bk;
wangqian
Collaborator
Viral B. Shah
Viral B. Shah
zchothia
Collaborator

@xianyi: I investigated further and the Windows failures are due to incompatible calling conventions.
This table shows where each parameter is passed when calling sgemm_kernel:

Param | Linux  | Windows
-------------------------
M     | %rdi   | %rcx
N     | %rsi   | %rdx
K     | %rdx   | %r8d
ALPHA | %xmm0  | %xmm3
SA    | %rcx   | 32(%rsp)
SB    | %r8d   | 40(%rsp)
SC    | %r9d   | 48(%rsp)
LDC   | (%rsp) | 56(%rsp)

(Incidentally, the tests pass if I build without optimizations but this is just sheer luck - e.g. %xmm0 is used as an intermediate register when placing ALPHA into %xmm3.)

You can try this out by looking at the assembly generated for this program:

$ gcc -S -O2 sgemm_kernel_call.c -o sgemm_kernel_call.S

// From <common.h>
#if defined(_WIN64)
typedef long long BLASLONG;
typedef unsigned long long BLASULONG;
#else
typedef long BLASLONG;
typedef unsigned long BLASULONG;
#endif

// From <common_param.h>
int sgemm_kernel(BLASLONG M, BLASLONG N, BLASLONG K, float ALPHA, float *SA, float *SB, float *SC, BLASLONG LDC);

int main() {
  BLASLONG M = 10;
  BLASLONG N = 20;
  BLASLONG K = 30;
  float ALPHA = 1.0;  // 1065353216 == 0x3f800000
  float *SA = (float*) 0xaaaa;  // 43690
  float *SB = (float*) 0xbbbb;  // 48059
  float *SC = (float*) 0xcccc;  // 52428
  BLASLONG LDC = 40;
  sgemm_kernel(M, N, K, ALPHA, SA, SB, SC, LDC);

  return 0;
}

--Zaheer

Zhang Xianyi
Owner
xianyi commented June 20, 2012

Hi all,

We just tested the library on Sandy Bridge Mac OSX. We found the SEGFAULT may relate to Clang. We use a very simple assembly file. The clang generates the wrong binary code.

Here is the test case https://gist.github.com/2960279

I think I will test it with Clang on Linux.

Thanks

Xianyi

zchothia
Collaborator

Hello,

Thanks for the calling convention fix, Qian! All the tests pass now on Windows.
Calling OpenBLAS from Visual Studio (2010 and 2012 RC) also works fine for the small test program I posted earlier.

The miscompile with Clang that you mentioned doesn't occur on Linux with a recent build:

$ clang --version
clang version 3.2 (trunk 157931)
Target: x86_64-unknown-linux-gnu
Thread model: posix
$ clang -c test_vaddpd.s -o test_vaddpd.o
$ objdump -d test_vaddpd.o
<snipped>
   0:   c4 21 11 58 6c 13 10    vaddpd 0x10(%rbx,%r10,1),%xmm13,%xmm13
   7:   c3                      retq

Clang only gained full support for AVX with version 3.0, so perhaps it may be worth trying a newer version. The Chromium project provides up-to-date binaries on a regular basis (for 64-bit Linux and Mac OS X): http://commondatastorage.googleapis.com/chromium-browser-clang/index.html

--Zaheer

wangqian
Collaborator
Zhang Xianyi
Owner
xianyi commented June 20, 2012

Hi all,

I just tested the library with Clang 3.1 on Mac OSX. It works fine.

Thanks

Xianyi

Viral B. Shah

Builds fine now for me and tests pass using the Apple provided clang (Apple clang version 3.1 (tags/Apple/clang-318.0.61) (based on LLVM 3.1svn)).

Zhang Xianyi xianyi closed this June 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.