Optimizing Level-3 BLAS on Intel Sandy Bridge #83

Closed
xianyi opened this Issue Mar 23, 2012 · 28 comments

Projects

None yet

4 participants

@xianyi
Owner

All Level-3 BLAS functions.

@wangqian wangqian was assigned Mar 23, 2012
@ViralBShah

I believe julia's dgemm performance bug that was related recently, is related to Sandy Bridge optimizations. Would be amazing if support for Sandy Bridge could be added.

JuliaLang/julia#747

@xianyi
Owner
@ViralBShah
@wangqian wangqian added a commit that referenced this issue Jun 19, 2012
@wangqian wangqian Refs #83 #53. Adding Intel Sandy Bridge (AVX supported) kernel codes …
…for BLAS level 3 functions.
f76f952
@xianyi
Owner

Hi @ViralBShah @zchothia ,

We already push sandybridge codes into github. It only support x86-64.
We just tested them on Intel Core i7 and Intel Xeon E5-26xx with 64-bit Linux. Could you help us test them?

Thanks

Xianyi

@ViralBShah
@xianyi
Owner

Hi @ViralBShah ,

The new Intel core i5 is sandy bridge architecture. I think OpenBLAS can detect it automatically.

Thanks

Xianyi

@ViralBShah
@ViralBShah
@zchothia

Hello @xianyi,

I was able to successfully build the sandybridge code on Windows but the tests failed, for example:

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat3 < ./sblat3.dat
OPENBLAS_NUM_THREADS=2 ./xscblat2 < sin2

I have posted the full build log here: https://gist.github.com/2953434

Processor: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Operating System: Windows 7 SP1, 64-bit [Version 6.1.7601]
Compiler: MinGW-w64 x86_64-w64-mingw32-gcc-4.6.3-release-win64_rubenvb.7z

--Zaheer

@xianyi
Owner

Hi viral,

Please check this wiki page http://en.wikipedia.org/wiki/Sandy_Bridge#Desktop_platform

I think your gcc tool chain didn't support AVX instructions. Thus, you can upgrade it to 4.6 version

Thanks
Xianyi

@ViralBShah
@xianyi
Owner

Hi @zchothia ,

Thank you for your report. I think there is a bug in Windows ABI.

Xianyi

@ViralBShah
@ViralBShah
@xianyi
Owner

Hi Viral,

I think this is a OpenBLAS bug ".align 64" on Mac OSX

Thanks

Xianyi

@wangqian
Collaborator
@ViralBShah
@zchothia

@xianyi: I was unable to debug the tests themselves, so I wrote a small program to narrow down the issue on Windows.
It computes C = A * A, where A is a small 5 x 5 matrix. The code is here: https://gist.github.com/2956370

This is the output with sgemm:

sizeof(float) = 4

C = [
  22890.000000, 18900.000000, 15225.000000, 14490.000000, 17220.000000,
  17850.000000, 22575.000000, 17115.000000, 15120.000000, 16065.000000,
  14700.000000, 17640.000000, 24045.000000, 17640.000000, 14700.000000,
  16065.000000, 15120.000000, 17115.000000, 22575.000000, 17850.000000,
  17220.000000, 14490.000000, 15225.000000, 18900.000000, 22890.000000,
]

C ./ expected = [
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
]

... and with dgemm:

sizeof(float) = 8

C = [
  22890.000000, 18900.000000, 15225.000000, 14490.000000, 17220.000000,
  17850.000000, 22575.000000, 17115.000000, 15120.000000, 16065.000000,
  14700.000000, 17640.000000, 24045.000000, 17640.000000, 14700.000000,
  16065.000000, 15120.000000, 17115.000000, 22575.000000, 17850.000000,
  0.000000, 0.000000, 0.000000, 0.000000, 0.000000,
]

C ./ expected = [
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  21.000000, 21.000000, 21.000000, 21.000000, 21.000000,
  0.000000, 0.000000, 0.000000, 0.000000, 0.000000,
]

For some reason, it appears as though the results are being scaled by a factor of 21. In addition it looks like dgemm doesn't fill the last row of the output matrix. For reference, this example works fine on Linux (sandybridge branch too).

EDIT: The factor of 21 comes from A[3, 3]. I realised this when I tried the equivalent of this MATLAB snippet and observed that the result was (incorrectly) scaled by 4:

>> A = diag(1:5);
>> B = ones(5,5);
>> C = A * B;

--Zaheer

@zchothia

@ViralBShah: if you make these two changes the code should compile and the tests pass (tested with Clang 3.1/Ubuntu 12.04):

diff --git a/kernel/x86_64/cgemm_kernel_4x8_sandy.S b/kernel/x86_64/cgemm_kernel_4x8_sandy.S
index 56ebee1..5987b8e 100644
--- a/kernel/x86_64/cgemm_kernel_4x8_sandy.S
+++ b/kernel/x86_64/cgemm_kernel_4x8_sandy.S
@@ -3578,7 +3578,7 @@ ADDQ      $8*SIZE, ptrba;
 ADDQ   $16*SIZE, ptrbb;
 DECQ   k;
 JG     .L241_bodyB;
-.align
+ALIGN_5
 .L241_loopE:
 #ifndef        TRMMKERNEL
 TEST $2, bk;
diff --git a/kernel/x86_64/sgemm_kernel_8x8_sandy.S b/kernel/x86_64/sgemm_kernel_8x8_sandy.S
index 4d16a60..23eda3a 100644
--- a/kernel/x86_64/sgemm_kernel_8x8_sandy.S
+++ b/kernel/x86_64/sgemm_kernel_8x8_sandy.S
@@ -2412,7 +2412,7 @@ ADDQ      $4*SIZE, ptrba;
 ADDQ   $16*SIZE, ptrbb;
 DECQ   k;
 JG .L241_bodyB;
-.align
+ALIGN_4
 .L241_loopE:
 #ifndef        TRMMKERNEL
 TEST $2, bk;
@wangqian
Collaborator
@xianyi xianyi added a commit that referenced this issue Jun 20, 2012
@xianyi Refs #83. Added the missing ALIGN_5 macro on Mac OSX. However, it sti…
…ll exists SEGFAULT bug.
88c272f
@ViralBShah
@ViralBShah
@zchothia

@xianyi: I investigated further and the Windows failures are due to incompatible calling conventions.
This table shows where each parameter is passed when calling sgemm_kernel:

Param | Linux  | Windows
-------------------------
M     | %rdi   | %rcx
N     | %rsi   | %rdx
K     | %rdx   | %r8d
ALPHA | %xmm0  | %xmm3
SA    | %rcx   | 32(%rsp)
SB    | %r8d   | 40(%rsp)
SC    | %r9d   | 48(%rsp)
LDC   | (%rsp) | 56(%rsp)

(Incidentally, the tests pass if I build without optimizations but this is just sheer luck - e.g. %xmm0 is used as an intermediate register when placing ALPHA into %xmm3.)

You can try this out by looking at the assembly generated for this program:

$ gcc -S -O2 sgemm_kernel_call.c -o sgemm_kernel_call.S

// From <common.h>
#if defined(_WIN64)
typedef long long BLASLONG;
typedef unsigned long long BLASULONG;
#else
typedef long BLASLONG;
typedef unsigned long BLASULONG;
#endif

// From <common_param.h>
int sgemm_kernel(BLASLONG M, BLASLONG N, BLASLONG K, float ALPHA, float *SA, float *SB, float *SC, BLASLONG LDC);

int main() {
  BLASLONG M = 10;
  BLASLONG N = 20;
  BLASLONG K = 30;
  float ALPHA = 1.0;  // 1065353216 == 0x3f800000
  float *SA = (float*) 0xaaaa;  // 43690
  float *SB = (float*) 0xbbbb;  // 48059
  float *SC = (float*) 0xcccc;  // 52428
  BLASLONG LDC = 40;
  sgemm_kernel(M, N, K, ALPHA, SA, SB, SC, LDC);

  return 0;
}

--Zaheer

@xianyi
Owner

Hi all,

We just tested the library on Sandy Bridge Mac OSX. We found the SEGFAULT may relate to Clang. We use a very simple assembly file. The clang generates the wrong binary code.

Here is the test case https://gist.github.com/2960279

I think I will test it with Clang on Linux.

Thanks

Xianyi

@zchothia

Hello,

Thanks for the calling convention fix, Qian! All the tests pass now on Windows.
Calling OpenBLAS from Visual Studio (2010 and 2012 RC) also works fine for the small test program I posted earlier.

The miscompile with Clang that you mentioned doesn't occur on Linux with a recent build:

$ clang --version
clang version 3.2 (trunk 157931)
Target: x86_64-unknown-linux-gnu
Thread model: posix
$ clang -c test_vaddpd.s -o test_vaddpd.o
$ objdump -d test_vaddpd.o
<snipped>
   0:   c4 21 11 58 6c 13 10    vaddpd 0x10(%rbx,%r10,1),%xmm13,%xmm13
   7:   c3                      retq

Clang only gained full support for AVX with version 3.0, so perhaps it may be worth trying a newer version. The Chromium project provides up-to-date binaries on a regular basis (for 64-bit Linux and Mac OS X): http://commondatastorage.googleapis.com/chromium-browser-clang/index.html

--Zaheer

@wangqian
Collaborator
@xianyi
Owner

Hi all,

I just tested the library with Clang 3.1 on Mac OSX. It works fine.

Thanks

Xianyi

@ViralBShah

Builds fine now for me and tests pass using the Apple provided clang (Apple clang version 3.1 (tags/Apple/clang-318.0.61) (based on LLVM 3.1svn)).

@xianyi xianyi closed this Jun 26, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment