Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

performance of dgemm(500x2, 2x2) #79

Closed
JeffBezanson opened this Issue Feb 23, 2012 · 12 comments

Comments

Projects
None yet
3 participants

I see a significant performance gap between openblas and mkl in this case:

Julia, using openblas:

julia> a=rand(500,2)
julia> b=rand(2,2)
julia> @time for i=1:10000; a*b; end
elapsed time: 0.08133101463317871 seconds

Matlab, using MKL:

>> tic();for i=1:10000,a*b;end;toc()
Elapsed time is 0.025024 seconds.

As far as I can tell from CPU use, MKL is not using multiple threads here, so that's not the issue.
I don't know whether it's possible to fix this, but it would be great.

I should add:

cpu family      : 6
model           : 37
model name      : Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz
stepping        : 2
Owner

xianyi commented Feb 24, 2012

Thank you for the report.

I think we can increase the multi-threading threshold.

BTW, what's your CPU and OS?

Thanks

Xianyi

2012/2/24 JeffBezanson <
reply@reply.github.com

I see a significant performance gap between openblas and mkl in this case:

Julia, using openblas:

julia> a=rand(500,2)
julia> b=rand(2,2)
julia> @time for i=1:10000; a*b; end
elapsed time: 0.08133101463317871 seconds

Matlab, using MKL:

>> tic();for i=1:10000,a*b;end;toc()
Elapsed time is 0.025024 seconds.

As far as I can tell from CPU use, MKL is not using multiple threads here,
so that's not the issue.
I don't know whether it's possible to fix this, but it would be great.


Reply to this email directly or view it on GitHub:
#79

Linux woz 3.1.7-1-ARCH #1 SMP PREEMPT Wed Jan 4 08:11:16 CET 2012 x86_64 Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz GenuineIntel GNU/Linux

@ghost ghost assigned xianyi Feb 27, 2012

Contributor

ViralBShah commented Mar 19, 2012

@xianyi Is it likely that this performance issue may be sorted out in an upcoming release? Do you know the reason for the poor performance?

Owner

xianyi commented Mar 19, 2012

Hi Viral,

I will fix this issue in next version.

Xianyi

xianyi added a commit that referenced this issue Mar 22, 2012

Owner

xianyi commented Mar 22, 2012

Hi @JeffBezanson @ViralBShah

I added a threshold to avoid multi-threading in small matrices GEMM.
You can test it on develop branch

Thanks

Xianyi

Owner

xianyi commented Mar 23, 2012

Hi,

Please try 0.1.0 version. I think I fixed this issue.

Xianyi

@xianyi xianyi closed this Mar 23, 2012

Contributor

ViralBShah commented Mar 23, 2012

Will do in a few days...

-viral

On 23-Mar-2012, at 4:18 PM, Xianyi Zhangreply@reply.github.com wrote:

Hi,

Please try 0.1.0 version. I think I fixed this issue.

Xianyi


Reply to this email directly or view it on GitHub:
#79 (comment)

Contributor

ViralBShah commented Mar 25, 2012

@xianyi I notice that this is still about twice as slow as the BLAS used by Matlab (I suspect it is MKL). Also, we are using OPENBLAS with multi-threading disabled, so this really did not make much of a difference.

Owner

xianyi commented Mar 25, 2012

Hi Viral,

OpenBLAS/GotoBLAS cannot automatically adjust the number of threads with
small input matrices or vectors. Because Intel MKL supports this feature,
it may run 1 thread to outperform multi-threading OpenBLAS/GotoBLAS.

Thus, the user should explicitly set the number of threads with 1 on small
matrices.

Thanks

Xianyi

2012/3/25 Viral B. Shah <
reply@reply.github.com

@xianyi I notice that this is still about twice as slow as the BLAS used
by Matlab (I suspect it is MKL). Also, we are using OPENBLAS with
multi-threading disabled, so this really did not make much of a difference.


Reply to this email directly or view it on GitHub:
#79 (comment)

Contributor

ViralBShah commented Mar 25, 2012

What I meant is that we are compiling OpenBLAS with USE_THREAD=0. In that case, this should not be an issue, right?

-viral

On 25-Mar-2012, at 7:52 PM, Xianyi Zhang wrote:

Hi Viral,

OpenBLAS/GotoBLAS cannot automatically adjust the number of threads with
small input matrices or vectors. Because Intel MKL supports this feature,
it may run 1 thread to outperform multi-threading OpenBLAS/GotoBLAS.

Thus, the user should explicitly set the number of threads with 1 on small
matrices.

Thanks

Xianyi

2012/3/25 Viral B. Shah <
reply@reply.github.com

@xianyi I notice that this is still about twice as slow as the BLAS used
by Matlab (I suspect it is MKL). Also, we are using OPENBLAS with
multi-threading disabled, so this really did not make much of a difference.


Reply to this email directly or view it on GitHub:
#79 (comment)


Reply to this email directly or view it on GitHub:
#79 (comment)

Owner

xianyi commented Mar 25, 2012

Yes. For small matrices, it will be better than OpenBLAS with USE_THREAD=1.

Xianyi

2012/3/25 Viral B. Shah <
reply@reply.github.com

What I meant is that we are compiling OpenBLAS with USE_THREAD=0. In that
case, this should not be an issue, right?

-viral

On 25-Mar-2012, at 7:52 PM, Xianyi Zhang wrote:

Hi Viral,

OpenBLAS/GotoBLAS cannot automatically adjust the number of threads with
small input matrices or vectors. Because Intel MKL supports this feature,
it may run 1 thread to outperform multi-threading OpenBLAS/GotoBLAS.

Thus, the user should explicitly set the number of threads with 1 on
small
matrices.

Thanks

Xianyi

2012/3/25 Viral B. Shah <
reply@reply.github.com

@xianyi I notice that this is still about twice as slow as the BLAS used
by Matlab (I suspect it is MKL). Also, we are using OPENBLAS with
multi-threading disabled, so this really did not make much of a
difference.


Reply to this email directly or view it on GitHub:
#79 (comment)


Reply to this email directly or view it on GitHub:
#79 (comment)


Reply to this email directly or view it on GitHub:
#79 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment