Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the THNN is so slow?! #1048

Closed
amazingyyc opened this issue Nov 24, 2016 · 21 comments
Closed

Why the THNN is so slow?! #1048

amazingyyc opened this issue Nov 24, 2016 · 21 comments

Comments

@amazingyyc
Copy link

I use the prisma and I found the libthnn.so in the prisma Android's app.
So I test the speed of the prisma's lib and the origin THNN.
I find that the the origin is very very slow.
Like, the origin THNN cost 5s but the prisma's lib just cost 50ms!!!
I find that the prisma's lib use the OpenBlas but the openBlas can't accelerate the speed so much!
Anyone can explain it?

@soumith
Copy link
Member

soumith commented Nov 24, 2016

are you talking about on-device performance? i.e. for ARM / Android?

@soumith
Copy link
Member

soumith commented Nov 24, 2016

its likely that they have implemented some custom optimizations that have not been pushed back upstream.

@fmassa
Copy link
Contributor

fmassa commented Nov 24, 2016

What @soumith mentioned is probably the main reason for the difference in performance.
We also probably lost some (maybe not much?) run-time performance with torch/torch7#839, but it was giving too many compilation problems on some architectures so it was better for maintenance.

@amazingyyc
Copy link
Author

Yes I talk about the Android platform. The prisma app(using the libTHNN.so) is so fast That I can't image!!

@austingg
Copy link

@amazingyyc Is your test two thnn with the same model , primsa may simplify model a lot ?
@soumith I found thnn is much slower than cunn and cudnn, about 25 times slower. I don't know whether there is something wrong my call for thnn(use openblas for gemm).

@amazingyyc
Copy link
Author

@austingg  I do not test for any model. I just test a convolutional operate(the first convolution in Googlenet). I build a new demo with the lib that include in the prisma's Android app and the origin thnn. Actually, the origin THNN cost about 2s(in release mode), 5s(in debug mode).

@austingg
Copy link

@amazingyyc Is that means you test primsa style model or only the conv1 cost 2s ?

@amazingyyc
Copy link
Author

I can,t find the reason, but I gauss the prisma use the openblas to accelerate the speed and use the uint8 matrix multiplication instead of float32. Just gauss...

@amazingyyc
Copy link
Author

@austingg
only test conv1.
prisma's lib cost about 50ms
origin thnn cost 2s (int release APK)

@amazingyyc
Copy link
Author

@austingg cunn is implemented on GPU, Of course the cunn is much faster than thnn(on cpu only)

@austingg
Copy link

@amazingyyc that's possible. however thnn also used openblas, and openblas has no int8 gemm

@austingg
Copy link

@amazingyyc I know. just slow too much. In other framework, cpu is about 10x slower than gpu

@soumith soumith closed this as completed Nov 26, 2016
@austingg
Copy link

austingg commented Dec 5, 2016

It's my mistake. I compared to cudnn not cuda, and when compared to cudnn, it is normal that torch nn is 25x time slower . sorry for misleading.

@amazingyyc
Copy link
Author

I test the prisma's lib and the gemmlowp (ref: https://github.com/google/gemmlowp) on the same scale convolution2d. the prisma do the convolution2d and use the gemmlowp do the matrix multiply on the same scale.
I found that the cost time of both on the same order of magnitude.
So I think the prisma use the uint8 matrix multiply instead of float32 (ref:http://ip.cadence.com/uploads/presentations/1100AM_TensorFlow_on_Embedded_Devices_PeteWarden.pdf).
And using the gemmlowp will accelerate the THNN too.

@austingg
Copy link

austingg commented Dec 7, 2016

@amazingyyc does it need net surgery to use gemmlowp ?

@amazingyyc
Copy link
Author

@austingg
Yes u have to instead of the THNN's float-matrix-multiply by the gemmlowp(convert float to uint8 by yourself and uint8-matrix-multiply by gemmlowp)
covert method ref:https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/
the gemmlowp only include the head file check this:https://github.com/google/gemmlowp, so it can be used easily.

@amazingyyc
Copy link
Author

@austingg
another attention, the gemmlowp will not open multi-thread in small matrix, but it will use multi-thread in big matrix automatically. SO, if u want to get the most fastest speed, U have to change the code of gemmlowp.

@austingg
Copy link

austingg commented Dec 7, 2016

@amazingyyc thank u so much

@austingg
Copy link

@amazingyyc have you ever seen other benchmark about 8bit-gemm on mobile devices?

@amazingyyc
Copy link
Author

@austingg sorry i do not know other.

@austingg
Copy link

@amazingyyc I have done some research, in tf issue, many people complaint when they used quantized(8bit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants