-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libtorch does not initialize OpenMP/MKL by default #20156
Comments
Can you give us more details about how you built the C++ API example and how you got the binary distributions of pytorch and libtorch? Also, what happens if you rerun the benchmark with |
not enough info here to be "high priority" yet. |
Is this only in CPU? What optimisations are you using? What backend does the CPU use for C++ to do the multiplication? |
I build the C++ API example as the tutorials on website, using cmake to build.
I get libtorch from url on website :
I install pytorch using conda:
Both are CPU-only. How to rerun the benchmark with OMP_NUM_THREADS=1? I'm new to pytorch so I don't know much about running parameters. |
It's an environment variable, so |
cc @ilia-cher @pjh5 is default multithreading behavior different between libtorch and regular PyTorch builds? |
The libtorch builds are actually Python manywheel builds with the the c++ bits zipped up, so I don't think there should be a difference. |
I rerun C++ example on another machine for many times, I find the result is not stable.
Sometimes it's similar to python example:
And python execution time is always stable:
I also run with the OMP_NUM_THREADS=1 on both, I think it's not because of this. |
In Python, when we initialize torch module we explicitly call |
I add
I think it's probably the reason for previous results. |
Sorry for another problem, how to build libtorch from source? I found the libtorch from website is build with I try to use |
@ezyang I'm not sure the title is correct though - all programs do not run singlethreaded by default, they just use default OMP/MKL settings. You can see in Supposedly that results in better perf. Also note, that we did not do any initialization in libtorch before ( What makes things complicated is how OpenMP initializes its threads - unfortunately we have to call this function from any new thread created, otherwise OpenMP uses its predefined settings (e.g. env. var, or number of physical threads). |
@ezyang Curious what would happen if we put a call to |
There was also a recommendation from @mingfeima on how to set the optimal values for OMP env vars: https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25. CORES=`lscpu | grep Core | awk '{print $4}'`
SOCKETS=`lscpu | grep Socket | awk '{print $2}'`
TOTAL_CORES=`expr $CORES \* $SOCKETS`
KMP_SETTING="KMP_AFFINITY=granularity=fine,compact,1,0"
KMP_BLOCKTIME=1
export OMP_NUM_THREADS=$TOTAL_CORES
export $KMP_SETTING
export KMP_BLOCKTIME=$KMP_BLOCKTIME cc. @mingfeima Would you like to weigh in on this? |
@ilia-cher According to @VitalyFedyunin, you have a plan of attack for tackling this problem. Would you mind posting it here? (We were reviewing this issue at the weekly triage meeting.) |
@ezyang it's pretty much what I posted above - we do need to call |
@ilia-cher We have some other options, right? For example, |
@ezyang at::parallel_for and other primitives could check it, but unfortunately the user might as well call MKL gemm function from a new thread, MKL would use OpenMP with default setting |
@pietern says: the problem is that @cpuhrsch: This is a global thing. Why would we differentiate between C++ and Python frontend in these decisions? @gchanan: It seems opt out is better than opt in... @pietern: Well, in Python you always have |
Unassigning stale assignee. |
I test code in this tutorial: I searched for a long time, but it doesn't work. But when I did as the tutorial: https://pytorch.org/cppdocs/installing.html, download libtorch from At first, I used this download from pytorch download page: So I think, the problem is why the CUDA version libtorch is "so strange" on cpu? By "strange" I mean, cpu is 1000% high(not 600%, not 1200%) and 3 times slower than python version. |
https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.4.0%2Bcpu.zip https://download.pytorch.org/libtorch/nightly/cpu/libtorch-cxx11-abi-shared-with-deps-latest.zip this is ok. |
Any one met libtorch PT issue might have a look at: https://zhuanlan.zhihu.com/p/363319763 |
I had libtorch version 1.10.0 and for error i had a library from pytorch installed in apt that has a conflit with the local library downloaded from pytorch website. Therefore, if you have this problem make sure you don't have installed the libtorchdev in your machine. Remove it with synaptic. |
I find that matrix multiplication is slower in C++ API, so I write the same code in C++ and python and record their execution times, code is as following:
C++:
Result:
python:
Result:
Testing Environment:
It's not a small difference, why is it happen???
The text was updated successfully, but these errors were encountered: