Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libtorch does not initialize OpenMP/MKL by default #20156

Closed
EsdeathYZH opened this issue May 6, 2019 · 25 comments
Closed

libtorch does not initialize OpenMP/MKL by default #20156

EsdeathYZH opened this issue May 6, 2019 · 25 comments
Assignees
Labels
high priority module: cpp Related to C++ API module: docs Related to our documentation, both in docs/ and docblocks module: multithreading Related to issues that occur when running on multiple CPU threads triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@EsdeathYZH
Copy link

I find that matrix multiplication is slower in C++ API, so I write the same code in C++ and python and record their execution times, code is as following:

C++:

#include<torch/torch.h>
#include<iostream>
#include <chrono>

int main(){
	torch::Tensor tensor = torch::randn({2708, 1433});
	torch::Tensor weight = torch::randn({1433, 16});
	auto start = std::chrono::high_resolution_clock::now();
	tensor.mm(weight);
	auto end = std::chrono::high_resolution_clock::now();
	std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << 	std::endl;
	return 0;
}

Result:

C++ Operation Time(s) 0.082496s

python:

import torch
import torch.nn as nn
import torch.nn.functional as F

tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))

Result:

Python Operation Time(s) 0.0114

Testing Environment:

ubuntu 16.04
gcc version 5.4.0
python version 3.7.3
pytorch version 1.0.1

It's not a small difference, why is it happen???

@izdeby izdeby added module: cpp Related to C++ API high priority module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 6, 2019
@ezyang
Copy link
Contributor

ezyang commented May 6, 2019

Can you give us more details about how you built the C++ API example and how you got the binary distributions of pytorch and libtorch? Also, what happens if you rerun the benchmark with OMP_NUM_THREADS=1 on both samples

@gchanan
Copy link
Contributor

gchanan commented May 6, 2019

not enough info here to be "high priority" yet.

@xsacha
Copy link
Contributor

xsacha commented May 6, 2019

Is this only in CPU? What optimisations are you using? What backend does the CPU use for C++ to do the multiplication?
Sounds like the python one is using 8 cores and the C++ one is using a single core.

@EsdeathYZH
Copy link
Author

I build the C++ API example as the tutorials on website, using cmake to build.
CMakeLists:

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(pytorch-cpp-example)
find_package(Torch REQUIRED)
add_executable(pytorch-cpp-example pytorch-cpp-example.cpp)
target_link_libraries(pytorch-cpp-example "${TORCH_LIBRARIES}")
set_property(TARGET pytorch-cpp-example PROPERTY CXX_STANDARD 11)

I get libtorch from url on website :

wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-latest.zip

I install pytorch using conda:

conda install pytorch-cpu torchvision-cpu -c pytorch

Both are CPU-only.

How to rerun the benchmark with OMP_NUM_THREADS=1? I'm new to pytorch so I don't know much about running parameters.

@ezyang
Copy link
Contributor

ezyang commented May 7, 2019

It's an environment variable, so OMP_NUM_THREADS=1 python your_benchmark.py

@ezyang ezyang added the module: binaries Anything related to official binaries that we release to users label May 7, 2019
@ezyang
Copy link
Contributor

ezyang commented May 7, 2019

cc @ilia-cher @pjh5 is default multithreading behavior different between libtorch and regular PyTorch builds?

@ezyang ezyang added the module: multithreading Related to issues that occur when running on multiple CPU threads label May 7, 2019
@pjh5
Copy link
Contributor

pjh5 commented May 7, 2019

The libtorch builds are actually Python manywheel builds with the the c++ bits zipped up, so I don't think there should be a difference.

@EsdeathYZH
Copy link
Author

I rerun C++ example on another machine for many times, I find the result is not stable.
Sometime it's slow:

C++ Operation Time(s) 0.0233114s

Sometimes it's similar to python example:

C++ Operation Time(s) 0.00362992s

And python execution time is always stable:

Python Operation Time(s) 0.003966

I also run with the OMP_NUM_THREADS=1 on both, I think it's not because of this.

@ilia-cher
Copy link
Contributor

In Python, when we initialize torch module we explicitly call at::init_num_threads() to initialize OMP/MKL. Could you add the same call in your C++ binary at the beginning and report results?

@EsdeathYZH
Copy link
Author

I add at::init_num_threads and after that the C++ time is similar to python.

C++ Operation Time(s) 0.00281327s

I think it's probably the reason for previous results.

@EsdeathYZH
Copy link
Author

Sorry for another problem, how to build libtorch from source? I found the libtorch from website is build with D_GLIBCXX_USE_CXX11_ABI=0, which introduce link error when linking with other dependencies.

I try to use python setup.py build, but I don't know how to organize output files as libtorch's directory structure. Is there a tutorial about how to build libtorch (CPU-only or with CUDA)?

@ezyang ezyang changed the title C++ API is slower than python? libtorch does not initialize OMP/MKL by default, meaning all libtorch programs run singlethreaded by deafult (performance hazard) May 8, 2019
@ezyang ezyang added module: docs Related to our documentation, both in docs/ and docblocks high priority and removed module: binaries Anything related to official binaries that we release to users module: performance Issues related to performance, either of kernel code or framework glue labels May 8, 2019
@ezyang
Copy link
Contributor

ezyang commented May 8, 2019

@ilia-cher
Copy link
Contributor

ilia-cher commented May 8, 2019

@ezyang I'm not sure the title is correct though - all programs do not run singlethreaded by default, they just use default OMP/MKL settings.

You can see in init_num_threads that what it does is:
omp_set_num_threads(mkl_get_max_threads());

Supposedly that results in better perf.

Also note, that we did not do any initialization in libtorch before (at::init_num_threads used to be called THInferNumThreads) and this function is only called from Python torch module init, JIT threads init and autograd threads init. If you create your own thread you need to call it there.

What makes things complicated is how OpenMP initializes its threads - unfortunately we have to call this function from any new thread created, otherwise OpenMP uses its predefined settings (e.g. env. var, or number of physical threads).

@ilia-cher
Copy link
Contributor

ilia-cher commented May 8, 2019

@ezyang Curious what would happen if we put a call to init_num_threads in static initialization code in libtorch, i.e. whether that would affect Python's execution thread and main function thread - we should check this, though this will not protect against user creating a new std::thread and launching OMP/MKL-based routines from there.

@yf225
Copy link
Contributor

yf225 commented May 9, 2019

There was also a recommendation from @mingfeima on how to set the optimal values for OMP env vars: https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25.

CORES=`lscpu | grep Core | awk '{print $4}'`
SOCKETS=`lscpu | grep Socket | awk '{print $2}'`
TOTAL_CORES=`expr $CORES \* $SOCKETS`

KMP_SETTING="KMP_AFFINITY=granularity=fine,compact,1,0"
KMP_BLOCKTIME=1

export OMP_NUM_THREADS=$TOTAL_CORES
export $KMP_SETTING
export KMP_BLOCKTIME=$KMP_BLOCKTIME

cc. @mingfeima Would you like to weigh in on this?

@ilia-cher ilia-cher changed the title libtorch does not initialize OMP/MKL by default, meaning all libtorch programs run singlethreaded by deafult (performance hazard) libtorch does not initialize OMP/MKL by default May 9, 2019
@ezyang ezyang assigned ilia-cher and unassigned yf225 May 20, 2019
@ezyang
Copy link
Contributor

ezyang commented May 20, 2019

@ilia-cher According to @VitalyFedyunin, you have a plan of attack for tackling this problem. Would you mind posting it here? (We were reviewing this issue at the weekly triage meeting.)

@ilia-cher
Copy link
Contributor

@ezyang it's pretty much what I posted above - we do need to call at::init_num_threads from each new thread, we already do this in places we control (e.g. internal thread pools), but if users create new std::threads then they need to call it there. We might be able to do this automatically for the main thread.

@ezyang ezyang changed the title libtorch does not initialize OMP/MKL by default libtorch does not initialize OpenMP/MKL by default May 21, 2019
@ezyang
Copy link
Contributor

ezyang commented Jun 4, 2019

@ilia-cher We have some other options, right? For example, at::parallel_for could check if it is being run from a thread which has never been initialized (using TLS) and then propagate initialization from some global setting, if that is the desired semantics.

@ilia-cher
Copy link
Contributor

ilia-cher commented Jun 6, 2019

@ezyang at::parallel_for and other primitives could check it, but unfortunately the user might as well call MKL gemm function from a new thread, MKL would use OpenMP with default setting
UPD: to be more precise, this is not MKL's fault but an issue with OpenMP that doesn't seem to respect omp_set_num_threads(N) called in a different thread

@ezyang
Copy link
Contributor

ezyang commented Jun 10, 2019

@pietern says: the problem is that import torch does a lot more stuff anyway; and mimicking it in C++ would require us to have some sort of "initialization" function. You can make arguments for and against; I think doing this explicitly is better than doing it lazily... it's just that import torch implies this stuff (making it explicit; except the fact that people don't know about it). There are arguments for and against--but I'd prefer it to be an explicit call. Then I know, I have to remove this initializer.

@cpuhrsch: This is a global thing. Why would we differentiate between C++ and Python frontend in these decisions?

@gchanan: It seems opt out is better than opt in...

@pietern: Well, in Python you always have import torch.

@ezyang
Copy link
Contributor

ezyang commented Jan 27, 2020

Unassigning stale assignee.

@magicly
Copy link

magicly commented Feb 20, 2020

I test code in this tutorial: https://pytorch.org/tutorials/advanced/cpp_export.html , on my ubuntu18.04 with 6 hard cpu cores and a RTX2080. it cost 80s to finished 1000 model.forward in c++ code, but just 25s in python code! I can't call at::init_num_threads, because it complains: error: 'init_num_threads' is not a member of 'at'. what's the most strange I think is when I run c++ code, cpu is 1000%high, but when I run python code, cpu is 600% high!

I searched for a long time, but it doesn't work.

But when I did as the tutorial: https://pytorch.org/cppdocs/installing.html, download libtorch from https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip, it works!!!! the cpu is 1200% when I run c++ code, same fast as python code: 24s!

At first, I used this download from pytorch download page:
https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.4.0.zip
So I think maybe this download just work for CUDA? And I test model.forward on CUDA in cpp, and in python, everything goes as expected: cpu is 100%, and every fast: 2.4s for 1000 forwards both in c++ and python, 10 times faster than the cpu version.

So I think, the problem is why the CUDA version libtorch is "so strange" on cpu? By "strange" I mean, cpu is 1000% high(not 600%, not 1200%) and 3 times slower than python version.

@magicly
Copy link

magicly commented Feb 20, 2020

@gemfield
Copy link
Contributor

gemfield commented Apr 8, 2021

Any one met libtorch PT issue might have a look at: https://zhuanlan.zhihu.com/p/363319763

@vitormartins01
Copy link

I had libtorch version 1.10.0 and for error i had a library from pytorch installed in apt that has a conflit with the local library downloaded from pytorch website. Therefore, if you have this problem make sure you don't have installed the libtorchdev in your machine. Remove it with synaptic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: cpp Related to C++ API module: docs Related to our documentation, both in docs/ and docblocks module: multithreading Related to issues that occur when running on multiple CPU threads triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.