libtorch does not initialize OpenMP/MKL by default #20156

EsdeathYZH · 2019-05-06T11:14:55Z

I find that matrix multiplication is slower in C++ API, so I write the same code in C++ and python and record their execution times, code is as following:

C++:

#include<torch/torch.h>
#include<iostream>
#include <chrono>

int main(){
	torch::Tensor tensor = torch::randn({2708, 1433});
	torch::Tensor weight = torch::randn({1433, 16});
	auto start = std::chrono::high_resolution_clock::now();
	tensor.mm(weight);
	auto end = std::chrono::high_resolution_clock::now();
	std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << 	std::endl;
	return 0;
}

Result:

C++ Operation Time(s) 0.082496s

python:

import torch
import torch.nn as nn
import torch.nn.functional as F

tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))

Result:

Python Operation Time(s) 0.0114

Testing Environment:

ubuntu 16.04
gcc version 5.4.0
python version 3.7.3
pytorch version 1.0.1

It's not a small difference, why is it happen???

The text was updated successfully, but these errors were encountered:

ezyang · 2019-05-06T15:39:32Z

Can you give us more details about how you built the C++ API example and how you got the binary distributions of pytorch and libtorch? Also, what happens if you rerun the benchmark with OMP_NUM_THREADS=1 on both samples

gchanan · 2019-05-06T21:06:36Z

not enough info here to be "high priority" yet.

xsacha · 2019-05-06T22:54:15Z

Is this only in CPU? What optimisations are you using? What backend does the CPU use for C++ to do the multiplication?
Sounds like the python one is using 8 cores and the C++ one is using a single core.

EsdeathYZH · 2019-05-07T05:51:46Z

I build the C++ API example as the tutorials on website, using cmake to build.
CMakeLists:

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(pytorch-cpp-example)
find_package(Torch REQUIRED)
add_executable(pytorch-cpp-example pytorch-cpp-example.cpp)
target_link_libraries(pytorch-cpp-example "${TORCH_LIBRARIES}")
set_property(TARGET pytorch-cpp-example PROPERTY CXX_STANDARD 11)

I get libtorch from url on website :

wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-latest.zip

I install pytorch using conda:

conda install pytorch-cpu torchvision-cpu -c pytorch

Both are CPU-only.

How to rerun the benchmark with OMP_NUM_THREADS=1? I'm new to pytorch so I don't know much about running parameters.

ezyang · 2019-05-07T13:33:21Z

It's an environment variable, so OMP_NUM_THREADS=1 python your_benchmark.py

ezyang · 2019-05-07T13:34:03Z

cc @ilia-cher @pjh5 is default multithreading behavior different between libtorch and regular PyTorch builds?

pjh5 · 2019-05-07T16:25:19Z

The libtorch builds are actually Python manywheel builds with the the c++ bits zipped up, so I don't think there should be a difference.

EsdeathYZH · 2019-05-07T17:18:28Z

I rerun C++ example on another machine for many times, I find the result is not stable.
Sometime it's slow:

C++ Operation Time(s) 0.0233114s

Sometimes it's similar to python example:

C++ Operation Time(s) 0.00362992s

And python execution time is always stable:

Python Operation Time(s) 0.003966

I also run with the OMP_NUM_THREADS=1 on both, I think it's not because of this.

ilia-cher · 2019-05-07T19:42:21Z

In Python, when we initialize torch module we explicitly call at::init_num_threads() to initialize OMP/MKL. Could you add the same call in your C++ binary at the beginning and report results?

EsdeathYZH · 2019-05-08T02:28:32Z

I add at::init_num_threads and after that the C++ time is similar to python.

C++ Operation Time(s) 0.00281327s

I think it's probably the reason for previous results.

EsdeathYZH · 2019-05-08T02:48:59Z

Sorry for another problem, how to build libtorch from source? I found the libtorch from website is build with D_GLIBCXX_USE_CXX11_ABI=0, which introduce link error when linking with other dependencies.

I try to use python setup.py build, but I don't know how to organize output files as libtorch's directory structure. Is there a tutorial about how to build libtorch (CPU-only or with CUDA)?

ezyang · 2019-05-08T12:59:22Z

see https://github.com/pytorch/pytorch/blob/master/docs/libtorch.rst

ilia-cher · 2019-05-08T18:52:50Z

@ezyang I'm not sure the title is correct though - all programs do not run singlethreaded by default, they just use default OMP/MKL settings.

You can see in init_num_threads that what it does is:
omp_set_num_threads(mkl_get_max_threads());

Supposedly that results in better perf.

Also note, that we did not do any initialization in libtorch before (at::init_num_threads used to be called THInferNumThreads) and this function is only called from Python torch module init, JIT threads init and autograd threads init. If you create your own thread you need to call it there.

What makes things complicated is how OpenMP initializes its threads - unfortunately we have to call this function from any new thread created, otherwise OpenMP uses its predefined settings (e.g. env. var, or number of physical threads).

ilia-cher · 2019-05-08T18:58:32Z

@ezyang Curious what would happen if we put a call to init_num_threads in static initialization code in libtorch, i.e. whether that would affect Python's execution thread and main function thread - we should check this, though this will not protect against user creating a new std::thread and launching OMP/MKL-based routines from there.

yf225 · 2019-05-09T00:44:12Z

There was also a recommendation from @mingfeima on how to set the optimal values for OMP env vars: https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25.

CORES=`lscpu | grep Core | awk '{print $4}'`
SOCKETS=`lscpu | grep Socket | awk '{print $2}'`
TOTAL_CORES=`expr $CORES \* $SOCKETS`

KMP_SETTING="KMP_AFFINITY=granularity=fine,compact,1,0"
KMP_BLOCKTIME=1

export OMP_NUM_THREADS=$TOTAL_CORES
export $KMP_SETTING
export KMP_BLOCKTIME=$KMP_BLOCKTIME

cc. @mingfeima Would you like to weigh in on this?

ezyang · 2019-05-20T17:17:00Z

@ilia-cher According to @VitalyFedyunin, you have a plan of attack for tackling this problem. Would you mind posting it here? (We were reviewing this issue at the weekly triage meeting.)

ilia-cher · 2019-05-20T20:21:03Z

@ezyang it's pretty much what I posted above - we do need to call at::init_num_threads from each new thread, we already do this in places we control (e.g. internal thread pools), but if users create new std::threads then they need to call it there. We might be able to do this automatically for the main thread.

ezyang · 2019-06-04T13:57:43Z

@ilia-cher We have some other options, right? For example, at::parallel_for could check if it is being run from a thread which has never been initialized (using TLS) and then propagate initialization from some global setting, if that is the desired semantics.

ilia-cher · 2019-06-06T00:25:39Z

@ezyang at::parallel_for and other primitives could check it, but unfortunately the user might as well call MKL gemm function from a new thread, MKL would use OpenMP with default setting
UPD: to be more precise, this is not MKL's fault but an issue with OpenMP that doesn't seem to respect omp_set_num_threads(N) called in a different thread

ezyang · 2019-06-10T17:27:07Z

@pietern says: the problem is that import torch does a lot more stuff anyway; and mimicking it in C++ would require us to have some sort of "initialization" function. You can make arguments for and against; I think doing this explicitly is better than doing it lazily... it's just that import torch implies this stuff (making it explicit; except the fact that people don't know about it). There are arguments for and against--but I'd prefer it to be an explicit call. Then I know, I have to remove this initializer.

@cpuhrsch: This is a global thing. Why would we differentiate between C++ and Python frontend in these decisions?

@gchanan: It seems opt out is better than opt in...

@pietern: Well, in Python you always have import torch.

ezyang · 2020-01-27T17:55:34Z

Unassigning stale assignee.

magicly · 2020-02-20T09:21:33Z

I test code in this tutorial: https://pytorch.org/tutorials/advanced/cpp_export.html , on my ubuntu18.04 with 6 hard cpu cores and a RTX2080. it cost 80s to finished 1000 model.forward in c++ code, but just 25s in python code! I can't call at::init_num_threads, because it complains: error: 'init_num_threads' is not a member of 'at'. what's the most strange I think is when I run c++ code, cpu is 1000%high, but when I run python code, cpu is 600% high!

I searched for a long time, but it doesn't work.

But when I did as the tutorial: https://pytorch.org/cppdocs/installing.html, download libtorch from https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip, it works!!!! the cpu is 1200% when I run c++ code, same fast as python code: 24s!

At first, I used this download from pytorch download page:
https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.4.0.zip
So I think maybe this download just work for CUDA? And I test model.forward on CUDA in cpp, and in python, everything goes as expected: cpu is 100%, and every fast: 2.4s for 1000 forwards both in c++ and python, 10 times faster than the cpu version.

So I think, the problem is why the CUDA version libtorch is "so strange" on cpu? By "strange" I mean, cpu is 1000% high(not 600%, not 1200%) and 3 times slower than python version.

magicly · 2020-02-20T10:08:59Z

https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.4.0%2Bcpu.zip
this version works as the CUDA version.

https://download.pytorch.org/libtorch/nightly/cpu/libtorch-cxx11-abi-shared-with-deps-latest.zip this is ok.

gemfield · 2021-04-08T14:32:01Z

Any one met libtorch PT issue might have a look at: https://zhuanlan.zhihu.com/p/363319763

vitormartins01 · 2024-03-19T10:22:36Z

I had libtorch version 1.10.0 and for error i had a library from pytorch installed in apt that has a conflit with the local library downloaded from pytorch website. Therefore, if you have this problem make sure you don't have installed the libtorchdev in your machine. Remove it with synaptic.

izdeby added module: cpp Related to C++ API high priority module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 6, 2019

gchanan removed the high priority label May 6, 2019

ezyang added the module: binaries Anything related to official binaries that we release to users label May 7, 2019

ezyang added the module: multithreading Related to issues that occur when running on multiple CPU threads label May 7, 2019

ezyang changed the title ~~C++ API is slower than python?~~ libtorch does not initialize OMP/MKL by default, meaning all libtorch programs run singlethreaded by deafult (performance hazard) May 8, 2019

ezyang added module: docs Related to our documentation, both in docs/ and docblocks high priority and removed module: binaries Anything related to official binaries that we release to users module: performance Issues related to performance, either of kernel code or framework glue labels May 8, 2019

ezyang added the triage review label May 8, 2019

ezyang mentioned this issue May 8, 2019

Official instructions for how to build libtorch don't have same structure as prebuilt binaries #20271

Closed

ilia-cher changed the title ~~libtorch does not initialize OMP/MKL by default, meaning all libtorch programs run singlethreaded by deafult (performance hazard)~~ libtorch does not initialize OMP/MKL by default May 9, 2019

fmassa assigned yf225 May 13, 2019

ezyang removed the triage review label May 20, 2019

ezyang assigned ilia-cher and unassigned yf225 May 20, 2019

ezyang changed the title ~~libtorch does not initialize OMP/MKL by default~~ libtorch does not initialize OpenMP/MKL by default May 21, 2019

ezyang added the triage review label Jun 4, 2019

ezyang removed the triage review label Jun 10, 2019

ezyang mentioned this issue Aug 14, 2019

Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) #19106

Open

ezyang unassigned ilia-cher Jan 27, 2020

peterbell10 mentioned this issue Apr 28, 2020

Lazily initialise thread local num_threads value #37461

Closed

rgommers assigned peterbell10 May 10, 2020

facebook-github-bot closed this as completed in 5137827 May 11, 2020

ares89 mentioned this issue Jun 30, 2020

java+libtorch version runs much slower than the python+torchscript version pytorch/java-demo#12

Closed

yasenh mentioned this issue Sep 9, 2020

post proccesing takes too long yasenh/libtorch-yolov5#3

Closed

esling mentioned this issue Mar 25, 2021

max/msp port acids-ircam/ddsp_pytorch#14

Closed

Marcel-Rodekamp mentioned this issue Feb 14, 2024

Added OpenMP parallelization Marcel-Rodekamp/NSL#166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libtorch does not initialize OpenMP/MKL by default #20156

libtorch does not initialize OpenMP/MKL by default #20156

EsdeathYZH commented May 6, 2019

ezyang commented May 6, 2019

gchanan commented May 6, 2019

xsacha commented May 6, 2019

EsdeathYZH commented May 7, 2019

ezyang commented May 7, 2019

ezyang commented May 7, 2019

pjh5 commented May 7, 2019

EsdeathYZH commented May 7, 2019

ilia-cher commented May 7, 2019

EsdeathYZH commented May 8, 2019

EsdeathYZH commented May 8, 2019

ezyang commented May 8, 2019

ilia-cher commented May 8, 2019 •

edited

Loading

ilia-cher commented May 8, 2019 •

edited

Loading

yf225 commented May 9, 2019

ezyang commented May 20, 2019

ilia-cher commented May 20, 2019

ezyang commented Jun 4, 2019

ilia-cher commented Jun 6, 2019 •

edited

Loading

ezyang commented Jun 10, 2019

ezyang commented Jan 27, 2020

magicly commented Feb 20, 2020

magicly commented Feb 20, 2020 •

edited

Loading

gemfield commented Apr 8, 2021

vitormartins01 commented Mar 19, 2024

libtorch does not initialize OpenMP/MKL by default #20156

libtorch does not initialize OpenMP/MKL by default #20156

Comments

EsdeathYZH commented May 6, 2019

ezyang commented May 6, 2019

gchanan commented May 6, 2019

xsacha commented May 6, 2019

EsdeathYZH commented May 7, 2019

ezyang commented May 7, 2019

ezyang commented May 7, 2019

pjh5 commented May 7, 2019

EsdeathYZH commented May 7, 2019

ilia-cher commented May 7, 2019

EsdeathYZH commented May 8, 2019

EsdeathYZH commented May 8, 2019

ezyang commented May 8, 2019

ilia-cher commented May 8, 2019 • edited Loading

ilia-cher commented May 8, 2019 • edited Loading

yf225 commented May 9, 2019

ezyang commented May 20, 2019

ilia-cher commented May 20, 2019

ezyang commented Jun 4, 2019

ilia-cher commented Jun 6, 2019 • edited Loading

ezyang commented Jun 10, 2019

ezyang commented Jan 27, 2020

magicly commented Feb 20, 2020

magicly commented Feb 20, 2020 • edited Loading

gemfield commented Apr 8, 2021

vitormartins01 commented Mar 19, 2024

ilia-cher commented May 8, 2019 •

edited

Loading

ilia-cher commented May 8, 2019 •

edited

Loading

ilia-cher commented Jun 6, 2019 •

edited

Loading

magicly commented Feb 20, 2020 •

edited

Loading