Add SLEEF for libm float and double #6725
Conversation
@@ -77,3 +77,6 @@ | |||
[submodule "third_party/onnx"] | |||
path = third_party/onnx | |||
url = https://github.com/onnx/onnx.git | |||
[submodule "aten/src/ATen/cpu/sleef"] |
ezyang
Apr 22, 2018
Contributor
Can we start adding these libraries to top-level third_party
instead of inside of ATen?
Can we start adding these libraries to top-level third_party
instead of inside of ATen?
cpuhrsch
Apr 26, 2018
Author
Contributor
Moved it.
Moved it.
672eef1
to
9113cd6
79ba742
to
dba6e93
Timings for newly vectorized functions.
|
@cpuhrsch Could you provide me with the test scripts? I think it's not a bad idea to compare the perf of the version with that of vml. |
Haven't reviewed CMake. Mostly LGTM |
backends: | ||
- CPU | ||
- CUDA | ||
name: _tan |
apaszke
May 2, 2018
Collaborator
Can we call all of them _th_*
? It's unclear what's different in a _tan()
call compared to tan()
Can we call all of them _th_*
? It's unclear what's different in a _tan()
call compared to tan()
cpuhrsch
May 2, 2018
•
Author
Contributor
_th_
indicates that it'd otherwise conflict with something from TH. For example tanh.
_th_
indicates that it'd otherwise conflict with something from TH. For example tanh.
apaszke
May 2, 2018
Collaborator
Yeah, but right now it's super inconsistent. Some functions will use the _th_
form, others will only have _
. Prefixing all of them with _th
would just seem easier to reason about (and would be clear what happens when you use them in your code). Not a big deal though.
Yeah, but right now it's super inconsistent. Some functions will use the _th_
form, others will only have _
. Prefixing all of them with _th
would just seem easier to reason about (and would be clear what happens when you use them in your code). Not a big deal though.
|
||
Tensor& fill_(Tensor& self, const Tensor& value) { | ||
return self._fill_(value); | ||
} |
apaszke
May 2, 2018
Collaborator
Lol why can't we leave fill_
as it was then?
Lol why can't we leave fill_
as it was then?
cpuhrsch
May 2, 2018
•
Author
Contributor
I created this stub to figure out how to replace it correctly (since it takes a scalar value). I'll add the vectorized implementation soon. I'd prefer to keep this and then add all the other stuff quickly in another PR (see the TODO section).
I created this stub to figure out how to replace it correctly (since it takes a scalar value). I'll add the vectorized implementation soon. I'd prefer to keep this and then add all the other stuff quickly in another PR (see the TODO section).
@@ -396,7 +399,8 @@ def cosh(x): | |||
try: | |||
return math.cosh(x) | |||
except OverflowError: | |||
return float('inf') if x > 0 else float('-inf') | |||
# http://en.cppreference.com/w/cpp/numeric/math/cosh |
apaszke
May 2, 2018
Collaborator
Can you add an extra comment here? It's unclear why should someone go there, and what should they look for.
Can you add an extra comment here? It's unclear why should someone go there, and what should they look for.
cpuhrsch
May 2, 2018
•
Author
Contributor
This reference shows that for an overflow std::cosh overflows to inf in any case, no matter the sign of the argument. If this is not changed it'll fail and why not stick to the standard.
This reference shows that for an overflow std::cosh overflows to inf in any case, no matter the sign of the argument. If this is not changed it'll fail and why not stick to the standard.
@@ -260,6 +260,9 @@ def _test_math(self, torchfn, mathfn, input=None): | |||
input = [] | |||
input.append(list(range(-5, 5))) | |||
input.append([x + 1e-6 for x in range(-5, 5)]) | |||
# Some vectorized implementations don't support large ranges | |||
input.append([x + 1e10 for x in range(-5, 5)]) | |||
input.append([x - 1e10 for x in range(-5, 5)]) |
apaszke
May 2, 2018
Collaborator
Aren't we going with SLEEF for large inputs anyway, since the precision at this scale is too low?
Aren't we going with SLEEF for large inputs anyway, since the precision at this scale is too low?
cpuhrsch
May 2, 2018
Author
Contributor
Hm, what do you mean by that? SLEEF is just as precise as libm (1ULP) for all the vectorized functions in this PR.
Hm, what do you mean by that? SLEEF is just as precise as libm (1ULP) for all the vectorized functions in this PR.
apaszke
May 2, 2018
Collaborator
Oh ok. I thought SLEEF has those guarantees only when abs(x) < 1e10
, so this wouldn't be the case, but maybe I misunderstood something.
Oh ok. I thought SLEEF has those guarantees only when abs(x) < 1e10
, so this wouldn't be the case, but maybe I misunderstood something.
cpuhrsch
May 2, 2018
Author
Contributor
That's the case for the trigonometric functions (sin, cos, etc.) and they all have various regions. We'll need to deal with that separately.
That's the case for the trigonometric functions (sin, cos, etc.) and they all have various regions. We'll need to deal with that separately.
@MlWoo Thank you for offering to look into this. I'm fully convinced Intel MKL VML is faster than this. But it doesn't provide us with short vector implementations. Intel SVML does, but it's specific to icc as far as I know. Also VML doesn't provide us with an implementation of softmax or other more ML specific function approximations as far as I know. So, we'll need to add SLEEF in any case. Also, as far as I know, MKL VML is not open source and comes with it's own threading library. The plan is to integrate VML in any case, since many users to have access to MKL, but I'd prefer to do that in another PR. I'm adding a lot more timings soon to get a wider understanding of the perf here, I'll add benchmark script alongside that. |
@MlWoo I used this script to get these timings. I'll still need to check for regressions for functions I touched without vectorizing them: cosh, sinh, logh, tanh. Except for erf we see an average speedup of 2x, which is skewed by some of the 10x+ speedups. |
This branch is usually a bit slower on average than master (~5%) on these non-vectorized functions, however much faster in some cases, presumably because it knows how to sort strides. Not that all of this is restricted to sinh/cosh/tanh. |
Why did you modify CMakeLists.txt in sleef? |
@shibatch Thanks for looking at this! Sleef is a fantastic library, thank you very much! I went through many changes, because it wouldn't build with the standard CMAKE setup on all our different platforms. I'm happy to remove all of that again and it is straightforward to drop in the original build setup. I'd want to do this as part of a separate PR. I'll send one out with all this removed and tag you in it, if you want. Then we can resolve the issues that the CI shows. |
CC @orionr on cmake stuff |
@cpuhrsch Okay. And I will add VML-style API to the development plan. |
@@ -0,0 +1,71 @@ | |||
IF(MSVC) | |||
option(BUILD_SHARED_LIBS "Build shared libs" ON) | |||
ELSE(MSVC) |
ezyang
May 2, 2018
Contributor
You don't have to fix this, but FYI modern cmake style is to not repeat the conditional in else/elseif/endif. So:
if(MSVC)
...
else()
...
endif()
You don't have to fix this, but FYI modern cmake style is to not repeat the conditional in else/elseif/endif. So:
if(MSVC)
...
else()
...
endif()
option(SLEEF_SHOW_ERROR_LOG "Show cmake error log." OFF) | ||
|
||
set(SLEEF_VERSION_MAJOR 3) | ||
set(SLEEF_VERSION_MINOR 2) |
ezyang
May 2, 2018
Contributor
Picture in my head: Someone's gonna upgrade the sleef submodule and forget to update these version numbers. ;)
Picture in my head: Someone's gonna upgrade the sleef submodule and forget to update these version numbers. ;)
cpuhrsch
May 2, 2018
Author
Contributor
Yes :(
Yes :(
endif() | ||
|
||
# Function used to generate safe command arguments for add_custom_command | ||
function(command_arguments PROPNAME) |
ezyang
May 2, 2018
Contributor
This utility is from sleef proper?
This utility is from sleef proper?
cpuhrsch
May 2, 2018
Author
Contributor
Indeed, I didn't write this.
Indeed, I didn't write this.
I didn't read the rest of the diff, but I'm signing LGTM for the cmake stuff. It would be nice to have some description of what the changes are, so it's easier to upstream the necessary changes (since you don't have to go through a diff with a fine toothed comb to find out what was changed.) Also, so you don't forget if you have to put this aside for a bit :) |
88a7055
into
pytorch:master
Speeds up vectorized operations on contiguous tensors while preserving the ULP required by libc's math library.
In particular, the "_u10" part of the function means that it has a ULP of 1.0.
This library comes with different valid ranges. For example sinf has a valid range of [-5e+9, 5e+9]. This needs to be accounted for manually (work in progress).
EDIT: It was decided that we'll add a flag to functions where the corresponding vectorized implementation does not cover the entire range of floating pointer values and will disable them by default.