add AVX2 implementation for sigmoid function #5010

vedanuj · 2018-02-02T09:45:36Z

PR introduces AVX2 optimization for sigmoid floats. Issue #4929. The internal benchmark shows ~10x speedup.

Added AVX2 vectorized sigmoid using the 8-way vectorized exp (exp256_ps) in avx_mathfun.h.
Implemented vector dispatch for sigmoid. Since sigmoid function is defined for floats and doubles only, for now, added preprocessor #ifdef to init sigmoid dispatch only for float and double.
Vector functions in THVector.h were not called for all of the basic functions in floating point or double only. Changed the LAB_IMPLEMENT_BASIC_FUNCTION define in THTensorMatch.c to use THVector_(NAME) implementations if the inputs are contiguous. For the functions that do not have vectorized SIMD implementations will use the same default function from THMath.h

Benchmark

Non-vectorized sigmoid :

In [1]: import torch
In [2]: x = torch.randn(10000,10000)
In [3]: %time _ = x.sigmoid()
CPU times: user 2.8 s, sys: 130 ms, total: 2.93 s
Wall time: 737 ms

In [1]: import torch
In [2]: x = torch.randn(1000,1000)
In [3]: %time _ = x.sigmoid()
CPU times: user 29.1 ms, sys: 4.16 ms, total: 33.3 ms
Wall time: 8.63 ms

AVX2 Vectorized sigmoid

In [1]: import torch
In [2]: x = torch.randn(10000,10000)
In [3]: %time _ = x.sigmoid()
CPU times: user 206 ms, sys: 106 ms, total: 312 ms
Wall time: 78.2 ms

In [14]: x = torch.randn(1000,1000)
In [15]: %time _ = x.sigmoid()
CPU times: user 179 µs, sys: 2.95 ms, total: 3.13 ms
Wall time: 858 µs

vedanuj · 2018-02-02T09:55:00Z

PR for Issue #4929
@zdevito

soumith · 2018-02-02T14:20:56Z

@pytorchbot add to whitelist

zdevito

This looks good! Let's see if we can track down what is going on with THVector_ dispatch to make sure we aren't degrading performance in some way for the non-vectorized functions by adding the THVector_ dispatch.

aten/src/TH/vector/AVX2.cpp

+  for (i = 0; i <= ((n)-16); i += 16) {
+    YMM0 = _mm256_loadu_ps(x + i);
+    YMM1 = _mm256_loadu_ps(x + i + 8);
+    YMM0 = _mm256_mul_ps(minus_one, YMM0);


aten/src/TH/generic/THTensorMath.c

+    } else {                                                                                                   \
+      int inOMP = omp_in_parallel();                            \
+      if( (r_Size > TH_OMP_OVERHEAD_THRESHOLD) && (!inOMP) ){   \
+        TH_TENSOR_APPLY2_OMP(r_Size, r_Contig, tContig, real, r_, real, t, *r__data = CFUNC(*t_data););        \


vedanuj · 2018-02-10T09:48:54Z

Changes in new commit :

Added a new macro LAB_IMPLEMENT_VECTORIZED_FUNCTION for the vectorized basic functions. Currently only sigmoid uses this macro which redirects to the vectorized implementation.
Although there is no significant improvement in performance(due to exponent being the computationally dominant operation), replaced _mm256_mul_ps(minus_one, YMM0) with _mm256_sub_ps(zero, YMM0) which should be less expensive.

(Please re-review the code @zdevito @soumith @fmassa)

goldsborough · 2018-02-28T17:52:39Z

@cpuhrsch do you want to take a look at this? I know you're working on CPU improvements.

cpuhrsch · 2018-02-28T18:46:00Z

@goldsborough I'm writing those directly within ATen native, so this code won't conflict with anything I'm writing right now.

ezyang · 2018-03-23T21:35:04Z

There is probably more to look into re this code path, but I don't see why we shouldn't take such an obvious improvement for sigmoid.

* upstream/master: (663 commits) Fix "command not found" error in perf test (pytorch#5982) add pip mkl-devel to the error message when mkl is found but mkl headers are not (pytorch#5984) Support batch LowerCholeskyTransform (pytorch#5980) Linearly interpolating upsampling fix (pytorch#5927) Store perf numbers in S3 (pytorch#5951) Modidy setup docs for Windows (pytorch#5981) Group Normalization (pytorch#5968) [distributions] Implement Power transform (pytorch#5976) Disable TestBottleneck test_cuda on Windows (pytorch#5977) Fix crash when cat-ing empty cuda tensors (pytorch#5971) Update no_unions flag for nanopb gen and update ONNX proto files (pytorch#5972) Expose gradients w.r.t. input & weight for conv1d, conv2d, conv3d in Python (pytorch#5408) Fixed non-determinate preprocessing on DataLoader (pytorch#4640) add AVX2 implementation for sigmoid function (pytorch#5010) Implement torch.util.bottleneck (pytorch#5216) Remove pragma once from cpp file (pytorch#5965) fix mvn docs (pytorch#5967) Fix incorrect rendering of Tensor.index_*_ doc examples. (pytorch#5969) Implement range for loop in script (pytorch#5827) Add windows doc (pytorch#5859) ... # Conflicts: # aten/src/TH/generic/THTensorMath.c # torch/_tensor_docs.py # torch/csrc/generic/methods/TensorCompare.cwrap

vedanuj changed the title ~~add AVX2 implementation for sigmoid function #4929~~ add AVX2 implementation for sigmoid function Feb 2, 2018

vedanuj mentioned this pull request Feb 2, 2018

Implement AVX2-vectorized sigmoid for floats #4929

Closed

vedanuj changed the title ~~add AVX2 implementation for sigmoid function~~ add AVX2 implementation for sigmoid function (#4929) Feb 2, 2018

vedanuj changed the title ~~add AVX2 implementation for sigmoid function (#4929)~~ add AVX2 implementation for sigmoid function Feb 2, 2018

vedanuj changed the title ~~add AVX2 implementation for sigmoid function~~ [WIP] add AVX2 implementation for sigmoid function Feb 3, 2018

vedanuj changed the title ~~[WIP] add AVX2 implementation for sigmoid function~~ add AVX2 implementation for sigmoid function Feb 3, 2018

zdevito reviewed Feb 6, 2018

View reviewed changes

vedanuj added 3 commits March 5, 2018 18:22

add AVX2 implementation for sigmoid function

f136068

Fix bug in AVX2 code for sigmoid

ee7d1c0

Add new macro for custom vectorized functions

4b09649

ezyang merged commit 83de3a0 into pytorch:master Mar 23, 2018

vedanuj deleted the sigmoid_avx2 branch March 24, 2018 06:18

zy97140 mentioned this pull request Mar 27, 2018

Add threshold for ops using openmp macro #5584

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add AVX2 implementation for sigmoid function #5010

add AVX2 implementation for sigmoid function #5010

vedanuj commented Feb 2, 2018 •

edited

vedanuj commented Feb 2, 2018 •

edited

soumith commented Feb 2, 2018

zdevito left a comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

vedanuj commented Feb 10, 2018 •

edited

goldsborough commented Feb 28, 2018

cpuhrsch commented Feb 28, 2018

ezyang commented Mar 23, 2018

add AVX2 implementation for sigmoid function #5010

add AVX2 implementation for sigmoid function #5010

Conversation

vedanuj commented Feb 2, 2018 • edited

vedanuj commented Feb 2, 2018 • edited

soumith commented Feb 2, 2018

zdevito left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

vedanuj commented Feb 10, 2018 • edited

Changes in new commit :

goldsborough commented Feb 28, 2018

cpuhrsch commented Feb 28, 2018

ezyang commented Mar 23, 2018

vedanuj commented Feb 2, 2018 •

edited

vedanuj commented Feb 2, 2018 •

edited

vedanuj commented Feb 10, 2018 •

edited