Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt][quant] Optimized qadd_scalar #34925

Closed
wants to merge 6 commits into from

Conversation

dskhudia
Copy link
Contributor

@dskhudia dskhudia commented Mar 17, 2020

Stack from ghstack:

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

Before

  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms

After

 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms

Differential Revision: D20500848

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)

[ghstack-poisoned]
dskhudia added a commit that referenced this pull request Mar 17, 2020
Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)

ghstack-source-id: 100341353
Pull Request resolved: #34925
@dskhudia dskhudia requested review from z-a-f and jamesr66a and removed request for z-a-f March 17, 2020 23:03
@dr-ci
Copy link

dr-ci bot commented Mar 18, 2020

💊 CircleCI build failures summary and remediations

As of commit 82de5b5 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no CircleCI failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 13 times.

Copy link
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return Vec256<c10::qint32>::loadu(result_vals);
#endif
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be implemented as members on Vec256c10:;qint32 instead of free functions?

Copy link
Contributor Author

@dskhudia dskhudia Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are implemented the same way for float. I am not sure if there is a reason for them to be free functions.
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cpu/vec256/vec256_float.h#L244-L262

"Only per tensor affine is supported for now!!");
TORCH_CHECK(
self.qscheme() == kPerTensorAffine,
"Only per tensor affine is supported for now!!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dskhudia I think the code in the comment below is no longer relevant with your changes. Can you update that as well to reflect the new requant flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Good catch. I forgot about it.

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)

[ghstack-poisoned]
dskhudia added a commit that referenced this pull request Mar 19, 2020
Pull Request resolved: #34925

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```
ghstack-source-id: 100470384

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)
Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)

[ghstack-poisoned]
Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)

[ghstack-poisoned]
dskhudia added a commit that referenced this pull request Mar 19, 2020
Pull Request resolved: #34925

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```
ghstack-source-id: 100489126

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)
Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)

[ghstack-poisoned]
dskhudia added a commit that referenced this pull request Mar 20, 2020
Pull Request resolved: #34925

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```
ghstack-source-id: 100559730

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)
Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)

[ghstack-poisoned]
dskhudia added a commit that referenced this pull request Mar 20, 2020
Pull Request resolved: #34925

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```
ghstack-source-id: 100595212

Differential Revision: [D20500848](https://our.internmc.facebook.com/intern/diff/D20500848/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 506996c.

@facebook-github-bot facebook-github-bot deleted the gh/dskhudia/18/head branch March 27, 2020 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants