Skip to content

Conversation

supriyar
Copy link
Contributor

@supriyar supriyar commented Nov 6, 2020

Stack from ghstack:

Summary:

fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Fixes #46533

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D24800495

Summary:
Previosuly the scale and zero_point were returned on the CPU even if
the input tensor was on the GPU.
This is because `copy_()` doesn't respect the device when copying over the tensor.

Also fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
supriyar added a commit that referenced this pull request Nov 6, 2020
Summary:
Previosuly the scale and zero_point were returned on the CPU even if
the input tensor was on the GPU.
This is because `copy_()` doesn't respect the device when copying over the tensor.

Also fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 15d3d50
Pull Request resolved: #47514
@dr-ci
Copy link

dr-ci bot commented Nov 6, 2020

💊 CI failures summary and remediations

As of commit be9df93 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


codecov.io: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 13 times.

@supriyar
Copy link
Contributor Author

supriyar commented Nov 6, 2020

QAT Benchmark on MobilenetV2
Before change:

PyTorchObserver {"type": "_", "metric": "qnnpack_fp_latency_ms", "unit": "ms", "value": "229.25734519958496"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat0_latency_ms", "unit": "ms", "value": "522.6466655731201"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat1_obs_0_latency_ms", "unit": "ms", "value": "261.6147994995117"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat2_obs_0_bn_0_latency_ms", "unit": "ms", "value": "250.9593963623047"}
PyTorchObserver {"type": "_", "metric": "qnnpack_fp_cuda_max_mem_mib", "unit": "MiB", "value": "246.51806640625"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat0_cuda_max_mem_mib", "unit": "MiB", "value": "314.82080078125"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat1_obs_0_cuda_max_mem_mib", "unit": "MiB", "value": "314.82080078125"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat2_obs_0_bn_0_cuda_max_mem_mib", "unit": "MiB", "value": "314.66650390625"}
/home/supriyar/local/fbsource/fbcode/buck-out/dev/gen/caffe2/caffe2/fb/high_perf_models/pytorch/mobilenetv2/mobilenetv2_bench#link-tree/torch/quantization/observer.py:123: UserWarning: Please use quant_min and quant_max to specify the range for observers.                     reduce_range will be deprecated in a future release of PyTorch.
  reduce_range will be deprecated in a future release of PyTorch."
PyTorchObserver {"type": "_", "metric": "fbgemm_fp_latency_ms", "unit": "ms", "value": "231.16016387939453"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat0_latency_ms", "unit": "ms", "value": "546.6287136077881"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat1_obs_0_latency_ms", "unit": "ms", "value": "282.5195789337158"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat2_obs_0_bn_0_latency_ms", "unit": "ms", "value": "268.3393955230713"}
PyTorchObserver {"type": "_", "metric": "fbgemm_fp_cuda_max_mem_mib", "unit": "MiB", "value": "246.80126953125"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat0_cuda_max_mem_mib", "unit": "MiB", "value": "315.10400390625"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat1_obs_0_cuda_max_mem_mib", "unit": "MiB", "value": "315.10400390625"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat2_obs_0_bn_0_cuda_max_mem_mib", "unit": "MiB", "value": "314.94970703125"}

After change

PyTorchObserver {"type": "_", "metric": "qnnpack_fp_latency_ms", "unit": "ms", "value": "221.08221054077148"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat0_latency_ms", "unit": "ms", "value": "506.84404373168945"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat1_obs_0_latency_ms", "unit": "ms", "value": "261.29603385925293"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat2_obs_0_bn_0_latency_ms", "unit": "ms", "value": "239.6218776702881"}
PyTorchObserver {"type": "_", "metric": "qnnpack_fp_cuda_max_mem_mib", "unit": "MiB", "value": "246.51806640625"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat0_cuda_max_mem_mib", "unit": "MiB", "value": "314.82080078125"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat1_obs_0_cuda_max_mem_mib", "unit": "MiB", "value": "314.82080078125"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat2_obs_0_bn_0_cuda_max_mem_mib", "unit": "MiB", "value": "314.66650390625"}
/home/supriyar/local/fbsource/fbcode/buck-out/dev/gen/caffe2/caffe2/fb/high_perf_models/pytorch/mobilenetv2/mobilenetv2_bench#link-tree/torch/quantization/observer.py:123: UserWarning: Please use quant_min and quant_max to specify the range for observers.                     reduce_range will be deprecated in a future release of PyTorch.
  reduce_range will be deprecated in a future release of PyTorch."
PyTorchObserver {"type": "_", "metric": "fbgemm_fp_latency_ms", "unit": "ms", "value": "214.65778350830078"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat0_latency_ms", "unit": "ms", "value": "502.2757053375244"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat1_obs_0_latency_ms", "unit": "ms", "value": "304.18944358825684"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat2_obs_0_bn_0_latency_ms", "unit": "ms", "value": "258.211612701416"}
PyTorchObserver {"type": "_", "metric": "fbgemm_fp_cuda_max_mem_mib", "unit": "MiB", "value": "246.80126953125"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat0_cuda_max_mem_mib", "unit": "MiB", "value": "315.10400390625"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat1_obs_0_cuda_max_mem_mib", "unit": "MiB", "value": "315.10400390625"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat2_obs_0_bn_0_cuda_max_mem_mib", "unit": "MiB", "value": "314.94970703125"}

@supriyar supriyar requested review from jerryzh168 and vkuzo November 6, 2020 21:14
@vkuzo
Copy link
Contributor

vkuzo commented Nov 6, 2020

wow, interesting, great catch

self.min_val = self.min_val.copy_(min_val).to(min_val.device)

is the self.min_val.copy_(min_val).to(min_val.device) code actually doing CUDA -> CPU -> CUDA? If yes, is there a way to remove that?

@supriyar
Copy link
Contributor Author

supriyar commented Nov 6, 2020

wow, interesting, great catch

self.min_val = self.min_val.copy_(min_val).to(min_val.device)

is the self.min_val.copy_(min_val).to(min_val.device) code actually doing CUDA -> CPU -> CUDA? If yes, is there a way to remove that?

Hmm, it might. I think I can use y = x.cuda(device=x.device) instead to copy to same device if the input is a cuda tensor.

Update: It would appear that using x.cuda isn't scriptable and seems to map to the wrong function.
Repro

class M(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.x = torch.tensor(0)

            def forward(self, input):
                x = input.cuda(device=input.device)
                return x

        m = M().eval()
        input = torch.rand(5, device='cuda:1')
        print(m(input))
        script = torch.jit.script(m)
RuntimeError:

aten::cuda(Tensor(a) self) -> (Tensor(b|a)):
Keyword argument device unknown.
:
  File "/home/supriyar/pytorch/test/quantization/test_workflow_module.py", line 514
            def forward(self, input):
                x = input.cuda(device=input.device)
                    ~~~~~~~~~~ <--- HERE
                return x

I can instead to the to call first before the copy_ to make sure we move the tensor to CUDA before doing the COPY. @vkuzo thoughts on this approach?

@vkuzo
Copy link
Contributor

vkuzo commented Nov 7, 2020

I can instead to the to call first before the copy_ to make sure we move the tensor to CUDA before doing the COPY. @vkuzo thoughts on this approach?

that would make sense to me, and the performance metrics don't show a slowdown which is good.

Doesn't have to block this bugfix, but I think in the future it might make sense to rethink how we do device management in observers. Ideally we can enforce strong assumptions that all observer buffers are on the correct device after initialization and any to calls, and then just throw an error with a stack trace when any mismatch is detected. Then we could remove all the custom code which moves tensors between devices.

Summary:
Previosuly the scale and zero_point were returned on the CPU even if
the input tensor was on the GPU.
This is because `copy_()` doesn't respect the device when copying over the tensor.

Also fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D24800495](https://our.internmc.facebook.com/intern/diff/D24800495)

[ghstack-poisoned]
supriyar added a commit that referenced this pull request Nov 7, 2020
Summary:
Previosuly the scale and zero_point were returned on the CPU even if
the input tensor was on the GPU.
This is because `copy_()` doesn't respect the device when copying over the tensor.

Also fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2cbd6f3
Pull Request resolved: #47514
Summary:
Previosuly the scale and zero_point were returned on the CPU even if
the input tensor was on the GPU.
This is because `copy_()` doesn't respect the device when copying over the tensor.

Also fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D24800495](https://our.internmc.facebook.com/intern/diff/D24800495)

[ghstack-poisoned]
supriyar added a commit that referenced this pull request Nov 7, 2020
Summary:

fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Fixes #46533

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 90eb81d
Pull Request resolved: #47514
Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg, the test failure in QuantizePerChannel4d looks potentially related

Summary:

fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Fixes #46533

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D24800495](https://our.internmc.facebook.com/intern/diff/D24800495)

[ghstack-poisoned]
supriyar added a commit that referenced this pull request Nov 10, 2020
Summary:

fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Fixes #46533

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: a3dc910
Pull Request resolved: #47514
@codecov
Copy link

codecov bot commented Nov 10, 2020

Codecov Report

Merging #47514 (be9df93) into gh/supriyar/206/base (8b3f1d1) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@                  Coverage Diff                  @@
##           gh/supriyar/206/base   #47514   +/-   ##
=====================================================
  Coverage                 80.78%   80.79%           
=====================================================
  Files                      1809     1809           
  Lines                    190036   190036           
=====================================================
+ Hits                     153526   153535    +9     
+ Misses                    36510    36501    -9     

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6bb18b2.

@facebook-github-bot facebook-github-bot deleted the gh/supriyar/206/head branch November 14, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants