Skip to content

Conversation

firstprayer
Copy link

@firstprayer firstprayer commented Aug 7, 2020

Stack from ghstack:

Differential Revision: D23010859

[ghstack-poisoned]
firstprayer pushed a commit that referenced this pull request Aug 7, 2020
ghstack-source-id: 8dfb2b3
Pull Request resolved: #42754
@firstprayer
Copy link
Author

Summary

Address issue #41696

Test Plan:

I leveraged the mnist example in the pytorch/example, manually changed the zero_grad() to reset_grad() and see if that affects the training precision:

Before:

Train Epoch: 3 [57600/60000 (96%)]      Loss: 0.049052                                                                                                                                                              │·························································
Train Epoch: 3 [58240/60000 (97%)]      Loss: 0.072383                                                                                                                                                              │·························································
Train Epoch: 3 [58880/60000 (98%)]      Loss: 0.004111                                                                                                                                                              │·························································
Train Epoch: 3 [59520/60000 (99%)]      Loss: 0.001328                                                                                                                                                              │·························································
                                                                                                                                                                                                                    │·························································
Test set: Average loss: 0.0368, Accuracy: 9867/10000 (99%)

After:

Train Epoch: 3 [58880/60000 (98%)]      Loss: 0.004111                                                                                                                                                              │·························································
Train Epoch: 3 [59520/60000 (99%)]      Loss: 0.001328
Test set: Average loss: 0.0368, Accuracy: 9867/10000 (99%)

@dr-ci
Copy link

dr-ci bot commented Aug 7, 2020

💊 CI failures summary and remediations

As of commit 82594d2 (more details on the Dr. CI page):



🕵️ 9 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (3/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 02 20:32:15 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n ^\n" }
Sep 02 20:32:15     raise RuntimeError(err) 
Sep 02 20:32:15 RuntimeError: test_type_hints failed! 
Sep 02 20:32:15  
Sep 02 20:32:15 real	33m12.777s 
Sep 02 20:32:15 user	50m13.078s 
Sep 02 20:32:15 sys	2m55.257s 
Sep 02 20:32:15 + cleanup 
Sep 02 20:32:15 =================== sccache compilation log =================== 
Sep 02 20:32:15 + retcode=1 
Sep 02 20:32:15 + set +x 
Sep 02 20:32:15 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 02 20:32:15  
Sep 02 20:32:15 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 20:32:16 Compile requests                 0 
Sep 02 20:32:16 Compile requests executed        0 
Sep 02 20:32:16 Cache hits                       0 
Sep 02 20:32:16 Cache misses                     0 
Sep 02 20:32:16 Cache timeouts                   0 
Sep 02 20:32:16 Cache read errors                0 
Sep 02 20:32:16 Forced recaches                  0 
Sep 02 20:32:16 Cache write errors               0 

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic (4/9)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

lExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n return ((int*)(&strtod_l))[argc];\n ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }
Sep 02 19:59:34     input: Tensor) -> Tensor: 
Sep 02 19:59:34     input0 = torch._convolution(input, self.weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) 
Sep 02 19:59:34              ~~~~~~~~~~~~~~~~~~ <--- HERE 
Sep 02 19:59:34     return input0 
Sep 02 19:59:34  
Sep 02 19:59:34 test/mobile/custom_build/build.sh: line 98: 21532 Aborted                 (core dumped) ./Predictor "${MODEL}" > output.txt 
Sep 02 19:59:34 =================== sccache compilation log =================== 
Sep 02 19:59:34 + sccache_epilogue 
Sep 02 19:59:34 + echo '=================== sccache compilation log ===================' 
Sep 02 19:59:34 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
Exists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Sep 02 19:59:34  
Sep 02 19:59:34 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Sep 02 19:59:34 + sccache --show-stats 
Sep 02 19:59:34 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 19:59:34 Compile requests              2813 
Sep 02 19:59:34 Compile requests executed     2182 
Sep 02 19:59:34 Cache hits                      22 
Sep 02 19:59:34 Cache misses                  2156 
Sep 02 19:59:34 Cache timeouts                   0 
Sep 02 19:59:34 Cache read errors                0 

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (5/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 02 19:57:57 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in
Sep 02 19:57:57     #7 0x5598541ee7eb in PyEval_EvalCode /tmp/build/80754af9/python_1588903631989/work/Python/ceval.c:731 
Sep 02 19:57:57     #8 0x55985426ee73 in run_mod /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:1025 
Sep 02 19:57:57     #9 0x55985426ef0c in PyRun_StringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:949 
Sep 02 19:57:57     #10 0x55985426ef6e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:445 
Sep 02 19:57:57     #11 0x559854272d72 in run_command /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:301 
Sep 02 19:57:57     #12 0x559854272d72 in Py_Main /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:749 
Sep 02 19:57:57     #13 0x55985413cf2d in main /tmp/build/80754af9/python_1588903631989/work/Programs/python.c:69 
Sep 02 19:57:57     #14 0x7fcf3981183f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291 
Sep 02 19:57:57     #15 0x55985421c27e in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103 
Sep 02 19:57:57  
Sep 02 19:57:57 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in  
Sep 02 19:57:57 + retcode=1 
Sep 02 19:57:57 + set -e 
Sep 02 19:57:57 + return 1 
Sep 02 19:57:57 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]] 
Sep 02 19:57:57 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]] 
Sep 02 19:57:57 + '[' -n https://github.com/pytorch/pytorch/pull/42754 ']' 
Sep 02 19:57:57 ++ mktemp 
Sep 02 19:57:57 + DETERMINE_FROM=/tmp/tmp.a0JXeZfXGD 
Sep 02 19:57:57 + file_diff_from_base /tmp/tmp.a0JXeZfXGD 
Sep 02 19:57:57 + set +e 

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_test (6/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 02 20:39:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n 2 | int main() { return 0 }\n | ^~\n | ;\n" }
Sep 02 20:39:24     raise RuntimeError(err) 
Sep 02 20:39:24 RuntimeError: test_type_hints failed! 
Sep 02 20:39:24  
Sep 02 20:39:24 real	32m49.888s 
Sep 02 20:39:24 user	35m47.941s 
Sep 02 20:39:24 sys	1m33.169s 
Sep 02 20:39:24 + cleanup 
Sep 02 20:39:24 + retcode=1 
Sep 02 20:39:24 + set +x 
Sep 02 20:39:24 =================== sccache compilation log =================== 
Sep 02 20:39:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n    2 | int main() { return 0 }\n      |                      ^~\n      |                      ;\n" } 
Sep 02 20:39:24  
Sep 02 20:39:24 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 20:39:24 Compile requests                 65 
Sep 02 20:39:24 Compile requests executed        35 
Sep 02 20:39:24 Cache hits                       25 
Sep 02 20:39:24 Cache misses                      9 
Sep 02 20:39:24 Cache timeouts                    0 
Sep 02 20:39:24 Cache read errors                 0 
Sep 02 20:39:24 Forced recaches                   0 
Sep 02 20:39:24 Cache write errors                0 

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_custom_build_static (7/9)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

bolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n return ((int*)(&strtod_l))[argc];\n ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }
Sep 02 19:37:00     input: Tensor) -> Tensor: 
Sep 02 19:37:00     input0 = torch._convolution(input, self.weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) 
Sep 02 19:37:00              ~~~~~~~~~~~~~~~~~~ <--- HERE 
Sep 02 19:37:00     return input0 
Sep 02 19:37:00  
Sep 02 19:37:01 test/mobile/custom_build/build.sh: line 98: 17404 Aborted                 (core dumped) ./Predictor "${MODEL}" > output.txt 
Sep 02 19:37:01 + sccache_epilogue 
Sep 02 19:37:01 + echo '=================== sccache compilation log ===================' 
Sep 02 19:37:01 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
Sep 02 19:37:01 =================== sccache compilation log =================== 
olExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Sep 02 19:37:01  
Sep 02 19:37:01 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 19:37:01 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Sep 02 19:37:01 + sccache --show-stats 
Sep 02 19:37:01 Compile requests              2231 
Sep 02 19:37:01 Compile requests executed     1600 
Sep 02 19:37:01 Cache hits                      17 
Sep 02 19:37:01 Cache misses                  1579 
Sep 02 19:37:01 Cache timeouts                   0 
Sep 02 19:37:01 Cache read errors                0 

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_build (8/9)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Tmp/CheckSymbolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n return ((int*)(&strtod_l))[argc];\n ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }
Sep 02 19:32:37     input: Tensor) -> Tensor: 
Sep 02 19:32:37     input0 = torch._convolution(input, self.weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) 
Sep 02 19:32:37              ~~~~~~~~~~~~~~~~~~ <--- HERE 
Sep 02 19:32:37     return input0 
Sep 02 19:32:37  
Sep 02 19:32:37 test/mobile/custom_build/build.sh: line 98: 13392 Aborted                 (core dumped) ./Predictor "${MODEL}" > output.txt 
Sep 02 19:32:37 =================== sccache compilation log =================== 
Sep 02 19:32:37 + sccache_epilogue 
Sep 02 19:32:37 + echo '=================== sccache compilation log ===================' 
Sep 02 19:32:37 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
mp/CheckSymbolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Sep 02 19:32:37  
Sep 02 19:32:37 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Sep 02 19:32:37 + sccache --show-stats 
Sep 02 19:32:37 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 19:32:37 Compile requests              2229 
Sep 02 19:32:37 Compile requests executed     1599 
Sep 02 19:32:37 Cache hits                    1350 
Sep 02 19:32:37 Cache misses                   245 
Sep 02 19:32:37 Cache timeouts                   0 
Sep 02 19:32:37 Cache read errors                0 

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (9/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result. 

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 02 22:17:22 ConnectionResetError: [Errno 104] Connection reset by peer
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 455, in accept 
Sep 02 22:17:22     deliver_challenge(c, self._authkey) 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge 
Sep 02 22:17:22     response = connection.recv_bytes(256)        # reject large message 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes 
Sep 02 22:17:22     buf = self._recv_bytes(maxlength) 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes 
Sep 02 22:17:22     buf = self._recv(4) 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv 
Sep 02 22:17:22     chunk = read(handle, remaining) 
Sep 02 22:17:22 ConnectionResetError: [Errno 104] Connection reset by peer 
Sep 02 22:17:23 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown 
Sep 02 22:17:23   len(cache)) 
Sep 02 22:17:25 Process ErrorTrackingProcess-156: 
Sep 02 22:17:25 Traceback (most recent call last): 
Sep 02 22:17:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap 
Sep 02 22:17:25     self.run() 
Sep 02 22:17:25   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 361, in run 
Sep 02 22:17:25     super(ErrorTrackingProcess, self).run() 
Sep 02 22:17:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run 
Sep 02 22:17:25     self._target(*self._args, **self._kwargs) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

@ezyang ezyang requested review from albanD and vincentqb August 10, 2020 22:28
@mcarilli
Copy link
Collaborator

mcarilli commented Aug 20, 2020

Optional suggestion: #41696 (comment)

Regardless of which API you choose, you should apply similar treatment to model.zero_grad.

for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.detach_()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to detach_()? Afaict the only way this has an effect is if something external also holds a reference to .grad and .grad was created in a create_graph=True backward pass.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds like a possible scenario so trying to be safe here

Copy link
Collaborator

@mcarilli mcarilli Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is detach_ may be the unsafe option. Because it affects p.grad in place, it silently affects anything else holding a reference to p.grad. Setting p.grad = None without detach_ simply drops our reference. Anything else holding a reference to grad will not see an effect.

Admittedly, I'm not sure why the default zeroing behavior of zero_grad performs detach_ then zero_. I assume it's to avoid building up spurious autograd history if the grad was created with create_graph=True and therefore requires grad. (edit: related, avoids memory leak when grad has grad_fn)

If so, in the alternative set-to-None path, we dont need detach_. We drop the reference to .grad, we don't perform any ops on it, therefore there's no danger of building spurious autograd history. And if we don't detach_, we don't risk silently affecting other references to grad.

tl;dr I think not detaching is the safe option here.

@albanD you're good with these tricky cases...Why is detach_ used with the default zeroing behavior of zero_grad? Also, do you agree not detaching is the safer implementation for the set-to-None path?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code is trying to make sure that we don't change the Tensor referenced by .grad as much as possible I think.
And yes as mentioned below, I do agree that we don't want to detach here.

@vincentqb
Copy link
Contributor

vincentqb commented Aug 20, 2020

Thanks for running an example. Out of curiosity, did it change the runtime and memory?

@vincentqb
Copy link
Contributor

vincentqb commented Aug 20, 2020

Responded about API on original issue.

@firstprayer firstprayer requested a review from apaszke as a code owner August 22, 2020 18:58
@firstprayer
Copy link
Author

Thanks for running an example. Out of curiosity, did it change the runtime and memory?

It doesn't seem to lead to noticeable difference in terms of run time, but again I am running a toy example here so probably hard to tell. Not sure how to measure the memory footprint though.

firstprayer pushed a commit that referenced this pull request Aug 22, 2020
ghstack-source-id: 9499a9b
Pull Request resolved: #42754
@ngimel
Copy link
Collaborator

ngimel commented Aug 22, 2020

Can you please modify test_zero_grad in test_nn.py so that it tests both modes?
@mcarilli, do you think it makes sense to modify grad scaling tests in test_cuda.py to use .zero_grad with both options?

firstprayer pushed a commit that referenced this pull request Aug 22, 2020
ghstack-source-id: 60bd467
Pull Request resolved: #42754
@mcarilli
Copy link
Collaborator

Can you please modify test_zero_grad in test_nn.py so that it tests both modes?
@mcarilli, do you think it makes sense to modify grad scaling tests in test_cuda.py to use .zero_grad with both options?

Yes, thats a good suggestion. should be several one-liners in the tests after this PR is merged.

if p.grad is not None:
if p.grad.grad_fn is not None:
if set_to_none:
p.grad.detach_()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this detach() as you remove it anyways.

r"""Sets gradients of all model parameters to zero.
Arguments:
set_to_none (bool): instead of setting to zero, set the grad to None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to the nn.optim version of this doc that contains more details about the change of behavior?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.optim, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, typo!

Copy link
Collaborator

@mcarilli mcarilli Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add

See :meth:`torch.optim.optimizer.zero_grad` for details.

if p.grad is not None:
if p.grad.grad_fn is not None:
if set_to_none:
p.grad.detach_()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unneeded detach here as well.


def zero_grad(self) -> None:
r"""Sets gradients of all model parameters to zero."""
def zero_grad(self, set_to_none=False) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add the type for set_to_none: set_to_none: bool = False


def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
def zero_grad(self, set_to_none=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update optimizer.pyi as well to reflect this change? Just add the new arg and type there.

A None attribute or a Tensor full of 0s will be different.
2. User can no longer rely on checking if `.grad` is None to see if a tensor
is touched in the backward pass
3. `nn.optim` optimizers have a different behavior if the gradient is 0 or None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.optim

Arguments:
set_to_none (bool): instead of setting to zero, set the grad to None.
This will in general have lower memory footprint, but using this
comes with caveats, to name a few:
Copy link
Collaborator

@mcarilli mcarilli Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say the changes are caveats. Some are benefits, like point 3 (skipping the update is faster if the grad is None).
Instead of

This will in general have lower memory footprint, but using this comes with caveats, to name a few:

I would say

This is will in general have lower memory footprint, and can modestly improve performance.  However, it changes certain behaviors.  For example:

comes with caveats, to name a few:
1. When user tries to access the gradient value and perform manual ops on it.
A None attribute or a Tensor full of 0s will be different.
2. User can no longer rely on checking if `.grad` is None to see if a tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense. I'd say it's the opposite: if a user sets grads = None before backward, they CAN check if .grad is None to see if the tensor received a gradient in the backward pass. Let's just describe the behavior:

If the user requests `zero_grad(set_to_none=True)` followed by a backward pass, `.grad`\ s are guaranteed to be None for params that did not receive a gradient.

firstprayer pushed a commit that referenced this pull request Aug 29, 2020
ghstack-source-id: a426a12
Pull Request resolved: #42754
Arguments:
set_to_none (bool): instead of setting to zero, set the grad to None.
This is will in general have lower memory footprint, and can modestly improve performance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is will -> This will

This is will in general have lower memory footprint, and can modestly improve performance.
However, it changes certain behaviors. For example:
1. When user tries to access the gradient value and perform manual ops on it.
A None attribute or a Tensor full of 0s will be different.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine sentences

When user tries to access the gradient value and perform manual ops on it,
                a None attribute or a Tensor full of 0s will act differently.

However, it changes certain behaviors. For example:
1. When user tries to access the gradient value and perform manual ops on it.
A None attribute or a Tensor full of 0s will be different.
2. If the user requests `zero_grad(set_to_none=True)` followed by a backward pass, `.grad` s
Copy link
Collaborator

@mcarilli mcarilli Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon rereading this, I think we should switch the order of points 1 and 2. The current point 2 sets up some context for (what is currently) point 1, so I think it makes sense to say point 2 first.

Otherwise LGTM! Thanks for the PR, making this optimization a first-class citizen is very helpful.

@ngimel
Copy link
Collaborator

ngimel commented Sep 1, 2020

@albanD, @vincentqb anything holding up this diff? Comments seem to be addressed (other than @mcarilli's doc suggestions which are imo minor).

@albanD
Copy link
Collaborator

albanD commented Sep 1, 2020

Nothing blocking on my side. Just the doc update. And make sure that the CI is happy with it after rebase.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, subject to minor doc fixes.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, I fixed two of the three doc changes that I see. I'm ok either way with the last one:

Upon rereading this, I think we should switch the order of points 1 and 2. The current point 2 sets up some context for (what is currently) point 1, so I think it makes sense to say point 2 first.

Copy link
Collaborator

@mcarilli mcarilli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcarilli mcarilli changed the title Add reset_grad() function Add zero_grad(set_to_none=True) Sep 9, 2020
ngimel pushed a commit to ngimel/pytorch that referenced this pull request Sep 10, 2020
Summary:
Pull Request resolved: pytorch#44423

Pull Request resolved: pytorch#42754

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23010859

Pulled By: ngimel

fbshipit-source-id: 760279f7c9cb84d11bef51207c18bf1f362ca7ad
@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in c515881.

@facebook-github-bot facebook-github-bot deleted the gh/firstprayer/3/head branch September 13, 2020 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants