Add zero_grad(set_to_none=True) #42754

firstprayer · 2020-08-07T21:38:40Z

Stack from ghstack:

Add zero_grad(set_to_none=True) #42754 Add reset_grad() function

Differential Revision: D23010859

[ghstack-poisoned]

ghstack-source-id: 8dfb2b3 Pull Request resolved: #42754

firstprayer · 2020-08-07T21:42:52Z

Summary

Address issue #41696

Test Plan:

I leveraged the mnist example in the pytorch/example, manually changed the zero_grad() to reset_grad() and see if that affects the training precision:

Before:

Train Epoch: 3 [57600/60000 (96%)]      Loss: 0.049052                                                                                                                                                              │·························································
Train Epoch: 3 [58240/60000 (97%)]      Loss: 0.072383                                                                                                                                                              │·························································
Train Epoch: 3 [58880/60000 (98%)]      Loss: 0.004111                                                                                                                                                              │·························································
Train Epoch: 3 [59520/60000 (99%)]      Loss: 0.001328                                                                                                                                                              │·························································
                                                                                                                                                                                                                    │·························································
Test set: Average loss: 0.0368, Accuracy: 9867/10000 (99%)

After:

Train Epoch: 3 [58880/60000 (98%)]      Loss: 0.004111                                                                                                                                                              │·························································
Train Epoch: 3 [59520/60000 (99%)]      Loss: 0.001328
Test set: Average loss: 0.0368, Accuracy: 9867/10000 (99%)

dr-ci · 2020-08-07T22:11:17Z

💊 CI failures summary and remediations

As of commit 82594d2 (more details on the Dr. CI page):

9/10 failures introduced in this PR
1/10 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 9 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

pytorch_linux_xenial_py3_6_gcc5_4_build (2/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

pytorch_linux_bionic_py3_6_clang9_test (3/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 02 20:32:15 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" }

Sep 02 20:32:15     raise RuntimeError(err) 
Sep 02 20:32:15 RuntimeError: test_type_hints failed! 
Sep 02 20:32:15  
Sep 02 20:32:15 real	33m12.777s 
Sep 02 20:32:15 user	50m13.078s 
Sep 02 20:32:15 sys	2m55.257s 
Sep 02 20:32:15 + cleanup 
Sep 02 20:32:15 =================== sccache compilation log =================== 
Sep 02 20:32:15 + retcode=1 
Sep 02 20:32:15 + set +x 
Sep 02 20:32:15 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 02 20:32:15  
Sep 02 20:32:15 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 20:32:16 Compile requests                 0 
Sep 02 20:32:16 Compile requests executed        0 
Sep 02 20:32:16 Cache hits                       0 
Sep 02 20:32:16 Cache misses                     0 
Sep 02 20:32:16 Cache timeouts                   0 
Sep 02 20:32:16 Cache read errors                0 
Sep 02 20:32:16 Forced recaches                  0 
Sep 02 20:32:16 Cache write errors               0

pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic (4/9)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

lExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }

Sep 02 19:59:34     input: Tensor) -> Tensor: 
Sep 02 19:59:34     input0 = torch._convolution(input, self.weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) 
Sep 02 19:59:34              ~~~~~~~~~~~~~~~~~~ <--- HERE 
Sep 02 19:59:34     return input0 
Sep 02 19:59:34  
Sep 02 19:59:34 test/mobile/custom_build/build.sh: line 98: 21532 Aborted                 (core dumped) ./Predictor "${MODEL}" > output.txt 
Sep 02 19:59:34 =================== sccache compilation log =================== 
Sep 02 19:59:34 + sccache_epilogue 
Sep 02 19:59:34 + echo '=================== sccache compilation log ===================' 
Sep 02 19:59:34 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
Exists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_dynamic/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Sep 02 19:59:34  
Sep 02 19:59:34 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Sep 02 19:59:34 + sccache --show-stats 
Sep 02 19:59:34 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 19:59:34 Compile requests              2813 
Sep 02 19:59:34 Compile requests executed     2182 
Sep 02 19:59:34 Cache hits                      22 
Sep 02 19:59:34 Cache misses                  2156 
Sep 02 19:59:34 Cache timeouts                   0 
Sep 02 19:59:34 Cache read errors                0

pytorch_linux_xenial_py3_clang5_asan_test2 (5/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 02 19:57:57 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in

Sep 02 19:57:57     #7 0x5598541ee7eb in PyEval_EvalCode /tmp/build/80754af9/python_1588903631989/work/Python/ceval.c:731 
Sep 02 19:57:57     #8 0x55985426ee73 in run_mod /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:1025 
Sep 02 19:57:57     #9 0x55985426ef0c in PyRun_StringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:949 
Sep 02 19:57:57     #10 0x55985426ef6e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:445 
Sep 02 19:57:57     #11 0x559854272d72 in run_command /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:301 
Sep 02 19:57:57     #12 0x559854272d72 in Py_Main /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:749 
Sep 02 19:57:57     #13 0x55985413cf2d in main /tmp/build/80754af9/python_1588903631989/work/Programs/python.c:69 
Sep 02 19:57:57     #14 0x7fcf3981183f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291 
Sep 02 19:57:57     #15 0x55985421c27e in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103 
Sep 02 19:57:57  
Sep 02 19:57:57 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in  
Sep 02 19:57:57 + retcode=1 
Sep 02 19:57:57 + set -e 
Sep 02 19:57:57 + return 1 
Sep 02 19:57:57 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]] 
Sep 02 19:57:57 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]] 
Sep 02 19:57:57 + '[' -n https://github.com/pytorch/pytorch/pull/42754 ']' 
Sep 02 19:57:57 ++ mktemp 
Sep 02 19:57:57 + DETERMINE_FROM=/tmp/tmp.a0JXeZfXGD 
Sep 02 19:57:57 + file_diff_from_base /tmp/tmp.a0JXeZfXGD 
Sep 02 19:57:57 + set +e

pytorch_linux_bionic_py3_8_gcc9_test (6/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 02 20:39:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n    2 | int main() { return 0 }\n      |                      ^~\n      |                      ;\n" }

Sep 02 20:39:24     raise RuntimeError(err) 
Sep 02 20:39:24 RuntimeError: test_type_hints failed! 
Sep 02 20:39:24  
Sep 02 20:39:24 real	32m49.888s 
Sep 02 20:39:24 user	35m47.941s 
Sep 02 20:39:24 sys	1m33.169s 
Sep 02 20:39:24 + cleanup 
Sep 02 20:39:24 + retcode=1 
Sep 02 20:39:24 + set +x 
Sep 02 20:39:24 =================== sccache compilation log =================== 
Sep 02 20:39:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n    2 | int main() { return 0 }\n      |                      ^~\n      |                      ;\n" } 
Sep 02 20:39:24  
Sep 02 20:39:24 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 20:39:24 Compile requests                 65 
Sep 02 20:39:24 Compile requests executed        35 
Sep 02 20:39:24 Cache hits                       25 
Sep 02 20:39:24 Cache misses                      9 
Sep 02 20:39:24 Cache timeouts                    0 
Sep 02 20:39:24 Cache read errors                 0 
Sep 02 20:39:24 Forced recaches                   0 
Sep 02 20:39:24 Cache write errors                0

pytorch_linux_xenial_py3_clang5_mobile_custom_build_static (7/9)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

bolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }

Sep 02 19:37:00     input: Tensor) -> Tensor: 
Sep 02 19:37:00     input0 = torch._convolution(input, self.weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) 
Sep 02 19:37:00              ~~~~~~~~~~~~~~~~~~ <--- HERE 
Sep 02 19:37:00     return input0 
Sep 02 19:37:00  
Sep 02 19:37:01 test/mobile/custom_build/build.sh: line 98: 17404 Aborted                 (core dumped) ./Predictor "${MODEL}" > output.txt 
Sep 02 19:37:01 + sccache_epilogue 
Sep 02 19:37:01 + echo '=================== sccache compilation log ===================' 
Sep 02 19:37:01 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
Sep 02 19:37:01 =================== sccache compilation log =================== 
olExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Sep 02 19:37:01  
Sep 02 19:37:01 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 19:37:01 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Sep 02 19:37:01 + sccache --show-stats 
Sep 02 19:37:01 Compile requests              2231 
Sep 02 19:37:01 Compile requests executed     1600 
Sep 02 19:37:01 Cache hits                      17 
Sep 02 19:37:01 Cache misses                  1579 
Sep 02 19:37:01 Cache timeouts                   0 
Sep 02 19:37:01 Cache read errors                0

pytorch_linux_xenial_py3_clang5_mobile_build (8/9)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Tmp/CheckSymbolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }

Sep 02 19:32:37     input: Tensor) -> Tensor: 
Sep 02 19:32:37     input0 = torch._convolution(input, self.weight, None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) 
Sep 02 19:32:37              ~~~~~~~~~~~~~~~~~~ <--- HERE 
Sep 02 19:32:37     return input0 
Sep 02 19:32:37  
Sep 02 19:32:37 test/mobile/custom_build/build.sh: line 98: 13392 Aborted                 (core dumped) ./Predictor "${MODEL}" > output.txt 
Sep 02 19:32:37 =================== sccache compilation log =================== 
Sep 02 19:32:37 + sccache_epilogue 
Sep 02 19:32:37 + echo '=================== sccache compilation log ===================' 
Sep 02 19:32:37 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
mp/CheckSymbolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Sep 02 19:32:37  
Sep 02 19:32:37 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Sep 02 19:32:37 + sccache --show-stats 
Sep 02 19:32:37 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 02 19:32:37 Compile requests              2229 
Sep 02 19:32:37 Compile requests executed     1599 
Sep 02 19:32:37 Cache hits                    1350 
Sep 02 19:32:37 Cache misses                   245 
Sep 02 19:32:37 Cache timeouts                   0 
Sep 02 19:32:37 Cache read errors                0

pytorch_xla_linux_bionic_py3_6_clang9_build (9/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_definitions.py 
Auto-merging .circleci/cimodel/data/pytorch_build_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py 
Auto-merging .circleci/cimodel/data/dimensions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py 
Auto-merging .circleci/cimodel/data/binary_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 02 22:17:22 ConnectionResetError: [Errno 104] Connection reset by peer

Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 455, in accept 
Sep 02 22:17:22     deliver_challenge(c, self._authkey) 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge 
Sep 02 22:17:22     response = connection.recv_bytes(256)        # reject large message 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes 
Sep 02 22:17:22     buf = self._recv_bytes(maxlength) 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes 
Sep 02 22:17:22     buf = self._recv(4) 
Sep 02 22:17:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv 
Sep 02 22:17:22     chunk = read(handle, remaining) 
Sep 02 22:17:22 ConnectionResetError: [Errno 104] Connection reset by peer 
Sep 02 22:17:23 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown 
Sep 02 22:17:23   len(cache)) 
Sep 02 22:17:25 Process ErrorTrackingProcess-156: 
Sep 02 22:17:25 Traceback (most recent call last): 
Sep 02 22:17:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap 
Sep 02 22:17:25     self.run() 
Sep 02 22:17:25   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 361, in run 
Sep 02 22:17:25     super(ErrorTrackingProcess, self).run() 
Sep 02 22:17:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run 
Sep 02 22:17:25     self._target(*self._args, **self._kwargs)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

mcarilli · 2020-08-20T19:13:04Z

Optional suggestion: #41696 (comment)

Regardless of which API you choose, you should apply similar treatment to model.zero_grad.

mcarilli · 2020-08-20T19:16:01Z

torch/optim/optimizer.py

+        for group in self.param_groups:
+            for p in group['params']:
+                if p.grad is not None:
+                    p.grad.detach_()


Why do we need to detach_()? Afaict the only way this has an effect is if something external also holds a reference to .grad and .grad was created in a create_graph=True backward pass.

Yeah that sounds like a possible scenario so trying to be safe here

My point is detach_ may be the unsafe option. Because it affects p.grad in place, it silently affects anything else holding a reference to p.grad. Setting p.grad = None without detach_ simply drops our reference. Anything else holding a reference to grad will not see an effect.

Admittedly, I'm not sure why the default zeroing behavior of zero_grad performs detach_ then zero_. I assume it's to avoid building up spurious autograd history if the grad was created with create_graph=True and therefore requires grad. (edit: related, avoids memory leak when grad has grad_fn)

If so, in the alternative set-to-None path, we dont need detach_. We drop the reference to .grad, we don't perform any ops on it, therefore there's no danger of building spurious autograd history. And if we don't detach_, we don't risk silently affecting other references to grad.

tl;dr I think not detaching is the safe option here.

@albanD you're good with these tricky cases...Why is detach_ used with the default zeroing behavior of zero_grad? Also, do you agree not detaching is the safer implementation for the set-to-None path?

The original code is trying to make sure that we don't change the Tensor referenced by .grad as much as possible I think.
And yes as mentioned below, I do agree that we don't want to detach here.

vincentqb · 2020-08-20T21:20:42Z

Thanks for running an example. Out of curiosity, did it change the runtime and memory?

vincentqb · 2020-08-20T21:23:05Z

Responded about API on original issue.

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

firstprayer · 2020-08-22T19:00:26Z

Thanks for running an example. Out of curiosity, did it change the runtime and memory?

It doesn't seem to lead to noticeable difference in terms of run time, but again I am running a toy example here so probably hard to tell. Not sure how to measure the memory footprint though.

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

ghstack-source-id: 9499a9b Pull Request resolved: #42754

ngimel · 2020-08-22T19:24:39Z

Can you please modify test_zero_grad in test_nn.py so that it tests both modes?
@mcarilli, do you think it makes sense to modify grad scaling tests in test_cuda.py to use .zero_grad with both options?

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

ghstack-source-id: 60bd467 Pull Request resolved: #42754

mcarilli · 2020-08-24T16:21:20Z

Can you please modify test_zero_grad in test_nn.py so that it tests both modes?
@mcarilli, do you think it makes sense to modify grad scaling tests in test_cuda.py to use .zero_grad with both options?

Yes, thats a good suggestion. should be several one-liners in the tests after this PR is merged.

albanD · 2020-08-24T16:51:18Z

torch/nn/modules/module.py

            if p.grad is not None:
-                if p.grad.grad_fn is not None:
+                if set_to_none:
                    p.grad.detach_()


I think you can remove this detach() as you remove it anyways.

albanD · 2020-08-24T16:52:02Z

torch/nn/modules/module.py

+        r"""Sets gradients of all model parameters to zero.
+
+        Arguments:
+            set_to_none (bool): instead of setting to zero, set the grad to None.


Can you link to the nn.optim version of this doc that contains more details about the change of behavior?

torch.optim, right?

Add

See :meth:`torch.optim.optimizer.zero_grad` for details.

albanD · 2020-08-24T16:52:26Z

torch/optim/optimizer.py

                if p.grad is not None:
-                    if p.grad.grad_fn is not None:
+                    if set_to_none:
                        p.grad.detach_()


unneeded detach here as well.

albanD · 2020-08-24T16:56:48Z

torch/nn/modules/module.py


-    def zero_grad(self) -> None:
-        r"""Sets gradients of all model parameters to zero."""
+    def zero_grad(self, set_to_none=False) -> None:


nit: can you add the type for set_to_none: set_to_none: bool = False

albanD · 2020-08-24T16:57:11Z

torch/optim/optimizer.py


-    def zero_grad(self):
-        r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
+    def zero_grad(self, set_to_none=False):


Can you update optimizer.pyi as well to reflect this change? Just add the new arg and type there.

mcarilli · 2020-08-24T17:58:07Z

torch/optim/optimizer.py

+                A None attribute or a Tensor full of 0s will be different.
+                2. User can no longer rely on checking if `.grad` is None to see if a tensor
+                is touched in the backward pass
+                3. `nn.optim` optimizers have a different behavior if the gradient is 0 or None


torch.optim

mcarilli · 2020-08-24T18:03:18Z

torch/optim/optimizer.py

+        Arguments:
+            set_to_none (bool): instead of setting to zero, set the grad to None.
+                This will in general have lower memory footprint, but using this
+                comes with caveats, to name a few:


I wouldn't say the changes are caveats. Some are benefits, like point 3 (skipping the update is faster if the grad is None).
Instead of

This will in general have lower memory footprint, but using this comes with caveats, to name a few:

I would say

This is will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example:

mcarilli · 2020-08-24T18:33:57Z

torch/optim/optimizer.py

+                comes with caveats, to name a few:
+                1. When user tries to access the gradient value and perform manual ops on it.
+                A None attribute or a Tensor full of 0s will be different.
+                2. User can no longer rely on checking if `.grad` is None to see if a tensor


this doesn't make sense. I'd say it's the opposite: if a user sets grads = None before backward, they CAN check if .grad is None to see if the tensor received a gradient in the backward pass. Let's just describe the behavior:

If the user requests `zero_grad(set_to_none=True)` followed by a backward pass, `.grad`\ s are guaranteed to be None for params that did not receive a gradient.

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

ghstack-source-id: a426a12 Pull Request resolved: #42754

mcarilli · 2020-09-01T17:31:58Z

torch/optim/optimizer.py

+
+        Arguments:
+            set_to_none (bool): instead of setting to zero, set the grad to None.
+                This is will in general have lower memory footprint, and can modestly improve performance.


This is will -> This will

mcarilli · 2020-09-01T17:33:01Z

torch/optim/optimizer.py

+                This is will in general have lower memory footprint, and can modestly improve performance.
+                However, it changes certain behaviors. For example:
+                1. When user tries to access the gradient value and perform manual ops on it.
+                A None attribute or a Tensor full of 0s will be different.


combine sentences

When user tries to access the gradient value and perform manual ops on it, a None attribute or a Tensor full of 0s will act differently.

mcarilli · 2020-09-01T17:34:38Z

torch/optim/optimizer.py

+                However, it changes certain behaviors. For example:
+                1. When user tries to access the gradient value and perform manual ops on it.
+                A None attribute or a Tensor full of 0s will be different.
+                2. If the user requests `zero_grad(set_to_none=True)` followed by a backward pass, `.grad` s


Upon rereading this, I think we should switch the order of points 1 and 2. The current point 2 sets up some context for (what is currently) point 1, so I think it makes sense to say point 2 first.

Otherwise LGTM! Thanks for the PR, making this optimization a first-class citizen is very helpful.

ngimel · 2020-09-01T20:00:03Z

@albanD, @vincentqb anything holding up this diff? Comments seem to be addressed (other than @mcarilli's doc suggestions which are imo minor).

albanD · 2020-09-01T20:41:05Z

Nothing blocking on my side. Just the doc update. And make sure that the CI is happy with it after rebase.

ngimel

Approving, subject to minor doc fixes.

vincentqb

LGTM too, I fixed two of the three doc changes that I see. I'm ok either way with the last one:

Upon rereading this, I think we should switch the order of points 1 and 2. The current point 2 sets up some context for (what is currently) point 1, so I think it makes sense to say point 2 first.

mcarilli

#42754 (review) is not a big deal imo but don't forget https://github.com/pytorch/pytorch/pull/42754/files#r481400744.

Summary: Pull Request resolved: pytorch#44423 Pull Request resolved: pytorch#42754 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23010859 Pulled By: ngimel fbshipit-source-id: 760279f7c9cb84d11bef51207c18bf1f362ca7ad

facebook-github-bot · 2020-09-10T06:11:39Z

@ngimel merged this pull request in c515881.

Add reset_grad() function

b255468

[ghstack-poisoned]

firstprayer pushed a commit that referenced this pull request Aug 7, 2020

Add reset_grad() function

b13c250

ghstack-source-id: 8dfb2b3 Pull Request resolved: #42754

ezyang requested review from albanD and vincentqb August 10, 2020 22:28

mcarilli reviewed Aug 20, 2020

View reviewed changes

Update on "Add reset_grad() function"

b6d6ff2

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

firstprayer requested a review from apaszke as a code owner August 22, 2020 18:58

Update on "Add reset_grad() function"

a0b629b

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

firstprayer pushed a commit that referenced this pull request Aug 22, 2020

Add reset_grad() function

b50da54

ghstack-source-id: 9499a9b Pull Request resolved: #42754

Update on "Add reset_grad() function"

bcaa3fe

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

firstprayer pushed a commit that referenced this pull request Aug 22, 2020

Add reset_grad() function

0e51e7b

ghstack-source-id: 60bd467 Pull Request resolved: #42754

albanD reviewed Aug 24, 2020

View reviewed changes

mcarilli reviewed Aug 24, 2020

View reviewed changes

Update on "Add reset_grad() function"

0ed963f

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

Update on "Add reset_grad() function"

c70bfc5

Differential Revision: [D23010859](https://our.internmc.facebook.com/intern/diff/D23010859) [ghstack-poisoned]

firstprayer pushed a commit that referenced this pull request Aug 29, 2020

Add reset_grad() function

5fe07b6

ghstack-source-id: a426a12 Pull Request resolved: #42754

mcarilli reviewed Sep 1, 2020

View reviewed changes

ngimel approved these changes Sep 2, 2020

View reviewed changes

vincentqb added 2 commits September 2, 2020 15:10

typo

c48321f

phrasing

82594d2

vincentqb approved these changes Sep 2, 2020

View reviewed changes

mcarilli requested changes Sep 6, 2020

View reviewed changes

mcarilli changed the title ~~Add reset_grad() function~~ Add zero_grad(set_to_none=True) Sep 9, 2020

ngimel mentioned this pull request Sep 9, 2020

Add reset_grad() function (#42754) #44423

Closed

facebook-github-bot closed this in c515881 Sep 10, 2020

facebook-github-bot added the merged label Sep 10, 2020

facebook-github-bot deleted the gh/firstprayer/3/head branch September 13, 2020 14:16

mruberry added the Merged label Oct 28, 2020

Add zero_grad(set_to_none=True) #42754

Add zero_grad(set_to_none=True) #42754

Uh oh!

Conversation

firstprayer commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firstprayer commented Aug 7, 2020

Summary

Test Plan:

Uh oh!

dr-ci bot commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 9 new failures recognized by patterns

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/9)

pytorch_linux_xenial_py3_6_gcc5_4_build (2/9)

pytorch_linux_bionic_py3_6_clang9_test (3/9)

pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic (4/9)

pytorch_linux_xenial_py3_clang5_asan_test2 (5/9)

pytorch_linux_bionic_py3_8_gcc9_test (6/9)

pytorch_linux_xenial_py3_clang5_mobile_custom_build_static (7/9)

pytorch_linux_xenial_py3_clang5_mobile_build (8/9)

pytorch_xla_linux_bionic_py3_6_clang9_build (9/9)

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Uh oh!

mcarilli commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentqb commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firstprayer commented Aug 22, 2020

Uh oh!

ngimel commented Aug 22, 2020

Uh oh!

mcarilli commented Aug 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Sep 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

firstprayer commented Aug 7, 2020 •

edited

Loading

dr-ci bot commented Aug 7, 2020 •

edited

Loading

mcarilli commented Aug 20, 2020 •

edited

Loading

mcarilli Aug 24, 2020 •

edited

Loading

vincentqb commented Aug 20, 2020 •

edited

Loading

vincentqb commented Aug 20, 2020 •

edited

Loading

mcarilli Sep 1, 2020 •

edited

Loading

mcarilli Aug 24, 2020 •

edited

Loading

mcarilli Sep 1, 2020 •

edited

Loading

vincentqb left a comment •

edited

Loading

mcarilli left a comment •

edited

Loading