Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][FSDP] Option to keep gradients in lower precision #83310

Closed
wants to merge 1 commit into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Aug 12, 2022

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 12, 2022

🔗 Helpful links

❌ 2 New Failures

As of commit dd76c26 (more details on the Dr. CI page):

Expand to see more
  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build Lint / lintrunner (1/2)

Step: "Run lintrunner on all files" (full log | diagnosis details)

2022-08-12T02:35:16.2434097Z ##[error]Process completed with exit code 1.
2022-08-12T02:35:16.2398653Z         �[2m2976�[0m  |                        output.div_(self.gradient_postdivide_factor)
2022-08-12T02:35:16.2399254Z         �[2m2977�[0m  |
2022-08-12T02:35:16.2399640Z     >>> �[2m2978�[0m  |�[33m                    print(f"rv: casting")
2022-08-12T02:35:16.2400111Z �[0m        �[2m2979�[0m  |                    self._cast_grad_to_param_dtype(output, param)
2022-08-12T02:35:16.2400471Z         �[2m2980�[0m  |
2022-08-12T02:35:16.2400943Z         �[2m2981�[0m  |                    # To support gradient accumulation outside `no_sync()`, we save
2022-08-12T02:35:16.2403216Z 
2022-08-12T02:35:16.2403346Z 
2022-08-12T02:35:16.2403972Z �[1m�[36mYou can reproduce these results locally by using `lintrunner`.�[0m
2022-08-12T02:35:16.2404590Z �[1m�[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.�[0m
2022-08-12T02:35:16.2434097Z ##[error]Process completed with exit code 1.
2022-08-12T02:35:16.2470759Z ##[group]Run # Use jq to massage the JSON lint output into GitHub Actions workflow commands.
2022-08-12T02:35:16.2471159Z �[36;1m# Use jq to massage the JSON lint output into GitHub Actions workflow commands.�[0m
2022-08-12T02:35:16.2471444Z �[36;1mjq --raw-output \�[0m
2022-08-12T02:35:16.2471833Z �[36;1m  '"::\(if .severity == "advice" or .severity == "disabled" then "warning" else .severity end) file=\(.path),line=\(.line),col=\(.char),title=\(.code) \(.name)::" + (.description | gsub("\\n"; "%0A"))' \�[0m
2022-08-12T02:35:16.2472180Z �[36;1m  lint.json�[0m
2022-08-12T02:35:16.2516728Z shell: /usr/bin/bash -e {0}
2022-08-12T02:35:16.2516929Z env:
2022-08-12T02:35:16.2517156Z   pythonLocation: /opt/hostedtoolcache/Python/3.8.13/x64
2022-08-12T02:35:16.2517455Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.13/x64/lib
2022-08-12T02:35:16.2517691Z ##[endgroup]

See GitHub Actions build pull / linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge) (2/2)

Step: "Test" (full log | diagnosis details)

2022-08-12T02:45:25.4357070Z AssertionError: Torch not compiled with CUDA enabled
2022-08-12T02:45:25.4353733Z   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 745, in cuda
2022-08-12T02:45:25.4354169Z     return self._apply(lambda t: t.cuda(device))
2022-08-12T02:45:25.4354544Z   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 634, in _apply
2022-08-12T02:45:25.4354803Z     module._apply(fn)
2022-08-12T02:45:25.4355196Z   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 657, in _apply
2022-08-12T02:45:25.4355457Z     param_applied = fn(param)
2022-08-12T02:45:25.4355808Z   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 745, in <lambda>
2022-08-12T02:45:25.4356086Z     return self._apply(lambda t: t.cuda(device))
2022-08-12T02:45:25.4356444Z   File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 215, in _lazy_init
2022-08-12T02:45:25.4356810Z     raise AssertionError("Torch not compiled with CUDA enabled")
2022-08-12T02:45:25.4357070Z AssertionError: Torch not compiled with CUDA enabled
2022-08-12T02:45:25.4357224Z 
2022-08-12T02:45:25.4357229Z 
2022-08-12T02:45:25.4357233Z 
2022-08-12T02:45:25.4357442Z ----------------------------------------------------------------------
2022-08-12T02:45:25.4357686Z Ran 46 tests in 75.589s
2022-08-12T02:45:25.4357797Z 
2022-08-12T02:45:25.4357906Z FAILED (errors=1, skipped=42, expected failures=3)
2022-08-12T02:45:25.4358049Z 
2022-08-12T02:45:25.4358119Z Generating XML reports...
2022-08-12T02:45:25.4397933Z Generated XML report: test-reports/python-unittest/distributed.fsdp.test_fsdp_mixed_precision/TEST-TestFSDPMixedPrecisionSharded-20220812024409.xml

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

rohan-varma added a commit that referenced this pull request Aug 12, 2022
ghstack-source-id: 628cbf14b714e51dc1e47bf4b59dbd3f00a922f1
Pull Request resolved: #83310
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 12, 2022
@rohan-varma rohan-varma changed the title Enable low prec grads [FSDP] Option to keep gradients in lower precision Aug 12, 2022
@rohan-varma rohan-varma changed the title [FSDP] Option to keep gradients in lower precision [WIP][FSDP] Option to keep gradients in lower precision Aug 12, 2022
@rohan-varma
Copy link
Member Author

#85062

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/582/head branch June 8, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants