Skip to content

Conversation

eellison
Copy link
Contributor

@eellison eellison commented May 16, 2022

Stack from ghstack (oldest at bottom):

Fix for AOT autograd, where amp has already been traced out and you don't want to re-invoke the amp pass.

cc @anijain2305, @Chillee

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 16, 2022

🔗 Helpful links

❌ 4 New Failures

As of commit 7b7a39e (more details on the Dr. CI page):

Expand to see more
  • 4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge) (1/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-16T18:07:01.7996288Z TypeError: run_gen...got an unexpected keyword argument 'get_device_fn'
2022-05-16T18:06:59.6397668Z + python setup.py install
2022-05-16T18:07:00.5805259Z No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2022-05-16T18:07:00.5917278Z Building torch_xla version: 1.12
2022-05-16T18:07:00.5917745Z XLA Commit ID: 56a52d53359b0862d0ada8d49c4e3dc52ff75d81
2022-05-16T18:07:00.5918207Z PyTorch Commit ID: 7b7a39ee29e45d02a8683579400b5d8bee18146d
2022-05-16T18:07:00.5981688Z /var/lib/jenkins/workspace /var/lib/jenkins/workspace/xla
2022-05-16T18:07:01.7066969Z /var/lib/jenkins/workspace/xla
2022-05-16T18:07:01.7994289Z Traceback (most recent call last):
2022-05-16T18:07:01.7994938Z   File "/var/lib/jenkins/workspace/xla/scripts/gen_lazy_tensor.py", line 84, in <module>
2022-05-16T18:07:01.7995520Z     get_device_fn="torch_xla::bridge::GetXlaDevice")
2022-05-16T18:07:01.7996288Z TypeError: run_gen_lazy_tensor() got an unexpected keyword argument 'get_device_fn'
2022-05-16T18:07:01.8083009Z Failed to generate lazy files: ['python', '/var/lib/jenkins/workspace/xla/scripts/gen_lazy_tensor.py']
2022-05-16T18:07:01.9875003Z + cleanup
2022-05-16T18:07:01.9875242Z + retcode=1
2022-05-16T18:07:01.9875411Z + set +x
2022-05-16T18:07:01.9908093Z ##[error]Process completed with exit code 1.
2022-05-16T18:07:02.0001223Z ##[group]Run pytorch/pytorch/.github/actions/get-workflow-job-id@master
2022-05-16T18:07:02.0001467Z with:
2022-05-16T18:07:02.0001883Z   github-token: ***
2022-05-16T18:07:02.0002055Z env:
2022-05-16T18:07:02.0002198Z   IN_CI: 1

See GitHub Actions build pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 4, linux.2xlarge) (2/4)

Step: "Upload test artifacts" (full log | diagnosis details | 🔁 rerun)

2022-05-16T18:19:10.9309708Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in
2022-05-16T18:19:10.8769242Z     #10 0x55c9c9da3c81 in run_mod /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:1037
2022-05-16T18:19:10.8771499Z     #11 0x55c9c9daec69 in PyRun_StringFlags /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:961
2022-05-16T18:19:10.8772154Z     #12 0x55c9c9daeccb in PyRun_SimpleStringFlags /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:455
2022-05-16T18:19:10.8773225Z     #13 0x55c9c9daedc8 in pymain_run_command /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:420
2022-05-16T18:19:10.8774068Z     #14 0x55c9c9daedc8 in pymain_run_python /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:2907
2022-05-16T18:19:10.8774481Z     #15 0x55c9c9daedc8 in pymain_main /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:3460
2022-05-16T18:19:10.8775043Z     #16 0x55c9c9daf18b in _Py_UnixMain /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:3495
2022-05-16T18:19:10.9308821Z     #17 0x7f001eab883f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
2022-05-16T18:19:10.9309200Z     #18 0x55c9c9d54039 in _start (/opt/conda/bin/python3.7+0x1d8039)
2022-05-16T18:19:10.9309374Z 
2022-05-16T18:19:10.9309708Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
2022-05-16T18:19:10.9534363Z + retcode=1
2022-05-16T18:19:10.9534690Z + set -e
2022-05-16T18:19:10.9534879Z + return 1
2022-05-16T18:19:10.9538778Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]
2022-05-16T18:19:10.9539346Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X ]]
2022-05-16T18:19:10.9539972Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]
2022-05-16T18:19:10.9540455Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]]
2022-05-16T18:19:10.9541100Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]
2022-05-16T18:19:10.9541904Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\5\1\2 ]]
2022-05-16T18:19:10.9543246Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]

See GitHub Actions build pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-16T18:01:28.3844642Z The PR is introduc...m to confirm whether this change is wanted or not.
2022-05-16T18:01:28.3830907Z processing existing schema:  text(__torch__.torch.classes.profiling.SourceRef _0) -> (str _0)
2022-05-16T18:01:28.3832529Z processing existing schema:  count(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-05-16T18:01:28.3833830Z processing existing schema:  duration_ns(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-05-16T18:01:28.3834800Z processing existing schema:  source(__torch__.torch.classes.profiling.SourceStats _0) -> (__torch__.torch.classes.profiling.SourceRef _0)
2022-05-16T18:01:28.3836869Z processing existing schema:  line_map(__torch__.torch.classes.profiling.SourceStats _0) -> (Dict(int, __torch__.torch.classes.profiling.InstructionStats) _0)
2022-05-16T18:01:28.3837924Z processing existing schema:  __init__(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-05-16T18:01:28.3839463Z processing existing schema:  enable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-05-16T18:01:28.3840449Z processing existing schema:  disable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-05-16T18:01:28.3842638Z processing existing schema:  _dump_stats(__torch__.torch.classes.profiling._ScriptProfile _0) -> (__torch__.torch.classes.profiling.SourceStats[] _0)
2022-05-16T18:01:28.3844214Z processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (NoneType _0)
2022-05-16T18:01:28.3844642Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2022-05-16T18:01:28.3844653Z 
2022-05-16T18:01:28.3845577Z Broken ops: [
2022-05-16T18:01:28.3845747Z 	aten::lift(Tensor self) -> (Tensor)
2022-05-16T18:01:28.3845933Z 	aten::ccol_indices(Tensor(a) self) -> (Tensor(a))
2022-05-16T18:01:28.3846100Z 	aten::ccol_indices_copy(Tensor self) -> (Tensor)
2022-05-16T18:01:28.3846391Z 	aten::index_reduce(Tensor self, int dim, Tensor index, Tensor source, str reduce, *, bool include_self=True) -> (Tensor)
2022-05-16T18:01:28.3846723Z 	aten::index_reduce.out(Tensor self, int dim, Tensor index, Tensor source, str reduce, *, bool include_self=True, Tensor(a!) out) -> (Tensor(a!))
2022-05-16T18:01:28.3847026Z 	aten::index_reduce_(Tensor(a!) self, int dim, Tensor index, Tensor source, str reduce, *, bool include_self=True) -> (Tensor(a!))
2022-05-16T18:01:28.3847223Z 	aten::glu_jvp(Tensor glu, Tensor x, Tensor dx, int dim) -> (Tensor)
2022-05-16T18:01:28.3847486Z 	aten::_sparse_addmm(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1) -> (Tensor)

See GitHub Actions build pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu) (4/4)

Step: "Upload test artifacts" (full log | diagnosis details | 🔁 rerun)

2022-05-16T20:03:16.1687246Z test_data_parall...as.so.11: undefined symbol: cublasGetSmCountTarget
2022-05-16T20:03:15.5220528Z   test_data_parallel_model_device (__main__.TestDataParallel)
2022-05-16T20:03:15.5532726Z Test device[0] check at forward time. ... ok (0.036s)
2022-05-16T20:03:15.6018930Z   test_data_parallel_model_no_refcycles (__main__.TestDataParallel) ... ok (0.048s)
2022-05-16T20:03:15.6069413Z   test_data_parallel_module_zero_inputs (__main__.TestDataParallel) ... ok (0.005s)
2022-05-16T20:03:15.6134068Z   test_data_parallel_multiple_input (__main__.TestDataParallel) ... /opt/conda/lib/python3.7/site-packages/torch/nn/parallel/comm.py:232: UserWarning: Using -1 to represent CPU tensor is deprecated. Please use a device object or string instead, e.g., "cpu".
2022-05-16T20:03:15.6134810Z   'Using -1 to represent CPU tensor is deprecated. Please use a '
2022-05-16T20:03:15.6307357Z ok (0.024s)
2022-05-16T20:03:15.6338980Z   test_data_parallel_nested_input (__main__.TestDataParallel) ... ok (0.003s)
2022-05-16T20:03:15.6406395Z   test_data_parallel_nested_output (__main__.TestDataParallel) ... ok (0.007s)
2022-05-16T20:03:15.6449246Z   test_data_parallel_no_grad (__main__.TestDataParallel) ... ok (0.004s)
2022-05-16T20:03:16.1687246Z   test_data_parallel_rnn (__main__.TestDataParallel) ... Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /usr/local/cuda/lib64/libcublas.so.11: undefined symbol: cublasGetSmCountTarget
2022-05-16T20:03:16.6736076Z ok (1.028s)
2022-05-16T20:03:16.6771315Z   test_data_parallel_small_back (__main__.TestDataParallel) ... ok (0.004s)
2022-05-16T20:03:16.6896305Z   test_data_parallel_sparse (__main__.TestDataParallel) ... ok (0.012s)
2022-05-16T20:03:16.7142864Z   test_gather_cpu (__main__.TestDataParallel) ... /opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
2022-05-16T20:03:16.7143680Z   warnings.warn('Was asked to gather along dimension 0, but all '
2022-05-16T20:03:16.7379797Z ok (0.048s)
2022-05-16T20:03:16.7392818Z   test_gather_different_len_dicts (__main__.TestDataParallel) ... ok (0.001s)
2022-05-16T20:03:16.7872922Z   test_gather_gpu (__main__.TestDataParallel) ... ok (0.048s)
2022-05-16T20:03:16.7930135Z   test_parallel_apply (__main__.TestDataParallel) ... ok (0.006s)
2022-05-16T20:03:16.7991024Z   test_parallel_apply_autocast (__main__.TestDataParallel) ... ok (0.006s)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

eellison pushed a commit that referenced this pull request May 16, 2022
ghstack-source-id: 81bc2c7
Pull Request resolved: #77566
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 16, 2022
@eellison eellison requested a review from davidberard98 May 16, 2022 17:58
Copy link
Contributor

@davidberard98 davidberard98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had some small comments, otherwise looks good


// if invoked on a graph that has already traced through amp
// don't invoke amp pass
mutable bool force_no_amp_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be mutable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think so otherwise all the const stuff wont compile

"name",
[](const StrongFunctionPtr& self) { return self.function_->name(); })
.def(
"_set_ignore_amp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need anything like this for modules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since the use case here is just aot autograd

@eellison
Copy link
Contributor Author

@pytorchbot merge this pleasee

@github-actions
Copy link
Contributor

Hey @eellison.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request May 20, 2022
Summary:
Pull Request resolved: #77566

Approved by: https://github.com/anijain2305, https://github.com/davidberard98

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/05ce0f9be63dd6fadd2fb40c29f8f867f267002b

Reviewed By: seemethere

Differential Revision: D36494147

Pulled By: seemethere

fbshipit-source-id: c09a25d1b606e54646e5d12a6c961f91f26b215e
@facebook-github-bot facebook-github-bot deleted the gh/eellison/289/head branch May 22, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants