Skip to content

Conversation

xwang233
Copy link
Collaborator

This would fix #33485.

cc @ptrblck

@xwang233 xwang233 requested a review from ngimel March 28, 2020 04:48
@dr-ci
Copy link

dr-ci bot commented Mar 28, 2020

💊 CircleCI build failures summary and remediations

As of commit c2a8ce4 (more details on the Dr. CI page):


  • 1/2 failures introduced in this PR

  • 1/2 broken upstream at merge base ef511d8 from Mar 27 until Mar 28 (23 commits; 0c16ced - a9b540d)

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase FETCH_HEAD
    

    Check out the recency history of this "viable master" tracking branch.


🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 29 04:45:12 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.
Mar 29 04:45:12 processing existing schema:  aten::sparse_coo_tensor.size(int[] size, *, int dtype, int layout, Device device, bool pin_memory=False) -> (Tensor) 
Mar 29 04:45:12 processing existing schema:  aten::sparse_coo_tensor.indices(Tensor indices, Tensor values, *, int? dtype=None, int? layout=None, Device? device=None, bool? pin_memory=None) -> (Tensor) 
Mar 29 04:45:12 processing existing schema:  aten::sparse_coo_tensor.indices_size(Tensor indices, Tensor values, int[] size, *, int? dtype=None, int? layout=None, Device? device=None, bool? pin_memory=None) -> (Tensor) 
Mar 29 04:45:12 processing existing schema:  aten::split_with_sizes(Tensor self, int[] split_sizes, int dim=0) -> (Tensor[]) 
Mar 29 04:45:12 processing existing schema:  aten::squeeze(Tensor(a) self) -> (Tensor(a)) 
Mar 29 04:45:12 processing existing schema:  aten::squeeze.dim(Tensor(a) self, int dim) -> (Tensor(a)) 
Mar 29 04:45:12 processing existing schema:  aten::stft(Tensor self, int n_fft, int? hop_length=None, int? win_length=None, Tensor? window=None, bool normalized=False, bool onesided=True) -> (Tensor) 
Mar 29 04:45:12 skipping schema:  aten::sub_.Tensor(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> (Tensor(a!)) 
Mar 29 04:45:12 skipping schema:  aten::sub_.Scalar(Tensor(a!) self, Scalar other, Scalar alpha=1) -> (Tensor(a!)) 
Mar 29 04:45:12 processing existing schema:  aten::t(Tensor(a) self) -> (Tensor(a)) 
Mar 29 04:45:12 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Mar 29 04:45:12  
Mar 29 04:45:12 Broken ops: [ 
Mar 29 04:45:12 	aten::owner(RRef(t) self) -> (__torch__.torch.classes.dist_rpc.WorkerInfo) 
Mar 29 04:45:12 	prepacked::conv2d_clamp_run(Tensor X, __torch__.torch.classes.xnnpack.Conv2dOpContext W_prepack) -> (Tensor Y) 
Mar 29 04:45:12 	prepacked::conv2d_clamp_prepack(Tensor W, Tensor? B, int[2] stride, int[2] padding, int[2] dilation, int groups, float? output_min=None, float? output_max=None) -> (__torch__.torch.classes.xnnpack.Conv2dOpContext) 
Mar 29 04:45:12 	prepacked::linear_clamp_run(Tensor X, __torch__.torch.classes.xnnpack.LinearOpContext W_prepack) -> (Tensor Y) 
Mar 29 04:45:12 	prepacked::linear_clamp_prepack(Tensor W, Tensor? B=None, float? output_min=None, float? output_max=None) -> (__torch__.torch.classes.xnnpack.LinearOpContext) 
Mar 29 04:45:12 ] 
Mar 29 04:45:12 + cleanup 
Mar 29 04:45:12 + retcode=1 

🚧 1 upstream failure:

These were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 15 times.

@vadimkantorov
Copy link
Contributor

Is NaN/Inf-issue related to "training on CUDA" per se?

From what I understood it can happen in any setting, just related to cuFFT functioning.

@xwang233
Copy link
Collaborator Author

Thanks for the comment. I have reworded that.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in e021c13.

@xwang233
Copy link
Collaborator Author

@ngimel I found this PR was reverted at a15a4a5. Is it because of the lint? Can I reland it if I fix the lint?

@ngimel
Copy link
Collaborator

ngimel commented Jun 25, 2020

Yeah, it was because of the lint. Sure, you can reland.

facebook-github-bot pushed a commit that referenced this pull request Jun 26, 2020
Summary:
Reland of #35594
Pull Request resolved: #40551

Reviewed By: ezyang

Differential Revision: D22249831

Pulled By: ngimel

fbshipit-source-id: b221b3c0a490ccaaabba50aa698a2490536e0917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.rfft returns NaNs for some half precision CUDA inputs

6 participants