Skip to content

Conversation

frank-wei
Copy link
Contributor

@frank-wei frank-wei commented Jun 9, 2022

The fairseq diff is split into two parts.
The first diff (this one)
This diff is about creating a mask left align function to check the mask condition for nested tensor. It is necessary for torchscript deployment.

The second diff (D37082681)
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

Reviewed By: mikekgfb

Differential Revision: D36057338

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 9, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 9c62434 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D36057338

2 similar comments
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D36057338

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D36057338

frank-wei pushed a commit to frank-wei/fairseq that referenced this pull request Jun 10, 2022
Summary:
X-link: pytorch/pytorch#79186

Pull Request resolved: facebookresearch#4468

as titled
Ford the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

In summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x

After enable nested tensor,
For V100, the speedup reaches to 2.46x

Reviewed By: mikekgfb

Differential Revision: D36057338

fbshipit-source-id: 229e72e6050bf70ddedcda8f47d158526910557f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D36057338

@frank-wei frank-wei requested review from ngimel and jbschlosser June 10, 2022 17:49
@erichan1 erichan1 self-requested a review June 10, 2022 21:16
@frank-wei frank-wei requested a review from zrphercule June 10, 2022 21:17
Copy link
Contributor

@erichan1 erichan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@jbschlosser jbschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused about this - I just see this PR adding a _nested_tensor_from_mask_left_aligned op. Some questions:

  • How does this related to the PR summary? AFAICT just adding this op won't enable its use.
  • How does this differ from _nested_tensor_from_mask - does it just avoid checks by making the assumption the mask is left-aligned? If so, couldn't we avoid quite a bit of duplication?

@frank-wei
Copy link
Contributor Author

frank-wei commented Jun 10, 2022

Confused about this - I just see this PR adding a _nested_tensor_from_mask_left_aligned op. Some questions:

  • How does this related to the PR summary? AFAICT just adding this op won't enable its use.

The diff is optimized to split into two parts now. Part1 only includes this op change and associated with D36057338. Part2 only includes fairseq change (D37082681) but depend on Part1. The internal CI tests show the op runs well.

  • How does this differ from _nested_tensor_from_mask - does it just avoid checks by making the assumption the mask is left-aligned? If so, couldn't we avoid quite a bit of duplication?

It is the front part of "_nested_tensor_from_mask" implementation. But the purpose is to help check the left aligned in advance.

Summary:
The fairseq diff is split into two parts.
The first diff (this one)
This diff is about creating a mask left align function to check the mask condition for nested tensor. It is necessary for torchscript deployment.

The second diff (D37082681)
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

Perf in summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x

After enable nested tensor,
For V100, the speedup reaches to 2.46x

Test Plan: In D37082681

Reviewed By: mikekgfb

Differential Revision: D36057338

fbshipit-source-id: 2b824481481b8972d168f2751afa77cc9b3cbe02
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D36057338

@frank-wei frank-wei changed the title [transformer] BT enablement on fairseq [transformer] BT enablement on fairseq - pytorch change Jun 10, 2022
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

Hey @frank-wei.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jun 12, 2022
Summary:
Pull Request resolved: #79186

The fairseq diff is split into two parts.
The first diff (this one)
This diff is about creating a mask left align function to check the mask condition for nested tensor. It is necessary for torchscript deployment.

The second diff (D37082681)
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

Perf in summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x

After enable nested tensor,
For V100, the speedup reaches to 2.46x

Test Plan: In D37082681

Reviewed By: mikekgfb

Differential Revision: D36057338

fbshipit-source-id: 0ba75c254ccc4b4a29702ab0e18a36b5d0e1d832
@erichan1 erichan1 added release notes: nn release notes category topic: bug fixes topic category topic: performance topic category labels Jun 13, 2022
@erichan1 erichan1 mentioned this pull request Jul 21, 2022
erichan1 pushed a commit that referenced this pull request Jul 21, 2022
The fairseq diff is split into two parts.
The first diff (this one)
This diff is about creating a mask left align function to check the mask condition for nested tensor. It is necessary for torchscript deployment.

The second diff (D37082681)
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

Reviewed By: mikekgfb

Differential Revision: D36057338

Pull Request resolved: #79186
Approved by: https://github.com/erichan1
@erichan1 erichan1 mentioned this pull request Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants