New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wav2vec2.0 model #1529
Add wav2vec2.0 model #1529
Conversation
Quantization tests are failing for macOS CI. But I tried it locally with the latest PyTorch nightly, and it worked fine.
@vkuzo Have you ever seen an error like this? I guess it's CI / PyTorch nightly package issue, but if you have an insight, that will be helpful as well.
|
|
||
shape = (batch_size, length, self.num_heads, self.head_dim) | ||
q = self.q_proj(x).view(*shape).transpose(2, 1) # B, nH, L, Hd | ||
k = self.k_proj(x).view(*shape).permute(0, 2, 3, 1) # B, nH, Hd, L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why permute and not transpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to do transpose
twice to achieve this, and I thought permute
is more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, you're merging the transpose needed for the weights below
) | ||
|
||
shape = (batch_size, length, self.num_heads, self.head_dim) | ||
q = self.q_proj(x).view(*shape).transpose(2, 1) # B, nH, L, Hd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking: All these projections consume the same input, so you could do it using one linear with 3x the embedding dim. You could also call into nn.MHA altogether.
mask = torch.arange(max_len).expand(batch_size, max_len) >= lengths[:, None] | ||
x[mask] = 0.0 | ||
# extend the mask to attention shape and set weight | ||
mask = -10000.0 * mask[:, None, None, :].to(dtype=features.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking: There's contention around what the right value here is. Parlai uses neginf that depends on the input dtype, nn.MHA uses float(-inf) which likely has issues with lower precision dtypes and FasterTransformer uses -10000 as well, fairseq uses float("-inf") as well. I think the -10000.0 is fine, but I'm curious about your reasoning behind this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is from HuggingFaces's implementation of Wav2Vec2.0.
Let me try float("-inf")
and if the test passes, then I will switch to float("-inf")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot make it work with float("-inf")
, so I will stick with -10000
.
This means that neither |
* Add loss_fn as an input to the test function The function `test()` depends on `loss_fn` just like `train()` does. It is better to explicitly provide `loss_fn` as an argument instead of relying on a global `test_fn` object. Also, this is more consistent with how `train()` is defined. And it also makes it more explicit that `train()` has a dependency on `optimizer`, while `test()` doesn't. * Update quickstart_tutorial.py Co-authored-by: Holly Sweeney <77758406+holly1238@users.noreply.github.com>
This PR adds
Wav2Vec2Model
classwav2vec2_base
wav2vec2_large
wav2vec2_large_lv60k
ref: #1506
supersedes #1525