-
Notifications
You must be signed in to change notification settings - Fork 731
Add Basic MVDR module #1708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Basic MVDR module #1708
Conversation
I also benchmark the performance of Here is the result:
According to the table, einsum is faster than matmul operation. |
PyTorch currently doesn't support batch version of trace method. There's a linalg.trace PR ongoing. We can change it to PyTorch method after the PR is settled. |
@Emrys365 Could you help review the MVDR implementation? Thanks! |
param(solution="ref_channel"), | ||
param(solution="stv_power"), | ||
# evd will fail since the eigenvalues are not distinct | ||
# param(solution="stv_evd"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eigenvalue decomposition solution fails the autograd test. I guess the reason is the eigenvalues are not distinct (i.e. some eigenvalues are close or identical). Is there a way to generate a masked matrix (STFT * mask) whose PSD matrix has distinct eigenvalues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe following the normal simulation, i.e. simulating a mixture of the white noise and some signal and calculating the idea ratio mask (IRM)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. I may add an utterance and a room impulse response from open-source and do the multi-channel signal simulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a condition that does not support autograd, please document it somewhere.
(we should be doing the same for time stretch as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in f976eae
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MVDR implementation looks good to me. I just made some minor comments.
examples/beamforming/mvdr.py
Outdated
def get_steering_vector_evd(self, psd_s: torch.Tensor) -> torch.Tensor: | ||
r"""Estimate the steering vector by eigenvalue decomposition. | ||
|
||
Args: | ||
psd_s (torch.tensor): covariance matrix of speech | ||
Tensor of dimension (..., freq, channel, channel) | ||
|
||
Returns: | ||
torch.Tensor: the enhanced STFT | ||
Tensor of dimension (..., freq, channel, 1) | ||
""" | ||
w, v = torch.linalg.eig(psd_s) # (..., freq, channel, channel) | ||
_, indices = torch.max(w.abs(), dim=-1, keepdim=True) | ||
indices = indices.unsqueeze(-1) | ||
stv = v.gather(-1, indices.expand(psd_s.shape[:-1] + (1,))) # (..., freq, channel, 1) | ||
return stv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice there are several conventional approaches to estimating the relative transfer function (RTF) or normalized steering vector (reference paper), e.g. covariance subtraction (CS), covariance subtraction with EVD (CS-EVD), and covariance whitening (CW).
But I'm not sure which one is more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nateanl, sorry for my late response. I ve tried several formulations this spring and this one [1] was the most robust when training e2e.
[1] Souden, M., Benesty, J., & Affes, S. (2009). On optimal frequency-domain multichannel
linear filtering for noise reduction. IEEE Transactions on audio, speech, and language processing, 18(2), 260-276.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Samuele! The formula refers to the reference channel selection solution in the module, right?
w = \Phi_NN^-1 \Phi_SS u / trace( \Phi_NN^-1 \Phi_SS)
I will benchmark the performances of all solutions and compare with yours. They should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep looks like the formula you already implemented as ref_channel
I will keep the "complex-valued mask support" as a future plan since I'm not sure of the formula to normalize the psd matrix by the mask. Here is my proposed method:
|
param(solution="ref_channel"), | ||
param(solution="stv_power"), | ||
# evd will fail since the eigenvalues are not distinct | ||
# param(solution="stv_evd"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a condition that does not support autograd, please document it somewhere.
(we should be doing the same for time stretch as well.)
spec = torch.rand((2, 6, 201, 100), dtype=torch.cdouble) | ||
|
||
# Single then transform then batch | ||
expected = PSD()(spec).repeat(3, 1, 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware that this is taken from the other tests, but this turned out to miss certain cases. (Ref: #1451)
Can you make samples in batch different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we also need to update the existing batch consistency tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I updated the test in f976eae
I will create another PR for updating the rest of batch consistency tests.
When reading a file with the ``read_csv`` method, by default the first line of the file is assumed to contain header information. In the example file mentioned in the tutorial, the first line contains data and if the ``names`` attribute of the ``read_csv`` method is not informed, the first line will be considered as a header. This way, when loading the dataset the first sample will be lost. Co-authored-by: Holly Sweeney <77758406+holly1238@users.noreply.github.com>
Summary:
TODO: