You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great paper and great repo. My question is rather related to your paper. In the paper you mention:
The statistics pooling layer in speaker embeddings networks with 2D CNN architectures is a concatenation of the mean and std of each of the F × C frequency-channel pairs
I am a bit confused on this end. In pytorch terms if my resent output is B x C x T x F, how exactly do I implement stats pooling?
would it be:
x = x.permute(0,2,3,1) #B x T x F x C
x = x.reshape(B,T,F x C) # B x T x (FxC)
followed by a stats pooling layer?
Thank You for the help!
The text was updated successfully, but these errors were encountered:
Hi Sreyan,
Thanks for your interest. Stats pooling is very simple.
Your tensor is x = x.reshape(B,T,F x C),
Do something like:
x = torch.cat((torch.mean(x,dim=1),torch.std(x,dim=1))).reshape(B,FxCx2)
so that each example in your minibatch is a vector of dim = 2xFxC.
Is that clear?
Thank You so much for your reply. Yes, that is clear. I am currently trying to implement an attentive stats pooling layer. Will update you here once I am able to find a solution. Thank You!
Hi there!
Great paper and great repo. My question is rather related to your paper. In the paper you mention:
I am a bit confused on this end. In pytorch terms if my resent output is B x C x T x F, how exactly do I implement stats pooling?
would it be:
followed by a stats pooling layer?
Thank You for the help!
The text was updated successfully, but these errors were encountered: