-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want to use frame-level ssast just for frame-level audio token extraction #16
Comments
Hi there, Thanks for reaching out. 1/
You are correct that Lines 125 to 130 in a1a3eec
FYI, you can use 2/
You are correct, but please be cautious on that 3/
This involves audio-visual learning while this paper is about pure audio research. But we do use fbank features as input in this paper, see: Lines 126 to 127 in a1a3eec
Hope these help. -Yuan |
Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay? |
I have never tried other input features. You can pretrain your own model with other input feature, but if you plan to use our pretrained model to extract feature/embedding/token, then you have to use the same dataloader (which is fully released in this repo) with us, any input distribution shift could cause a dramatic performance difference.
-Yuan |
in your ast_models.py, you put cluster True as Default
But if to use frame-level ssast, cluster should be False. Do I have to turn it off?
If I want to use your pretrained frame level ssast for audio token extraction, is the output of self.v.norm(x) except the first one what I have to use in finetuningcls function? because the first one is cls token....^^
One more thing I wonder....Could I get some part of fbank that is corresponding to the video frames? melspectrogram does but I don't know fbank could be....
The text was updated successfully, but these errors were encountered: