-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yamnet architecture for audioset label extractor #379
Conversation
…n logic; log_attributes as dicts
@tyarkoni @adelavega the basic logic should be in place and working now, could you take a quick first look to see if you have any major objections? |
pliers/extractors/audio.py
Outdated
def _extract(self, stim): | ||
|
||
data = stim.data | ||
self.params['SAMPLE_RATE'] = stim.sampling_rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure its OK to do this? You explained to me that it was only change how the spectrogram is created, so if you're positive, sounds good to me. I'm just noting this comment in params.py:
The following hyperparameters (except PATCH_HOP_SECONDS) were used to train YAMNet,
so expect some variability in performance if you change these. The patch hop can
be changed arbitrarily: a smaller hop should give you more patches from the same
clip and possibly better performance at a larger computational cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth testing the same audio clip with two sampling rates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will make sure to test it thoroughly (and look into an issue with output values I just noticed) and call another round of review :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I've looked into this and I think I'm on top of things.
Tl;dr -> it's okay to have other SAMPLE_RATE
than 16000Hz, as a) size of input patches will still be the same; b) results should not be significantly affected. Caveat is that SAMPLE_RATE
has to be twice as large as MEL_MAX_HZ
. Writing down a lengthy explanation below, mostly as a reminder for myself.
(Why) the model is compatible with different sampling rates
Yamnet is compatible with different sampling rates, although it's been trained with audio sampled at 16000Hz.
The input sampling rate is only relevant in the preprocessing step, where the waveform is first passed through STFT to extract spectrogram, which is then converted into a mel-scale spectrogram.
The sampling rate parameter is used to compute how many samples in the stimulus go into a STFT window, but the size of the window is independent from SAMPLE_RATE
and explicitly encoded in a separate parameter. Therefore, different sampling rates do not affect how the waveform is binned in the temporal domain.
What SAMPLE_RATE
does implicitly influence, though, is the (number, values and resolution of) frequency bins used for the Fourier transform. Higher SAMPLE_RATE
correspond to higher frequency resolution in the spectrogram.
The second pass is transforming the spectrogram into a mel-scale spectrogram. The trick here is that, regardless of how many and which frequency bins you have in the spectrogram, the number of frequency bands in the mel spectrogram is fixed by the parameters MEL_BANDS
(= number of mel-frequency bins, set to 64 for training) and its range is determined by MEL_MIN_HZ
/ MEL_MAX_HZ
(set to 125/7500 Hz for training). Input spectrograms of size [n_freq_bins, n_time_bins] will always output [n_mel_bands, n_time_bins], where n_mel_bands is independent of n_freq_bins. That's why different sampling rates would yield still fit model requirements in terms of input size.
Different sampling rates should have no impact on the results
Although the input shape is not affected, having different sampling rates on the same waveform might still yield slightly differences in the spectrograms (as FFT frequency resolution will differ).
But as long as the mel frequency range (MEL_MIN_HZ
and MEL_MAX_HZ
) is kept close to that used for training these differences shouldn't have any impact.
Which of the params can be changed?
MEL_BANDS
has to be 64 as the model expects this in its inputMEL_MIN_HZ
andMEL_MAX_HZ
can be changed, with the caveat that model performance can actually be affected and thatMEL_MAX_HZ
has to be less than half of the sampling rate (Nyquist freq).SAMPLE_RATE
can be modified and adapted to actual sampling rate, with the constraint that it has to be at least twice as large asMEL_MAX_HZ
.
Re: pliers extractor
- Would probably add a warning if
SAMPLE_RATE
≠ 16000Hz telling the user that the model was trained on 1600Hz, different sampling rates may have a minimal impact on the results and it could be an idea to resample; - Would also add a pliers error if
SAMPLE_RATE
< 2 *MEL_MAX_HZ
, warning that it may be a bad idea to have lower sampling rate and suggesting to either upsample or changeMEL_MAX_HZ
to a suitable value, although this may reduce accuracy of the predictions
Oh, and this works in TF 1 & 2 right? |
Co-Authored-By: Alejandro de la Vega <aleph4@gmail.com>
Yes, found a workaround for the yamnet versioning issue by calling tf graph explicitly, which works both in tf 1 and 2. Will keep ensuring that it's compatible with both. Also raised an issue on yamnet repo, something might happen on that front too (tensorflow/models#8157). |
next to dos:
|
…om exception for sampling rate / max mel freq mismatch
there might be a more general timing issue. |
Onsets/duration issue fixed (the length of the window to which predictions are made is actually slight longer than patch_window_seconds). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good! I left some minor comments.
pliers/extractors/audio.py
Outdated
|
||
def __init__(self, hop_size=0.1, top_n=None, labels=None, | ||
weights_path=None, yamnet_path=None, **yamnet_kwargs): | ||
if yamnet_path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to allow the user to specify a custom yamnet path? I'm assuming this would be an alternate way to import yamnet... the risk there is that users probably won't want to use this unless they have reason to think something has changed in the yamnet implementation. And if something has changed in the yamnet implementation, there's a good chance it will imply an API change, which could result in things breaking for us. So I guess my inclination would be to remove this check and always point to our own installer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, unless we think sometimes our install would fail. But I think that should be considered a bug to fix, so no need to build a workaround yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the option to specify custom yamnet_path
(that was only to avoid downloading extra stuff in the user already has the yamnet or the whole tf repo somewhere else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably what we should ultimately do is settle on a unified way of handling downloaded models/files, so that users can do something like pliers install yamnet --dir=/path/to/put/files
from the command line, where the storage location is optional (and would default to what we're currently using). Might be worth opening an issue. But since this is the only model like this right now (all the others have their own packages that handle setup themselves), let's not get sidetracked with that right now. This seems fine.
pliers/support/setup_yamnet.py
Outdated
import runpy | ||
|
||
DOWNLOAD_PATH = Path.home() / 'pliers_data' | ||
YAMNET_PATH = DOWNLOAD_PATH / 'models-master' / 'research' / 'audioset' / 'yamnet' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this mirrors the structure in the yamnet repo, but it's a lot of nesting. Unless the yamnet code requires that structure, I'd probably just store this under .pliers/models/yamnet
or something like that (note that I believe we already use the convention of storing data in .pliers
in other places, which is what many other packages do as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's that the whole TF repo is being downloaded. That said, in the download script you could download a subset, or download to a tmp dir, and only copy over what you need to .pliers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now so that the download script fetches the TF repo's zip archive (no way, as far as I can see, to only download a subset), extracts it all into a yamnet_temp
folder in pliers_data
, moves the yamnet part to a flat directory pliers_data/yamnet
and deletes yamnet_temp
.
We could in principle avoid extracting the whole thing, and just looping through files in the zip archive so to only extract the yamnet ones. That would however add a bunch of lines of code, as we would: 1) have to loop; 2) take all the extracted files and move them to a flat folder, as zipfile keeps the file structure when using extract
. And after all, the whole thing is only 125Mb.
Just to make sure though, you are talking about pliers_data
(same default folder in user's home directory as text dictionaries are dumped to), right? Not .pliers
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was under the impression we're putting stuff in .pliers
, but if we're currently putting it in pliers_data
, then yeah, putting this there as well makes sense.
@tyarkoni @adelavega thanks a lot for your comments! The last few commits should address all of them, you're welcome to take a look and make sure it is so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Okay to merge?
Should be, unless @adelavega has further comments. |
Nope, LGTM. Merging! |
wrapper for https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet.py extracting label probability for 521 audio events from AudioSet hierarchical ontology.