Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yamnet architecture for audioset label extractor #379

Merged
merged 70 commits into from
Mar 23, 2020

Conversation

rbroc
Copy link
Collaborator

@rbroc rbroc commented Feb 17, 2020

wrapper for https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet.py extracting label probability for 521 audio events from AudioSet hierarchical ontology.

@coveralls
Copy link

coveralls commented Feb 17, 2020

Coverage Status

Coverage decreased (-5.2%) to 73.502% when pulling afbe485 on rbroc:audioset into 7e92ad5 on tyarkoni:master.

@rbroc
Copy link
Collaborator Author

rbroc commented Feb 19, 2020

@tyarkoni @adelavega the basic logic should be in place and working now, could you take a quick first look to see if you have any major objections?

pliers/extractors/audio.py Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
def _extract(self, stim):

data = stim.data
self.params['SAMPLE_RATE'] = stim.sampling_rate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure its OK to do this? You explained to me that it was only change how the spectrogram is created, so if you're positive, sounds good to me. I'm just noting this comment in params.py:

The following hyperparameters (except PATCH_HOP_SECONDS) were used to train YAMNet,
so expect some variability in performance if you change these. The patch hop can
be changed arbitrarily: a smaller hop should give you more patches from the same
clip and possibly better performance at a larger computational cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth testing the same audio clip with two sampling rates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will make sure to test it thoroughly (and look into an issue with output values I just noticed) and call another round of review :)

Copy link
Collaborator Author

@rbroc rbroc Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I've looked into this and I think I'm on top of things.
Tl;dr -> it's okay to have other SAMPLE_RATE than 16000Hz, as a) size of input patches will still be the same; b) results should not be significantly affected. Caveat is that SAMPLE_RATE has to be twice as large as MEL_MAX_HZ. Writing down a lengthy explanation below, mostly as a reminder for myself.

(Why) the model is compatible with different sampling rates

Yamnet is compatible with different sampling rates, although it's been trained with audio sampled at 16000Hz.

The input sampling rate is only relevant in the preprocessing step, where the waveform is first passed through STFT to extract spectrogram, which is then converted into a mel-scale spectrogram.

The sampling rate parameter is used to compute how many samples in the stimulus go into a STFT window, but the size of the window is independent from SAMPLE_RATE and explicitly encoded in a separate parameter. Therefore, different sampling rates do not affect how the waveform is binned in the temporal domain.

What SAMPLE_RATE does implicitly influence, though, is the (number, values and resolution of) frequency bins used for the Fourier transform. Higher SAMPLE_RATE correspond to higher frequency resolution in the spectrogram.

The second pass is transforming the spectrogram into a mel-scale spectrogram. The trick here is that, regardless of how many and which frequency bins you have in the spectrogram, the number of frequency bands in the mel spectrogram is fixed by the parameters MEL_BANDS (= number of mel-frequency bins, set to 64 for training) and its range is determined by MEL_MIN_HZ / MEL_MAX_HZ (set to 125/7500 Hz for training). Input spectrograms of size [n_freq_bins, n_time_bins] will always output [n_mel_bands, n_time_bins], where n_mel_bands is independent of n_freq_bins. That's why different sampling rates would yield still fit model requirements in terms of input size.

Different sampling rates should have no impact on the results

Although the input shape is not affected, having different sampling rates on the same waveform might still yield slightly differences in the spectrograms (as FFT frequency resolution will differ).
But as long as the mel frequency range (MEL_MIN_HZ and MEL_MAX_HZ) is kept close to that used for training these differences shouldn't have any impact.

Which of the params can be changed?

  • MEL_BANDS has to be 64 as the model expects this in its input
  • MEL_MIN_HZ and MEL_MAX_HZ can be changed, with the caveat that model performance can actually be affected and that MEL_MAX_HZ has to be less than half of the sampling rate (Nyquist freq).
  • SAMPLE_RATE can be modified and adapted to actual sampling rate, with the constraint that it has to be at least twice as large as MEL_MAX_HZ.

Re: pliers extractor

  • Would probably add a warning if SAMPLE_RATE ≠ 16000Hz telling the user that the model was trained on 1600Hz, different sampling rates may have a minimal impact on the results and it could be an idea to resample;
  • Would also add a pliers error if SAMPLE_RATE < 2 * MEL_MAX_HZ, warning that it may be a bad idea to have lower sampling rate and suggesting to either upsample or change MEL_MAX_HZ to a suitable value, although this may reduce accuracy of the predictions

pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
@adelavega
Copy link
Member

Oh, and this works in TF 1 & 2 right?

@rbroc
Copy link
Collaborator Author

rbroc commented Feb 20, 2020

Oh, and this works in TF 1 & 2 right?

Yes, found a workaround for the yamnet versioning issue by calling tf graph explicitly, which works both in tf 1 and 2. Will keep ensuring that it's compatible with both. Also raised an issue on yamnet repo, something might happen on that front too (tensorflow/models#8157).

@rbroc
Copy link
Collaborator Author

rbroc commented Feb 20, 2020

next to dos:

  • subset by category name
  • mixin option or add length warning
  • add tests
  • docs on how to download and setup yamnet
  • maybe migrate to models

@rbroc
Copy link
Collaborator Author

rbroc commented Mar 5, 2020

We determined this bug is due to the fact that one more onset is being included than should be. This does not occur when creating the data, thus the shorter length.

A hacky workaround is to index the onsets by the length of the data:

        onsets = onsets[:, preds.shape[0]]

there might be a more general timing issue.
Will look into that asap! :)

@rbroc
Copy link
Collaborator Author

rbroc commented Mar 5, 2020

We determined this bug is due to the fact that one more onset is being included than should be. This does not occur when creating the data, thus the shorter length.

A hacky workaround is to index the onsets by the length of the data:

        onsets = onsets[:, preds.shape[0]]

Onsets/duration issue fixed (the length of the window to which predictions are made is actually slight longer than patch_window_seconds).

Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good! I left some minor comments.

pliers/extractors/audio.py Show resolved Hide resolved

def __init__(self, hop_size=0.1, top_n=None, labels=None,
weights_path=None, yamnet_path=None, **yamnet_kwargs):
if yamnet_path is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to allow the user to specify a custom yamnet path? I'm assuming this would be an alternate way to import yamnet... the risk there is that users probably won't want to use this unless they have reason to think something has changed in the yamnet implementation. And if something has changed in the yamnet implementation, there's a good chance it will imply an API change, which could result in things breaking for us. So I guess my inclination would be to remove this check and always point to our own installer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, unless we think sometimes our install would fail. But I think that should be considered a bug to fix, so no need to build a workaround yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the option to specify custom yamnet_path (that was only to avoid downloading extra stuff in the user already has the yamnet or the whole tf repo somewhere else).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably what we should ultimately do is settle on a unified way of handling downloaded models/files, so that users can do something like pliers install yamnet --dir=/path/to/put/files from the command line, where the storage location is optional (and would default to what we're currently using). Might be worth opening an issue. But since this is the only model like this right now (all the others have their own packages that handle setup themselves), let's not get sidetracked with that right now. This seems fine.

pliers/extractors/audio.py Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Outdated Show resolved Hide resolved
pliers/extractors/audio.py Show resolved Hide resolved
import runpy

DOWNLOAD_PATH = Path.home() / 'pliers_data'
YAMNET_PATH = DOWNLOAD_PATH / 'models-master' / 'research' / 'audioset' / 'yamnet'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this mirrors the structure in the yamnet repo, but it's a lot of nesting. Unless the yamnet code requires that structure, I'd probably just store this under .pliers/models/yamnet or something like that (note that I believe we already use the convention of storing data in .pliers in other places, which is what many other packages do as well).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's that the whole TF repo is being downloaded. That said, in the download script you could download a subset, or download to a tmp dir, and only copy over what you need to .pliers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now so that the download script fetches the TF repo's zip archive (no way, as far as I can see, to only download a subset), extracts it all into a yamnet_temp folder in pliers_data, moves the yamnet part to a flat directory pliers_data/yamnet and deletes yamnet_temp.
We could in principle avoid extracting the whole thing, and just looping through files in the zip archive so to only extract the yamnet ones. That would however add a bunch of lines of code, as we would: 1) have to loop; 2) take all the extracted files and move them to a flat folder, as zipfile keeps the file structure when using extract. And after all, the whole thing is only 125Mb.
Just to make sure though, you are talking about pliers_data (same default folder in user's home directory as text dictionaries are dumped to), right? Not .pliers...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was under the impression we're putting stuff in .pliers, but if we're currently putting it in pliers_data, then yeah, putting this there as well makes sense.

pliers/tests/extractors/test_audio_extractors.py Outdated Show resolved Hide resolved
@rbroc
Copy link
Collaborator Author

rbroc commented Mar 18, 2020

@tyarkoni @adelavega thanks a lot for your comments! The last few commits should address all of them, you're welcome to take a look and make sure it is so.
I've made some slightly more substantial changes to setup_yamnet.py script, so to only keep the relevant folder and delete all the other TF models (see my reply to Tal's comment).

Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Okay to merge?

@rbroc
Copy link
Collaborator Author

rbroc commented Mar 23, 2020

LGTM! Okay to merge?

Should be, unless @adelavega has further comments.

@adelavega
Copy link
Member

Nope, LGTM. Merging!

@adelavega adelavega merged commit 5dedbe7 into PsychoinformaticsLab:master Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants