Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-supervision but actually using labels #4

Open
pmorerio opened this issue Apr 10, 2019 · 2 comments
Open

Self-supervision but actually using labels #4

pmorerio opened this issue Apr 10, 2019 · 2 comments

Comments

@pmorerio
Copy link

pmorerio commented Apr 10, 2019

Hi,
the work should rely on self supervision, where no labels are used for training, as stated in the README:

We DO NOT use the labels of the videos in any way during training, and only use them for evaluation.

However, labels are actually used in chosing the negative pairs for the contrastive loss. This is either a bug or not fair. However I reckon choosing negative pairs randomly should not make much difference, given the high number of classes (approx p=1/50 of taking a wrong pair).

@rohitrango
Copy link
Owner

The paper (and we) interpret it as NOT using discriminative labels like "guitar", "piano", etc. and still let the neural net figure out the semantics on its own. Note that there is no clear boundary between instruments and some videos have multiple sound sources too, making the task more difficult.

I hope this answers your question. If you have any questions please let me know.

@pmorerio
Copy link
Author

pmorerio commented Jul 8, 2019

Hi,
I respectfully disagree.
Self-supervised methods are engineered precisely to avoid labeling. If you use labels this goes against the spirit of self-supervision. Even if labels are not used directly to optimize the loss, this does not mean there is no label supervision. This is simply a different kind of supervision, where labels are used in a different way (i.e. to choose the pairs).
In the paper they actually state:

The labels for the positives (matching) and negatives (mismatched) pairs are obtained directly, as videos provide an automatic alignment between the visual and the audio streams – frame and audio coming from the same time in a video are positives, while frame and audio coming from different videos are negatives.

This seem to suggest that negative pairs are simply taken from a different video with the need of no labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants