Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 
 
 
 
 

EmoMatch Task

Simple transfer-learning task based on the VoxCeleb dataset to pretrain networks working on videos (audio + video) This code requires you to download the VoxCeleb dataset and to extract it (both audio and video).

The idea of this aproach is based on the paper Look, Listen, Learn: here, audio and video information were used to pretain an image encoder network to be used for image classificaiton tasks.

This project tries to extend this approach to not only train an image encoder but to actually pre-train a network that is able to process both audio and video information. The task the network is meant to solve is rather simple: given an audio sequence and a video sequence, decide whether the two match (i.e. have the same origin).

Structure of the EmoMatch training procedure is shown in the image below. The left side showsthe data preparation while the right side illustrates the data flow through thenetwork. In the data preparation video recordings are used to separate theirvideo and audio track. These tracks are then feed into a VNet and an ANet for the video respectively the audio. These networks serve as an encoder to generate features for a classifier network. This classifier will then detect whether the audiotrack originates from the same original recording as the video track (Match) or from a two different recordings (No Match).

About

Unsupervised Audio + Video Network Pretraining using PyTorch

Resources

License

Releases

No releases published

Packages

No packages published

Languages