Multimodal One-Shot Learning of Speech and Images
This repository contains the full code recipe for building models that can acquire novel concepts from only one paired audio-visual example per class, without receiving any hard labels. These models can then be used to match new continuous speech input to the correct visual instance (e.g. the spoken word "lego" is matched to the visual signal of lego, without receiving any textual labels, and after seeing only a single paired speech-image example of a different lego instance). This is multimodal one-shot learning, a new task which we formalise in the following paper:
- R. Eloff, H. A. Engelbrecht, H. Kamper, "Multimodal One-Shot Learning of Speech and Images," arXiv preprint arXiv:1811.03875, 2018. [arXiv]
Please cite this paper if you use the code.
The following datasets are required for these experiments:
Note that the Flickr8k text corpus is used purely for obtaining train/validation/test splits. The instructions that follow assume that you have obtained these datasets and placed them somewhere sensible (e.g. ../data/tidigits).
The following steps need to be completed before running the experiment scripts:
Install nvidia-docker (version 2.0) for NVIDIA GPU access in docker containers
Pull required images from Docker Hub:
|Docker image||Docker pull command|
|Kaldi for extracting speech features||
|TensorFlow used as base for research environment||
|Multimodal one-shot research environment||
Alternatively you can build these images locally from their DockerFiles:
Kaldi feature extraction
Extract speech features by simply running
run_feature_extraction [OPTIONS] (use
--help flag for more information):
./run_feature_extraction.sh \ --tidigits=<path to TIDigits> \ --flickr-audio=<path to Flickr audio> \ --flickr-text=<path to Flickr8k text> \ --n-cpu-cores=<number of CPU cores>
Replace each path with the full path to the corresponding dataset. The
--n-cpu-cores flag specifies the number of CPU cores used for feature extraction (defaults to 8; set higher or lower depending on available CPU cores), where more cores may speed up the process. For example:
./run_feature_extraction.sh --tidigits=/home/rpeloff/datasets/datasets/speech/tidigits --flickr-audio=/home/rpeloff/datasets/speech/flickr_audio --flickr-text=/home/rpeloff/datasets/text/Flickr8k_text --n-cpu-cores=8
Train and test multimodal models
The multimodal one-shot models are demonstrated in two separate Jupyter notebooks:
experiments/nb1_unimodal_train_test.ipynbtrains and tests unimodal models for one-shot speech or image classification
experiments/nb2_multimodal_test.ipynbextends unimodal models to the multimodal one-shot case, testing on one-shot cross-modal speech-image digit matching
To run these notebooks and reproduce the results in the paper, execute the
run_notebooks.sh [OPTIONS] script (use
--help flag for more information),
and navigate to http://127.0.0.1:8888/. Follow the experiment notebooks and execute the code cells to train, test, and summarise the unimodal and multimodal one-shot models.
All code used for the paper is present in this repo, and the experiment notebooks should reproduce all results.
If you find any mistakes in the code or notebooks, please let us know by raising an issue!
Also feel free to raise issues if you have general comments!