# Testing of multimodal speech-vision models

**Author:** Ryan Eloff<br>
**Contact:** ryan.peter.eloff@gmail.com<br>
**Date:** October 2018

Experiments notebook 2.

## Overview

*Multimodal one-shot learning* is the problem of learning novel concepts from only *one or a few* examples of features in multiple modalities, with the only supervisory signal being that these features co-occur. 
Here we specifically consider multimodal one-shot learning on a dataset of isolated spoken digits paired with images (although any paired sensory information may be used).

We approach this problem by extending unimodal one-shot models to the multimodal case. Assuming that we have such models that can measure similarity within a modality (see [experiments notebook 1](https://github.com/rpeloff/multimodal-one-shot-learning/blob/master/experiments/nb1_unimodal_train_test.ipynb)), we can perform one-shot cross-modal matching by unimodal comparisons through the multimodal support set.

This notebook demonstrates how to extend unimodal models to multimodal one-shot learning, and reproduces the one-shot cross-modal matching (of speech-image digits) results presented in [our paper](https://arxiv.org/abs/1811.03875): 
R. Eloff, H. A. Engelbrecht, H. Kamper, "Multimodal One-Shot Learning of Speech and Images," 2018.

## Navigation

1. [Generate random model seeds](#seeds)<br>
2. [Multimodal one-shot models](#multimodal)<br>
    2.1. [Test parameters](#test_params)<br>
    2.2. [One-shot cross-modal matching tests](#multimodal_test)<br>
    2.3. [Summaries](#multimodal_summ)<br>

### Imports:

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


import os
import sys
import json
import numpy as np


sys.path.append('..')


try:  # check that DTW has been compiled
    from src.dtw.speech_dtw import _dtw
except ImportError:
    print("Building DTW Cython code ...")
    !make clean -C ../src/dtw
    !make -C ../src/dtw
    from src.dtw.speech_dtw import _dtw  # should no longer raise ImportError after building Cython DTW code

### Utility functions:

In [None]:
def test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot):
    print("--------------------------------------------------------------------------------")
    print("Testing multimodal model:\n\t--speech-model-dir={}\n\t--vision-model-dir={}"
          .format(speech_model_dir, vision_model_dir))
    print("--------------------------------------------------------------------------------")
    !python ../src/multimodal/test_multimodal.py \
        --speech-data-dir=../kaldi_features/tidigits \
        --speech-model-dir={speech_model_dir} \
        --vision-model-dir={vision_model_dir} \
        --output-dir={out_dir} \
        --random-seed={random_seed} \
        --zeros-different \
        --n-queries=10 \
        --n-test-episodes=400 \
        --k-shot={k_shot} \
        --l-way=11


def test_speaker_invariance(speech_model_dir, vision_model_dir, out_dir, random_seed):
    print("--------------------------------------------------------------------------------")
    print("Testing multimodal model:\n\t--speech-model-dir={}\n\t--vision-model-dir={}"
          .format(speech_model_dir, vision_model_dir))
    print("--------------------------------------------------------------------------------")
    !python ../src/multimodal/test_multimodal.py \
        --speech-data-dir=../kaldi_features/tidigits \
        --speech-model-dir={speech_model_dir} \
        --vision-model-dir={vision_model_dir} \
        --output-dir={out_dir} \
        --random-seed={random_seed} \
        --zeros-different \
        --originator-type='difficult' \
        --n-queries=1 \
        --n-test-episodes=4000 \
        --k-shot=1 \
        --l-way=11


def summarise_tests(result_dir, result_file='test_result.txt', speaker_invariance=False):
    overall_results = []
    easy_overall_results = []
    dist_overall_results = []
    for root, subdirs, files in os.walk(result_dir):
        subdirs.sort()
        for dirname in subdirs:
            res_file = os.path.join(root, dirname, result_file)
            if os.path.isfile(res_file):
                print("--------------------------------------------------------------------------------")
                print("Model summary: directory={}".format(os.path.join(root, dirname)))
                print("--------------------------------------------------------------------------------")
                with open(res_file, 'r') as fp:
                    results = fp.read()
                print('\tResults: {}'.format(results))
                overall_results.append(float(results.split('\n')[0].split('accuracy: ')[1]))
                if speaker_invariance:
                    invariance_results = results.split('\n')[1].strip().split('\t')
                    easy_overall_results.append(float(invariance_results[0].split('accuracy: ')[1]))
                    dist_overall_results.append(float(invariance_results[1].split('accuracy: ')[1]))
    conf_interval_95 = 1.96 * np.std(overall_results) / np.sqrt(len(overall_results))
    easy_conf_interval_95 = 1.96 * np.std(easy_overall_results) / np.sqrt(len(easy_overall_results))
    dist_conf_interval_95 = 1.96 * np.std(dist_overall_results) / np.sqrt(len(dist_overall_results))
    print("================================================================================")
    print("OVERALL: AVERAGE ACCURACY: {:.4f} % +- {:.4f} (total tests: {})"
          .format(np.mean(overall_results)*100, conf_interval_95*100, len(overall_results)))
    if speaker_invariance:
        print("\tAVERAGE EASY SPEAKER ACCURACY: {:.4f} % +- {:.4f} (total tests: {})"
              .format(np.mean(easy_overall_results)*100, easy_conf_interval_95*100, len(easy_overall_results)))
        print("\tAVERAGE DISTRACTOR SPEAKER ACCURACY: {:.4f} % +- {:.4f} (total tests: {})"
              .format(np.mean(dist_overall_results)*100, dist_conf_interval_95*100, len(dist_overall_results)))
    print("--------------------------------------------------------------------------------")

## 1. Generate random model seeds
<a id='seeds'></a>

We average results over 10 models trained with different seeds so that we can report average accuracies with 95% confidence intervals.

These seeds are generated as follows:

In [None]:
np.random.seed(42)
random_seeds = np.random.randint(1000, size=10)
print("Random seeds:", random_seeds)

## 2. Multimodal one-shot models
<a id='multimodal'></a>

The multimodal one-shot models that we present here are a combination of unimodal one-shot speech and vision models which are previosuly trained on background data that does not overlap with the multimodal one-shot task.
These models require no further training, and we can directly perform one-shot cross-modal matching by unimodal comparisons through the multimodal support set.

We specifically investigate Siamese neural networks trained for one-shot speech or image classification,
and compare to directly matching images (pixels) and extracted speech features (dynamic time warping), as well as to transfer learning with neural network classifiers.

## 2.1. Test parameters
<a id='test_params'></a>

The following parameters were used to produce the multimodal one-shot learning results in the paper (only used for selecting correct models for testing):

In [None]:
# DTW + pixels
dtw_feats_type = 'mfcc'
# FFNN classifier
ffnn_batch_size = 200  # same for both modalities
# CNN classifier
cnn_batch_size = 200  # same for both modalities
# Siamese CNN (offline)
speech_offline_n_train_episodes = 200
vision_offline_n_train_episodes = 600
# Siamese CNN (online)
speech_online_n_train_episodes = 50
vision_online_n_train_episodes = 150

## 2.2. One-shot cross-modal matching tests
<a id='multimodal_test'></a>

We test the trained multimodal speech-vision models on three tasks, where speech-image pairs are randomly selected from the [TIDigits speech corpus](https://catalog.ldc.upenn.edu/LDC93S10) and [MNIST handwritten digit dataset](http://yann.lecun.com/exdb/mnist/):

1. One-shot 11-way cross-modal speech-image digit matching
2. Five-shot 11-way cross-modal speech-image digit matching
3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

### Dynamic Time Warping (DTW) for Speech + Pixel Matching for Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/dtw_pixels/1_shot/{}".format(dtw_feats_type)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/dtw/{}/random_seed={}".format(dtw_feats_type,
                                                                      random_seed)
    vision_model_dir = "./models/vision/pixels/random_seed={}".format(random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=1)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/dtw_pixels/5_shot/{}".format(dtw_feats_type)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/dtw/{}/random_seed={}".format(dtw_feats_type,
                                                                      random_seed)
    vision_model_dir = "./models/vision/pixels/random_seed={}".format(random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=5)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
output_dir = "./results/multimodal/dtw_pixels/speaker_invariance/{}".format(dtw_feats_type)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/dtw/{}/random_seed={}".format(dtw_feats_type,
                                                                      random_seed)
    vision_model_dir = "./models/vision/pixels/random_seed={}".format(random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_speaker_invariance(speech_model_dir, vision_model_dir, out_dir, random_seed)

### Feedforward Neural Network (FFNN) Softmax Classifiers for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/ffnn_softmax/1_shot/batch_size={}".format(ffnn_batch_size)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/ffnn_softmax/batch_size={}/random_seed={}".format(ffnn_batch_size,
                                                                                          random_seed)
    vision_model_dir = "./models/vision/ffnn_softmax/batch_size={}/random_seed={}".format(ffnn_batch_size,
                                                                                          random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=1)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/ffnn_softmax/5_shot/batch_size={}".format(ffnn_batch_size)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/ffnn_softmax/batch_size={}/random_seed={}".format(ffnn_batch_size,
                                                                                          random_seed)
    vision_model_dir = "./models/vision/ffnn_softmax/batch_size={}/random_seed={}".format(ffnn_batch_size,
                                                                                          random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=5)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
output_dir = "./results/multimodal/ffnn_softmax/speaker_invariance/batch_size={}".format(ffnn_batch_size)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/ffnn_softmax/batch_size={}/random_seed={}".format(ffnn_batch_size,
                                                                                          random_seed)
    vision_model_dir = "./models/vision/ffnn_softmax/batch_size={}/random_seed={}".format(ffnn_batch_size,
                                                                                          random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_speaker_invariance(speech_model_dir, vision_model_dir, out_dir, random_seed)

### Convolutional Neural Network (CNN) Softmax Classifiers for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/cnn_softmax/1_shot/batch_size={}".format(cnn_batch_size)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/cnn_softmax/batch_size={}/random_seed={}".format(cnn_batch_size,
                                                                                         random_seed)
    vision_model_dir = "./models/vision/cnn_softmax/batch_size={}/random_seed={}".format(cnn_batch_size,
                                                                                         random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=1)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/cnn_softmax/5_shot/batch_size={}".format(cnn_batch_size)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/cnn_softmax/batch_size={}/random_seed={}".format(cnn_batch_size,
                                                                                         random_seed)
    vision_model_dir = "./models/vision/cnn_softmax/batch_size={}/random_seed={}".format(cnn_batch_size,
                                                                                         random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=5)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
output_dir = "./results/multimodal/cnn_softmax/speaker_invariance/batch_size={}".format(cnn_batch_size)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/cnn_softmax/batch_size={}/random_seed={}".format(cnn_batch_size,
                                                                                         random_seed)
    vision_model_dir = "./models/vision/cnn_softmax/batch_size={}/random_seed={}".format(cnn_batch_size,
                                                                                         random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_speaker_invariance(speech_model_dir, vision_model_dir, out_dir, random_seed)

### Siamese CNN (offline) Comparators for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/siamese_offline/1_shot/n_train_speech={}_vision={}".format(
    speech_offline_n_train_episodes, vision_offline_n_train_episodes)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/siamese_offline/n_train={}/random_seed={}".format(
        speech_offline_n_train_episodes, random_seed)
    vision_model_dir = "./models/vision/siamese_offline/n_train={}/random_seed={}".format(
        vision_offline_n_train_episodes, random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=1)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/siamese_offline/5_shot/n_train_speech={}_vision={}".format(
    speech_offline_n_train_episodes, vision_offline_n_train_episodes)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/siamese_offline/n_train={}/random_seed={}".format(
        speech_offline_n_train_episodes, random_seed)
    vision_model_dir = "./models/vision/siamese_offline/n_train={}/random_seed={}".format(
        vision_offline_n_train_episodes, random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=5)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
output_dir = "./results/multimodal/siamese_offline/speaker_invariance/n_train_speech={}_vision={}".format(
    speech_offline_n_train_episodes, vision_offline_n_train_episodes)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/siamese_offline/n_train={}/random_seed={}".format(
        speech_offline_n_train_episodes, random_seed)
    vision_model_dir = "./models/vision/siamese_offline/n_train={}/random_seed={}".format(
        vision_offline_n_train_episodes, random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_speaker_invariance(speech_model_dir, vision_model_dir, out_dir, random_seed)

### Siamese CNN (online) Comparators for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/siamese_online/1_shot/n_train_speech={}_vision={}".format(
    speech_online_n_train_episodes, vision_online_n_train_episodes)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/siamese_online/n_train={}/random_seed={}".format(
        speech_online_n_train_episodes, random_seed)
    vision_model_dir = "./models/vision/siamese_online/n_train={}/random_seed={}".format(
        vision_online_n_train_episodes, random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=1)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
output_dir = "./results/multimodal/siamese_online/5_shot/n_train_speech={}_vision={}".format(
    speech_online_n_train_episodes, vision_online_n_train_episodes)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/siamese_online/n_train={}/random_seed={}".format(
        speech_online_n_train_episodes, random_seed)
    vision_model_dir = "./models/vision/siamese_online/n_train={}/random_seed={}".format(
        vision_online_n_train_episodes, random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_multimodal_k_shot(speech_model_dir, vision_model_dir, out_dir, random_seed, k_shot=5)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
output_dir = "./results/multimodal/siamese_online/speaker_invariance/n_train_speech={}_vision={}".format(
    speech_online_n_train_episodes, vision_online_n_train_episodes)
for random_seed in random_seeds:
    speech_model_dir = "./models/speech/siamese_online/n_train={}/random_seed={}".format(
        speech_online_n_train_episodes, random_seed)
    vision_model_dir = "./models/vision/siamese_online/n_train={}/random_seed={}".format(
        vision_online_n_train_episodes, random_seed)
    out_dir = os.path.join(output_dir, 'random_seed={}'.format(random_seed))
    test_speaker_invariance(speech_model_dir, vision_model_dir, out_dir, random_seed)

## 2.3. Summaries
<a id='multimodal_summ'></a>

This section presents summaries on the one-shot testing of the multimodal models.

### Dynamic Time Warping (DTW) for Speech + Pixel Matching for Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/dtw_pixels/1_shot/{}".format(dtw_feats_type)
summarise_tests(result_dir)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/dtw_pixels/5_shot/{}".format(dtw_feats_type)
summarise_tests(result_dir)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
result_dir = "./results/multimodal/dtw_pixels/speaker_invariance/{}".format(dtw_feats_type)
summarise_tests(result_dir, speaker_invariance=True)

### Feedforward Neural Network (FFNN) Softmax Classifiers for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/ffnn_softmax/1_shot/batch_size={}".format(ffnn_batch_size)
summarise_tests(result_dir)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/ffnn_softmax/5_shot/batch_size={}".format(ffnn_batch_size)
summarise_tests(result_dir)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
result_dir = "./results/multimodal/ffnn_softmax/speaker_invariance/batch_size={}".format(ffnn_batch_size)
summarise_tests(result_dir, speaker_invariance=True)

### Convolutional Neural Network (CNN) Softmax Classifiers for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/cnn_softmax/1_shot/batch_size={}".format(cnn_batch_size)
summarise_tests(result_dir)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/cnn_softmax/5_shot/batch_size={}".format(cnn_batch_size)
summarise_tests(result_dir)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
result_dir = "./results/multimodal/cnn_softmax/speaker_invariance/batch_size={}".format(cnn_batch_size)
summarise_tests(result_dir, speaker_invariance=True)

### Siamese CNN (offline) Comparators for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/siamese_offline/1_shot/n_train_speech={}_vision={}".format(
    speech_offline_n_train_episodes, vision_offline_n_train_episodes)
summarise_tests(result_dir)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/siamese_offline/5_shot/n_train_speech={}_vision={}".format(
    speech_offline_n_train_episodes, vision_offline_n_train_episodes)
summarise_tests(result_dir)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
result_dir = "./results/multimodal/siamese_offline/speaker_invariance/n_train_speech={}_vision={}".format(
    speech_offline_n_train_episodes, vision_offline_n_train_episodes)
summarise_tests(result_dir, speaker_invariance=True)

### Siamese CNN (online) Comparators for Speech and Images

1. One-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/siamese_online/1_shot/n_train_speech={}_vision={}".format(
    speech_online_n_train_episodes, vision_online_n_train_episodes)
summarise_tests(result_dir)

2. Five-shot 11-way cross-modal speech-image digit matching

In [None]:
result_dir = "./results/multimodal/siamese_online/5_shot/n_train_speech={}_vision={}".format(
    speech_online_n_train_episodes, vision_online_n_train_episodes)
summarise_tests(result_dir)

3. Speaker invariance for one-shot 11-way cross-modal speech-image digit matching in the presence of query speaker distractors

In [None]:
result_dir = "./results/multimodal/siamese_online/speaker_invariance/n_train_speech={}_vision={}".format(
    speech_online_n_train_episodes, vision_online_n_train_episodes)
summarise_tests(result_dir, speaker_invariance=True)