Skip to content
[ACL'19] [PyTorch] Multimodal Transformer
Branch: master
Clone or download
Latest commit 66ff1cf Jul 1, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
imgs Delete d Jun 3, 2019
modules Initial Commit Jun 3, 2019
src Added missing files; updated documentations Jun 5, 2019 Fix the bibtex citation Jul 1, 2019 Added missing files; updated documentations Jun 5, 2019

Python 3.6

Multimodal Transformer for Unaligned Multimodal Language Sequences

Pytorch implementation for learning Multimodal Transformer for unaligned multimodal language sequences.

Correspondence to:


Multimodal Transformer for Unaligned Multimodal Language Sequences
Yao-Hung Hubert Tsai *, Shaojie Bai *, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov
Association for Computational Linguistics (ACL), 2019. (*equal contribution)

Please cite our paper if you find our work useful for your research:

  title={Multimodal Transformer for Unaligned Multimodal Language Sequences},
  author={Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J. Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month = {7},
  address = {Florence, Italy},
  publisher = {Association for Computational Linguistics},


Overall Architecture for Multimodal Transformer

Multimodal Transformer (MulT) merges multimodal time-series via a feed-forward fusion process from multiple directional pairwise crossmodal transformers. Specifically, each crossmodal transformer serves to repeatedly reinforce a target modality with the low-level features from another source modality by learning the attention across the two modalities' features. A MulT architecture hence models all pairs of modalities with such crossmodal transformers, followed by sequence models (e.g., self-attention transformer) that predicts using the fused features.

Crossmodal Attention for Two Sequences from Distinct Modalities

The core of our proposed model are crossmodal transformer and crossmodal attention module.




Data files (containing processed MOSI, MOSEI and IEMOCAP datasets) can be downloaded from here.

To retrieve the meta information and the raw data, please refer to the SDK for these datasets.

Run the Code

  1. Create (empty) folders for data and pre-trained models:
mkdir data pre_trained_models

and put the downloaded data in 'data/'.

  1. Command as follows
python [--FLAGS]

Note that the defualt arguments are for unaligned version of MOSEI. For other datasets, please refer to Supplmentary.

If Using CTC

Transformer requires no CTC module. However, as we describe in the paper, CTC module offers an alternative to applying other kinds of sequence models (e.g., recurrent architectures) to unaligned multimodal streams.

If you want to use the CTC module, plesase install warp-ctc from here.

The quick version:

git clone
cd warp-ctc
mkdir build; cd build
cmake ..
cd ../pytorch_binding
python install
export WARP_CTC_PATH=/home/xxx/warp-ctc/build


Some portion of the code were adapted from the fairseq repo.

You can’t perform that action at this time.