Skip to content
Attention-based multimodal fusion for sentiment analysis
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
dataset Update README.md: Added dataset details. Sep 5, 2018
LICENSE Initial commit Jul 6, 2018
README.md Update README.md Mar 8, 2019
atlstm3.jpg Add files via upload Jul 6, 2018
atlstm3.pdf Add files via upload Jul 6, 2018
create_data.py
data_prep.py
iemocap-data.zip
model.py Added more datasets Aug 15, 2018
network.jpg Add files via upload Jul 6, 2018
run.py Update code. Sep 5, 2018
unimodal_iemocap_4way.pickle Add files via upload Aug 15, 2018
unimodal_iemocap_6way.pickle Add files via upload Aug 15, 2018
unimodal_mosei_3way.pickle.zip Added features. Aug 15, 2018
unimodal_mosi_2way.pickle

README.md

Attention-based multimodal fusion for sentiment analysis

Attention-based multimodal fusion for sentiment analysis

Code for the paper

Context-Dependent Sentiment Analysis in User-Generated Videos (ACL 2017).

Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis(ICDM 2017).

Alt text

Preprocessing

Edit: the create_data.py is obsolete. The pre-processed datasets have already been provided in the dataset/ folder in the repo. Use them directly.

As data is typically present in utterance format, we combine all the utterances belonging to a video using the following code

python create_data.py

Note: This will create speaker independent train and test splits In dataset/mosei, extract the zip into a folder named 'raw'. Also, extract 'unimodal_mosei_3way.pickle.zip'

Running the model

Sample command:

With attention-based fusion:

python run.py --unimodal True --fusion True
python run.py --unimodal False --fusion True

Without attention-based and with concatenation-based fusion:

python run.py --unimodal True --fusion False
python run.py --unimodal False --fusion False

Utterance level attention:

python run.py --unimodal False --fusion True --attention_2 True
python run.py --unimodal False --fusion True --attention_2 True

Note:

  1. Keeping the unimodal flag as True (default False) shall train all unimodal lstms first (level 1 of the network mentioned in the paper)
  2. Setting --fusion True applies only to multimodal network.

Datasets:

We provide results on the MOSI, MOSEI and IEMOCAP datasets.
Please cite the creators.

We are adding more datasets, stay tuned.

Use --data [mosi|mosei|iemocap] and --classes [2|3|6] in the above commands to test different configurations on different datasets.

mosi: 2 classes
mosei: 3 classes
iemocap: 6 classes

Example:

python run.py --unimodal False --fusion True --attention_2 True --data mosei --classes 3

Dataset details

MOSI:

2 classes: Positive/Negative
Raw Features: (Pickle files)
Audio: dataset/mosi/raw/audio_2way.pickle
Text: dataset/mosi/raw/text_2way.pickle
Video: dataset/mosi/raw/video_2way.pickle

Each file contains:
train_data, train_label, test_data, test_label, maxlen, train_length, test_length

train_data - np.array of dim (62, 63, feature_dim)
train_label - np.array of dim (62, 63, 2)
test_data - np.array of dim (31, 63, feature_dim)
test_label - np.array of dim (31, 63, 2)
maxlen - max utterance length int of value 63
train_length - utterance length of each video in train data.
test_length - utterance length of each video in test data.

Train/Test split: 62/31 videos. Each video has utterances. The videos are padded to 63 utterances.

IEMOCAP:

6 classes: happy/sad/neutral/angry/excited/frustrated
Raw Features: dataset/iemocap/raw/IEMOCAP_features_raw.pkl (Pickle files)
The file contains:
videoIDs[vid] = List of utterance IDs in this video in the order of occurance
videoSpeakers[vid] = List of speaker turns. e.g. [M, M, F, M, F]. here M = Male, F = Female
videoText[vid] = List of textual features for each utterance in video vid.
videoAudio[vid] = List of audio features for each utterance in video vid.
videoVisual[vid] = List of visual features for each utterance in video vid.
videoLabels[vid] = List of label indices for each utterance in video vid.
videoSentence[vid] = List of sentences for each utterance in video vid.
trainVid = List of videos (videos IDs) in train set.
testVid = List of videos (videos IDs) in test set.

Refer to the file dataset/iemocap/raw/loadIEMOCAP.py for more information. We use this data to create a speaker independent train and test splits in the format. (videos x utterances x features)

Train/Test split: 120/31 videos. Each video has utterances. The videos are padded to 110 utterances.

MOSEI:

3 classes: positive/negative/neutral
Raw Features: (Pickle files)
Audio: dataset/mosei/raw/audio_3way.pickle
Text: dataset/mosei/raw/text_3way.pickle
Video: dataset/mosei/raw/video_3way.pickle

The file contains: train_data, train_label, test_data, test_label, maxlen, train_length, test_length

train_data - np.array of dim (2250, 98, feature_dim)
train_label - np.array of dim (62, 63, 2)
test_data - np.array of dim (31, 63, feature_dim)
test_label - np.array of dim (31, 63, 2)
maxlen - max utterance length int of value 98
train_length - utterance length of each video in train data.
test_length - utterance length of each video in test data.

Train/Test split: 2250/678 videos. Each video has utterances. The videos are padded to 98 utterances.

Citation

If using this code, please cite our work using :

@inproceedings{soujanyaacl17,
  title={Context-dependent sentiment analysis in user-generated videos},
  author={Poria, Soujanya  and Cambria, Erik and Hazarika, Devamanyu and Mazumder, Navonil and Zadeh, Amir and Morency, Louis-Philippe},
  booktitle={Association for Computational Linguistics},
  year={2017}
}

@inproceedings{poriaicdm17, 
author={S. Poria and E. Cambria and D. Hazarika and N. Mazumder and A. Zadeh and L. P. Morency}, 
booktitle={2017 IEEE International Conference on Data Mining (ICDM)}, 
title={Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis}, 
year={2017},  
pages={1033-1038}, 
keywords={data mining;feature extraction;image classification;image fusion;learning (artificial intelligence);sentiment analysis;attention-based networks;context learning;contextual information;contextual multimodal sentiment;dynamic feature fusion;multilevel multiple attentions;multimodal sentiment analysis;recurrent model;utterances;videos;Context modeling;Feature extraction;Fuses;Sentiment analysis;Social network services;Videos;Visualization}, 
doi={10.1109/ICDM.2017.134}, 
month={Nov},}

Credits

Soujanya Poria

Gangeshwar Krishnamurthy (gangeshwark@gmail.com; Github: @gangeshwark)

You can’t perform that action at this time.