A Low-Rank Matching Attention Based Cross-Modal Feature Fusion Method for Conversational Emotion Recognition

Abstract: Conversational emotion recognition (CER) is an important research topic in human-computer interactions. Although recent advancements in transformer-based cross-modal fusion methods have shown promise in CER tasks, they tend to overlook the crucial intra-modal and inter-modal emotional interaction or suffer from high computational complexity. To address this, we introduce a novel and lightweight cross-modal feature fusion method called Low-Rank Matching Attention Method (LMAM). LMAM effectively captures contextual emotional semantic information in conversations while mitigating the quadratic complexity issue caused by the self-attention mechanism. Specifically, by setting a matching weight and calculating inter-modal features attention scores row by row, LMAM requires only one-third of the parameters of self-attention methods. We also employ the low-rank decomposition method on the weights to further reduce the number of parameters in LMAM. As a result, LMAM offers a lightweight model while avoiding overfitting problems caused by a large number of parameters. Moreover, LMAM is able to fully exploit the intra-modal emotional contextual information within each modality and integrates complementary emotional semantic information across modalities by computing and fusing similarities of intra-modal and inter-modal features simultaneously. Experimental results verify the superiority of LMAM compared with other popular cross-modal fusion methods on the premise of being more lightweight. Also, LMAM can be embedded into any existing state-of-the-art CER methods in a plug-and-play manner, and can be applied to other multi-modal recognition tasks, e.g., session recommendation and humour detection, demonstrating its remarkable generalization ability.

This is an official implementation of 'A Low-Rank Matching Attention Based Cross-Modal Feature Fusion Method for Conversational Emotion Recognition' 🔥. Any problems, please contact shouyuntao@stu.xjtu.edu.cn. Any other interesting papers or codes are welcome. If you find this repository useful to your research or work, it is really appreciated to star this repository ❤️.

Paper

A Low-Rank Matching Attention Based Cross-Modal Feature Fusion Method for Conversational Emotion Recognition
Shou, Yuntao and Liu, Huan and Cao, Xiangyong and Meng, Deyu and Dong, Bo
IEEE Transactions on Affective Computing, 2024

Please cite our paper if you find our work useful for your research:

@ARTICLE{10753050,
  author={Shou, Yuntao and Liu, Huan and Cao, Xiangyong and Meng, Deyu and Dong, Bo},
  journal={IEEE Transactions on Affective Computing}, 
  title={A Low-Rank Matching Attention Based Cross-Modal Feature Fusion Method for Conversational Emotion Recognition}, 
  year={2024},
  volume={},
  number={},
  pages={1-13},
  doi={10.1109/TAFFC.2024.3498443}}

Usage (Choose IEMOCAP-Six for Example)

🚀 Prerequisites

Python 3.8
CUDA 10.2
pytorch ==1.8.0
torchvision == 0.9.0
torch_geometric == 2.0.1
fairseq == 0.10.1
transformers==4.5.1
pandas == 1.2.5

(see requirements.txt for more details)

Pretrained model

## for lexical feature extraction
https://huggingface.co/microsoft/deberta-large/tree/main  -> ../tools/transformers/deberta-large

## for acoustic feature extraction
https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt  -> ../tools/wav2vec

## for face extractor (OpenFace-win)
https://drive.google.com/file/d/1-O8epcTDYCrRUU_mtXgjrS3OWA4HTp0-/view?usp=share_link  -> ./OpenFace_2.2.0_win_x64

## for visual feature extraction
https://drive.google.com/file/d/1wT2h5sz22SaEL4YTBwTIB3WoL4HUvg5B/view?usp=share_link ->  ../tools/manet

## using ffmpeg for sub-video extraction
https://ffmpeg.org/download.html#build-linux ->  ../tools/ffmpeg-4.4.1-i686-static

🚀 Datasets

# download IEMOCAP dataset and put it into ../emotion-data/IEMOCAP
https://sail.usc.edu/iemocap/iemocap_release.htm   ->   ../emotion-data/IEMOCAP

# whole video -> subvideo
python preprocess.py split_video_by_start_end_IEMOCAP

# subvideo -> detect face
python detect.py --model='face_detection_yunet_2021sep.onnx' --videofolder='dataset/IEMOCAP/subvideo' --save='dataset/IEMOCAP/subvideofaces' --dataset='IEMOCAP'

# extract visual features
cd feature_extraction/visual
python extract_manet_embedding.py --dataset='IEMOCAPFour' --gpu=0
python preprocess.py feature_compressed_iemocap dataset/IEMOCAP/features/manet dataset/IEMOCAP/features/manet_UTT

# extract acoustic features
python preprocess.py split_audio_from_video_16k 'dataset/IEMOCAP/subvideo' 'dataset/IEMOCAP/subaudio'
cd feature_extraction/audio
python extract_wav2vec_embedding.py --dataset='IEMOCAPFour' --feature_level='UTTERANCE' --gpu=0

# extract textual features
python preprocess.py generate_transcription_files_IEMOCAP
cd feature_extraction/text
python extract_text_embedding_LZ.py --dataset='IEMOCAPFour' --feature_level='UTTERANCE' --model_name='deberta-large' --gpu=0

###################################################################
# We also provide pre-extracted multimodal features
IEMOCAP: https://drive.google.com/file/d/1Hn82-ZD0CNqXQtImd982YHHi-3gIX2G3/view?usp=share_link  -> ./dataset/IEMOCAP/features
CMUMOSI: https://drive.google.com/file/d/1aJxArYfZsA-uLC0sOwIkjl_0ZWxiyPxj/view?usp=share_link  -> ./dataset/CMUMOSI/features
CMUMOSEI:https://drive.google.com/file/d/1L6oDbtpFW2C4MwL5TQsEflY1WHjtv7L5/view?usp=share_link  -> ./dataset/CMUMOSEI/features

CUMMOSI dataset download

🚀 Installation

The first step is to download the SDK:

git clone https://github.com/CMU-MultiComp-Lab/CMU-MultimodalSDK.git

Next, you need to install the SDK on your python enviroment.

cd CMU-MultimodalSDK
pip install .

Training

python train.py

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
baseline		baseline
README.md		README.md
archi.png		archi.png
dataloader.py		dataloader.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Low-Rank Matching Attention Based Cross-Modal Feature Fusion Method for Conversational Emotion Recognition

Paper

Usage (Choose IEMOCAP-Six for Example)

🚀 Prerequisites

Pretrained model

🚀 Datasets

CUMMOSI dataset download

🚀 Installation

Training

Star History

About

Releases

Packages

Languages

yuntaoshou/LMAM

Folders and files

Latest commit

History

Repository files navigation

A Low-Rank Matching Attention Based Cross-Modal Feature Fusion Method for Conversational Emotion Recognition

Paper

Usage (Choose IEMOCAP-Six for Example)

🚀 Prerequisites

Pretrained model

🚀 Datasets

CUMMOSI dataset download

🚀 Installation

Training

Star History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages