Features • Download • Dataset Description • Benchmark Framework • Quick start
MIntRec2.0 is a large-scale multimodal multi-party benchmark dataset for intent recognition and out-of-scope detection in conversations. We also provide benchmark framework and evaluation codes for usage.
Date | Announcements |
---|---|
1/2024 | 🎆 🎆 The first large-scale multimodal intent dataset has been released. Refer to the directory MIntRec2.0 for the dataset and codes. Read the paper -- MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations (Published in ICLR 2024). |
10/2022 | 🎆 🎆 The first multimodal intent dataset is published. Refer to the directory MIntRec for the dataset and codes. Read the paper -- MIntRec: A New Dataset for Multimodal Intent Recognition (Published in ACM MM 2022). |
MIntRec2.0 has the following features:
-
Large in Scale: Compared with our first version of multimodal intent recognition dataset (MIntRec), MIntRec2.0 increase the data-scale from 2.2K to 15K, with 30 intent classes, 9.3K in-scope and 5.7K out-of-scope annotated utterances with text, video, and audio modalities.
-
Multi-turn & Multi-party Dialogues: It contains 1,245 dialogues with an average of 12 utterances per dialogue in continuous conversations. Each utterance has an intent label in each dialogue. Each dialogue has at least two different speakers with annotated speaker identities for each utterance.
-
Out-of-scope Detection: As real-world dialogues are in the open-world scenarios as suggested in TEXTOIR, we further include an OOS tag for detecting those utterances that do not belong to any of existing intent classes. They can be used for out-of-distribution detection and improve system robustness.
The brief version of the dataset (text and video, audio feature files, 7G) can be downloaded from zenodo.
We provide video feature files, audio feature files, and text annotation files (9G), which can be downloaded from Google Drive.
We also provide raw video data (13G), which can be downloaded from Google Drive.
- Data sources: The raw videos are collected from three TV series: Superstore, The Big Bang Theory, and Friends.
- Dialogue division: We manually divide dialogues based on the scenes and episode.
- Speaker information: We manually annotate 21, 7, 6 main characters in Superstore, The Big Bang Theory, and Friends, respectively.
- Intent classes
- Express emotions or attitudes (16): doubt, acknowledge, refuse, warn, emphasize, complain, praise, apologize, thank, criticize, care, agree, oppose, taunt, flaunt, joke
- Acheve goals (14): ask for opinions, confirm, explain, invite, plan, inform, advise, arrange, introduce, comfort, leave, prevent, greet, ask for help
Item | Statistics |
---|---|
Number of coarse-grained intents | 2 |
Number of fine-grained intents | 30 |
Number of dialogues | 1,245 |
Number of utterances | 15,040 |
Number of words in utterances | 118,477 |
Number of unique words in utterances | 9,524 |
Average length of utterances | 7.0 |
Maximum length of utterances | 46 |
Average video clip duration | 3.0 (s) |
Maximum video clip duration | 19.9 (s) |
Video hours | 12.3 (h) |
We present a framework to benchmark multimodal intent understanding and out-of-scope detection in both single-turn and multi-turn conversational scenarios.
The framework contains 4 main modules:
- Data Organization: Single-turn dialogues use utterance-level samples as inputs. Multi-turn dialogues are arranged chronologically based on the order in which the speakers take their turn.
- Multimodal Feature Extraction: Extracting features from text, video, and audio modalities. For multi-turn dialogues, we concatenate the context information with the current utterance and separate them with a special token.
- Multimodal Fusion: Multimodal fusion methods (e.g., MAG-BERT, MulT) can be used for fusing different modalities.
- Training: In-scope data uses cross-entropy loss. Out-of-scope data uses outlier exposure loss. It may also contain the multimodal fusion loss for capturing cross-modal interactions.
- Inference: Open set recognition method (e.g., DOC) can be used to identify K known classes and detect one out-of-scope class.
-
Use anaconda to create Python environment
conda create --name MIntRec python=3.9 conda activate MIntRec
-
Install PyTorch (Cuda version 11.2)
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
-
Clone the MIntRec repository.
git clone git@github.com:thuiar/MIntRec2.0.git cd MIntRec
-
Install related environmental dependencies
pip install -r requirements.txt
-
Run examples (Take mag-bert as an example, more can be seen here)
sh examples/run_mag_bert_baselines.sh
If this work is helpful, or you want to use the codes and results in this repo, please cite the following papers:
- MIntRec2.0: A Large-scale Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations
- MIntRec: A New Dataset for Multimodal Intent Recognition
@inproceedings{
zhang2024mintrec,
title={{MI}ntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations},
author={Hanlei Zhang and Xin Wang and Hua Xu and Qianrui Zhou and Kai Gao and Jianhua Su and jinyue Zhao and Wenrui Li and Yanting Chen},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=nY9nITZQjc}
}
@inproceedings{MIntRec,
author = {Zhang, Hanlei and Xu, Hua and Wang, Xin and Zhou, Qianrui and Zhao, Shaojie and Teng, Jiayan},
title = {MIntRec: A New Dataset for Multimodal Intent Recognition},
year = {2022},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
pages = {1688–1697},
}
The dataset and camera ready version of the paper will be updated recently.