Skip to content
/ MFAE Public

Source code for SDM 2020 paper "What Do Questions Exactly Ask? MFAE: Duplicate Question Identification with Multi-Fusion Asking Emphasis"

License

Notifications You must be signed in to change notification settings

rzhangpku/MFAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What Do Questions Exactly Ask? MFAE: Duplicate Question Identification with Multi-Fusion Asking Emphasis

Paper Accepted by SDM 2020

Description

This repository includes the source code of the paper "What Do Questions Exactly Ask? MFAE: Duplicate Question Identification with Multi-Fusion Asking Emphasis". Please cite our paper when you use this program! 😍 This paper has been accepted to the conference "SIAM International Conference on Data Mining (SDM20)". The paper can be downloaded here.

@inproceedings{zhang2020questions,
  title={What Do Questions Exactly Ask? MFAE: Duplicate Question Identification with Multi-Fusion Asking Emphasis},
  author={Zhang, Rong and Zhou, Qifei and Wu, Bo and Li, Weiping and Mo, Tong},
  booktitle={Proceedings of the 2020 SIAM International Conference on Data Mining},
  pages={226--234},
  year={2020},
  organization={SIAM}
}

Model Overview

Requirements

python3

pip install -r requirements.txt

Datasets

The codes support four datasets for Duplicate Sentence Identification.

Duplicate Question Identification Datasets (DQI)

  • Quora Question Pairs
  • CQADupStack

Natural Language Inference Datasets (NLI)

  • SNLI
  • MultiNLI

Data Preprocessing

After the datasets have been downloaded, you can preprocess the data.

Preprocess the data by BERT

cd scripts/preprocessing
python process_quora_bert.py
python preprocess_cqadup_bert.py
python preprocess_snli_bert.py
python process_mnli_bert.py

Preprocess the data by ELMo

cd scripts/preprocessing
python process_quora.py
python preprocess_snli.py
python preprocess_mnli.py

Training

BERT as service

If you want to train models with BERT word embedding, please use the bert-as-service, and then run the following scripts.

Train all models

sh -x run.sh

Train with BERT

python bert_quora.py >> log/quora/quora_bert.log
python bert_cqadup.py >> log/cqadup/cqadup_bert.log
python bert_snli.py >> log/snli/snli_bert.log
python bert_mnli.py >> log/mnli/mnli_bert.log

Train with ELMo

python train_quora_elmo.py >> log/quora/quora_elmo.log
python train_snli_elmo.py >> log/snli/snli_elmo.log
python train_mnli_elmo.py >> log/mnli/mnli_elmo.log

Testing

After the models have been trained, you can test the models.

Test the models with BERT backbone

python test_bert_quora.py
python test_bert_cqadup.py
python test_bert_snli.py
python test_bert_mnli.py

Test the models with ELMo backbone

python test_elmo_quora.py
python test_elmo_snli.py
python test_elmo_mnli.py

Report Issues

Please let us know, if you encounter any problems.

The contact email is rzhangpku@pku.edu.cn