Skip to content

silverriver/OOD4NLU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OOD4NLU

This repository contains the code for the work Out-of-domain detection for natural language understanding in dialog systems.

The POG folder contains the pseudo OOD sample generation model code. The CNN_KL folder contains the code for a CNN-based text classifier with KL regularization.

All these codes are developed and tested using TensorFlow 1.12.0 and python 3.6.5.

Usage of POG

  1. Install dependency
pip install -r requirements.txt
  1. Make a new project folder (for example, project). Copy the config.json file to this folder, and make a new data folder.
mkdir project
cp ${code-folder-of-POG}/config.json project/
cd project
mkdir data
  1. Put the following files in the data folder: ind_train, ind_dev, ood_dev, ind_test, ood_test. Each file is a tab-separated file with two columns. The first column is the intent name, and the second column is the sentence. We have prepared the CLINC150 dataset for you in POG/data. Note that the POG model does not need to use OOD data for training.
translate	in spanish, meet me tomorrow is said how
translate	in french, how do i say, see you later
translate	how do you say hello in japanese
  1. Change the project/config.json file to specify which data file to use.
  • train_file: the training file of IND data, which is ind_train
  • ind_valid_file: the validation file of IND data, which is ind_dev
  • ood_valid_file: the validation file of OOD data, which is ood_dev
  • ind_test_file": the test file of IND data, which is ind_test
  • ood_test_file": the test file of OOD data, which is ood_test

Note that our code will automatically look into the data folder for these files.

You can also specify some important hyperparameters:

  • word_vocab_size: maximum number of words in the vocabulary (important!)
  • pretrained_embed: whether to use the pretrained embedding (You can use the GloVe embedding for example)
  • max_decode_len: the maximum length of the decoded sentence
  • max_epoch: the maximum number of epochs
  • max_utter_len: the maximum length of the input utterance (used in the preprocessing step)
  1. Change to the code folder of our POG implementation, and use the following command to train the POG model:
cd ${code-folder-of-POG}
python main.py --config ${path-to-project/config.json} --gpu {gpu}
  1. After the POG model is trained, use the following command to sample utterances from the trained model:
python generate.py --config ${path-to-project/config.json} --gpu {gpu} --outfile {outfile} --count {50000} --is_sample True --sample_t 1.0

Usage of CNN_KL

To train the CNN text classifier, you need to prepare a set of OOD data. For example, you can use the above POG model to generate pseudo OOD data, or sample from whatever text corpus for OOD data. After obtaining the OOD data, you can use the following steps to train the CNN text classifier with the KL regularization.

  1. Install dependency
pip install -r requirements.txt
  1. Make a new project folder (for example, project). Copy the config.json file to this folder, and make a new data folder.
mkdir project
cp config.json project/
cd project
mkdir data
  1. Put the following files in the data folder: ind_train, ind_dev, ood_dev, ind_test, ood_test, fake_ood. Each file is a tab-separated file with two columns. The first column is the intent name, and the second column is the sentence. We have prepared the CLINC150 dataset for you in CNN_KL/data, along with a set of pseudo OOD data (i.e., fake_ood) generated using the POG model. The classifier will use these pseudo OOD samples to calculate the KL regularization loss.
translate	in spanish, meet me tomorrow is said how
translate	in french, how do i say, see you later
translate	how do you say hello in japanese
  1. Change the project/config.json file to specify which data file to use.
  • ind_train_data": the training file of IND data, which is data/ind_train
  • ood_train_data": the training file of OOD data, which is data/fake_ood
  • ind_valid_data": the validation file of IND data, which is data/ind_dev
  • ood_valid_data": the validation file of OOD data, which is data/ood_dev
  • ind_test_data": the validation file of IND data, which is data/ind_test
  • ood_test_data": the validation file of OOD data, which is data/ood_test

Note that you need to retrain the data prefix in the path for each data file.

Important hyperparameters:

  • pretrained_embed: whether to use the pretrained embedding
  • max_decode_len: the maximum length of the decoded sentence
  • max_epoch: the maximum number of epochs
  • max_utter_len: the maximum length of the utterance (used in the preprocessing step)
  1. Change to the code folder of our CNN_KL implementation, and use the following command to train the text classifier with different random seeds:
cd ${code-folder-of-CNN_KL}
python multi-seed.py --config ${path-to-project/config.json} --gpu {gpu} --shuffle_data True --seeds 10,20,30,40,50

The above command will train five text classifiers with five random seeds 10, 20, 30, 40, and 50, respectively.

  1. Use the following command to test the trained text classifiers:
python multi-seed.py --config ${path-to-project/config.json} --gpu {gpu} --seeds 10,20,30,40,50 --is_train False

Note that the code for POG and CNN_KL released here is re-implemented based on our paper. The original codes used for our study are part of Samsung's internal codebase and thus cannot be retrieved.

Please kindly cite our paper if you find this repository useful.

@article{zheng2020out,
  title={Out-of-domain detection for natural language understanding in dialog systems},
  author={Zheng, Yinhe and Chen, Guanyi and Huang, Minlie},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={28},
  pages={1198--1209},
  year={2020},
  publisher={IEEE}
}

Response to the re-implemented results reported by Marek et al., 2021

We have noticed a work published on NAACL2021 Industrial Track: Marek et al., OodGAN: Generative Adversarial Network for Out-of-Domain Data Generation that try to re-implement our model. However, the re-implemented results reported in their paper are much lower than ours. We have tried to contact the authors of Marek et al. to see if they can provide us with their code to reproduce the results reported in their paper. However, we are informed by the authors that their code is not publicly available.

We suspect the primary reason for such a performance gap is that Marek et al. did not train a good classifier for IND samples, let alone utilize the generated pseudo OOD data. In fact, the OOD detection performance largely depends on the quality of the IND classifier. If the classifier can not perform well in classifying IND samples, then the classifier's performance in detecting OOD samples will likely be very low.

In Marek et al., the maximum classification accuracy of IND samples reported on the CLINC150 dataset is 90.11% (see Table 4 in Marek et al.). However, in our implementation, a simple CNN-based text classifier can push this accuracy to 93.0+%. The accuracy score could be much higher if we used a pretrained model (about 97.00% if we use BERT). We suspect such a degenerated classifier is the main reason for their low OOD detection performance.

Moreover, the results on the CLINC150 dataset reported by Marek et al. are suspicious because their model underperforms the simplest baseline: using a naive text classifier trained on IND samples without KL regularization (i.e., the MSP baseline reported in our paper.). Similar results for this simple baseline are reported in various papers, including

The performance of this simplest baseline reported in most above papers can reach an AUROC score of about 93.0, much higher than the results reported by Marek et al.

More detailed disscussions can be found in this paper.

About

Code for paper "Out-of-domain detection for natural language understanding in dialog systems"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages