An Empirical Study on the Overlapping Problem of Open-Domain Dialogue Datasets

Cleaned Datasets

Our cleaned datasets can be downloaded at:

Training

The training script is train.py. For example, to train the GPT-2 model on the cleaned DailyDialog dataset:

python train.py \
    --train-path=path-to-the-training-csv-file \
    --eval-path=path-to-the-validation-csv-file \
    --num-epochs=50 \
    --model-str=gpt2 \

LSTM and Transformer are trained using the Fairseq framework.

Monitoring Performance

The training script will automatically generate a timestamped logging directory to store the checkpoints as well as log files. The validation performance can be monitored during training through tensorboard:

tensorboard --logdir=path-to-the-timestamped-logging-folder

Continue Training

If the performance is still increasing at the end of training, you can resume with the following command:

python train.py \
    --train-path=path-to-the-training-csv-file \
    --eval-path=path-to-the-validation-csv-file \
    --num-epochs=100 \
    --model-str=gpt2 \
    --resume-path=path-to-the-timestamped-logging-folder

Evaluation

After the performance has peaked, you can evaluate the model using eval.py:

python eval.py --ckpt=path-to-the-best-validation-checkpoint --eval-path=path-to-the-test-csv-file

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
data		data
metric		metric
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
cc_setup.sh		cc_setup.sh
dailydialogue.py		dailydialogue.py
download.py		download.py
eval.py		eval.py
opensubtitles.py		opensubtitles.py
train.py		train.py
train_ost.py		train_ost.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Empirical Study on the Overlapping Problem of Open-Domain Dialogue Datasets

Cleaned Datasets

Training

Monitoring Performance

Continue Training

Evaluation

About

Releases

Packages

Contributors 2

Languages

yq-wen/overlapping-datasets

Folders and files

Latest commit

History

Repository files navigation

An Empirical Study on the Overlapping Problem of Open-Domain Dialogue Datasets

Cleaned Datasets

Training

Monitoring Performance

Continue Training

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages