Skip to content
Cleaned E2E NLG Challenge data + supporting scripts
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
cleaned-data
partially-cleaned-data
README.md
remove_overlaps.py
slot_error.py

README.md

Cleaning Semantic Noise in the E2E dataset

An update release of E2E NLG Challenge data with cleaned MRs and scripts, accompanying the following paper (link coming soon):

Ondřej Dušek, David M. Howcroft, and Verena Rieser (2019): Semantic Noise Matters for Neural Natural Language Generation. In INLG, Tokyo, Japan.

Cleaned data

The fully cleaned E2E NLG Challenge data can be found in cleaned-data. The training and development set are filtered so that they don't overlap the test set, hence the no-ol naming.

The partially cleaned data (see paper) are under partially-cleaned-data. Do not use these unless you have a good reason to do so.

Cleaning process

This is just documenting what we have done to get the cleaned data; you do not need to run this.

1.) Re-annotate MRs in the data (use -t if you want a partial fix only):

./slot_error.py -f train-fixed.csv path/to/trainset.csv
./slot_error.py -f devel-fixed.csv path/to/devset.csv
./slot_error.py -f test-fixed.csv path/to/testset_w_refs.csv

2.) Remove instances with overlapping MRs (after reannotation). Keeps the test set intact; if an instance overlaps between train and dev set, it's removed from the train set:

./remove_overlaps.py train-fixed.csv devel-fixed.csv test-fixed.csv

Experiments with TGen

We used the data with default TGen settings for the E2E Challenge, with validation on the development set (additional training parameter -v input/devel-das.txt,input/devel-text.txt) and evaluation on the test set (both original and cleaned).

To get the plain seq2seq configuration ("TGen-"), we set the classif_filter parameter in the config/config.yaml file to null. To use the slot error script as reranker ("TGen+"), we set classif_filter in the following way:

    classif_filter: {'model': 'e2e_patterns'}

Note that a version of the slot_error.py script is included in TGen code for simpler usage.

You can’t perform that action at this time.