Skip to content
Cleaned E2E NLG Challenge data + supporting scripts
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Cleaning Semantic Noise in the E2E dataset

An update release of E2E NLG Challenge data with cleaned MRs and scripts, accompanying the following paper (link coming soon):

Ondřej Dušek, David M. Howcroft, and Verena Rieser (2019): Semantic Noise Matters for Neural Natural Language Generation. In INLG, Tokyo, Japan.

Cleaned data

The fully cleaned E2E NLG Challenge data can be found in cleaned-data. The training and development set are filtered so that they don't overlap the test set, hence the no-ol naming.

The partially cleaned data (see paper) are under partially-cleaned-data. Do not use these unless you have a good reason to do so.

Cleaning process

This is just documenting what we have done to get the cleaned data; you do not need to run this.

1.) Re-annotate MRs in the data (use -t if you want a partial fix only):

./ -f train-fixed.csv path/to/trainset.csv
./ -f devel-fixed.csv path/to/devset.csv
./ -f test-fixed.csv path/to/testset_w_refs.csv

2.) Remove instances with overlapping MRs (after reannotation). Keeps the test set intact; if an instance overlaps between train and dev set, it's removed from the train set:

./ train-fixed.csv devel-fixed.csv test-fixed.csv

Experiments with TGen

We used the data with default TGen settings for the E2E Challenge, with validation on the development set (additional training parameter -v input/devel-das.txt,input/devel-text.txt) and evaluation on the test set (both original and cleaned).

To get the plain seq2seq configuration ("TGen-"), we set the classif_filter parameter in the config/config.yaml file to null. To use the slot error script as reranker ("TGen+"), we set classif_filter in the following way:

    classif_filter: {'model': 'e2e_patterns'}

Note that a version of the script is included in TGen code for simpler usage.

You can’t perform that action at this time.