There are two Python programs here:
./default
chooses one of the sentences translated by Turkers. Specifically, the default method always chooses the sentence translated by the first Turker../grade
calculates the sentence-level BLEU score based on the output generated, using 'hidden' reference sentences that is only available to grader.
The commands are designed to work in a pipeline. For instance, this is a valid invocation:
./default | ./grade
To use this, you may do the following: 1. Develop your own model and generate output 2. Use grade function to evaluate your output
The data-train/
, data-test/
directory contains a training set and a test set
-
data-train
contains the following:- lm : language model probability for English (n-grams)
- surveys.tsv: data on each Turker's information
- train_postedited_translations.tsv: translations generated by each Turkers are edited by other Turkers who are residing in the U.S. Total 10 edits are available for each source sentence. You may use this data to generate additional features.
- train_translations.tsv: 20% of original data which also have four reference sentences for each source sentence.
-
data-test
contains the following:- test_translations.tsv: Original data which do not have reference sentences. Your goal is to choose the best candidate sentence generated by each Turker when a source sentence, along with worker ids and four candidate sentences are provided. Note that first 358 sentences are from train_translations.tsv