Am I running the correct test file? #4

Sanqiang · 2017-10-09T04:13:16Z

I try to reproduce the result (mostly for BLEU) for encoder-decoder with attention (in the paper it said 88% in BLEU), but I tried the model you provide "model_0.001.256.2L.we.full.2l.ft0.t7" with both test file in wikilarge and wikismall. I see the best result I got is from Wikilarge (which is 59.80% lower than what you reported in the paper). I wonder if I use correct test file or correct checkpoint?

`th generate_pipeline.lua --modelPath ../model_0.001.256.2L.we.full.2l.ft0.t7 --dataPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.test --outPathRaw outraw2.txt --oriDataPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.ori.test --oriMapPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.map.t7

th generate_pipeline.lua --modelPath ../model_0.001.256.2L.we.full.2l.ft0.t7 --dataPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.test --outPathRaw outraw2.txt --oriDataPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.ori.test --oriMapPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.map.t7`

To be honest, for now, I tried your dataset with another encoder-decoder model (Transformer, no RNN inside), which supposed to be work better at least in memory part. And I have also concerns with your model configuration:
why turn off the ignore case, I think the uppercase or lowercase makes no difference for the words in this data set?

Also based on your decode result, I also see many of short sentences (which drop performance but maybe good for FK score), the reason is wikilarge contains noisy kwoodsen data (http://homepages.inf.ed.ac.uk/kwoodsen/wiki.html#wiki-data), the data teach model to generate short sentences.

XingxingZhang · 2017-10-09T08:32:52Z

It looks like you are using single BLEU evaluation. There are 8 references in wikilarge test sets (available here https://github.com/cocoxu/simplification).

The code I released can be used to produce system output of EncDecA, Dress and Dress-Ls. Please follow the evaluation protocols described in our paper. More suggestions can be found here

XingxingZhang · 2017-10-09T08:34:36Z

< why turn off the ignore case, I think the uppercase or lowercase makes no difference for the words in this data set?

I don't think it matters :)

Sanqiang · 2017-10-09T14:36:33Z

Yes, I also tried dataset download from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus
But the 8 references dataset doesn't do the NER replacement (e.g. PEOPLE@1, LOCATION@1 that kind of things), so I cannot directly use your code (I am supposed to be put preprocess with NER tools test files as input).
I wonder if you still do the NER replacement for 8 references dataset?

Meanwhile, I wonder could you show me the command you run for 8 references test set (since there are 8 references for iBLEU, the Xu Wei's paper didn't indicate how to do with 8 references, take the mean or max for 8 reference performances?)?

I downloaded the dataset from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus/truecased (because true cased is good for NER tool), and use Stanford NER tool to do the replacement. (I think I am doing the correct thing because I see exactly same output when I replace from wiki.full.aner.ori.test to wiki.full.aner.test). But since your code seems doesn't support 8 references data set, I tried my TensorFlow encoder-decoder model, which follows similar setting as yours, but still not get 88% BLEU. But my model can generate similar performance based on the Wikilarge/Wikismall test set (non 8 reference ones) through mteval-v13a.pl. So I think perhaps I just use the wrong script.

In addition, are you still work on this task? So far, it is true that Seq2seq without RL prefers copy complex text into simpler one but I think you can solve it in the evaluation stage (through RL) but I think the major reason id the attention (I am running experiments for now to prove it).

XingxingZhang · 2017-10-09T15:31:22Z

"
Yes, I also tried dataset download from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus
But the 8 references dataset doesn't do the NER replacement (e.g. PEOPLE@1, LOCATION@1 that kind of things), so I cannot directly use your code (I am supposed to be put preprocess with NER tools test files as input).
I wonder if you still do the NER replacement for 8 references dataset?
"
The "wiki.full.aner.map.t7" file in "data-simplification/wikilarge" folder contains all you need for NER anonymization/de-anonymization. Note that in test set, I only did NER anonymization for complex sentences and one of the reference sentences. But it doesn't matter since your system output will be de-anonymized anyway.

XingxingZhang · 2017-10-09T15:44:18Z

"
Meanwhile, I wonder could you show me the command you run for 8 references test set (since there are 8 references for iBLEU, the Xu Wei's paper didn't indicate how to do with 8 references, take the mean or max for 8 reference performances?)?
"
BLEU evaluation in default assumes there are multiple references [1][2]. Please refer to the documentations of Joshua or mt-eval-v13 for how to evaluate BLEU with multiple references.

[1] BLEU: a Method for Automatic Evaluation of Machine Translation
[2] https://en.wikipedia.org/wiki/BLEU

Sanqiang · 2017-10-09T17:16:04Z

I think I can achieve similar result in the paper,
This is what I do:
(1) I still use scripts/mteval-v13a.pl to eval your output with single ground truth, get a score for BLEU(I, O) roughly 60%.
(2) I use scripts/multi-bleu.perl to eval your output with 8 references, get a score for BLEU(I, R) roughly 90%. But the original reference is all lowercase, so I use true cased references ones so that it is matchable.
(3) iBLEU = 0.9 * BLEU(I, R) + 0.1 * BLEU(I, O) as Xu's paper, the result is similar to your paper.

Am I correct? or any bias from what you did?

XingxingZhang · 2017-10-10T08:20:26Z

< Am I correct? or any bias from what you did?
No. I didn't use iBLEU and didn't mention iBLEU anywhere

XingxingZhang · 2017-10-10T09:36:34Z

Here are the instructions for 8-ref bleu evaluation on wikilarge https://github.com/XingxingZhang/dress/tree/master/experiments/evaluation/BLEU

Good luck!

Sanqiang · 2017-10-10T14:07:11Z

I got it.
so you use the 8-ref bleu evaluation from https://github.com/XingxingZhang/dress/tree/master/experiments/evaluation/BLEU

I wonder you only use 8-references or you use 9-references (8 reference plus original single ground truth)? (Based on the script you provide, I think you use 8-references ones but just double check)

XingxingZhang · 2017-10-13T10:11:22Z

did you get the correct bleu score?
=> BLEU = 0.8885

Sanqiang · 2017-10-15T03:00:24Z

Yes, I can get correct bleu. Thank you.

XingxingZhang · 2017-10-15T10:03:31Z

awesome!

Sanqiang closed this as completed Oct 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Am I running the correct test file? #4

Am I running the correct test file? #4

Sanqiang commented Oct 9, 2017 •

edited

Loading

XingxingZhang commented Oct 9, 2017

XingxingZhang commented Oct 9, 2017

Sanqiang commented Oct 9, 2017 •

edited

Loading

XingxingZhang commented Oct 9, 2017

XingxingZhang commented Oct 9, 2017

Sanqiang commented Oct 9, 2017 •

edited

Loading

XingxingZhang commented Oct 10, 2017

XingxingZhang commented Oct 10, 2017

Sanqiang commented Oct 10, 2017 •

edited

Loading

XingxingZhang commented Oct 13, 2017

Sanqiang commented Oct 15, 2017

XingxingZhang commented Oct 15, 2017

Am I running the correct test file? #4

Am I running the correct test file? #4

Comments

Sanqiang commented Oct 9, 2017 • edited Loading

XingxingZhang commented Oct 9, 2017

XingxingZhang commented Oct 9, 2017

Sanqiang commented Oct 9, 2017 • edited Loading

XingxingZhang commented Oct 9, 2017

XingxingZhang commented Oct 9, 2017

Sanqiang commented Oct 9, 2017 • edited Loading

XingxingZhang commented Oct 10, 2017

XingxingZhang commented Oct 10, 2017

Sanqiang commented Oct 10, 2017 • edited Loading

XingxingZhang commented Oct 13, 2017

Sanqiang commented Oct 15, 2017

XingxingZhang commented Oct 15, 2017

Sanqiang commented Oct 9, 2017 •

edited

Loading

Sanqiang commented Oct 9, 2017 •

edited

Loading

Sanqiang commented Oct 9, 2017 •

edited

Loading

Sanqiang commented Oct 10, 2017 •

edited

Loading