Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Am I running the correct test file? #4

Closed
Sanqiang opened this issue Oct 9, 2017 · 12 comments
Closed

Am I running the correct test file? #4

Sanqiang opened this issue Oct 9, 2017 · 12 comments

Comments

@Sanqiang
Copy link

Sanqiang commented Oct 9, 2017

I try to reproduce the result (mostly for BLEU) for encoder-decoder with attention (in the paper it said 88% in BLEU), but I tried the model you provide "model_0.001.256.2L.we.full.2l.ft0.t7" with both test file in wikilarge and wikismall. I see the best result I got is from Wikilarge (which is 59.80% lower than what you reported in the paper). I wonder if I use correct test file or correct checkpoint?

`th generate_pipeline.lua --modelPath ../model_0.001.256.2L.we.full.2l.ft0.t7 --dataPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.test --outPathRaw outraw2.txt --oriDataPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.ori.test --oriMapPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.map.t7

th generate_pipeline.lua --modelPath ../model_0.001.256.2L.we.full.2l.ft0.t7 --dataPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.test --outPathRaw outraw2.txt --oriDataPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.ori.test --oriMapPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.map.t7`

To be honest, for now, I tried your dataset with another encoder-decoder model (Transformer, no RNN inside), which supposed to be work better at least in memory part. And I have also concerns with your model configuration:
why turn off the ignore case, I think the uppercase or lowercase makes no difference for the words in this data set?

Also based on your decode result, I also see many of short sentences (which drop performance but maybe good for FK score), the reason is wikilarge contains noisy kwoodsen data (http://homepages.inf.ed.ac.uk/kwoodsen/wiki.html#wiki-data), the data teach model to generate short sentences.

@XingxingZhang
Copy link
Owner

It looks like you are using single BLEU evaluation. There are 8 references in wikilarge test sets (available here https://github.com/cocoxu/simplification).

The code I released can be used to produce system output of EncDecA, Dress and Dress-Ls. Please follow the evaluation protocols described in our paper. More suggestions can be found here

@XingxingZhang
Copy link
Owner

< why turn off the ignore case, I think the uppercase or lowercase makes no difference for the words in this data set?

I don't think it matters :)

@Sanqiang
Copy link
Author

Sanqiang commented Oct 9, 2017

Yes, I also tried dataset download from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus
But the 8 references dataset doesn't do the NER replacement (e.g. PEOPLE@1, LOCATION@1 that kind of things), so I cannot directly use your code (I am supposed to be put preprocess with NER tools test files as input).
I wonder if you still do the NER replacement for 8 references dataset?

Meanwhile, I wonder could you show me the command you run for 8 references test set (since there are 8 references for iBLEU, the Xu Wei's paper didn't indicate how to do with 8 references, take the mean or max for 8 reference performances?)?

I downloaded the dataset from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus/truecased (because true cased is good for NER tool), and use Stanford NER tool to do the replacement. (I think I am doing the correct thing because I see exactly same output when I replace from wiki.full.aner.ori.test to wiki.full.aner.test). But since your code seems doesn't support 8 references data set, I tried my TensorFlow encoder-decoder model, which follows similar setting as yours, but still not get 88% BLEU. But my model can generate similar performance based on the Wikilarge/Wikismall test set (non 8 reference ones) through mteval-v13a.pl. So I think perhaps I just use the wrong script.

In addition, are you still work on this task? So far, it is true that Seq2seq without RL prefers copy complex text into simpler one but I think you can solve it in the evaluation stage (through RL) but I think the major reason id the attention (I am running experiments for now to prove it).

@XingxingZhang
Copy link
Owner

"
Yes, I also tried dataset download from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus
But the 8 references dataset doesn't do the NER replacement (e.g. PEOPLE@1, LOCATION@1 that kind of things), so I cannot directly use your code (I am supposed to be put preprocess with NER tools test files as input).
I wonder if you still do the NER replacement for 8 references dataset?
"
The "wiki.full.aner.map.t7" file in "data-simplification/wikilarge" folder contains all you need for NER anonymization/de-anonymization. Note that in test set, I only did NER anonymization for complex sentences and one of the reference sentences. But it doesn't matter since your system output will be de-anonymized anyway.

@XingxingZhang
Copy link
Owner

"
Meanwhile, I wonder could you show me the command you run for 8 references test set (since there are 8 references for iBLEU, the Xu Wei's paper didn't indicate how to do with 8 references, take the mean or max for 8 reference performances?)?
"
BLEU evaluation in default assumes there are multiple references [1][2]. Please refer to the documentations of Joshua or mt-eval-v13 for how to evaluate BLEU with multiple references.

[1] BLEU: a Method for Automatic Evaluation of Machine Translation
[2] https://en.wikipedia.org/wiki/BLEU

@Sanqiang
Copy link
Author

Sanqiang commented Oct 9, 2017

I think I can achieve similar result in the paper,
This is what I do:
(1) I still use scripts/mteval-v13a.pl to eval your output with single ground truth, get a score for BLEU(I, O) roughly 60%.
(2) I use scripts/multi-bleu.perl to eval your output with 8 references, get a score for BLEU(I, R) roughly 90%. But the original reference is all lowercase, so I use true cased references ones so that it is matchable.
(3) iBLEU = 0.9 * BLEU(I, R) + 0.1 * BLEU(I, O) as Xu's paper, the result is similar to your paper.

Am I correct? or any bias from what you did?

@XingxingZhang
Copy link
Owner

< Am I correct? or any bias from what you did?
No. I didn't use iBLEU and didn't mention iBLEU anywhere

@XingxingZhang
Copy link
Owner

Here are the instructions for 8-ref bleu evaluation on wikilarge https://github.com/XingxingZhang/dress/tree/master/experiments/evaluation/BLEU

Good luck!

@Sanqiang
Copy link
Author

Sanqiang commented Oct 10, 2017

I got it.
so you use the 8-ref bleu evaluation from https://github.com/XingxingZhang/dress/tree/master/experiments/evaluation/BLEU

I wonder you only use 8-references or you use 9-references (8 reference plus original single ground truth)? (Based on the script you provide, I think you use 8-references ones but just double check)

@XingxingZhang
Copy link
Owner

did you get the correct bleu score?
=> BLEU = 0.8885

@Sanqiang
Copy link
Author

Yes, I can get correct bleu. Thank you.

@XingxingZhang
Copy link
Owner

awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants