-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Am I running the correct test file? #4
Comments
It looks like you are using single BLEU evaluation. There are 8 references in wikilarge test sets (available here https://github.com/cocoxu/simplification). The code I released can be used to produce system output of EncDecA, Dress and Dress-Ls. Please follow the evaluation protocols described in our paper. More suggestions can be found here |
< why turn off the ignore case, I think the uppercase or lowercase makes no difference for the words in this data set? I don't think it matters :) |
Yes, I also tried dataset download from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus Meanwhile, I wonder could you show me the command you run for 8 references test set (since there are 8 references for iBLEU, the Xu Wei's paper didn't indicate how to do with 8 references, take the mean or max for 8 reference performances?)? I downloaded the dataset from https://github.com/cocoxu/simplification/tree/master/data/turkcorpus/truecased (because true cased is good for NER tool), and use Stanford NER tool to do the replacement. (I think I am doing the correct thing because I see exactly same output when I replace from wiki.full.aner.ori.test to wiki.full.aner.test). But since your code seems doesn't support 8 references data set, I tried my TensorFlow encoder-decoder model, which follows similar setting as yours, but still not get 88% BLEU. But my model can generate similar performance based on the Wikilarge/Wikismall test set (non 8 reference ones) through mteval-v13a.pl. So I think perhaps I just use the wrong script. In addition, are you still work on this task? So far, it is true that Seq2seq without RL prefers copy complex text into simpler one but I think you can solve it in the evaluation stage (through RL) but I think the major reason id the attention (I am running experiments for now to prove it). |
" |
" [1] BLEU: a Method for Automatic Evaluation of Machine Translation |
I think I can achieve similar result in the paper, Am I correct? or any bias from what you did? |
< Am I correct? or any bias from what you did? |
Here are the instructions for 8-ref bleu evaluation on wikilarge https://github.com/XingxingZhang/dress/tree/master/experiments/evaluation/BLEU Good luck! |
I got it. I wonder you only use 8-references or you use 9-references (8 reference plus original single ground truth)? (Based on the script you provide, I think you use 8-references ones but just double check) |
did you get the correct bleu score? |
Yes, I can get correct bleu. Thank you. |
awesome! |
I try to reproduce the result (mostly for BLEU) for encoder-decoder with attention (in the paper it said 88% in BLEU), but I tried the model you provide "model_0.001.256.2L.we.full.2l.ft0.t7" with both test file in wikilarge and wikismall. I see the best result I got is from Wikilarge (which is 59.80% lower than what you reported in the paper). I wonder if I use correct test file or correct checkpoint?
`th generate_pipeline.lua --modelPath ../model_0.001.256.2L.we.full.2l.ft0.t7 --dataPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.test --outPathRaw outraw2.txt --oriDataPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.ori.test --oriMapPath ../../text_simplification_data/train/dress/wikismall/PWKP_108016.tag.80.aner.map.t7
th generate_pipeline.lua --modelPath ../model_0.001.256.2L.we.full.2l.ft0.t7 --dataPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.test --outPathRaw outraw2.txt --oriDataPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.ori.test --oriMapPath ../../text_simplification_data/train/dress/wikilarge/wiki.full.aner.map.t7`
To be honest, for now, I tried your dataset with another encoder-decoder model (Transformer, no RNN inside), which supposed to be work better at least in memory part. And I have also concerns with your model configuration:
why turn off the ignore case, I think the uppercase or lowercase makes no difference for the words in this data set?
Also based on your decode result, I also see many of short sentences (which drop performance but maybe good for FK score), the reason is wikilarge contains noisy kwoodsen data (http://homepages.inf.ed.ac.uk/kwoodsen/wiki.html#wiki-data), the data teach model to generate short sentences.
The text was updated successfully, but these errors were encountered: