Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

BPE support seems missing #4

Closed
skyw opened this issue Jul 13, 2017 · 8 comments
Closed

BPE support seems missing #4

skyw opened this issue Jul 13, 2017 · 8 comments
Assignees

Comments

@skyw
Copy link

skyw commented Jul 13, 2017

I'm trying to run wmt16_en_de_gnmt.json.
It first comes back with an error of missing vocabulary file. Looking into the code, it doesn't look for the vocab files with "bpe.32000" which are created by the wmt16_en_de.sh. If I force it to look at the right vocab file, then the model starts to run and graph build seems successful. However, it stops with an error "HashTable has different value for same key. Key <s> has 1 and trying to add value 4"

@skyw
Copy link
Author

skyw commented Jul 13, 2017

It seems train/test/dev data load don't have BPE path neither.

@ebrevdo
Copy link
Contributor

ebrevdo commented Jul 15, 2017

@lmthang Can you PTAL?

@lmthang lmthang assigned oahziur and unassigned lmthang Jul 16, 2017
@lmthang
Copy link
Contributor

lmthang commented Jul 16, 2017

@skyw: can you provide the error log?

@skyw
Copy link
Author

skyw commented Jul 17, 2017

test.log.txt

Attached. the command I used is at beginning of the log.
It looks the BPE vocab files contains "<s>, </s>" already, while the vocab_utils.py is not aware of any of them.

@oahziur
Copy link
Contributor

oahziur commented Jul 17, 2017

Hmm, why the first token in your vocab is "-e unk", is there any compatibility issues on the vocab generation?

I believe the 3 special token is appended here:

https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh#L148

@skyw
Copy link
Author

skyw commented Jul 17, 2017

hmm, I didn't even check it. Though they look pretty much the same, I still used https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh to generate data instead of the one I generated by tensor2tensor's script. I would suppose there was some compatibility issues.
But anyway, I removed "-e", it still reports the same error.

I would guess the error is the vocab_util.py trying to add "<s>" but it was already in the vocab file, I haven't tried a manual fix though.

@oahziur
Copy link
Contributor

oahziur commented Jul 17, 2017

@skyw Can you try remove "-e " and a fresh out_dir. I think the model will use previous saved vocab_file if you use the same out_dir.

Also, here is the head of the vocab file:

<unk>
<s>
</s>
,
.
the
in
of
and
die

@skyw
Copy link
Author

skyw commented Jul 17, 2017

Uh, it seems to be working after I deleted the the out_dir. Sorry for the chatter, I should have done that.

I also tried to reproduce the issue of generating "-e" in the vocab file. I think the problem is the space between # and !, https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh#L1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants