Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Conversation

@vthorsteinsson
Copy link
Contributor

This PR adds text data generators that read source and target sentences in pairs, separated by tabs (\t), from a single source file.

It also adds scaffolding to support parsing of Icelandic text, either from characters or from subword tokens. (More coming later on that front.)

A few smaller fixes are included, such as replacing magic constants to do with reserved token ids with more meaningful identifiers, and getting rid of annoying strings in decoder output trace messages.

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Villi. I'd be happy to merge, but there's one line that needs to go away, I think (as tokenizer is a module now, not a class).

vocab = text_encoder.SubwordTextEncoder(vocab_filepath)
return vocab

tokenizer = Tokenizer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed.

@vthorsteinsson
Copy link
Contributor Author

Yes, sorry, I missed this change in the final merge with upstream. Should be OK now.

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lukaszkaiser lukaszkaiser merged commit 43bfb9f into tensorflow:master Jul 14, 2017
@vthorsteinsson
Copy link
Contributor Author

My pleasure! :-)

@lukaszkaiser
Copy link
Contributor

I just found one small problem (source_vocab is used instead of target_vocab in wmt.tabbed_generator). I'll correct it in my next PR and release. It'd be great though to add an option to download the Icelandic parsing set instead of assuming it's there in a file. Is the set public?

@vthorsteinsson vthorsteinsson deleted the iceparse branch July 16, 2017 16:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants