Source/target pair text files; Icelandic parsing support; fixes #154

vthorsteinsson · 2017-07-14T00:45:46Z

This PR adds text data generators that read source and target sentences in pairs, separated by tabs (\t), from a single source file.

It also adds scaffolding to support parsing of Icelandic text, either from characters or from subword tokens. (More coming later on that front.)

A few smaller fixes are included, such as replacing magic constants to do with reserved token ids with more meaningful identifiers, and getting rid of annoying strings in decoder output trace messages.

lukaszkaiser

Thanks Villi. I'd be happy to merge, but there's one line that needs to go away, I think (as tokenizer is a module now, not a class).

lukaszkaiser · 2017-07-14T01:47:27Z

tensor2tensor/data_generators/generator_utils.py

+    vocab = text_encoder.SubwordTextEncoder(vocab_filepath)
+    return vocab
+
+  tokenizer = Tokenizer()


This line should be removed.

vthorsteinsson · 2017-07-14T14:25:54Z

Yes, sorry, I missed this change in the final merge with upstream. Should be OK now.

lukaszkaiser

Thanks!

vthorsteinsson · 2017-07-14T17:34:07Z

My pleasure! :-)

lukaszkaiser · 2017-07-14T17:43:48Z

I just found one small problem (source_vocab is used instead of target_vocab in wmt.tabbed_generator). I'll correct it in my next PR and release. It'd be great though to add an option to download the Icelandic parsing set instead of assuming it's there in a file. Is the set public?

vthorsteinsson added 12 commits June 30, 2017 11:23

Starting iceparse branch

7b4590f

Merge remote-tracking branch 'upstream/master' into iceparse

1b6ef7c

Added +x on t2t-datagen and t2t-trainer

7b91a3c

Better resiliency of utf-8 conversion

599a3e8

Merge remote-tracking branch 'upstream/master' into iceparse

aca3c0e

Icelandic parsing components added

5052414

Merged with upstream/master

80e8a55

Target string displayed; smaller fixes

30887b8

Iceparser adaptations

9dc2826

Upstream merge

6d4e7b4

Cleanup in text_encoder.py

27c6185

Standardized EOS token

7bf4936

lukaszkaiser suggested changes Jul 14, 2017

View reviewed changes

Adapted to upstream tokenizer change

5a72e5c

lukaszkaiser approved these changes Jul 14, 2017

View reviewed changes

lukaszkaiser merged commit 43bfb9f into tensorflow:master Jul 14, 2017

vthorsteinsson deleted the iceparse branch July 16, 2017 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Source/target pair text files; Icelandic parsing support; fixes #154

Source/target pair text files; Icelandic parsing support; fixes #154

Uh oh!

vthorsteinsson commented Jul 14, 2017

Uh oh!

lukaszkaiser left a comment

Uh oh!

lukaszkaiser Jul 14, 2017

Uh oh!

vthorsteinsson commented Jul 14, 2017

Uh oh!

lukaszkaiser left a comment

Uh oh!

vthorsteinsson commented Jul 14, 2017

Uh oh!

lukaszkaiser commented Jul 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Source/target pair text files; Icelandic parsing support; fixes #154

Source/target pair text files; Icelandic parsing support; fixes #154

Uh oh!

Conversation

vthorsteinsson commented Jul 14, 2017

Uh oh!

lukaszkaiser left a comment

Choose a reason for hiding this comment

Uh oh!

lukaszkaiser Jul 14, 2017

Choose a reason for hiding this comment

Uh oh!

vthorsteinsson commented Jul 14, 2017

Uh oh!

lukaszkaiser left a comment

Choose a reason for hiding this comment

Uh oh!

vthorsteinsson commented Jul 14, 2017

Uh oh!

lukaszkaiser commented Jul 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants