Wordpiece implementation #670

varisd · 2018-03-12T11:45:08Z

Added wordpiece preprocesssor and postprocessor
Added vocabulary.from_t2t_vocabulary for reading t2t-generated vocabularies

(TODOs: See issue #669)

jindrahelcl · 2018-03-12T11:46:24Z

neuralmonkey/processors/wordpiece.py

@@ -0,0 +1,88 @@
+


až budeš řešit pycodestyle, vyhoď i tuhle prázdnou řádku

jindrahelcl

Poznámky.

Plus je potřeba přidat jednoduchej unit test.

jindrahelcl · 2018-03-12T11:47:55Z

neuralmonkey/processors/wordpiece.py

+ALNUM_CHAR_SET = set(
+    six.unichr(i) for i in range(sys.maxunicode)
+    if (unicodedata.category(six.unichr(i)).startswith("L") or
+        unicodedata.category(six.unichr(i)).startswith("N")))


ten six vyhoď. Six je knihovna kompatibility pythonu 2 a 3. six.unichr() je to samý co chr() v pythonu 3

jindrahelcl · 2018-03-12T11:52:46Z

neuralmonkey/processors/wordpiece.py

+    """Loose implementation of the t2t SubwordTextTokenizer.
+
+    Paper: TODO?
+    Code: TODO


tadyty TODOčka pořeš

jindrahelcl · 2018-03-12T11:53:37Z

neuralmonkey/processors/wordpiece.py

+
+    def __init__(self,
+                 vocabulary: Vocabulary) -> None:
+        log("Initializing wordpiece preprocessor")


nad tuhle řádku přidat check_argument_types z modulu typeguard

jindrahelcl · 2018-03-12T11:53:56Z

neuralmonkey/processors/wordpiece.py

+    def __init__(self,
+                 vocabulary: Vocabulary) -> None:
+        log("Initializing wordpiece preprocessor")
+


tuhle empty line bych vyhodil

jindrahelcl · 2018-03-12T11:54:17Z

neuralmonkey/processors/wordpiece.py

+        """See the code.
+
+        TODO
+        """


vyřešit todočko: upravit docstring

jindrahelcl · 2018-03-12T11:55:19Z

neuralmonkey/processors/wordpiece.py

+            tokens_str += sent[i]
+
+        # Mark the end of each token
+        # TODO: escape the characters properly


přidat číslo issue

e.g. TODO(#669)

jindrahelcl · 2018-03-12T11:56:45Z

neuralmonkey/processors/wordpiece.py

+    """
+
+    def __init__(self,
+                 vocabulary: Vocabulary) -> None:


tohle nejde dát na jednu řádku?

jindrahelcl · 2018-03-12T11:57:43Z

neuralmonkey/processors/wordpiece.py

+                        break
+                else:
+                    raise ValueError(
+                        "Subword '{}' (from {}) is not in the vocabulary"


s/ {}/ '{}'/

jindrahelcl · 2018-03-12T11:58:14Z

neuralmonkey/processors/wordpiece.py

+
+    # pylint: disable=no-self-use
+    def __init__(self) -> None:
+        pass


ten konstruktor tu nemusí vůbec bejt

(a tim pádem to no-self-use by šlo přesunout až před decode, ale nedělej to - viz dál)

jindrahelcl · 2018-03-12T12:05:18Z

neuralmonkey/processors/wordpiece.py

+    def __call__(self, decoded_sentences: List[List[str]]) -> List[List[str]]:
+        return [self.decode(s) for s in decoded_sentences]
+
+    def decode(self, sentence: List[str]) -> List[str]:


Jelikož ten WordpiecePostprocessor má bejt callable, může to bejt asi rovnou funkce, protože to nemá vnitřní stav. Nicméně nechal bych tam nějakej syntactic sugar:

def wordpiece_decode(sentences: List[List[str]]) -> List[List[str]]: return [wordpiece_decode_sentence(s) for s in sentences] def wordpiece_decode_sentence(sentence: List[str]) -> List[str]: # tady dej body toho decode # pylint: disable=invalid-name WordpiecePostprocessor = wordpiece_decode

ale uznávám že takhle to nikde jinde neni a jestli se ti to nelíbí, tak tam nech ten no-self-use

Přidělal jsem netriviální kus kódu, tak by to chtělo, aby se na to mrknul ještě někdo.

varisd

Predpokladam, ze escape_token() a unescape_token() jsou prevzate z t2t, ze?

Jinak zbytek vypada v poradku

jindrahelcl · 2018-03-13T20:26:34Z

Jo

varisd · 2018-03-14T08:46:21Z

Arnoste, ackni to :)

jindrahelcl requested changes Mar 12, 2018

View reviewed changes

jindrahelcl assigned varisd Mar 12, 2018

varisd force-pushed the wordpieces branch 5 times, most recently from eff410b to a32683a Compare March 12, 2018 16:54

jindrahelcl mentioned this pull request Mar 12, 2018

Finish the wordpiece implementation #669

Open

jindrahelcl added the feature label Mar 12, 2018

jindrahelcl previously approved these changes Mar 12, 2018

View reviewed changes

jindrahelcl requested a review from jlibovicky March 12, 2018 23:45

jindrahelcl force-pushed the wordpieces branch from 95e57b3 to 976ea5a Compare March 13, 2018 00:13

varisd added 2 commits March 13, 2018 11:59

wordpiece processors; t2t vocabulary loading

444e6ac

added unit tests for wordpieces

d4c76ae

varisd commented Mar 13, 2018

View reviewed changes

varisd and others added 6 commits March 13, 2018 14:49

added unit tests for wordpieces

cbacec8

Refactor for review

ca6b5fd

Add tensor2tensor-based reader

5628b84

refactoring wordpiece preprocessors and tests

2cb5966

nicer wordpiece unit test and dealing with <unk>s

433627f

unit test for t2t reader

86ea5a1

varisd force-pushed the wordpieces branch from 976ea5a to 86ea5a1 Compare March 13, 2018 13:51

jlibovicky approved these changes Mar 14, 2018

View reviewed changes

jlibovicky merged commit b95b8de into master Mar 14, 2018

jlibovicky deleted the wordpieces branch March 14, 2018 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wordpiece implementation #670

Wordpiece implementation #670

varisd commented Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl left a comment

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

jindrahelcl Mar 12, 2018

varisd left a comment •

edited

Loading

jindrahelcl commented Mar 13, 2018

varisd commented Mar 14, 2018

Wordpiece implementation #670

Wordpiece implementation #670

Conversation

varisd commented Mar 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jindrahelcl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varisd left a comment • edited Loading

Choose a reason for hiding this comment

jindrahelcl commented Mar 13, 2018

varisd commented Mar 14, 2018

varisd left a comment •

edited

Loading