-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wordpiece implementation #670
Conversation
@@ -0,0 +1,88 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
až budeš řešit pycodestyle, vyhoď i tuhle prázdnou řádku
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(dismiss)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poznámky.
Plus je potřeba přidat jednoduchej unit test.
neuralmonkey/processors/wordpiece.py
Outdated
ALNUM_CHAR_SET = set( | ||
six.unichr(i) for i in range(sys.maxunicode) | ||
if (unicodedata.category(six.unichr(i)).startswith("L") or | ||
unicodedata.category(six.unichr(i)).startswith("N"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ten six vyhoď. Six je knihovna kompatibility pythonu 2 a 3. six.unichr()
je to samý co chr()
v pythonu 3
neuralmonkey/processors/wordpiece.py
Outdated
"""Loose implementation of the t2t SubwordTextTokenizer. | ||
|
||
Paper: TODO? | ||
Code: TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tadyty TODOčka pořeš
neuralmonkey/processors/wordpiece.py
Outdated
|
||
def __init__(self, | ||
vocabulary: Vocabulary) -> None: | ||
log("Initializing wordpiece preprocessor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nad tuhle řádku přidat check_argument_types
z modulu typeguard
neuralmonkey/processors/wordpiece.py
Outdated
def __init__(self, | ||
vocabulary: Vocabulary) -> None: | ||
log("Initializing wordpiece preprocessor") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuhle empty line bych vyhodil
neuralmonkey/processors/wordpiece.py
Outdated
"""See the code. | ||
|
||
TODO | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vyřešit todočko: upravit docstring
neuralmonkey/processors/wordpiece.py
Outdated
tokens_str += sent[i] | ||
|
||
# Mark the end of each token | ||
# TODO: escape the characters properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
přidat číslo issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. TODO(#669)
neuralmonkey/processors/wordpiece.py
Outdated
""" | ||
|
||
def __init__(self, | ||
vocabulary: Vocabulary) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tohle nejde dát na jednu řádku?
neuralmonkey/processors/wordpiece.py
Outdated
break | ||
else: | ||
raise ValueError( | ||
"Subword '{}' (from {}) is not in the vocabulary" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ {}/ '{}'/
neuralmonkey/processors/wordpiece.py
Outdated
|
||
# pylint: disable=no-self-use | ||
def __init__(self) -> None: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ten konstruktor tu nemusí vůbec bejt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a tim pádem to no-self-use
by šlo přesunout až před decode, ale nedělej to - viz dál)
neuralmonkey/processors/wordpiece.py
Outdated
def __call__(self, decoded_sentences: List[List[str]]) -> List[List[str]]: | ||
return [self.decode(s) for s in decoded_sentences] | ||
|
||
def decode(self, sentence: List[str]) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jelikož ten WordpiecePostprocessor
má bejt callable, může to bejt asi rovnou funkce, protože to nemá vnitřní stav. Nicméně nechal bych tam nějakej syntactic sugar:
def wordpiece_decode(sentences: List[List[str]]) -> List[List[str]]:
return [wordpiece_decode_sentence(s) for s in sentences]
def wordpiece_decode_sentence(sentence: List[str]) -> List[str]:
# tady dej body toho decode
# pylint: disable=invalid-name
WordpiecePostprocessor = wordpiece_decode
ale uznávám že takhle to nikde jinde neni a jestli se ti to nelíbí, tak tam nech ten no-self-use
eff410b
to
a32683a
Compare
Přidělal jsem netriviální kus kódu, tak by to chtělo, aby se na to mrknul ještě někdo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Predpokladam, ze escape_token() a unescape_token() jsou prevzate z t2t, ze?
Jinak zbytek vypada v poradku
Jo |
Arnoste, ackni to :) |
(TODOs: See issue #669)