Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wordpiece implementation #670

Merged
merged 8 commits into from
Mar 14, 2018
Merged

Wordpiece implementation #670

merged 8 commits into from
Mar 14, 2018

Conversation

varisd
Copy link
Member

@varisd varisd commented Mar 12, 2018

  • Added wordpiece preprocesssor and postprocessor
  • Added vocabulary.from_t2t_vocabulary for reading t2t-generated vocabularies

(TODOs: See issue #669)

@@ -0,0 +1,88 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

až budeš řešit pycodestyle, vyhoď i tuhle prázdnou řádku

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(dismiss)

Copy link
Member

@jindrahelcl jindrahelcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poznámky.

Plus je potřeba přidat jednoduchej unit test.

ALNUM_CHAR_SET = set(
six.unichr(i) for i in range(sys.maxunicode)
if (unicodedata.category(six.unichr(i)).startswith("L") or
unicodedata.category(six.unichr(i)).startswith("N")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ten six vyhoď. Six je knihovna kompatibility pythonu 2 a 3. six.unichr() je to samý co chr() v pythonu 3

"""Loose implementation of the t2t SubwordTextTokenizer.

Paper: TODO?
Code: TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tadyty TODOčka pořeš


def __init__(self,
vocabulary: Vocabulary) -> None:
log("Initializing wordpiece preprocessor")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nad tuhle řádku přidat check_argument_types z modulu typeguard

def __init__(self,
vocabulary: Vocabulary) -> None:
log("Initializing wordpiece preprocessor")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuhle empty line bych vyhodil

"""See the code.

TODO
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vyřešit todočko: upravit docstring

tokens_str += sent[i]

# Mark the end of each token
# TODO: escape the characters properly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

přidat číslo issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. TODO(#669)

"""

def __init__(self,
vocabulary: Vocabulary) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tohle nejde dát na jednu řádku?

break
else:
raise ValueError(
"Subword '{}' (from {}) is not in the vocabulary"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ {}/ '{}'/


# pylint: disable=no-self-use
def __init__(self) -> None:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ten konstruktor tu nemusí vůbec bejt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a tim pádem to no-self-use by šlo přesunout až před decode, ale nedělej to - viz dál)

def __call__(self, decoded_sentences: List[List[str]]) -> List[List[str]]:
return [self.decode(s) for s in decoded_sentences]

def decode(self, sentence: List[str]) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jelikož ten WordpiecePostprocessor má bejt callable, může to bejt asi rovnou funkce, protože to nemá vnitřní stav. Nicméně nechal bych tam nějakej syntactic sugar:

def wordpiece_decode(sentences: List[List[str]]) -> List[List[str]]:
    return [wordpiece_decode_sentence(s) for s in sentences]

def wordpiece_decode_sentence(sentence: List[str]) -> List[str]:
    # tady dej body toho decode

# pylint: disable=invalid-name
WordpiecePostprocessor = wordpiece_decode

ale uznávám že takhle to nikde jinde neni a jestli se ti to nelíbí, tak tam nech ten no-self-use

jindrahelcl
jindrahelcl previously approved these changes Mar 12, 2018
@jindrahelcl jindrahelcl dismissed their stale review March 12, 2018 23:46

Přidělal jsem netriviální kus kódu, tak by to chtělo, aby se na to mrknul ještě někdo.

Copy link
Member Author

@varisd varisd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predpokladam, ze escape_token() a unescape_token() jsou prevzate z t2t, ze?

Jinak zbytek vypada v poradku

@jindrahelcl
Copy link
Member

Jo

@varisd
Copy link
Member Author

varisd commented Mar 14, 2018

Arnoste, ackni to :)

@jlibovicky jlibovicky merged commit b95b8de into master Mar 14, 2018
@jlibovicky jlibovicky deleted the wordpieces branch March 14, 2018 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants