-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manual MWT control in tokenization #1302
Conversation
27471c0
to
d1e2cfe
Compare
stanza/models/common/doc.py
Outdated
ae = (multi_word_expanded_token_misc.match(token.misc) | ||
if token.misc is not None else None) | ||
|
||
perform_mwt_processing = "flatten" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would normally prefer an enum here instead of string constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good; addressed by d13e5d4
stanza/models/common/doc.py
Outdated
for i in token.words: | ||
i.id += idx_e | ||
idx_w = token.id[-1] | ||
token.misc = None if token.misc == 'MEXP=Yes' else '|'.join([x for x in token.misc.split('|') if x != 'MEXP=Yes']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the basic idea I had, but I was wondering if there should instead be an attribute on the Token
object instead of a component of the misc field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the benefits/disadvantages of choosing one/another? I wanted to use the misc field because then whomever is observing it can also see the difference relating to MWT/MEXP in the same place, and I didn't want to make an entire attribute just for the very specific case of "user wanted this MWT to be split in a particular way". I would love to get more clarity on why MWT was marked there. Apologies for my confusion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess I don't have a strong motivation. Was just thinking that it's most likely a temporary annotation & could be forgotten about after the processing. Even in the case of something we want to keep around, I think it would be easier for the other coding tasks if it's kept in a separate field and then added to the misc output when needed. The ner
field is like that, for example
stanza/models/tokenization/utils.py
Outdated
assert token_lens == mwt_lens, "Postprocessor returned token and MWT lists of different length! Token list lengths %s, MWT list lengths %s" % (token_lens, mwt_lens) | ||
|
||
corrected_expansions.append(sent_expansions) | ||
|
||
# recassemble document. offsets and oov shouldn't change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo (reassemble)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Fat fingers. Fixed in bc0e41c
# check postprocessor output | ||
token_lens = [len(i) for i in corrected_words] | ||
mwt_lens = [len(i) for i in corrected_mwts] | ||
assert token_lens == mwt_lens, "Postprocessor returned token and MWT lists of different length! Token list lengths %s, MWT list lengths %s" % (token_lens, mwt_lens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep the error checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does one extra loop through the input, but the user can never actually supply MWT/token lists of different lengths as they are supplying them in one zipped array:
[['Le', 'prince', 'va', 'manger', ('du', True), 'poulet', ('aux', True), 'les',
'magasins', ("aujourd'hui", ["aujourd'", "hui"]), '.']]
I added this check when before the user was passing in two lists with the postprocessor before. This is no longer the case.
However, I added it back just in case something with these lists goes wrong again: bc0e41c
stanza/models/tokenization/utils.py
Outdated
@@ -392,6 +398,8 @@ def reassemble_doc_from_tokens(tokens, mwts, raw_text): | |||
mwts : List[List[bool]] | |||
Whether or not each of the tokens are MWTs to be analyzed by | |||
the MWT raw. | |||
mwts : List[List[List[str}]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed by bc0e41c
stanza/pipeline/mwt_processor.py
Outdated
@@ -3,6 +3,7 @@ | |||
""" | |||
|
|||
import io | |||
from stanza.resources.common import process_pipeline_parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about that! addressed by bc0e41c
|
||
text = "Joe Smith lives in California." | ||
|
||
with pytest.raises(ValueError): | ||
utils.reassemble_doc_from_tokens(bad_addition_tokenization, bad_addition_mwts, text) | ||
utils.reassemble_doc_from_tokens(bad_addition_tokenization, bad_addition_mwts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one line would be fine here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed by bc0e41c
mostly happy except for the misc field comments above i do like having stuff which is functional in separate fields so they can be manipulated easier without needing to go through the misc annotation each time |
got it, I will implement this shortly into the weekend and see how it goes |
Done; ready for your review. There's a chance that adding a whole extra field to |
stanza/models/common/doc.py
Outdated
processing for tokens marked manually expanded: | ||
|
||
process_manual_expanded = None - default; doesn't process manually expanded tokens | ||
= True - process only manually expanded tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you clarify the distinction between True and False a bit? maybe an example or two would help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, addressed by 556a053
stanza/models/common/doc.py
Outdated
idx_e = 0 | ||
for sentence in self.sentences: | ||
idx_w = 0 | ||
for token in sentence.tokens: | ||
idx_w += 1 | ||
m = (len(token.id) > 1) | ||
n = multi_word_token_misc.match(token.misc) if token.misc is not None else None | ||
if not m and not n: | ||
n = (multi_word_token_misc.match(token.misc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this particular line break is not changing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My emacs config was too aggressive in complaining about long lines; addressed by 556a053. Apologies!
stanza/models/common/doc.py
Outdated
""" | ||
|
||
idx_e = 0 | ||
for sentence in self.sentences: | ||
idx_w = 0 | ||
for token in sentence.tokens: | ||
idx_w += 1 | ||
m = (len(token.id) > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think at this point m, n, ae, etc need real variable names instead. i'm getting a little lost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies about that! addressed by 556a053
stanza/models/common/doc.py
Outdated
""" | ||
expansions = [] | ||
for sentence in self.sentences: | ||
for token in sentence.tokens: | ||
m = (len(token.id) > 1) | ||
n = multi_word_token_misc.match(token.misc) if token.misc is not None else None | ||
if m or n: | ||
ae = token.manual_expansion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, actual variable names which give a bit more hint as to what is going on with (m and not ae) or n
would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed by 556a053
else: | ||
sent_words.append(word[0]) | ||
sent_mwts.append(True) | ||
sent_expansions.append(" ".join(word[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question about this, typically MWT are literally something that has been split out of one token, so "".join()
would be more appropriate. however, happy to have my understanding corrected... preferably via comment instead of github reply!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed by 556a053.
@@ -102,7 +102,7 @@ def test_postprocessor_application(): | |||
good_tokenization = [['I', 'am', 'Joe.', '⭆⊱⇞', 'Hi', '.'], ["I'm", 'a', 'chicken', '.']] | |||
text = "I am Joe. ⭆⊱⇞ Hi. I'm a chicken." | |||
|
|||
target_doc = [[{'id': (1,), 'text': 'I', 'start_char': 0, 'end_char': 1}, {'id': (2,), 'text': 'am', 'start_char': 2, 'end_char': 4}, {'id': (3,), 'text': 'Joe.', 'start_char': 5, 'end_char': 9}, {'id': (4,), 'text': '⭆⊱⇞', 'start_char': 10, 'end_char': 13}, {'id': (5,), 'text': 'Hi', 'start_char': 14, 'end_char': 16}, {'id': (6,), 'text': '.', 'start_char': 16, 'end_char': 17}], [{'id': (1,), 'text': "I'm", 'start_char': 18, 'end_char': 21}, {'id': (2,), 'text': 'a', 'start_char': 22, 'end_char': 23}, {'id': (3,), 'text': 'chicken', 'start_char': 24, 'end_char': 31}, {'id': (4,), 'text': '.', 'start_char': 31, 'end_char': 32}]] | |||
target_doc = [[{'id': 1, 'text': 'I', 'start_char': 0, 'end_char': 1}, {'id': 2, 'text': 'am', 'start_char': 2, 'end_char': 4}, {'id': 3, 'text': 'Joe.', 'start_char': 5, 'end_char': 9}, {'id': 4, 'text': '⭆⊱⇞', 'start_char': 10, 'end_char': 13}, {'id': 5, 'text': 'Hi', 'start_char': 14, 'end_char': 16}, {'id': 6, 'text': '.', 'start_char': 16, 'end_char': 17}], [{'id': 1, 'text': "I'm", 'start_char': 18, 'end_char': 21}, {'id': 2, 'text': 'a', 'start_char': 22, 'end_char': 23}, {'id': 3, 'text': 'chicken', 'start_char': 24, 'end_char': 31}, {'id': 4, 'text': '.', 'start_char': 31, 'end_char': 32}]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I must be missing some detail, but why has this changed to now return ints instead of tuples? I believe there may be some downstream code which expects the token IDs to be a tuple of length 1 for non-MWT words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the system should still handle such cases correctly with multiple IDs, but it seems to default to single ID by default as I simply call doc.to_dict()
.
stanza/stanza/models/common/doc.py
Lines 1038 to 1042 in 7b58099
self._id = word_entry.get(ID, None) | |
if isinstance(self._id, tuple): | |
if len(self._id) == 1: | |
self._id = self._id[0] | |
self._text = word_entry.get(TEXT, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what's happened. The original was manually creating the id
as a tuple. However, now it is creating a Document
in reassemble_doc_from_tokens
and then calling to_dict
on that.
In that case, I withdraw my objection! However, I think what we do need is a test that it is doing the expected thing for MWT. Would you say that test_tokenizer.py::test_postprocessor_mwt
should cover that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this does. The intermediate output from the spot of concern from that test was
[[{'id': 1, 'text': 'Le', 'start_char': 0, 'end_char': 2}, {'id': 2, 'text': 'prince', 'start_char': 3, 'end_char': 9}, {'id': 3, 'text': 'va', 'start_char': 10, 'end_char': 12}, {'id': 4, 'text': 'manger', 'start_char': 13, 'end_char': 19}, {'id': 5, 'text': 'du', 'misc': 'MWT=Yes', 'start_char': 20, 'end_char': 22}, {'id': 6, 'text': 'poulet', 'start_char': 23, 'end_char': 29}, {'id': 7, 'text': 'aux', 'misc': 'MWT=Yes', 'start_char': 30, 'end_char': 33}, {'id': 8, 'text': 'les', 'start_char': 34, 'end_char': 37}, {'id': 9, 'text': 'magasins', 'start_char': 38, 'end_char': 46}, {'id': (10, 11), 'text': "aujourd'hui", 'start_char': 47, 'end_char': 58, 'manual_expansion': True}, {'id': 10, 'text': "aujourd'"}, {'id': 11, 'text': 'hui'}, {'id': 12, 'text': '.', 'start_char': 58, 'end_char': 59}]]
which does contain "aujourd'hui" which is a tuple {'id': (10, 11), 'text': "aujourd'hui",
So I'd say this is good
Double check the test case, then I can squash it down & merge I think |
fb4b555
to
65a12e5
Compare
uses an enum to signify the manual MWT processing manual MWT information is in a field instead of the misc field default of the manual expansion is `None`
Description
This is a followup to #1290 which allows manual control of MWT splitting via the
tokenize_postprocessor
. We implement this via a new attributeMEXP=Yes
to denote pre-deliniated MWT that the MWT processor shouldn't touch.For instance:
whereby, the single argument passed to
tokenize_postprocessor
is a list of lists containing string sentence and word tokenizationsWith this postprocessor return,
du
,aux
are requested to be split via the traditional MWT splitter, whereas a user-defined split is provided foraujourd'hui
.Unit Test Coverage
test_postprocessor_mwt
is created to check for this functionality, andpipeline/test_tokenizer.py
contains updates to support MWT override.