Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manual MWT control in tokenization #1302

Merged
merged 1 commit into from Nov 29, 2023
Merged

manual MWT control in tokenization #1302

merged 1 commit into from Nov 29, 2023

Conversation

Jemoka
Copy link
Member

@Jemoka Jemoka commented Oct 24, 2023

Description

This is a followup to #1290 which allows manual control of MWT splitting via the tokenize_postprocessor. We implement this via a new attribute MEXP=Yes to denote pre-deliniated MWT that the MWT processor shouldn't touch.

For instance:

nlp = stanza.Pipeline(lang="fr", processors="tokenize",
                                    tokenize_postprocessor=lambda draft: do_stuff_to_draft(draft))

whereby, the single argument passed to tokenize_postprocessor is a list of lists containing string sentence and word tokenizations

[['Le', 'prince', 'va', 'manger', ('du', True), 'poulet', ('aux', True), 'les', 
   'magasins', ("aujourd'hui", ["aujourd'", "hui"]), '.']]

With this postprocessor return, du, aux are requested to be split via the traditional MWT splitter, whereas a user-defined split is provided for aujourd'hui.

Unit Test Coverage

test_postprocessor_mwt is created to check for this functionality, and pipeline/test_tokenizer.py contains updates to support MWT override.

ae = (multi_word_expanded_token_misc.match(token.misc)
if token.misc is not None else None)

perform_mwt_processing = "flatten"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would normally prefer an enum here instead of string constants

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good; addressed by d13e5d4

for i in token.words:
i.id += idx_e
idx_w = token.id[-1]
token.misc = None if token.misc == 'MEXP=Yes' else '|'.join([x for x in token.misc.split('|') if x != 'MEXP=Yes'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the basic idea I had, but I was wondering if there should instead be an attribute on the Token object instead of a component of the misc field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefits/disadvantages of choosing one/another? I wanted to use the misc field because then whomever is observing it can also see the difference relating to MWT/MEXP in the same place, and I didn't want to make an entire attribute just for the very specific case of "user wanted this MWT to be split in a particular way". I would love to get more clarity on why MWT was marked there. Apologies for my confusion here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess I don't have a strong motivation. Was just thinking that it's most likely a temporary annotation & could be forgotten about after the processing. Even in the case of something we want to keep around, I think it would be easier for the other coding tasks if it's kept in a separate field and then added to the misc output when needed. The ner field is like that, for example

assert token_lens == mwt_lens, "Postprocessor returned token and MWT lists of different length! Token list lengths %s, MWT list lengths %s" % (token_lens, mwt_lens)

corrected_expansions.append(sent_expansions)

# recassemble document. offsets and oov shouldn't change
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo (reassemble)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Fat fingers. Fixed in bc0e41c

# check postprocessor output
token_lens = [len(i) for i in corrected_words]
mwt_lens = [len(i) for i in corrected_mwts]
assert token_lens == mwt_lens, "Postprocessor returned token and MWT lists of different length! Token list lengths %s, MWT list lengths %s" % (token_lens, mwt_lens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the error checking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does one extra loop through the input, but the user can never actually supply MWT/token lists of different lengths as they are supplying them in one zipped array:

[['Le', 'prince', 'va', 'manger', ('du', True), 'poulet', ('aux', True), 'les', 
   'magasins', ("aujourd'hui", ["aujourd'", "hui"]), '.']]

I added this check when before the user was passing in two lists with the postprocessor before. This is no longer the case.

However, I added it back just in case something with these lists goes wrong again: bc0e41c

@@ -392,6 +398,8 @@ def reassemble_doc_from_tokens(tokens, mwts, raw_text):
mwts : List[List[bool]]
Whether or not each of the tokens are MWTs to be analyzed by
the MWT raw.
mwts : List[List[List[str}]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by bc0e41c

@@ -3,6 +3,7 @@
"""

import io
from stanza.resources.common import process_pipeline_parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about that! addressed by bc0e41c


text = "Joe Smith lives in California."

with pytest.raises(ValueError):
utils.reassemble_doc_from_tokens(bad_addition_tokenization, bad_addition_mwts, text)
utils.reassemble_doc_from_tokens(bad_addition_tokenization, bad_addition_mwts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one line would be fine here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by bc0e41c

@AngledLuffa
Copy link
Collaborator

mostly happy except for the misc field comments above

i do like having stuff which is functional in separate fields so they can be manipulated easier without needing to go through the misc annotation each time

@Jemoka
Copy link
Member Author

Jemoka commented Nov 17, 2023

got it, I will implement this shortly into the weekend and see how it goes

@Jemoka
Copy link
Member Author

Jemoka commented Nov 19, 2023

Done; ready for your review. There's a chance that adding a whole extra field to Token and Word types causes some unit tests to fail if they are checking the to_dict() output. However, that in theory shouldn't be the case in most cases where the documents' string reprs themselves are checked because I didn't touch that.

processing for tokens marked manually expanded:

process_manual_expanded = None - default; doesn't process manually expanded tokens
= True - process only manually expanded tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you clarify the distinction between True and False a bit? maybe an example or two would help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, addressed by 556a053

idx_e = 0
for sentence in self.sentences:
idx_w = 0
for token in sentence.tokens:
idx_w += 1
m = (len(token.id) > 1)
n = multi_word_token_misc.match(token.misc) if token.misc is not None else None
if not m and not n:
n = (multi_word_token_misc.match(token.misc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular line break is not changing anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My emacs config was too aggressive in complaining about long lines; addressed by 556a053. Apologies!

"""

idx_e = 0
for sentence in self.sentences:
idx_w = 0
for token in sentence.tokens:
idx_w += 1
m = (len(token.id) > 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think at this point m, n, ae, etc need real variable names instead. i'm getting a little lost

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies about that! addressed by 556a053

"""
expansions = []
for sentence in self.sentences:
for token in sentence.tokens:
m = (len(token.id) > 1)
n = multi_word_token_misc.match(token.misc) if token.misc is not None else None
if m or n:
ae = token.manual_expansion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, actual variable names which give a bit more hint as to what is going on with (m and not ae) or n would be helpful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by 556a053

else:
sent_words.append(word[0])
sent_mwts.append(True)
sent_expansions.append(" ".join(word[1]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question about this, typically MWT are literally something that has been split out of one token, so "".join() would be more appropriate. however, happy to have my understanding corrected... preferably via comment instead of github reply!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by 556a053.

@@ -102,7 +102,7 @@ def test_postprocessor_application():
good_tokenization = [['I', 'am', 'Joe.', '⭆⊱⇞', 'Hi', '.'], ["I'm", 'a', 'chicken', '.']]
text = "I am Joe. ⭆⊱⇞ Hi. I'm a chicken."

target_doc = [[{'id': (1,), 'text': 'I', 'start_char': 0, 'end_char': 1}, {'id': (2,), 'text': 'am', 'start_char': 2, 'end_char': 4}, {'id': (3,), 'text': 'Joe.', 'start_char': 5, 'end_char': 9}, {'id': (4,), 'text': '⭆⊱⇞', 'start_char': 10, 'end_char': 13}, {'id': (5,), 'text': 'Hi', 'start_char': 14, 'end_char': 16}, {'id': (6,), 'text': '.', 'start_char': 16, 'end_char': 17}], [{'id': (1,), 'text': "I'm", 'start_char': 18, 'end_char': 21}, {'id': (2,), 'text': 'a', 'start_char': 22, 'end_char': 23}, {'id': (3,), 'text': 'chicken', 'start_char': 24, 'end_char': 31}, {'id': (4,), 'text': '.', 'start_char': 31, 'end_char': 32}]]
target_doc = [[{'id': 1, 'text': 'I', 'start_char': 0, 'end_char': 1}, {'id': 2, 'text': 'am', 'start_char': 2, 'end_char': 4}, {'id': 3, 'text': 'Joe.', 'start_char': 5, 'end_char': 9}, {'id': 4, 'text': '⭆⊱⇞', 'start_char': 10, 'end_char': 13}, {'id': 5, 'text': 'Hi', 'start_char': 14, 'end_char': 16}, {'id': 6, 'text': '.', 'start_char': 16, 'end_char': 17}], [{'id': 1, 'text': "I'm", 'start_char': 18, 'end_char': 21}, {'id': 2, 'text': 'a', 'start_char': 22, 'end_char': 23}, {'id': 3, 'text': 'chicken', 'start_char': 24, 'end_char': 31}, {'id': 4, 'text': '.', 'start_char': 31, 'end_char': 32}]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I must be missing some detail, but why has this changed to now return ints instead of tuples? I believe there may be some downstream code which expects the token IDs to be a tuple of length 1 for non-MWT words

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the system should still handle such cases correctly with multiple IDs, but it seems to default to single ID by default as I simply call doc.to_dict().

self._id = word_entry.get(ID, None)
if isinstance(self._id, tuple):
if len(self._id) == 1:
self._id = self._id[0]
self._text = word_entry.get(TEXT, None)
is the exact area where the system defaults to single numbers instead of Tuples if there's only one element; lmk if you want me to change this behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what's happened. The original was manually creating the id as a tuple. However, now it is creating a Document in reassemble_doc_from_tokens and then calling to_dict on that.

In that case, I withdraw my objection! However, I think what we do need is a test that it is doing the expected thing for MWT. Would you say that test_tokenizer.py::test_postprocessor_mwt should cover that?

Copy link
Member Author

@Jemoka Jemoka Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this does. The intermediate output from the spot of concern from that test was

[[{'id': 1, 'text': 'Le', 'start_char': 0, 'end_char': 2}, {'id': 2, 'text': 'prince', 'start_char': 3, 'end_char': 9}, {'id': 3, 'text': 'va', 'start_char': 10, 'end_char': 12}, {'id': 4, 'text': 'manger', 'start_char': 13, 'end_char': 19}, {'id': 5, 'text': 'du', 'misc': 'MWT=Yes', 'start_char': 20, 'end_char': 22}, {'id': 6, 'text': 'poulet', 'start_char': 23, 'end_char': 29}, {'id': 7, 'text': 'aux', 'misc': 'MWT=Yes', 'start_char': 30, 'end_char': 33}, {'id': 8, 'text': 'les', 'start_char': 34, 'end_char': 37}, {'id': 9, 'text': 'magasins', 'start_char': 38, 'end_char': 46}, {'id': (10, 11), 'text': "aujourd'hui", 'start_char': 47, 'end_char': 58, 'manual_expansion': True}, {'id': 10, 'text': "aujourd'"}, {'id': 11, 'text': 'hui'}, {'id': 12, 'text': '.', 'start_char': 58, 'end_char': 59}]]

which does contain "aujourd'hui" which is a tuple {'id': (10, 11), 'text': "aujourd'hui",

So I'd say this is good

@AngledLuffa
Copy link
Collaborator

Double check the test case, then I can squash it down & merge I think

@AngledLuffa AngledLuffa force-pushed the manual_mwt_control branch 2 times, most recently from fb4b555 to 65a12e5 Compare November 29, 2023 05:36
uses an enum to signify the manual MWT processing

manual MWT information is in a field instead of the misc field

default of the manual expansion is `None`
@AngledLuffa AngledLuffa merged commit 5d56fb9 into dev Nov 29, 2023
1 check passed
@AngledLuffa AngledLuffa deleted the manual_mwt_control branch November 29, 2023 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants