manual MWT control in tokenization #1302

Jemoka · 2023-10-24T19:58:00Z

Description

This is a followup to #1290 which allows manual control of MWT splitting via the tokenize_postprocessor. We implement this via a new attribute MEXP=Yes to denote pre-deliniated MWT that the MWT processor shouldn't touch.

For instance:

nlp = stanza.Pipeline(lang="fr", processors="tokenize",
                                    tokenize_postprocessor=lambda draft: do_stuff_to_draft(draft))

whereby, the single argument passed to tokenize_postprocessor is a list of lists containing string sentence and word tokenizations

[['Le', 'prince', 'va', 'manger', ('du', True), 'poulet', ('aux', True), 'les', 
   'magasins', ("aujourd'hui", ["aujourd'", "hui"]), '.']]

With this postprocessor return, du, aux are requested to be split via the traditional MWT splitter, whereas a user-defined split is provided for aujourd'hui.

Unit Test Coverage

test_postprocessor_mwt is created to check for this functionality, and pipeline/test_tokenizer.py contains updates to support MWT override.

AngledLuffa · 2023-11-05T01:30:22Z

stanza/models/common/doc.py

+                ae = (multi_word_expanded_token_misc.match(token.misc)
+                      if token.misc is not None else None)
+
+                perform_mwt_processing = "flatten"


I would normally prefer an enum here instead of string constants

sounds good; addressed by d13e5d4

AngledLuffa · 2023-11-05T01:32:22Z

stanza/models/common/doc.py

+                    for i in token.words:
+                        i.id += idx_e
+                    idx_w = token.id[-1]
+                    token.misc = None if token.misc == 'MEXP=Yes' else '|'.join([x for x in token.misc.split('|') if x != 'MEXP=Yes'])


This is the basic idea I had, but I was wondering if there should instead be an attribute on the Token object instead of a component of the misc field

What would be the benefits/disadvantages of choosing one/another? I wanted to use the misc field because then whomever is observing it can also see the difference relating to MWT/MEXP in the same place, and I didn't want to make an entire attribute just for the very specific case of "user wanted this MWT to be split in a particular way". I would love to get more clarity on why MWT was marked there. Apologies for my confusion here

Yeah, I guess I don't have a strong motivation. Was just thinking that it's most likely a temporary annotation & could be forgotten about after the processing. Even in the case of something we want to keep around, I think it would be easier for the other coding tasks if it's kept in a separate field and then added to the misc output when needed. The ner field is like that, for example

AngledLuffa · 2023-11-05T02:07:26Z

stanza/models/tokenization/utils.py

-    assert token_lens == mwt_lens, "Postprocessor returned token and MWT lists of different length! Token list lengths %s, MWT list lengths %s" % (token_lens, mwt_lens)
-
+        corrected_expansions.append(sent_expansions)
+
    # recassemble document. offsets and oov shouldn't change


typo (reassemble)

Ack. Fat fingers. Fixed in bc0e41c

AngledLuffa · 2023-11-05T02:07:54Z

stanza/models/tokenization/utils.py

-    # check postprocessor output
-    token_lens = [len(i) for i in corrected_words]
-    mwt_lens = [len(i) for i in corrected_mwts]
-    assert token_lens == mwt_lens, "Postprocessor returned token and MWT lists of different length! Token list lengths %s, MWT list lengths %s" % (token_lens, mwt_lens)


can we keep the error checking?

This does one extra loop through the input, but the user can never actually supply MWT/token lists of different lengths as they are supplying them in one zipped array:

[['Le', 'prince', 'va', 'manger', ('du', True), 'poulet', ('aux', True), 'les', 'magasins', ("aujourd'hui", ["aujourd'", "hui"]), '.']]

I added this check when before the user was passing in two lists with the postprocessor before. This is no longer the case.

However, I added it back just in case something with these lists goes wrong again: bc0e41c

AngledLuffa · 2023-11-05T02:12:41Z

stanza/models/tokenization/utils.py

@@ -392,6 +398,8 @@ def reassemble_doc_from_tokens(tokens, mwts, raw_text):
    mwts : List[List[bool]]
        Whether or not each of the tokens are MWTs to be analyzed by
        the MWT raw.
+    mwts : List[List[List[str}]]


addressed by bc0e41c

AngledLuffa · 2023-11-05T02:18:03Z

stanza/pipeline/mwt_processor.py

@@ -3,6 +3,7 @@
 """

 import io
+from stanza.resources.common import process_pipeline_parameters


not needed?

sorry about that! addressed by bc0e41c

AngledLuffa · 2023-11-05T02:19:09Z

stanza/tests/tokenization/test_tokenize_utils.py


    text = "Joe Smith lives in California."

    with pytest.raises(ValueError):
-        utils.reassemble_doc_from_tokens(bad_addition_tokenization, bad_addition_mwts, text)
+        utils.reassemble_doc_from_tokens(bad_addition_tokenization, bad_addition_mwts,


one line would be fine here

addressed by bc0e41c

AngledLuffa · 2023-11-17T07:48:12Z

mostly happy except for the misc field comments above

i do like having stuff which is functional in separate fields so they can be manipulated easier without needing to go through the misc annotation each time

Jemoka · 2023-11-17T08:12:24Z

got it, I will implement this shortly into the weekend and see how it goes

Jemoka · 2023-11-19T03:26:33Z

Done; ready for your review. There's a chance that adding a whole extra field to Token and Word types causes some unit tests to fail if they are checking the to_dict() output. However, that in theory shouldn't be the case in most cases where the documents' string reprs themselves are checked because I didn't touch that.

AngledLuffa · 2023-11-28T02:31:00Z

stanza/models/common/doc.py

+        processing for tokens marked manually expanded:
+
+        process_manual_expanded = None - default; doesn't process manually expanded tokens
+                                = True - process only manually expanded tokens


would you clarify the distinction between True and False a bit? maybe an example or two would help

Done, addressed by 556a053

AngledLuffa · 2023-11-28T02:31:37Z

stanza/models/common/doc.py

        idx_e = 0
        for sentence in self.sentences:
            idx_w = 0
            for token in sentence.tokens:
                idx_w += 1
                m = (len(token.id) > 1)
-                n = multi_word_token_misc.match(token.misc) if token.misc is not None else None
-                if not m and not n:
+                n = (multi_word_token_misc.match(token.misc)


this particular line break is not changing anything?

My emacs config was too aggressive in complaining about long lines; addressed by 556a053. Apologies!

AngledLuffa · 2023-11-28T02:32:31Z

stanza/models/common/doc.py

        """
+
        idx_e = 0
        for sentence in self.sentences:
            idx_w = 0
            for token in sentence.tokens:
                idx_w += 1
                m = (len(token.id) > 1)


i think at this point m, n, ae, etc need real variable names instead. i'm getting a little lost

apologies about that! addressed by 556a053

AngledLuffa · 2023-11-28T02:33:23Z

stanza/models/common/doc.py

        """
        expansions = []
        for sentence in self.sentences:
            for token in sentence.tokens:
                m = (len(token.id) > 1)
                n = multi_word_token_misc.match(token.misc) if token.misc is not None else None
-                if m or n:
+                ae = token.manual_expansion


same here, actual variable names which give a bit more hint as to what is going on with (m and not ae) or n would be helpful

addressed by 556a053

AngledLuffa · 2023-11-28T02:35:44Z

stanza/models/tokenization/utils.py

+                else:
+                    sent_words.append(word[0])
+                    sent_mwts.append(True)
+                    sent_expansions.append(" ".join(word[1]))


question about this, typically MWT are literally something that has been split out of one token, so "".join() would be more appropriate. however, happy to have my understanding corrected... preferably via comment instead of github reply!

addressed by 556a053.

AngledLuffa · 2023-11-28T02:37:57Z

stanza/tests/tokenization/test_tokenize_utils.py

@@ -102,7 +102,7 @@ def test_postprocessor_application():
    good_tokenization = [['I', 'am', 'Joe.', '⭆⊱⇞', 'Hi', '.'], ["I'm", 'a', 'chicken', '.']]
    text = "I am Joe. ⭆⊱⇞ Hi. I'm a chicken."

-    target_doc = [[{'id': (1,), 'text': 'I', 'start_char': 0, 'end_char': 1}, {'id': (2,), 'text': 'am', 'start_char': 2, 'end_char': 4}, {'id': (3,), 'text': 'Joe.', 'start_char': 5, 'end_char': 9}, {'id': (4,), 'text': '⭆⊱⇞', 'start_char': 10, 'end_char': 13}, {'id': (5,), 'text': 'Hi', 'start_char': 14, 'end_char': 16}, {'id': (6,), 'text': '.', 'start_char': 16, 'end_char': 17}], [{'id': (1,), 'text': "I'm", 'start_char': 18, 'end_char': 21}, {'id': (2,), 'text': 'a', 'start_char': 22, 'end_char': 23}, {'id': (3,), 'text': 'chicken', 'start_char': 24, 'end_char': 31}, {'id': (4,), 'text': '.', 'start_char': 31, 'end_char': 32}]]
+    target_doc = [[{'id': 1, 'text': 'I', 'start_char': 0, 'end_char': 1}, {'id': 2, 'text': 'am', 'start_char': 2, 'end_char': 4}, {'id': 3, 'text': 'Joe.', 'start_char': 5, 'end_char': 9}, {'id': 4, 'text': '⭆⊱⇞', 'start_char': 10, 'end_char': 13}, {'id': 5, 'text': 'Hi', 'start_char': 14, 'end_char': 16}, {'id': 6, 'text': '.', 'start_char': 16, 'end_char': 17}], [{'id': 1, 'text': "I'm", 'start_char': 18, 'end_char': 21}, {'id': 2, 'text': 'a', 'start_char': 22, 'end_char': 23}, {'id': 3, 'text': 'chicken', 'start_char': 24, 'end_char': 31}, {'id': 4, 'text': '.', 'start_char': 31, 'end_char': 32}]]


Again I must be missing some detail, but why has this changed to now return ints instead of tuples? I believe there may be some downstream code which expects the token IDs to be a tuple of length 1 for non-MWT words

the system should still handle such cases correctly with multiple IDs, but it seems to default to single ID by default as I simply call doc.to_dict().

stanza/stanza/models/common/doc.py

Lines 1038 to 1042 in 7b58099

self._id = word_entry.get(ID, None)

if isinstance(self._id, tuple):

if len(self._id) == 1:

self._id = self._id[0]

self._text = word_entry.get(TEXT, None)

is the exact area where the system defaults to single numbers instead of Tuples if there's only one element; lmk if you want me to change this behavior.

Ah, I see what's happened. The original was manually creating the id as a tuple. However, now it is creating a Document in reassemble_doc_from_tokens and then calling to_dict on that.

In that case, I withdraw my objection! However, I think what we do need is a test that it is doing the expected thing for MWT. Would you say that test_tokenizer.py::test_postprocessor_mwt should cover that?

yes, this does. The intermediate output from the spot of concern from that test was

[[{'id': 1, 'text': 'Le', 'start_char': 0, 'end_char': 2}, {'id': 2, 'text': 'prince', 'start_char': 3, 'end_char': 9}, {'id': 3, 'text': 'va', 'start_char': 10, 'end_char': 12}, {'id': 4, 'text': 'manger', 'start_char': 13, 'end_char': 19}, {'id': 5, 'text': 'du', 'misc': 'MWT=Yes', 'start_char': 20, 'end_char': 22}, {'id': 6, 'text': 'poulet', 'start_char': 23, 'end_char': 29}, {'id': 7, 'text': 'aux', 'misc': 'MWT=Yes', 'start_char': 30, 'end_char': 33}, {'id': 8, 'text': 'les', 'start_char': 34, 'end_char': 37}, {'id': 9, 'text': 'magasins', 'start_char': 38, 'end_char': 46}, {'id': (10, 11), 'text': "aujourd'hui", 'start_char': 47, 'end_char': 58, 'manual_expansion': True}, {'id': 10, 'text': "aujourd'"}, {'id': 11, 'text': 'hui'}, {'id': 12, 'text': '.', 'start_char': 58, 'end_char': 59}]]

which does contain "aujourd'hui" which is a tuple {'id': (10, 11), 'text': "aujourd'hui",

So I'd say this is good

AngledLuffa · 2023-11-28T22:50:22Z

Double check the test case, then I can squash it down & merge I think

uses an enum to signify the manual MWT processing manual MWT information is in a field instead of the misc field default of the manual expansion is `None`

Jemoka requested a review from AngledLuffa October 24, 2023 19:58

AngledLuffa force-pushed the dev branch from 4256ae6 to b0a227b Compare October 28, 2023 16:14

Jemoka force-pushed the manual_mwt_control branch from 27471c0 to d1e2cfe Compare November 4, 2023 06:42

AngledLuffa force-pushed the dev branch from 51e00b1 to bffc517 Compare November 4, 2023 20:54

AngledLuffa reviewed Nov 5, 2023

View reviewed changes

Jemoka requested a review from AngledLuffa November 5, 2023 07:56

AngledLuffa force-pushed the dev branch from e725548 to 7b58099 Compare November 23, 2023 21:15

AngledLuffa reviewed Nov 28, 2023

View reviewed changes

AngledLuffa force-pushed the manual_mwt_control branch 2 times, most recently from fb4b555 to 65a12e5 Compare November 29, 2023 05:36

MWT expansion with MEXP attribute tags

65a12e5

uses an enum to signify the manual MWT processing manual MWT information is in a field instead of the misc field default of the manual expansion is `None`

AngledLuffa merged commit 5d56fb9 into dev Nov 29, 2023
1 check passed

AngledLuffa deleted the manual_mwt_control branch November 29, 2023 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manual MWT control in tokenization #1302

manual MWT control in tokenization #1302

Jemoka commented Oct 24, 2023 •

edited

AngledLuffa Nov 5, 2023

Jemoka Nov 5, 2023

AngledLuffa Nov 5, 2023

Jemoka Nov 5, 2023

AngledLuffa Nov 17, 2023

AngledLuffa Nov 5, 2023

Jemoka Nov 5, 2023

AngledLuffa Nov 5, 2023

Jemoka Nov 5, 2023

AngledLuffa Nov 5, 2023

Jemoka Nov 5, 2023

AngledLuffa Nov 5, 2023

Jemoka Nov 5, 2023

AngledLuffa Nov 5, 2023

Jemoka Nov 5, 2023

AngledLuffa commented Nov 17, 2023

Jemoka commented Nov 17, 2023

Jemoka commented Nov 19, 2023

AngledLuffa Nov 28, 2023

Jemoka Nov 28, 2023

AngledLuffa Nov 28, 2023

Jemoka Nov 28, 2023

AngledLuffa Nov 28, 2023

Jemoka Nov 28, 2023

AngledLuffa Nov 28, 2023

Jemoka Nov 28, 2023

AngledLuffa Nov 28, 2023

Jemoka Nov 28, 2023

AngledLuffa Nov 28, 2023

Jemoka Nov 28, 2023

AngledLuffa Nov 28, 2023

Jemoka Nov 29, 2023 •

edited

AngledLuffa commented Nov 28, 2023

	self._id = word_entry.get(ID, None)
	if isinstance(self._id, tuple):
	if len(self._id) == 1:
	self._id = self._id[0]
	self._text = word_entry.get(TEXT, None)

manual MWT control in tokenization #1302

manual MWT control in tokenization #1302

Conversation

Jemoka commented Oct 24, 2023 • edited

Description

Unit Test Coverage

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AngledLuffa commented Nov 17, 2023

Jemoka commented Nov 17, 2023

Jemoka commented Nov 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jemoka Nov 29, 2023 • edited

Choose a reason for hiding this comment

AngledLuffa commented Nov 28, 2023

Jemoka commented Oct 24, 2023 •

edited

Jemoka Nov 29, 2023 •

edited