"SpaceAfter=No" not being included in misc field of Word objects #1315

tomlup · 2023-12-04T01:58:09Z

Describe the bug
Tokens without a space after them in the original text do not include that info in the misc field of the Word object or in the conllu output format.

To Reproduce

import stanza
from stanza.utils.conll import CoNLL

text = """
A bird hit the car, apparently.
"""

nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse', package='ewt')
doc = nlp(text)
for sent in doc.sentences:
    print(*[f'id: {word.id}\tword: {word.text}\tmisc: {word.misc}' for word in sent.words], sep='\n')
    print('\n')
print(CoNLL.doc2conll_text(doc))

Outputs the following:

id: 1	word: A	misc: None
id: 2	word: bird	misc: None
id: 3	word: hit	misc: None
id: 4	word: the	misc: None
id: 5	word: car	misc: None
id: 6	word: ,	misc: None
id: 7	word: apparently	misc: None
id: 8	word: .	misc: None


# text = A bird hit the car, apparently.
# sent_id = 0
1	A	a	DET	DT	Definite=Ind|PronType=Art	2	det	_	start_char=1|end_char=2
2	bird	bird	NOUN	NN	Number=Sing	3	nsubj	_	start_char=3|end_char=7
3	hit	hit	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	_	start_char=8|end_char=11
4	the	the	DET	DT	Definite=Def|PronType=Art	5	det	_	start_char=12|end_char=15
5	car	car	NOUN	NN	Number=Sing	3	obj	_	start_char=16|end_char=19
6	,	,	PUNCT	,	_	3	punct	_	start_char=19|end_char=20
7	apparently	apparently	ADV	RB	_	3	advmod	_	start_char=21|end_char=31
8	.	.	PUNCT	.	_	3	punct	_	start_char=31|end_char=32

Expected behavior
In the example sentence, tokens "car" and "apparently" should have "SpaceAfter=No" in the misc field.

Environment (please complete the following information):

OS: Windows
Python version: 3.1
Stanza version: 1.6.1

Additional context
Indicated here that "SpaceAfter=No" is included in the conllu output format: #677 (comment)

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2023-12-04T02:23:33Z

Yeah, that's a good point. I will hopefully be able to look into it this week

nschneid · 2023-12-13T19:49:55Z

@AngledLuffa any update on this?

AngledLuffa · 2023-12-13T19:58:45Z

I am thinking about ways of doing it. Currently the SpaceAfter annotations have been kept along with most of the other misc fields in a big map of stuff attached to the words & tokens. I am thinking that perhaps it ought to be separated out into its own field, also along with a SpaceBefore, I suppose, since otherwise the start of a document will be lost. That is a little bit more of an update than I was originally thinking. However, I would say this is a rough outline of a plan and I just need to go ahead and implement it.

AngledLuffa · 2023-12-15T05:50:43Z

Question about this for either of you. What should the output look like at the end of a sentence or the end of a document? For example, we could do this, with the \s\s at the end, but maybe that's horrible:

import stanza
pipe = stanza.Pipeline('en', processors='tokenize')

doc = pipe("  Jennifer has nice antennae.  ")
print("--------------")
print("{:C}".format(doc))

# text = Jennifer has nice antennae.
# sent_id = 0
1       Jennifer        _       _       _       _       0       _       _       start_char=2|end_char=10
2       has     _       _       _       _       1       _       _       start_char=11|end_char=14
3       nice    _       _       _       _       2       _       _       start_char=15|end_char=19
4       antennae        _       _       _       _       3       _       _       start_char=20|end_char=28|SpaceAfter=No
5       .       _       _       _       _       4       _       _       start_char=28|end_char=29|SpacesAfter=\s\s

There could also be a \s\s between sentences:

doc = pipe("  Jennifer has nice antennae.  Not very nice person, though.  ")
print("{:C}".format(doc))

# text = Jennifer has nice antennae.
# sent_id = 0
1       Jennifer        _       _       _       _       0       _       _       start_char=2|end_char=10
2       has     _       _       _       _       1       _       _       start_char=11|end_char=14
3       nice    _       _       _       _       2       _       _       start_char=15|end_char=19
4       antennae        _       _       _       _       3       _       _       start_char=20|end_char=28|SpaceAfter=No
5       .       _       _       _       _       4       _       _       start_char=28|end_char=29|SpacesAfter=\s\s

# text = Not very nice person, though.
# sent_id = 1
1       Not     _       _       _       _       0       _       _       start_char=31|end_char=34
2       very    _       _       _       _       1       _       _       start_char=35|end_char=39
3       nice    _       _       _       _       2       _       _       start_char=40|end_char=44
4       person  _       _       _       _       3       _       _       start_char=45|end_char=51|SpaceAfter=No
5       ,       _       _       _       _       4       _       _       start_char=51|end_char=52
6       though  _       _       _       _       5       _       _       start_char=53|end_char=59|SpaceAfter=No
7       .       _       _       _       _       6       _       _       start_char=59|end_char=60|SpacesAfter=\s\s

AngledLuffa · 2023-12-15T08:15:51Z

Similarly, what about at the end of a document with no trailing whitespace? Should that have a SpaceAfter=No annotation?

AngledLuffa · 2023-12-15T10:02:41Z

Aside from those questions, I think this is now good to go. One change I still want to make is that currently the spaces annotations are kept as part of the MISC field, and I think they would be better as a separate member of the Token itself

nschneid · 2023-12-15T14:17:50Z

Similarly, what about at the end of a document with no trailing whitespace? Should that have a SpaceAfter=No annotation?

I guess it depends on Stanza's expectations regarding documents. Whatever is the "default" ending for a document doesn't need any special annotation IMO.

For example, we could do this, with the \s\s at the end

If the input contains two spaces after a sentence, then yes, SpacesAfter=\s\s makes sense to me.

AngledLuffa · 2023-12-16T02:05:44Z

SGTM. I made the change to represent the SpacesBefore and SpacesAfter as members of the Token object rather than something that needs to be extracted from the MISC field, so I'm going to call this good. Feel free to LMK if there are suggestions on how to change either the code or the output scheme.

One thing I note is that this change puts the SpacesAfter and SpacesBefore at the end of the misc column always, rather than sometimes being at the start, so some of the UD dataset change when the files are read in and then written back out. I'm going to consider that "not a big deal" although it might be nicer if those columns in the UD files had a canonical ordering, such as Spaces always at the start, always at the end, or always sorted with the other MISC attributes.

tomlup added the bug label Dec 4, 2023

AngledLuffa mentioned this issue Dec 15, 2023

Spaces after #1322

Merged

AngledLuffa added the fixed on dev label Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"SpaceAfter=No" not being included in misc field of Word objects #1315

"SpaceAfter=No" not being included in misc field of Word objects #1315

tomlup commented Dec 4, 2023

AngledLuffa commented Dec 4, 2023 via email

nschneid commented Dec 13, 2023

AngledLuffa commented Dec 13, 2023 via email

AngledLuffa commented Dec 15, 2023

AngledLuffa commented Dec 15, 2023

AngledLuffa commented Dec 15, 2023

nschneid commented Dec 15, 2023

AngledLuffa commented Dec 16, 2023

"SpaceAfter=No" not being included in misc field of Word objects #1315

"SpaceAfter=No" not being included in misc field of Word objects #1315

Comments

tomlup commented Dec 4, 2023

AngledLuffa commented Dec 4, 2023 via email

nschneid commented Dec 13, 2023

AngledLuffa commented Dec 13, 2023 via email

AngledLuffa commented Dec 15, 2023

AngledLuffa commented Dec 15, 2023

AngledLuffa commented Dec 15, 2023

nschneid commented Dec 15, 2023

AngledLuffa commented Dec 16, 2023