Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"SpaceAfter=No" not being included in misc field of Word objects #1315

Open
tomlup opened this issue Dec 4, 2023 · 8 comments
Open

"SpaceAfter=No" not being included in misc field of Word objects #1315

tomlup opened this issue Dec 4, 2023 · 8 comments

Comments

@tomlup
Copy link

tomlup commented Dec 4, 2023

Describe the bug
Tokens without a space after them in the original text do not include that info in the misc field of the Word object or in the conllu output format.

To Reproduce

import stanza
from stanza.utils.conll import CoNLL

text = """
A bird hit the car, apparently.
"""

nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse', package='ewt')
doc = nlp(text)
for sent in doc.sentences:
    print(*[f'id: {word.id}\tword: {word.text}\tmisc: {word.misc}' for word in sent.words], sep='\n')
    print('\n')
print(CoNLL.doc2conll_text(doc))

Outputs the following:

id: 1	word: A	misc: None
id: 2	word: bird	misc: None
id: 3	word: hit	misc: None
id: 4	word: the	misc: None
id: 5	word: car	misc: None
id: 6	word: ,	misc: None
id: 7	word: apparently	misc: None
id: 8	word: .	misc: None


# text = A bird hit the car, apparently.
# sent_id = 0
1	A	a	DET	DT	Definite=Ind|PronType=Art	2	det	_	start_char=1|end_char=2
2	bird	bird	NOUN	NN	Number=Sing	3	nsubj	_	start_char=3|end_char=7
3	hit	hit	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	_	start_char=8|end_char=11
4	the	the	DET	DT	Definite=Def|PronType=Art	5	det	_	start_char=12|end_char=15
5	car	car	NOUN	NN	Number=Sing	3	obj	_	start_char=16|end_char=19
6	,	,	PUNCT	,	_	3	punct	_	start_char=19|end_char=20
7	apparently	apparently	ADV	RB	_	3	advmod	_	start_char=21|end_char=31
8	.	.	PUNCT	.	_	3	punct	_	start_char=31|end_char=32

Expected behavior
In the example sentence, tokens "car" and "apparently" should have "SpaceAfter=No" in the misc field.

Environment (please complete the following information):

  • OS: Windows
  • Python version: 3.1
  • Stanza version: 1.6.1

Additional context
Indicated here that "SpaceAfter=No" is included in the conllu output format: #677 (comment)

@tomlup tomlup added the bug label Dec 4, 2023
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 4, 2023 via email

@nschneid
Copy link

@AngledLuffa any update on this?

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 13, 2023 via email

@AngledLuffa
Copy link
Collaborator

Question about this for either of you. What should the output look like at the end of a sentence or the end of a document? For example, we could do this, with the \s\s at the end, but maybe that's horrible:

import stanza
pipe = stanza.Pipeline('en', processors='tokenize')

doc = pipe("  Jennifer has nice antennae.  ")
print("--------------")
print("{:C}".format(doc))

# text = Jennifer has nice antennae.
# sent_id = 0
1       Jennifer        _       _       _       _       0       _       _       start_char=2|end_char=10
2       has     _       _       _       _       1       _       _       start_char=11|end_char=14
3       nice    _       _       _       _       2       _       _       start_char=15|end_char=19
4       antennae        _       _       _       _       3       _       _       start_char=20|end_char=28|SpaceAfter=No
5       .       _       _       _       _       4       _       _       start_char=28|end_char=29|SpacesAfter=\s\s

There could also be a \s\s between sentences:

doc = pipe("  Jennifer has nice antennae.  Not very nice person, though.  ")
print("{:C}".format(doc))

# text = Jennifer has nice antennae.
# sent_id = 0
1       Jennifer        _       _       _       _       0       _       _       start_char=2|end_char=10
2       has     _       _       _       _       1       _       _       start_char=11|end_char=14
3       nice    _       _       _       _       2       _       _       start_char=15|end_char=19
4       antennae        _       _       _       _       3       _       _       start_char=20|end_char=28|SpaceAfter=No
5       .       _       _       _       _       4       _       _       start_char=28|end_char=29|SpacesAfter=\s\s

# text = Not very nice person, though.
# sent_id = 1
1       Not     _       _       _       _       0       _       _       start_char=31|end_char=34
2       very    _       _       _       _       1       _       _       start_char=35|end_char=39
3       nice    _       _       _       _       2       _       _       start_char=40|end_char=44
4       person  _       _       _       _       3       _       _       start_char=45|end_char=51|SpaceAfter=No
5       ,       _       _       _       _       4       _       _       start_char=51|end_char=52
6       though  _       _       _       _       5       _       _       start_char=53|end_char=59|SpaceAfter=No
7       .       _       _       _       _       6       _       _       start_char=59|end_char=60|SpacesAfter=\s\s

@AngledLuffa
Copy link
Collaborator

Similarly, what about at the end of a document with no trailing whitespace? Should that have a SpaceAfter=No annotation?

@AngledLuffa
Copy link
Collaborator

Aside from those questions, I think this is now good to go. One change I still want to make is that currently the spaces annotations are kept as part of the MISC field, and I think they would be better as a separate member of the Token itself

@nschneid
Copy link

Similarly, what about at the end of a document with no trailing whitespace? Should that have a SpaceAfter=No annotation?

I guess it depends on Stanza's expectations regarding documents. Whatever is the "default" ending for a document doesn't need any special annotation IMO.

For example, we could do this, with the \s\s at the end

If the input contains two spaces after a sentence, then yes, SpacesAfter=\s\s makes sense to me.

@AngledLuffa
Copy link
Collaborator

SGTM. I made the change to represent the SpacesBefore and SpacesAfter as members of the Token object rather than something that needs to be extracted from the MISC field, so I'm going to call this good. Feel free to LMK if there are suggestions on how to change either the code or the output scheme.

One thing I note is that this change puts the SpacesAfter and SpacesBefore at the end of the misc column always, rather than sometimes being at the start, so some of the UD dataset change when the files are read in and then written back out. I'm going to consider that "not a big deal" although it might be nicer if those columns in the UD files had a canonical ordering, such as Spaces always at the start, always at the end, or always sorted with the other MISC attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants