-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"SpaceAfter=No" not being included in misc field of Word objects #1315
Comments
Yeah, that's a good point. I will hopefully be able to look into it this
week
|
@AngledLuffa any update on this? |
I am thinking about ways of doing it. Currently the SpaceAfter annotations
have been kept along with most of the other misc fields in a big map of
stuff attached to the words & tokens. I am thinking that perhaps it ought
to be separated out into its own field, also along with a SpaceBefore, I
suppose, since otherwise the start of a document will be lost. That is a
little bit more of an update than I was originally thinking. However, I
would say this is a rough outline of a plan and I just need to go ahead and
implement it.
|
Question about this for either of you. What should the output look like at the end of a sentence or the end of a document? For example, we could do this, with the
There could also be a
|
Similarly, what about at the end of a document with no trailing whitespace? Should that have a |
Aside from those questions, I think this is now good to go. One change I still want to make is that currently the spaces annotations are kept as part of the MISC field, and I think they would be better as a separate member of the Token itself |
I guess it depends on Stanza's expectations regarding documents. Whatever is the "default" ending for a document doesn't need any special annotation IMO.
If the input contains two spaces after a sentence, then yes, |
SGTM. I made the change to represent the SpacesBefore and SpacesAfter as members of the Token object rather than something that needs to be extracted from the MISC field, so I'm going to call this good. Feel free to LMK if there are suggestions on how to change either the code or the output scheme. One thing I note is that this change puts the SpacesAfter and SpacesBefore at the end of the misc column always, rather than sometimes being at the start, so some of the UD dataset change when the files are read in and then written back out. I'm going to consider that "not a big deal" although it might be nicer if those columns in the UD files had a canonical ordering, such as Spaces always at the start, always at the end, or always sorted with the other MISC attributes. |
Describe the bug
Tokens without a space after them in the original text do not include that info in the misc field of the Word object or in the conllu output format.
To Reproduce
Outputs the following:
Expected behavior
In the example sentence, tokens "car" and "apparently" should have "SpaceAfter=No" in the misc field.
Environment (please complete the following information):
Additional context
Indicated here that "SpaceAfter=No" is included in the conllu output format: #677 (comment)
The text was updated successfully, but these errors were encountered: