Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behavior of OdfElement.getTextContent() #229

Open
ttaomae opened this issue May 27, 2023 · 3 comments
Open

Incorrect behavior of OdfElement.getTextContent() #229

ttaomae opened this issue May 27, 2023 · 3 comments
Assignees
Milestone

Comments

@ttaomae
Copy link

ttaomae commented May 27, 2023

While investigating #220, I came across OdfTableCell.getOdfElement().getTextContent(), which is defined by org.w3c.dom.Node.getTextContent() and implemented in OdfElement. My interpretation of the Node.getTextContent() documentation is that it should recursively retrieve the text contents of all child nodes and concatenate them together. However, the OdfElement implementation only looks at the children, rather than all descendants.

I think it is also worth noting that OdfWhitespaceParser.getText(Node) appears to do what OdfElement.getTextContent() is supposed to do. Or at least something very similar.

@mistmist
Copy link
Contributor

i don't know the ODFDOM code well, but as a general rule about ODF: inside of a paragraph (text:p/text:h) and its child elements, there is character content which is part of the paragraph text (most elements), and there is character content which is part of some nested object and not part of the paragraph text.

there are a bunch of elements for drawing shapes, draw:frame etc. and also office:annotation, that should comprise most of the latter case - these should probably not be included in any API that promises some sort of text content of "the paragraph" or "the table cell".

the criteria for what is considered paragraph text are in section "6.1.2 White Space Characters" of ODF 1.3 part 3, and i think the text of a table cell should simply be the concatenation of its paragraphs.

@svanteschubert svanteschubert self-assigned this May 30, 2023
@svanteschubert svanteschubert added this to the 1.0.0 milestone May 30, 2023
@svanteschubert
Copy link
Contributor

You are absolutely right, Todd.

OdfElement has the base functionality to concatenate the text content:
https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/pkg/OdfElement.java#L2633

but the every text node containing element like OdfTextSpan should override this method and define its specific behavior.
By this method implementation, the specific behavior - Michael has pointed out it his previous comment - would be reflected.

I am uncertain atm, if we have a way in the ODF Specification to differentiate this behavior, if not it would be worth to add it.

Taking this over issue over!
Will aim to generate this as part of the upcoming release candidate.

Thanks for pointing out!

@svanteschubert
Copy link
Contributor

svanteschubert commented May 30, 2023

The ODF specification explains in the bullet list of 6.1.1 all elements containing text that should not be considered, see
https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#a_6_1_Basic_Text_Content

The funcationality is currently spread and needs to harmonized, e.g. this whitespace collapsing function is implemented in the characters(..) method of ChangesFileSaxHandler
https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/changes/ChangesFileSaxHandler.java#L2252

#220 and this #229 are quite related and will likely be fixed in one patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants