fix!: TextNode text property is not decoded (fixes #134) #135

nonara · 2021-06-30T20:59:02Z

Addressed

TextNode#Text now is html-decoded
HtmlNode#structuredText now uses html-decoded text
Added trimmed text cache invalidation on rawText update (ensures up to date after HtmlNode#removeWhitespace is called)

Notes

I assume structuredText should not preserve html entities, if I'm wrong, let me know and I'll adjust.
⚠️ You will want to bump major version with this release, as it's likely that someone's code (erroneously) relies on using text instead of the proper rawText to get html entities

Associated Issue

crosstype/node-html-markdown#14

nonara · 2021-06-30T22:07:10Z

All set for review @taoqf

taoqf · 2021-07-02T05:52:43Z

Thanks for your work. based on your pr, I digged it up. please check the code 840ffda if you have time.
Let me know if I missed anything.

nonara · 2021-07-02T15:33:53Z

Hi. I appreciate your taking the time! I did find a few issues.

Principally, according to spec TextNode#textContent should be the raw content, which may cause breaking issue for people (more detail below)

An appeal

As for this proposal, please consider the following:

Each node has text and rawText property
The documentation for text claims that it returns the unescaped text
Regular nodes decode the text property and rawText provides access to unescaped
Presently, TextNode simply returns rawText for both text and rawText, which is counterintuitive and goes against the documentation

Because of this, I would classify this as an error, as it's going against established convention setup for nodes as well as its documentation. It also effectively eliminates the purpose for having both properties.

Put more simply, why should text and rawText be different for HtmlNode and the same for TextNode? This bypasses the purpose.

So, although it's technically a breaking change, it is one which makes the code do what the documentation and convention says.

Issue with textContent

The issue with changing textContent is that this property is actually present in the HTML DOM spec, which means that this may actually cause breaking changes for people who are using it. Because it's a standardized property, this may actually affect even more users.

If you absolutely do not want to change the text field, my recommendation is to simply add a new property or to close the issue without fixing it, in which case, I'll workaround on my end, but I hope I can persuade otherwise.

Next steps

I would advise removing your recent commit, for now, and I've given you access to this PR in case you need to make changes.

I've made some corrections to my PR based on your commit. I've also added notes to make it easier to understand what I did. I should have done that before. I apologize.

Please have a look over the comments in the files changed tab, and let me know.

nonara · 2021-07-02T15:39:50Z

src/nodes/html.ts

@@ -458,7 +458,7 @@ export default class HTMLElement extends Node {
 				if ((node as TextNode).isWhitespace) {
 					return;
 				}
-				node.rawText = (<TextNode>node).trimmedText;
+				node.rawText = (<TextNode>node).trimmedRawText;


Because we are setting rawText, we want to be certain that it uses the raw version

nonara · 2021-07-02T15:45:11Z

src/nodes/text.ts

+		this._rawText = text;
+		this._trimmedRawText = void 0;
+		this._trimmedText = void 0;
+	}


Reset caches when rawText changed (connected to previous note)

nonara · 2021-07-02T15:45:39Z

src/nodes/text.ts

@@ -61,18 +60,50 @@ export default class TextNode extends Node {
 	 * @return {string} text content
 	 */
 	public get text() {
-		return this.rawText;
+		return decode(this.rawText);


Standardize to convention and documentation

nonara · 2021-07-02T15:45:55Z

src/nodes/text.ts

 	}

 	/**
 	 * Detect if the node contains only white space.
-	 * @return {bool}
+	 * @return {boolean}


This produced an eslint error, so I updated it

nonara · 2021-07-02T15:47:23Z

test/html.js

-				textNode.rawText.should.eql(' 123 ');
+				const textNode = new TextNode('  123&nbsp;  ');
+				textNode.rawText = textNode.trimmedRawText;
+				textNode.rawText.should.eql(' 123&nbsp; ');


I wrote this original test in a previous PR. We're altering it slightly to make sure we're using the proper rawText version of trimmed text. We also add in an &nbsp to ensure that it does not decode.

nonara · 2021-07-02T15:48:56Z

src/nodes/text.ts

 	 */
 	public get isWhitespace() {
 		return /^(\s|&nbsp;)*$/.test(this.rawText);
 	}

 	public toString() {
-		return this.text;
+		return this.rawText;
+	}


Necessary to ensure toString does not decode (as it did not before)

nonara · 2021-07-02T15:49:53Z

test/html.js

+
+				const textNode = divNode.firstChild;
+				textNode.rawText.should.eql(content);
+			});


This test ultimately did not need to be rewritten, however, I left my change in because it is more complete and easier to read. I also renamed it so that it is easier to understand what's happening. If you'd prefer, it can be changed back.

nonara · 2021-07-02T15:50:24Z

test/html.js

+
+				const textNode = pNode.firstChild;
+				textNode.text.should.eql(decodedText);
+				textNode.rawText.should.eql(encodedText);
 			});


It made sense to keep this test under the encode/decode heading, as this is core functionality.

nonara · 2021-07-02T16:17:14Z

As a secondary suggestion, I recall your saying you don't have much time for this repo anymore. If you'd like help maintaining it, I'd be happy to come on board.

I currently maintain several large, critical libraries in the parser/compiler space. Because we're looking at a major version bump, if it made you feel better about it, I could also go through and fix the active issues that are bugs, as well as clean up the tests and update them to use jest. That way we can test against uncompiled source. I also can be sure to maintain (and likely improve) performance time, as I do/did with node-html-markdown, obtaining 3x faster than the leading node compiler.

I could also add github actions to automate testing in pull requests, pushes, etc. That can make reviewing PRs a lot easier.

Let me know if any if that sounds good, otherwise, just getting this wrapped is fine.

nonara · 2021-07-02T17:08:59Z

src/nodes/text.ts

 		super(parentNode);
+		this._rawText = rawText;
 	}


This change allows us to automatically reset trimmed text caches on update. I added the caching in a previous PR. This change prevents a future bug report with existing code.

taoqf · 2021-07-05T05:48:01Z

Thank you for all your work.

taoqf · 2021-07-05T05:49:13Z

Should I increase the major version? I have not publish new version yet.

nonara · 2021-07-05T17:27:51Z

@taoqf You're very welcome! Thanks for addressing it so quickly! Yes, I would definitely publish as a new major version.

I'd release as v4.0.0 and add a release with the following notes:

Breaking Changes

TextNode#text now properly returns HTML decoded text. In order to access the unencoded text, use TextNode#rawText
HtmlNode#structuredText is now returning HTML decoded text

taoqf · 2021-07-06T01:11:59Z

https://github.com/taoqf/node-html-parser/releases/tag/v4.0.0

nonara added 2 commits June 30, 2021 16:52

fix: TextNode text property is not decoded (fixes taoqf#134)

7dea8d7

style: lint fixes

e77fe84

nonara marked this pull request as draft June 30, 2021 21:22

nonara added 2 commits June 30, 2021 17:49

fix: TextNode#toString should use rawText

996320e

test: Add / update tests

a4f345b

nonara marked this pull request as ready for review June 30, 2021 22:06

nonara changed the title ~~fix: TextNode text property is not decoded (fixes #134)~~ fix!: TextNode text property is not decoded (fixes #134) Jun 30, 2021

style: Normalize indents to tabs

c445f31

taoqf added a commit that referenced this pull request Jul 2, 2021

decode textContent #135

840ffda

refactor: Remove utils file

7d6c367

nonara commented Jul 2, 2021

View reviewed changes

style: Normalize indents to tabs

e728280

nonara commented Jul 2, 2021

View reviewed changes

taoqf merged commit 4f13096 into taoqf:main Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix!: TextNode text property is not decoded (fixes #134) #135

fix!: TextNode text property is not decoded (fixes #134) #135

nonara commented Jun 30, 2021 •

edited

nonara commented Jun 30, 2021 •

edited

taoqf commented Jul 2, 2021

nonara commented Jul 2, 2021 •

edited

nonara Jul 2, 2021

nonara Jul 2, 2021 •

edited

nonara Jul 2, 2021

nonara Jul 2, 2021

nonara Jul 2, 2021

nonara Jul 2, 2021

nonara Jul 2, 2021

nonara Jul 2, 2021

nonara commented Jul 2, 2021 •

edited

nonara Jul 2, 2021

taoqf commented Jul 5, 2021

taoqf commented Jul 5, 2021

nonara commented Jul 5, 2021

taoqf commented Jul 6, 2021

fix!: TextNode text property is not decoded (fixes #134) #135

fix!: TextNode text property is not decoded (fixes #134) #135

Conversation

nonara commented Jun 30, 2021 • edited

Addressed

Notes

Associated Issue

nonara commented Jun 30, 2021 • edited

taoqf commented Jul 2, 2021

nonara commented Jul 2, 2021 • edited

An appeal

Issue with textContent

Next steps

Choose a reason for hiding this comment

nonara Jul 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nonara commented Jul 2, 2021 • edited

Choose a reason for hiding this comment

taoqf commented Jul 5, 2021

taoqf commented Jul 5, 2021

nonara commented Jul 5, 2021

Breaking Changes

taoqf commented Jul 6, 2021

nonara commented Jun 30, 2021 •

edited

nonara commented Jun 30, 2021 •

edited

nonara commented Jul 2, 2021 •

edited

nonara Jul 2, 2021 •

edited

nonara commented Jul 2, 2021 •

edited