Node.text() does not respect changes from Node.unwrap_tags #68

NixBiks · 2022-09-06T15:40:26Z

It seems that node.text() does not respect the mutated node after calling node.unwrap_tags. I'd expect the following to pass

from selectolax.parser import HTMLParser


tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
tree.unwrap_tags(["strong"])
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

but instead I get

AssertionError: J ohn Doe  != John Doe

The text was updated successfully, but these errors were encountered:

NixBiks · 2022-09-06T15:56:51Z

I have quite the ugly workaround currently (parsing the html twice)

pre_tree = tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
pre_tree.unwrap_tags(["strong"])

tree = HTMLParser(pre_tree.html)
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

rushter · 2022-09-07T04:58:05Z

That happens because you get two text nodes close to each other when removing the strong tag.

for node in tree.root.traverse(include_text=True):
    print(node.text(deep=False), node.tag)
 html
 head
 body
 div
John p
J -text
ohn -text
Doe p
Doe -text

I'm not sure what to do about it yet. Technically, I think this behaviour is correct, but for the majority of users it would be unexpected. To make it work, we need to merge two text nodes.

rushter · 2022-09-07T05:05:55Z

I get the same behavior in Chrome:

NixBiks · 2022-09-07T07:02:12Z

Ahh I see. I guess I'd need to iterate through the node and merge all -text siblings before. Not sure if this should be part of node.text?

rushter · 2022-09-07T13:04:31Z

I think adding a separate merge_text_nodes method will be sufficient.

rushter · 2022-09-07T19:00:23Z

Added a PoC:

selectolax/tests/test_nodes.py

Line 537 in abc9a3c

def test_merge_text_nodes(parser):

It needs more tests.

rushter · 2022-09-20T06:43:50Z

I've made a new release that supports merge_text_nodes.

NixBiks · 2022-09-20T06:50:56Z

@rushter you're a legend !

NixBiks · 2022-09-20T07:58:01Z

Shouldn't it be added to parser.pyi as well?

rushter · 2022-09-20T08:03:05Z

Yes, I've updated it. Thank you for checking it.

NixBiks · 2022-09-20T08:22:10Z

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed.
Corrupt value: 0xb0000000129e3f00
python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html

def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

I'm running on Macbook Pro M1 - not sure if that is relevant

rushter · 2022-09-20T08:32:38Z

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed.
Corrupt value: 0xb0000000129e3f00
python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html
def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)
I'm running on Macbook Pro M1 - not sure if that is relevant

Is it possible to provide the HTML that crashes it?

NixBiks · 2022-09-20T09:54:50Z

Is it possible to provide the HTML that crashes it?

Yes but it is inconstent. I can send you a JSON file with 300'ish HTML. When I loop through then it crashes at random times.

rushter · 2022-09-20T10:09:20Z

Ok, please send HTML to me.

NixBiks · 2022-09-20T10:18:00Z

I've just sent you a JSON with a array with only one object that has html key. If I run the code on that html multiple times then it breaks. It's random how many times. In the script I sent it is running over the html 300 times. Script here for reference

import json
from selectolax.parser import HTMLParser


DEFAULT_UNWRAP_TAGS = [
    "a",
    "abbr",
    "acronym",
    "b",
    "bdo",
    "big",
    "br",
    "button",
    "cite",
    "code",
    "dfn",
    "em",
    "i",
    "img",
    "input",
    "kbd",
    "label",
    "map",
    "object",
    "output",
    "q",
    "samp",
    "script",
    "select",
    "small",
    "span",
    "strong",
    "textarea",
    "time",
    "tt",
    "var",
]
DEFAULT_REMOVE_TAGS = ["sub", "sup", "table"]


with open("html.json", "r") as f:
    data = json.load(f)

html = data[0]["html"]

def html_to_text(html: str) -> str:
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

for i in range(300):
    print(i)
    html_to_text(html)

rushter · 2022-09-20T19:12:25Z

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

NixBiks · 2022-09-21T09:42:26Z

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

I agree. It works !

rushter closed this as completed Sep 20, 2022

rushter reopened this Sep 20, 2022

rushter closed this as completed Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node.text() does not respect changes from Node.unwrap_tags #68

Node.text() does not respect changes from Node.unwrap_tags #68

NixBiks commented Sep 6, 2022

NixBiks commented Sep 6, 2022

rushter commented Sep 7, 2022 •

edited

rushter commented Sep 7, 2022

NixBiks commented Sep 7, 2022

rushter commented Sep 7, 2022 •

edited

rushter commented Sep 7, 2022

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022

NixBiks commented Sep 20, 2022

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022 •

edited

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022 •

edited

rushter commented Sep 20, 2022

NixBiks commented Sep 21, 2022

Node.text() does not respect changes from Node.unwrap_tags #68

Node.text() does not respect changes from Node.unwrap_tags #68

Comments

NixBiks commented Sep 6, 2022

NixBiks commented Sep 6, 2022

rushter commented Sep 7, 2022 • edited

rushter commented Sep 7, 2022

NixBiks commented Sep 7, 2022

rushter commented Sep 7, 2022 • edited

rushter commented Sep 7, 2022

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022

NixBiks commented Sep 20, 2022

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022 • edited

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022

rushter commented Sep 20, 2022

NixBiks commented Sep 20, 2022 • edited

rushter commented Sep 20, 2022

NixBiks commented Sep 21, 2022

rushter commented Sep 7, 2022 •

edited

rushter commented Sep 7, 2022 •

edited

NixBiks commented Sep 20, 2022 •

edited

NixBiks commented Sep 20, 2022 •

edited