Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node.text() does not respect changes from Node.unwrap_tags #68

Closed
NixBiks opened this issue Sep 6, 2022 · 17 comments
Closed

Node.text() does not respect changes from Node.unwrap_tags #68

NixBiks opened this issue Sep 6, 2022 · 17 comments

Comments

@NixBiks
Copy link
Contributor

NixBiks commented Sep 6, 2022

It seems that node.text() does not respect the mutated node after calling node.unwrap_tags. I'd expect the following to pass

from selectolax.parser import HTMLParser


tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
tree.unwrap_tags(["strong"])
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

but instead I get

AssertionError: J ohn Doe  != John Doe
@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 6, 2022

I have quite the ugly workaround currently (parsing the html twice)

pre_tree = tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
pre_tree.unwrap_tags(["strong"])

tree = HTMLParser(pre_tree.html)
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

@rushter
Copy link
Owner

rushter commented Sep 7, 2022

That happens because you get two text nodes close to each other when removing the strong tag.

for node in tree.root.traverse(include_text=True):
    print(node.text(deep=False), node.tag)
 html
 head
 body
 div
John p
J -text
ohn -text
Doe p
Doe -text

I'm not sure what to do about it yet. Technically, I think this behaviour is correct, but for the majority of users it would be unexpected. To make it work, we need to merge two text nodes.

@rushter
Copy link
Owner

rushter commented Sep 7, 2022

I get the same behavior in Chrome:
image

image

@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 7, 2022

Ahh I see. I guess I'd need to iterate through the node and merge all -text siblings before. Not sure if this should be part of node.text?

@rushter
Copy link
Owner

rushter commented Sep 7, 2022

I think adding a separate merge_text_nodes method will be sufficient.

@rushter
Copy link
Owner

rushter commented Sep 7, 2022

Added a PoC:

def test_merge_text_nodes(parser):

It needs more tests.

@rushter
Copy link
Owner

rushter commented Sep 20, 2022

I've made a new release that supports merge_text_nodes.

@rushter rushter closed this as completed Sep 20, 2022
@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 20, 2022

@rushter you're a legend !

@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 20, 2022

Shouldn't it be added to parser.pyi as well?

@rushter
Copy link
Owner

rushter commented Sep 20, 2022

Yes, I've updated it. Thank you for checking it.

@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 20, 2022

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed.
Corrupt value: 0xb0000000129e3f00
python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html

def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

I'm running on Macbook Pro M1 - not sure if that is relevant

@rushter rushter reopened this Sep 20, 2022
@rushter
Copy link
Owner

rushter commented Sep 20, 2022

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed.
Corrupt value: 0xb0000000129e3f00
python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html

def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

I'm running on Macbook Pro M1 - not sure if that is relevant

Is it possible to provide the HTML that crashes it?

@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 20, 2022

Is it possible to provide the HTML that crashes it?

Yes but it is inconstent. I can send you a JSON file with 300'ish HTML. When I loop through then it crashes at random times.

@rushter
Copy link
Owner

rushter commented Sep 20, 2022

Ok, please send HTML to me.

@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 20, 2022

I've just sent you a JSON with a array with only one object that has html key. If I run the code on that html multiple times then it breaks. It's random how many times. In the script I sent it is running over the html 300 times. Script here for reference

import json
from selectolax.parser import HTMLParser


DEFAULT_UNWRAP_TAGS = [
    "a",
    "abbr",
    "acronym",
    "b",
    "bdo",
    "big",
    "br",
    "button",
    "cite",
    "code",
    "dfn",
    "em",
    "i",
    "img",
    "input",
    "kbd",
    "label",
    "map",
    "object",
    "output",
    "q",
    "samp",
    "script",
    "select",
    "small",
    "span",
    "strong",
    "textarea",
    "time",
    "tt",
    "var",
]
DEFAULT_REMOVE_TAGS = ["sub", "sup", "table"]


with open("html.json", "r") as f:
    data = json.load(f)

html = data[0]["html"]

def html_to_text(html: str) -> str:
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

for i in range(300):
    print(i)
    html_to_text(html)

@rushter
Copy link
Owner

rushter commented Sep 20, 2022

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

@NixBiks
Copy link
Contributor Author

NixBiks commented Sep 21, 2022

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

I agree. It works !

@rushter rushter closed this as completed Dec 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants