New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node.text() does not respect changes from Node.unwrap_tags #68
Comments
I have quite the ugly workaround currently (parsing the html twice) pre_tree = tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
pre_tree.unwrap_tags(["strong"])
tree = HTMLParser(pre_tree.html)
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe" |
That happens because you get two text nodes close to each other when removing the
I'm not sure what to do about it yet. Technically, I think this behaviour is correct, but for the majority of users it would be unexpected. To make it work, we need to merge two text nodes. |
Ahh I see. I guess I'd need to iterate through the node and merge all |
I think adding a separate |
Added a PoC: selectolax/tests/test_nodes.py Line 537 in abc9a3c
It needs more tests. |
I've made a new release that supports |
@rushter you're a legend ! |
Shouldn't it be added to |
Yes, I've updated it. Thank you for checking it. |
I'm afraid there is a memory issue with this thing:
This is what I apply to some html def html_to_text(html: str):
tree = HTMLParser(html)
tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
tree.strip_tags(DEFAULT_REMOVE_TAGS)
body = tree.body
if body is None:
raise Exception("No body")
body.merge_text_nodes()
return body.text(separator="\n", strip=True) I'm running on Macbook Pro M1 - not sure if that is relevant |
Is it possible to provide the HTML that crashes it? |
Yes but it is inconstent. I can send you a JSON file with 300'ish HTML. When I loop through then it crashes at random times. |
Ok, please send HTML to me. |
I've just sent you a JSON with a array with only one object that has html key. If I run the code on that html multiple times then it breaks. It's random how many times. In the script I sent it is running over the html 300 times. Script here for reference import json
from selectolax.parser import HTMLParser
DEFAULT_UNWRAP_TAGS = [
"a",
"abbr",
"acronym",
"b",
"bdo",
"big",
"br",
"button",
"cite",
"code",
"dfn",
"em",
"i",
"img",
"input",
"kbd",
"label",
"map",
"object",
"output",
"q",
"samp",
"script",
"select",
"small",
"span",
"strong",
"textarea",
"time",
"tt",
"var",
]
DEFAULT_REMOVE_TAGS = ["sub", "sup", "table"]
with open("html.json", "r") as f:
data = json.load(f)
html = data[0]["html"]
def html_to_text(html: str) -> str:
tree = HTMLParser(html)
tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
tree.strip_tags(DEFAULT_REMOVE_TAGS)
body = tree.body
if body is None:
raise Exception("No body")
body.merge_text_nodes()
return body.text(separator="\n", strip=True)
for i in range(300):
print(i)
html_to_text(html) |
I think I fixed it. Can you please double check by clonning the git repo and installing dev version? |
I agree. It works ! |
It seems that
node.text()
does not respect the mutated node after callingnode.unwrap_tags
. I'd expect the following to passbut instead I get
The text was updated successfully, but these errors were encountered: