Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) when calling "attributes" of a node #9

Closed
aleasims opened this issue Feb 12, 2019 · 2 comments
Closed

Comments

@aleasims
Copy link

aleasims commented Feb 12, 2019

First of all, thanks for a really good and extremely fast HTML-parser in Python.
I came across a problem with this file:
bad_html.txt
(Probably it was corrupted while dumping, but here it doesn't matter)

I am finding meta-tags in it using HTMLParser.tags() function and then I'm trying to access attributes of found nodes. And on 4-th node python collapses with segfault.
Code:

html = open(path, 'r').read()
parser = HTMLParser(html)
for tag in parser.tags('meta'):
    print(tag.attributes)

Result:

{'charset': 'utf-8'}
{'content': 'IE=edge', 'http-equiv': 'X-UA-Compatible'}
{'content': 'Ñника, ÑозеÑки и вÑклÑÑаÑели ÑнайдеÑ, вÑклÑÑаÑели schneider electric, ÑозеÑки ÑнайдеÑ, ÑозеÑки schneider, schneider electric ÑозеÑки и вÑклÑÑаÑели, вÑклÑÑаÑели ÑнайдеÑ, unica ÑозеÑки, unica schneider, вÑклÑÑаÑели unica, schneider electric unica, Ñамки unica, Ð\xa0озеÑки schneider electric, ÑозеÑки ÑÐ½Ð°Ð¹Ð´ÐµÑ ÑлекÑÑик', 'name': 'keywords'}
Segmentation fault (core dumped)

The fact that you cannot handle SIGSEGV signals in Python properly makes it more sad.

Hope, you will be able to handle this "fatal error".

My params:

  • selectolax 0.1.9
  • Python 3.5.2
  • Linux 4.15.0-45-generic (Ubuntu 16.04)

UPDATE
Maybe I've found out the weak place, because:

html = open(path, 'r').read()
parser = HTMLParser(html)
parser = HTMLParser(parser.html)
for tag in parser.tags('meta'):
    print(tag.attributes)

works! Symbol " occurs inside meta tag and this builds a problem. Maybe helpful to look at differences between origin html text, parser.html and HTMLParser(parser.html).html

rushter added a commit that referenced this issue Feb 13, 2019
@rushter
Copy link
Owner

rushter commented Feb 13, 2019

I've fixed your problem.

Don't close this issue. I will investigate it further later because I'm not sure if that's expected behavior from the Modest engine.

@aleasims
Copy link
Author

Great, it seems to work fine now! Thank you very much for quickly solving the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants