New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect cleaning of <img> tag #92
Comments
Thanks for pointing this out. This is probably because safehtml doesn't have a way of dealing with self closing tags (img, input, meta, etc.). We will need to look into a suitable way to handle them. |
Handle void tags when processing as safehtml #92
@ruairif At first, sorry for being so late to reply the message (it's been crazy days). Second, thanks for creating a PR and solving that. Regardless of not replying to your PR, I checked out the code and liked the way you approached the problem. Thanks a lot! |
Hi guys, currently I'm facing the same problem pointed out by @victormartinez ! Given the fact that the bug was solved in the following commit c2878f1, do you know when a new release will be launched? Thanks in advance!! |
I tested here with and without the close tag and worked very well: >>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is </img><b>cool</b>')
'my img is <strong>cool</strong>'
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my img is <strong>cool</strong>' |
Hi guys, I was looking for a html cleaner and found it inside the Scrapely lib. After some trials, I found a bug that I believe is critical.
It is expected that the
img
tag appear in the self-closing way (<img src='github.png' />
) but it might appear in this way:<img src='stackoverflow.png'>
. In this case, thesafehtml
cleans the text incorrectly. For example, see the test in the terminal:IMHO, the output was expected to be
my img is <strong>cool</strong>
. The same behavior is witnessed with the tag<input>
.Best regards,
The text was updated successfully, but these errors were encountered: