Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect cleaning of <img> tag #92

Closed
victormartinez opened this issue Jun 20, 2016 · 4 comments
Closed

Incorrect cleaning of <img> tag #92

victormartinez opened this issue Jun 20, 2016 · 4 comments

Comments

@victormartinez
Copy link

victormartinez commented Jun 20, 2016

Hi guys, I was looking for a html cleaner and found it inside the Scrapely lib. After some trials, I found a bug that I believe is critical.

It is expected that the img tag appear in the self-closing way (<img src='github.png' />) but it might appear in this way: <img src='stackoverflow.png'>. In this case, the safehtml cleans the text incorrectly. For example, see the test in the terminal:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my'

IMHO, the output was expected to be my img is <strong>cool</strong>. The same behavior is witnessed with the tag <input>.

Best regards,

@ruairif
Copy link
Collaborator

ruairif commented Jun 21, 2016

Thanks for pointing this out. This is probably because safehtml doesn't have a way of dealing with self closing tags (img, input, meta, etc.). We will need to look into a suitable way to handle them.

kmike added a commit that referenced this issue Jun 30, 2016
Handle void tags when processing as safehtml #92
@victormartinez
Copy link
Author

victormartinez commented Aug 9, 2016

@ruairif At first, sorry for being so late to reply the message (it's been crazy days). Second, thanks for creating a PR and solving that. Regardless of not replying to your PR, I checked out the code and liked the way you approached the problem.

Thanks a lot!

@DavidPinho
Copy link

DavidPinho commented Aug 9, 2016

Hi guys, currently I'm facing the same problem pointed out by @victormartinez ! Given the fact that the bug was solved in the following commit c2878f1, do you know when a new release will be launched?

Thanks in advance!!

@robsonpeixoto
Copy link
Contributor

I tested here with and without the close tag and worked very well:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is </img><b>cool</b>')
'my  img is <strong>cool</strong>'
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my  img is <strong>cool</strong>'

@ruairif ruairif closed this as completed Dec 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants