Incorrect cleaning of <img> tag #92

victormartinez · 2016-06-20T18:00:02Z

Hi guys, I was looking for a html cleaner and found it inside the Scrapely lib. After some trials, I found a bug that I believe is critical.

It is expected that the img tag appear in the self-closing way (<img src='github.png' />) but it might appear in this way: <img src='stackoverflow.png'>. In this case, the safehtml cleans the text incorrectly. For example, see the test in the terminal:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my'

IMHO, the output was expected to be my img is <strong>cool</strong>. The same behavior is witnessed with the tag <input>.

Best regards,

The text was updated successfully, but these errors were encountered:

ruairif · 2016-06-21T09:16:31Z

Thanks for pointing this out. This is probably because safehtml doesn't have a way of dealing with self closing tags (img, input, meta, etc.). We will need to look into a suitable way to handle them.

Handle void tags when processing as safehtml #92

victormartinez · 2016-08-09T16:00:00Z

@ruairif At first, sorry for being so late to reply the message (it's been crazy days). Second, thanks for creating a PR and solving that. Regardless of not replying to your PR, I checked out the code and liked the way you approached the problem.

Thanks a lot!

DavidPinho · 2016-08-09T16:02:23Z

Hi guys, currently I'm facing the same problem pointed out by @victormartinez ! Given the fact that the bug was solved in the following commit c2878f1, do you know when a new release will be launched?

Thanks in advance!!

robsonpeixoto · 2016-08-09T16:13:14Z

I tested here with and without the close tag and worked very well:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is </img><b>cool</b>')
'my  img is <strong>cool</strong>'
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my  img is <strong>cool</strong>'

ruairif added a commit that referenced this issue Jun 21, 2016

Handle void tags when processing as safehtml #92

c2878f1

kmike added a commit that referenced this issue Jun 30, 2016

Merge pull request #93 from scrapy/92_handle_void_tags_with_safehtml

1c3e857

Handle void tags when processing as safehtml #92

ruairif closed this as completed Dec 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect cleaning of <img> tag #92

Incorrect cleaning of <img> tag #92

victormartinez commented Jun 20, 2016 •

edited

ruairif commented Jun 21, 2016

victormartinez commented Aug 9, 2016 •

edited

DavidPinho commented Aug 9, 2016 •

edited

robsonpeixoto commented Aug 9, 2016

Incorrect cleaning of <img> tag #92

Incorrect cleaning of <img> tag #92

Comments

victormartinez commented Jun 20, 2016 • edited

ruairif commented Jun 21, 2016

victormartinez commented Aug 9, 2016 • edited

DavidPinho commented Aug 9, 2016 • edited

robsonpeixoto commented Aug 9, 2016

victormartinez commented Jun 20, 2016 •

edited

victormartinez commented Aug 9, 2016 •

edited

DavidPinho commented Aug 9, 2016 •

edited