Skip to content

Conversation

@shaneaevans
Copy link
Member

These are loosely based on the encoding in scrapy.

Main differences:

  • tweaks to regular expressions for encoding detection in HTML. One regexp handles html and xml
  • handle byte order marks
  • better handling of character encoding overrides, with an updated list
  • does not fall back to BeautifulSoup, instead the auto-detect is customizeable and disabled by default

This is based on the encoding detection in scrapy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are python byte slices cheap?

Just to point about rest_of_data been copied and later dropped when bom_enc and enc doens't match.
It's a rare case I know, but it waste memory for big responses or for sites sending BOM and different transport encoding for all its pages.

I think this isn't a merge blocker but worth pointing it, once in the wild we can optimize if affects any real case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be pretty rare that we have BOMs that differ from the encoding in the headers, and it's not that expensive. But still - why take the risk? I'll change it.

The content type header parameter to html_to_unicode has been documented
more clearly.
shaneaevans added a commit that referenced this pull request Feb 14, 2012
Add encoding functions for converting html to unicode
@shaneaevans shaneaevans merged commit 9f39f99 into scrapy:master Feb 14, 2012
wRAR pushed a commit that referenced this pull request Aug 24, 2021
Improve ParseDataURIResult documentation
kmike pushed a commit that referenced this pull request Jun 16, 2022
For issue #162 Add different regex pattern to search for meta tags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants