Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Fix UnicodeDecodeError while scraping from non utf-8 page. #810

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
2 participants
Contributor

methane commented Feb 16, 2013

pip on Python3 raises UnicodeDecodeError.
https://gist.github.com/cocoatomo/4966725

Member

hltbra commented Feb 16, 2013

Do you know any page or package that pip would crash?

Contributor

methane commented Feb 16, 2013

@hltbra There are two patteerns.

  1. pip install PLY cause UnicodeDecodeError because http://www.dabeaz.com/ply/ is encoded in latin1.
  2. pip install docutils cause UnicodeDecodeError bcause some download link in http://pypi.python.org/simple/docutils/ are not skipped (ex http://sourceforge.net/projects/docutils/files/docutils/0.8.1/docutils-0.8.1.tar.gz/download)
Contributor

methane commented Feb 16, 2013

UnicodeDecodeError happens only with Python 3.
But Python 2 also downloads archives while scraping. It's very bad.

Member

hltbra commented Feb 26, 2013

Just took a look again at the diff and it seems you removed everything but the skipping algorithm. Is that correct?

Contributor

methane commented Feb 26, 2013

Old algorithm in pseudo code:

if filename.endswith(archive_exts):
    response = http.head(url)
    if response['content-type'] != "text/html":
        skip
response = http.get(url)

This algorithm doesn't work for redirected sites (e.g. http://sourceforge.net/projects/docutils/files/docutils/0.8.1/docutils-0.8.1.tar.gz/download)

New algorithm:

response = http.get(url)
if response['content-type'] != "text/html":
    skip

It use GET request but doesn't read whole response.

@ghost ghost assigned hltbra Mar 24, 2013

Member

hltbra commented Apr 13, 2013

Pull request #886 does two things different to decode the page content:

  1. try to get charset, and if not found, defaults to latin-1;
  2. test for text/plain in addition to text/html

Closing this issue because of #886.

@hltbra hltbra closed this Apr 13, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment