Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

wants to merge 2 commits into


None yet
3 participants

soimort commented Mar 24, 2013

This fixes exactly the same issue as in #850 and #810, but much simpler.

When upgrading Sphinx:

$ pip install --upgrade Sphinx
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.3/threading.py", line 639, in _bootstrap_inner
  File "/usr/lib/python3.3/threading.py", line 596, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 286, in _get_queued_page
    page = self._get_page(location, req)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 404, in _get_page
    return HTMLPage.get_page(link, req, cache=self.cache)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 535, in get_page
    inst = cls(u(contents), real_url, headers)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/backwardcompat/__init__.py", line 51, in u
    return s.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

In this case, the URL it's trying to scrape is http://heanet.dl.sourceforge.net/project/docutils/docutils/0.8.1/docutils-0.8.1.tar.gz, which is obviously not a valid HTML page and there's no actual need for HTMLPage to encode and scrape its contents at all.

Date: Sun, 24 Mar 2013 16:49:34 GMT
Server: Apache/2.2.16 (Debian)
Last-Modified: Tue, 30 Aug 2011 07:51:04 GMT
ETag: "203350-16e2b8-4abb446046a00"
Accept-Ranges: bytes
Content-Length: 1499832
Connection: close
Content-Type: application/x-gzip

Added a few lines to skip invalid URL based on its response headers (whose Content-Type is not text/html)

@ghost ghost assigned hltbra Mar 24, 2013

soimort commented Mar 24, 2013

Just for the record, docutils is using a direct download link (instead of an HTML page) as its Download-URL:

<a href="http://sourceforge.net/projects/docutils/files/docutils/0.8.1/docutils-0.8.1.tar.gz/download" rel="download">0.8.1 download_url</a>

which is the real cause of this sort of issue.


pnasrat commented Mar 30, 2013

@hltbra I believe there is some overlap with pull #874 here. What are you intending?


hltbra commented Mar 30, 2013

@pnasrat I need to take a deeper look at #850 and #810 and check if this pull request solves everything those pull requests solve.

@hltbra hltbra commented on the diff Apr 13, 2013

@@ -523,6 +523,13 @@ def get_page(cls, link, req, cache=None, skip_archives=True):
real_url = geturl(resp)
headers = resp.info()
+ content_type = headers.get('Content-Type', None)

hltbra Apr 13, 2013


It is not a good idea to choose None as default and in the next line do content_type.lower()


hltbra commented Apr 13, 2013

Closing because #886 solves the same issue and others.

@hltbra hltbra closed this Apr 13, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment