Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

soimort · 2013-03-24T22:54:23Z

This fixes exactly the same issue as in #850 and #810, but much simpler.

When upgrading Sphinx:

$ pip install --upgrade Sphinx
...
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.3/threading.py", line 639, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.3/threading.py", line 596, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 286, in _get_queued_page
    page = self._get_page(location, req)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 404, in _get_page
    return HTMLPage.get_page(link, req, cache=self.cache)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 535, in get_page
    inst = cls(u(contents), real_url, headers)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/backwardcompat/__init__.py", line 51, in u
    return s.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

In this case, the URL it's trying to scrape is http://heanet.dl.sourceforge.net/project/docutils/docutils/0.8.1/docutils-0.8.1.tar.gz, which is obviously not a valid HTML page and there's no actual need for HTMLPage to encode and scrape its contents at all.

Date: Sun, 24 Mar 2013 16:49:34 GMT
Server: Apache/2.2.16 (Debian)
Last-Modified: Tue, 30 Aug 2011 07:51:04 GMT
ETag: "203350-16e2b8-4abb446046a00"
Accept-Ranges: bytes
Content-Length: 1499832
Connection: close
Content-Type: application/x-gzip

Added a few lines to skip invalid URL based on its response headers (whose Content-Type is not text/html)

soimort · 2013-03-24T23:14:08Z

Just for the record, docutils is using a direct download link (instead of an HTML page) as its Download-URL:

<a href="http://sourceforge.net/projects/docutils/files/docutils/0.8.1/docutils-0.8.1.tar.gz/download" rel="download">0.8.1 download_url</a>

which is the real cause of this sort of issue.

pnasrat · 2013-03-30T15:46:49Z

@hltbra I believe there is some overlap with pull #874 here. What are you intending?

hltbra · 2013-03-30T17:37:23Z

@pnasrat I need to take a deeper look at #850 and #810 and check if this pull request solves everything those pull requests solve.

hltbra · 2013-04-13T19:23:13Z

pip/index.py

@@ -523,6 +523,13 @@ def get_page(cls, link, req, cache=None, skip_archives=True):

            real_url = geturl(resp)
            headers = resp.info()
+            content_type = headers.get('Content-Type', None)


It is not a good idea to choose None as default and in the next line do content_type.lower()

hltbra · 2013-04-13T19:29:32Z

Closing because #886 solves the same issue and others.

soimort added 2 commits March 24, 2013 17:17

add Content-Type handling for real URLs

23ea6bb

add test case for urls whose content-type is not text/html

d32a26e

ghost assigned hltbra Mar 24, 2013

This was referenced Apr 5, 2013

Fix issue #760: broken external requirements on Python 3 #874

Merged

Page charset fix #886

Closed

hltbra reviewed Apr 13, 2013
View reviewed changes

hltbra closed this Apr 13, 2013

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 5, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

soimort commented Mar 24, 2013

soimort commented Mar 24, 2013

pnasrat commented Mar 30, 2013

hltbra commented Mar 30, 2013

hltbra Apr 13, 2013

hltbra commented Apr 13, 2013

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

Conversation

soimort commented Mar 24, 2013

soimort commented Mar 24, 2013

pnasrat commented Mar 30, 2013

hltbra commented Mar 30, 2013

hltbra Apr 13, 2013

Choose a reason for hiding this comment

hltbra commented Apr 13, 2013