Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

Closed
wants to merge 2 commits into from
Closed

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

wants to merge 2 commits into from

Conversation

soimort
Copy link

@soimort soimort commented Mar 24, 2013

This fixes exactly the same issue as in #850 and #810, but much simpler.

When upgrading Sphinx:

$ pip install --upgrade Sphinx
...
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.3/threading.py", line 639, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.3/threading.py", line 596, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 286, in _get_queued_page
    page = self._get_page(location, req)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 404, in _get_page
    return HTMLPage.get_page(link, req, cache=self.cache)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 535, in get_page
    inst = cls(u(contents), real_url, headers)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/backwardcompat/__init__.py", line 51, in u
    return s.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

In this case, the URL it's trying to scrape is http://heanet.dl.sourceforge.net/project/docutils/docutils/0.8.1/docutils-0.8.1.tar.gz, which is obviously not a valid HTML page and there's no actual need for HTMLPage to encode and scrape its contents at all.

Date: Sun, 24 Mar 2013 16:49:34 GMT
Server: Apache/2.2.16 (Debian)
Last-Modified: Tue, 30 Aug 2011 07:51:04 GMT
ETag: "203350-16e2b8-4abb446046a00"
Accept-Ranges: bytes
Content-Length: 1499832
Connection: close
Content-Type: application/x-gzip

Added a few lines to skip invalid URL based on its response headers (whose Content-Type is not text/html)

@ghost ghost assigned hltbra Mar 24, 2013
@soimort
Copy link
Author

soimort commented Mar 24, 2013

Just for the record, docutils is using a direct download link (instead of an HTML page) as its Download-URL:

<a href="http://sourceforge.net/projects/docutils/files/docutils/0.8.1/docutils-0.8.1.tar.gz/download" rel="download">0.8.1 download_url</a>

which is the real cause of this sort of issue.

@pnasrat
Copy link
Contributor

pnasrat commented Mar 30, 2013

@hltbra I believe there is some overlap with pull #874 here. What are you intending?

@hltbra
Copy link
Contributor

hltbra commented Mar 30, 2013

@pnasrat I need to take a deeper look at #850 and #810 and check if this pull request solves everything those pull requests solve.

@@ -523,6 +523,13 @@ def get_page(cls, link, req, cache=None, skip_archives=True):

real_url = geturl(resp)
headers = resp.info()
content_type = headers.get('Content-Type', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good idea to choose None as default and in the next line do content_type.lower()

@hltbra
Copy link
Contributor

hltbra commented Apr 13, 2013

Closing because #886 solves the same issue and others.

@hltbra hltbra closed this Apr 13, 2013
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 5, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants