HTMLPage.get_page doesn't handle gzip Content-Types #850

Closed
wants to merge 3 commits into
from

Conversation

Projects
None yet
3 participants

mzipay commented Mar 18, 2013

When a response is missing a Content-Encoding header, and the Content-Type indicates gzip, HTMLPage.get_page fails when decoding the content (since it hasn't been gunzipped). Beyond this, the backwardcompat.u function always assumes UTF-8 encoding, but the HTTP default ISO-8859-1 should be supported.

Trying to install Sphinx (1.1.3) last night I ran into both problems:

    Exception in thread Thread-4:
    Traceback (most recent call last):
      File "-------/lib/python3.3/threading.py", line 639, in _bootstrap_inner
            self.run()
      File "-------/lib/python3.3/threading.py", line 596, in run
            self._target(*self._args, **self._kwargs)
      File "-------/lib/python3.3/site-packages/pip-1.2.1-py3.3.egg/pip/index.py", line 245, in _get_queued_page
            page = self._get_page(location, req)
      File "-------/lib/python3.3/site-packages/pip-1.2.1-py3.3.egg/pip/index.py", line 337, in _get_page
            return HTMLPage.get_page(link, req, cache=self.cache)
      File "-------/lib/python3.3/site-packages/pip-1.2.1-py3.3.egg/pip/index.py", line 466, in get_page
            inst = cls(u(contents), real_url, headers)
      File "-------/lib/python3.3/site-packages/pip-1.2.1-py3.3.egg/pip/backwardcompat.py", line 44, in u
            return s.decode('utf-8')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This commit addresses both issues.

Matthew Zipay added some commits Mar 18, 2013

Contributor

pnasrat commented Mar 18, 2013

Could you add tests for these cases, you probably have to provide test pages for each under tests/indexes

mzipay commented Mar 19, 2013

confirmed test cases fail against unpatched develop branch, pass against my patches

@ghost ghost assigned hltbra Mar 24, 2013

Member

hltbra commented Apr 13, 2013

The problem installing Sphinx does not need this kind of fix. When installing sphinx it looks for some urls like "http://sourceforge.net/projects/docutils/files/docutils/0.10/docutils-0.10.tar.gz/download", and that is a redirect to a tarball, gzipped. That tarball is not an HTML page. pip does not even need to look at its content, only skip it.

When a page is encoded with gzip, it goes on Content-Encoding header, and when it is a gzip file, it goes on Content-Type header. I think we should not mix that.

Pull request #886 seems to be the right path.

PS.: Your pull request seems to miss two files you created for test purposes: tests/indexes/gzipped and tests/indexes/iso_8859_1

@hltbra hltbra closed this Apr 13, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment