Skip to content
This repository

Add Content-Type handling in get_page() to fix UnicodeDecodeError #862

Closed
wants to merge 2 commits into from

3 participants

Mort Yao Paul Nasrat Hugo Lopes Tavares
Mort Yao

This fixes exactly the same issue as in #850 and #810, but much simpler.

When upgrading Sphinx:

$ pip install --upgrade Sphinx
...
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.3/threading.py", line 639, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.3/threading.py", line 596, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 286, in _get_queued_page
    page = self._get_page(location, req)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 404, in _get_page
    return HTMLPage.get_page(link, req, cache=self.cache)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/index.py", line 535, in get_page
    inst = cls(u(contents), real_url, headers)
  File "/usr/lib/python3.3/site-packages/pip-1.4.dev1-py3.3.egg/pip/backwardcompat/__init__.py", line 51, in u
    return s.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

In this case, the URL it's trying to scrape is http://heanet.dl.sourceforge.net/project/docutils/docutils/0.8.1/docutils-0.8.1.tar.gz, which is obviously not a valid HTML page and there's no actual need for HTMLPage to encode and scrape its contents at all.

Date: Sun, 24 Mar 2013 16:49:34 GMT
Server: Apache/2.2.16 (Debian)
Last-Modified: Tue, 30 Aug 2011 07:51:04 GMT
ETag: "203350-16e2b8-4abb446046a00"
Accept-Ranges: bytes
Content-Length: 1499832
Connection: close
Content-Type: application/x-gzip

Added a few lines to skip invalid URL based on its response headers (whose Content-Type is not text/html)

Mort Yao

Just for the record, docutils is using a direct download link (instead of an HTML page) as its Download-URL:

<a href="http://sourceforge.net/projects/docutils/files/docutils/0.8.1/docutils-0.8.1.tar.gz/download" rel="download">0.8.1 download_url</a>

which is the real cause of this sort of issue.

Paul Nasrat
Collaborator

@hltbra I believe there is some overlap with pull #874 here. What are you intending?

Hugo Lopes Tavares
Collaborator

@pnasrat I need to take a deeper look at #850 and #810 and check if this pull request solves everything those pull requests solve.

Preston Holmes ptone referenced this pull request April 07, 2013
Closed

Page charset fix #886

Hugo Lopes Tavares hltbra commented on the diff April 13, 2013
pip/index.py
@@ -523,6 +523,13 @@ def get_page(cls, link, req, cache=None, skip_archives=True):
523 523
 
524 524
             real_url = geturl(resp)
525 525
             headers = resp.info()
  526
+            content_type = headers.get('Content-Type', None)
1
Hugo Lopes Tavares Collaborator
hltbra added a note April 13, 2013

It is not a good idea to choose None as default and in the next line do content_type.lower()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Hugo Lopes Tavares
Collaborator

Closing because #886 solves the same issue and others.

Hugo Lopes Tavares hltbra closed this April 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
9  pip/index.py
@@ -478,7 +478,7 @@ def __str__(self):
478 478
     def get_page(cls, link, req, cache=None, skip_archives=True):
479 479
         url = link.url
480 480
         url = url.split('#', 1)[0]
481  
-        if cache.too_many_failures(url):
  481
+        if cache is not None and cache.too_many_failures(url):
482 482
             return None
483 483
 
484 484
         # Check for VCS schemes that do not support lookup as web pages.
@@ -523,6 +523,13 @@ def get_page(cls, link, req, cache=None, skip_archives=True):
523 523
 
524 524
             real_url = geturl(resp)
525 525
             headers = resp.info()
  526
+            content_type = headers.get('Content-Type', None)
  527
+            if not content_type.lower().startswith('text/html'):
  528
+                logger.debug('Skipping page %s because of Content-Type: %s' % (link, content_type))
  529
+                if cache is not None:
  530
+                    cache.set_is_archive(url)
  531
+                return None
  532
+
526 533
             contents = resp.read()
527 534
             encoding = headers.get('Content-Encoding', None)
528 535
             #XXX need to handle exceptions and add testing for this
10  tests/test_index.py
@@ -126,4 +126,14 @@ def test_mirror_url_formats():
126 126
             assert url == result, str([url, result])
127 127
 
128 128
 
  129
+def test_non_html_page_should_not_be_scraped():
  130
+    """
  131
+    Test that a url whose content-type is not text/html
  132
+    will never be scraped as an html page.
  133
+    """
  134
+    # Content-type is already set
  135
+    # no need to monkeypatch on response headers
  136
+    url = path_to_url(os.path.join(here, 'indexes', 'empty_with_pkg', 'simple-1.0.tar.gz'))
  137
+    page = HTMLPage.get_page(Link(url), None, cache=None, skip_archives=False)
  138
+    assert page == None
129 139
 
Commit_comment_tip

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.