Page charset fix #886

wants to merge 16 commits into


None yet
5 participants

ptone commented Apr 7, 2013

So here is my branch to address the issues tackled in #810, #850, #862

It uses the recently committed work in #874

My comments on the other pulls:
#810 tries to use the backcompat u/utf-8, and falls back on latin-1, with no attempt to get charset
#850 tries to change the backcompat function to handle other decodings, but tries to match all archive content-types, instead of just doing 'not text/html'
#862 adds additional content type checks, but leaves the redundant filename based check, and doesn't handle charset param at all.


ptone commented Apr 7, 2013

I need to pull over one of the good tests from one of the other pulls still, but will have to do that later.


pnasrat commented Apr 7, 2013

Can you update something to trigger a travis build.

Do you have time to address the testing?


ptone commented Apr 8, 2013

working on tests now - as far as I can tell skip_archives is always true, yes?


ptone commented Apr 8, 2013

So I'm a bit stuck on testing this - as the functionality is buried deep in get_page

I need to get a way to get a HTTPMessage with a set charset

I've gotten close by uploading a UTF-16 file, and pulling that, and it does decode to 'helloworld' but something is off on the assert still and the test is failing.

Mocking is probably what is required here, but it means mocking some internals of Http/MIME message :(


pnasrat commented Apr 11, 2013

@ptone I'll try take a look on the weekend to see if I can help figure a good approach.


hltbra commented Apr 12, 2013

I am taking a deep look at HTMLPage.get_page() and I see a lot of unused and not needed stuff. I am adding some FIXMEs here and then I discuss with you what can we do. For instance: cache is never None, and skip_archives is never False...


ptone commented Apr 13, 2013

@hltbra yes - get_page could for sure use a refactor - it would make this change much easier to unit test.

The crux of the problem in testing this patch, is that the behavior that changes, is very hard to get at in an isolated way, and requires mocking a HTTPMessage like object, which requires a bit of a dive to do completely enough. But maybe there is a simpler way to generate the required test resp.

I too had noticed the vestigial nature of skip_archives and cache - but TBH - pip seems chock full of this sort of thing...


qwcode commented Apr 13, 2013

but TBH - pip seems chock full of this sort of thing...

good that you're among us to notice and help refactor : )


hltbra commented Apr 13, 2013

Pull request #810 removes _get_content_type after removing its call. It would be nice if this pull requests does the same, because that function is dead code.


hltbra commented Apr 13, 2013

Only one concern about this pull request, and it was discussed before (comments on #874): should we use latin-1 or utf-8?


methane commented Apr 14, 2013

@hltbra pip's concern is only finding URLs.
latin-1 is 1byte to 1codepoint translation.
All encoding based ASCII (including UTF-8) can be decoded by latin-1 without breaking ASCII strings.

But, I prefer using bytes instead of unicode to handle HTTP content.
Both of Python 2 and 3 supports bytes regex.


hltbra commented Apr 14, 2013

It is a good idea to avoid converting that to unicode, maybe the best so far. Thanks @methane! It seems there are no implications with that in PackageFinder. The only component that access content value is the HTMLPage itself, and it only apply regexes on that.


hltbra commented Apr 14, 2013

By the way, I was taking a look at how six package converts to unicode using unicode-escape codec:


ptone commented Apr 14, 2013

FTR - I brought this up originally on #874 with a reference to this link:

Since a page can declare a multibyte encoding in its charset param, we can't pass raw bytes to the link-finding regexes.

On the choice of latin-1 vs utf-8 as the default, no simple call.

#810 uses the example of

This page does not declare a charset, uses latin-1 and will raise a unicode error :

However if a page uses multibyte unicode and doesn't declare a charset - latin-1 will raise problems.

Perhaps, given how common utf-8 we take this approach:

  1. use charset if declared
  2. try utf-8
  3. catch UnicodeDecodeError and try latin-1, and and allow the exception to bubble up at that point

ptone commented Apr 14, 2013

@hltbra good catch on _get_content_type, I thought I had removed that, but didn't.

There is still the issue of the test.


Comes close. It should work, as I have a utf-16 page that decodes correctly - but the assert is failing.

I'd still prefer that the test be more narrow - but the way things are written in pip, all I can give get_page to test with is a link, not sure if get_page could/should be factored more to allow better testing.


hltbra commented Apr 18, 2013

@ptone I like the idea of a try/catch with utf-8 and then latin-1. 👍

I am going to change get_file_content() to have the same behavior later.


hltbra commented May 14, 2013

I am finishing @ptone's work in a branch, and there is a test failing after this pull request:

The test expects this to succeed:

assert "Getting page" in result.stdout

But the output is:

Downloading/unpacking INITools
  Could not fetch URL HTTP Error 404: Not Found
  Will skip URL when looking for download links for INITools
  Getting page
  URLs to search for versions for INITools:
  Could not fetch URL HTTP Error 404: Not Found
  Will skip URL when looking for download links for INITools
  Could not find any downloads that satisfy the requirement INITools
Cleaning up...
  Removing temporary dir /Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/build...
No distributions at all found for INITools
Exception information:
Traceback (most recent call last):
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/", line 134, in main
    status =, args)
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/commands/", line 241, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/", line 1070, in prepare_files
    url = finder.find_requirement(req_to_install, upgrade=self.upgrade)
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/", line 222, in find_requirement
    raise DistributionNotFound('No distributions at all found for %s' % req)
DistributionNotFound: No distributions at all found for INITools

And it seems that some work is duplicated here: the pypi.appspot url was hitted two times.

Does anyone have any idea of what's happening?

@hltbra hltbra commented on the diff May 15, 2013

real_url = geturl(resp)
headers =
+ content_type = headers.get('Content-Type', '')
+ if skip_archives:
+ if cache is not None:
+ if cache.is_archive(url):
+ return None
+ if not (content_type.lower().startswith('text/html') or
+ content_type.lower().startswith('text/plain')):
+ logger.debug('Skipping page %s because of Content-Type: %s' % (link, content_type))
+ if cache is not None:
+ cache.set_is_archive(url)
+ return None
+ logger.debug('Getting page %s' % url)

hltbra May 15, 2013


It is wrong to place this here, because the request is made when urlopen(url) is hit, and if any exception raises, it is not raise here, but up there. So the right thing to do is to move this line before resp = urlopen(url).

I am doing it on my branch, btw.


hltbra commented May 15, 2013

I finished the work I think was missing here, and after I am sure all tests are ok, I am going to open a new pull request with @ptone's work and close this one.


ptone commented May 15, 2013

great to hear! I'll be happy to cast another set of eyes on your revision when posted - can you post a comment on this issue as I'm subbed here and not to everything on pip ;-)


hltbra commented May 16, 2013

I just opened a new pull request with all commits from this pull request, more test and fix commits: #945

hltbra closed this May 16, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment