Page charset fix #886

Closed
wants to merge 16 commits into
from

Conversation

Projects
None yet
5 participants
Contributor

ptone commented Apr 7, 2013

So here is my branch to address the issues tackled in #810, #850, #862

It uses the recently committed work in #874

My comments on the other pulls:
#810 tries to use the backcompat u/utf-8, and falls back on latin-1, with no attempt to get charset
#850 tries to change the backcompat function to handle other decodings, but tries to match all archive content-types, instead of just doing 'not text/html'
#862 adds additional content type checks, but leaves the redundant filename based check, and doesn't handle charset param at all.

Contributor

ptone commented Apr 7, 2013

I need to pull over one of the good tests from one of the other pulls still, but will have to do that later.

Contributor

pnasrat commented Apr 7, 2013

Can you update something to trigger a travis build.

Do you have time to address the testing?

Contributor

ptone commented Apr 8, 2013

working on tests now - as far as I can tell skip_archives is always true, yes?

Contributor

ptone commented Apr 8, 2013

So I'm a bit stuck on testing this - as the functionality is buried deep in get_page

I need to get a way to get a HTTPMessage with a set charset

I've gotten close by uploading a UTF-16 file, and pulling that, and it does decode to 'helloworld' but something is off on the assert still and the test is failing.

Mocking is probably what is required here, but it means mocking some internals of Http/MIME message :(

Contributor

pnasrat commented Apr 11, 2013

@ptone I'll try take a look on the weekend to see if I can help figure a good approach.

Member

hltbra commented Apr 12, 2013

I am taking a deep look at HTMLPage.get_page() and I see a lot of unused and not needed stuff. I am adding some FIXMEs here and then I discuss with you what can we do. For instance: cache is never None, and skip_archives is never False...

Contributor

ptone commented Apr 13, 2013

@hltbra yes - get_page could for sure use a refactor - it would make this change much easier to unit test.

The crux of the problem in testing this patch, is that the behavior that changes, is very hard to get at in an isolated way, and requires mocking a HTTPMessage like object, which requires a bit of a dive to do completely enough. But maybe there is a simpler way to generate the required test resp.

I too had noticed the vestigial nature of skip_archives and cache - but TBH - pip seems chock full of this sort of thing...

Contributor

qwcode commented Apr 13, 2013

but TBH - pip seems chock full of this sort of thing...

good that you're among us to notice and help refactor : )

Member

hltbra commented Apr 13, 2013

Pull request #810 removes _get_content_type after removing its call. It would be nice if this pull requests does the same, because that function is dead code.

Member

hltbra commented Apr 13, 2013

Only one concern about this pull request, and it was discussed before (comments on #874): should we use latin-1 or utf-8?

Contributor

methane commented Apr 14, 2013

@hltbra pip's concern is only finding URLs.
latin-1 is 1byte to 1codepoint translation.
All encoding based ASCII (including UTF-8) can be decoded by latin-1 without breaking ASCII strings.

But, I prefer using bytes instead of unicode to handle HTTP content.
Both of Python 2 and 3 supports bytes regex.

Member

hltbra commented Apr 14, 2013

It is a good idea to avoid converting that to unicode, maybe the best so far. Thanks @methane! It seems there are no implications with that in PackageFinder. The only component that access content value is the HTMLPage itself, and it only apply regexes on that.

Member

hltbra commented Apr 14, 2013

By the way, I was taking a look at how six package converts to unicode using unicode-escape codec: http://pythonhosted.org/six/#six.u

Contributor

ptone commented Apr 14, 2013

FTR - I brought this up originally on #874 with a reference to this link:

http://www.w3.org/International/O-HTTP-charset#charset

Since a page can declare a multibyte encoding in its charset param, we can't pass raw bytes to the link-finding regexes.

On the choice of latin-1 vs utf-8 as the default, no simple call.

#810 uses the example of http://www.dabeaz.com/ply/

This page does not declare a charset, uses latin-1 and will raise a unicode error : https://gist.github.com/ptone/8dddb3196e59af60908f

However if a page uses multibyte unicode and doesn't declare a charset - latin-1 will raise problems.

Perhaps, given how common utf-8 we take this approach:

  1. use charset if declared
  2. try utf-8
  3. catch UnicodeDecodeError and try latin-1, and and allow the exception to bubble up at that point
Contributor

ptone commented Apr 14, 2013

@hltbra good catch on _get_content_type, I thought I had removed that, but didn't.

There is still the issue of the test.

ptone/pip@e9aa5e2

Comes close. It should work, as I have a utf-16 page that decodes correctly - but the assert is failing.

I'd still prefer that the test be more narrow - but the way things are written in pip, all I can give get_page to test with is a link, not sure if get_page could/should be factored more to allow better testing.

Member

hltbra commented Apr 18, 2013

@ptone I like the idea of a try/catch with utf-8 and then latin-1. 👍

I am going to change get_file_content() to have the same behavior later.

Member

hltbra commented May 14, 2013

I am finishing @ptone's work in a branch, and there is a test failing after this pull request: test_config.py:test_config_file_override_stack

The test expects this to succeed:

assert "Getting page http://pypi.appspot.com/INITools" in result.stdout

But the output is:

Downloading/unpacking INITools
  Could not fetch URL http://pypi.appspot.com/INITools/: HTTP Error 404: Not Found
  Will skip URL http://pypi.appspot.com/INITools/ when looking for download links for INITools
  Getting page http://pypi.appspot.com/
  URLs to search for versions for INITools:
  * http://pypi.appspot.com/INITools/
  Could not fetch URL http://pypi.appspot.com/INITools/: HTTP Error 404: Not Found
  Will skip URL http://pypi.appspot.com/INITools/ when looking for download links for INITools
  Could not find any downloads that satisfy the requirement INITools
Cleaning up...
  Removing temporary dir /Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/build...
No distributions at all found for INITools
Exception information:
Traceback (most recent call last):
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/basecommand.py", line 134, in main
    status = self.run(options, args)
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/commands/install.py", line 241, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/req.py", line 1070, in prepare_files
    url = finder.find_requirement(req_to_install, upgrade=self.upgrade)
  File "/Users/hugo/github-projects/pip/tests/tests_cache/test_ws/setuptools/.virtualenv/lib/python2.7/site-packages/pip-1.4.dev1-py2.7.egg/pip/index.py", line 222, in find_requirement
    raise DistributionNotFound('No distributions at all found for %s' % req)
DistributionNotFound: No distributions at all found for INITools

And it seems that some work is duplicated here: the pypi.appspot url was hitted two times.

Does anyone have any idea of what's happening?

@hltbra hltbra commented on the diff May 15, 2013

pip/index.py
real_url = geturl(resp)
headers = resp.info()
+ content_type = headers.get('Content-Type', '')
+ if skip_archives:
+ if cache is not None:
+ if cache.is_archive(url):
+ return None
+ if not (content_type.lower().startswith('text/html') or
+ content_type.lower().startswith('text/plain')):
+ logger.debug('Skipping page %s because of Content-Type: %s' % (link, content_type))
+ if cache is not None:
+ cache.set_is_archive(url)
+ return None
+ logger.debug('Getting page %s' % url)
@hltbra

hltbra May 15, 2013

Member

It is wrong to place this here, because the request is made when urlopen(url) is hit, and if any exception raises, it is not raise here, but up there. So the right thing to do is to move this line before resp = urlopen(url).

I am doing it on my branch, btw.

Member

hltbra commented May 15, 2013

I finished the work I think was missing here, and after I am sure all tests are ok, I am going to open a new pull request with @ptone's work and close this one.

Contributor

ptone commented May 15, 2013

great to hear! I'll be happy to cast another set of eyes on your revision when posted - can you post a comment on this issue as I'm subbed here and not to everything on pip ;-)

Member

hltbra commented May 16, 2013

I just opened a new pull request with all commits from this pull request, more test and fix commits: #945

hltbra closed this May 16, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment