Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"import pkg_resources" fails with UnicodeDecodeError while parsing /usr/lib/pymodules/python2.7/rpl-1.5.5.egg-info #719

Closed
micklat opened this issue Aug 4, 2016 · 9 comments

Comments

@micklat
Copy link

micklat commented Aug 4, 2016

My ubuntu has an egg-info that pkg_resources fails to read with a UnicodeDecodeError. This made all sorts of things fail in my virtual environment.

> git clone https://github.com/pypa/setuptools
...
> cd setuptools/
> python -c "import pkg_resources"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pkg_resources/__init__.py", line 2964, in <module>
    @_call_aside
  File "pkg_resources/__init__.py", line 2950, in _call_aside
    f(*args, **kwargs)
  File "pkg_resources/__init__.py", line 2977, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "pkg_resources/__init__.py", line 626, in _build_master
    ws = cls()
  File "pkg_resources/__init__.py", line 619, in __init__
    self.add_entry(entry)
  File "pkg_resources/__init__.py", line 675, in add_entry
    for dist in find_distributions(entry, True):
  File "pkg_resources/__init__.py", line 1988, in find_on_path
    path_item, entry, metadata, precedence=DEVELOP_DIST
  File "pkg_resources/__init__.py", line 2376, in from_location
    py_version=py_version, platform=platform, **kw
  File "pkg_resources/__init__.py", line 2717, in _reload_version
    md_version = _version_from_file(self._get_metadata(self.PKG_INFO))
  File "pkg_resources/__init__.py", line 2341, in _version_from_file
    line = next(iter(version_lines), '')
  File "pkg_resources/__init__.py", line 2509, in _get_metadata
    for line in self.get_metadata_lines(name):
  File "pkg_resources/__init__.py", line 1879, in get_metadata_lines
    return yield_lines(self.get_metadata(name))
  File "pkg_resources/__init__.py", line 1869, in get_metadata
    metadata = f.read()
  File "/usr/lib/python2.7/codecs.py", line 296, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb6 in position 147: invalid start byte in /usr/lib/pymodules/python2.7/rpl-1.5.5.egg-info

The problematic egg-info file is attached, with an added extension (txt) to fool github into accepting the attachment.
rpl-1.5.5.egg-info.txt

This is with python 2.7.2 and setuptools aad4a69.

To make my virtualenv work properly, I had to patch pkg_resources/init.py, replacing the line:

        md_version = _version_from_file(self._get_metadata(self.PKG_INFO))

in EggInfoDistribution._reload_version with this:

        try: 
            md_version = _version_from_file(self._get_metadata(self.PKG_INFO))
        except UnicodeDecodeError as e: 
            warnings.warn(
                'failure to read version number of %s at %s due to: %s' % (self.project_name, self.location, e))
            md_version = None

I can make a pull request out of this if you think this is a good solution.

BTW, the stack trace looks similar to that of #531 . You can also see multiple other people running into the same problem on stackoverflow.com, here, here, and here.

@micklat micklat changed the title "import pkg_resources" fails with UnicodeDecodeError /usr/lib/pymodules/python2.7/rpl-1.5.5.egg-info "import pkg_resources" fails with UnicodeDecodeError while parsing /usr/lib/pymodules/python2.7/rpl-1.5.5.egg-info Aug 4, 2016
@micklat
Copy link
Author

micklat commented Aug 4, 2016

Upon reflection I think _version_from_file is a better place to catch the error and return None.

@jaraco
Copy link
Member

jaraco commented Aug 4, 2016

The stack trace isn't just similar to #531, it's nearly identical... finding the same offending character in the same position. I think your report explains the issue there, and here's what I think was happening:

  1. pip was installing packages and in doing so installed rpl 1.5.5 (or similar).
  2. When pip got around to installing setuptools, that was the first package that happened to invoke pkg_resources,
  3. which triggered the failure you've reported above.

I suspect the issue is that the offending package wasn't properly packaged (or was packaged with an old or defective build system).

I'm not sure we want to patch this by suppressing the error and masking the installed version. I could see augmenting the error to help the user better trace the issue to the source.

I'll dig a bit more into rpl to see what we can learn from it.

@jaraco
Copy link
Member

jaraco commented Aug 4, 2016

Hmm. Rpl is 404 in PyPI and difficult to find anything about. I eventually track it to its sourceforge home, where the latest release is from 2007 and there's an open ticket for exactly this issue. When I downloaded the 1.5.5 tgz file from the project page, I was unable to execute setup.py due to an encoding error in the script (the author should be Göran Weinholt not b'G\xf6ran Weinholt'.

Perhaps, though, there's still something setuptools could do here beyond crashing with a nicer error message. Perhaps pkg_resources, when opening a metadata file, should use a lenient decoding, perhaps utf-8 with surrogate escapes or with some other replacement.

@jaraco
Copy link
Member

jaraco commented Aug 4, 2016

I'm not sure we want to patch this by suppressing the error and masking the installed version. I could see augmenting the error to help the user better trace the issue to the source.

I see this was already done, and was apparent in the error message above.

jaraco added a commit that referenced this issue Aug 4, 2016
…ing PKG_RESOURCES_METADATA_ERRORS='replace'. Ref #719.
@jaraco
Copy link
Member

jaraco commented Aug 4, 2016

I've created the issue-719 branch in the repo and pushed a possible workaround. This workaround keeps the existing behavior (fail fast and hard), but provides an environment variable for these environments affected by the issue to bypass it. This mechanism would help maintain the defacto expectation (that packages should be properly encoded), but enable legacy environments or environments with abandoned packages like rpl to continue to function.

Thoughts?

@micklat
Copy link
Author

micklat commented Aug 5, 2016

Your solution does provide a remedy for problems such as mine, and I appreciate that.

I think, though, that setuptools could be kinder to its users. I don't see much point in sending users off to google what this problem with UnicodeDecodeError and what's the accepted workaround ("oh, so I have to set this environment variable and then it works? thanks mate. But why can't things just work? Sigh..."). I mean, what's the benefit from that virtual legwork? setuptools can just catch the error, consult a hard-coded list of "bad egg-infos", see that the egg-info in question is a known troublemaker, and silently disregard the error.

I see why you'd be concerned about incentivizing maintainers of other packages to fix the encoding of the egg-info. For that, I think it sufficient to require that elements are not added to the "bad eggs" list until a ticket is opened against the offending package in the appropriate place. If the package is actively maintained, then the owner is very likely to care that his package is listed in a "hall of shame" such as this, so I think this is incentive enough to fix the root cause.

Consider that my workstation's OS has reached end-of-life. Even if rpl were fixed in sourceforge, it would not help me much (I neither have root access to my machine nor do I know how to create and install a custom ubuntu package to replace the egg-info with a good one). There are lots of users who cannot fix the root cause, and I don't see much point in disrupting their work. Hence my suggestion of a "bad eggs" list.

@micklat
Copy link
Author

micklat commented Aug 5, 2016

Or maybe a combination of both of our proposals would be best: have a "bad-eggs" list and allow it to be extended through an environment variable. That way users can work around the problem without involving you (the pkg_resources maintainer).

jaraco added a commit that referenced this issue Aug 5, 2016
jaraco added a commit that referenced this issue Aug 5, 2016
@jaraco
Copy link
Member

jaraco commented Aug 5, 2016

You make some good points. I do want to be cognizant of over-complicating the implementation when it may just be one or two packages that are bad eggs.

And that leads me to wonder, does it really need to fail fast and hard here, or just be noisy?

I've pushed another implementation that I believe will always suppress decoding errors, but will log a warning if such an encoding issue is detected.

How does this approach strike you?

@micklat
Copy link
Author

micklat commented Aug 5, 2016

This seems to me like an excellent solution. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants