-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pip install: UnicodeDecodeError on Windows #1291
Comments
Can you create a pull request with a test case to ensure this bug won't regress? |
@thedrow Can the PR be merged or is there something missing? |
What if those files are not in UTF-8? As far as I know, there's no requirement that they have to be. |
The probability that they are is imho much higher then that they are in the windows default encoding, because utf-8 is the default on mac/linux and also in Python 3. You could also wrap it with a try/except block and use the default encoding if utf-8 should fail. |
@sscherfke First of all, add tests to the pull request to ensure this bug won't regress. |
@thedrow try default, fall back to utf-8 or orther way around? |
It depends on which one is the most common use case. Ask the core developers. |
The point is that the encoding of such files is not defined. So you can't reliably say that any encoding will be correct (some, like latin1, will never give encoding errors, but that doesn't mean they are necessarily right). Neither the egg-info nor the metadata specs make any mention of encodings, which is unfortunate, but reflects the fact that they were written when Python tended to assume that only ASCII would be used for anything that needed interoperability. Hence this bug. The correct fix (Metadata 2.0) is of course to clearly specify encodings. But that's a way off yet, and in the meantime we have to be prepared to accept arbitrary data. My view is that we should
Whether we use the platform encoding or UTF-8 doesn't really affect (1). In both cases, there is the possibility of invalid data. To address (1) we need to progressively fall back through a series of encodings, finishing with latin-1 (as that is the commonly used encoding that accepts all 256 byte values, and so will never error). Using UTF-8 addresses (2), as UTF-8 is probably the most common encoding we will see (due to its prevalence on Unix systems). For (3) it's really about how the data is used, and I think that's out of scope for this patch. So if you add exception handling and a fallback - I'd suggest the platform default and then latin-1 in that order if UTF-8 fails - I think that would be a good solution. Whether UTF-8 or platform default is the best choice for the initial attempt is something I doubt anyone can tell you. I suspect that a relatively small number of projects go outside ASCII anyway. For those that do, if you're on Unix UTF-8 is the platform default so it makes no difference. On Windows, it boils down to which is the most important case - installing stuff developed by other Windows developers, or installing stuff developed by Unix developers. Honestly, that's going to be an almost totally arbitrary decision. |
I think @pfmoore has a point. I completely agree. |
try:
# Try utf-8
with open(filename, 'rb') as fp:
data = fp.read().decode('utf-8')
except UnicodeDecodeError:
try:
# Try the system’s default encoding
with open(filename, 'r') as fp:
data = fp.read()
except UnicodeDecodeError:
# Our last resort is latin1 which never throws an error
# (but returns nonsense instead :-))
with open(filename, 'rb') as fp:
data = fp.read().decode('latin1')
return data This surely doesn’t look very friendly but shouldn’t raise any UnicodeDecodeError (as far as I’ve tested it). What would be the preferred way for a pip testcase? To use actual files with varying encodings or to mock open() and pass varying bytes instead? |
Created a new pull request #1331 which fixes the issue and looks a bit nicer then the snippet I posted above. :) Also added a new unit test. |
|
@while0pass I'm not sure if it's related. Try to uninstall pip and installl it again. |
I get a error that is UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 33: ordinal not in range(128) in windows ,and it is cause by the registration table change of windows ,and the way to solve is opening the mimetypes.py in the C:\python27\Lib ,and found the code like ‘default_encoding = sys.getdefaultencoding()’,before it ,add the code below: and then it is ok. |
What is registration table? |
@piotr-dobrogost |
It's called Windows Registry (http://en.wikipedia.org/wiki/Windows_Registry) not registration table. |
can this patch work ? --- C:/Python27/Lib/site-packages/pip/download.py ÖÜËÄ 5ÔÂ 28 09:40:28 2015
+++ C:/Python27/Lib/site-packages/pip/download.py ÖÜËÄ 5ÔÂ 28 11:34:40 2015
@@ -881,6 +881,13 @@ def _download_http_url(link, session, temp_dir):
ext = os.path.splitext(resp.url)[1]
if ext:
filename += ext
+ try:
+ if isinstance(temp_dir, unicode):
+ temp_dir = temp_dir.encode(sys.getfilesystemencoding())
+ if isinstance(filename, unicode):
+ filename = filename.encode(sys.getfilesystemencoding())
+ except NameError:
+ pass
file_path = os.path.join(temp_dir, filename)
with open(file_path, 'wb') as content_file:
_download_url(resp, link, content_file)
|
FYI, There is a Python standard for specifying the character encoding of a python module: PEP 263. Which you can read here: https://www.python.org/dev/peps/pep-0263/. |
any word on this? just got this today |
Closing this, I believe that this has been fixed. |
Not for me.... |
Still happening. Is this related or different?:
|
That error seems to be coming from setuptools. Try |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
pip install <package>
fails on Windows, if the projects description (e.g, its long description) is in utf-8.The problem seems to be, that
req.egg_info_data()
(currently line 317 reads the egg-info created bypython setup.py egg_info
with the system's default encoding, which is not utf-8 on Windows (but on most *nix systems).With Python 3, it should be no problem if you use utf-8 in your README/CHANGES/AUTHORS.txt (or whatever), so pip should read files as unicode by default:
Changing lines 296 and 297 (in pip 1.4.1; 316 and 317 in the repo) to
fixes the problem for me.
The test setups was:
The text was updated successfully, but these errors were encountered: