Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip install: UnicodeDecodeError on Windows #1291

Closed
sscherfke opened this issue Nov 4, 2013 · 26 comments
Closed

pip install: UnicodeDecodeError on Windows #1291

sscherfke opened this issue Nov 4, 2013 · 26 comments
Labels
auto-locked Outdated issues that have been locked by automation project: setuptools Related to setuptools type: bug A confirmed bug or unintended behavior

Comments

@sscherfke
Copy link
Contributor

pip install <package> fails on Windows, if the projects description (e.g, its long description) is in utf-8.

(simpy) C:\Users\sscherfke\Code\simpy>pip install .
Unpacking c:\users\sscherfke\code\simpy
  Running setup.py egg_info for package from file:///c%7C%5Cusers%5Csscherfke%5Ccode%5Csimpy

Cleaning up...
Exception:
Traceback (most recent call last):
  File "C:\Users\sscherfke\Envs\simpy\lib\site-packages\pip\basecommand.py", line 134, in main
    status = self.run(options, args)
  File "C:\Users\sscherfke\Envs\simpy\lib\site-packages\pip\commands\install.py", line 236, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "C:\Users\sscherfke\Envs\simpy\lib\site-packages\pip\req.py", line 1134, in prepare_files
    req_to_install.run_egg_info()
  File "C:\Users\sscherfke\Envs\simpy\lib\site-packages\pip\req.py", line 264, in run_egg_info
    "%(Name)s==%(Version)s" % self.pkg_info())
  File "C:\Users\sscherfke\Envs\simpy\lib\site-packages\pip\req.py", line 357, in pkg_info
    data = self.egg_info_data('PKG-INFO')
  File "C:\Users\sscherfke\Envs\simpy\lib\site-packages\pip\req.py", line 297, in egg_info_data
    data = fp.read()
  File "C:\Users\sscherfke\Envs\simpy\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1235: character maps to <undefined>

Storing complete log in C:\Users\sscherfke\pip\pip.log

The problem seems to be, that req.egg_info_data() (currently line 317 reads the egg-info created by python setup.py egg_info with the system's default encoding, which is not utf-8 on Windows (but on most *nix systems).

With Python 3, it should be no problem if you use utf-8 in your README/CHANGES/AUTHORS.txt (or whatever), so pip should read files as unicode by default:

Changing lines 296 and 297 (in pip 1.4.1; 316 and 317 in the repo) to

fp = open(filename, 'rb')
data = fp.read().decode('utf-8')

fixes the problem for me.

The test setups was:

  • Windows 7 64bit
  • Python 3.3.1
  • pip 1.4.1
  • setuptools 0.9.8
@thedrow
Copy link

thedrow commented Nov 4, 2013

Can you create a pull request with a test case to ensure this bug won't regress?

@sscherfke
Copy link
Contributor Author

@thedrow Can the PR be merged or is there something missing?

@pfmoore
Copy link
Member

pfmoore commented Nov 13, 2013

With Python 3, it should be no problem if you use utf-8 in your README/CHANGES/AUTHORS.txt (or whatever), so pip should read files as unicode by default:

What if those files are not in UTF-8? As far as I know, there's no requirement that they have to be.

@sscherfke
Copy link
Contributor Author

The probability that they are is imho much higher then that they are in the windows default encoding, because utf-8 is the default on mac/linux and also in Python 3. You could also wrap it with a try/except block and use the default encoding if utf-8 should fail.

@thedrow
Copy link

thedrow commented Nov 13, 2013

@sscherfke First of all, add tests to the pull request to ensure this bug won't regress.
And I was about to write that you should wrap it with a try..catch that uses the default encoding if it cannot decode it to UTF8 but you are ahead of me :)

@sscherfke
Copy link
Contributor Author

@thedrow try default, fall back to utf-8 or orther way around?

@thedrow
Copy link

thedrow commented Nov 13, 2013

It depends on which one is the most common use case. Ask the core developers.

@pfmoore
Copy link
Member

pfmoore commented Nov 13, 2013

The point is that the encoding of such files is not defined. So you can't reliably say that any encoding will be correct (some, like latin1, will never give encoding errors, but that doesn't mean they are necessarily right).

Neither the egg-info nor the metadata specs make any mention of encodings, which is unfortunate, but reflects the fact that they were written when Python tended to assume that only ASCII would be used for anything that needed interoperability. Hence this bug.

The correct fix (Metadata 2.0) is of course to clearly specify encodings. But that's a way off yet, and in the meantime we have to be prepared to accept arbitrary data.

My view is that we should

  1. Make sure that we avoid exceptions caused by incompatible encoding assumptions.
  2. Avoid mojibake-related errors caused by guessing the wrong encoding as much as we can.
  3. Try to avoid damaging the data as a last resort.

Whether we use the platform encoding or UTF-8 doesn't really affect (1). In both cases, there is the possibility of invalid data. To address (1) we need to progressively fall back through a series of encodings, finishing with latin-1 (as that is the commonly used encoding that accepts all 256 byte values, and so will never error).

Using UTF-8 addresses (2), as UTF-8 is probably the most common encoding we will see (due to its prevalence on Unix systems).

For (3) it's really about how the data is used, and I think that's out of scope for this patch.

So if you add exception handling and a fallback - I'd suggest the platform default and then latin-1 in that order if UTF-8 fails - I think that would be a good solution.

Whether UTF-8 or platform default is the best choice for the initial attempt is something I doubt anyone can tell you. I suspect that a relatively small number of projects go outside ASCII anyway. For those that do, if you're on Unix UTF-8 is the platform default so it makes no difference. On Windows, it boils down to which is the most important case - installing stuff developed by other Windows developers, or installing stuff developed by Unix developers. Honestly, that's going to be an almost totally arbitrary decision.

@thedrow
Copy link

thedrow commented Nov 13, 2013

I think @pfmoore has a point. I completely agree.

@sscherfke
Copy link
Contributor Author

    try:
        # Try utf-8
        with open(filename, 'rb') as fp:
            data = fp.read().decode('utf-8')
    except UnicodeDecodeError:
        try:
            # Try the system’s default encoding
            with open(filename, 'r') as fp:
                data = fp.read()
        except UnicodeDecodeError:
            # Our last resort is latin1 which never throws an error
            # (but returns nonsense instead :-))
            with open(filename, 'rb') as fp:
                data = fp.read().decode('latin1')
    return data

This surely doesn’t look very friendly but shouldn’t raise any UnicodeDecodeError (as far as I’ve tested it).

What would be the preferred way for a pip testcase? To use actual files with varying encodings or to mock open() and pass varying bytes instead?

@sscherfke
Copy link
Contributor Author

Created a new pull request #1331 which fixes the issue and looks a bit nicer then the snippet I posted above. :) Also added a new unit test.

@while0pass
Copy link

C:\> pip --version
pip 1.5.4 from C:\Python27\lib\site-packages (Python 2.7)
C:\>pip install ipython
Downloading/unpacking ipython
Cleaning up...
Exception:
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\pip\basecommand.py", line 122, in main
    status = self.run(options, args)
  File "C:\Python27\lib\site-packages\pip\commands\install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bun
dle)
  File "C:\Python27\lib\site-packages\pip\req.py", line 1229, in prepare_files
    req_to_install.run_egg_info()
  File "C:\Python27\lib\site-packages\pip\req.py", line 292, in run_egg_info
    logger.notify('Running setup.py (path:%s) egg_info for package %s' % (self.setup_py, s
elf.name))
  File "C:\Python27\lib\site-packages\pip\req.py", line 265, in setup_py
    import setuptools
  File "C:\Python27\lib\site-packages\setuptools\__init__.py", line 12, in <module>
    from setuptools.extension import Extension
  File "C:\Python27\lib\site-packages\setuptools\extension.py", line 7, in <module>
    from setuptools.dist import _get_unpatched
  File "C:\Python27\lib\site-packages\setuptools\dist.py", line 15, in <module>
    from setuptools.compat import numeric_types, basestring
  File "C:\Python27\lib\site-packages\setuptools\compat.py", line 19, in <module>
    from SimpleHTTPServer import SimpleHTTPRequestHandler
  File "C:\Python27\lib\SimpleHTTPServer.py", line 27, in <module>
    class SimpleHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
  File "C:\Python27\lib\SimpleHTTPServer.py", line 208, in SimpleHTTPRequestHandler
    mimetypes.init() # try to read system mime.types
  File "C:\Python27\lib\mimetypes.py", line 358, in init
    db.read_windows_registry()
  File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry
    for subkeyname in enum_types(hkcr):
  File "C:\Python27\lib\mimetypes.py", line 249, in enum_types
    ctype = ctype.encode(default_encoding) # omit in 3.x!
UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 9: ordinal not in ran
ge(128)

Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\Scripts\pip.exe\__main__.py", line 9, in <module>
  File "C:\Python27\lib\site-packages\pip\__init__.py", line 185, in main
    return command.main(cmd_args)
  File "C:\Python27\lib\site-packages\pip\basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 68: ordinal not in ra
nge(128)
C:\>pip --help
Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\Scripts\pip.exe\__main__.py", line 9, in <module>
  File "C:\Python27\lib\site-packages\pip\__init__.py", line 177, in main
    cmd_name, cmd_args = parseopts(initial_args)
  File "C:\Python27\lib\site-packages\pip\__init__.py", line 138, in parseopts
    general_options, args_else = parser.parse_args(args)
  File "C:\Python27\lib\optparse.py", line 1399, in parse_args
    stop = self._process_args(largs, rargs, values)
  File "C:\Python27\lib\optparse.py", line 1439, in _process_args
    self._process_long_opt(rargs, values)
  File "C:\Python27\lib\optparse.py", line 1514, in _process_long_opt
    option.process(opt, value, values, self)
  File "C:\Python27\lib\optparse.py", line 788, in process
    self.action, self.dest, opt, value, values, parser)
  File "C:\Python27\lib\optparse.py", line 810, in take_action
    parser.print_help()
  File "C:\Python27\lib\optparse.py", line 1669, in print_help
    file.write(self.format_help().encode(encoding, "replace"))
  File "C:\Python27\lib\encodings\cp866.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 1257: ordinal not in
range(128)
C:\>pip
Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\Scripts\pip.exe\__main__.py", line 9, in <module>
  File "C:\Python27\lib\site-packages\pip\__init__.py", line 177, in main
    cmd_name, cmd_args = parseopts(initial_args)
  File "C:\Python27\lib\site-packages\pip\__init__.py", line 148, in parseopts
    parser.print_help()
  File "C:\Python27\lib\optparse.py", line 1669, in print_help
    file.write(self.format_help().encode(encoding, "replace"))
  File "C:\Python27\lib\encodings\cp866.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 1257: ordinal not in
range(128)

@thedrow
Copy link

thedrow commented Mar 19, 2014

@while0pass I'm not sure if it's related. Try to uninstall pip and installl it again.
If the issue persists just open a new ticket.

@thedrow
Copy link

thedrow commented Mar 19, 2014

Why is this issue still open when #1395 & #1396 were already merged? @dstufft

@BlakeWL
Copy link

BlakeWL commented Mar 6, 2015

I get a error that is UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 33: ordinal not in range(128) in windows ,and it is cause by the registration table change of windows ,and the way to solve is opening the mimetypes.py in the C:\python27\Lib ,and found the code like ‘default_encoding = sys.getdefaultencoding()’,before it ,add the code below:
if sys.getdefaultencoding() != 'gbk':
reload(sys)
sys.setdefaultencoding('gbk')
default_encoding = sys.getdefaultencoding()

and then it is ok.

@piotr-dobrogost
Copy link

@BlakeWL

What is registration table?

@BlakeWL
Copy link

BlakeWL commented Mar 7, 2015

@piotr-dobrogost
when you run regedit in windows,the registration table will show up.

@piotr-dobrogost
Copy link

It's called Windows Registry (http://en.wikipedia.org/wiki/Windows_Registry) not registration table.

@tanbro
Copy link

tanbro commented May 28, 2015

can this patch work ?

--- C:/Python27/Lib/site-packages/pip/download.py   ÖÜËÄ 5ÔÂ 28 09:40:28 2015
+++ C:/Python27/Lib/site-packages/pip/download.py   ÖÜËÄ 5ÔÂ 28 11:34:40 2015
@@ -881,6 +881,13 @@ def _download_http_url(link, session, temp_dir):
         ext = os.path.splitext(resp.url)[1]
         if ext:
             filename += ext
+    try:
+        if isinstance(temp_dir, unicode):
+            temp_dir = temp_dir.encode(sys.getfilesystemencoding())
+        if isinstance(filename, unicode):
+            filename = filename.encode(sys.getfilesystemencoding())
+    except NameError:
+        pass
     file_path = os.path.join(temp_dir, filename)
     with open(file_path, 'wb') as content_file:
         _download_url(resp, link, content_file)

@krader1961
Copy link

FYI, There is a Python standard for specifying the character encoding of a python module: PEP 263. Which you can read here: https://www.python.org/dev/peps/pep-0263/.

@allenwyma
Copy link

any word on this? just got this today

@xavfernandez xavfernandez added the type: bug A confirmed bug or unintended behavior label Apr 3, 2016
@dstufft
Copy link
Member

dstufft commented Mar 24, 2017

Closing this, I believe that this has been fixed.

@dstufft dstufft closed this as completed Mar 24, 2017
@arisobel
Copy link

Not for me....
Which lib should I have to upgrade i/o to get away from this problem?

@mixmastamyk
Copy link

Still happening. Is this related or different?:

e:\repos\fr>pip install -e .

Obtaining file:///E:/repos/fr
Installing collected packages: fr
  Found existing installation: fr 3.0a0
    Uninstalling fr-3.0a0:
      Successfully uninstalled fr-3.0a0
  Running setup.py develop for fr
    Complete output from command c:\users\TheUser\python36\python.exe -c "import setuptools, tokenize;__file_
_='E:\\repos\\fr\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.
close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
    running develop
    running egg_info
    writing fr.egg-info\PKG-INFO
    writing dependency_links to fr.egg-info\dependency_links.txt
    writing requirements to fr.egg-info\requires.txt
    writing top-level names to fr.egg-info\top_level.txt
    reading manifest file 'fr.egg-info\SOURCES.txt'
    writing manifest file 'fr.egg-info\SOURCES.txt'
    running build_ext
    Creating c:\users\TheUser\python36\lib\site-packages\fr.egg-link (link to .)
    fr 3.0a0 is already the active version in easy-install.pth
    c:\users\TheUser\python36\lib\site-packages\setuptools\dist.py:397: UserWarning: Normalizing '3.00a0' to
'3.0a0'
      normalized_version,
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "E:\repos\fr\setup.py", line 62, in <module>
        'Topic :: Utilities',
      File "c:\users\TheUser\python36\lib\site-packages\setuptools\__init__.py", line 129, in setup
        return distutils.core.setup(**attrs)
      File "c:\users\TheUser\python36\lib\distutils\core.py", line 148, in setup
        dist.run_commands()
      File "c:\users\TheUser\python36\lib\distutils\dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "c:\users\TheUser\python36\lib\distutils\dist.py", line 974, in run_command
        cmd_obj.run()
      File "c:\users\TheUser\python36\lib\site-packages\setuptools\command\develop.py", line 36, in run
        self.install_for_development()
      File "c:\users\TheUser\python36\lib\site-packages\setuptools\command\develop.py", line 152, in install_
for_development
        self.process_distribution(None, self.dist, not self.no_deps)
      File "c:\users\TheUser\python36\lib\site-packages\setuptools\command\easy_install.py", line 726, in pro
cess_distribution
        self.install_egg_scripts(dist)
      File "c:\users\TheUser\python36\lib\site-packages\setuptools\command\develop.py", line 187, in install_
egg_scripts
        script_text = strm.read()
      File "c:\users\TheUser\python36\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1303: character maps to <undefined>


    ----------------------------------------
  Rolling back uninstall of fr
Command "c:\users\TheUser\python36\python.exe -c "import setuptools, tokenize;__file__='E:\\repos\\fr\\setup.
py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(cod
e, __file__, 'exec'))" develop --no-deps" failed with error code 1 in E:\repos\fr\

@pradyunsg
Copy link
Member

That error seems to be coming from setuptools. Try python setup.ly develop, if that errors out, it's a setuptools issue. If it doesn't, could you file a new bug report for it?

@pradyunsg pradyunsg added the project: setuptools Related to setuptools label Jun 13, 2018
@lock
Copy link

lock bot commented Jun 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 2, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation project: setuptools Related to setuptools type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

No branches or pull requests