As Short as Possible Guidelines for Handling Unicode in Python 2
»recommended« and »mandatory« are with regard to reading a section.
The Python modules in the repo complement this guide. Feel free to copy them into your projects and send improvements.
unicodeor unicode if you mean the Python type.
- Write Unicode if you mean Unicode in general.
Need to Know (mandatory)
bytesand Python 2 vs. Python 3
- When dealing with strings and Unicode in Python, there are two types
you have to know.
stris a plain list of bytes that just happens to be rendered as a string.
unicodeis a list of Unicode characters. Python 2 → Python 3:
- ❃ default string type
- The default string type in both Pythons is
str, but note that
stris different things in Python 2 and Python 3. In Python 3 all string variables inside a program are lists of Unicode characters and we want to have the same in Python 2, because we are forward-looking.
- ❃ the ideal: every string is
- Therefore, we assume all string variables inside our programs to be
- ❃ (nearly) everything outside is
- When communicating with the outside world and some libraries, we have
to convert to or from
- ❃ Unicode and UTF-8
- Unicode is different from UTF-8. Read the first paragraph in the blue box at the top of https://pythonhosted.org/kitchen/unicode-frustrations.html.
- ❃ encoding and decoding
- To turn a UTF-8-encoded
str(list of bytes) into
.decode('utf-8'). To turn a
unicodeinto a UTF-8-encoded
- ❃ unicode_literals
In every Python file, import
from __future__ import unicode_literals
If you don't do this, all string literals in your source code will be
str, which is against the »every string is
unicode« ideal of the Need to Know.
b"bla"to write a
- ❃ string conversion
str()when you want to convert numbers etc. to strings.
- ❃ naming convention
If there is a string variable that needs to be of type
strinside your program, prefix it with
b_if you don't know the encoding, or with
utf8_if you know it is UTF-8:
b_company_name = read_company_name_str() utf8_company_name = read_company_name_utf8()
- ❃ reading and writing files
When you want to read from or write to a file, use
codecs.open()instead of the built-in
>>> from __future__ import unicode_literals >>> import codecs >>> with codecs.open("bla.txt", 'w', 'utf-8') as f: ... f.write("üüü") ... >>> with codecs.open("bla.txt", 'r', 'utf-8') as f: ... f.read(3) ... u'\xfc\xfc\xfc' >>> 'ü' * 3 u'\xfc\xfc\xfc'
Everything that is written to the outside world should be
str. This normally includes parameters to
unicodes all the time, write at the top of every file, but after all imports:
import sys import codecs # and other imports if not isinstance(sys.stdout, codecs.StreamWriter): sys.stdout = codecs.getwriter('utf-8')(sys.stdout) # main code follows
(Don't forget to add imports for
codecsif they aren't there already.) This way you can do
print(unicode). Note however, that now it's dangerous to do
print(str). Never pass a
- ❃ exceptions and warnings
- When raising exceptions or warnings, only pass
str. Think twice whether the thing you're passing really is
- We don't put an UTF-8 writer in front of
sys.stderr, since that would cause even more confusion. So make sure that everything you send there is
- ❃ external libraries
- Check whether the library procedures you're calling accept and return
unicode. If they accept and return
str, take care to make the right conversions. Below are notes on which libraries do what.
- ❃ environment variables
os.environ. If you need to do anything else with the environment, extend
unicode_environinstead of resorting to environment utilities from
- ❃ command line arguments
- Command line arguments come as
strand you need to convert them. Unfortunately, passing
ArgumentParser.add_argumentis not enough. Use
- ❃ testing
- In your tests, try to break the system by including non-ASCII characters in strings. If you can't succeed, chances are good that you have done the Unicode thing correctly.
- ❃ CONSTANT VIGILANCE!
- When you read data from or write data to somewhere outside your program, make sure it gets converted to the right types.
Exceptions to the rules (recommended)
You may make project-specific exceptions to these rules if they get annoying. Be sure to document them.
Example for a project that uses Pygit2 often:
- ❃ Git SHA1s
- Git SHA1s as returned by
Oid.hexare of type
str. Since they never contain non-ASCII characters and it would be annoying to convert them all the time, we leave them as
str. Since we know that they are
strand it is annoying to write prefixes, it is okay to leave off the
b_. (Not so sure if this is good, though.)
- ❃ UTF-8-encoded source
In the first or second line of every Python file, put the following:
# -*- coding: utf-8 -*-
Doing this will allow you to use non-ASCII characters in your Python source.
- ❃ unicodification (stringification)
__str__like this (credits):
def __unicode__(self): return … # create unicode representation of your object def __str__(self): return unicode(self).encode('utf-8')
- ❃ writing Unicode utilities
- If you want to write utilities like
unicode_argparse, you might find the functions from
When I write something like »works with
unicode arguments«, I mean that it
works with arguments of type
unicode which can contain arbitrary
characters, i. e. ASCII as well as non-ASCII.
Feel free to extend, or correct if things have changed.
codecs.open works with
unicode as well as
httplib2.Http.request works with
unicode arguments. However, the
results will all contain or be of type
>>> r, c = httplib2.Http(".cache").request("http://de.wikipedia.org/wiki/Erdkröte") >>> r['content-type'] 'text/html; charset=UTF-8' >>> type(r['content-type']) <type 'str'> >>> type(c) <type 'str'>
Things in os are generally safe to use with
unicode. However, note this:
unicode(!!!) If the result contains non-ASCII characters, it will be
str. Isn't it sweet?
PyCurl works solely on
- Config values can be
- Paths are
str. However, this is extrapolated from the fact that
str. The API might be inconsistent, so check the thing you're using and add the data here.
- Refspecs should be
Remote.add_fetchdoesn't complain when you pass
Remote.fetch_refspecsthrows an exception if you added a refspec with non-ASCII characters. Funny enough, though,
Remote.fetch_refspecsis a list of
Repository(path)doesn't work with
unicodes containing non-ASCII characters. In order to be sure, I'd say that all paths passed to Pygit2 methods or the like should be converted to UTF-8
unicode. If you need
str, you can use
>>> no_r = pygit2.Repository("/tmp/tüüls") # throws error >>> r = pygit2.clone_repository("/tmp/tüüls", "./tüüls") # works >>> r.remotes.url # throws error
re is completely okay with
unicode if you give it
urllib2 didn't like
unicode for URLs and also returned
str only. Since
urllib is older, I guess it's the same there.
- the documentation of the mentioned modules or libraries
- Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you are in an industrious mood, you can help improving this document and the modules.
- I marked up many things as
literal text. It would be nice if you could change this to interpreted text, such as :meth:`pygit2.Diff.merge`. But you'd also have to find the right way to convert this to HTML, since rst2html doesn't like
meth(as well as the other Python-specific roles, I guess).
- As stated above, the notes on which libraries do what are always happy to be updated and extended.
Copyright (c) 2015 Richard Möhn
This work is licensed under the Creative Commons Attribution 4.0 International License.