Strings and Unicode
Clone this wiki locally
Tornado supports both python2 and python3, which requires care in dealing with strings (see also PEP 3333). Even though each version of python only has two string types, there are logically three types that must be considered when supporting both versions:
bytes: Represented by the
strtype in python2 and the
bytestype in python3. To unambiguously refer to this type in tornado code, use
tornado.util.b("")to create byte literals (byte literal support wasn't added to python until version 2.6, so until we drop support for 2.5 we must use our own aliases).
unicode: Represented by the
unicodetype in python2 and the
strtype in python3. Tornado code refers to this type with the python2 names:
unicodefor the type and
str: The native string type, called
strin both versions but equivalent to
bytesin python2 and
Tornado uses UTF-8 as its default encoding, and the
tornado.escape module provides
native_str functions to convert arguments to the three string types. In general, tornado methods should accept any string type as arguments. Return values should be native strings when possible. Data from external sources should only be converted to unicode if a definite encoding is known, otherwise it should be left as bytes.
- Low-level code such as
IOStreamgenerally deals solely in bytes
- Output methods such as
RequestHandler.writeaccept either bytes or unicode. Unicode strings will be encoded as utf8, but byte strings will never be decoded so applications can output non-utf8 data.
- HTTP headers are generally ascii (officially they're latin1, but use of non-ascii is rare), so we mostly represent them (and data derived from them) with native strings (note that in python2 if a header contains non-ascii data tornado will decode the latin1 and re-encode as utf8!)
- Query parameters are sent percent-encoded, but the underlying character set is unspecified. In
HTTPRequest.argumentsthe percent-encoding has been undone, resulting in byte strings for the argument values. In
RequestHandler.get_argumentthese bytes are decoded according to
RequestHandler.decode_argument, allowing the application to choose the encoding to be used (default utf8). Note that because keys are nearly always ascii and having byte strings as keys is awkward, the keys are converted to native strings (using latin1 on python3).