Strings and Unicode

bdarnell edited this page May 30, 2011 · 2 revisions


Tornado supports both python2 and python3, which requires care in dealing with strings (see also PEP 3333). Even though each version of python only has two string types, there are logically three types that must be considered when supporting both versions:

  • bytes: Represented by the str type in python2 and the bytes type in python3. To unambiguously refer to this type in tornado code, use tornado.util.bytes_type, and tornado.util.b("") to create byte literals (byte literal support wasn't added to python until version 2.6, so until we drop support for 2.5 we must use our own aliases).
  • unicode: Represented by the unicode type in python2 and the str type in python3. Tornado code refers to this type with the python2 names: unicode for the type and u"" for literals.
  • str: The native string type, called str in both versions but equivalent to bytes in python2 and unicode in python3.

Tornado uses UTF-8 as its default encoding, and the tornado.escape module provides utf8, to_unicode, and native_str functions to convert arguments to the three string types. In general, tornado methods should accept any string type as arguments. Return values should be native strings when possible. Data from external sources should only be converted to unicode if a definite encoding is known, otherwise it should be left as bytes.

Detailed rules

  • Low-level code such as IOStream generally deals solely in bytes
  • Output methods such as RequestHandler.write accept either bytes or unicode. Unicode strings will be encoded as utf8, but byte strings will never be decoded so applications can output non-utf8 data.
  • HTTP headers are generally ascii (officially they're latin1, but use of non-ascii is rare), so we mostly represent them (and data derived from them) with native strings (note that in python2 if a header contains non-ascii data tornado will decode the latin1 and re-encode as utf8!)
  • Query parameters are sent percent-encoded, but the underlying character set is unspecified. In HTTPRequest.arguments the percent-encoding has been undone, resulting in byte strings for the argument values. In RequestHandler.get_argument these bytes are decoded according to RequestHandler.decode_argument, allowing the application to choose the encoding to be used (default utf8). Note that because keys are nearly always ascii and having byte strings as keys is awkward, the keys are converted to native strings (using latin1 on python3).