#Text encoding and Unicode in Python 2.7

When browsing the web, you've probably encountered something that looks like this:

In [25]:
normal_inoffensive_seeming_string = "don’t have a cow, man"
print normal_inoffensive_seeming_string.decode('cp1252')

donâ€™t have a cow, man


You find yourself staring at weird garbage like "â€™" where you were expecting a normal, inoffensive character like *’* (that's an apostrophe, if you can't immediately tell).

If you haven't yet encountered this kind of problem when you're dealing with data: don't worry, you will. The issue is that there is more than one way to "encode" text as data, and each one of those ways is incompatible with the others. Learning how to deal with text encoded in different ways and how to convert that text from one encoding to another is one of the most common tasks in computer programming. The goal of this tutorial is to show you the basics so you can recognize what you're dealing with and make good decisions about how to solve whatever text encoding problem is before you.

##A brief history of character encoding

One of the earliest digitization tasks that we performed with machines was the task of converting text into digital signals. One of the earliest and most successful schemes for representing text for digital transmission was Morse Code, which represents each letter as a series of short or long on/off signals:

![Morse Code](https://upload.wikimedia.org/wikipedia/commons/b/b5/International_Morse_Code.svg)

(Image from Wikipedia)

The process of communicating with Morse Code looked something like this:

1. Start with original (written) text
2. "Encode" text as Morse Code signals
3. Send signals (e.g., over a telegraph wire)
4. "Decode" text at destination

The result is a text that has all of the same characters as the original text. Morse Code is unlike contemporary digital network transmission in that [it is *not* strictly digital](http://cs.stackexchange.com/a/39922), and the encoding/decoding steps were usually done by human beings instead of machines. But the idea is the same: text starts out in one place, with one representation, is then converted to a code and transmitted, and is then decoded at the destination.

> There were a number of other schemes for transmitting text digitally before the advent of contemporary computers and networking technology, most notably [Baudot Code](https://en.wikipedia.org/wiki/Baudot_code), which encoded each letter in a multiplexed signal of five bits (on/off signals).

###ASCII and ye shall receive

Modern digital computers (say, from 1960 onward) usually store and transmit data in chunks of eight bits. Each of these chunks is called a "byte." A byte, containing eight bits, can store one of 256 distinct values (2 to the 8th power = 256). The earliest schemes for encoding text digitally for storage and transmission took advantage of this fact, and "encoded" text by assigning each letter and symbol to a unique number somewhere in the range of 0 to 255.

Of course, there's no obvious correspondence between characters and numbers. (Maybe "A" should be represented by the number 1, sure, but what number do you use to represent "!" or ")"?) So individual organizations invented their own "standards" for how numbers should be assigned to characters. In the 50s and 60s there was a veritable Cambrian explosion of such standards, with names like Fieldata and EBCDIC. Each of these standards used different numbers for different characters. (Fieldata represented the letter "A" with the number 6; EBCDIC represented "A" with 193.)

In response to this Babel of encodings, the American Standards Association created a committee in the early 1960s to create one character encoding standard to rule them all. The result of the committee was a standard called "ASCII" (American Standard Code for Information Interchange). Here's a table of the characters in ASCII and the numbers they correspond with:

![ASCII table](https://upload.wikimedia.org/wikipedia/commons/d/dd/ASCII-Table.svg)

If you imagine a computer's memory as a sequence of bytes, sort of like a spreadsheet with a large number of columns where each column contains one number, you can visualize ASCII-encoded text as looking like this:

| Cell number | 0  | 1   | 2   | 3   | 4   | 5   | 6   |
| ----------- | -- | --- | --- | --- | --- | --- | --- |
| Cell value  | 65 | 108 | 108 | 105 | 115 | 111 | 110 |
| ASCII char  | A  | l   | l   | i   | s   | o   | n   |

###Getting on the same (code) page

ASCII, as a standard, has a lot of virtues, which led to its wide adoption in the late 60s and early 70s. In 1968 U.S. President Lyndon B. Johnson declared that all federal computer systems must be ASCII-compliant. ASCII remains, to this day, the "default" character encoding for a wide variety of software and hardware systems.

But ASCII has a lot of problems, too. The most glaring problem is that the selection of characters is very limited: just the letters of the English alphabet, plus some basic typographical symbols. The standard was popular enough in the U.S. that computer manufacturers in other countries wanted to adopt the standard as well. But in order to represent text written in the language of those countries, those manufacturers needed to either use the standard creatively, by replacing characters in the existing set, or *extend* the standard by adding new characters to it.

Conveniently, ASCII actually only uses 128 characters---enough to fill seven bits, not eight (the seventh bit had previously been reserved for a [parity check](https://en.wikipedia.org/wiki/Parity_bit)). So each manufacturer (and sometimes each *company*) came up with ideas for how to use the remaining 128 characters by making use of the spare bit.

These schemes varied *wildly* in which symbols they chose to represent and which numbers they assigned to these characters. Compare, for example, the non-ASCII portions of two popular encodings, [Windows 1252](http://static.decontextualize.com/snaps/cp1252-8thbit.png) (popularly but inaccurately known as Latin-1) and [MacRoman](http://static.decontextualize.com/snaps/macroman-8thbit.png) (used by Macintosh computers up until OSX). A lot of the same characters, but all represented by different numbers.

A general term for an extended ASCII encoding is "code page"; you can [see a list of known code pages](https://en.wikipedia.org/wiki/Code_page) on Wikipedia. As you can see, it's kind of a mess: dozens of encodings, sometimes multiple encodings for the same language and region. And the variety of code pages doesn't even encompass text encoding systems for the writing systems of languages like Japanese and Chinese, which can't easily and intuitively be enumerated in an 8-bit number space.

##Unicode to the rescue

In response to this explosion of code pages and conflicting text encoding standards, in 1987 a working group from Apple and Xerox started working on something called "[Unicode](https://en.wikipedia.org/wiki/Unicode)": a single character encoding that would encompass all characters in all written languages across the world. As this idea gained steam, the Unicode Consortium (with stakeholders from other organizations) was created in 1991, and since then the Consortium has been publishing versions of the Unicode Standard, which defines how Unicode works. The current (as of this writing) version of the Unicode standard is [Unicode 8.0.0](http://www.unicode.org/standard/standard.html).

Unicode is, at its core, just a big list of characters. [You can see the list of supported characters here](http://www.unicode.org/charts/). Each character is given a number; this number is known as its "code point." (There are units other than written characters that have "code points" as well, like combining characters, white space, control characters, etc. But the basic rule is one code point per character.) Unicode code points are often written with `U+` as a prefix, followed by a hexidecimal number, so you'll often see things like:

* `U+2192: RIGHTWARDS ARROW` (→)
* `U+0041: LATIN CAPITAL LETTER A` (A)
* `U+308F: HIRAGANA LETTER WA` (わ)

In total, the Unicode standard reserves code points for over one million different symbols, and currently defines in its catalog many tens of thousands of symbols.

Since its creation, Unicode has been widely adopted across the world, gradually supplanting ASCII and vendor-/nation-specific code pages as the standard for transmitting and storing text.

##Encoding Unicode

So Unicode is pretty great! One standard that associates a number with, like, every known character in every human writing system. Problem solved, right? Well, not quite. The problem with Unicode code points is that there are SO MANY OF THEM, which means that it's difficult to digitally represent Unicode in a space-efficient way.

To explain, let's go back to ASCII. There's a reason that ASCII was limited to 128 characters: that's just enough to have all common English characters while still comfortably fitting into the basic unit of computer storage: the byte. One character to one byte is very space-efficient and also very convenient from a programming standpoint. Underneath the hood, most low(ish)-level programming languages operate on bytes directly, which means that storing text data as bytes makes many operations (like iterating over each character in a string, or counting the number of characters in a string, or allocating new memory for a string) very easy.

However, Unicode defines *too many characters* for each character to take up one byte exactly. In fact, you'd need *four* bytes of memory to represent any given Unicode character. (4 bytes = 32 bits = 4294967296 possible values, more than enough for every Unicode code point.)

This, on its face, is enough to make most programmers and system adminstrators balk. "So if I want to convert all of my data to Unicode, it'll use *quadruple* the amount of space?" they'd say. "And transmitting it over the Internet will take four times the bandwidth? I'll just stick with my code page mess, thank you very much."

###UTF-8

In order to ameliorate the bandwidth and storage problems of sending/storing Unicode characters as four bytes of data, a standard called "[UTF-8](https://en.wikipedia.org/wiki/UTF-8)" was invented. UTF-8 is a variable-length encoding that uses some clever tricks to store Unicode text cleanly and efficiently.

This standard takes advantage of the fact that, because of some finagling in the standard, the original ASCII characters included in Unicode have the same integer value in ASCII as their Unicode code points. (E.g., 'A' in ASCII is value 0x41 but also Unicode code point U+0041.) In UTF-8, all characters whose code points numbers can be represented in fewer than 8 bits (i.e., all ASCII characters) simply end up in the string as their original numerical values---which means that ASCII text and UTF-8 text are technically identical. Characters with code points whose numbers are larger are encoded with a progressively larger number of bytes, with 4 bytes being the maximum number of bytes used to represent any character.

We won't go over the specifics of the way that UTF-8 represents Unicode text, or attempt to reproduce an algorithm to decode it. [But you can read more on Wikipedia](https://en.wikipedia.org/wiki/UTF-8#Examples).

##What does this have to do with Python

So, to summarize: text can be represented in many different ways. Some text encodings, like ASCII and extended ASCII code pages, represent text so that each character can fit into one byte. The Unicode standard can represent (potentially) ANY character, but there are certain encoding schemes needed to represent that data efficiently (i.e., with fewer than four bytes per character). UTF-8 is one such standard.

But what does any of this have to do with Python?

Recall the instructions we used for encoding, transmitting, and decoding Morse Code:

1. Start with original (written) text
2. "Encode" text as Morse Code signals
3. Send signals (e.g., over a telegraph wire)
4. "Decode" text at destination

In Python, we often get source text from various sources. That text comes pre-encoded as bytes. It's our job to find out what the encoding was, and then use the appropriate *decoding* procedure to turn that encoded text back into characters that we can work with.

To that end, Python has *two* data types that are used to represent strings of characters. The first is the *string*, which is the data type that you get by default when you, e.g., use a string literal. So, for example:

In [3]:
message = "bungalow"
type(message)

str

Above, I made a variable `message` that contains a value of type `str`. Uncontroversial, right? Behind the scenes, a Python string is actually a *sequence of bytes*. You can determine the integer value associated with a byte in a string with the `ord()` function:

In [31]:
ord(message[0])

98

With that in mind, here's a list comprehension that gives us the integer values for each character in the string:

In [47]:
[ord(item) for item in message]

[116, 111, 117, 99, 104, 195, 169]

The `chr()` function is the converse of `ord()`: it takes a number and returns the ASCII character for that number:

In [48]:
chr(116)

't'

###Non-ASCII characters in Python strings

This should all seem pretty unproblematic so far. But what happens when we put a character into a Python string that *isn't* an ASCII value? Just as an example, let's put in `é` (in OSX, if you're using the default US keyboard layout, you can make `é` by hitting Option-E and then immediately hitting the "E" again):

In [43]:
french = "touché"
print french

touché


Okay... that looks fine. But there's something weird about this string:

In [45]:
len(french)

7

Seven? But... there are only six characters in "touché"! What's going on here?

We can find out by checking with `ord()` the value of each character in the string:

In [46]:
[ord(ch) for ch in french]

[116, 111, 117, 99, 104, 195, 169]

Doing some spot-checking with `chr()`, most of this seems right:

In [49]:
print chr(116)
print chr(111)
print chr(117)
print chr(99)
print chr(104)

t
o
u
c
h


But then we get to those last two numbers---195 and 169. Those don't look like ASCII characters (they're greater than 127), and in fact, they are not ASCII characters at all. The `chr()` function gives strange when we try to pass these values in:

In [50]:
chr(195)

'\xc3'

In [52]:
chr(169)

'\xa9'

If we attempt to print either of these characters on their own, we get even stranger results:

In [53]:
print french[5]

�


In [55]:
print french[6]

�


What is going on here? It turns out that these two values (195, or hexadecimal 0xC3, followed by 169, or hexadecimal 0xA9) are how UTF-8 encodes the character `é`. IPython Notebook is smart enough to interpret your keystroke behind the scenes as UTF-8, and to display that data later as UTF-8, but the data in the string is weird: we expect `"touché"` to have six characters instead of seven.

###The `unicode` type

It's often useful to work with character data on a byte-by-byte basis, especially when we're working on binary data from unknown sources. But as we've seen above, it can be a hassle when we're working with data that is purely textual.

Ideally what we'd like is a way in Python of working with non-ASCII strings where each index in the string corresponds with a *character*, not with a byte. Python supplies such a type: the `unicode` type. You can create a value of type `unicode` just like you make a value of type `str`: by putting some characters in between quotes. The only difference: to make a `unicode` value, put a `u` before the first quote in the string. For example:

In [57]:
french_str = "touché"
french_unicode = u"touché"
print type(french_str)
print type(french_unicode)

<type 'str'>
<type 'unicode'>


As you can see, the two expressions result in values of different types. The main difference between a `unicode` value and a `str` value is that a `unicode` value gets the length of the string correct:

In [58]:
print len(french_str)
print len(french_unicode)

7
6


In addition, the `unicode` value gives the correct results when we ask for the 6th (index 5) character in the string:

In [59]:
print french_unicode[5]

é


The `unicode` value gets it right because `unicode` values store *Unicode code points*, not bytes. When you're working with a `str`, you're manipulating the bytes of memory directly. When you're working with a `unicode` value, you're essentially manipulating a list of Unicode code points.

##Converting raw bytes to Unicode

Even though `str` is the default type in Python, and Python tries its hardest to make `str` values easy to use, what you *almost always* want is a value of type `unicode`. Values of type `unicode` are better because there's no ambiguity about what the characters in the data mean.

The problem is that *not all of the data in the world is stored as Unicode* and even the data that *is* stored as Unicode can be stored in one of various Unicode encodings (the previously-discussed UTF-8 is the most common on the Internet, but there are several others). In your journeys as a data wrangler, you're likely to encounter all manner of poorly or strangely encoded data.

To make things worse: there isn't a good, fail-safe way to *automatically detect* which encoding a particular text uses. For this reason, most built-in Python functions, especially functions that work with files, return `str` values by default. It's up to you to then convert that data from `str` to `unicode`.

An example of a built-in function (from the standard library) that returns `str` values is `urllib.urlopen()`. (We've used this function previously to retrieve values from URLs on the web.) [This file](http://static.decontextualize.com/accent_names_latin1.txt) is a list of names with accents in it; I created this file and saved it with a Latin-1 encoding. (Latin-1 is a common extended ASCII code page.) Here's how to retrieve this data in Python:

In [61]:
import urllib
url = "http://static.decontextualize.com/accent_names_latin1.txt"
data = urllib.urlopen(url).read()
print type(data)

<type 'str'>


You can see that the value returned from reading this data is type `str`. Let's try to print it out:

In [62]:
print data

Em�lia,27
Ir�ne,22
V�in�,33
L�szl�,29
J�rgen,36


Disastrous! All of those lovely non-ASCII accents are displaying as weird question marks. But why? Let's examine the values of the bytes:

In [63]:
# get the first line of the text file
first_line = data.split("\n")[0]
[ord(x) for x in first_line]

[69, 109, 237, 108, 105, 97, 44, 50, 55]

The value `first_line` in the cell above has the string data from the first line of the text returned from `urlopen().read()`. (I used the `.split()` method to split the string into lines, and then grabbed the first item of the list.) You can see the integer values of the individual characters after breaking them up with `ord()`. The third character---what should be `í`---is represented by the number 237.

Now, 237 is *indeed* the number that corresponds to the character `í` in the Latin-1 encoding. The problem is that Python doesn't know that "Emília" is the string we want to end up with. The number 237 could mean anything, depending on which code page we're using. In the Mac OS Roman encoding, 237 means `Ì`, which is a perfectly valid interpretation. ("EmÌlia" could be like... an edgy electronic musician/DJ or something.)

So: we need a way to direct Python to interpret bytes according to a particular encoding, and then return a `unicode` value with the resulting interpretation. The way to do this is with the `.decode()` method of the `str` value. The `.decode()` method takes a single argument, which is a string that indicates which encoding Python should use to interpret the data in the string:

In [66]:
decoded = data.decode('latin1')
print type(decoded)

<type 'unicode'>


The `.decode()` method evaluates to a `unicode` value where all of the strange numbers in the source text have been interpreted and turned into Unicode code points. The result looks like what we were expecting in the first place:

In [70]:
print decoded

Emília,27
Irène,22
Väinö,33
László,29
Jürgen,36


Now that you've got Unicode data, you can start doing all the fun stuff you want with the string (indexing by number, etc etc etc).

For a full list of encodings that `.decode()` supports, [consult this list in the Python documentation](https://docs.python.org/2/library/codecs.html#standard-encodings).

###Interpreting UTF-8

UTF-8 is a way of representing Unicode strings, but that doesn't mean that Python automatically knows when some sequence of bytes contains UTF-8 data. You still need to convert a `str` that contains Unicode data to an actual `unicode` value.

[Here's another version](http://static.decontextualize.com/accent_names_utf8.txt) of the same data that we used above. This time, however, I encoded the data as UTF-8. We'll load it the same way we loaded the Latin-1 values, with `urllib`:

In [71]:
import urllib
url = "http://static.decontextualize.com/accent_names_utf8.txt"
data = urllib.urlopen(url).read()
print data

Emília,27
Irène,22
Väinö,33
László,29
Jürgen,36


This *looks* fine, but don't be deceived---underneath the hood, this is still a string of bytes:

In [75]:
print type(data)

<type 'str'>


... and if we were to get the first line of data, we'd see that it has the same string-length discrepancy that we saw with `"touché"`:

In [77]:
# get the first line of the text file
first_line = data.split("\n")[0]
print len(first_line)

10


Ten characters, when a quick visual check shows nine. The problem (as you can see below) is that the string contains raw UTF-8 bytes:

In [78]:
[ord(x) for x in first_line]

[69, 109, 195, 173, 108, 105, 97, 44, 50, 55]

How do we fix this? By using the `.decode()` method on the whole string. This time, because we know the original text is in UTF-8 format, we'll use the `utf8` parameter:

In [79]:
decoded = data.decode('utf8')

In [80]:
first_line = decoded.split("\n")[0]
print len(first_line)

9


Now we get nine---the number we'd expect.

##Encoding `unicode` in other encodings

Once you have your textual data in a `unicode` string, you may occasionally need to write that data back out to disk, so you can share it with other people (or send it to another programming environment). In order to accomplish, you can use the `.encode()` method of a `unicode` value. This method converts a `unicode` value to a byte string (`str`) using the indicated encoding.

For example, say you have a `unicode` value like this:

In [81]:
rad_name = u"Väinö"
print type(rad_name)

<type 'unicode'>


... and you want to convert this to a Latin-1 encoded string:

In [83]:
encoded = rad_name.encode('latin1')
print encoded
print [ord(x) for x in encoded]

V�in�
[86, 228, 105, 110, 246]


(You could then write this to a file with, e.g., the file object's `write` method.)

The `.encode()` method supports all of the character encodings that `.decode()` supports. See above for a link to a full list.

##Summary

That's a lot of stuff! Let's summarize:

* A system for relating digital numbers with written characters is called a "character encoding." History is littered with dozens and dozens of such systems, all slightly different.
* ASCII is one of the most common; extensions to ASCII include character encodings like MacRoman and Latin-1.
* Unicode is an attempt to unify all of the world's character encodings.
* UTF-8 is a common, space-efficient way to represent Unicode characters as bytes.
* When you read data into Python, you usually get data in the form of the `str` data type---just a list of bytes. Python assumes the integer value of these bytes correspond to their ASCII equivalents, but it doesn't know (by default) about any other encodings.
* Python's `unicode` data type can hold strings that consist of any combination of Unicode characters.
* To get a `unicode` value, use the `.decode()` method of a `str` object. You'll need to figure out which encoding to use on your own (based on your own sleuth work about the data source in question).
* To convert a `unicode` value back to a list of bytes (so you can, e.g., save it to a file), use the `unicode` value's `.encode()` method.

##Further reading and resources

This document contains just the basics! There's much more to learn about character encoding and Unicode and how they operate in Python. Here's some reading to get you started.

* [A Brief History of Character Codes](http://tronweb.super-nova.co.jp/characcodehist.html) is a thorough and fascinating overview of how character encoding developed from the 18th century to present.
* [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) by Joel Spolsky is required reading, and goes into a bit more detail about many of the points discussed above.
* [Python Unicode HOWTO](https://docs.python.org/2/howto/unicode.html) from the official documentation is a good guide to how Unicode works in Python, including more detail about how to work with Unicode characters in source files and how to work with Unicode data on the command-line.
* [Unicode in Python, Completely Demystified](http://farmdev.com/talks/unicode/) is a fantastic presentation about common Unicode errors in Python 2.x, why they happen, and how to fix them.
* This document only describes how Unicode works in Python 2.7. The way that Python handles Unicode (and strings in general) changed significantly in Python 3. [Here's a good overview](http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/) of the changes. (Basically: there is no more `str` type; strings are all Unicode by default; a new `bytes` type fills in the gap left by `str`)
* [chardet](https://pypi.python.org/pypi/chardet) is a `pip`-installable library that will make intelligent guesses about the encoding of a given stretch of text (the same way a web browser does).