# Mangle Data Like A Pro

## Encoding and Decoding

If all of the data processing you do is inside your Python modules, then you don't need to worry about encoding and decoding your string data since you will always be working in Unicode. However, when you export data or access data from an external source, like an external API, you will need to make sure that the external application can communicate data to your script properly.  This is where encoding schemes come into play. 

There are many encoding schemes out there, including ASCII and latin-1 which you may have seen before in other applications. UTF-8 is a standard encoding scheme that is widely used in many applications throughout the world. By encoding our strings using UTF-8, we maximize the chances that Linux aplications, other Python aplications, webpages, and so on will be able to properly read our data.

## How to encode strings to UTF-8

Let's first look at the built-in functions that enable you to encode strings in Python.

The process of encoding takes a string and converts it to a sequence of bytes. Any external application that uses the same encoding will then be able to interpret every character in the string correctly.

For example, let's begin with an emoji character.  Like the letters we're used to, this character actually has a unicode codepoint.

In [25]:
raw_char = '\u2603'
print(raw_char)

☃


As you would expect, the length of `raw_char` will be the number of characters in that string:

In [26]:
len(raw_char)

1

Now let's encode this character in UTF-8 encoding:

In [27]:
encoded_char = raw_char.encode('utf-8')
print(encoded_char)

b'\xe2\x98\x83'


Note the "b" in front of the character. This is to designate that encoded_char is a sequence of bytes (instead of characters). Note also the three "\x" escape sequences. This is to designate that "e2" is the raw byte value (in hexadecimal notation, or base 16) of the first byte in that sequence. If you were to see the variable `encoded_char` in memory it would look something like this:

E29883

You might wonder why we need the encoding step when the original string - like everything else in a computer - was always stored in bytes.  It's true that raw_char was stored in bytes, but the Python str is an abstraction based on unicode characters.  It's possible that the bytes used to represent a unicode character in Python change over time.  By contrast, after we encode the string, we have access to the actual byte values in memory.  We can see what the byte values are, even manipulate them if we wish.  Moreover, we can be confident that the byte values assigned by UTF-8 won't change, so the text we encode in this way can always be decoded correctly.

Since we encoded our string, the len function will now count bytes instead of characters:

In [28]:
len(encoded_char)

3

We can also confirm that encoded_char has type bytes.

In [29]:
type(encoded_char)

bytes

As mentioned earlier, there are other encoding schemes besides UTF-8.  In the next cell, we encode a character into ASCII.

In [30]:
ascii_char = 'A'
print(ascii_char.encode('ascii'))

b'A'


Notice that the printed output begins with a 'b', alerting us to the fact that we're looking at individual bytes.  Instead of displaying the single byte in hexadecimal notation, however, we see a single character, A.  In fact, whenever Python needs to print a byte value, it checks to see whether the value is printable in the ASCII encoding.  If it is, Python displays the ASCII representation instead of hexadecimal, making the output a bit easier to read.  You should remember though, that the 'A' in this output refers to an actual byte value in memory (\x41 in hexadecimal notation).

When encoding to a scheme that is not UTF-8, you need to make sure that the characters in your string are available in that particular encoding.  If an encoding doesn't assign a sequence of bytes to a given character, you will get an error.

In [31]:
snowman_char = '\u2603'
print(snowman_char.encode('ascii'))

UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128)

The snowman_char could not be properly encoded because the ASCII encoding does not contain emoji's. 

##Handling Encoding Errors

Sometimes you want a more elegant way to handle encoding errors; Python provides several options you can use.

###Ignore all characters that are not part of the encoding

Python can simply ignore all of characters that do not have a representation in the specified encoding. It will simply not include those characters in the transformed output:

In [32]:
snowman_char.encode('ascii', 'ignore')

b''

###Replace all characters not part of the encoding with a "?"

You can also add a "?" everytime the string contains a character that the encoding cannot replace:

In [33]:
snowman_char.encode('ascii', 'replace')

b'?'

###Escape all characters not part of the encoding

When a unicode character isn't represented by an encoding, such as ASCII, you might want to replace it with the backslash escape character, following by the unicode representation. This representation can be understood by some systems.

In [34]:
snowman_char.encode('ascii', 'backslashreplace')

b'\\u2603'

###Produce an XML friendly string

You can also produce an XML friendly version of the string. This will be interpreted correctly by most web browsers.

In [35]:
snowman_char.encode('ascii', 'xmlcharrefreplace')

b'&#9731;'

##Decoding

Just as we want to be able to send data to external applications accurately, we also want to be able to grab data from external sources without errors. We do this by decoding the data. It is important that we know what encoding was used to encode the data or we may not be able to properly decode the string.

As an example, let's encode another message that includes our snowman character:

In [36]:
my_msg = "Can't wait for the snow to come, going to make my first snowman \u2603\u2603!"
print(my_msg)
type(my_msg)

Can't wait for the snow to come, going to make my first snowman ☃☃!


str

We will encode this message in UTF-8, which will convert the string into a sequence of bytes:

In [37]:
my_msg_bytes = my_msg.encode('utf-8')
print(my_msg_bytes)
type(my_msg_bytes)

b"Can't wait for the snow to come, going to make my first snowman \xe2\x98\x83\xe2\x98\x83!"


bytes

Notice again that only the snowman emoji's are printed in raw byte format. The other bytes are conveniently displayed in ASCII format, since they correspond to printable ASCII characters.

Now let's imagine that the bytes object my_msg_bytes is actually data that we received from an external application. We would like to decode it back to a string variable.  We can do that using the decode method:

In [38]:
my_msg_decoded = my_msg_bytes.decode('utf-8')
print(my_msg_decoded)
type(my_msg_decoded)

Can't wait for the snow to come, going to make my first snowman ☃☃!


str

As expected, we get our original string with no errors.

Of course, if we were to use another encoding to decode our data, errors would result.

In [40]:
my_msg_decoded_wrong = my_msg_bytes.decode('latin-1')
print(my_msg_decoded_wrong)

Can't wait for the snow to come, going to make my first snowman ââ!


In [22]:
my_msg_decoded_wrong = my_msg_bytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 64: ordinal not in range(128)

Remember to use utf-8 when possible. It is the closest thing we have to a global encoding standard.

Here's more [history on encoding](http://www.joelonsoftware.com/articles/Unicode.html) if you are interested.

At this point, you know the basics of encoding and decoding, so you can send and grab information from external applications. In the next section, we'll consider techniques for creating well-formatted strings from our variables.