# Mangle Data Like A Pro

## Encoding and Decoding

If all of the data processing you are doing is within your Python modules, then you don't need to worry about encoding and decoding your string data since you will always be working in Unicode. However, when you are exporting data or accessing data from an external source, like from an external API, you will want to make sure that the external application can properly communicate data to your script properly.

This is where encoding schemes come into play. There are several encoding schemes , such as ASCII and latin-1 which you may have seen before in other applications. UTF-8 is a standard encoding scheme that is widely used in several applications throughout the world. So if we encode our strings using UTF-8 encoding, we can be sure that other Linux aplications, other Python aplications, webpages, and so on will be able to properly read our data.


## How to encode strings to UTF-8

Let's first look at the built-in functions that enable you to encode strings in Python.

The process of encoding takes a string and converts it to a sequence of bytes. This ensures that the external application that is reading the data will not run into any platform or application specific data that could cause issues properly decoding the encoded data.

For example, we are going to encode an emoji into UTF-8:

In [8]:
raw_char = '\u2603'
print(raw_char)

☃


And as you expect, the length of `raw_char` will be the number of "characters" in that string:

In [9]:
len(raw_char)

1

Now let's encode this typical char in UTF-8 encoding:

In [10]:
encoded_char = raw_char.encode('utf-8')
print(encoded_char)

b'\xe2\x98\x83'


Note the "b" in front of the character. This is to designate that encoded char is a sequence of bytes (instead of characters). Note also the three "\x" escape sequences. This is to designate that "e2" is the raw byte value of the first byte in that sequence. As in if you were to see the variable `encoded_char` in memory it would look someting like this (in hexadecimal format:

E29883

Now the length of encoded_char will count bytes instead of characters,

In [11]:
len(encoded_char)

3

since encoded_char is a byte variable.

In [12]:
type(encoded_char)

bytes

As mentioned earlier, there are other encoding schemes, however you need to make sure that the strings that you are trying to encode are available for that particular encoding or you will run into problems, as the followng:

In [13]:
ascii_char = 'A'
print(ascii_char.encode('ascii'))

b'A'


In [14]:
snowman_char = '\u2603'
print(snowman_char.encode('ascii'))

UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128)

The snowman_char could not be properly encoded because the ASCII encoding does not contain emoji's. 

##Handling Encoding Errors

Sometimes you want a more elegant way to handle encoding errors: Python gives you a few choices to do so:

###Ignore all characters that are not part of the encoding

Python can simply ignore all of characters that does not have a representation in the specified encoding. It will simply not include those characters in the transformed output:

In [15]:
snowman_char.encode('ascii', 'ignore')

b''

###Replace all characters not part of the encoding with a "?"

You can also add a "?" everytime the string contains a character that the encoding cannot replace:

In [16]:
snowman_char.encode('ascii', 'replace')

b'?'

###Escape all characters not part of the encoding

You can can also produce a Python Unicode charcter string by including the backslashed escaped unicode representation:

In [17]:
snowman_char.encode('ascii', 'backslashreplace')

b'\\u2603'

###Produce an XML friendly string

You can even produce an XML friendly version of the string:

In [18]:
snowman_char.encode('ascii', 'xmlcharrefreplace')

b'&#9731;'

##Decoding

Just as we want to be able to send data to external applications accurately, we also want to be able to grab data from external sources faithfully as well. We do this by decoding the data. It is important that we know what encoding was used beforehand or in many cases we will not be able to properly decode the string.

For example, let's take our snowman char:

In [19]:
my_msg = "Can't wait for the snow to come, going to make my first snowman \u2603\u2603!"
print(my_msg)
type(my_msg)

Can't wait for the snow to come, going to make my first snowman ☃☃!


str

Now we will encode this message in UTF-8, which will convert the string into a sequence of bytes:

In [20]:
my_msg_bytes = my_msg.encode('utf-8')
print(my_msg_bytes)
type(my_msg_bytes)

b"Can't wait for the snow to come, going to make my first snowman \xe2\x98\x83\xe2\x98\x83!"


bytes

Notice how only the snowman emoji's are printed in raw byte format. That is because the rest of the message are in the ASCII format, which is a single byte encoding that Python intrinsically understands. So instead of printing the raw bytes, it prints the character instead.

Now let's treat my_msg_bytes as data that we received from an external application. We would like to decode it back to a string variable:

In [21]:
my_msg_decoded = my_msg_bytes.decode('utf-8')
print(my_msg_decoded)
type(my_msg_decoded)

Can't wait for the snow to come, going to make my first snowman ☃☃!


str

And we get our original string!

If we were to use another encoding to decode we would run into the same problems as we did earlier

In [22]:
my_msg_decoded_wrong = my_msg_bytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 64: ordinal not in range(128)

Use utf-8 as much as you possibly can. It is widely supported and is quickly encoded and decoded.

Here's more [history on encoding](http://www.joelonsoftware.com/articles/Unicode.html) if you are interested.

So now we've covered encoding and decoding and you can send and grab information from external applications properly. In the next section we are going to go into text formatting and how you can easily create strings from variables and specify precision and alignment.