# Mangle Data Like a Pro

##Python 3 Unicode Strings

On of the major benefits of Python 3 is that all strings are Unicode strings.  By contrast, strings in Python 2 were stored as arrays of bytes.  While these worked well for ASCII characters, programmers would often encounter errors and frustration whenever more characters came into play.  The shift to unicode is important because it means that every string in Python 3 can theoretically represent any character or symbol in the world without additional processing. 

Lets go into how Python 3 handles unicode strings:

### Translation between Unicode names, IDs, and values

Every Unicode character has a standard name, ID, and of course, value. You can find the standard names for a unicode character in the [Unicode Character Name Index](http://www.unicode.org/charts/charindex.html) and you can find the ID for a unicode character in the [Unicode Code Charts page](http://www.unicode.org/charts/).

Let's define a function that can demonstrate the relationship between a Unicode ID, Standard Name, and its value:

In [1]:
import unicodedata

def unicode_test(value):
    name = unicodedata.name(value)
    print("value=%s, name=%s" % (value, name))
    
unicode_test("B")
unicode_test("\u0042")

value=B, name=LATIN CAPITAL LETTER B
value=B, name=LATIN CAPITAL LETTER B


The function unicode_test prints out the value of the character that we passed it as well as its standard unicode name. Please note that passing `"B"` and passing `"\u0042"` yields the same result. That is because `"B"` and `"\u0042"` are the same character, namely the unicode character for `"B"`. The `"\u"` is the escape sequence for when you want to specify a character by its unicode id. If you really wanted to, you could use unicode id's to represent every character you enter.  Consider the following code:

In [2]:
print("Great!")
print("\u0047\u0072\u0065\u0061\u0074\u0021")

Great!
Great!


In [3]:
regular_string = "Great!"
unicode_id_string = "\u0047\u0072\u0065\u0061\u0074\u0021"
print(regular_string == unicode_id_string)

True


Because Python 3 represents all strings as unicode characters, you could theroetically subsitute all of the characters for their escaped unicode id representations and they would be equivalent.

This is useful when you want to represent certain characters that are hard or impossible to type using your keyboard:

In [4]:
new_line = "\u000A"
print("This has a\u000Anewline in the middle of the sentence")

This has a
newline in the middle of the sentence


Note that `"\u000A"` could also be represented by the shortcut "\n".

In [5]:
print("This has a\nnewline in the middle of the sentence")

This has a
newline in the middle of the sentence


Now let's look into the standard unicode name.  It turns out that we can use the standard name in much the same way as the unicode id:

In [6]:
unicode_test("B")
unicode_test("\u0042")
unicode_test("\N{LATIN CAPITAL LETTER B}")

value=B, name=LATIN CAPITAL LETTER B
value=B, name=LATIN CAPITAL LETTER B
value=B, name=LATIN CAPITAL LETTER B


We can look up a unicode character by its unicode standard name by using the escape sequence "\N{*standard name*}" 

### Representing hard to type characters in python 3

Depending on the system that you are developing for, you may be able to cut and paste a hard-to-type character right into the code that you are trying to use:

In [7]:
place = 'cafè'
print(place)

cafè


This worked on my computer because I copied from an application using the utf-8 encoding, but it is possible that this will not work for you. To ensure that you do not have to worry about encoding, you can use the unicode id to represent the character:

In [8]:
place = 'caf\u00e9'
print(place)

café


###The len() function and unicode

Also note that the `len()` function counts unicode characters:

In [9]:
len('cafè')

4

In [10]:
len('caf\u00e9')

4

Python 3 treats each unicode character as a single character, regardless of how many bytes are used to store it.

We've introduced how Python stores strings in Unicode format and how we can use Unicode standard names and id's to represent characters. In the next section, we will look into encoding and decoding our string data into UTF-8 to ensure that any data that we pass outside our application will be translated properly.