# Mangle Data Like a Pro

##Python 3 Unicode Strings

So we just learned that Python 3 uses Unicode strings, instead of the byte arrays of Python 2. This is important because that means that every Python string can theoretically represent any character or symbol in the world without additional processing. 

Lets go into how Python 3 handles unicode strings:

### Translation between Unicode names, IDs, and values

Every Unicode character has a standard name, ID, and of course, value. You can find the standard names for a unicode character in the [Unicode Character Name Index](http://www.unicode.org/charts/charindex.html) and you can find the ID for a unicode character in the [Unicode Code Charts page](http://www.unicode.org/charts/).

Let's define a function that can demonstrate the relationship between a Unicode ID, Standard Name, and its value:

In [1]:
import unicodedata

def unicode_test(value):
    name = unicodedata.name(value)
    print("value=%s, name=%s" % (value, name))
    
unicode_test("B")
unicode_test("\u0042")

value=B, name=LATIN CAPITAL LETTER B
value=B, name=LATIN CAPITAL LETTER B


The funciton unicode_test prints out the value of the character that we passed it as well as it's standard unicode name. Please note that passing `"B"` and passing `"\u0042"` yields the same result. That is because `"B"` and `"\u0042"` are the same character, namely the unicode character for `"B"`. The `"\u"` is the escape sequence for when you want to specify a character by its unicode id. You can use unicode id's to represent any character if you wanted to, like the following:

In [2]:
print("Great!")
print("\u0047\u0072\u0065\u0061\u0074\u0021")

Great!
Great!


In [3]:
regular_string = "Great!"
unicode_id_string = "\u0047\u0072\u0065\u0061\u0074\u0021"
print(regular_string == unicode_id_string)

True


Because Python 3 represents all strings into unicode characters, you could theroetically subsitute all of the characters for their escaped unicode id representations and they would be equivalent.

This is useful when you want to represent certain characters that are hard or impossible to type and capture:

In [4]:
new_line = "\u000A"
print("This has a\u000Anewline in the middle of the sentence")

This has a
newline in the middle of the sentence


Note that `"\u000A"` could also be represented by the shortcut "\n" (since all strings in python are represented as unicode characters": 

In [5]:
print("This has a\nnewline in the middle of the sentence")

This has a
newline in the middle of the sentence


Now let's look into the standard name: we can do the same with the name as well as we did with the id:

In [6]:
unicode_test("B")
unicode_test("\u0042")
unicode_test("\N{LATIN CAPITAL LETTER B}")

value=B, name=LATIN CAPITAL LETTER B
value=B, name=LATIN CAPITAL LETTER B
value=B, name=LATIN CAPITAL LETTER B


We can look up the unicode character by its unicode standard name by using the escape sequence "\N{*standard name*}" 

### Representing hard to type characters in python 3

Depending on the system that you are developing for, you may be able to cut and paste the hard to type character right into the code that you are trying to use:

In [7]:
place = 'cafè'
print(place)

cafè


In my case this worked because i'm on a computer that uses utf-8 encoding, but it is possible that this did not work for you. However to ensure that you do not have to worry about encoding, you can use the unicode id to represent the character:

In [8]:
place = 'caf\u00e9'
print(place)

café


###The len() function and unicode

Also note that the `len()` function counts unicode characters:

In [9]:
len('cafè')

4

In [10]:
len('caf\u00e9')

4

This is key to understanding that Python 3 treats all strings as unicode characters, regardless of how many bytes are used to store each character.

We've introduced how Python stores strings in Unicode format and how we can use Unicode standard names and id's to represent characters. In the next section we will look into encoding and decoding our string data into UTF-8 to ensure that any data that we pass outside our application will be translated properly. 