In [None]:
%load_ext tutormagic

# Example: Strings

Strings are objects. Thus, strings have attributes.

A string has the method `upper`, which converts it to uppercase.

In [13]:
s = 'Hello'
s.upper()

'HELLO'

We can lowercase it,

In [14]:
s.lower()

'hello'

Or even swapcase it,

In [15]:
s.swapcase()

'hELLO'

These methods don't change `s` itself.

In [16]:
s

'Hello'

These methods return new strings based on the old string. These methods are general functions that are only specific to strings. 

Strings themselves are abstraction that allow us to represent texts. However, these strings are encoded in numbers.

## Representing Strings: the ASCII Standard

#### American Standard Code for Information Interchange

<img src = 'ASCII.jpg' width = 600/>

This encoding is shared among many programming languages, not just Python. 

ASCII is one of the first standards to take hold in computing. This table has 8 rows, which is important because 8 is the amount of different rows we can represent using 3 bits (3 `ones and zeros`). 

And there are 4 columns, which can be represented using 4 bits.

The layout was chosen to support sorting by character code (most computer system sorts capital letters before lowercase letters, and puts things with exclamation mark `!` at the top while puts things with the tilda `~` at the end)

If we didn't have enough bits to represent the entire table, the middle subset (rows 2-5, highlighted green) is useful. We might need to type all caps, but it's better than nothing. 

The control characters at the top (highlighted red) were designed for transmission (or information interchange). Most of them have original meanings that aren't used anymore nowadays (some of them are still used, such as `line feed` and `bell`.

For example, we can see what number corresponds to `A`.

In [17]:
ord('A')

65

Or in hexadecimal format,

In [18]:
hex(ord('A'))

'0x41'

41 means row `4`, column `1` in ASCII table.

We can print the line feed character multiple times,

In [19]:
print('\n\n\n')







If we print the bell character multiple times, we should obtain a sound from the computer.

In [20]:
print('\a\a\a')




The ASCII standard is specific for english, but the Unicode standard was designed to have one character set that would be used for different languages.

## Representing Strings: The Unicode Standard

There's a Unicode or a number assigned for any character or any script in any language in the world. Below is an example of a snippet of Chinese characters represented in Unicode:

<img src = 'unicode.jpg' width = 500/>

Unicode:

1. Has 109,000 characters
2. Has 93 organized scripts
3. Has enumeration of character properties, such as case
4. Supports bidirectional display order (read from the right or left)
5. Has a canonical name for every character

To be able to use unicode, we need to import from `unicodedata`

In [21]:
from unicodedata import name, lookup

`name` gives the name of a unicode character

In [22]:
name('A')

'LATIN CAPITAL LETTER A'

In [23]:
name('a')

'LATIN SMALL LETTER A'

`lookup` does the opposite. We pass in a name, and it returns the character. 

In [24]:
lookup('WHITE SMILING FACE')

'☺'

In [25]:
lookup('SNOWMAN')

'☃'

In [26]:
lookup('SOCCER BALL')

'⚽'

In [27]:
lookup('BABY')

'👶'

The characters above might be displayed differently depending on our system's font. 

The idea that a baby corresponds to a particular encoding is universal to various programming languages. We can examine the encoding by encoding it with bytes.

In [28]:
lookup('BABY').encode()

b'\xf0\x9f\x91\xb6'

In [29]:
'A'.encode()

b'A'