# What are Strings?

PyTennessee 2020 Open Space

Xavier Villaneau | @xvillaneau | xvillaneau@gmail.com

## What is a string?

- **Goal:**  
  Represent text in a computer

- **Problem:**  
  Computers only understand numbers

- **Solution:**  
  Store letters as numbers!

## String = Sequence of numbers

In [1]:
list(b"Hello World!")

[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]

Any structure as long as it's ordered:  
array, linked list, tree…

## How do strings work?

Matching letters to numbers = **Encoding**

Binary text encoding is _much_ older than computers.

History time.

![alt text](Baudot_Code_-_from_1888_patent.png)

## The Baudot Code

* Invented 1870 (patented later)
* 5-bit **stateful** encoding
* Designed for use by humans

## The Murray Code

* Invented 1901
* Better Baudot code for machines
* Introduced **control characters**

Industry standard up through the 1950s

![Colossus](Colossus.jpg)

A Colossus Mark 2 codebreaking omputer, 1943. Operating staff had 272 women and 27 men!

## The ASCII Encoding

- 7-bit code (128 characters)
- Published in 1963
- Designed for Teletype / Teleprinters
- Base of practically all other encodings created since

## The ASCII Encoding

In [2]:
import numpy as np
import pandas as pd

ascii_chars = [
    "␀␁␂␃␄␅␆␇␈␉␊␋␌␍␎␏",
    "␐␑␒␓␔␕␖␗␘␙␚␛␜␝␞␟",
    " !\"#$%&'()*+,-./",
    "0123456789:;<=>?",
    "@ABCDEFGHIJKLMNO",
    "PQRSTUVWXZY[\\]^_",
    "`abcdefghijklmno",
    "pqrstuvwxyz{|}~␡",
]
ascii_col = [f"_{n:x}" for n in range(16)]
ascii_row = [f"{n:x}_" for n in range(8)]
np_ascii = np.array([list(line) for line in ascii_chars])
ascii_table = pd.DataFrame(np_ascii, columns=ascii_col, index=ascii_row)

In [3]:
ascii_table

Unnamed: 0,_0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_a,_b,_c,_d,_e,_f
0_,␀,␁,␂,␃,␄,␅,␆,␇,␈,␉,␊,␋,␌,␍,␎,␏
1_,␐,␑,␒,␓,␔,␕,␖,␗,␘,␙,␚,␛,␜,␝,␞,␟
2_,,!,"""",#,$,%,&,',(,),*,+,",",-,.,/
3_,0,1,2,3,4,5,6,7,8,9,:,;,<,=,>,?
4_,@,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
5_,P,Q,R,S,T,U,V,W,X,Z,Y,[,\,],^,_
6_,`,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
7_,p,q,r,s,t,u,v,w,x,y,z,{,|,},~,␡


## Meanwhile at IBM…

* Created the EBCDIC 8-bit encoding
* Designed for System/360 released in 1964
* Totally ASCII-incompatible

System/360 popularized using 8 bits as a byte.

## What about other languages?

"Extended ASCII" codes use the 8<sup>th</sup> bit for 128 extra characters.

- ISO-8859
- KOI8
- Windows-1252
- Mac OS Roman
- …

## Unicode

Having all these encodings is inconvenient, because Internet.

**Unicode** = international standard for encoding text.

Currently includes 137,994 characters! (more soon)

## Unicode

* Code points allowed: 0 - 1,114,111 (`0x10FFFF`)
* Codes 0 - 255 based on Latin-1
* Each 16-bit block is called a _plane_


## UTF-8

UTF-8 is the most popular way to encode Unicode.

- Superset of ASCII
- Uses between 1 and 4 bytes per character
- Never produces null bytes except for `U+0`

## UTF-8

| From      | To         | Byte 1     | Byte 2     | Byte 3     | Byte 4     |
|-----------|------------|------------|------------|------------|------------|
|     `U+0` |     `U+7F` | `0xxxxxxx` |            |            |            |
|    `U+80` |    `U+7FF` | `110xxxxx` | `10xxxxxx` |            |            |
|   `U+800` |   `U+FFFF` | `1110xxxx` | `10xxxxxx` | `10xxxxxx` |            |
| `U+10000` | `U+10FFFF` | `11110xxx` | `10xxxxxx` | `10xxxxxx` | `10xxxxxx` |

## UTF-8 Example

The Thumbs Up (👍) emoji: `U+1F44D`

In binary: `0001.1111.0100.0100.1101`  
Re-arranged: `000.011111.010001.001101`

UTF-8: `11110000.10011111.10010001.10001101`  
Or: `F0.9F.91.8D`

## ⚠ Unicode ≠ UTF-8 ⚠

- **Unicode:**  
  Assigns _code points_ to characters
- **UTF-8:**  
  Converts code points into _bytes_

## Strings in Python

> Strings are immutable sequences of Unicode code points.

https://docs.python.org/3.8/library/stdtypes.html#textseq

## Demonstration

## Demonstration (plan B)

In [6]:
s = "Pandora \U0001F44D"
s

'Pandora 👍'

In [7]:
s.encode()

b'Pandora \xf0\x9f\x91\x8d'

In [8]:
s.encode().decode('cp1252', 'replace')

'Pandora ðŸ‘�'

## A short memory exercise

In [9]:
s = "ewQ;KDWad qwe24a]s[awv;'15324ansdanx!$0(" * 2500

In [10]:
len(s)

100000

In [13]:
import sys
sys.getsizeof(s)

100049

## A short memory exercise

In [15]:
s1 = s + 'é'
s2 = s + '€'
s3 = s + '👍'

In [16]:
sys.getsizeof(s1)

100074

In [17]:
sys.getsizeof(s2)

200076

In [18]:
sys.getsizeof(s3)

400080

## Implementation of strings

> Unicode objects internally use a variety of representations, in order to allow handling the complete range of Unicode characters while staying memory efficient. There are special cases for strings where all code points are below 128, 256, or 65536 \[…\]

https://docs.python.org/3/c-api/unicode.html

## Other languages

- **Haskell:** Linked list of Unicode code points
- **Java / JavaScript:** Sequence of UTF-16 codes
- **Go / Rust:** Sequence of bytes, UTF-8 encoded
- **C:** ¯\\_(ツ)_/¯

# Thank You!