# Introduction to Strings

Algorithms PERG – Mar. 5<sup>th</sup>, 2020

by Xavier Villaneau

## What is a string?

- **Goal:**  
  Represent text in a computer

- **Problem:**  
  Computers only understand numbers

- **Solution:**  
  Store letters as numbers!

## String = Sequence of numbers

In [1]:
list(b"Hello World!")

[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]

Any structure as long as it's ordered:  
array, linked list, tree…

## How do strings work?

Matching letters to numbers = **Encoding**

Binary text encoding is _much_ older than computers:
* Baudot code: 1874
* Colossus computer: 1944

## The ASCII Encoding

- 7-bit code (128 characters)
- Published in 1963
- Designed for Teletype / Teleprinters
- Base of practically all other encodings created since

## The ASCII Encoding

In [2]:
import numpy as np
import pandas as pd

ascii_chars = [
    "␀␁␂␃␄␅␆␇␈␉␊␋␌␍␎␏",
    "␐␑␒␓␔␕␖␗␘␙␚␛␜␝␞␟",
    " !\"#$%&'()*+,-./",
    "0123456789:;<=>?",
    "@ABCDEFGHIJKLMNO",
    "PQRSTUVWXZY[\\]^_",
    "`abcdefghijklmno",
    "pqrstuvwxyz{|}~␡",
]
ascii_col = [f"_{n:x}" for n in range(16)]
ascii_row = [f"{n:x}_" for n in range(8)]
np_ascii = np.array([list(line) for line in ascii_chars])
ascii_table = pd.DataFrame(np_ascii, columns=ascii_col, index=ascii_row)

In [3]:
ascii_table

Unnamed: 0,_0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_a,_b,_c,_d,_e,_f
0_,␀,␁,␂,␃,␄,␅,␆,␇,␈,␉,␊,␋,␌,␍,␎,␏
1_,␐,␑,␒,␓,␔,␕,␖,␗,␘,␙,␚,␛,␜,␝,␞,␟
2_,,!,"""",#,$,%,&,',(,),*,+,",",-,.,/
3_,0,1,2,3,4,5,6,7,8,9,:,;,<,=,>,?
4_,@,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
5_,P,Q,R,S,T,U,V,W,X,Z,Y,[,\,],^,_
6_,`,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
7_,p,q,r,s,t,u,v,w,x,y,z,{,|,},~,␡


## What about other languages?

"Extended ASCII" codes use the 8<sup>th</sup> bit for 128 extra characters.

- ISO-8859
- KOI8
- Windows-1252
- Mac OS Roman
- …

## Unicode

Having all these encodings is inconvenient, because Internet.

**Unicode** = international standard for encoding text.  
Currently includes 137,994 characters!

## UTF-8

UTF-8 is the most popular way to encode Unicode.

- Superset of ASCII
- Uses between 1 and 4 bytes per character
- Never produces null bytes except for `U+0`

## ⚠ Unicode ≠ UTF-8 ⚠

- **Unicode:**  
  Assigns _code points_ to characters
- **UTF-8:**  
  Converts code points into _bytes_

Alternatives to UTF-8 exist, e.g. UTF-16.

## Demonstration

In [4]:
s = "Pandora \U0001F44D\U0001F3FF"
s

'Pandora 👍🏿'

In [5]:
s.encode('utf-8')

b'Pandora \xf0\x9f\x91\x8d\xf0\x9f\x8f\xbf'

In [9]:
s.encode('utf-8').decode('cp1252', 'replace')

'Pandora ðŸ‘�ðŸ�¿'

## Implementations

- **Python 3:** Sequence of Unicode code points
- **Java / JavaScript:** Sequence of UTF-16 codes
- **Go / Rust:** Sequence of bytes, UTF-8 encoded
- **C:** ¯\\_(ツ)_/¯

# Thank You!