# Lecture 19 Notes

## Data Structures

A **data structure** is an organized collection of data. Python has three main
built-in data structures that we will see in this course:

- *Strings*, which are ordered sequences of characters.
- *Lists*, which are ordered sequences of any values.
- *Dictionaries*, which let you find a value using a key value associated with
  it. For example, you could make a dictionary where you can find the name of a
  student given their SFU ID number.

In the next few lectures, we'll look at **strings**, a very useful and important
data structure. Python has good built-in support for strings, and so is a
popular language to use for string processing.


## String Basics

In Python, a **string** is a sequence of 0 or more characters. The type of a
string is `str`:

In [1]:
type('cat')

str

A **string literal** is represented using single-quotes, double-quotes, or
triple quotes. Here are 4 examples of string literals:

```python
'you can put a " in single-quote strings'

"you can put a ' in double-quotes strings"

"""Triple-quotes strings can span
multiple lines. They are often use a doc strings
for functions.
"""

'''This is also a triple-quoted string,
using single-quotes instead of double-quotes.
'''
```

As shown you *can* split triple quotes across multiple lines, but you *cannot*
split a regular single-quoted or double-quoted string across multiple lines.

The **empty string** is a string with 0 characters, i.e. of length 0. These
all represent the empty string (the first two are by far the most common):

```
''

""

""""""

''''''
```

We will often need to treat the empty string as a special case when processing
strings.

The *order* of the characters in a string matters. For example, `'abc'` and
`'bac` are *different* strings.

The *case* of a letter in a string *matters*. For example, `'M'` and `'m'` are
*different* string, and `'Cat'` and `'cat'` are also different.

**Note** Some programming languages have a special data type for single
characters. For example, in C++ the `char` data type represents a single
character. However, Python does *not* have a character data type.
Single-character strings like `'h'` or `'!'` are regular strings of length 1.

The **length** of a string is the number of characters it contains, and in
Python the built-in `len` function returns this:

```python
>>> len('')
0
>>> len('log')
3
>>> len('a b c')
5
```

In [None]:
print(len(''))       # 0
print(len('log'))    # 3
print(len('a b c'))  # 5

## Special Characters

In a string literal, a '`\`' indicates an **escape character**, which means that
the next character is special in some way. For example, `'\n` is an escape
character called **newline** that represents a command to send the cursor to the
next line. For example:

```python
>>> print('one\ntwo')
one
two
```

In [3]:
print('one\ntwo')

one
two


The string `'one\ntwo'` has length 7 (not 8!). Even though `'\n'` consists of
two symbols, `\` and `n`, it counts as a single character. Similarly, the
string `'\n\n\n'` has length 3:

In [4]:
print(len('\n\n\n'))

3


Here are the most common escape characters that you will see in Python
strings:

|    |   **name**   |                                     **common use**                                    | **example**                            |
|:--:|:------------:|:-------------------------------------------------------------------------------------:|----------------------------------------|
| \n |    newline   | blank line                                                                            | >>> print('a\nb')<br>a<br>b            |
| \t |      tab     | fixed-width space, for formatting;<br>width of a tab is **not** defined by <br>Python | >>> print('\thello!')<br>    hello!    |
| \\\ |   backslash  | \ as a literal character                                                              | >>> print('root\\\\users')<br>root\users |
| \\' | single quote | ' as a literal character                                                           | >>> print('\\'-quote')<br>'-quote       |
| \\" | double quote | " as a literal character                                                           | >>> print("\\"-quote")<br>"-quote       |

**Example 1** Make a string, that when printed, displays this:

```
special characters: ' " \
```

Here is one way to do it:

In [5]:
print('special characters: \' " \\')

special characters: ' " \


**Example 2** Make a string, that when printed, displays three backslashes:

```
\\\
```

Each printed `\` needs double-`\` in the string:

```python
print('\\\\\\')  # 6 backslashes
```

In [6]:
print('\\\\\\')  # 6 backslashes

\\\


Note that 5 backslashes *doesn't* is a syntax error:

The problem here is that Python reads the string as these three characters:
`\\`, `\\`, and `\\'`. This means there is no `'`-quote to end the string, and
so the error.

In [7]:
print('\\\\\')

SyntaxError: EOL while scanning string literal (2955580404.py, line 1)

The problem is that Python reads the string as these three escape characters:
`\\`, `\\`, and `\'`. This means there is no `'`-quote to end the string.

## Whitespace

A **whitespace character** is a character that doesn't have a visual
representation (and so, when "printed" on a piece of white paper will look like
empty white space). The three
most common whitespace characters are:

- `' '`, a regular space
- `\n`, a newline
- `\t`, a tab

When programmers say "whitespace", they mean characters like these. Sometimes
whitespace matters, sometimes it doesn't. For instance, the indentation in a
Python program is whitespace that matters: inconsistent indentation can cause a
syntax error. 

When we read strings from the user, we often remove whitespace characters at the
beginning and the end using the `.strip()` method:

In [9]:
print('   done '.strip())
print('   print    name  '.strip())

done
print    name


`strip()` removes all whitespace characters from the beginning and end of a
string, but does not remove whitespace characters in the middle of the string.

## Strings are Immutable

In Python, **mutable** means "changeable", and **immutable** means "not
changeable". Python strings are immutable, i.e. there is no way to modify them
or change their length.

String immutability has both pros and cons:

- One good feature of immutable strings is that they are quite efficient for
  most copying operations. Since they never change, it is always safe to have
  different variables refer to the same underlying string.

- One bad feature of immutable strings is that if you need to, say, change one
  character in a very long string, then you need to construct a brand new
  string. If your program does this kind of thing a lot, then it could become
  very inefficient.

## String Concatenation

**Concatenating** two strings means to combine them together to make a new
string. This is easily done in Python with the `+` operator:

In [10]:
print('cat' + 'nap')          # 'catnap'
print('Elon' + ' ' + 'Musk')  # 'Elon Musk'
print('ha' + 'ha' + 'ha')     # 'hahaha'

catnap
Elon Musk
hahaha


The last example with `'ha'` is an example of concatenating a string with
itself. You can also do that with the `'*'` operator:

In [11]:
print(3 * 'ha')    # 'hahaha'
print('Boat' * 5)  # 'BoatBoatBoatBoatBoat'

hahaha
BoatBoatBoatBoatBoat


String expressions can get complicated:

In [12]:
s = (5 * 'Meow! ' + '\n') * 6
print(s)

Meow! Meow! Meow! Meow! Meow! 
Meow! Meow! Meow! Meow! Meow! 
Meow! Meow! Meow! Meow! Meow! 
Meow! Meow! Meow! Meow! Meow! 
Meow! Meow! Meow! Meow! Meow! 
Meow! Meow! Meow! Meow! Meow! 



## String Comparisons

If `s` and `t` are variables that refer to strings, then:

- `s == t` evaluates to `True` if `s` and `t` are the same length, and have
  exactly the same characters in the same order. Otherwise, it evaluates to
  `False`.

- `s != t` evaluates to `True` if `s` and `t` are different, and to `False` if
  `s == t`. It returns the same value as `not (s == t)`.

- `s < t` evaluates to `True` if `s` comes alphabetically *before* `t`, and
  `False` otherwise. `s > t` evaluates to the same value as `t < s`.

- `s <= t` evaluates to `True` if either `s < t` or `s == t`, and `False`
  otherwise. `s >= t` evaluates to the same value as `t <= s`.

For example:

```
>>> s = 'cat'
>>> t = 'dog'

>>> s == t
False
>>> s == s
True
>>> ('c' + 'at' ) == s
True
>>> s != t
True

>>> s < t
True
>>> s <= t
True

>>> s > t
False
>>> s >= t
False
```

In [13]:
s = 'cat'
t = 'dog'

print(s == t)              # False
print(s == s)              # True
print(('c' + 'at' ) == s)  # True
print(s != t)              # True

print(s < t)               # True
print(s <= t)              # True

print(s > t)               # False
print(s >= t)              # False

False
True
True
True
True
True
False
False


## String Indexing

If `s` is a string, then `s[i]` is the character at index location `i` of `s`.
`s[i]` is an example of **string indexing**, e.g. we say we are *indexing into&
`s` using `i`.

For example:

In [14]:
s = 'apple'
print(s[0])  # 'a'
print(s[1])  # 'p'
print(s[2])  # 'p'
print(s[3])  # 'l'
print(s[4])  # 'e'
print(s[5])  # IndexError: i is out of range error

a
p
p
l
e


IndexError: string index out of range

In Python, string indexing *always* starts at 0, meaning the first character of
a string is at location 0. This is known as **0-based indexing**. 

The index of the last character is always *one less* than the length of the
string (because the indices start at 0). If you try to access an index location
past the last one, then you get an "index out of range" run-time error as shown
in the last example above.

Diagrams are helpful for understanding indexing:

```
       0     1     2     3     4
    +-----+-----+-----+-----+-----+
 s  | 'a' | 'p' | 'p' | 'l' | 'e' |
    +-----+-----+-----+-----+-----+
     s[0]  s[1]  s[2]  s[3]  s[4] 
```

In general, if `s` is any *non-empty* string, then `s[0]` is its first
character, and `s[len(s)-1]` is its last character. Evaluating `s[len(s)]`
causes an "index out of range" run-time error.

For the empty string, `s[i]` is an out of range
error for any index `i`.

Since Python strings are immutable (i.e. not changeable), it is an error to
assign a new character to a string:

In [15]:
s = 'apple'
s[0] = 'A'  # error: can't change a string

TypeError: 'str' object does not support item assignment

## Looping Over Strings

The simplest way to loop over a Python string is to use a for-loop to directly get the characters:

In [16]:
s = 'apple'
for c in s:
  print(c)

a
p
p
l
e


You can also use a for-loop with an index:

In [17]:
s = 'apple'
for i in range(len(s)):
	print(s[i])

a
p
p
l
e


Or the same thing using a while-loop:

In [18]:
s = 'apple'
i = 0
while i < len(s):  # < is important; <= would be wrong
	print(s[i])
	i += 1           # += adds a value to a variable

a
p
p
l
e


**Example** Write a function that *reverses* a string, e.g.
`reverse('abc')` returns `'cba'`.

Here's one way to do it using a for-loop:

In [22]:
def reverse1(s):
    result = ''
    for c in s:
        result = c + result
    return result

print(reverse1('apple'))  # 'elppa'
print(reverse1('bird'))   # 'drib'

elppa
drib


And with a while-loop:

In [21]:
def reverse2(s):
    result = ''
    i = len(s) - 1
    while i >= 0:
        result += s[i]
        i -= 1
    return result

print(reverse2('apple'))  # 'elppa'
print(reverse2('bird'))   # 'drib'

elppa
drib


Since Python strings are immutable, the line `result += s[i]` creates a new
string every time it is called. And all of these strings are discarded when the
next character from `s` is appended. So these are not very efficient functions.