# Chapter 3. Built-in Data Structures, Functions, and Files

We will cover Python's built-in data structure, functions, and files in this chapter:
- *tuple* with `()`
- *list* with `[]`
- *dict* with `{'key': value}`
- *set* with `{}`

## Data Structures and Sequences

### Tuple

A tuple is a fixed-length, immutable se4quence of objects, often created with a common-separated sequence of values wrapped in parentheses. In many contexts, the parentheses can be omitted. You can convert other iterables such as lists to tuples with `tuple()`. Indexing and slicing are done with square brackets `[]`, same as other iterables.

Although tuples are immutable, if an element in a tuple is mutable (such as a list), you can modify it in place!

In [139]:
my_tuple = (1, [2, 3], "string")
my_tuple # (1, [2, 3], 'string')
my_tuple[1].append(4)
my_tuple # (1, [2, 3, 4], 'string')

# You can concatenate tuples with + operator or duplicating with *
my_tuple = my_tuple + (6, 7, 8) # (1, [2, 3, 4], 'string', 6, 7, 8)
my_tuple

my_tuple2 = (1, 2, 3)
my_tuple2 * 3 # (1, 2, 3, 1, 2, 3, 1, 2, 3)


(1, 2, 3, 1, 2, 3, 1, 2, 3)

In [140]:
# Unpacking tuples
a, b, c = my_tuple2
a # 1

a, b, c = (1, 2, (3, 4))
c # (3, 4)
a, b, (c, d) = (1, 2, (3, 4))
c # 3

# Unpacking over sequences of tuples/lists
seq = [(1, 2, 3), ("a", "b", "c"), (True, False, False), (7, 8, 9)]
for x, y, z in seq:
    print(f"x = {x}, y = {y}, z = {z}")

x = 1, y = 2, z = 3
x = a, y = b, z = c
x = True, y = False, z = False
x = 7, y = 8, z = 9


In [141]:
# If you only need some values from a tuple
values = 1, 2, 3, 4, 5
a, b, *rest = values # method 1
a, b, *_ = values # method 2

In [142]:
# Tuple method—there are only a few, such as .count()
a = (1, 5, 2, 5, 3)
a.count(5) # count the number of occurances, 2

2

### List

There are more methods available for lists:
- `append()` adds one element at the end of the list
- `extend()` adds multiple elements at the end
- You can also concatenate multiple lists with `+`, but `extend()` is preferred
- `insert(pos, element)` inserts at position `pos`
- `pop`(pos)` removes the element at `pos`
- `remove(element)` removes the first occurance of `element`
- `"element" in my_list` tests whether `element` is in `my_list`
- `"element" not in my_list`
- `my_list.sort()` will sort `my_list` in place. You can also provide an optional `key`, which is a function that produces a value to sort the objects.
- Slicing is done with square brackets `[]`. **Note that negative values slice from the end of the sequence**, which is different from R. A clever way to reverse the order of a list is to use `my_list[::-1]`.

In [143]:
# Slicing examples
my_list = list(range(10))
my_list # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
my_list[-3:] # [7, 8, 9]
my_list[::2] # [0, 2, 4, 6, 8]
my_list[::-1] # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

### Dictionary

The most important data type in Python is arguable the `dict`. In other languages, it's often called *hash maps* or *associative arrays*. Dictionary is often created with curly braces `{}`, with key-value pairs separated by colons `:`.

In [144]:
my_dict = {"name": "John", "age": 30, "3": "a strange one"}
my_dict[2] = "insert" # {'name': 'John', 'age': 30, '3': 'a strange one', 2: 'insert'}
my_dict["3"] = "a better one" # {'name': 'John', 'age': 30, '3': 'a better one', 2: 'insert'}

# Check whether a key exists
2 in my_dict # True

# Delete an element
del my_dict[2]
my_dict # {'name': 'John', 'age': 30, '3': 'a better one'}

# Delete and return the value
age = my_dict.pop("age")
age # 30
my_dict # {'name': 'John', '3': 'a better one'}

my_dict["Occupation"] = "AI Engineer"

# Convert keys, values, or both to a list
list(my_dict.keys()) # ['name', '3', 'Occupation']
list(my_dict.values()) # ['John', 'a better one', 'AI Engineer']
list(my_dict.items()) # [('name', 'John'), ('3', 'a better one'), ('Occupation', 'AI Engineer')]

# Merge two dictionaries—it changes the first dict in place
my_dict2 = {"Location": "Toronto", "Salary": 550_000}
my_dict.update(my_dict2)
my_dict

# Zip a dict from two lists
key_list = ["Name", "Age", "City"]
value_list = ["John", 30, "Toronto"]
dict(zip(key_list, value_list)) # {'Name': 'John', 'Age': 30, 'City': 'Toronto'}

{'Name': 'John', 'Age': 30, 'City': 'Toronto'}

**Default values**

Suppose that you want something like:
```
if key in my_dict:
    value = my_dict[key]
else:
    value = default_value
```
You have a few alternatives:

In [145]:
# get.(key, default_value) method
my_dict = {"name": "Alice", "age": 30}
my_dict.get("name", "Unknown") # 'Alice'
my_dict.get("location", "Unknown") # 'Unknown'

'Unknown'

In [146]:
words = ["apple", "bat", "bar", "atom", "book"]
by_letter = {}

# 1. explicit search
for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)

by_letter # {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

# 2. use setdefault method
for word in words:
    by_letter.setdefault(word[0],[]).append(word)
    
# 3. use defaultdict from collections module
from collections import defaultdict
by_letter = defaultdict(list)

for word in words:
    by_letter[word[0], word]


The keys in a dictionary have to be immutable objects such as `int`, `float`, or `str` or tuples (all the objects in the tuple have to be immutable as well). In other words, the key has to be *hashable*, which can be tested using `hash()` function.

### Set

A set is an *unordered* collection of *unique* elements. It can be created either with `set()` or a pair of curly braces `{}`.

In [147]:
set([1, 2, 2, 3]) # {1, 2, 3}
{1, 2, 2, 3} # {1, 2, 3}

{1, 2, 3}

In [148]:
a = {1, 2, 3}
b = {3, 4, 5, 6}
c = a.union(b)
c # {1, 2, 3, 4, 5, 6}
c = a | b # same as before
c = a.intersection(b) # {3}
c = a & b # same as before
a |= b # modify a in place
a # {1, 2, 3, 4, 5, 6}

{1, 2, 3, 4, 5, 6}

### Set operators

There are many other set operators:

| Function | Operator | Description |
|----------|----------|-------------|
| `a.add(x)` | N/A | Add `x` to `a` |
| `a.clear()` | N/A | Reset `a` to empty |
| `remove(x)` | N/A | Remove `x` from `a` |
| `a.pop()` | N/A | Remove an arbitrary element from `a`; raising `KeyError` is `a` is empty |
| `a.union(b)` | `a \| b` | Union of `a` and `b` |
| `a.update(b)` | `a \|= b` | Union of `a` and `b` in place |
| `a.intersection(b)` | `a \& b` | Intersection of `a` and `b` |
| `a.intersection_update(b)` | `a \&= b` | Intersection of `a` and `b` in place |
| `a.difference(b)` | `a - b` | Elements in `a` but not `b` |
| `a.difference_update(b)` | `a -= b` | Elements in `a` but not `b` in place|
| `a.symmtric_difference(b)` | `a ^ b` | Elements in either `a` or `b` but not both |
| `a.symmtric_difference_update(b)` | `a ^= b` | Elements in either `a` or `b` but not both, in place |
| `a.issubset(b)` | `<=` | `True` if `a` is a subset of `b` |
| `a.issuperset(b)` | `>=` | `True` if `a` is a superset of `b` |
| `a.isdisjoint(b)` | N/A | `True` if `a` and `b` have nothing in common |

Like dictionary keys, set elements generally must be immutable and *hashable*. If you want to include a sequence in a set, the sequence has to be converted to tuples first.

### Built-in Sequence Functions

There are four useful built-in sequence functions:
- `enumerate()` produces the index and content at the same time
- `sorte()` returns a new sorted list from the elements of any sequence
- `zip()` pairs up the elements from any number of sequences to create a *list of tuples*. The length is determined by the *shortest* sequence.
- `reversed()` sorts the elements of a sequence in reverse order

In [149]:
fruits = ['apple', 'orange', 'banana']

for i, fruit in enumerate(fruits, 1):
    print(f"{i}: {fruit}")

1: apple
2: orange
3: banana


In [150]:
# sorted()
sorted("Hello World!")

[' ', '!', 'H', 'W', 'd', 'e', 'l', 'l', 'l', 'o', 'o', 'r']

In [151]:
# zip
seq1 = ['apple', 'orange', 'banana']
seq2 = (1, 2, 3, 4)
seq3 = ["Toronto", "New York", "Sydney"]

list(zip(seq1, seq2, seq3))
# [('apple', 1, 'Toronto'), ('orange', 2, 'New York'), ('banana', 3, 'Sydney')]

for i, (a, b) in enumerate(zip(seq1, seq2)):
    print(f"{i+1}: fruit {a} with quantity {b}")

1: fruit apple with quantity 1
2: fruit orange with quantity 2
3: fruit banana with quantity 3


In [152]:
# reversed() is a generator
list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

### List, Set, and Dictonary Comprehension

One of the most powerful features in Python is *list comprehension*—it allows you to concisely form a new list by filtering the elements of a collection, transforming the elements passive the filter into one concise expression. It's faster and easier to read/write than the equivalent `for` loop.

```python
# Basic syntax:
[expr for value in collection if condition]

# Equivalent for loop:
result = []
for value in collection:
    if condition:
        result.append(expr)
```
A closely related idea is *generator expression*, which looks very similar—instead of `[]`, we use `()`. Unlike list comprehension, generator expression returns a *lazy iterator* that yields values one by one.

A set comprehension is similar: `set_comp = {expr for value in collection if condition}`. Here is the syntax for dict comprehension: `dict_comp = {key-expr: value-expr for value in collection if condition}`.

In [153]:
# Example1: list comprehension
my_strings = ['a', 'NY', 'cat', 'Python', 'language']
new_list = [x.title() for x in my_strings if len(x) > 2]
# ['Cat', 'Python', 'Language']

# Example2: set comprehension
new_set = {len(x) for x in my_strings}
# {1, 2, 3, 6, 8}

# Example 2 with map() function
set(map(len, my_strings))

# Example 3: dict comprehension
my_dict_mapping = {key: index for index, key in enumerate(my_strings)}
# {'a': 0, 'NY': 1, 'cat': 2, 'Python': 3, 'language': 4}

# Example 4: generator expression
my_generator = (x.title() for x in my_strings if len(x) > 2)
next(my_generator) # 'Cat'
tuple(my_generator) # remaining items in a tuple: ('Python', 'Language')

('Python', 'Language')

In [154]:
# Nested list comprehensions
all_names = [["John", "Amanda", "Jane"],
             ["Maria", "Natalia", "Juan"]]

# We want to find all names with at least two a's
# Method 1. Using double for loops
name_of_interest1 = []
for lst in all_names:
    for name in lst:
        if name.count("a") >= 2:
            name_of_interest1.append(name)
name_of_interest1 # ['Amanda', 'Maria', 'Natalia']

# Method 2. for loop + list comprehension
name_of_interest2 = []
for lst in all_names:
    temp_list = [name for name in lst if name.count("a") >= 2]
    name_of_interest2.extend(temp_list)
name_of_interest2

# Method 3. nested list comprehension
name_of_interest3 = [
    name
    for lst in all_names
    for name in lst
    if name.count("a") >= 2
]
name_of_interest3

['Amanda', 'Maria', 'Natalia']

## Functions

We have already studied both built-in and user functions in our Real Python tutorial—referring to Obsidian notes for more details.

### Namespace, Scope, and Local Functions

You need to know the LEGB (local, enclosing, global, and built-in) scoping rule. You could also declare a global or nonlocal variable within your function (but probably not a good idea).

### Functions are Objects

In [159]:
# local variable a disappears after the function
def func():
    a = []
    for i in range(5):
        a.append(i)

# define a outside of the function
a = []
def func():    
    for i in range(5):
        a.append(i)
print(a) # []
func()
print(a) # [0, 1, 2, 3, 4]
func()
print(a) # [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

[]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]


In [160]:
# Returning multiple values
def func2():
    a = 5
    b = 6
    c = 7
    return a, b, c
my_values = func2() # (5, 6, 7)
x, y, z = func2()
print(x) # 5

# Returning a dict
def func3():
    a = 5
    b = 6
    c = 7
    return {'a': a, 'b': b, 'c': c}
my_dict = func3() # {'a': 5, 'b': 6, 'c': 7}

5


In [161]:
# An example of how to write data cleaning functions
states = ["  Alabama?", "Georgia!", "georgia", "FloriDA", "south carolina##"]
import re

# Method 1. write a sequence of functions
def clean_strings1(strings):
    result = []
    for string in strings:
        string = string.strip()
        string = re.sub("[!#?]", "", string)
        string = string.title()
        result.append(string)
    return result

clean_strings1(states) # ['Alabama', 'Georgia', 'Georgia', 'Florida', 'South Carolina']

# Method 2. use a list of operations
def remove_punctuation(string):
    return re.sub("[!#?]", "", string)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings2(strings, ops):
    result = []
    for string in strings:
        for func in ops:
            string = func(string)
        result.append(string)
    return result

clean_strings2(states, clean_ops) # ['Alabama', 'Georgia', 'Georgia', 'Florida', 'South Carolina']

# Method 3. You can also use the map function
for state in map(remove_punctuation, states):
    print(state)

  Alabama
Georgia
georgia
FloriDA
south carolina


### Anonymous (Lambda) Functions

Python has support for in-line/anonymous lambda functions. Here is a simple example of how to use a lambda function to sort a list based on the number of unique letters in each string.

In [None]:
# Here is how set() function works
x = "hello"
set(x) # {'e', 'h', 'l', 'o'}

# An example of using lambda function
strings = ["foo", "card", "bar", "aaaa", "abab"]
strings.sort(key=lambda x: len(set(x)))
# ['aaaa', 'foo', 'abab', 'bar', 'card']

['aaaa', 'foo', 'abab', 'bar', 'card']

### Generators

A *generator* is a convenience way, similar to writing a normal function, to construct a new iterable object. Instead of the typical `return` statement, you replace it with the `yield` keyword. Generators are typically more memory efficient than typical iterators.

In [174]:
# A generator example
def squares(n=10):
    for i in range(1, n+1):
        yield i ** 2
gen = squares(5)
gen # <generator object squares at 0x0000016E7414FD80>

for x in gen:
    print(x)

1
4
9
16
25


In [179]:
# Generator expression looks similar to list comprehension
# except that you use a pair of parenthesis
gen = (x**2 for x in range(100))
gen # <generator object <genexpr> at 0x0000016E14E4A260>
sum(gen) # 328350

# another example
dict(("item"+str(i), i**2) for i in range(5))

{'item0': 0, 'item1': 1, 'item2': 4, 'item3': 9, 'item4': 16}

### `itertools` module

There are a few functions from `itertools`:
- `chain(*iterables)` generates a sequence by chaining iterators together, basically flattening multiple iterables without nesting
- `combinations(iterable, k)` generates a sequence of all possible k-tuples of elements in the iterable, ignoring order and without replacement
- `permutations(iterable, k)` generates a sequence of all possible k-typles of elements in the iterable, respecting order
- `groupby(iterable, keyfunc)` gernates (key, sub-iterator) for each unique key
- `product(*iterables, repeat=1)` generates the Cartesian product of the input iterables as tuples, similar to a nested `for` loop

In [180]:
# An example with groupby
import itertools

def first_letter(x):
    return x[0]

names = ["Adam", "Adrian", "Allen", "Betty", "William", "Will", "Christine"]

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names))

A ['Adam', 'Adrian', 'Allen']
B ['Betty']
W ['William', 'Will']
C ['Christine']


In [182]:
# Another example using combinations
letters = ['a', 'b', 'c', 'd']
results = list(itertools.combinations(letters, 2))
print(results)

[('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')]


### Error Handling

You can handle errors more gracefully. For example, you can use `try` and `except` blocks.

In [None]:
# Example 1. Catch ValueError

def attempt_float(x):
    try:
        return float(x)
    except:
        return x
    
attempt_float("1.234") # 1.234
attempt_float("hello") # 'hello'
attempt_float((1, 2)) # (1, 2)

# Example 2. If you only want to catch ValueError, not TypeError
def attempt_float2(x):
    try:
        return float(x)
    except ValueError:
        return x
    
# attempt_float2((1, 2)) # TypeError

# Example 3. You want to explicitly catch only ValueError and TypeError
def attempt_float3(x):
    try:
        return float(x)
    except (ValueError, TypeError):
        return x
    
attempt_float3((1, 2)) # (1, 2)

(1, 2)

In [None]:
# Using with context manager to read files

from pathlib import Path
path = Path("fruits2.txt").resolve()
#p = Path(str(path).strip()).resolve()
with open(path, encoding="utf-8") as f:
    lines = [x.rstrip() for x in f]
lines

['apple', 'banana', 'cherry', 'orange', 'mango', 'pear']

In [211]:
path2 = Path("fruits2.txt")
path2

WindowsPath('fruits2.txt')