<h1>Data Structures and Sequences</h1>

Python’s data structures are simple but powerful. Mastering their use is a critical part of becoming a proficient Python programmer. We start with tuple, list, and dictionary, which are some of the most frequently used `sequence` types.

<h2>Tuple</h2>

A tuple is a fixed-length, immutable sequence of Python objects which, once assigned, cannot be changed. The easiest way to create one is with a comma-separated sequence of values wrapped in parentheses:

In [None]:
tup_data = (43, 54, 16)

tup_data

You can convert any sequence or iterator to a tuple by invoking `tuple`:

In [None]:
tuple([4, 0, 2])

tup = tuple('string')

tup

Elements can be accessed with square brackets [] as with most other sequence types. As in C, C++, Java, and many other languages, sequences are 0-indexed in Python:

In [None]:
tup[0]

When you're defining tuples within more complicated expressions, it’s often necessary to enclose the values in parentheses, as in this example of creating a tuple of tuples:

In [None]:
nested_tup = (4, 5, 6), (7, 8)

nested_tup

In [None]:
nested_tup[1]

While the objects stored in a tuple may be mutable themselves, once the tuple is created it’s not possible to modify which object is stored in each slot:

In [None]:
tup_data = tuple(['foo', [1, 2], True])

In [None]:
# Error
tup_data[2] = False

If an object inside a tuple is mutable, such as a list, you can modify it in place:

In [None]:
tup_data[1].append(3)

tup_data

<h3>Unpacking tuples</h3>

In [None]:
tup_data = (43, 54, 126)

a, b, c = tup_data

b

Even sequences with nested tuples can be unpacked:

In [None]:
tup_data = 4, 5, (6, 7)

a, b, (c, d) = tup_data

In Python, the swap can be done like this:

In [None]:
a, b = 41, 23

b, a = a, b

A common use of variable unpacking is iterating over sequences of tuples or lists:

In [None]:
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

for a, b, c in seq:
    print(f'a={a}, b={b}, c={c}')

There are some situations where you may want to "pluck" a few elements from the beginning of a tuple. There is a special syntax that can do this, *rest, which is also used in function signatures to capture an arbitrarily long list of positional arguments:

In [None]:
values = 1, 2, 3, 4, 5

a, b, *rest = values

rest

This rest bit is sometimes something you want to discard; there is nothing special about the rest name. As a matter of convention, many Python programmers will use the underscore (_) for unwanted variables:

In [None]:
a, b, *_ = values

<h2>List</h2>

In [None]:
a_list = [2, 3, 7, None]

tup_data = ("foo", "bar", "baz")

b_list = list(tup)

b_list

In [None]:
b_list[1] = "peekaboo"

b_list

Lists and tuples are semantically similar (though tuples cannot be modified) and can be used interchangeably in many functions.

The `list` built-in function is frequently used in data processing as a way to materialize an iterator or generator expression:

In [None]:
gen = range(10)

gen

In [None]:
list(gen)

<h3>Adding and removing elements</h3>

Elements can be appended to the end of the list with the `append` method:

In [None]:
b_list.append("dwarf")

b_list

Using `insert` you can insert an element at a specific location in the list:

In [None]:
b_list.insert(1, "red")

b_list

The inverse operation to `insert` is `pop`, which removes and returns an element at a particular index:

In [None]:
b_list.pop(2)

Elements can be removed by value with `remove`, which locates the first such value and removes it from the list:

In [None]:
b_list.append("foo")


b_list.remove("foo")

Check if a list contains a value using the `in` or `not in` keyword:

In [None]:
"foo" in b_list

In [None]:
"dwarf" not in b_list

<h3>Concatenating and combining lists</h3>

Similar to tuples, adding two lists together with `+` concatenates them:

In [None]:
[4, None, "foo"] + [7, 8, (2, 3)]

If you have a list already defined, you can append multiple elements to it using the `extend` method:

In [None]:
dummy_list = [4, None, "foo"]

dummy_list.extend([7, 8, (2, 3)])

dummy_list

Note that list concatenation by addition is a comparatively expensive operation since a new list must be created and the objects copied over. Using `extend` to append elements to an existing list, especially if you are building up a large list, is usually preferable. Thus:

In [None]:
everything = []
for chunk in list_of_lists:
    everything.extend(chunk)

<h3>Sorting</h3>

You can sort a list in place (without creating a new object) by calling its `sort` function:

In [None]:
a = [47, 22, 35, 11, 53]

a.sort()

a

`sort` has a few options that will occasionally come in handy. One is the ability to pass a secondary sort key—that is, a function that produces a value to use to sort the objects. For example, we could sort a collection of strings by their lengths:

In [None]:
b = ["saw", "small", "He", "foxes", "six"]

b.sort(key=len)

b

<h3>Slicing</h3>

You can select sections of most sequence types by using slice notation, which in its basic form consists of `start:stop` passed to the indexing operator []:

In [None]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]

seq[1:5]

Either the `start` or `stop` can be omitted, in which case they default to the start of the sequence and the end of the sequence, respectively:

seq[:5]

In [None]:
seq[2:]

In [None]:
seq[-4:]

In [None]:
seq[-6:-2]

A `step` can also be used after a second colon to, say, take every other element:

In [None]:
seq[::2]

In [None]:
seq[::-1]

<h2>Dictionary</h2>

The dictionary or `dict` may be the most important built-in Python data structure. In other programming languages, dictionaries are sometimes called hash maps or associative arrays. A dictionary stores a collection of key-value pairs, where key and value are Python objects. Each key is associated with a value so that a value can be conveniently retrieved, inserted, modified, or deleted given a particular key. One approach for creating a dictionary is to use curly braces `{}` and colons to separate keys and values:

In [None]:
empty_dict = {}

d1 = {"a": "some value", "b": [1, 2, 3, 4]}

d1

You can access, insert, or set elements using the same syntax as for accessing elements of a list or tuple:

In [None]:
d1[7] = "an integer"

d1

You can check if a dictionary contains a key using the same syntax used for checking whether a list or tuple contains a value:

In [None]:
"b" in d1

You can delete values using either the `del` keyword or the `pop` method (which simultaneously returns the value and deletes the key):

In [None]:
d1[5] = "some value"

d1

In [None]:
d1["dummy"] = "another value"

d1

In [None]:
del d1[5]

d1

In [None]:
ret = d1.pop("dummy")

ret

The `keys` and `values` method gives you iterators of the dictionary's keys and values, respectively. The order of the keys depends on the order of their insertion, and these functions output the keys and values in the same respective order:

In [None]:
list(d1.keys())

In [None]:
list(d1.values())

If you need to iterate over both the keys and values, you can use the `items` method to iterate over the keys and values as 2-tuples:

In [None]:
list(d1.items())

You can merge one dictionary into another using the `update` method:

In [None]:
d1.update({"b": "foo", "c": 12})

d1

<h3>Creating dictionaries from sequences</h3>

It’s common to occasionally end up with two sequences that you want to pair up element-wise in a dictionary. As a first cut, you might write code like this:

In [None]:
mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value

Since a dictionary is essentially a collection of 2-tuples, the `dict` function accepts a list of 2-tuples:

In [None]:
tuples = zip(range(5), reversed(range(5)))

mapping = dict(tuples)

mapping

<h3>Default values</h3>

In [None]:
if key in some_dict:
    value = some_dict[key]
else:
    value = default_value

Thus, the dictionary methods `get` and `pop` can take a default value to be returned, so that the above `if-else` block can be written simply as:

In [None]:
value = some_dict.get(key, default_value)

`get` by default will return `None` if the key is not present, while `pop` will raise an exception. With setting values, it may be that the values in a dictionary are another kind of collection, like a list. For example, you could imagine categorizing a list of words by their first letters as a dictionary of lists:

In [None]:
words = ["apple", "bat", "bar", "atom", "book"]

by_letter = {}

for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)

by_letter

The `setdefault` dictionary method can be used to simplify this workflow. The preceding for loop can be rewritten as:

In [None]:
by_letter = {}

for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)

by_letter

<h2>Set</h2>

A set is an unordered collection of unique elements. A set can be created in two ways: via the `set` function or via a set literal with curly braces:

In [None]:
set([2, 2, 2, 1, 3, 3])

Sets support mathematical set operations like union, intersection, difference, and symmetric difference. Consider these two example sets:

In [None]:
a = {1, 2, 3, 4, 5}

b = {3, 4, 5, 6, 7, 8}

a.union(b)

# a | b

# a.intersection(b)

# a & b

All of the logical set operations have in-place counterparts, which enable you to replace the contents of the set on the left side of the operation with the result. For very large sets, this may be more efficient:

In [None]:
c = a.copy()

c |= b

c

In [None]:
d = a.copy()

d &= b

d

You can also check if a set is a subset of (is contained in) or a superset of (contains all elements of) another set:

In [None]:
a_set = {1, 2, 3, 4, 5}

{1, 2, 3}.issubset(a_set)

# a_set.issuperset({1, 2, 3})

<h2>Built-In Sequence Functions</h2>

<h3>enumerate</h3>

It’s common when iterating over a sequence to want to keep track of the index of the current item. A do-it-yourself approach would look like:

In [None]:
index = 0
for value in collection:
   # do something with value
   index += 1

Since this is so common, Python has a built-in function, `enumerate`, which returns a sequence of `(i, value)` tuples:

In [None]:
for index, value in enumerate(collection):
   # do something with value
   continue

<h3>sorted</h3>

The `sorted` function returns a new sorted list from the elements of any sequence:

In [None]:
sorted([7, 1, 2, 6, 0, 3, 2])

sorted("string")

<h3>zip</h3>

`zip` “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples:

In [None]:
seq1 = ["foo", "bar", "baz"]

seq2 = ["one", "two", "three"]

zipped = zip(seq1, seq2)

list(zipped)

`zip` can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence:

In [None]:
seq3 = [False, True]

list(zip(seq1, seq2, seq3))

In [None]:
for index, (a, b) in enumerate(zip(seq1, seq2)):
    print(f"{index}: {a}, {b}")

<h3>reversed</h3>

`reversed` iterates over the elements of a sequence in reverse order:

In [None]:
list(reversed(range(10)))

<h2>List, Set, and Dictionary Comprehensions</h2>

List `comprehensions` are a convenient and widely used Python language feature. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter into one concise expression. They take the basic form:

[expr for value in collection if condition]

This is equivalent to the following `for` loop:

In [None]:
result = []
for value in collection:
    if condition:
        result.append(expr)

The filter condition can be omitted, leaving only the expression. For example, given a list of strings, we could filter out strings with length 2 or less and convert them to uppercase like this:

In [None]:
strings = ["a", "as", "bat", "car", "dove", "python"]

[x.upper() for x in strings if len(x) > 2]

Set and dictionary comprehensions are a natural extension, producing sets and dictionaries in an idiomatically similar way instead of lists.

A `set` comprehension looks like this:

In [None]:
set_comp = {expr for value in collection if condition}

A `dictionary` comprehension looks like the equivalent list comprehension except with curly braces instead of square brackets:

In [None]:
dict_comp = {key-expr: value-expr for value in collection if condition}

As a simple dictionary comprehension example, we could create a lookup map of these strings for their locations in the list:

In [None]:
loc_mapping = {value: index for index, value in enumerate(strings)}

loc_mapping

<h3>Nested list comprehensions</h3>

Suppose we have a list of lists containing some English and Spanish names:

In [None]:
all_data = [["John", "Emily", "Michael", "Mary", "Steven"],
    ["Maria", "Juan", "Javier", "Natalia", "Pilar"]]

Suppose we wanted to get a single list containing all names with two or more a’s in them. We could certainly do this with a simple for loop:

In [None]:
names_of_interest = []

for names in all_data:
    enough_as = [name for name in names if name.count("a") >= 2]
    names_of_interest.extend(enough_as)

names_of_interest

In [None]:
# result_element for inner_list in original_list for result_element in innerlist
result = [name for names in all_data for name in names
    if name.count("a") >= 2]

result

At first, nested list comprehensions are a bit hard to wrap your head around. The `for` parts of the list comprehension are arranged according to the order of nesting, and any filter condition is put at the end as before. Here is another example where we “flatten” a list of tuples of integers into a simple list of integers:

In [None]:
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

flattened = [x for tup in some_tuples for x in tup]

flattened

Keep in mind that the order of the `for` expressions would be the same if you wrote a nested `for` loop instead of a list comprehension:

In [None]:
flattened = []

for tup in some_tuples:
    for x in tup:
        flattened.append(x)

<h2>Functions</h2>

Functions are the primary and most important method of code organization and reuse in Python. As a rule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable function. Functions can also help make your code more readable by giving a name to a group of Python statements.

Functions are declared with the `def` keyword. A function contains a block of code with an optional use of the `return` keyword:

In [None]:
def my_function(x, y):
    return x + y

When a line with `return` is reached, the value or expression after `return` is sent to the context where the function was called, for example:

In [None]:
result = my_function(1, 2)

result

There is no issue with having multiple return statements. If Python reaches the end of a function without encountering a `return` statement, `None` is returned automatically. For example:

In [None]:
def function_without_return(x):
    print(x)

result = function_without_return("hello!")

print(result)
None

Each function can have positional arguments and keyword arguments. Keyword arguments are most commonly used to specify default values or optional arguments. Here we will define a function with an optional `z` argument with the default value `1.5`:

In [None]:
def my_function2(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)

While keyword arguments are optional, all positional arguments must be specified when calling a function.

You can pass values to the `z` argument with or without the keyword provided, though using the keyword is encouraged:

In [None]:
# my_function2(5, 6, z=0.7)

my_function2(10, 20)

<h3>Namespaces, Scope, and Local Functions</h3>

Functions can access variables created inside the function as well as those outside the function in higher (or even global) scopes. An alternative and more descriptive name describing a variable scope in Python is a namespace. Any variables that are assigned within a function by default are assigned to the local namespace. The local namespace is created when the function is called and is immediately populated by the function’s arguments. After the function is finished, the local namespace is destroyed (with some exceptions) Consider the following function:

In [None]:
def func():
    a = []
    for i in range(5):
        a.append(i)

When `func()` is called, the empty list `a` is created, five elements are appended, and then a is destroyed when the function exits. Suppose instead we had declared `a` as follows:

In [None]:
a = []

def func():
    for i in range(5):
        a.append(i)

func()

a

Assigning variables outside of the function's scope is possible, but those variables must be declared explicitly using either the `global` or `nonlocal` keywords:

In [None]:
a = None

def bind_a_variable():
    global a
    a = []
    
bind_a_variable()

print(a)

<style>
    div {
    margin-bottom: 15px;
    padding: 4px 12px;
    width: 1130px;
    }

    .danger {
    background-color: #ffdddd;
    border-left: 6px solid #f44336;
    }
</style>

<div class="danger">
  <p style="color:black;"><strong>Note!</strong> I generally discourage use of the global keyword. Typically, global variables are used to store some kind of state in a system. If you find yourself using alot of them, it may indicate a need for object-oriented programming (using classes).</p>
</div>


<h3>Returning Multiple Values</h3>

When I first programmed in Python after having programmed in Java and C++, one of my favorite features was the ability to return multiple values from a function with simple syntax. Here’s an example:

In [None]:
def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

a, b, c = f()

In data analysis and other scientific applications, you may find yourself doing this often. What’s happening here is that the function is actually just returning one object, a tuple, which is then being unpacked into the result variables. 

<h3>Functions Are Objects</h3>

Since Python functions are objects, many constructs can be easily expressed that are difficult to do in other languages. Suppose we were doing some data cleaning and needed to apply a bunch of transformations.

Anyone who has ever worked with user-submitted survey data has seen messy results like these. Lots of things need to happen to make this list of strings uniform and ready for analysis: stripping whitespace, removing punctuation symbols, and standardizing proper capitalization. One way to do this is to use built-in string methods along with the re standard library module for regular expressions:

In [None]:
import re

states = ["   Alabama ", "Georgia!", "Georgia", "georgia", "FlOrIda", "south   carolina##", "West virginia?"]

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub("[!#?]", "", value)
        value = value.title()
        result.append(value)
    return result

clean_strings(states)

An alternative approach that you may find useful is to make a list of the operations you want to apply to a particular set of strings:

In [None]:
def remove_punctuation(value):
    return re.sub("[!#?]", "", value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for func in ops:
            value = func(value)
        result.append(value)
    return result

clean_strings(states, clean_ops)

You can use functions as arguments to other functions like the built-in `map` function, which applies a function to a sequence of some kind:

In [None]:
for x in map(remove_punctuation, states):
    print(x)

<h3>Anonymous (Lambda) Functions</h3>

Python has support for so-called anonymous or lambda functions, which are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the `lambda` keyword, which has no meaning other than “we are declaring an anonymous function”:

In [None]:
def short_function(x):
    return x * 2

equiv_anon = lambda x: x * 2

I usually refer to these as lambda functions in the rest of the book. They are especially convenient in data analysis because, as you’ll see, there are many cases where data transformation functions will take functions as arguments. It’s often less typing (and clearer) to pass a lambda function as opposed to writing a full-out function declaration or even assigning the lambda function to a local variable. Consider this example:

In [None]:
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]

apply_to_list(ints, lambda x: x * 2)

As another example, suppose you wanted to sort a collection of strings by the number of distinct letters in each string, here we could pass a lambda function to the list’s sort method:

In [None]:
strings = ["foo", "card", "bar", "aaaa", "abab"]

strings.sort(key=lambda x: len(set(x)))

strings

<h3>Errors and Exception Handling</h3>

Handling Python errors or exceptions gracefully is an important part of building robust programs. In data analysis applications, many functions work only on certain kinds of input. As an example, Python’s `float` function is capable of casting a string to a floating-point number, but it fails with ValueError on improper inputs:

In [None]:
# float("1.2345")

float("string")

Suppose we wanted a version of `float` that fails gracefully, returning the input argument. We can do this by writing a function that encloses the call to float in a `try/except` block (execute this code in IPython):

In [None]:
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

# attempt_float("1.2345")

attempt_float("string")

In [None]:
# def attempt_float(x):
#     try:
#         return float(x)
#     except ValueError:
#         return x

Here, the file object `f` will always get closed. Similarly, you can have code that executes only if the `try`: block succeeds using `else`:

In [None]:
f = open(path, mode="w")

try:
    write_to_file(f)
except:
    print("Failed")
else:
    print("Succeeded")
finally:
    f.close()

<h2>Files and the Operating System</h2>

It’s important to understand the basics of how to work with files in Python. Fortunately, it’s relatively straightforward, which is one reason Python is so popular for text and file munging.
To open a file for reading or writing, use the built-in open function with either a relative or absolute file path and an optional file encoding:

In [None]:
path = "examples/segismundo.txt"

f = open(path, encoding="utf-8")

Here, I pass `encoding="utf-8"` as a best practice because the default Unicode encoding for reading files varies from platform to platform.

By default, the file is opened in read-only mode `"r"`. We can then treat the file object `f` like a list and iterate over the lines like so:

In [None]:
for line in f:
    print(line)

When you use `open` to create file objects, it is recommended to close the file when you are finished with it. Closing the file releases its resources back to the operating system:

In [None]:
f.close()

One of the ways to make it easier to clean up open files is to use the `with` statement:

In [None]:
with open(path, encoding="utf-8") as f:
    lines = [x.rstrip() for x in f]