# Chapter 3 - Built-in Data Structures, Functions and Files

This chapter discusses capabilities built into the Python language that will be used
ubiquitously throughout the book. While add-on libraries like pandas and NumPy
add advanced computational functionality for larger datasets, they are designed to be
used together with Python’s built-in data manipulation tools.

We’ll start with Python’s workhorse data structures: tuples, lists, dicts, and sets. Then,
we’ll discuss creating your own reusable Python functions. Finally, we’ll look at the
mechanics of Python file objects and interacting with your local hard drive.

## 3.1 Data Structures and Sequences

Python’s data structures are simple but powerful. Mastering their use is a critical part
of becoming a proficient Python programmer.

### Tuple

*__A tuple is a fixed-length, immutable sequence of Python objects__*. The easiest way to
create one is with a comma-separated sequence of values:

In [3]:
tup = 4,5,6
print(tup)
type(tup)

(4, 5, 6)


tuple

When you’re defining tuples in more complicated expressions, it’s often necessary to
enclose the values in parentheses, as in this example of creating a tuple of tuples:

In [4]:
nested_tup = (1,2,3), (4,5)
print(nested_tup)

((1, 2, 3), (4, 5))


*__You can convert any sequence or iterator to a tuple by invoking tuple__*:

In [5]:
tuple([1,2,3])

(1, 2, 3)

In [7]:
tup = tuple('string')
print(tup)

('s', 't', 'r', 'i', 'n', 'g')


*__Elements can be accessed with square brackets [ ] as with most other sequence types__*.
As in C, C++, Java, and many other languages, sequences are 0-indexed in Python:

In [8]:
tup[0]

's'

*__While the objects stored in a tuple may be mutable themselves, once the tuple is created it’s not possible to modify which object is stored in each slot__*:

In [9]:
tup = tuple(['foo', [1, 2], True])
print(tup)

('foo', [1, 2], True)


In [10]:
tup[2]

True

In [11]:
tup[2] = False

TypeError: 'tuple' object does not support item assignment

*__If an object inside a tuple is mutable, such as a list, you can modify it in-place__*:

In [12]:
tup[1]

[1, 2]

In [13]:
tup[1].append(3)
print(tup)

('foo', [1, 2, 3], True)


*__You can concatenate tuples using the + operator to produce longer tuples__*:

In [14]:
 (4, None, 'foo') + (6, 0) + ('bar',)

(4, None, 'foo', 6, 0, 'bar')

Multiplying a tuple by an integer, as with lists, has the effect of concatenating together
that many copies of the tuple:

In [15]:
('foo', 'bar')*4

('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')

Note that the objects themselves are not copied, only the references to them.

#### Unpacking Tuples

If you try to assign to a tuple-like expression of variables, Python will attempt to
unpack the value on the righthand side of the equals sign:

In [16]:
tup = (1,2,3)
a,b,c = tup
print(b)

2


Even sequences with nested tuples can be unpacked:

In [18]:
tup = (1,2,(3,4))
a,b,(c,d) = tup
print(d)

4


Using this functionality you can easily swap variable names, a task which in many
languages might look like:

    tmp = a
    a = b
    b = tmp
    
But, in Python, the swap can be done like this:

In [20]:
a,b = 1,2
print('a =',a,', b =',b)

a = 1 , b = 2


In [21]:
b,a = a,b
print('a =',a,', b =',b)

a = 2 , b = 1


A common use of variable unpacking is iterating over sequences of tuples or lists:

In [34]:
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

for a,b,c in seq:
    print('a={0}, b={1}, c={2}'.format(a, b, c))

a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9


Another common use is returning multiple values from a function. I’ll cover this in
more detail later.
*__The Python language recently acquired some more advanced tuple unpacking to help
with situations where you may want to “pluck” a few elements from the beginning of
a tuple. This uses the special syntax *rest__*, which is also used in function signatures
to capture an arbitrarily long list of positional arguments:

In [35]:
val = 1,2,3,4,5
a,b, *rest = val

In [37]:
a,b

(1, 2)

In [39]:
rest

[3, 4, 5]

This rest bit is sometimes something you want to discard; there is nothing special
about the rest name. *__As a matter of convention, many Python programmers will use
the underscore (_) for unwanted variables__*:

In [40]:
a,b, *_ = val

In [41]:
_

[3, 4, 5]

#### Tuple Methods

Since the size and contents of a tuple cannot be modified, it is very light on instance
methods. A particularly useful one (also available on lists) is count, which counts the
number of occurrences of a value:

In [42]:
a = (1, 2, 2, 2, 3, 4, 2)
a.count(2)

4

### List

In contrast with tuples, lists are variable-length and their contents can be modified
in-place. You can define them using square brackets [ ] or using the list type function:

In [44]:
a_list = [1,2,3,None]
a_list

[1, 2, 3, None]

In [55]:
tup = ('foo', 'bar', 'baz')
b_list = list(tup)
b_list

['foo', 'bar', 'baz']

In [56]:
b_list[1]

'bar'

In [57]:
b_list[1] = 'peekaboo'
b_list

['foo', 'peekaboo', 'baz']

*__Lists and tuples are semantically similar (though tuples cannot be modified) and can
be used interchangeably in many functions__*.

The list function is frequently used in data processing as a way to materialize an
iterator or generator expression:

In [48]:
gen = range(10)
gen

range(0, 10)

In [49]:
list(gen)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### Adding and Removing Elements

Elements can be appended to the end of the list with the append method:

In [58]:
b_list

['foo', 'peekaboo', 'baz']

In [59]:
b_list.append('dwarf')
b_list

['foo', 'peekaboo', 'baz', 'dwarf']

Using insert you can insert an element at a specific location in the list:

In [60]:
b_list[1]

'peekaboo'

In [61]:
b_list.insert(1,'red')
b_list

['foo', 'red', 'peekaboo', 'baz', 'dwarf']

The insertion index must be between 0 and the length of the list, inclusive.

The inverse operation to insert is pop, which removes and returns an element at a
particular index:

In [62]:
b_list

['foo', 'red', 'peekaboo', 'baz', 'dwarf']

In [63]:
b_list[2]

'peekaboo'

In [64]:
b_list.pop(2)

'peekaboo'

In [65]:
b_list

['foo', 'red', 'baz', 'dwarf']

Elements can be removed by value with remove, which locates the first such value and
removes it from the last:

In [66]:
b_list

['foo', 'red', 'baz', 'dwarf']

In [67]:
b_list.append('foo')
b_list

['foo', 'red', 'baz', 'dwarf', 'foo']

In [68]:
b_list.remove('foo')
b_list

['red', 'baz', 'dwarf', 'foo']

In [69]:
b_list.remove('foo')
b_list

['red', 'baz', 'dwarf']

In [70]:
b_list.remove('baz')
b_list

['red', 'dwarf']

If performance is not a concern, by using append and remove, you can use a Python
list as a perfectly suitable “multiset” data structure.

Check if a list contains a value using the in keyword:

In [71]:
'dwarf' in b_list

True

The keyword not can be used to negate in:

In [72]:
'dwarf' not in b_list

False

Checking whether a list contains a value is a lot slower than doing so with dicts and
sets (to be introduced shortly), as Python makes a linear scan across the values of the
list, whereas it can check the others (based on hash tables) in constant time.

#### Concatenating and Combining List

Similar to tuples, adding two lists together with + concatenates them:

In [73]:
[4, None, 'foo'] + [7, 8, (2, 3)]

[4, None, 'foo', 7, 8, (2, 3)]

If you have a list already defined, you can append multiple elements to it using the
extend method:

In [75]:
x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])
x

[4, None, 'foo', 7, 8, (2, 3)]

Note that list concatenation by addition is a comparatively expensive operation since
a new list must be created and the objects copied over. *__Using extend to append ele‐
ments to an existing list, especially if you are building up a large list, is usually pref‐
erable__*. Thus,

    everything = [ ]
    for chunk in list_of_lists:
    everything.extend(chunk)
    
is faster than the concatenative alternative:

    everything = [ ]
    for chunk in list_of_lists:
    everything = everything + chunk

#### Sorting

You can sort a list in-place (without creating a new object) by calling its sort
function:

In [76]:
a = [7, 2, 5, 1, 3]
a.sort()
a

[1, 2, 3, 5, 7]

sort has a few options that will occasionally come in handy. One is the ability to pass
a secondary sort key—that is, a function that produces a value to use to sort the
objects. For example, we could sort a collection of strings by their lengths:

In [77]:
b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
b

['He', 'saw', 'six', 'small', 'foxes']

Soon, we’ll look at the sorted function, which can produce a sorted copy of a general
sequence.

#### Binary Search and Maintaining a Sorted List

The built-in bisect module implements binary search and insertion into a sorted list.
bisect.bisect finds the location where an element should be inserted to keep it sor‐
ted, while bisect.insort actually inserts the element into that location:

In [78]:
import bisect

In [83]:
c = [1, 2, 2, 2, 3, 4, 7]

In [84]:
bisect.bisect(c,5)

6

In [85]:
bisect.insort(c,5)
c

[1, 2, 2, 2, 3, 4, 5, 7]

In [86]:
c[6]

5

Note that The bisect module functions do not check whether the list is sorted, as doing so would be computationally expensive. Thus, using
them with an unsorted list will succeed without error but may lead
to incorrect results.

#### Slicing

You can select sections of most sequence types by using slice notation, which in its
basic form consists of start:stop passed to the indexing operator []:

In [87]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5]

[2, 3, 7, 5]

Slices can also be assigned to with a sequence:

In [88]:
seq[3:4] = [6, 3]
seq

[7, 2, 3, 6, 3, 5, 6, 0, 1]

In [90]:
seq[0:1] = [23]
seq

[23, 2, 3, 6, 3, 5, 6, 0, 1]

While the element at the start index is included, the stop index is not included, so
that the number of elements in the result is stop - start.

Either the start or stop can be omitted, in which case they default to the start of the
sequence and the end of the sequence, respectively:

In [91]:
seq[:5]

[23, 2, 3, 6, 3]

In [92]:
seq[3:]

[6, 3, 5, 6, 0, 1]

Negative indices slice the sequence relative to the end:

In [93]:
seq[-4:]

[5, 6, 0, 1]

In [94]:
seq[-6:-2]

[6, 3, 5, 6]

Slicing semantics takes a bit of getting used to, especially if you’re coming from R or
MATLAB. See Figure 3-1 for a helpful illustration of slicing with positive and nega‐
tive integers. In the figure, the indices are shown at the “bin edges” to help show
where the slice selections start and stop using positive or negative indices.

![](slicing.jpg)

In [95]:
seq

[23, 2, 3, 6, 3, 5, 6, 0, 1]

In [98]:
seq[-1:]

[1]

A step can also be used after a second colon to, say, take every other element:

In [99]:
seq[::2]

[23, 3, 3, 6, 1]

A clever use of this is to pass -1, which has the useful effect of reversing a list or tuple:

In [100]:
seq[::1]

[23, 2, 3, 6, 3, 5, 6, 0, 1]

In [101]:
seq[::-1]

[1, 0, 6, 5, 3, 6, 3, 2, 23]

### Built-in Sequence Functions

Python has a handful of useful sequence functions that you should familiarize your‐
self with and use at any opportunity

#### Enumerate

It’s common when iterating over a sequence to want to keep track of the index of the
current item. A do-it-yourself approach would look like:

    i = 0
    for value in collection:
    # do something with value
    i += 1
    
Since this is so common, *__Python has a built-in function, enumerate, which returns a
sequence of (i, value) tuples__*:

    for i, value in enumerate(collection):
    # do something with value
    
When you are indexing data, a helpful pattern that uses enumerate is *__computing a
dict mapping the values of a sequence (which are assumed to be unique) to their
locations in the sequence__*:

In [1]:
some_list = ['foo', 'bar', 'baz']
mapping = {}

for i,v in enumerate(some_list):
    mapping[v] = i

In [2]:
mapping

{'foo': 0, 'bar': 1, 'baz': 2}

#### Sorted

The sorted function returns a new sorted list from the elements of any sequence:

In [3]:
sorted([7, 1, 2, 6, 0, 3, 2])

[0, 1, 2, 2, 3, 6, 7]

In [4]:
sorted('horse race')

[' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']

The sorted function accepts the same arguments as the sort method on lists.

#### zip

zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a
list of tuples:

In [7]:
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1,seq2)
list(zipped)

[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

zip can take an arbitrary number of sequences, and the number of elements it pro‐
duces is determined by the shortest sequence:

In [8]:
seq3 = [False, True]
zipped = zip(seq1,seq2,seq3)
list(zipped)

[('foo', 'one', False), ('bar', 'two', True)]

A very common use of zip is simultaneously iterating over multiple sequences, possi‐
bly also combined with enumerate:

In [9]:
for i, (a,b) in enumerate(zip(seq1,seq2)):
    print('{0}: {1}, {2}'.format(i,a,b))

0: foo, one
1: bar, two
2: baz, three


Given a “zipped” sequence, zip can be applied in a clever way to “unzip” the
sequence. Another way to think about this is converting a list of rows into a list of
columns. The syntax, which looks a bit magical, is:

In [10]:
pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)

In [11]:
first_names

('Nolan', 'Roger', 'Schilling')

In [12]:
last_names

('Ryan', 'Clemens', 'Curt')

#### Reversed

reversed iterates over the elements of a sequence in reverse order:

In [14]:
list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Keep in mind that reversed is a generator (to be discussed in some more detail later),
so it does not create the reversed sequence until materialized (e.g., with list or a for
loop).

### Dict

*__dict is likely the most important built-in Python data structure. A more common
name for it is hash map or associative array__*. It is a flexibly sized collection of key-value
pairs, where key and value are Python objects. One approach for creating one is to use
curly braces {} and colons to separate keys and values:

In [17]:
empty_dict = {}

In [18]:
d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
d1

{'a': 'some value', 'b': [1, 2, 3, 4]}

You can access, insert, or set elements using the same syntax as for accessing elements
of a list or tuple:

In [19]:
d1[7] = 'an integer'
d1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

In [21]:
d1['b']

[1, 2, 3, 4]

You can check if a dict contains a key using the same syntax used for checking
whether a list or tuple contains a value:

In [22]:
'b' in d1

True

You can delete values either using the del keyword or the pop method (which simul‐
taneously returns the value and deletes the key):

In [23]:
d1[5] = 'some value'
d1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer', 5: 'some value'}

In [24]:
d1['dummy'] = 'another value'
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 5: 'some value',
 'dummy': 'another value'}

In [25]:
del d1[5]
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value'}

In [26]:
ret = d1.pop('dummy')
ret

'another value'

In [27]:
d1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

The keys and values method give you iterators of the dict’s keys and values, respec‐
tively. While the key-value pairs are not in any particular order, these functions out‐
put the keys and values in the same order:

In [28]:
list(d1.keys())

['a', 'b', 7]

In [29]:
list(d1.values())

['some value', [1, 2, 3, 4], 'an integer']

You can merge one dict into another using the update method:

In [30]:
d1.update({'b' : 'foo', 'c' : 12})
d1

{'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

The update method changes dicts in-place, so any existing keys in the data passed to
update will have their old values discarded.

#### Creating Dicts from Sequences

It’s common to occasionally end up with two sequences that you want to pair up
element-wise in a dict. As a first cut, you might write code like this:

    mapping = {}
    for key, value in zip(key_list, value_list):
    mapping[key] = value
    
Since a dict is essentially a collection of 2-tuples, the dict function accepts a list of
2-tuples:

In [31]:
mapping = dict(zip(range(5),reversed(range(5))))
mapping

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

Later we’ll talk about dict comprehensions, another elegant way to construct dicts.

#### Default Values

It's very common to have logic like:

    if key in some_dict:
        value = some_dict[key]
    else:
        value = default_value
        
Thus, the dict methods get and pop can take a default value to be returned, so that
the above if-else block can be written simply as:

    value = some_dict.get(key, default_value)

get by default will return None if the key is not present, while pop will raise an excep‐
tion. With setting values, a common case is for the values in a dict to be other collec‐
tions, like lists. For example, you could imagine categorizing a list of words by their
first letters as a dict of lists:

In [40]:
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}    

for word in words:
    letter = word[0] #first letter
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)

In [41]:
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

The setdefault dict method is for precisely this purpose. The preceding for loop
can be rewritten as:

In [42]:
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}    

for word in words:
    letter = word[0]
    by_letter.setdefault(letter,[]).append(word)

In [43]:
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

The built-in collections module has a useful class, defaultdict, which makes this
even easier. To create one, you pass a type or function for generating the default value
for each slot in the dict:

In [44]:
from collections import defaultdict

words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}

by_letter = defaultdict(list)
for word in words:
    by_letter[word[0]].append(word)

In [45]:
by_letter

defaultdict(list, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})

#### Valid Dict Key Types

While the values of a dict can be any Python object, *__the keys generally have to be
immutable objects like scalar types (int, float, string) or tuples (all the objects in the
tuple need to be immutable, too). The technical term here is hashability__*. You can
check whether an object is hashable (can be used as a key in a dict) with the hash
function:


In [46]:
hash('string')

-5262084218156641715

In [47]:
hash((1,2,(4,5)))

-1423827714952672830

In [49]:
hash([1,2,3]) #fails because list are mutable.

TypeError: unhashable type: 'list'

To use a list as a key, one option is to convert it to a tuple, which can be hashed as
long as its elements also can:

In [50]:
d = {}
d[tuple([1,2,3])] = 'string'
d

{(1, 2, 3): 'string'}

### Set

A set is an unordered collection of unique elements. You can think of them like dicts,
but keys only, no values. A set can be created in two ways: via the set function or via
a set literal with curly braces:

In [51]:
set([2, 2, 2, 1, 3, 3])

{1, 2, 3}

In [52]:
{2, 2, 2, 1, 3, 3}

{1, 2, 3}

Sets support mathematical set operations like union, intersection, difference, and
symmetric difference. Consider these two example sets:

In [53]:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}

The union of these two sets is the set of distinct elements occurring in either set. This
can be computed with either the union method or the | binary operator:

In [54]:
a.union(b)

{1, 2, 3, 4, 5, 6, 7, 8}

In [55]:
a|b

{1, 2, 3, 4, 5, 6, 7, 8}

The intersection contains the elements occurring in both sets. The & operator or the
intersection method can be used:

In [56]:
a.intersection(b)

{3, 4, 5}

In [57]:
a & b

{3, 4, 5}

See Table 3-1 for a list of commonly used set methods.

![](table_set.jpg)

### List, Set, and Dict Comprehensions

List comprehensions are one of the most-loved Python language features. They allow
you to concisely form a new list by filtering the elements of a collection, transforming
the elements passing the filter in one concise expression. They take the basic form:

    [expr for val in collection if condition]
    
This is equivalent to the following for loop:

    result = []
    for val in collection:
        if condition:
            result.append(expr)
    
The filter condition can be omitted, leaving only the expression. For example, given a
list of strings, we could filter out strings with length 2 or less and also convert them to
uppercase like this:

In [58]:
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
[x.upper() for x in strings if len(x)>2]

['BAT', 'CAR', 'DOVE', 'PYTHON']

In [59]:
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
[_.upper() for _ in strings if len(_)>2]

['BAT', 'CAR', 'DOVE', 'PYTHON']

Set and dict comprehensions are a natural extension, producing sets and dicts in an
idiomatically similar way instead of lists. A dict comprehension looks like this:

    dict_comp = {key-expr : value-expr for value in collection if condition}
    
A set comprehension looks like the equivalent list comprehension except with curly
braces instead of square brackets:

    set_comp = {expr for value in collection if condition}
    
Like list comprehensions, set and dict comprehensions are mostly conveniences, but
they similarly can make code both easier to write and read. Consider the list of strings
from before. Suppose we wanted a set containing just the lengths of the strings con‐
tained in the collection; we could easily compute this using a set comprehension:

In [60]:
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
unique_lengths = {len(x) for x in strings}
unique_lengths

{1, 2, 3, 4, 6}

We could also express this more functionally using the map function, introduced
shortly:

In [61]:
set(map(len, strings))

{1, 2, 3, 4, 6}

As a simple dict comprehension example, we could create a lookup map of these
strings to their locations in the list:

In [63]:
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
loc_mapping = {val:index for index,val in enumerate(strings)}
loc_mapping

{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

#### Nested List Comprehensions

Suppose we have a list of lists containing some English and Spanish names:

In [64]:
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],['Maria', 'Juan', 'Javier', 
                                                            'Natalia', 'Pilar']]

You might have gotten these names from a couple of files and decided to organize
them by language. Now, suppose we wanted to get a single list containing all names
with two or more e’s in them. We could certainly do this with a simple for loop:

    names_of_interest = []
    for names in all_data:
        enough_es = [name for name in names if name.count('e') >= 2]
        names_of_interest.extend(enough_es)
        
You can actually wrap this whole operation up in a single nested list comprehension,
which will look like:

In [65]:
result = [name for names in all_data for name in names if name.count('e') >= 2]
result

['Steven']

At first, nested list comprehensions are a bit hard to wrap your head around. The for
parts of the list comprehension are arranged according to the order of nesting, and
any filter condition is put at the end as before. Here is another example where we
“flatten” a list of tuples of integers into a simple list of integers:

In [67]:
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
flattened

[1, 2, 3, 4, 5, 6, 7, 8, 9]

Keep in mind that the order of the for expressions would be the same if you wrote a
nested for loop instead of a list comprehension:

    flattened = []
    for tup in some_tuples:
        for x in tup:
            flattened.append(x)
            
You can have arbitrarily many levels of nesting, though if you have more than two or
three levels of nesting you should probably start to question whether this makes sense
from a code readability standpoint. It’s important to distinguish the syntax just shown
from a list comprehension inside a list comprehension, which is also perfectly valid:

In [68]:
[[x for x in tup] for tup in some_tuples]

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

This produces a list of lists, rather than a flattened list of all of the inner elements.

## 3.2 Functions

Functions are the primary and most important method of code organization and
reuse in Python. As a rule of thumb, if you anticipate needing to repeat the same or
very similar code more than once, it may be worth writing a reusable function. Func‐
tions can also help make your code more readable by giving a name to a group of
Python statements.

Functions are declared with the def keyword and returned from with the return key‐
word:

In [69]:
def my_function(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)

In [70]:
my_function(1,2)

4.5

*__Each function can have positional arguments and keyword arguments. Keyword arguments are most commonly used to specify default values or optional arguments__*. In
the preceding function, x and y are positional arguments while z is a keyword argument. This means that the function can be called in any of these ways:

    my_function(5, 6, z=0.7)
    my_function(3.14, 7, 3.5)
    my_function(10, 20)
    
The main restriction on function arguments is that the keyword arguments must fol‐
low the positional arguments (if any). You can specify keyword arguments in any
order; this frees you from having to remember which order the function arguments
were specified in and only what their names are.

### Namespaces, Scope and Local Functions

Functions can access variables in two different scopes: global and local. An alternative
and more descriptive name describing a variable scope in Python is a namespace. Any
variables that are assigned within a function by default are assigned to the local
namespace. The local namespace is created when the function is called and immedi‐
ately populated by the function’s arguments. After the function is finished, the local
namespace is destroyed (with some exceptions that are outside the purview of this
chapter). Consider the following function:

In [71]:
def func():
    a = []
    for i in range(5):
        a.append(i)
    return a

In [73]:
func()

[0, 1, 2, 3, 4]

*__When func() is called, the empty list a is created, five elements are appended, and
then a is destroyed when the function exits__*. Suppose instead we had declared a as
follows:

In [74]:
a = []

def func():
    for i in range(5):
        a.append(i)
    return a

In [75]:
func()

[0, 1, 2, 3, 4]

In [76]:
a = None

In [80]:
def bind_a_var():
    global a
    a = []

In [81]:
bind_a_var()

In [82]:
print(a)

[]


I generally discourage use of the global keyword. *__Typically global
variables are used to store some kind of state in a system. If you
find yourself using a lot of them, it may indicate a need for object-oriented programming (using classes)__*.

### Returning Multiple Values

When I first programmed in Python after having programmed in Java and C++, one
of my favorite features was the ability to return multiple values from a function with
simple syntax. Here’s an example:

In [83]:
def f():
    a=1
    b=2
    c=3
    return a,b,c

In [85]:
a,b,c = f()

In data analysis and other scientific applications, you may find yourself doing this
often. What’s happening here is that the function is actually just returning one object,
namely a tuple, which is then being unpacked into the result variables. In the preced‐
ing example, we could have done this instead:

In [86]:
return_value = f()

In [87]:
return_value

(1, 2, 3)

In this case, return_value would be a 3-tuple with the three returned variables. A
potentially attractive alternative to returning multiple values like before might be to
return a dict instead:

In [88]:
def f():
    a = 5
    b = 6
    c = 7
    return {'a':a, 'b':b, 'c':c}

In [89]:
f()

{'a': 5, 'b': 6, 'c': 7}

This alternative technique can be useful depending on what you are trying to do.

### Functions are Objects

Since Python functions are objects, many constructs can be easily expressed that are
difficult to do in other languages. Suppose we were doing some data cleaning and
needed to apply a bunch of transformations to the following list of strings:

In [93]:
states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
          'south carolina##', 'West virginia?']
print(states)

[' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda', 'south carolina##', 'West virginia?']


Anyone who has ever worked with user-submitted survey data has seen messy results
like these. Lots of things need to happen to make this list of strings uniform and
ready for analysis: stripping whitespace, removing punctuation symbols, and stand‐
ardizing on proper capitalization. *__One way to do this is to use built-in string methods
along with the re standard library module for regular expressions__*:

In [107]:
import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip() # Leading and trailing whitespaces are removed.
        value = re.sub('[!#?]','', value)
        value = value.title() # The first character in every word is upper case.
        result.append(value)
    return result

In [108]:
clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

*__You can use functions as arguments to other functions like the built-in map function,
which applies a function to a sequence of some kind__*:

In [110]:
states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
            'south carolina##', 'West virginia?']

In [111]:
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

In [113]:
states2 = []
for x in map(remove_punctuation, states):
    states2.append(x)
print(states2)

[' Alabama ', 'Georgia', 'Georgia', 'georgia', 'FlOrIda', 'south carolina', 'West virginia']


### Anonymous (Lambda) Functions

Python has support for so-called anonymous or lambda functions, which are a way of
writing functions consisting of a single statement, the result of which is the return
value. They are defined with *__the lambda keyword, which has no meaning other than
“we are declaring an anonymous function”__*:

In [114]:
def func():
    return x*2

In [115]:
equiv_func = lambda x: x*2

I usually refer to these as lambda functions in the rest of the book. They are especially
convenient in data analysis because, as you’ll see, there are many cases where data
transformation functions will take functions as arguments. It’s often less typing (and
clearer) to pass a lambda function as opposed to writing a full-out function declara‐
tion or even assigning the lambda function to a local variable. For example, consider
this silly example:

In [116]:
def apply_to_list(some_list,f):
    return [f(x) for x in some_list]

In [117]:
ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x*2)

[8, 0, 2, 10, 12]

You could also have written [x * 2 for x in ints], but here we were able to suc‐
cinctly pass a custom operator to the apply_to_list function.
As another example, suppose you wanted to sort a collection of strings by the number
of distinct letters in each string:

In [118]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

Here we could pass a lambda function to the list’s sort method:

In [119]:
strings.sort(key=lambda x: len(set(list(x))))

In [120]:
print(strings)

['aaaa', 'foo', 'abab', 'bar', 'card']


One reason lambda functions are called anonymous functions is
that , unlike functions declared with the def keyword, the function
object itself is never given an explicit __ name __ attribute.

### Currying: Partial Argument Application

*__Currying is computer science jargon (named after the mathematician Haskell Curry)
that means deriving new functions from existing ones by partial argument applicatio__*n. For example, suppose we had a trivial function that adds two numbers together:

In [122]:
def add_numbers(x,y):
    return x+y

Using this function, we could derive a new function of one variable, add_five, that
adds 5 to its argument:

In [123]:
add_5 = lambda y: add_numbers(5,y)

*__The second argument to add_numbers is said to be curried__*. There’s nothing very fancy
here, as all we’ve really done is define a new function that calls an existing function.
The built-in functools module can simplify this process using the partial function:

In [125]:
from functools import partial
add_5 = partial(add_numbers, 5)

In [126]:
add_5(2)

7

### Generators

Having a consistent way to iterate over sequences, like objects in a list or lines in a
file, is an important Python feature. *__This is accomplished by means of the iterator
protocol, a generic way to make objects iterable__*. For example, iterating over a dict
yields the dict keys:

In [127]:
some_dict = {'a': 1, 'b': 2, 'c': 3}

for key in some_dict:
    print(key)

a
b
c


When you write for key in some_dict, the Python interpreter first attempts to create an iterator out of some_dict:

In [128]:
dict_iterator = iter(some_dict)
dict_iterator

<dict_keyiterator at 0x1dbbce4e228>

An iterator is any object that will yield objects to the Python interpreter when used in
a context like a for loop. Most methods expecting a list or list-like object will also
accept any iterable object. This includes built-in methods such as min, max, and sum,
and type constructors like list and tuple:

In [129]:
list(dict_iterator)

['a', 'b', 'c']

A generator is a concise way to construct a new iterable object. Whereas normal functions execute and return a single result at a time, generators return a sequence of
multiple results lazily, pausing after each one until the next one is requested. To create
a generator, use the yield keyword instead of return in a function:

In [144]:
def squares(n=10):
#     print('Generating squares from 1 to {0}'.format(n**2))
    print('Generating squares from 1 to', n**2)
    for i in range(1, n+1):
        yield i**2

When you actually call the generator, no code is immediately executed:

In [156]:
gen = squares()

In [157]:
gen

<generator object squares at 0x000001DBBCE1CC00>

It is not until you request elements from the generator that it begins executing its
code:

In [158]:
for x in gen:
    print(x, end=' ')

Generating squares from 1 to 100
1 4 9 16 25 36 49 64 81 100 

#### Generator Expressions

Another even more concise way to make a generator is by using a generator expres‐
sion. This is a generator analogue to list, dict, and set comprehensions; to create one,
enclose what would otherwise be a list comprehension within parentheses instead of
brackets:

In [159]:
gen = (x**2 for x in range(100))

In [160]:
gen

<generator object <genexpr> at 0x000001DBBCE1CC78>

Generator expressions can be used instead of list comprehensions as function argu‐
ments in many cases:

In [161]:
sum(x**2 for x in range(100))

328350

In [163]:
dict((i,i**2) for i in range(5))

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

#### Itertools Module

The standard library itertools module has a collection of generators for many com‐
mon data algorithms. For example, groupby takes any sequence and a function,
grouping consecutive elements in the sequence by return value of the function. Here’s
an example:

In [164]:
import itertools

first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) #names is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


See Table 3-2 for a list of a few other itertools functions I’ve frequently found help‐
ful. You may like to check out the official Python documentation for more on this
useful built-in utility module.

![](itertools.jpg)

### Errors and Exception Handling

Handling Python errors or exceptions gracefully is an important part of building
robust programs. In data analysis applications, many functions only work on certain
kinds of input. As an example, Python’s float function is capable of casting a string
to a floating-point number, but fails with ValueError on improper inputs:

In [165]:
float('1.2345')

1.2345

In [166]:
float('something')

ValueError: could not convert string to float: 'something'

Suppose we wanted a version of float that fails gracefully, returning the input argu‐
ment. We can do this by writing a function that encloses the call to float in a try/
except block:

In [167]:
def attemp_float(x):
    try:
        return float(x)
    except:
        return x

The code in the except part of the block will only be executed if float(x) raises an
exception:

In [168]:
attemp_float('1.2345')

1.2345

In [169]:
attemp_float('something')

'something'

You might notice that float can raise exceptions other than ValueError:

In [170]:
float((1, 2))

TypeError: float() argument must be a string or a number, not 'tuple'

## 3.3 Files and the Operating System

Most of this book uses high-level tools like pandas.read_csv to read data files from
disk into Python data structures. However, it’s important to understand the basics of
how to work with files in Python. Fortunately, it’s very simple, which is one reason
why Python is so popular for text and file munging.

In [171]:
path = 'examples/segismundo.txt'
f = open(path)

By default, the file is opened in read-only mode 'r'. We can then treat the file handle
f like a list and iterate over the lines like so:

In [173]:
for line in f:
    pass

The lines come out of the file with the end-of-line (EOL) markers intact, so you’ll
often see code to get an EOL-free list of lines in a file like:

In [174]:
lines = [x.rstrip() for x in open(path)]
lines

['SueÃ±a el rico en su riqueza,',
 'que mÃ¡s cuidados le ofrece;',
 '',
 'sueÃ±a el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueÃ±a el que a medrar empieza,',
 'sueÃ±a el que afana y pretende,',
 'sueÃ±a el que agravia y ofende,',
 '',
 'y en el mundo, en conclusiÃ³n,',
 'todos sueÃ±an lo que son,',
 'aunque ninguno lo entiende.',
 '']

When you use open to create file objects, it is important to explicitly close the file
when you are finished with it. Closing the file releases its resources back to the oper‐
ating system:

In [175]:
f.close()

One of the ways to make it easier to clean up open files is to use the with statement:

In [177]:
with open(path) as f:
    lines = [x.rstrip() for x in f]

This will automatically close the file f when exiting the with block.

If we had typed f = open(path, 'w'), a new file at examples/segismundo.txt would
have been created (be careful!), overwriting any one in its place. There is also the 'x'
file mode, which creates a writable file but fails if the file path already exists. See
Table 3-3 for a list of all valid file read/write modes.

![](file_modes.jpg)

For readable files, some of the most commonly used methods are read, seek, and
tell. read returns a certain number of characters from the file. What constitutes a
“character” is determined by the file’s encoding (e.g., UTF-8) or simply raw bytes if
the file is opened in binary mode:

In [178]:
f = open(path)

In [179]:
f.read(10)

'SueÃ±a el '

In [180]:
f2 = open(path,'rb')

In [181]:
f2.read(10)

b'Sue\xc3\xb1a el '

The read method advances the file handle’s position by the number of bytes read.
tell gives you the current position:

In [182]:
f.tell()

10

In [183]:
f2.tell()

10

Even though we read 10 characters from the file, the position is 11 because it took
that many bytes to decode 10 characters using the default encoding. You can check
the default encoding in the sys module:

In [184]:
import sys

sys.getdefaultencoding()

'utf-8'

seek changes the file position to the indicated byte in the file:

In [185]:
f.seek(3)

3

In [186]:
f.read(1)

'Ã'

Lastly, we remember to close the files:

In [188]:
f.close()

In [189]:
f2.close()

To write text to a file, you can use the file’s write or writelines methods. For exam‐
ple, we could create a version of prof_mod.py with no blank lines like so:

In [190]:
with open('tmp.txt','w') as handle:
    handle.writelines(x for x in open(path) if len(x)>1)

In [191]:
with open('tmp.txt') as f:
    lines = f.readlines()

In [192]:
lines

['SueÃ±a el rico en su riqueza,\n',
 'que mÃ¡s cuidados le ofrece;\n',
 'sueÃ±a el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueÃ±a el que a medrar empieza,\n',
 'sueÃ±a el que afana y pretende,\n',
 'sueÃ±a el que agravia y ofende,\n',
 'y en el mundo, en conclusiÃ³n,\n',
 'todos sueÃ±an lo que son,\n',
 'aunque ninguno lo entiende.\n']

See Table 3-4 for many of the most commonly used file methods.

![](file_methods1.jpg)

![](file_methods2.jpg)

### Bytes and Unicode with Files

The default behavior for Python files (whether readable or writable) is text mode,
which means that you intend to work with Python strings (i.e., Unicode). This con‐
trasts with binary mode, which you can obtain by appending b onto the file mode.
Let’s look at the file (which contains non-ASCII characters with UTF-8 encoding)
from the previous section:

In [193]:
with open(path) as f:
    chars = f.read(10)

In [194]:
chars

'SueÃ±a el '

UTF-8 is a variable-length Unicode encoding, so when I requested some number of
characters from the file, Python reads enough bytes (which could be as few as 10 or as
many as 40 bytes) from the file to decode that many characters. If I open the file in
'rb' mode instead, read requests exact numbers of bytes:

In [195]:
with open(path,'rb') as f:
    data = f.read(10)

In [196]:
data

b'Sue\xc3\xb1a el '

Depending on the text encoding, you may be able to decode the bytes to a str object
yourself, but only if each of the encoded Unicode characters is fully formed:

In [197]:
data.decode('utf8')

'Sueña el '

Text mode, combined with the encoding option of open, provides a convenient way
to convert from one Unicode encoding to another:

## 3.4 Conclusion

With some of the basics and the Python environment and language now under our
belt, it’s time to move on and learn about NumPy and array-oriented computing in
Python.