# Built-in Data Structures, Functions, and Files

## Data Structures and Sequences

### Tuple

Fixed-length, immutable sequence of Python objects

Possible to define simply with commas:

In [1]:
tup = 4, 5, 6
tup

(4, 5, 6)

Complex, nested tuple

In [2]:
nested_tup = (4, 5, 6), (7, 8)
nested_tup

((4, 5, 6), (7, 8))

In [4]:
tuple([4, 0, 2])

(4, 0, 2)

Strings are sequences of Unicode characters and can be converted to tuples

In [5]:
tup = tuple('string')
tup

('s', 't', 'r', 'i', 'n', 'g')

In [6]:
tup[0]

's'

It's not mutable, therefore it is not possible to modify which object is stored in a tuple slot

In [7]:
tup = tuple(['foo', [1, 2], True])
tup[2] = False

TypeError: 'tuple' object does not support item assignment

Tuples contain python objects that can be of different types

In [8]:
tup[1].append(3)
tup

('foo', [1, 2, 3], True)

Add comma after string to create a tuple:

In [1]:
test_tuple = ('bar')
type(test_tuple)

str

In [2]:
test_tuple2 = ('bar',)
type(test_tuple2)

tuple

In [9]:
(4, None, 'foo') + (6, 0) + ('bar',)

(4, None, 'foo', 6, 0, 'bar')

In [10]:
('foo', 'bar') * 4

('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')

#### Unpacking tuples

Assigning to a tuple will lead to unpacking of the value right of the equals sign

In [12]:
tup = (4, 5, 6)
a, b, c = tup
b

5

In [13]:
tup = 4, 5, (6, 7)
a, b, (c, d) = tup
d

7

In many other languages, swapping of variable names looks like this:

tmp = a

a = b

b = tmp

In Python, a swap can be done like this:

In [14]:
a, b = 1, 2
a

1

In [15]:
b

2

In [16]:
b, a = a, b
a

2

In [17]:
b

1

Unpacking is commonly used when iterating over sequences of tuples or lists

In [18]:
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for a, b, c in seq:
    print('a={0}, b={1}, c={2}'.format(a, b, c))

a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9


Using the `rest` syntax to select a few elements from a tuple:

In [3]:
values = 1, 2, 3, 4, 5
a, b, *rest = values
values

(1, 2, 3, 4, 5)

In [20]:
a, b

(1, 2)

In [21]:
rest

[3, 4, 5]

The underscore is commonly used for unwanted variables:

In [23]:
a, b, *_ = values

#### Tuple methods

Count: counts number of occurrences of a value

In [4]:
a = (1, 2, 2, 2, 3, 4, 2)
a.count(2)

4

Use tab completion in jupyter notebooks!

In [None]:
a.

### List

Lists are variable-length and their contents can be modified in place

Lists can store Python objects of different sizes

In [25]:
a_list = [2, 3, 7, None]
tup = ('foo', 'bar', 'baz')
b_list = list(tup)
b_list

['foo', 'bar', 'baz']

Modifications of elements is possible (unlike tuples):

In [26]:
b_list[1] = 'peekaboo'
b_list

['foo', 'peekaboo', 'baz']

In [8]:
gen = range(10)
gen
type(gen)

range

In [9]:
list(gen)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [10]:
type(gen)

range

#### Adding and removing elements

In [28]:
b_list.append('dwarf')
b_list

['foo', 'peekaboo', 'baz', 'dwarf']

`insert` is computationally expensive compared with `append`, because references to subsequent elements have to be shifted internally to make room for the new element. If you need to insert elements at both the beginning and end of a sequence, you may wish to explore `collections.deque`, a double-ended queue, for this purpose.

In [29]:
b_list.insert(1, 'red')
b_list

['foo', 'red', 'peekaboo', 'baz', 'dwarf']

In [30]:
b_list.pop(2)
b_list

['foo', 'red', 'baz', 'dwarf']

In [31]:
b_list.append('foo')
b_list

['foo', 'red', 'baz', 'dwarf', 'foo']

`remove` is also compute-intense. Note that only the first occurrence is removed:

In [32]:
b_list.remove('foo')
b_list

['red', 'baz', 'dwarf', 'foo']

In [33]:
'dwarf' in b_list

True

In [34]:
'dwarf' not in b_list

False

Checking whether a list contains a value is a lot slower than doing so with dicts and sets, as Python makes a linear scan across the values of the list, whereas it can check the others (based on hash tables) in constant time.

#### Concatenating and combining lists

In [35]:
[4, None, 'foo'] + [7, 8, (2, 3)]

[4, None, 'foo', 7, 8, (2, 3)]

Adding lists to one another is more expensive than `extend`:

In [36]:
x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])
x

[4, None, 'foo', 7, 8, (2, 3)]

```
everything = []
for chunk in list_of_lists:
    everything.extend(chunk)
```

is faster than

```
everything = []
for chunk in list_of_lists:
    everything = everything + chunk
```

#### Sorting

In [37]:
a = [7, 2, 5, 1, 3]
a.sort()
a

[1, 2, 3, 5, 7]

In [38]:
b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
b

['He', 'saw', 'six', 'small', 'foxes']

#### Binary search and maintaining a sorted list

The built-in `bisect` module implements binary search and insertion into a sorted list. `bisect.bisect` finds the location where an element should be inserted to keep it sorted, while `bisect.insort` actually inserts the element into that location:

In [45]:
import bisect
c = [1, 2, 2, 2, 3, 4, 7]
bisect.bisect(c, 2)

4

In [46]:
c

[1, 2, 2, 2, 3, 4, 7]

In [47]:
bisect.bisect(c, 5)

6

In [48]:
c

[1, 2, 2, 2, 3, 4, 7]

In [49]:
bisect.insort(c, 6)
c

[1, 2, 2, 2, 3, 4, 6, 7]

!!! The `bisect` module functions do not check whether the list is sorted, as doing so would be computationally expensive. Thus, using them with an unsorted list will succeed without error but may lead to incorrect results. !!!

#### Slicing

In [50]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5]

[2, 3, 7, 5]

In [51]:
seq[3:4] = [6, 3]
seq

[7, 2, 3, 6, 3, 5, 6, 0, 1]

While the element at the start index is included, the stop index is not included, so that the number of elements in the result is stop - start.

In [53]:
seq[:5]

[7, 2, 3, 6, 3]

In [54]:
seq[3:]

[6, 3, 5, 6, 0, 1]

Negative indices slice the sequence relative to the end:

In [55]:
seq[-4:]

[5, 6, 0, 1]

In [56]:
seq[-6:-2]

[6, 3, 5, 6]

A step can also be used after a second colon to take every other element:

In [57]:
seq[::2]

[7, 3, 3, 6, 1]

-1 has the useful effect of reversing a list or tuple:

In [None]:
seq[::-1]

### Built-in Sequence Functions

Pythons built-in functions are very efficient and should always be preferred over self-written code

#### enumerate

Keeps track of index of current item

Do-it-yourself approach:

```
i = 0
for value in collection:
   # do something with value
   i += 1
```

Built-in function `enumerate` returns a sequence of `(i, value)` tuples:

```
for i, value in enumerate(collection):
   # do something with value
```

In [58]:
some_list = ['foo', 'bar', 'baz']
mapping = {}
for i, v in enumerate(some_list):
    mapping[v] = i
mapping

{'foo': 0, 'bar': 1, 'baz': 2}

#### sorted

Returns a new sorted list from the elements of any sequence. Accepts the same arguments as the `sort` method on lists.

In [60]:
sorted([7, 1, 2, 6, 0, 3, 2])

[0, 1, 2, 2, 3, 6, 7]

In [59]:
sorted('horse race')

[' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']

#### zip

"pairs" up elements of a number of lists, tuples or other sequences to create a list of tuples:

In [61]:
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1, seq2)
list(zipped)

[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

The number of elements it produces is determined by the shortest sequence:

In [62]:
seq3 = [False, True]
list(zip(seq1, seq2, seq3))

[('foo', 'one', False), ('bar', 'two', True)]

Common use together with enumerate, simultaneously iterating over multiple sequences:

In [63]:
for i, (a, b) in enumerate(zip(seq1, seq2)):
    print('{0}: {1}, {2}'.format(i, a, b))

0: foo, one
1: bar, two
2: baz, three


Given a “zipped” sequence, zip can be applied in a clever way to “unzip” the sequence. Another way to think about this is converting a list of rows into a list of columns. The syntax, which looks a bit magical, is:

In [65]:
pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'),
            ('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)
first_names
last_names

('Ryan', 'Clemens', 'Curt')

#### reversed

In [66]:
list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

### dict

Other common names for it are hash map or associative array

```
empty_dict = {}
```

In [1]:
d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
d1

{'a': 'some value', 'b': [1, 2, 3, 4]}

In [3]:
d1[7] = 'an integer'
d1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

In [4]:
d1['b']

[1, 2, 3, 4]

In [5]:
'b' in d1

True

In [13]:
d1[5] = 'some value'
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value',
 5: 'some value'}

In [14]:
d1['dummy'] = 'another value'
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value',
 5: 'some value'}

In [15]:
del d1[5]
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value'}

`pop` simultaneously returns the value and deletes the key

In [16]:
ret = d1.pop('dummy')
ret

'another value'

In [17]:
d1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

Key-value pairs are not in any particular order. These functions output the keys and values in the same order:

In [18]:
list(d1.keys())

['a', 'b', 7]

In [19]:
list(d1.values())

['some value', [1, 2, 3, 4], 'an integer']

Merge one dictionary into another with `update`:

In [21]:
d1.update({'b' : 'foo', 'c' : 12})
d1

{'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

#### Creating dicts from sequences

To pair up two sequences element-wise, one might write code like this:

```
mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value
```

A dict is essentially a collection of 2-tuples, so the `dict` function accepts a list of 2-tuples:

In [22]:
mapping = dict(zip(range(5), reversed(range(5))))
mapping

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

#### Default values

A common logic:

```
if key in some_dict:
    value = some_dict[key]
else:
    value = default_value
```

The dict methods `get` and `pop` can take a default value to be returned, so that the above if-else block can be written simply as:

```
value = some_dict.get(key, default_value)
```

`get` by default will return `None` if the key is not present, while `pop` will raise an exception

Categorize a list of words by their first letters as a dict of lists:

In [11]:
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}
for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

Instead of the for loop above, the `setdefault` method can be used:

In [12]:
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

`defaultdict` from the `collections` module makes this even easier. Pass a type or function for generating the default value for each slot in the dict:

In [13]:
from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
    by_letter[word[0]].append(word)
by_letter

defaultdict(list, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})

#### Valid dict key types

Values can be any Python object. Keys have to be immutable types - the technical term is hashability. It is possible to check if an object is hashable (i.e. suitable to be a key) with `hash`:

In [28]:
hash('string')

3435059394780840327

In [29]:
hash((1, 2, (2, 3)))

1097636502276347782

In [30]:
hash((1, 2, [2, 3])) # fails because lists are mutable

TypeError: unhashable type: 'list'

A list can be converted into a tuple that can be hashed as long as its elements also can

In [31]:
d = {}
d[tuple([1, 2, 3])] = 5
d

{(1, 2, 3): 5}

### set

Unordered collection of unique elements - like dicts, but keys only

In [32]:
set([2, 2, 2, 1, 3, 3])

{1, 2, 3}

In [33]:
{2, 2, 2, 1, 3, 3}

{1, 2, 3}

In [15]:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}

Mathematical `set` operations like union, intersection, difference, symmetric difference:

In [16]:
a.union(b)

{1, 2, 3, 4, 5, 6, 7, 8}

In [17]:
a | b # binary operator

{1, 2, 3, 4, 5, 6, 7, 8}

In [42]:
a.intersection(b)

{3, 4, 5}

In [43]:
a & b

{3, 4, 5}

Replace the contents of the set on the left side of the operation with the result.
Option for very large sets:

In [44]:
c = a.copy()
c |= b
c

{1, 2, 3, 4, 5, 6, 7, 8}

In [19]:
d = a.copy()
d &= b # assign union of d and b to d
d

{3, 4, 5}

Convert set to tuple to have list-like elements (set elements must be immutable):

In [46]:
my_data = [1, 2, 3, 4]
my_set = {tuple(my_data)}
my_set

{(1, 2, 3, 4)}

Is the set a subset of (is contained in) or a superset of (contains all elements) another set?

In [47]:
a_set = {1, 2, 3, 4, 5}
{1, 2, 3}.issubset(a_set)

True

In [48]:
a_set.issuperset({1, 2, 3})

True

In [49]:
{1, 2, 3} == {3, 2, 1}

True

### List, Set, and Dict Comprehensions

Basic form:

```
[expr for val in collection if condition]
```

This is equivalent to:

```
result = []
for val in collection:
    if condition:
        result.append(expr)
```

Filter out strings with length <= 2 and convert to uppercase:

In [20]:
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']

# Make elements longer than 2 upper case
list_comp1 = [x.upper() for x in strings if len(x) > 2]

# Make all elements upper case
list_comp2 = [x.upper() for x in strings]

print(list_comp1, list_comp2)

['BAT', 'CAR', 'DOVE', 'PYTHON'] ['A', 'AS', 'BAT', 'CAR', 'DOVE', 'PYTHON']


dict comprehension:

```
dict_comp = {key-expr : value-expr for value in collection if condition}
```

set comprehensions look almost the same:

```
set_comp = {expr for value in collection if condition}
```

list, set and dict comprehensions can make code easier to write and read

In [53]:
unique_lengths = {len(x) for x in strings}
unique_lengths

{1, 2, 3, 4, 6}

Express the same using `map`

In [55]:
set(map(len, strings))

{1, 2, 3, 4, 6}

dict comprehension example:

In [56]:
loc_mapping = {val : index for index, val in enumerate(strings)}
loc_mapping

{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

#### Nested list comprehensions

In [22]:
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],
            ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]

Get a single list with all names containing two or more e's in them:

In [24]:
names_of_interest = []
for names in all_data:
    enough_es = [name for name in names if name.count('e') >= 2]
    names_of_interest.extend(enough_es)
print(names_of_interest)

['Steven']


As single nested list comprehension:

In [58]:
result = [name for names in all_data for name in names
          if name.count('e') >= 2]
result

['Steven']

`for` parts of nested list comprehension are arranged according to the order of nesting. Any filter condition is put at the end.

Another example (flatten a list of tuples of integers into a simple list of integers):

In [59]:
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
flattened

[1, 2, 3, 4, 5, 6, 7, 8, 9]

Order of `for` expressions the same if written as nested for loop:

```
flattened = []

for tup in some_tuples:
    for x in tup:
        flattened.append(x)
```

More than two or three levels of nesting might not make sense any more from a code readability standpoint.

A list comprehension inside a list comprehension is also a possibility:

In [60]:
[[x for x in tup] for tup in some_tuples]

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

## Functions

If you need to repeat the same of similar code more than once, write a function

Functions are declared with `def` and returned from with `return`

```
def my_function(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)
```

It's possible to have several `return` statements. If there is no `return` statement, `None` is returned

Positional arguments (e.g. x, y in the function above) and keyword arguments (specify default values or optional arguments). Keyword arguments _must_ follow the positional arguments.

```
my_function(5, 6, z=0.7)
my_function(3.14, 7, 3.5)
my_function(10, 20)
```

### Namespaces, Scope, and Local Functions

Namespace = variable scope, e.g. global vs. local.

Variables within a function are per default assigned to the local namespace, e.g. like here:

```
def func():
    a = []
    for i in range(5):
        a.append(i)
```

Here, `a` is declared outside the function, i.e. globally:

```
a = []
def func():
    for i in range(5):
        a.append(i)
```

Variables that are assigned outside the functions scope must be declared as global:

In [2]:
a = None
def bind_a_variable():
    global a
    a = []
bind_a_variable()
print(a)

[]


(discouraged, if you need many, use classes instead)

### Returning Multiple Values

Useful for data analysis or other scientific applications! The function returns one object under the hood, a tuple, which is then unpacked into the result variables:

```
def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

a, b, c = f()
```

or as one tuple with three variables:

```
return_value = f()
```

Alternatively, use a dict:

```
def f():
    a = 5
    b = 6
    c = 7
    return {'a' : a, 'b' : b, 'c' : c}
```

### Functions Are Objects

List that needs data cleaning, e.g. stripping whitespace, removing punctuation symbols, standardizing on proper capitalization...

In [3]:
states = ['   Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
          'south   carolina##', 'West virginia?']

Built-in string methods + `re` standard library module:

In [4]:
import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result

In [5]:
clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

Alternatively, make a list of the operations and make the function clean_strings more generic and reusable:

In [8]:
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

In [9]:
clean_strings(states, clean_ops)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

Functions can be arguments for other functions, like the built-in `map` function

In [10]:
for x in map(remove_punctuation, states):
    print(x)

   Alabama 
Georgia
Georgia
georgia
FlOrIda
south   carolina
West virginia


### Anonymous (Lambda) Functions

Functions consisting of a single statement, result is the return value. Defined with the `lambda` keyword ("we are declaring an anonymous function"). Anonymous because the function object is never given an explicit `__name__` attribute.

```
def short_function(x):
    return x * 2

equiv_anon = lambda x: x * 2
```

Many data transformation functions take functions as arguments. Lambda functions mean less typing.

```
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * 2)
```

(This equals `[x * 2 for x in ints]`)

Sort a collection of strings by the number of distinct letters in each string:

In [11]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

In [12]:
strings.sort(key=lambda x: len(set(list(x))))
strings

['aaaa', 'foo', 'abab', 'bar', 'card']

Example of function that returns a list comprehension and that takes a list & a function as input:

In [25]:
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

# Create a list of integers
ints = [4, 0, 1, 5, 6]

# Call thh function apply_to_list() & apply a simple lambda function
apply_to_list(ints, lambda x: x * 2)

[8, 0, 2, 10, 12]

### Currying: Partial Argument Application

_Currying_ is computer science jargon (after the mathematician Haskell Curry) and means deriving new functions from existing ones by _partial argument application_:

```
def add_numbers(x, y):
    return x + y
```

Derive new function of variable `add_five` that adds 5 to its argument. The second argument to `add_numbers` is _curried_:

```
add_five = lambda y: add_numbers(5, y)
```

Via built-in `functools` module, using the `partial` function:

```
from functools import partial
add_five = partial(add_numbers, 5)
```

### Generators

Useful if you have a lot of data, so that you don't load everything into memory.

Check https://realpython.com/introduction-to-python-generators/ to learn more.

Iterator protocol

In [1]:
some_dict = {'a': 1, 'b': 2, 'c': 3}
for key in some_dict:
    print(key)

a
b
c


`for key in some_dict` will cause Python to create an iterator out of `some_dict`:

In [2]:
dict_iterator = iter(some_dict)
dict_iterator

<dict_keyiterator at 0x7fc1efdc69a8>

Most methods expecting a list will accept other iterable objects

In [3]:
list(dict_iterator)

['a', 'b', 'c']

Generators: 
- construct new iterable object
- return sequence of multiple results, pausing after each one until the next one i requested
- `yield` keyword instead of `return`

In [5]:
def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2

When the generator is called, no code is immediately executed

In [6]:
gen = squares()
gen

<generator object squares at 0x7fc1efdc0c50>

Only when you request elements from the generator, it begins executing the code

In [7]:
for x in gen:
    print(x, end=' ')

Generating squares from 1 to 100
1 4 9 16 25 36 49 64 81 100 

#### Generator expressions

Analogue to list, dict and set comprehensions, enclosed within parentheses

In [8]:
gen = (x ** 2 for x in range(100))
gen

<generator object <genexpr> at 0x7fc1efdc0e60>

Equivalent to:

```
def _make_gen():
    for x in range(100):
        yield x ** 2
gen = _make_gen()
```

Generator expressions can be used instead of list comprehensions

In [10]:
sum(x ** 2 for x in range(100))

328350

In [11]:
dict((i, i **2) for i in range(5))

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

#### itertools module

Contains collection of generators for many common data algorithms. `groupby` takes any sequence and a function and groups consecutive elements in the sequence by return value of the function:

In [12]:
import itertools
first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']
for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) # names is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


### Errors and Exception Handling

In [13]:
float('1.2345')

1.2345

vs.

In [14]:
float('something')

ValueError: could not convert string to float: 'something'

Version that fails gracefully, returning input argument:

In [15]:
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

In [16]:
attempt_float('1.2345')

1.2345

In [17]:
attempt_float('something')

'something'

Exceptions other than `ValueError`

In [20]:
float((1, 2))

TypeError: float() argument must be a string or a number, not 'tuple'

In [21]:
def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return x

In [22]:
attempt_float((1, 2))

TypeError: float() argument must be a string or a number, not 'tuple'

Catching multiple exception types:

In [23]:
def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x

Example where the exception should not be suppressed but some code should be executed in any case, using `finally`

```
f = open(path, 'w')

try:
    write_to_file(f)
finally:
    f.close()
```

Alternative where code is executed only if the `try` block succeeds:

```
f = open(path, 'w')

try:
    write_to_file(f)
except:
    print('Failed')
else:
    print('Succeeded')
finally:
    f.close()
```

#### Exceptions in IPython

_Skipping that section_

## Files and the Operating System

In [25]:
path = 'examples/segismundo.txt'
f = open(path)

Default: read-only mode `'r'`

```
for line in f:
    pass
```

Lines come out of file with end-of-line (EOL) markers intact. Often code to get an EOL-free list of lines looks like that:

In [26]:
lines = [x.rstrip() for x in open(path)]
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.',
 '']

It is imporant to explicitly close the file when finished, this releases its resources back to the operating system.

In [27]:
f.close()

In [28]:
with open(path) as f:
    lines = [x.rstrip() for x in f]

Different file modes:
- `'w'` will open a new file, overwriting any one in its place
- `'b'` is binary mode

`read`, `seek`, `tell` 

In [30]:
f = open(path)
f.read(10) # returns certain number of characters from the file

'Sueña el r'

In [31]:
f2 = open(path, 'rb')  # Binary mode
f2.read(10)

b'Sue\xc3\xb1a el '

`tell` gives you the current position of the file handle

In [33]:
f.tell() # 11 because it took that many bytes to decode 10 characters using default encoding

11

In [34]:
f2.tell()

10

Check the default encoding with `sys`

In [35]:
import sys
sys.getdefaultencoding()

'utf-8'

`seek` changes the file handle position to the indicated byte in the file

In [36]:
f.seek(3)

3

In [37]:
f.read(1)

'ñ'

In [38]:
f.close()
f2.close()

Writing to a file: `write` or `writelines`

In [40]:
with open('tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)

In [41]:
with open('tmp.txt') as f:
    lines = f.readlines()
lines

['Sueña el rico en su riqueza,\n',
 'que más cuidados le ofrece;\n',
 'sueña el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueña el que a medrar empieza,\n',
 'sueña el que afana y pretende,\n',
 'sueña el que agravia y ofende,\n',
 'y en el mundo, en conclusión,\n',
 'todos sueñan lo que son,\n',
 'aunque ninguno lo entiende.\n']

In [42]:
import os
os.remove('tmp.txt')

### Bytes and Unicode with Files

Default behavior: text mode (Python strings, i.e. Unicode). 

Example with non-ASCII characters and UTF-8 encoding:

In [43]:
with open(path) as f:
    chars = f.read(10)
chars

'Sueña el r'

UTF-8 = variable-length Unicode encoding

When some number of characters were requested above, Python read enough bytes to decode that many characters.

Opening the file in `'rb'` mode, `read` will request the exact number of bytes:

In [44]:
with open(path, 'rb') as f:
    data = f.read(10)
data

b'Sue\xc3\xb1a el '

Decoding the bytes to a `str` object works only if each of the encoded Unicode characters is fully formed:

In [45]:
data.decode('utf8')
data[:4].decode('utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: unexpected end of data

Text mode combined with the `encoding` option of `open` allows to convert from one Unicode encoding to another:

In [46]:
sink_path = 'sink.txt'
with open(path) as source:
    with open(sink_path, 'xt', encoding='iso-8859-1') as sink:
        sink.write(source.read())
with open(sink_path, encoding='iso-8859-1') as f:
    print(f.read(10))

Sueña el r


In [47]:
os.remove(sink_path)

When a file is opened in any other mode than binary, `seek` will have trouble with the encoding. If the file position falls between bytes defining a Unicode character, subsequent reads will result in an error:

In [48]:
f = open(path)
f.read(5)

'Sueña'

In [49]:
f.seek(4)

4

In [50]:
f.read(1)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte

In [51]:
f.close()

In [52]:
%popd

/Users/verenakutschera/Dropbox/Projects/Miscellanea/Workshops/meetup_Python_DataAnalysis/pydata-book
popd -> ~/Dropbox/Projects/Miscellanea/Workshops/meetup_Python_DataAnalysis/pydata-book
