# More Python Fundamentals

In [1]:
## Notebook settings

# multiple lines of output per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

>### Today
>
>- Writing Functions
>
>
>- Handling Exceptions
>
>
>- Classes
>
>
>- File Input/Output

### A little bit of theory: Assignment & Data types

>The following has been extracted from [section 4.1](http://www.nltk.org/book/ch04.html#back-to-the-basics) of Bird et al. (2009)

Assignment always **copies the value of an expression**. The problem: what is a value? We've already said that the value is a memory location, but this has **different consequences on different kinds of data types**. 

<br>

**This is a frequent source of errors in everyone's programs!</font>**

<br>

In [2]:
s = "hello!"

In [3]:
type(s)

str

In [4]:
s = 1

**Case 1: Simple types & copy**

When copying **simple data types** like numbers of strings nothing counter-intuitive happens. 

In what follows, what is copied from `foo` to `bar` is the string "Monty". As a consequence, when `foo` is changed, `bar` in unaffected.

In [5]:
# copy & original value
foo = "Monty"
bar = foo
foo = "Python"
bar
foo

'Monty'

'Python'

**Case 2: Complex types & reference**

However, the value of a structured (or compound, or complex) object such as a list is actually **just a reference to the object**. 

When a list is copied, what is actually copied not the content of the variable, but only its object reference. 

As a consequence, **updating one will affect the other**.

In [6]:
# mm... what?
foo = ["Monty", "Python"]
bar = foo
foo[0] = "Bodkin"
bar
foo

['Bodkin', 'Python']

['Bodkin', 'Python']

A list `foo` is a reference to an object stored at location 3133 (which is itself a series of pointers to other locations holding strings). 

When we assign `bar = foo`, it is just the object reference 3133 that gets copied

![alt text](images/array-memory.png)


Source: [Bird et al. (2009)](http://www.nltk.org/book/ch04.html#back-to-the-basics).

We can use the function `id()` to show that

1. even if the memory location of the objects inside the list is changed

2. the memory location of the two lists are left unchanged

In [7]:
foo = ["Monty", "Python"]
bar = foo

In [8]:
print(id(bar))
print(id(foo))

4453025024
4453025024


In [9]:
foo[0] = "Bodkin"

In [10]:
print(id(bar))
print(id(foo))

4453025024
4453025024


Working with lists of lists can help to clarify the issue. 

Let's create a list of lists, in which the sublists are copies of each other:

In [11]:
empty = []
nested = [empty, empty, empty]
nested

[[], [], []]

In [12]:
# the empty list might look like a primitive objects, but...
nested[1].append('Python')
nested

[['Python'], ['Python'], ['Python']]

Changing one of the items inside our nested list of lists changed them all. 

This is because each of the three elements is actually just a reference to one and the same list in memory.

Note that if we **modify the reference**, i.e. change which lists are in `nested`, the other sublists are unaffected

In [13]:
nested[1] = ["Monty"]
nested

[['Python'], ['Monty'], ['Python']]

A solution to all these issues is to use the method `copy.deepcopy()` from the module `copy`, that would copy the structure but not the object references:

In [14]:
# import the module copy
import copy

In [15]:
# deep copy
foo = ["Monty", "Python"]
bar = copy.deepcopy(foo)

In [16]:
print(id(bar))
print(id(foo))

4452710016
4452708800


In [17]:
# Let's see... yes, much better (depending on your goals, of course)
foo[0] = "Bodkin"
bar

['Monty', 'Python']

In [18]:
print(id(bar))
print(id(foo))

4452710016
4452708800


## Functions

Functions are constructs that allows us to organize portions of code more than once in a program. 

The alternative way to obtain the same results without functions would be to copy the same portion of code every time it is needed. 

Functions in Python are defined by a `def` statement, following this template:

```python
def function_name(parameters):
    """
    docstring
    """
    function_body
    return result
```

> The list of the parameters required by the function is reported between round brackets right after the name of the function. Each function may have **zero or more** parameters. When a function is called, its parameters are called **arguments**.
>
> The (optional) documentation string should be placed immediately after the function definition. There are many way to format your **docstring**, [PEP 287](https://www.python.org/dev/peps/pep-0257/) recommends reStructuredText, but more formats are available. See [this tutorial](http://daouzli.com/blog/docstring.html) for an introduction to the topic.
>
> The **indented** function body contains all the statements that are executed every time the function is called. When a `return` statement is executed, the function exits and its output is the argument of the `return` statement. 
>
> When there is no return statement in the body function, or when a return statement with no arguments is executed, the function  returns `None`

For instance, the following function calculates the number of characters in a string:

In [19]:
def chars(s):
    """
    Calculates the number of characters in a string
    """
    if not type(s) is str:
        return "This is not a string!"
    r = len(s)
    return r

The docstring is saved into a  `__doc__` variable and can be accessed by using the `help()` function or the IPython `?`

In [20]:
# don't use this, it is just to make the point
print(chars.__doc__)


    Calculates the number of characters in a string
    


In [21]:
# use one of this two
help(chars)
chars?

Help on function chars in module __main__:

chars(s)
    Calculates the number of characters in a string



In order to execute the code included in a function, you have to **call the function**, either in your script or in the interactive shell. For instance:

In [22]:
chars("voodoo")

6

In [23]:
chars(1979)

'This is not a string!'

### Parameters

A function can receive any number of parameters:

In [24]:
def higher(n1, n2, n3):
    """
    find the higher of three numbers
    """
    if n1 > n2 and n2 >= n3:
        return n1
    if n2 >= n3:
        return n2
    else:
        return n3

In [25]:
# a parameter can be passed either by position
higher(4, 2, 8)

8

In [26]:
# or by name
higher(n3 = 8, n1 = 4, n2 = 2)

8

#### Optional Parameters

In some situation it may be useful to have a default parameter value, that is used when a call leaves an arguments **unspecified**.

In [27]:
def higher(n1, n2 = 0, n3 = 0):
    """
    find the higher of three numbers
    """
    if n1 > n2 and n2 >= n3:
        return n1
    if n2 >= n3:
        return n2
    else:
        return n3

In [28]:
higher(9,4)

9

In [29]:
higher(-6)

0

#### Arbitrary Number of Parameters

A different situation is when we want our function to have an unspecified number of parameters. Python functions admit the so-called "tuple references", marked by an asterisk `*` in front of the last parameter  (that becomes a tuple)

In [30]:
def print_params(*params):
    print ("your input:")
    print (params)

In [31]:
print_params("Down from my ceiling", "Drips great noise", "It drips on my head through a hole in the roof") 

your input:
('Down from my ceiling', 'Drips great noise', 'It drips on my head through a hole in the roof')


#### A Note on Parameter Passing

The way in which variable works in Python has a great influence on how they are passed to functions, and how they are affected by it. 

The following function, for instance, accepts two parameteres, a string and a list. 

But while the string is left unaffected by the function, the list is changed. Why?

In [32]:
def reference(another_string, another_list):
    another_string = "new string value"
    another_list.append("new list value")

In [33]:
a_string = "old string value"
a_list = ["old list value"]

reference(a_string, a_list)    
    
print (a_string)
print (a_list)

old string value
['old list value', 'new list value']


The effects of our function on the two variables can be understood by recalling how assignment works in Python. 

In [34]:
# this is what happened to the variable a_string
a_string = "old string value"
another_string = a_string
another_string = "new string value"
a_string

'old string value'

In [35]:
# this is what happened to the variable a_list
a_list = ["old list value"]
another_list = a_list
another_list.append("new list value")
a_list

['old list value', 'new list value']

### Variables Scope

The scope of a variable determines where in the program it can be accessed, and depends on where it has been declared.

**Variables declared inside a function are visible only inside the function itself**. That, they can be used only by the code inside that function. 

Global functions, on the other side, are those declared outside a function body. 

When looking for the reference of an object, the Python interpreter follows the **LGB rule**:

- LOCAL: first looks in the names locally defined by the function


- GLOBAL: if nothing is found, looks in the names defined globally in the module


- BUILT-IN: if nothing is found, check if the name is a Python built-in

In [36]:
age = 17  # global variable

def try_to_buy_a_beer():
    age = 18  # local variable
    print("(inside the bar) I'm " + str(age))

try_to_buy_a_beer()
print("(outside the bar) I'm " + str(age))

(inside the bar) I'm 18
(outside the bar) I'm 17


### (Extra) Recursive Functions

A recursive function is a function that recalls itself in its body, usually returning the output of this function call. 

That is, a recursive function must satisfy two conditions:

- it must have a state in which recursion terminates (**base case**) (one or more)


- it must recall itself (**recursive call**) (one or more "in parallel")

The Fibonacci sequence is a classic example of recursion: in it every number is found by adding up the two numbers before (apart from the first two member of the sequence, that are $0$ and $1$ by definition):

$$
f(n) = \begin{cases}
               0               & n = 0\\
               1               & n = 1\\
               f(n-1) + f(n-2) & \text{otherwise}
           \end{cases}
$$


The following sequence of numbers is generated according to this rule:

$$\{\;0,\;1,\;1,\;2,\;3,\;5,\;8,\;13,\;21,\;34,\;55,\;89,\;144,\;233,\;377,\;610,\;987,\;1597,\;2584,\;4181,\;6765\;\ldots\;\}$$

We can calculate the Fibonacci numbers by using a recursive functions that mimics the definition of the Fibonacci sequence:

In [37]:
def fibonacci_recursive_illustrated(n):
    """
    recursive function to calculate a position in the Fibonacci sequence
    """
    print ("function has been called for n = " + str(n))
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        print ("- calling f for: {x} and {y}".format(x = n-1, y = n-2))
        res = fibonacci_recursive_illustrated(n-1) + fibonacci_recursive_illustrated(n-2)
        print (">> intermediate result: {x} + {y} = {r}".format(x = n-1, y = n-2, r = res))
        return res

fibonacci_recursive_illustrated(5)

function has been called for n = 5
- calling f for: 4 and 3
function has been called for n = 4
- calling f for: 3 and 2
function has been called for n = 3
- calling f for: 2 and 1
function has been called for n = 2
- calling f for: 1 and 0
function has been called for n = 1
function has been called for n = 0
>> intermediate result: 1 + 0 = 1
function has been called for n = 1
>> intermediate result: 2 + 1 = 2
function has been called for n = 2
- calling f for: 1 and 0
function has been called for n = 1
function has been called for n = 0
>> intermediate result: 1 + 0 = 1
>> intermediate result: 3 + 2 = 3
function has been called for n = 3
- calling f for: 2 and 1
function has been called for n = 2
- calling f for: 1 and 0
function has been called for n = 1
function has been called for n = 0
>> intermediate result: 1 + 0 = 1
function has been called for n = 1
>> intermediate result: 2 + 1 = 2
>> intermediate result: 4 + 3 = 5


5

![alt text](images/fibonacci_tree.png)

Notwithstanding their simplicity, recursive functions can be **inefficient**. Indeed, they require information to be stacked every time a function is called, so that once the function is completed, execution continues from where it left off. 

Note, moreover, that our recursive implementation performs the same calculation many times. In our example, `fibonacci_recursive_illustrated(3)` is called 2 times, `fibonacci_recursive_illustrated(2)` 3 times and so forth. 

Taken together, this sounds pretty inefficient.

### Benchmarking Python Code

Testing the execution time of a function is quite straightforward using the `%timeit` and `%%timeit` IPython magics. 

Let's test our recursive function to calculate a position in the Fibonacci sequence

In [38]:
# let's rewrite our function without the print statements
def fibonacci_recursive(n):
    """
    recursive function to calculate a position in the Fibonacci sequence
    """
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        res = fibonacci_recursive(n-1) + fibonacci_recursive(n-2)
        return res

In [39]:
%timeit fibonacci_recursive(20)

2.78 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


This magic command execute a statement multiple times in order to get reliable estimates. The actual number of calls can be specified with the parameters:

- `-n N`: execute the given statement N times in a run

- `-r M`: run M loops and take the best result each time

In [40]:
# let's play with these options
%timeit -n 50 -r 10 fibonacci_recursive(20)

2.58 ms ± 72.8 µs per loop (mean ± std. dev. of 10 runs, 50 loops each)


Now let's test our hypothesis about the recursive functions: i.e. that they are inefficient and that an iterative solution would be way faster.

Let's implement an iterative version of our fibonacci function and test it:

In [41]:
def fibonacci_iterative(n):
    """
    iterative function to calculate a position in the Fibonacci sequence
    """
    n1 = 0
    n2 = 1
    for i in range(n):
        n1, n2 = n2, n1 + n2
    return n1

fibonacci_iterative(5)

5

In [42]:
%timeit -n 50 -r 10 fibonacci_iterative(20)

1.19 µs ± 59.1 ns per loop (mean ± std. dev. of 10 runs, 50 loops each)


**Are there any fixes for the efficiency problem of recursive functions?**

Sure. Two relevant keywords here are: 

* Tail recursion
* Dynamic programming

For more, see **[Section 4.7](http://www.nltk.org/book/ch04.html)** of: S. Bird, S., E. Klein & W. Loper (2009). Natural Language Processing with Python. Analyzing Text with the Natural Language Toolkit, O'Reilly.

---

#### Quiz

* Write a function that takes a string as input and returns a dictionary of tokens (sequences of characters separated by whitespace) as keys, and the number of times they occur as values. The `split()` method for string might be useful.
* Write an alternative version using just comprehensions.
* Time both.

In [43]:
# your code here

---

### (Extra) Profiling Python Code

While being optimal for **benchmarking** (i.e. for comparing different portions of code), the `%timeit` magic isn't suitable for **profiling** (i.e. for investigating a program's behavior). Profiling is vital when you want to identify the bottlenecks of your code in order to speed it up.

The `%prun` and `%%prun` IPython magics can be used to run code through the python code profiler.

In [44]:
import random
a_dictionary = dict((key, None) for key in random.sample(range(6000000), 600000)) 

In [45]:
def play_with_dict(input_dictionary, n):
    """
    useless code in which you delete all key[x] entries from a dictionary, where 0 < x < n 
    """
    for a_number in range(n):
        if a_number in input_dictionary.keys():
            del(input_dictionary[a_number])
        else:
            pass

In [46]:
%timeit play_with_dict(a_dictionary, 15)

1.73 µs ± 166 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Let's use the magic `%prun` to see if there may be a bottleneck somewhere:

In [47]:
# the option -s to manipulate how our results are ordered (for more info `%prun?`, as usual)
%prun -s cumtime play_with_dict(a_dictionary, 15)

 

> **Recommended Reading**
>
> We won't discuss optimization techniques in this course, but having a look at the [Performance Tips](https://wiki.python.org/moin/PythonSpeed/PerformanceTips) on `python.org` won't hurt.

### (Extra) Debugging

**(For now, just an overview of the commands. Will be used later.)**

To debug some code after it crashed (**post-morted debugging**), execute the `%debug` magic to launch the interactive debugger.

Among the [commands](https://docs.python.org/2.7/library/pdb.html#debugger-commands) recognized by the debugger, the most important are:

- `w` : the current position in the stack, with the most recent frame at the bottom. An arrow indicates the current frame
- `l` : shows source code. Without arguments, list 11 lines around the current line. With one argument, list 11 lines around at that line.
- `u`/`d` : to move up/down in the stack (the stack is the list of all active functions in a given position)
- `p varname` : print the value of `varname`
- `a` : print the argument of this function
- `q` : quit from the debugger
- `h` : print help

## Handling Exceptions in Python

Exceptions are anomalous conditions that modify the intended flow of a program. 

When not foreseen by the programmer, exceptions usually cause the program to end abruptly.

Error handling is the process of responding to such exceptional occurrences by saving the state of execution of the program and executing a specialized function of code, the so called  the **exception handler**.

The simplest way to handle exception in Python is to rely on a `try-except` block, i.e. a block of code with the following form:

```python
try:
    code_in_which_the_exception_may_occurr
except [ExceptionName]:
    code_executed_in_case_of_error:
```

see the documentation for the list of [built-in Python exceptions](https://docs.python.org/2/library/exceptions.html)

In [48]:
# Avoid at all costs!
(x, y) = (9, 0)

x/y

ZeroDivisionError: division by zero

In [49]:
# a common exception is the division by zero
(x, y) = (9, 0)

try:
    x / y
except ZeroDivisionError:
    print ("are you seriously trying to divide by zero?")

are you seriously trying to divide by zero?


The error message from the interpreter can be captured with the following syntax:

In [50]:
(x, y) = (9, 0)

try:
    x / y
except ZeroDivisionError as e:
    print (e)

division by zero


Two optional clauses of the `try-except` block are:

- `else`: a portion of code that is executed iif the `try` clause didn't raise any exception


- `finally`: a portion of code that is executed under all circumstances

In [51]:
for x, y in [(9,0), (5,4)]:
    print (x,y) ,
    try:
        res = x / y
    except ZeroDivisionError:
        print ("are you seriously trying to divide by zero?"),
    else:
        print ("the result is: " + str(res)),
    finally: 
        print("done!")

9 0


(None,)

are you seriously trying to divide by zero?


(None,)

done!
5 4


(None,)

the result is: 1.25


(None,)

done!


## Classes

A **class** is a user-programmed Python type.



In [52]:
class Room:
    pass

In [53]:
i = int()
type(i)

int

In [54]:
r = Room()
type(r)

__main__.Room

We say that an **object** is an **instance** of a particular **class**.

`__main__` is the name of the scope in which top-level code executes, where we've defined the class Room.

We can add properties to objects directly:

In [55]:
r.size = 100

In [56]:
r.size

100

In [57]:
r2 = Room()
r2.size

AttributeError: 'Room' object has no attribute 'size'

We can add properties and functions to classes directly, in such a way that all objects of that type will have them. We do this via **constructors**.

Classes are thus mostly used to express complex behaviour that many objects share.

In [58]:
class Room(object):
    def __init__(self, name, exits, capacity, occupants=[]):
        self.name = name
        self.occupants = occupants  # Note the default argument, occupants start empty
        self.exits = exits
        self.capacity = capacity

    def overfull(self):
        return len(self.occupants) > self.capacity

In [59]:
r3 = Room("kitchen",2,3)
r4 = Room("bathroom",1,1)

In [60]:
r4.occupants.append("Bill")
r4.occupants.append("Sue")

In [61]:
r4.overfull()

True

### Object-oriented design

In building a computer system to model a problem, therefore, we often want to make:

* classes for each kind of thing in our system
* methods for each capability of that kind
* properties (defined in a constructor) for each piece of information describing that kind

For more: https://alan-turing-institute.github.io/rsd-engineeringcourse/ch00python/101Classes.html

---

#### Quiz

* Write a class representing a university course. It might model things such as:
    - title, abstract, syllabus, room, time, ..
    - enrolment capacity
    - students and their grades?
    - ...

In [62]:
# your code here

---

## File Input/Output

A huge portion of our input data will come from files on disk, and a lot of our work will be saved in memory. So, mastering the art of reading and writing is crucial even in programming.

The following code opens a file in our filesystem, prints the first 10 lines and closes the file:

In [63]:
infile = open('data/adams-hhgttg.txt', 'r')
for i, line in enumerate(infile):
    if i == 10:
        break
    print(line)
infile.close()

The Hitch Hiker's Guide to the Galaxy 



for Jonny Brock and Clare Gorst 

and all other Arlingtoniansfor tea, sympathy, and a sofa







Far out in the uncharted backwaters of the unfashionable  end  of

the  western  spiral  arm  of  the Galaxy lies a small unregarded

yellow sun.



The key passage here is the one in which the `open()` function opens a file and return a **file object**, and it is commonly used with the following two parameters: the **name of the file** that we want to open and the **mode**. 

- **filename**: the name of the file to open

- the **mode** in which we want to open a file: the most commonly used values are `r` for **reading** (default), `w` for **writing** (overwriting existing files), and `a` for **appending**. (Note that [the documentation](https://docs.python.org/2/library/functions.html#open) report mode values that may be necessary in some exceptional case)

>**IMPORTANT**: every opened file should be **closed** by using the function `close()` before the end of the program, or the file could be unavailable to successive manipulations or for other programs.

There are other ways to read a text file, among which the use of the methods `read()` and `readlines()`, that would simplify the above function in:

```python
infile = open('data/adams-hhgttg.txt', 'rt')
text = infile.read()
print(text[:10])
infile.close()
```

However, these methods **read the whole file at once**, thus creating huge problems when working with big corpora.

In the solution we adopt here the input file is read line by line, so that at any given moment **only one line of text** is loaded into memory. 

---

Writing an output file in Python has a structure that is close to that we've used in our reading examples above. The main difference are 

- the specification of the **mode** `w`


- the use of the function `write()` for each line of text

In [64]:
outfile = open('stuff/output-test-1.txt', 'w')
outfile.write("My name is:")
outfile.write("John")
outfile.close()

11

4

> When writing line by line, it's up to you to take care of the **newlines** by appending `\n` to each line

In [65]:
outfile = open('stuff/output-test-2.txt', 'w')
outfile.write("My name is:\n")
outfile.write("Alexander")
outfile.close()

12

9

### the With statement 

A `with` statement is used to wrap the execution of a block with methods defined by a **context manager** (where a context manager is a class that implements `__enter__` and `__exit__` methods). A statement of this sort follows the template:

```python
with controlled_execution() as variable_name:
    code
```

where `controlled_execution()` is an arbitrary expression, `variable_name` is a single assignment target and `code` is the encapsulated code.

What the Python interpreter does when it meets one of such statement is:

- it calls the `__enter__` method on the resulting value


- it assigns whatever `__enter__` returns to the variable given by `as`


- it executes the code in the body and...


- ... **no matter what happens** call the `__exit__` method on the resulting value

The fact that

- the `open` statement is a **context manager**


- file objects have  `__enter__` and `__exit__` methods, so that

allows us to **open a file** adopting the following syntax:

```python
with open(file_name, mode) as opened_file_name:
    code
```

that will open the file, keep it open as long as the code is execute and automaticlly close it as soon as the interpreter stops executing the nested code, **no matter what happens**

Using this construction to open files has three major advantages:

- there is no need to explicitly  close the file (the file is automatically closed as soon as the nested code exits)


- the file is closed automatically even when unhandled errors cause the program to crash


- the code is way clearer (it is trivial to identify where in the code a file is opened ) 

In [66]:
# that's how I usually open files
with open("data/adams-hhgttg.txt", "rb") as infile:
    for i, line in enumerate(infile):
        if i == 10:
            break
        print(line)

b"The Hitch Hiker's Guide to the Galaxy \n"
b'\n'
b'for Jonny Brock and Clare Gorst \n'
b'and all other Arlingtoniansfor tea, sympathy, and a sofa\n'
b'\n'
b'\n'
b'\n'
b'Far out in the uncharted backwaters of the unfashionable  end  of\n'
b'the  western  spiral  arm  of  the Galaxy lies a small unregarded\n'
b'yellow sun.\n'


---

#### For Next time:

- Read **[Sections 4.6, 4.8](http://www.nltk.org/book/ch04.html)** of: S. Bird, S., E. Klein & W. Loper (2009). Natural Language Processing with Python. Analyzing Text with the Natural Language Toolkit, O'Reilly.
- Do the exercises below.

### Exercise 1.

Use the function `id()` to explain what happens to the variable `pizza` in the following code:

In [67]:
def eat(food):
    food.append("ham")
    food = ["pasta", "sugo"]

pizza = ["base"]
eat(pizza)
print(pizza)

['base', 'ham']


In [68]:
# your code here

### Exercise 2.

The [factorial](https://en.wikipedia.org/wiki/Factorial) of an integer $n$, defined as:

$$
n! = \begin{cases}
               1               & n = 1\\
               n * (n-1)! & \text{n > 1}
           \end{cases}
$$

is the product of all positive integers less than or equal to $n$. For example:

$$4! = 4 * 3 * 2 * 1$$

$$3! = 3 * 2 * 1$$

The factorial operation can be implemented in Python both as a recursive function and as an iterative functions. 

Write an example of both and benchmark them.

In [69]:
# your code here

### Exercise 3.

Read the file `data/adams-hhgttg.txt` and:

- Count the number of lines in the file

- Count the number of non-empty lines

- Read each line of the input file, remove its newline character and write it to file `stuff/adams-output.txt`

- Compute the average number of alphanumeric characters per line

- Identify all the unique words used in the text (no duplicates!) and write them in a text file called `stuff/lexicon.txt` (one word per line)

In [70]:
# your code here

---