# Week 7: Iterators, generators, and introduction to Pandas

This week, we introduce **iterators and generators** as tools to generate sequences. We also introduce the module **pandas**, a very popular package for data analysis.

The best way to learn programming is to write code. Don't hesitate to edit the code in the example cells, or add your own code, to test your understanding. You will find practice exercises throughout the notebook, denoted by 🚩 **Exercise $x$**.

In [None]:
from show_solutions import show, initialise_path
show = initialise_path(show, '../solutions/w07_solutions.md')

---

## Iterables and iterators

**Iterables** in Python are the technical name for anything that has an **order** and can be enumerated. For instance, when we make a loop `for ... in ...`, the sequence we are looping over is an iterable.
- Some of the examples of iterables that we have seen so far are lists, strings, tuples, and ranges.
- On the other hand, for example, `float`, `bool`, or `int` are not iterables (try starting a loop with `for i in 5:` and watch the error message!)

**Iterators** are value producers which yield successive values from their associated iterable. In simple words, they are like bookmarks: whenever we call them, they return the value of a particular element of the iterable they are associated with, and then move the bookmark to the next element. The next time we call them, they pick the new element and move the bookmark forward again.  

Below we create a list of powers of two using a list comprehension, and then create an iterator associated with this list using the function `iter()`. Once you created your iterator, you can call the function `next()` on it, and it yields the value of the element that your iterator is pointing at, and moves the bookmark forward. Next time you call `next()`, you get a next element of your list.

In [None]:
# create the list of powers of 2 from 1 to 2**12
powers_two = [2**n for n in range(12)]
print(powers_two)

# create an iterator associated with 'powers_two'
my_iter = iter(powers_two)

# starting from the beginning call your iterator, see what happens
print(next(my_iter))
print(next(my_iter))

In the following loop, we try to pick the rest of the values of `powers_two`, from where it was left off.  

In [None]:
for i in range(10):
    print(f'iteration number {i}, and iterator is returning {next(my_iter)}')

What will happen if you call `next()` one more time? Does the error make sense to you?

In [None]:
print(next(my_iter))

The advantage of iterators is that they don't need to occupy too much memory, because they only need to keep track of where they are and then how to generate the next value (as opposed to keeping all the values in the iterables like lists)!

---
**📚 Learn more:**

- [Iterators](https://docs.python.org/3/tutorial/classes.html#iterators) - The Python tutorial

From the Python documentation:
- [The `iter` object](https://docs.python.org/3/glossary.html#term-iterator)
- [Iterator](https://docs.python.org/3/glossary.html#term-iterator)
- [`next()`](https://docs.python.org/3/library/functions.html#next)
- [`StopIteration` exception](https://docs.python.org/3/library/exceptions.html#StopIteration)

---

## Generators

Another object type that is closely related to iterators is the **generator**. Generators can be counted as one of the strengths of Python relative to other programming languages that don't have this feature.

Generators are a way of defining a procedure to get the next number in a sequence. They can remember where they were left off in a sequence, and generate the next number each time they are called. To understand them let's consider the *infinite* sequence of square numbers

$$
{1,\ 4,\ 9,\ 16,\ 25,\ ...,\ n^2,\ ...}
$$

Using functions and lists, we cannot produce an infinite sequence. At best we can write a function that takes an argument $N$ as an input and returns a list of square numbers up to $N^2$. This is not ideal! We want an object that can remember what the last square number it calculated was, and then generate the next square number the next time we call it. In this way, we have no limitation on the number of times we call this sequence producer (and we don't take up much memory). The beloved object that can manage to do this for us is **generator**.


To define a generator, we write something that looks a little like a function, but instead of the `return` keyword we use `yield` to give a result from the generator.

Note that each time a function reaches `return` statement, it returns its output and the compiler completely forgets what values were stored inside the function variables. Next time we call the function, it starts from the beginning and does not remember anything from the last time it was called. A generator is very different though:
- The first time it is called (using `next()`), it starts from the beginning and executes commands until it reaches the first `yield` statement. At this point, it stops and yields the value.
- The second time it is called, it starts from the previous `yield` statement and continues executing commands until it reaches the next `yield` statement. Then it stops again and yields the new value.
- This procedure goes on each time we call the generator.

In the example below we create the infinite sequence of square numbers and examine it:

In [None]:
def squares_func():
    num = 1
    while True:
        yield num**2
        num += 1

# squares_func is a generator function:
print(type(squares_func))

# But the output of squares_func is a generator iterator:
squares_gen = squares_func()
print(type(squares_gen))

print('First for-loop:')
for i in range(5):
    print(f'My square number is {next(squares_gen)}')
    

# Do other things in the code
print('\nI am doing other things...')

print(f'\nI am printing the next square number after the first for-loop: {next(squares_gen)}')

print('\nSecond for-loop:')
for i in range(5):
    print(f'My square number is {next(squares_gen)}')

Generators can be a very memory efficient way to deal with a problem - a list is all stored in memory, but a generator is made as you go, you don’t have to store the whole thing at once.

---
**📚 Learn more:**

- [Generators](https://docs.python.org/3/tutorial/classes.html#generators) - The Python tutorial
- [generator, generator iterator](https://docs.python.org/3/glossary.html#term-generator) - The Python glossary
- [Yield expressions](https://docs.python.org/3/reference/expressions.html#yieldexpr) - Python documentation

---
🚩 **Exercise 1**

In mathematics, the Perrin numbers are defined by the recurrence relation

$$
P(n) = P(n − 2) + P(n − 3), \quad n > 2,
$$
with initial values
$$
P(0) = 3, P(1) = 0, P(2) = 2.
$$

Following this relation, the sequence of Perrin numbers starts with

3, 0, 2, 3, 2, 5, 5, 7, 10, 12, 17, 22, 29, 39, ...

Write a generator that produces an infinite sequence of Perrin numbers.

In [None]:
show('Exercise 1')

---
## Pandas

Pandas is a module which allows the construction of a **dataframe**, an object to store data that looks a little like a spreadsheet (the data is indexed principally by a column name and row name/number). The data contained in a dataframe does *not* have to be of the same type.

Pandas is a very popular module for anything to do with data analysis in Python.

---
**📚 Learn more:**
* [Pandas (main website)](http://pandas.pydata.org/index.html)
* [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/).
* [A quick introduction to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
* There is also a fantastic set of community tutorials [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html). There are plenty of supplementary materials which are well worth working through a little if you want a longer introduction to the basic concepts in Pandas, and more practice examples.
---

Here are some basic examples, the first uses the same file `oil_reserve_data.csv` from Week 5.

In [None]:
# First, import the pandas module
import pandas as pd

# Use the read_csv method to read the CSV file into a dataframe
oil_data = pd.read_csv('oil_reserve_data.csv')

# Look at what the dataframe contains
oil_data

In [None]:
# Get the column headers
print('Column headers:')
print(oil_data.columns, '\n')

# Pull the data from a particular column by referring to it by name
print('Data from Germany:')
print(oil_data['Germany'])

We can also specify a column to use as row labels when reading the file -- note the difference with the previous command here:

In [None]:
# Use the first column in the file as row index
oil_data = pd.read_csv('oil_reserve_data.csv', index_col=0)

# Print the column names and the row names
print(oil_data.columns)
print(oil_data.index)

# Look at what the dataframe contains
oil_data

There are plenty of other optional arguments of `pd.read_csv()` which are very helpful to read CSV files with different properties and layouts -- [have a look at the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

---
### Indexing dataframes

[The user guide in the pandas documentation is a must-read](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html). Here is a summary:

- **`.iloc`** is used for indexing by number (like in Numpy arrays).

In [None]:
oil_data = pd.read_csv('oil_reserve_data.csv', index_col=0)

# Select a row from our data, by numerical index (here, the second row)
print(oil_data.iloc[1], '\n')

# We can also grab a column by number - here, the second column
print(oil_data.iloc[:, 1])

- **`.loc`** is used for indexing by label (row or column header), and also for Boolean indexing.

In the second example here:
- we use `.loc` to return all the data for Bulgaria,
- we create a Boolean dataframe using `<` which is `True` where the reserves for Bulgaria are below 1000,
- we use `.index` with the Boolean array to return the names of all the **rows** (i.e. the dates) where this happened.

In [None]:
# Print data for June 2019 in Germany, using row and column labels
print(oil_data.loc['2019M06', 'Germany'], '\n')

# Print the dates when the reserves in Bulgaria are below 1000
bulgaria_below_1000 = oil_data.loc[:, 'Bulgaria'] < 1000
print(bulgaria_below_1000, '\n')
print(oil_data.index[bulgaria_below_1000], '\n')

Note that `.columns`, `.index`, `.loc` and `.iloc` **are not functions** (i.e. you don't use parentheses when using them, but square brackets).

---
🚩 **Exercise 2**

Display the oil reserves in Denmark, but only at the dates when the reserves in Germany are below 20000 **and** the reserves in Belgium are above 4000. [You might find this helpful...](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing)

In [None]:
show('Exercise 2')

---
Let us look at a different dataset. The file `r_and_d_spend.csv` comes from an [open European dataset](http://data.europa.eu/euodp/en/data/dataset/Lnlc8Fcv5u1RYlfjnsKxg) describing the GDP spend of different countries on research and development work.

In [None]:
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

# Read in our data
randd = pd.read_csv('r_and_d_spend.csv')
print(randd)

# Plot the data for Austria
fig, ax = plt.subplots()
ax.plot(randd['Year'], randd['Austria'], 'bo')

# Fit a line through the Austria data
LR_aus = st.linregress(randd['Year'], randd['Austria'])
# slope, intercept = LR_aus[0], LR_aus[1]
y_aus = LR_aus.intercept + LR_aus.slope * randd['Year']

ax.plot(randd['Year'], y_aus, 'b-')
plt.show()

---
**📚 Learn more:**
* [pandas.DataFrame.loc - Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)
* [pandas.DataFrame.iloc - Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)
* [Indexing, iteration - Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#indexing-iteration)
* [Plotting - Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#plotting)
* [Boolean indexing - Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing)

---
🚩 **Exercise 3**

Look at the other columns in the `r_and_d_spend.csv` file. Fit linear best fit lines to the other columns, make plots of them similar to the one shown above. Try putting them all in the same figure but on different subplots. Add legends, titles, and axis labels as necessary.

In [None]:
show('Exercise 3')

---
🚩 **Exercise 4**

Plot the data for all countries on the same graph between 2000 and 2010.

In [None]:
show('Exercise 4')