# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

## Part 2: Pandas 101 (11 points)

We want you to build some functions to implement the melt and cast operations, among other operations. You will build these primitives on top of Pandas's data frames, so let's first start with some basics.

Consider the famous [Iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set). It consists of 50 samples from each of three species of Iris (_Iris setosa_, _Iris virginica_, and _Iris versicolor_). Four features were measured from each sample: the lengths and the widths of the [sepals](https://en.wikipedia.org/wiki/Sepal) and [petals](https://en.wikipedia.org/wiki/Petal).

In [None]:
# Some modules you'll need in this part
import pandas as pd
from io import StringIO
from IPython.display import display

irises = pd.read_csv ('iris.csv')
print ("=== Iris data set: {} rows x {} columns. ===".format (irises.shape[0], irises.shape[1]))
display (irises.head ())

In a Pandas data frame, every column has a name (stored as a string) and all values within the column must have the same primitive type. This fact makes columns different from, for instance, lists.

In addition, every row has a special column, called the data frame's _index_. (Try printing `irises.index`.) Any particular index value serves as a name for its row; these index values are usually integers but can be more complex types, like tuples. Separate from the index values (row names), you can also refer to rows by their integer offset from the top, where the first row has an offset of 0 and the last row has an offset of `n-1` if the data frame has `n` rows.

In [None]:
Z = pd.DataFrame (columns=['a', 'b', 'c'])
Z['a'] = pd.Series ([1, 2, 3])
Z['b'] = pd.Series ([4.0, 5.1, 6.2])
Z['c'] = pd.Series (['7', '8.2', 'cat'])
display (Z)
print (Z.index)
print (Z.loc[1])
print (Z.iloc[1])

alt_index = pd.Index ([123, 7, 4])
Z.set_index (alt_index, inplace=True)
display (Z)
display (Z.index)

display (Z['b'])
display (Z.loc[7])
display (Z.iloc[1])

**Exercise 1.** (6 points) Run the following commands and describe what each one does (1 sentence each).

```python
irises.describe ()
irises['sepal length'].head ()
irises[["sepal length", "petal width"]].head ()
irises.iloc[5:10]
irises[irises["sepal length"] > 5.0]
irises["sepal length"].max ()
irises['species'].unique ()
irises.sort_values (by="sepal length", ascending=False).head (1)
irises.sort_values (by="sepal length", ascending=False).iloc[5:10]
irises['x'] = 3.14
irises.rename (columns={'species': 'type'})
del irises['x']
```

In [None]:
# This is a dummy cell, in case you want to try out any of the above statements.

YOUR ANSWER HERE

**Exercise 2.** (1 point) Do the functions `irises.sort_values ()` and `irises.rename()` modify the input object, `irises`?

YOUR ANSWER HERE

## Merging data frames: join operations

Another useful operation on data frames is [merging](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html).

For instance, consider the following two tables, `A` and `B`:

| country     | year | cases  |
|:------------|-----:|-------:|
| Afghanistan | 1999 |    745 |
| Brazil      | 1999 |  37737 |
| China       | 1999 | 212258 |
| Afghanistan | 2000 |   2666 |
| Brazil      | 2000 |  80488 |
| China       | 2000 | 213766 |

| country     | year | population |
|:------------|-----:|-----------:|
| Afghanistan | 1999 |   19987071 |
| Brazil      | 1999 |  172006362 |
| China       | 1999 | 1272915272 |
| Afghanistan | 2000 |   20595360 |
| Brazil      | 2000 |  174504898 |
| China       | 2000 | 1280428583 |

Suppose we wish to combine these into a single table, `C`:

| country     | year | cases  | population |
|:------------|-----:|-------:|-----------:|
| Afghanistan | 1999 |    745 |   19987071 |
| Brazil      | 1999 |  37737 |  172006362 |
| China       | 1999 | 212258 | 1272915272 |
| Afghanistan | 2000 |   2666 |   20595360 |
| Brazil      | 2000 |  80488 |  174504898 |
| China       | 2000 | 213766 | 1280428583 |

In Pandas, you can perform this merge using the [`.merge()` function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html):

```python
C = A.merge (B, on=['country', 'year'])
```

In this call, the `on=` parameter specifies the list of column names to use to align or "match" the two tables, `A` and `B`. By default, `merge()` will only include rows from `A` and `B` where all keys match between the two tables.

The following code cell demonstrates this functionality.

In [None]:
A_csv = """country,year,cases
Afghanistan,1999,745
Brazil,1999,37737
China,1999,212258
Afghanistan,2000,2666
Brazil,2000,80488
China,2000,213766"""

with StringIO (A_csv) as fp:
    A = pd.read_csv (fp)
print ("=== A ===")
display (A)

B_csv = """country,year,population
Afghanistan,1999,19987071
Brazil,1999,172006362
China,1999,1272915272
Afghanistan,2000,20595360
Brazil,2000,174504898
China,2000,1280428583"""

with StringIO (B_csv) as fp:
    B = pd.read_csv (fp)
print ("\n=== B ===")
display (B)

C = A.merge (B, on=['country', 'year'])
print ("\n=== C = merge (A, B) ===")
display (C)

**Joins.** This default behavior of keeping only rows that match both input frames is an example of what relational database systems call an _inner-join_ operation. But there are several other types of joins.

- _Inner-join (`A`, `B`)_ (default): Keep only rows of `A` and `B` where the on-keys match in both.
- _Outer-join (`A`, `B`)_: Keep all rows of both frames, but merge rows when the on-keys match. For non-matches, fill in missing values with not-a-number (`NaN`) values.
- _Left-join (`A`, `B`)_: Keep all rows of `A`. Only merge rows of `B` whose on-keys match `A`.
- _Right-join (`A`, `B`)_: Keep all rows of `B`. Only merge rows of `A` whose on-keys match `B`.

You can use `merge`'s `how=...` parameter, which takes the (string) values, `'inner`', `'outer'`, `'left'`, and `'right'`. Here is an example of an outer join.

In [None]:
with StringIO ("""x,y,z
bug,1,d
rug,2,d
lug,3,d
mug,4,d""") as fp:
    D = pd.read_csv (fp)
print ("=== D ===")
display (D)

with StringIO ("""x,y,w
hug,-1,e
smug,-2,e
rug,-3,e
tug,-4,e
bug,1,e""") as fp:
    E = pd.read_csv (fp)
print ("\n=== E ===")
display (E)

print ("\n=== Outer-join (D, E) ===")
display (D.merge (E, on=['x', 'y'], how='outer'))

**Exercise 3** (2 points). Predict the output of left-joining and right-joining `D` with `E`. In the Markdown cell below, you can use the following syntax to draw a table. (This example shows the table, `E`. The placement of colons in the second row indicates left vs. right vs. centered alignment within the given column. Also, you may ignore the presence of row numbers in the frames as done below.)

```markdown
Data frame: **outer-join** (D, E).

| x    |  y |  z  |  w  |
|:-----|---:|:---:|:---:|
| bug  |  1 |  d  |  e  |
| rug  |  2 |  d  | NaN |
| lug  |  3 |  d  | NaN |
| mug  |  4 |  d  | NaN |
| hug  | -1 | NaN |  e  |
| smug | -2 | NaN |  e  |
| rug  | -3 | NaN |  e  |
| tug  | -4 | NaN |  e  |
```

YOUR ANSWER HERE

## Apply functions to data frames

Another useful primitive is `apply()`, which can apply a function to a data frame or to a series (column of the data frame).

For instance, suppose we wish to convert the year into an abbrievated two-digit form. The following code will do it:

In [None]:
display (C)
G = C.copy ()
G['year'] = G['year'].apply (lambda x: "'{:02d}".format (x % 100))
display (G)

**Exercise 4** (2 points). Use `apply()` to add a new variable in `G`, called `'prevalence'`, which is the ratio of cases to the population.

> **Note**: It's actually easier to compute prevalence using,
>
> ```python
> G['prevalence'] = G['cases'] / G['population']
> ```
>
> However, for this exercise we want you to use the `apply` function. You might consult the documentation for `apply()` when applied to data frames: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
display (G)

assert (G['prevalence'] == (G['cases'] / G['population'])).all ()