# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

## Part 3: Tibbles and Bits (35 points)

Now let's start creating and manipulating tibbles.

In [None]:
import pandas as pd  # The suggested idiom
from io import StringIO

from IPython.display import display # For pretty-printing data frames

**Exercise 1** (5 points). Write a function that, given a tibble, returns a new copy in _canonical order_. A tibble, `X`, is in canonical order has the following properties.

1. The variables appear in sorted order by name, ascending from left to right.
2. The rows appear in lexicographically sorted order by variable, ascending from top to bottom.
3. The row labels (`X.index`) go from 0 to `n-1`, where `n` is the number of observations.

For instance, here is a **non-canonical tibble** ...

|   |  c  | a | b |
|:-:|:---:|:-:|:-:|
| 2 | hat | x | 1 |
| 0 | rat | y | 4 |
| 3 | cat | x | 2 |
| 1 | bat | x | 2 |


... and here is its **canonical counterpart.**

|   | a | b |  c  |
|:-:|:-:|:-:|:---:|
| 0 | x | 1 | hat |
| 1 | x | 2 | bat |
| 2 | x | 2 | cat |
| 3 | y | 4 | rat |

A partial solution appears below, which ensures that Property 1 above holds. Complete the solution to ensure Properties 2 and 3 hold. Feel free to consult the [Pandas API](http://pandas.pydata.org/pandas-docs/stable/api.html).

> **Hint**. For Property 3, you may find this function handy: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

In [None]:
def canonicalize_tibble (X):
    """Returns a tibble in _canonical order_."""
    # Enforce Property 1:
    var_names = sorted (X.columns)
    Y = X[var_names].copy ()
    
    # Your turn: Enforce Properties 2 and 3 of canonical order!
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return Y

In [None]:
# Test input
canonical_in_csv = """,c,a,b
2,hat,x,1
0,rat,y,4
3,cat,x,2
1,bat,x,2"""

with StringIO (canonical_in_csv) as fp:
    canonical_in = pd.read_csv (fp, index_col=0)
print ("=== Input ===")
display (canonical_in)
print ("")
    
# Test output solution
canonical_soln_csv = """,a,b,c
0,x,1,hat
1,x,2,bat
2,x,2,cat
3,y,4,rat"""

with StringIO (canonical_soln_csv) as fp:
    canonical_soln = pd.read_csv (fp, index_col=0)
print ("=== True solution ===")
display (canonical_soln)
print ("")

canonical_out = canonicalize_tibble (canonical_in)
print ("=== Your computed solution ===")
display (canonical_out)
print ("")

canonical_matches = (canonical_out == canonical_soln)
print ("=== Matches? (Should be all True) ===")
display (canonical_matches)
assert canonical_matches.all ().all ()

print ("\n(Passed.)")

**Exercise 2.** (5 points) Write a function to determine if two tibbles, stored as Pandas data frames, are equivalent. That means they have identical variables and observations, up to permutations.

The last condition, "up to permutations," means that the variables and observations might not appear in the table in the same order. For example, the following two tibbles are equivalent:

| a | b |  c  |
|:-:|:-:|:---:|
| x | 1 | hat |
| y | 2 | cat |
| z | 3 | bat |
| w | 4 | rat |

| b |  c  | a |
|:-:|:---:|:-:|
| 2 | cat | y |
| 3 | bat | z |
| 1 | hat | x |
| 4 | rat | w |

By contrast, the following table would not be equivalent to either of the above tibbles.

| a | b |  c  |
|:-:|:-:|:---:|
| 2 | y | cat |
| 3 | z | bat |
| 1 | x | hat |
| 4 | w | rat |

> **Note**: Unlike Pandas data frames, tibbles conceptually do not have row labels. So you should ignore row labels.

In [None]:
def tibbles_are_equivalent (A, B):
    """Given two tidy tables ('tibbles'), returns True iff they are
    equivalent.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Test code
A = pd.DataFrame (columns=['a', 'b', 'c'],
                  data=list (zip (['x', 'y', 'z', 'w'],
                                  [1, 2, 3, 4],
                                  ['hat', 'cat', 'bat', 'rat'])))
print ("=== Tibble A ===")
display (A)

# Permute rows and columns, preserving equivalence
import random

obs_ind_orig = list (range (A.shape[0]))
var_names = list (A.columns)

obs_ind = obs_ind_orig.copy ()
while obs_ind == obs_ind_orig:
    random.shuffle (obs_ind)
    
while var_names == list (A.columns):
    random.shuffle (var_names)

B = A[var_names].copy ()
B = B.iloc[obs_ind]

print ("=== Tibble B == A ===")
display (B)

print ("=== Tibble C != A ===")
C = A.copy ()
C.columns = var_names
display (C)

assert tibbles_are_equivalent (A, B)
assert not tibbles_are_equivalent (A, C)
assert not tibbles_are_equivalent (B, C)
print ("\n(Passed.)")

## Melting

Recall the melting operation.

![Melt example](http://r4ds.had.co.nz/images/tidy-9.png)

To melt the table, you need to do the following.

1. Extract the _column values_ into a new variable. In this case, columns `"1999"` and `"2000"` of `table4` need to become the values of the variable, `"year"`.
2. Convert the values associated with the column values into a new variable as well. In this case, the values formerly in columns `"1999"` and `"2000"` become the values of the `"cases"` variable.

In the context of a melt, let's also refer to `"year"` as the new _key_ variable and `"cases"` as the new _value_ variable.

**Exercise 3.** (5 points) Implement a melting function. It should take as arguments the input data frame (e.g., `table4`), a list of the column values, and names for the new columns.

> You may need to refer to the Pandas documentation to figure out how to create and manipulate tables. The bits related to [indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html) and [merging](http://pandas.pydata.org/pandas-docs/stable/merging.html) may be especially helpful.

In [None]:
def melt (df, col_vals, key, value):
    assert type (df) is pd.DataFrame
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
table4a = pd.read_csv ('table4a.csv')
print ("\n=== table4a ===")
display (table4a)

m_4a = melt (table4a, col_vals=['1999', '2000'], key='year', value='cases')
print ("=== melt (table4a) ===")
display (m_4a)

table4b = pd.read_csv ('table4b.csv')
print ("\n=== table4b ===")
display (table4b)

m_4b = melt (table4b, col_vals=['1999', '2000'], key='year', value='population')
print ("=== melt (table4b) ===")
display (m_4b)

m_4 = pd.merge (m_4a, m_4b, on=['country', 'year'])
print ("\n=== inner-join (melt (table4a), melt (table4b)) ===")
display (m_4)

m_4['year'] = m_4['year'].apply (int)

table1 = pd.read_csv ('table1.csv')
print ("=== table1 (target solution) ===")
display (table1)
assert tibbles_are_equivalent (table1, m_4)
print ("\n(Passed.)")

## Casting

Now recall the casting example:

![Cast example](http://r4ds.had.co.nz/images/tidy-8.png)

The signature of a cast is similar to that of melt. However, you only need to know the `key`, which is column of the input table containing new variable names, and the `value`, which is the column containing corresponding values.

**Exercise 4** (5 points). Implement a function to cast a data frame into a tibble, given a key column containing new variable names and a value column containing the corresponding cells.

We've given you a partial solution that

- verifies that the given `key` and `value` columns are actual columns of the input data frame;
- computes the list of columns, `fixed_vars`, that should remain unchanged; and
- initializes and empty tibble.

Observe that we are asking your `cast()` to accept an optional parameter, `join_how`, that may take the values `'outer'` or `'inner'` (with `'outer'` as the default). Why do you need such a parameter?

In [None]:
def cast (df, key, value, join_how='outer'):
    """Casts the input data frame into a tibble,
    given the key column and value column.
    """
    assert type (df) is pd.DataFrame
    assert key in df.columns and value in df.columns
    assert join_how in ['outer', 'inner']
    
    fixed_vars = df.columns.difference ([key, value])
    tibble = pd.DataFrame (columns=fixed_vars) # empty frame
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return tibble

In [None]:
table2 = pd.read_csv ('table2.csv')
print ('=== table2 ===')
display (table2)

print ('\n=== tibble2 = cast (table2, "type", "count") ===')
tibble2 = cast (table2, 'type', 'count')
display (tibble2)

assert tibbles_are_equivalent (table1, tibble2)
print ('\n(Passed.)')

## Separating variables

Recall Table 3:

In [None]:
table3 = pd.read_csv ('table3.csv')
display (table3)

This table has a different problem, which is that the `rate` variable actually combines the `cases` and `population` data. This example is an instance in which we need to _separate_ a column into two variables.

**Exercise 5** (5 points). Write a function that takes a data frame (`df`) and separates an existing column (`key`) into new variables (given by the list of new variable names, `into`).

How will the separation happen? The caller should provide a function, `splitter (x)`, that given a value returns a _list_ containing the components. Observe that the partial solution below defines a default splitter, which uses the regular expression, `(\d+\.?\d+)`, to find all integer or floating-point values in a string input `x`.

In [None]:
import re

def default_splitter (text):
    """Searches the given spring for all integer and floating-point
    values, returning them as a list _of strings_.
    
    E.g., the call
    
      default_splitter ('Given me $10.52 in exchange for 91 kitten stickers.')
      
    will return ['10.52', '91'].
    """
    fields = re.findall ('(\d+\.?\d+)', text)
    return fields

def separate (df, key, into, splitter=default_splitter):
    """Given a data frame, separates one of its columns, the key,
    into new variables.
    """
    assert type (df) is pd.DataFrame
    assert key in df.columns
    
    # Hint: http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns

    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
print ("=== Recall: table3 ===")
display (table3)

tibble3 = separate (table3, key='rate', into=['cases', 'population'])
print ("\n=== tibble3 = separate (table3, ...) ===")
display (tibble3)

assert 'cases' in tibble3.columns
assert 'population' in tibble3.columns
assert 'rate' not in tibble3.columns

tibble3['cases'] = tibble3['cases'].apply (int)
tibble3['population'] = tibble3['population'].apply (int)

assert tibbles_are_equivalent (tibble3, table1)
print ("\n(Passed.)")

**Exercise 6** (5 points). Implement the inverse of separate, which is `unite()`. This function should take a data frame (`df`), the set of columns to combine (`cols`), the name of the new column (`new_var`), and a function that takes the subset of `cols` variables from a single observation and returns a new value for that observation.

In [None]:
def str_join_elements (x, sep=""):
    assert type (sep) is str
    return sep.join ([str (xi) for xi in x])

def unite (df, cols, new_var, combine=str_join_elements):
    # Hint: http://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
table3_again = unite (tibble3, ['cases', 'population'], 'rate',
                      combine=lambda x: str_join_elements (x, "/"))
display (table3_again)
assert tibbles_are_equivalent (table3, table3_again)
print ("\n(Passed.)")

## Putting it all together

Let's use primitives to tidy up the original WHO TB data set. First here is the raw data.

In [None]:
who_raw = pd.read_csv ('who.csv')

print ("=== WHO TB data set: {} rows x {} columns ===".format (who_raw.shape[0],
                                                               who_raw.shape[1]))
print ("Column names:", who_raw.columns)

print ("\n=== A few randomly selected rows ===")
import random
row_sample = sorted (random.sample (range (len (who_raw)), 5))
display (who_raw.iloc[row_sample])

The data set has 7,240 rows and 60 columns. Here is how to decode the columns.
- Columns `'country'`, `'iso2'`, and `'iso3'` are different ways to designate the country and redundant, meaning you only really need to keep one of them.
- Column `'year'` is the year of the report and is a natural variable.
- Among columns `'new_sp_m014'` through `'newrel_f65'`, the `'new...'` prefix indicates that the column's values count new cases of TB. In this particular data set, all the data are for new cases.
- The short codes, `rel`, `ep`, `sn`, and `sp` describe the type of TB case. They stand for relapse, extrapulmonary, pulmonary not detectable by a pulmonary smear test ("smear negative"), and pulmonary detectable by such a test ("smear positive"), respectively.
- The codes `'m'` and `'f'` indicate the gender (male and female, respectively).
- The trailing numeric code indicates the age group: `014` is 0-14 years of age, `1524` for 15-24 years, `2534` for 25-34 years, etc., and `65` stands for 65 years or older.

In other words, it looks like you are likely to want to treat all the columns as values of multiple variables!

**Exercise 7** (3 points). As a first step, start with `who_raw` and create a new data frame, `who2`, with the following properties:

- All the `'new...'` columns of `who_raw` become values of a _single_ variable, `case_type`. Store the counts associated with each `case_type` value as a new variable called `'count'`.
- Remove the `iso2` and `iso3` columns, since they are redundant with `country` (which you should keep!).
- Keep the `year` column as a variable.
- Remove all not-a-number (`NaN`) counts. _Hint_: You can test for a `NaN` using Python's [`math.isnan()`](https://docs.python.org/3/library/math.html).
- Convert the counts to integers. (Because of the presence of NaNs, the counts will be otherwise be treated as floating-point values, which is undesirable since you do not expect to see non-integer counts.)

In [None]:
from math import isnan

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print ("=== First few rows of your solution ===")
display (who2.head ())

print ("=== First few rows of the instructor's solution ===")
who2_soln = pd.read_csv ('who2_soln.csv')
display (who2_soln.head ())

# Check it
assert tibbles_are_equivalent (who2, who2_soln)
print ("\n(Passed.)")

**Exercise 8** (5 points). Starting from your `who2` data frame, create a new tibble, `who3`, for which each `'key'` value is split into three new variables:

- `'type'`, to hold the TB type, having possible values of `rel`, `ep`, `sn`, and `sp`;
- `'gender'`, to hold the gender as a string having possible values of `female` and `male`; and
- `'age_group'`, to hold the age group as a string having possible values of `0-14`, `25-34`, `35-44`, `45-54`, `55-64`, and `65+`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print ("=== First few rows of your solution ===")
display (who3.head ())

who3_soln = pd.read_csv ('who3_soln.csv')
print ("\n=== First few rows of the instructor's solution ===")
display (who3_soln.head ())

assert tibbles_are_equivalent (who3, who3_soln)
print ("\n(Passed.)")