# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Part 1: Tidy data (4 points)

The overall topic for this lab is what we'll refer to as representing data _relationally_. The topic of this part is a specific type of relational representation sometimes referred to as the _tidy_ (as opposed to _untidy_ or _messy_) form. The concept of tidy data was developed by [Hadley Wickham](http://hadley.nz/), a statistician and R programming maestro. Much of this lab is based on his tutorial materials (see below).

If you know [SQL](https://en.wikipedia.org/wiki/SQL), then you are already familiar with relational data representations. However, we might discuss it a little differently from the way you may have encountered the subject previously. The main reason is our overall goal in the class: to build data _analysis_ pipelines. If our end goal is analysis, then we often want to extract or prepare data in a way that makes analysis easier.

You may find it helpful to also refer to the original materials on which this lab is based:

* Wickham's R tutorial on making data tidy: http://r4ds.had.co.nz/tidy-data.html
* The slides from a talk by Wickham on the concept: http://vita.had.co.nz/papers/tidy-data-pres.pdf
* Wickham's more theoretical paper of "tidy" vs. "untidy" data: http://www.jstatsoft.org/v59/i10/paper

------------------------------------------------------------

## What is tidy data?

To build your intuition, consider the following data set collected from a survey or study.

**Representation 1.** [Two-way contigency table](https://en.wikipedia.org/wiki/Contingency_table).

|            | Pregnant | Not pregnant |
|-----------:|:--------:|:------------:|
| **Male**   |     0    |      5       |
| **Female** |     1    |      4       |

**Representation 2.** Observation list or "data frame."

| Gender  | Pregnant | Count |
|:-------:|:--------:|:-----:|
| Male    | Yes      | 0     |
| Male    | No       | 5     |
| Female  | Yes      | 1     |
| Female  | No       | 4     |

These are two entirely equivalent ways of representing the same data. However, each may be suited to a particular task.

For instance, Representation 1 is a typical input format for statistical routines that implement Pearson's $\chi^2$-test, which can check for independence between factors. (Are gender and pregnancy status independent?) By contrast, Representation 2 might be better suited to regression. (Can you predict relative counts from gender and pregnancy status?)

While [Representation 1 has its uses](http://simplystatistics.org/2016/02/17/non-tidy-data/), Wickham argues that Representation 2 is often the cleaner and more general way to supply data to a wide variety of statistical analysis and visualization tasks. He refers to Representation 2 as _tidy_ and Representation 1 as _untidy_ or _messy_.

> The term "messy" is, as Wickham states, not intended to be perjorative since "messy" representations may be exactly the right ones for particular analysis tasks, as noted above.

More specifically, Wickham defines a tidy data set as one that can be organized into a 2-D table such that

1. each column represents a _variable_;
2. each row represents an _observation_;
3. each entry of the table represents a single _value_, which may come from either categorical (discrete) or continuous spaces.

Here is a visual schematic of this definition, taken from [another source](http://r4ds.had.co.nz/images/tidy-1.png):

![Wickham's illustration of the definition of tidy](http://r4ds.had.co.nz/images/tidy-1.png)

This definition appeals to a statistician's intuitive idea of data he or she wishes to analyze. It is also consistent with tasks that seek to establish a functional relationship between some response (output) variable from one or more independent variables.

> A computer scientist with a machine learning outlook on life might refer to columns as _features_ and rows as _data points_, especially when all values are numerical (ordinal or continuous).

Here's one more bit of terminology: if a table is tidy, we will call it a tidy table, or _tibble_.

## Setup: The Python Pandas module

In Python, the [Pandas](http://pandas.pydata.org/) module is a convenient way to store tibbles. If you know [R](http://r-project.org), you will see that the design and API of Pandas's data frames derives from [R's data frames](https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html).

Let's use Pandas to load the same data in different formats, and study how general techniques to transform data between representations, _a la_ [Wickham's tutorial](http://r4ds.had.co.nz/tidy-data.html#tidy-data-1).

Consider a data set consisting of the number of cases of tuberculosis in different countries and different years compared to the total population in those years. You are told that the variables for analysis are _country_, _year_, _cases_, and _population_.

Run the code cells below to see 4 different representations of these data.

> These examples come from a World Health Organization data set, available at the URL below and in cleaner form as part of the R "tidyr" package:
> - WHO TB data set: http://www.who.int/tb/country/data/download/en/
> - tidyr sources: https://github.com/hadley/tidyr/tree/master/data-raw

In [None]:
import pandas as pd  # The suggested idiom
from IPython.display import display # For pretty-printing tables

In [None]:
table1 = pd.read_csv ('table1.csv')
display (table1.head ()) # peak at the first few rows

In [None]:
# An alternative representation of table1
table2 = pd.read_csv ('table2.csv')
display (table2.head ()) # peak at the first few rows

In [None]:
# Same data stored as rates
table3 = pd.read_csv ('table3.csv')
display (table3.head ()) # peak at the first few rows

In [None]:
# Same data spread across two tables
table4a = pd.read_csv ('table4a.csv')
table4b = pd.read_csv ('table4b.csv')

print ("=== table4a ===")
display (table4a.head ())

print ("=== table4b ===")
display (table4b.head ())

**Exercise 1.** (4 points) Which of these representations is tidy and why?

YOUR ANSWER HERE

## Basic tidying transformations: Melting and casting

Given a data set and a target set of variables, there are at least two common issues that require tidying.

First, values often appear as columns. Table 4a is an example. To tidy up, you want to turn columns into rows:

![Gather example](http://r4ds.had.co.nz/images/tidy-9.png)

Because this operation takes columns into rows, making a "fat" table more tall and skinny, it is sometimes called _melting_.


The second most issue is that an observation might be split across multiple rows. Table 2 is an example. To tidy up, you want to merge rows:

![Spread example](http://r4ds.had.co.nz/images/tidy-8.png)

Because this operation is the moral opposite of melting, and "rebuilds" observations from parts, it is sometimes called _casting_.

> Melting and casting are Wickham's terms from [his original paper on tidying data](http://www.jstatsoft.org/v59/i10/paper). In his more recent writing, [on which this tutorial is based](http://r4ds.had.co.nz/tidy-data.html), he refers to the same operation as _gathering_. Again, this term comes from Wickham's original paper, whereas his more recent summaries use the term _spreading_.