**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 18. Tidy Data

## In this lesson...

- We'll learn about **tidy data**: a way to consistently organize tabular data

- We'll also learn some techniques on how to make data tidy
    - These techniques are useful in general, too

- The concept of tidy data was originally proposed by [Hadley Wickham](http://hadley.nz/), Chief Scientist at RStudio

- Many statistical and visualization packages in Python (and R) are designed to work with tidy data, like Altair!

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## A few example datasets 

* Let's start by importing Pandas and NumPy:

In [None]:
import pandas as pd
import numpy as np

- We'll use the following datasets in this lesson:

In [None]:
table1 = pd.read_csv('data/table1.csv')
table2 = pd.read_csv('data/table2.csv')
table3 = pd.read_csv('data/table3.csv')
table4a = pd.read_csv('data/table4a.csv')
table4b = pd.read_csv('data/table4b.csv')

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## What is tidy data?

- We can represent the same underlying data in multiple ways


- Throughout this course so far, we've used the terms *columns* and *variables* interchangeably, as well as the terms *rows* and *observations*


- However, depending on how the data is organized and the information it contains, this may not be correct


- Below, we have 4 representations of the same data: `table1`, `table2`, `table3` and `table4a` + `table4b`


- In this data, we have four variables: `country`, `year`, `population`, and `cases`


- Each observation corresponds to a `country`-`year` pair


- Each of the 4 representations shows the same values, but organized differently:

In [None]:
table1

In [None]:
table2

In [None]:
table3

In [None]:
# Together with table4b below
table4a

In [None]:
# Together with table4a above
table4b

- A dataset is **tidy** if:
    1. Each variable has its own column
    2. Each observation has its own row
    3. Each value has its own cell

**Question.** Which of the four representations of the dataset above are tidy?

*Write your notes here. Double-click to edit.*

- The principles of tidy data seem obvious, but most data that we encounter in the wild is *not* tidy 


- Given a dataset, we need to first figure out what the variables and observations are; then we can make it tidy


- Next, we'll learn a few techniques that can help make a dataset tidy

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Pivoting from wide to long form

- A common problem is a dataset where some of the column names are not the names of variables, but the *values* of a variable


- For example, let's look at `table4a` again:

In [None]:
table4a

- In this dataset:
    - the column names `1999` and `2000` represent values of the `year` variable
    - the values in these columns represent values of the `cases` variable
    - each row represents 2 observations, not 1

- To tidy a dataset like this, we need to **pivot** the offending columns into a new pair of variables


- We can accomplish this with the `.melt()` DataFrame method

- We need 3 parameters:

    1. `id_vars`: a list of columns to keep as-is
        - The other columns represent data that will be moved to the column created by `var_name`
    2. `var_name`: the name of the column to create from the data stored in the column names
    3. `value_name`: the name of the column to create from the data stored in the column values

- Visually:

<img src='img/melt.jpg' width=700 />


- For example, we can pivot the columns `1999` and `2000` in `table4a` into new variables called `year` and `cases`, like this:

- This is often called **pivoting from wide to long form** because it makes datasets "longer" by increasing the number of rows and decreasing the number of columns

- We can use `.melt()` in a similar fashion to tidy `table4b`, which contains the value of the variable `population`:

- We'll learn how to merge these `melt`ed DataFrames into a single DataFrame in a future lesson

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Pivoting from long to wide form

- Another common problem is a dataset with each observation scattered across multiple rows


- For example, consider `table2`: each observation corresponds to a `country`-`year` pair, but each observation is spread across 2 rows

In [None]:
table2

- To tidy this up, we can use the `.pivot_table()` method

- We need 3 parameters:
    1. `index`: the variables that identify a single observation
    2. `columns`: the column to take variable names from
    3. `values`: the column to take values from

- Visually:

<img src='img/pivot_table.jpg' width=800 />


- For example, we can pivot the `type` and `count` columns of `table2` like this:

- `.reset_index()` converts the existing index into ordinary columns, and resets the index of the DataFrame to the default one (consecutive integers)


- `.rename_axis(columns=None)` removes the name of the column axis generated by `.pivot_table()`


- These steps are often desirable when performing additional wrangling or analysis steps

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Separating

- Let's take a look at `table3`:

In [None]:
table3

- To tidy this data, we need to split the contents of `rate` into two columns, `cases` and `population`

- We can accomplish this with the `.str.split()` Series method, with the following keyword arguments:
    - `pat=...` specifies a string to use as a separator
        - If `pat` is not specified, the method will split on whitespace
    - `expand=True` tells the method to output split strings into multiple columns/Series in a DataFrame

- So, we can split the contents of `rate` in `table3` like this:

- Note that after splitting `rate`, we still have strings instead of numeric values as output:

In [None]:
table3_cases_pop.info()

- We can add the split contents of `rate` to `table3`, convert them to integers, and drop `rate` from the table, like this:

- If the data you want to separate doesn't contain a separator character, you can use Python slicing notation with `.str` to take substrings of the data


- For example, we can split `year` into `century` and `year` like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Uniting

- The `.str.cat()` Series method is used to concatenate strings
    - `sep=...` specifies the separator to use between the strings 
        - By default, the separator is the empty string `''`

- For example, we can reverse the split we performed above and join `century` and `year` back together, like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

### Problem 0

In the same folder as this notebook, there is a CSV file `data/who.csv`, which contains a subset of data from the World Health Organization Global Tuberculosis Report.

The problems below will walk you through tidying this data.

First, read the CSV file into a DataFrame called `who`. Use `.head()` and `.info()` to get a sense of the data.

### Problem 1

The columns `country`, `iso2`, and `iso3` redundantly specify the country. Verify this by grouping the data by `country` and counting the number of unique values of `iso2` and `iso3` for each group. *Hint.* Use the `nunique` reduction/aggregation method.

### Problem 2

Now that you've established that `iso2` and `iso3` are not needed, create a DataFrame `who1` without those redundant columns.

### Problem 3

The columns from `new_sp_m014` to `newrel_f65` specify the number of cases in each country-year for different situations.

Pivot the columns in `who1` from `new_sp_m014` to `newrel_f65` into a new variable called `key` (a generic name for now). Drop all rows in the resulting DataFrame with NA values. Put the result in a new DataFrame called `who2`.

### Problem 4

Next, let's parse the values in the `key` column of `who2`:

- The first 3 letters denote whether the observation represents new or old cases of TB
    - Note that this dataset contains only new cases

- The next 2-3 letters describe the type of TB:
    - `rel` = cases of relapse
    - `ep` = cases of extrapulmonary TB
    - `sn` = cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
    - `sp` = cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)

- The 6th letter gives the sex of TB patients: `m` for male, `f` for female    

- The remaining number gives the age group:
    - `014` = 0-14 years old
    - `1524` = 15-24 years old
    - `2534` = 25-34 years old
    - `3544` = 35-44 years old
    - `4554` = 45-54 years old
    - `5564` = 55-64 years old
    - `65` = 65 or older

Note that the `key` values are slightly inconsistent: most use `_` to separate the first 3 letters from the type of TB, except those that have `newrel` instead of `new_rel`.

Create a new DataFrame called `who3`, that replaces the values of `newrel` with `new_rel` in the `key` column of `who2`. 

*Hint.* Use the `.str.replace()` method. [Here's the documentation.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html)

### Problem 5

Now that the values in `key` are consistently separated by `_`, split `key` into three columns: `new`, `type`, `sexage`. Drop the `key` column, and put the results into a DataFrame called `who4`.

### Problem 6

Split the column `sexage` into 2 columns: `sex` and `age`. Note that the first character of `sexage` is always either `m` or `f`. Drop the `sexage` column, and put the results into a DataFrame called `who5`.

### Problem 7 

Now the dataset is tidy! Put together your code from Problems 2-6 into a single "tidyfying" method chain. Merge method calls where appropriate.

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- From the [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html):
    - [Reshaping and pivot tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)

- Lesson and problems inspired by Chapter 12 of [R for Data Science](https://r4ds.had.co.nz/)    