# Tidy Data

Consider the two imaginary datasets below.  The DataFrame below shows data on the time until relief of a headache after taking Aspirin (`AspTime`) and Tylenol (`TyTime`).

In [43]:
import pandas as pd

df1 = pd.DataFrame({'Subject': ['Nathan', 'Paul', 'Caroline'],
                    'AspTime': [None, 16, 3], 
                    'TylTime': [2,11,1]})

df1

Unnamed: 0,Subject,AspTime,TylTime
0,Nathan,,2
1,Paul,16.0,11
2,Caroline,3.0,1


NB: `None` is the python value null value and `NaN` is the default missing value in pandas.  We will learn more about this later on.

The same data is shown in the DataFrame below.

In [44]:
df2 = pd.DataFrame({'Treatment': ['Aspirin', 'Tylenol'], 
                    'Nathan': [None, 2], 
                    'Paul': [16,11], 
                    'Caroline': [3, 1]})

df2

Unnamed: 0,Treatment,Nathan,Paul,Caroline
0,Aspirin,,16,3
1,Tylenol,2.0,11,1


What's the difference?

# Data structure

- A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). 

- Values are organized in two ways. Every **value** belongs to a **variable** and an **observation**. 

- A **variable** contains all **values** that measure the same underlying attribute (like height, temperature, duration) across units. 

- An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

Let's reorganize `df1` into `df3`.

In [45]:
df3 = pd.DataFrame({'Subject': ['Nathan', 'Paul', 'Caroline', 'Nathan', 'Paul', 'Caroline'],
                    'Treatment':['Aspirin', 'Aspirin', 'Aspirin', 
                                 'Tylenol', 'Tylenol','Tylenol'],
                   'Headache Duration': [None, 16, 3, 2, 11, 1]})
df3

Unnamed: 0,Subject,Treatment,Headache Duration
0,Nathan,Aspirin,
1,Paul,Aspirin,16.0
2,Caroline,Aspirin,3.0
3,Nathan,Tylenol,2.0
4,Paul,Tylenol,11.0
5,Caroline,Tylenol,1.0


- In this **experiment** every combination of `Subject` and `Treament` was measured.  

- For a given data set, it is usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general. 

- For example, 

   + if the columns in `df1` were `height` and `weight` we would have been happy to call them variables.
   
   + If the columns were `height` and `width`, it would be less clear cut, as we might think of `height` and `width` as values of a `dimension` variable. 
   
   + If the columns were `home phone` and `work phone`, we could treat these as two variables, but in a fraud detection environment we might want variables `phone number` and `number type` because the use of one phone number for multiple people might suggest fraud.


- A *general rule of thumb* is that it is easier to describe relationships between variables (e.g., z is the sum of x and y, density is the ratio of weight to volume) than between rows, and it is easier to make comparisons between groups of observations (e.g., average of group a vs. average of group b) than between groups of columns.

- In a given analysis, there may be multiple levels of observations. For example (GGR), in the Statistics Canada General Social Survey demographic data is collected from each person(age,sex,race), and time use data collected from each person each day(time spent working, time spent sleeping). (EEB) in a trial of new allergy medication we might have three observational types: demographic data collected from each person(`age`,`sex`,`race`),medical data collected from each person on each day(`number of sneezes`, `redness of eyes`), and meteorological data collected on each day (`temperature`, `pollen count`).

# Tidy Data

**Tidy data** is a standard way of mapping the meaning of a data set to its structure. A data set is **messy** or **tidy** depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

1. Each **variable** forms a column.
2. Each **observation** forms a row.
3. Each type of **observational unit** forms a table.

**Messy data** is any other arrangement of the data.

- Tidy data makes it easy for a data scientist or a computer to extract needed variables because it provides a standard way of structuring a data set. Compare `df3` to `df1`: in `df1` you need to use different strategies to extract different variables.

- Many data analysis operations involve all of the values in a variable (e.g., `sum`, `len`), you can see how important it is to extract these values in a simple, standard way. 

- The order of variables and observations does not affect analysis, although a good ordering makes it easier to scan the raw values. 

# Tidying Messy Data

- We will be exploring tidying different types of messy data in the next few weeks. 

- Data comes in forms like `df1` and `df2` so often that pandas has created functions to help **transform** it into tidy data.  We will learn more about this later.  

- Try to "*read*" the two pieces of code below and guess what it's doing.  Ask yourself:
    + Does `id_vars` represent a variable, or observation?
    + Does `value_vars` represent a variable, or observation?

In [46]:
df1

Unnamed: 0,Subject,AspTime,TylTime
0,Nathan,,2
1,Paul,16.0,11
2,Caroline,3.0,1


In [47]:
df1.melt(id_vars = ['Subject'], value_vars = ['AspTime', 'TylTime'])

Unnamed: 0,Subject,variable,value
0,Nathan,AspTime,
1,Paul,AspTime,16.0
2,Caroline,AspTime,3.0
3,Nathan,TylTime,2.0
4,Paul,TylTime,11.0
5,Caroline,TylTime,1.0


In [48]:
df2

Unnamed: 0,Treatment,Nathan,Paul,Caroline
0,Aspirin,,16,3
1,Tylenol,2.0,11,1


In [49]:
df2.melt(id_vars = ['Treatment'], value_vars = ['Nathan', 'Paul', 'Caroline'])

Unnamed: 0,Treatment,variable,value
0,Aspirin,Nathan,
1,Tylenol,Nathan,2.0
2,Aspirin,Paul,16.0
3,Tylenol,Paul,11.0
4,Aspirin,Caroline,3.0
5,Tylenol,Caroline,1.0
