# 01. Tidy Data with `melt`

### Objectives
After this lesson you should be able to...
+ Explain what tidy data is
+ Spot messy data
+ Transform a simple messy dataset into a tidy data set
+ Use **`melt`** to reshape the data
+ Use parameter **`id_vars`** to keep a column vertical
+ Use parameter **`value_vars`** to melt columns into a single column

### Resources
+ Read Hadley Wickham's paper on [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf)
+ Watch Hadley Wickham's talk on [tidy data](https://vimeo.com/33727555)
+ Watch Jeff Leek's video on [tidy data](https://www.youtube.com/watch?v=whDilsFoLVY)
+ Read the [reshaping pandas documentation page](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

## Datasets until now
Thus far, we have analyzed several datasets but have not done much work to change their structure or do any preprocessing before computation. We immediately began generating results and answering questions. Producing results is typically not the first step of a data analysis. The vast majority of datasets 'in the wild' will need some amount of inspection and preprocessing. And in some cases, the entire project will just be about cleaning the data so that it can be further processed by someone else. 

For all the work that goes into data preparation for machine learning, there is surprisingly sparse coverage on how to do it. This notebook will use many ideas formulated by Hadley Wickham to **tidy** data before introducing a few more steps in order to prepare it for machine learning and visualization.

There's an infamous data science saying goes something like this: "data scientists spend 80% of their time cleaning data and the other 20% complaining about cleaning the data."

## The Genesis of Data
Do you know where and how data is generated? Many introductory courses such as this one will use premade csv files. Loading this data into your workspace is not the genesis of this data. The data from these sources must come from somewhere. It wasn't just magically put in a csv file or on a website or in a database used by an API. 

Some original sources of data might be:
+ While playing a mobile game, your smart phone sends game data to a small sqlite instance on your local phone and to a large remote Amazon S3 server.
+ You keep track of all your golf scores on paper and copy them to an excel file after each round
+ Censors on industrial equipment continually pour data into an on-premise hadoop cluster
+ Facebook quickly writing all it's interactions to hbase
+ City of Houston employees enter in personal information in an online web app.

Yes, non-electronic data does exist and is valuable (that was all there was before the 20th century) but for obvious reasons we will only deal with electronic data that can be read by modern computers.

# Tidy Data
Tidy data is a term coined by Hadley Wickham, the creator of many popular R packages, to describe data that is in a specific **structure** that makes for easy analysis. It is highly recommended that you read [his paper][1] to get a fuller understanding of tidy data. The basics will be covered below.

Tidy data is a specific structure of data that makes analysis easier. A dataset is tidy when:
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

Any dataset that does not meet this definition is considered "messy". This definition is simple, but useful, and something that will take you a long way in your data exploration analyses. 

### First example of messy data
Messy data can appear deceptively clean and tidy, especially if you have not been exposed to it before. In the table below we have some data about the weight of some fruit owned by some people.

[1]: http://vita.had.co.nz/papers/tidy-data.pdf

In [1]:
import pandas as pd
import numpy as np

In [2]:
# looks so nice and clean!
df = pd.DataFrame(data={'State': ['Texas', 'Arizona', 'Florida'],
                        'Apple': [12, 9, 0],
                        'Orange': [10, 7, 14],
                        'Banana': [40, 12, 190]})
df

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


### What's wrong?
Even though the dataset returns perfectly readable and acceptable information, it is not technically a tidy data set and although machine learning would be uninteresting with this dataset, visualization would be made easier if the data were tidy. More on this in the plotting notebooks.

The main issue with the above dataset is that the column names are variables themselves. At this point, you might be confused as to what exactly is meant by a 'variable'. A simple definition of a variable is **anything that is liable to change**.

### What are the variable names?
Only the **`State`** column appears to be a variable name of that is directly in the DataFrame above. You must infer the others from the context of the problem. The variables are:
+ States
+ Types of fruit 
+ Weight of fruit

### Actual Tidying
To tidy, we simply need to make sure the three tidy rules are followed. Let's start with forcing each variable into a column. The states are already a single column.

The types of fruit are column names and need to be transposed into a single column.

The weight of the fruit is a total mess and comprises a three by three square.

### Melting
Pandas contains a DataFrame method named **`melt`**, which takes up to 5 parameters with two of them being more important. 

+ **`id_vars`** - a list of column names that you want to keep as columns.
+ **`value_vars`** - a list of column names that you would like to reshape into one column

This 'reshaping' into one column is usually referred to as **melting** or **stacking**. The **`id_vars`** will stay in the same column they are currently in, but **repeat** to align with all the newly melted values in the **`value_vars`** columns. 

Let's keep the **`State`** column vertical and melt the fruit columns:

In [3]:
df.melt(id_vars='State', value_vars=['Apple', 'Orange', 'Banana'])

Unnamed: 0,State,variable,value
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


## Renaming the columns with `var_name` and `value_name`
By default, the **`melt`** method will name the column containing the old column names as **`variable`**. The column containing the values of these columns is named **`value`**.

The **`melt`** method provides two additional parameters **`var_name`** and **`value_name`**. Set these parameters equal to the column names you would prefer.

In [4]:
df.melt(id_vars='State', value_vars=['Apple', 'Orange', 'Banana'],
        var_name='Fruit', value_name='Weight')

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


### Alternate Syntax
By default **`melt`** will melt all the columns that are not **`id_vars`**. You don't have to explicitly put them in a list.

In [5]:
df.melt(id_vars='State', var_name='Fruit', value_name='Weight')

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


### Our first tidy dataset
By ensuring that each variable forms its own column, each observation is also in its own row. We now have tidy data.

## Key terms - reshaping and restructuring
When you think of tidy data, you brain should think about the terms **reshaping** or **restructuring**. The data is being maneuvered like a jigsaw puzzle such that is in a different form. The actual data values are not changing (though some aspects of tidy data will change the values).

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">In this problem, we will only look at the title column and the actor name columns. Restructure the dataset so that there are only three variables - the title of the movie, the actor number (1, 2, or 3), and the actor name. Sort the result by title, assign the DataFrame to a variable and output the first 20 rows.</span>

In [6]:
# execute this cell and use this dataset
movie = pd.read_csv('../data/movie.csv')
movie.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1


In [12]:
# your code here
q1 = movie.melt(id_vars='title', value_vars=['actor1', 'actor2', 'actor3'],
        var_name='factor number', value_name='actor name').sort_values('title')
q1.head(20)

Unnamed: 0,title,factor number,actor name
14181,#Horror,actor3,Lydia Hearst
9265,#Horror,actor2,Balthazar Getty
4349,#Horror,actor1,Timothy Hutton
13461,10 Cloverfield Lane,actor3,Sumalee Montano
8545,10 Cloverfield Lane,actor2,John Gallagher Jr.
3629,10 Cloverfield Lane,actor1,Bradley Cooper
7880,10 Days in a Madhouse,actor2,Kelly LeBrock
12796,10 Days in a Madhouse,actor3,Alexandra Callas
2964,10 Days in a Madhouse,actor1,Christopher Lambert
7715,10 Things I Hate About You,actor2,Heath Ledger


### Problem 2
<span  style="color:green; font-size:16px">Using the original movie dataset (and keeping its structure), attempt to count the total appearances of each actor in the dataset regardless whether they are 1, 2, or 3. Then repeat this task with your tidy dataset.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Tidy the dataset in the **`tidy/employee_messy1.csv`** file. It contains the count of all employees by race and gender.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Tidy the dataset in the **`tidy/employee_messy2.csv`** file. It contains the count of all employees by department, race and gender.</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Tidy the dataset in the **`tidy/employee_salary_stats.csv`** file. Save the tidy dataset to a variable and then select all the median salaries. The select all the median salaries with the original 'messy' dataset. Which one is easier to read summary statistics from?</span>

In [None]:
# your code here