# 2. Tidying Data for Analysis

## Tidy Data

Concept created in 2014 by Hadley Wickham, aims to:
* Formalize the way we describe the shape of data
* Establish a goal when formatting out data
* Standard way to organize data values within a dataset

### Principles of Tidy Data

1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form tables

**Example of converting to tidy data:**

![tidy.png](attachment:tidy.png)

While Tidy Data is not the best format for ***reporting***, it is the form of choice for ***data analysis***:
  
* Tidy Data makes it easier to fix common data problems (it's easier to code)

The problem to be fixed, according to Wickham:
* Columns containing values, instead of variables
-> We fix this by MELTING with pandas

### Melting Data
* It's the process of turning columns of your data into rows of data. 

In [None]:
# Example of melting: we turn the 2 treatment columns in only one

# id_vars='' holds the column to be held fixed
# value_vars=['',''] specifies wich columns you want to melt
# var_name='' and value_name='' can rename the newly created columns

# If you don't specify value_vars, it'll melt all columns not in id_vars!

pd.melt(frame=df, id_vars='name',
       value_vars=['treatment a', 'treatment b'],
       var_name='treatment', value_name='result')

### Pivoting Data

* The opposite of melting
* Turns unique values from rows into separate columns
* Useful for converting an analysis-friendly shape into a report-friendly
* Use when dataset violates tidy data principles: 
    * Multilple variables stored in the same column

In [None]:
# index='' specifies the columns to stay fixed
# columns='' specifies the columns to be pivoted into new columns
# values='' specifies the values to be used to fill the new columns

df_tidy = df.pivot(index='', columns='', values='')

#### Pivot Table method
* Used when we have duplicate entries
* Has a parameter that specifies how to deal with duplicate values
    * Example: Can aggregate the duplicate values by taking their average

In [None]:
# aggfunc=np.mean tells python how to handle multiple values, in this case, taking the average between them.

df2_tidy - df.pivot_table(index='', columns='', values='',  aggfunc=np.mean)

# aggfunc= default values is np.mean

#### Resetting the index of a dataframe

In [None]:
# for viewing index:
df.index
#for resetting it:
df.reset_index()

### Beyond melt and pivot

* Melting and Pivoting are the basic tools for cleaning and reshaping data
* Another problem: columns contain multiple bits of information

#### You can slice a column in two or more columns

In [None]:
# Melt the df
df_melt = pd.melt(frame=df, id_vars=['country', 'year'])

# Create the 'gender' column
df_melt['gender'] = df_melt.variable.str[0]