# Tidy Data V1
What is it??

References: 
* https://vita.had.co.nz/papers/tidy-data.pdf
* https://en.wikipedia.org/wiki/Tidy_data

From Hadley Wickham in pt. 4 of the above paper: 
> Tidy data is only worthwhile if it makes analysis easier.

From this, I sense another question. 
_**What are we trying to analyze?**_

In the end, tidy data is contextual. 
There are some common methods to help reach a tidy dataset, but variable and observation definitions change depending on your goal.
In general, a good process to understanding how to tidy your data may be:
1. Determine a question
2. Figure out which `variables` are needed to define your `observation`
3. _Tidy your data_
4. Answer your question

**Note that tidying your data doesn't come until step 3!!**

With that, let's look at an example case study on individual-level mortality data from Mexico.

**The Goal**
> Find causes of death with unusual temporal patterns within a day.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

%matplotlib inline

PLOT_TITLE_FONTDICT = {
    'fontsize': 'xx-large',
    'fontweight' : 'heavy'}

PLOT_TICK_FONTDICT = {
    'fontsize': 'x-large',
    'fontweight' : 'demibold'}

In [None]:
deaths = pd.read_csv('./../../data/10_26_2019/deaths08.csv')
deaths.head()

In [None]:
'Dataset has {} rows and {} columns.'.format(*deaths.shape)

<hr>

## Step 1: Asking a question
Let's start with something pretty simple

### What hour of the day do the most deaths occur?

<hr>

## Step 2: Figure out which `variables` are needed to define your `observation`
First, what are variables and observations?

### Variable
`A variable contains all values that measure the same underlying attribute across units (groups)`

### Observation
`An observation contains all values measured on the same unit across attributes (values specific to combination of groups)`

### What variables are in play?
Only one here. 
* `hod`: Categorical group. Hour from 0-23

### What is our observation?
* `death count`: number of deaths per hour from the entire dataset

## Step 3: Tidy your data
Group by hour, and get an overall count!

* Target column we're interested in
* Effectively create a blank column to count on
* Establish the group
* Perform aggregation
* Back to a normal dataframe
* Rename column so it's easy to interpret

In [None]:
death_count_per_hour = deaths[['hod']] \
    .reset_index() \
    .groupby('hod') \
    .count() \
    .reset_index() \
    .rename(
        columns={
            'hod': 'hour', 
            'index': 'death_count'
        }
    )

death_count_per_hour

At this point, we have the data we need to answer our question. 
If we wanted to scrub further, we could!

In [None]:
# all hours between 0 and 23
death_count_per_hour__valid = death_count_per_hour[death_count_per_hour['hour'].isin(range(24))]

# everything that isnt in the first dataset
death_count_per_hour__unknown = death_count_per_hour.iloc[death_count_per_hour.index.difference(death_count_per_hour__valid.index), :]

In [None]:
death_count_per_hour__valid.shape, death_count_per_hour__unknown.shape

### Conveying the results
Some [questions](https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization) to ask when deciding how to present your data:
1. Do you want to compare values? _column, line, bar, pie, scatter, mekko_
2. Do you want to show the composition of something? _pie, stacked bar, mekko, stacked column, area, waterfall_
3. Do you want to understand the distribution of your data? _scatter, mekko, line, column, bar_
4. Are you interested in analyzing trends in your data set? _line, dual-axis line, column_
5. Do you want to better understand the relationship between value sets? _scatter, bubble, line_

Yes to 4.
`hour` is a time-like value where we can begin to analyze trends.
Trends are best represented by a [line chart](https://en.wikipedia.org/wiki/Line_chart), and sometimes emphasized with a [scatterplot](https://en.wikipedia.org/wiki/Scatter_plot) overlay.

In [None]:
f, ax = plt.subplots(1, 1, figsize=(16, 4))

# actual plotting
ax.plot(death_count_per_hour__valid['hour'], death_count_per_hour__valid['death_count'])
ax.scatter(death_count_per_hour__valid['hour'], death_count_per_hour__valid['death_count'])

# Setting the title
total_count, _ = deaths.shape
unknown_death_pct = round((death_count_per_hour__unknown['death_count'].sum() / total_count) * 100, 2)
ax.set_title(f'Death Count by Hour ({unknown_death_pct}% unknown)', loc='left', fontdict=PLOT_TITLE_FONTDICT)

# setting xticks
ax.set_xticks(death_count_per_hour__valid['hour'].unique())
ax.set_xticklabels(list(range(1, 13)) * 2)

# grid lines
plt.grid();

In [None]:
print('Top 2 Hours')
for hour, death_count in death_count_per_hour__valid.sort_values(by=['death_count'], ascending=False).head(2).values:
    print(f'Hour: {hour+1:>5} | Count: {death_count}')