# Fixing Problems with the Golden Rule

In [1]:
import polars as pl

## The Golden Rule of Tabular Data

The **Golden Rule of Tabular Data** states that tables should be arranged with

* One variable per column
* One individual per row

Computer scientists call this [database normalization](https://en.wikipedia.org/wiki/Database_normalization); [Wickham (2014)](https://vita.had.co.nz/papers/tidy-data.pdf) calls this *tidy data*.

## Violations of the golden rule

Wickim identifies the following violations

* Column headers are values, not variable names.
* Multiple variables are stored in one column.
* Variables are stored in both rows and columns.
* Multiple types of observational units are stored in the same table.
* A single observational unit is stored in multiple tables.


## Example 1 - PEW Income Research
    
**Task:** Load the file `PEW_income_religion.csv`.  Identify the violation of the golden rule.


In [3]:
income = pl.read_csv('./data/PEW_income_religion.csv')
income

religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k
str,i64,i64,i64,i64,i64,i64
"""Agnostic""",27,34,60,81,76,137
"""Atheist""",12,27,37,52,35,70
"""Buddhist""",27,21,30,34,33,58
"""Catholic""",418,617,732,670,638,1116
"""Don't know/ref...",15,14,15,11,10,35
"""Evangelical Pr...",575,869,1064,982,881,1486
"""Hindu""",1,9,7,9,11,34
"""Historically B...",228,244,236,238,197,223
"""Jehovah's Witn...",20,27,24,24,21,30
"""Jewish""",19,19,25,25,30,95


**Problem:** There are measurments in the the table heading

## Measurements in column labels?  Stack!

We can fix issues with informative column labels by stacking the data with `gather`

In [7]:
?income.melt

In [10]:
(income
  .melt('religion', None, 'income category', 'count') #Using val_vars = None ==> Use all but ID
  .head()
)

religion,income category,count
str,str,i64
"""Agnostic""","""<$10k""",27
"""Atheist""","""<$10k""",12
"""Buddhist""","""<$10k""",27
"""Catholic""","""<$10k""",418
"""Don't know/ref...","""<$10k""",15


##  Example 2
    
Let's look at a dataset based on another example provided by Wickham.

In [11]:
weather = pl.read_csv('https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/weather-raw.csv')
weather.head()

id,year,month,element,d1,d2,d3,d4,d5,d6,d7,d8
str,i64,i64,str,str,f64,f64,str,f64,str,str,str
"""MX17004""",2010,1,"""tmax""",,,,,,,,
"""MX17004""",2010,1,"""tmin""",,,,,,,,
"""MX17004 """,2010,2,"""tmax""",,27.3,24.1,,,,,
"""MX17004""",2010,2,"""tmin""",,14.4,14.4,,,,,
"""MX17004""",2010,3,"""tmax""",,,,,32.1,,,


## Data in the column labels

`d1`, `d2`, etc. are days

In [15]:
from composable.string import startswith
from composable.strict import filter

days = weather.columns >> filter(startswith('d'))
days

['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8']

In [16]:
other_cols = [c for c in weather.columns if c not in days]
other_cols

['id', 'year', 'month', 'element']

In [21]:
weather_stacked = (weather
                   .melt(other_cols,
                         days,
                          'day',
                          'measurement')
                  )
weather_stacked.head()

id,year,month,element,day,measurement
str,i64,i64,str,str,str
"""MX17004""",2010,1,"""tmax""","""d1""",
"""MX17004""",2010,1,"""tmin""","""d1""",
"""MX17004 """,2010,2,"""tmax""","""d1""",
"""MX17004""",2010,2,"""tmin""","""d1""",
"""MX17004""",2010,3,"""tmax""","""d1""",


## Still violates the rule!

Note that 
* There are a lot of meaningless columns with no actual mmeasurements
* the `element` $\rightarrow$ *variable names*.
* Should be separate columns

In [23]:
weather_stacked.head()

id,year,month,element,day,measurement
str,i64,i64,str,str,str
"""MX17004""",2010,1,"""tmax""","""d1""",
"""MX17004""",2010,1,"""tmin""","""d1""",
"""MX17004 """,2010,2,"""tmax""","""d1""",
"""MX17004""",2010,2,"""tmin""","""d1""",
"""MX17004""",2010,3,"""tmax""","""d1""",


## Solution: Unstack the column containing variable names

In [32]:
?weather_stacked.pivot

In [37]:
pl.Config.with_columns_kwargs = True

weather_fixed = (weather_stacked 
                 .filter(pl.col('measurement').is_not_null()) 
                 .with_columns(measurement = pl.col('measurement').cast(pl.Float64))
                 .pivot('measurement',
                        ['id', 'month', 'year', 'day'],
                        'element',
                        'first'
                       )
                 .groupby(['id', 'month', 'year', 'day'])
                 .agg([pl.col('tmax').mean().alias('tmax'),
                       pl.col('tmin').mean().alias('tmin'),
                      ])
                )
weather_fixed

id,month,year,day,tmax,tmin
str,i64,i64,str,f64,f64
"""MX17004 """,2,2010,"""d2""",27.3,
"""MX17004""",2,2010,"""d2""",,14.4
"""MX17004 """,2,2010,"""d3""",24.1,
"""MX17004""",2,2010,"""d3""",,14.4
"""MX17004""",3,2010,"""d5""",32.1,14.2


## Problems that require we reshape the data

The following problems 
* Measurement information in column labels
* Measurments on the same individual across multiple rows

## <font color="red"> Exercise 2.6.2 </font>
    
**Task:** Load the `rochester_mins_max_temp_2018.csv` data, contains data weather data for Rochester, MN. that is available at the [DNR website](https://www.dnr.state.mn.us/climate/historical/lcd.html?loc=rst). Note that `SM` and `AV` stand for *sum* and *average*, respectively.

1. Identify that problem with the current format.
2. Use `gather` and `spread` to fix the issue.

In [None]:
temps = pl.read_csv("./data/Rochester_temps_2019.csv")
temps.head()

> *Your thoughts here

In [56]:
# Your code here.