# Intro to Data Wrangling and Tidy Data
*Author*: Zach del Rosario


### Learning outcomes
By working through this notebook, you will be able to:

- State the basic ideas of tidy data


In [1]:
import numpy as np
import pandas as pd
import grama as gr

DF = gr.Intention()

# For downloading data
import os
import requests


The following code downloads the same data you extracted in the previous day's Tabula exercise.


In [2]:
# Filename for local data
filename_data = "./data/tabula-weibull.csv"

# The following code downloads the data, or (after downloaded)
# loads the data from a cached CSV on your machine
if not os.path.exists(filename_data):
    # Make request for data
    url_data = "https://raw.githubusercontent.com/zdelrosario/mi101/main/mi101/data/tabula-weibull1939-table4.csv"
    r = requests.get(url_data, allow_redirects=True)
    open(filename_data, 'wb').write(r.content)
    print("   Tabula-extracted data downloaded from public Google sheet")
else:
    # Note data already exists
    print("   Tabula-extracted data loaded locally")
    
# Read the data into memory
df_tabula = pd.read_csv(filename_data)


   Tabula-extracted data loaded locally


These are data on the tensile strength of specimens of stearic acid and plaster-of-paris.


In [3]:
df_tabula


Unnamed: 0,No.,Area mm^2,sigma_d kg/mm^2,No..1,Area mm^2.1,sigma_d kg/mm^2.1
0,1,21.5,0.61,14.0,23.1,0.58
1,2,22.31,0.6,15.0,21.91,0.62
2,3,23.0,0.5,16.0,23.23,0.5
3,4,14.18,0.63,17.0,25.8,0.5
4,5,22.03,0.48,18.0,20.68,0.52
5,6,22.79,0.6,19.0,15.9,0.59
6,7,28.88,0.56,20.0,16.47,0.5
7,8,17.79,0.59,21.0,18.75,0.54
8,9,23.6,0.6,22.0,17.91,0.55
9,10,23.6,0.52,23.0,25.55,0.5


Note that Pandas renamed some columns to avoid giving us duplicate column names. The names are useful for holding metadata, but shorter column names are far easier to work with in a computational environment.


### __Q1__: Complete the code below to rename the columns

*Hint*: You can click-and-drag on the DataFrame printout above for a less error-prone way of giving the original column names.


In [4]:
###
# TASK: Copy the original column names into the double-quote below
#       to complete the code and rename the columns with shorter
#       names.
###

df_q1 = (
    df_tabula
    >> gr.tf_rename(

        obs_1="No.",
        area_1="Area mm^2",
        sigma_1="sigma_d kg/mm^2",
        obs_2="No..1",
        area_2="Area mm^2.1",
        sigma_2="sigma_d kg/mm^2.1",
    )
)

## NOTE: No need to edit, this will show your renamed data
df_q1

Unnamed: 0,obs_1,area_1,sigma_1,obs_2,area_2,sigma_2
0,1,21.5,0.61,14.0,23.1,0.58
1,2,22.31,0.6,15.0,21.91,0.62
2,3,23.0,0.5,16.0,23.23,0.5
3,4,14.18,0.63,17.0,25.8,0.5
4,5,22.03,0.48,18.0,20.68,0.52
5,6,22.79,0.6,19.0,15.9,0.59
6,7,28.88,0.56,20.0,16.47,0.5
7,8,17.79,0.59,21.0,18.75,0.54
8,9,23.6,0.6,22.0,17.91,0.55
9,10,23.6,0.52,23.0,25.55,0.5


Use the following to check your work:


In [5]:
## NO NEED TO EDIT; use this to check your work
assert(set(df_q1.columns) == {"obs_1", "area_1", "sigma_1", "obs_2", "area_2", "sigma_2"})
print("Success!")


Success!


Now the column names are much shorter, but we've


### __Q2__: Complete the *data dictionary* below to document the units associated with the short column names.

*Note*: Weibull in his (1939) paper reports these stress values in units `kg / mm^2`. For the data to be sensible, his `kg` must refer to a kilogram-force, sometimes denoted `kgf`. One `kgf` is the force exerted by a kilogram in standard gravity (`g = 9.8 m/s^2`). We'll convert to less strange units later!


In [6]:
df_tabula.head()


Unnamed: 0,No.,Area mm^2,sigma_d kg/mm^2,No..1,Area mm^2.1,sigma_d kg/mm^2.1
0,1,21.5,0.61,14.0,23.1,0.58
1,2,22.31,0.6,15.0,21.91,0.62
2,3,23.0,0.5,16.0,23.23,0.5
3,4,14.18,0.63,17.0,25.8,0.5
4,5,22.03,0.48,18.0,20.68,0.52



| Column | Units |
|--------|-------|
| `obs_1`   | (Unitless) |
| `area_1`, | mm^2 |
| `sigma_1` | kgf / mm^2 |
| `obs_2`   | (Unitless) |
| `area_2`, | mm^2 |
| `sigma_2` | kgf / mm^2 |

<!-- solution-end -->

## Pivoting

(TODO Describe the "blocks")

In [7]:
df_q1.head()


Unnamed: 0,obs_1,area_1,sigma_1,obs_2,area_2,sigma_2
0,1,21.5,0.61,14.0,23.1,0.58
1,2,22.31,0.6,15.0,21.91,0.62
2,3,23.0,0.5,16.0,23.23,0.5
3,4,14.18,0.63,17.0,25.8,0.5
4,5,22.03,0.48,18.0,20.68,0.52


Imagine we wanted to compute some simple statistics on these data, say the mean of the stress values. Since the data come in two blocks, we have to access them separately:

In [8]:
sigma_mean_1 = df_q1.sigma_1.mean()
sigma_mean_2 = df_q1.sigma_2.mean()

print("Mean 1: {0:4.3f}".format(sigma_mean_1))
print("Mean 2: {0:4.3f}".format(sigma_mean_2))

Mean 1: 0.569
Mean 2: 0.529


We could do something hacky to combine the two:


In [9]:
sigma_mean_both = (12 * sigma_mean_1 + 11 * sigma_mean_2) / (12 + 11)
print("Mean both: {0:4.3f}".format(sigma_mean_both))

Mean both: 0.550


But it would be far easier if we could just *combine* all the relevant columns so they're not in "blocks." This is what **pivoting** a dataset allows us to do:


In [10]:
## NOTE: NO need to edit; you'll learn how to do this in the evening's notebook
df_long = (
    df_q1
    >> gr.tf_pivot_longer(
        columns=["obs_1", "area_1", "sigma_1", "obs_2", "area_2", "sigma_2"],
        names_to=[".value", "block"],
        names_sep="_",
    )
    >> gr.tf_arrange(DF.obs)
)
df_long

Unnamed: 0,block,area,obs,sigma
0,1,21.5,1.0,0.61
1,1,22.31,2.0,0.6
2,1,23.0,3.0,0.5
3,1,14.18,4.0,0.63
4,1,22.03,5.0,0.48
5,1,22.79,6.0,0.6
6,1,28.88,7.0,0.56
7,1,17.79,8.0,0.59
8,1,23.6,9.0,0.6
9,1,23.6,10.0,0.52


## Tidy Data


(TODO: Write up the principles of tidy data.)


## Data Wrangling

(Part of data wrangling is screaming. Here's an example trying to make sense of the plaster data.)


Weibull reports the $\sigma_d$ values in `kg / mm^2`; if we interpret `kg` as a kilogram (mass) then these can't be stress values! However, suppose for a moment that he were using `kg` to denote a kilogram-force, where $1 \text{kgf} = 1 \text{kg} \times 9.8 m/s^2$.

Weibull gives a summary value for the same stress in the more interpretable units `g / (cm s^2)`. Let's check this hypothesis by comparing the proposed unit converstion with our data:


In [11]:
# (540 x 10^5 g/(cm s^2)) / (980 cm/s^2) * (kg / 1000 g) * (cm^2 / 100 mm^2)
540e5 / 980 / 1000 / 100


0.5510204081632653

This is very near the `sigma` values we have in our dataset, which lends a great deal of credibility to our interpretation of `kg` as `kgf`. With this, we can make a unit conversation to more standard units.

$$\text{kgf} / \text{mm}^2 = 9.8 \text{MPa}$$


### __Q3__: Convert the units to MPa.


In [12]:
###
# TASK: Replace the 1.0 factor with the correct conversion factor
###


df_q3 = (
    df_long

    >> gr.tf_mutate(sigma_MPa=DF.sigma * 9.8)
)

df_q3.head()

Unnamed: 0,block,area,obs,sigma,sigma_MPa
0,1,21.5,1.0,0.61,5.978
1,1,22.31,2.0,0.6,5.88
2,1,23.0,3.0,0.5,4.9
3,1,14.18,4.0,0.63,6.174
4,1,22.03,5.0,0.48,4.704
