# Programmatic Data Operations

*Author*: Zach del Rosario

The purpose of this exercise is to give you some tools to work with data *programmatically*; that is, using a programming language. While you can carry out many data operations by hand or with spreadsheet programs, you will see that doing things programmatically is extremely powerful. 

### Learning Outcomes
By working through this notebook, you will be able to:

- Use Pandas' `DataFrame` object to represent data
- Use DataFrame operations from the package `py-grama` to operate on data
- Use basic data checks: shapes, data types, head and tail
- Perform fundamental data wrangling: type conversions, pivoting, filtering, mutating


In [1]:
import numpy as np
import pandas as pd
import grama as gr

DF = gr.Intention()

# For downloading data
import os
import requests


## DataFrames

---


A `DataFrame` is a data structure provided by Pandas. In contrast with `lists` (which we saw in a previous exercise), DataFrames are explicitly designed to facilitate data analysis. Accordingly, they provide a number of helpful features that aid in data analysis and operations.

A `DataFrame` is a *rectangular* representation of data -- it consists of rows and columns. If the data are *tidy*, then each *row* represents an *observation*, each *column* represents a *variable*, and each *cell* represents a single measurement. 

For instance, the following code chunk downloads a alloy dataset into the DataFrame `df_mpea`---here each row is an alloy, and each column is some physical property of that alloy.

In [2]:
# Filename for local data
filename_data = "./data/mpea.csv"

# The following code downloads the data, or (after downloaded)
# loads the data from a cached CSV on your machine
if not os.path.exists(filename_data):
    # Make request for data
    url_data = "https://docs.google.com/spreadsheets/u/1/d/1MsF4_jhWtEuZSvWfXLDHWEqLMScGCVXYWtqHW9Y7Yt0/export?format=csv"
    r = requests.get(url_data, allow_redirects=True)
    open(filename_data, 'wb').write(r.content)
    print("   MPEA data downloaded from public Google sheet")
else:
    # Note data already exists
    print("    MPEA data loaded locally")
    
# Read the data into memory
df_mpea = pd.read_csv(filename_data)

    MPEA data loaded locally


Let's use some of the basic attributes of the data to get some basic facts:


In [3]:
# Check the shape
df_mpea.shape

(1653, 20)

We have `1653` rows (also called observations) on 20 columns (also called variables or features).


In [4]:
df_mpea.head()

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,10.1016/j.actamat.2019.03.010,2019


The `head` method shows just the top of the DataFrame; this is useful for "getting a sense" for what's in the data.

In [5]:
df_mpea.dtypes


IDENTIFIER: Reference ID                    int64
FORMULA                                    object
PROPERTY: Microstructure                   object
PROPERTY: Processing method                object
PROPERTY: grain size ($\mu$m)             float64
PROPERTY: ROM Density (g/cm$^3$)          float64
PROPERTY: Exp. Density (g/cm$^3$)         float64
PROPERTY: HV                              float64
PROPERTY: Type of test                     object
PROPERTY: Test temperature ($^\circ$C)    float64
PROPERTY: YS (MPa)                        float64
PROPERTY: UTS (MPa)                       float64
PROPERTY: Elongation (%)                  float64
PROPERTY: Elongation plastic (%)          float64
PROPERTY: Young modulus (GPa)             float64
PROPERTY: O content (wppm)                float64
PROPERTY: N content (wppm)                 object
PROPERTY: C content (wppm)                float64
REFERENCE: doi                             object
REFERENCE: year                             int64


The `dtypes` attribute gives us the data type for each column. Depending on the dataset, you might find that your data loads in with strange datatypes. This can happen, for instance, if your numeric values are contained within string characters (e.g. `"1.23"`). If this happens, you can catch the fact with a call to `df_data.dtypes`.

### __Q1__: Inspect a DataFrame

Consult the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) (it might be useful to use a page search) and use some basic calls on `df_mpea` to answer the following questions:

- What are the *last* five observations in the DataFrame?
- How many rows are in `df_mpea`? How many columns?
- How would you access the column `PROPERTY: Microstructure`?

In [6]:
###
# TASK: Inspect df_data
# TODO: Show the last five observations of df_mpea
###

# -- WRITE YOUR CODE BELOW -----
df_mpea.tail(5)


Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
1648,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,50.1,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1649,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,21.9,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1650,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,75.5,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1651,133,Al1 Nb1 Ti1 V1 Zr0.5,,,,,5.64,,C,25.0,1480.0,,50.0,,,,,,10.1080/02670836.2018.1446267,2018
1652,133,Hf1 Nb1 Ti1 Zr1,,,,,,,C,25.0,879.0,,16.5,,,,,,10.1080/02670836.2018.1446267,2018


In [7]:
###
# TASK: Inspect df_mpea
# TODO: Determine the number of rows and columns in df_mpea
###

# -- WRITE YOUR CODE BELOW -----
df_mpea.shape  # rows, columns


(1653, 20)

In [8]:
###
# TASK: Inspect df_data
# TODO: Grab the column `PROPERTY: Microstructure` alone
###

# -- WRITE YOUR CODE BELOW -----
# Note that this returns a Pandas Series
df_mpea["PROPERTY: Microstructure"]
# And this returns a Pandas DataFrame
df_mpea[["PROPERTY: Microstructure"]]


Unnamed: 0,PROPERTY: Microstructure
0,
1,
2,FCC
3,B2+Sec.
4,
...,...
1648,BCC
1649,BCC
1650,BCC
1651,


These manipulations are simple, but they are bread-and-butter for studying new datasets.

## Grama

---

The `py-grama` package builds on top of Pandas to provide a pipeline-based data (and model) infrastructure. Grama provides many of the same functions as Pandas (really, just different ways to use the same Pandas functions):


In [9]:
(
   df_mpea
   >> gr.tf_head()
)


Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,10.1016/j.actamat.2019.03.010,2019


One of the advantages of using `py-grama` is that we can write *data pipelines* to organize our data operations. For instance, the following code filters the MPEA dataset to only those cases that have a valid Yield Strength (YS) and Ultimate Tensile Strength (UTS), and computes a correlation coefficient between those two quantities.


In [10]:
## NOTE: No need to edit; run and see the result
(
    df_mpea
    >> gr.tf_filter(
        gr.not_nan(DF["PROPERTY: YS (MPa)"]),
        gr.not_nan(DF["PROPERTY: UTS (MPa)"]),
    )
    >> gr.tf_summarize(
        rho_YS_UTS=gr.corr(
            DF["PROPERTY: YS (MPa)"],
            DF["PROPERTY: UTS (MPa)"],
        )
    )
)

Unnamed: 0,rho_YS_UTS
0,0.870465


As we might expect, these two properties are strongly correlated.

This code shows off a few concepts, which we'll explore below: The pipe operator `>>`, Grama verbs (such as `tf_filter`), and the data pronoun `DF`.


### The pipe operator `>>`

It's helpful to think of the pipe operator `>>` as the words "and then". That means code like this:

```
(
    df_mpea
    >> gr.tf_filter( ... )
    >> gr.tf_mutate( ... )
    >> gr.tf_pivot_longer( ... )
)
```

Can be read something like an English sentence, where we are using various *verbs* to operate on the data:

```
(
    Start with df_mpea
    and then filter the data
    and then mutate the data
    and then pivot the data in to a longer format
)
```

We don't yet know what these verbs do; we'll learn more in the exercises below!

### Selecting

The `tf_select` verb allows us to select one or more columns; this is helpful when we want to focus on just a handful of properties, such as the chemical formulas.


In [11]:
(
    df_mpea
    >> gr.tf_select("FORMULA")
)

Unnamed: 0,FORMULA
0,Cr1 Mo1 Nb1 Ta1 V1 W1
1,Cr1 Mo1 Nb1 Ta1 V1 W1
2,Co1 Cr1 Fe1 Ni1
3,Al1 Cr1 Fe1 Mo1 Ni1
4,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106
...,...
1648,Hf1 Nb1 Ta1 Ti1 Zr1
1649,Hf1 Nb1 Ta1 Ti1 Zr1
1650,Hf1 Nb1 Ta1 Ti1 Zr1
1651,Al1 Nb1 Ti1 V1 Zr0.5


We can select multiple columns by providing multiple arguments.


### __Q2__: Use `tf_select` to select the formula and microstructure columns only.


In [12]:
###
# TASK: Select the formula and microstructure columns only
###

# -- WRITE YOUR CODE BELOW -----
(
    df_mpea
    >> gr.tf_select("FORMULA", "PROPERTY: Microstructure")
)
# solution-end

Unnamed: 0,FORMULA,PROPERTY: Microstructure
0,Cr1 Mo1 Nb1 Ta1 V1 W1,
1,Cr1 Mo1 Nb1 Ta1 V1 W1,
2,Co1 Cr1 Fe1 Ni1,FCC
3,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.
4,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,
...,...,...
1648,Hf1 Nb1 Ta1 Ti1 Zr1,BCC
1649,Hf1 Nb1 Ta1 Ti1 Zr1,BCC
1650,Hf1 Nb1 Ta1 Ti1 Zr1,BCC
1651,Al1 Nb1 Ti1 V1 Zr0.5,


We can also use some *selection helpers* to make `tf_select` even more convenient. For instance, the `gr.everything()` function just selects all the columns, which at first seems silly:


In [13]:
(
    df_mpea
    >> gr.tf_select(gr.everything())
    >> gr.tf_head()
)

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,10.1016/j.actamat.2019.03.010,2019


However, when we use `gr.everything()` *along* with specific columns, we can *re-arrange* the columns to make quick comparisons easier. For instance, let's move the reference information to the left. We could then easily copy the DOI's to find the original reference for each observation.


In [14]:
(
    df_mpea
    >> gr.tf_select("REFERENCE: doi", gr.everything())
    >> gr.tf_head()
)

Unnamed: 0,REFERENCE: doi,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: year
0,10.1016/j.jallcom.2018.11.318,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,2018
1,10.1016/j.jallcom.2018.11.318,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,2018
2,10.1016/j.matdes.2019.107698,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,2019
3,10.1016/j.jallcom.2013.03.253,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,2013
4,10.1016/j.actamat.2019.03.010,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,2019


There are a variety of other selection helpers, including:

- `gr.starts_with(...)` will select all columns that start with a given string
- `gr.ends_with(...)` will select all columns that end with a given string
- `gr.contains(...)` will select all columns that contain a given string
- `gr.matches(...)` will select all columns that match a given [regular expression](https://regexone.com/)

You'll practice using selection helpers in the next task.


### __Q3__ Use a selection helper to find **all** of the columns with the string `"REFERENCE"`


In [15]:
###
# TASK: Select the formula and microstructure columns only
###

# -- WRITE YOUR CODE BELOW -----
(
    df_mpea
    >> gr.tf_select(gr.contains("REFERENCE"))
)
# solution-end

Unnamed: 0,REFERENCE: doi,REFERENCE: year
0,10.1016/j.jallcom.2018.11.318,2018
1,10.1016/j.jallcom.2018.11.318,2018
2,10.1016/j.matdes.2019.107698,2019
3,10.1016/j.jallcom.2013.03.253,2013
4,10.1016/j.actamat.2019.03.010,2019
...,...,...
1648,10.1007/s11661-018-4646-8,2019
1649,10.1007/s11661-018-4646-8,2019
1650,10.1007/s11661-018-4646-8,2019
1651,10.1080/02670836.2018.1446267,2018


### Renaming

Aside from selecting columns, we can also make convenience modifications to the data. The verb `tf_rename` allows us to rename columns, usually to create a more compact, convenient name:


In [16]:
## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
    >> gr.tf_head()
)

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,10.1016/j.actamat.2019.03.010,2019


This is a good time to step aside from verbs to talk about the *data pronoun*.


## Interlude: Pipelines and the "Data Pronoun

---


Imagine we wanted to search through the dataset to find only those materials with a FCC microstructure. Above, we gave the `microstructure` column a new, convenient name. We might like to use that new, convenient name when searching for FCC materials. However, we're going to run into an issue:


In [17]:
## NOTE: Try uncommenting and running the following code; it WILL break!
# (
#     df_mpea
#     >> gr.tf_rename(
#         microstructure="PROPERTY: Microstructure",
#     )
#     >> gr.tf_filter(
#         df_mpea["microstructure"] == "FCC"
#     )
# )

If we want to refer to the data *now*---as it is currently in the pipeline---we need a name to refer to that DataFrame. This is where the *data pronoun* comes in; remember when we ran this line way up above in the setup chunk?

```
DF = gr.Intention()
```

This assigns the data pronoun to the name `DF`. The data pronoun represents a DataFrame, so we can use things like column access `DF["column name"]`. We can use this to take advantage of the new (shorter) name we gave to the microstructure column:

In [18]:
(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
    >> gr.tf_filter(
        DF["microstructure"] == "FCC"
    )
)

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
1,209,Al0.5 Co1 Fe1 Ni1 Ti0.5,FCC,,,,,733.0,,,,,,,,,,,10.1016/j.msea.2019.05.056,2019
2,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,7.89,615.0,C,25.0,,,,,,,,,10.3390/e21020122,2019
3,35,Al1 Cu1 Ni1 Ti1,FCC,CAST,,5.7,,537.0,C,25.0,300.0,536.0,,0.85,108.0,,,,10.1016/j.apsusc.2015.07.207,2015
4,130,C0.429 Co1 Cr1 Fe1 Ni1 W0.429,FCC,POWDER,0.0182,,8.39,531.0,,,,,,,,,,,10.3390/coatings9010016,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272,119,Co1 Cr1 Cu1 Fe1 Ni1,FCC,CAST,,,,,C,25.0,300.0,,50.2,,,,,,10.1007/s11665-018-3837-1,2019
273,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,5.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
274,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,6.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
275,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,7.84,,C,25.0,,,,,,,,,10.3390/e21020122,2019


Together, the pipe operator `>>` and the data pronoun `DF` form a powerful team that helps us do sophisticated data operations. 


### __Q4__ Re-write the following code using the pipe operator and data pronoun


In [19]:
###
# TASK: Eliminate the intermediate variables by using the data pronoun
###

# -- NO NEED TO EDIT; REWRITE THIS CODE -----
# Set up a simple dataset
df_initial = gr.df_make(
    A=[1, 2, 3],
    longcolumnname=[4, 5, 6],
)
print(df_initial)

df_new = (
    df_initial
    >> gr.tf_rename(B="longcolumnname")
)

(
    df_new
    >> gr.tf_filter(df_new.B == 5)
)

# -- WRITE YOUR CODE BELOW -----
(
    df_initial

    >> gr.tf_rename(B="longcolumnname")
    >> gr.tf_filter(DF.B == 5)
)

   A  longcolumnname
0  1               4
1  2               5
2  3               6


Unnamed: 0,A,B
0,2,5


## Back to Verbs

---


### Filtering

We saw `tf_filter` above; this allows us to filter a dataset to only those rows satisfying some logical criterion. This makes answering basic questions about the data very easy. For instance, we might be interested in a particular processing method; we could find only those rows matching a specified method:


In [20]:
## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_filter(DF["PROPERTY: Processing method"] == "POWDER")
)

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.0100,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8000,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,191,Mo1 Nb1 Ta1 W1,BCC,POWDER,,,,830.0,,,,,,,,,,,10.1016/j.msea.2019.138140,2019
3,71,Cr1 Ta1 Ti0.3 V1 W1,BCC+Laves,POWDER,,12.2,,780.0,C,25.0,2050.0,,1.5,,,,,,10.1016/j.matchemphys.2017.06.054,2017
4,110,Al1 Cr1 Cu1 Fe1 Mn1 W0.5,,POWDER,,,,780.0,C,25.0,1510.0,,,,,,,,10.1557/jmr.2019.18,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,5.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
99,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,6.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
100,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,7.84,,C,25.0,,,,,,,,,10.3390/e21020122,2019
101,124,Al1 C1 Co1 Cr1 Fe1 Ni1,,POWDER,,,,,,,,,,,,,,,10.1080/00325899.2019.1576389,2019


Notice that not all of the cells have useful values; some have `NaN` as their value (which means Not a Number). These could be due to any of a number of potential issues; perhaps the original reference did not report that value, meaning that information exists but is missing.

There are some helper functions to help deal with `NaN` values in filters: `gr.not_nan(df.column)` will return `True` when its input is not `NaN`, while `gr.is_nan(df.column)` will do the reverse.

### __Q5__ Filter the MPEA dataset to only those rows with a valid Yield Strength. Compare the original number of rows with the number of valid rows.


In [21]:
###
# TASK: Filter the data to find the non-NaN Yield Strength values
###

# -- NO NEED TO EDIT; USE FOR COMPARISON -----
print("Original shape: {}".format(df_mpea.shape))

# -- WRITE YOUR CODE BELOW -----
(
    df_mpea
    >> gr.tf_filter(gr.not_nan(DF["PROPERTY: YS (MPa)"]))
)


Original shape: (1653, 20)


Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.00,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.80,,,988.3,C,25.0,3338.0,,1.90,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,181,Al0.667 Co0.833 Cr0.833 Fe0.833 Ni0.833 Ti1,,CAST,22.03,,,856.9,C,25.0,1272.0,,3.15,,,,,,10.1016/j.jallcom.2019.07.100,2019
3,19,Al1 Cr1 Fe1 Mo0.8 Ni1,B2+Sec.,CAST,,7.0,,854.0,C,25.0,1513.0,1513.0,,0.0,,,,,10.1016/j.jallcom.2013.03.253,2013
4,186,Cr1 Fe1 Mo1 Nb1 V1,,CAST,,,,826.0,C,25.0,2663.0,,2.70,,,,,,10.1007/s40195-019-00935-x,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,24.40,,,,T,25.0,1107.0,977.0,13.90,,92.6,,,,10.1007/s11661-018-4646-8,2019
1142,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,31.30,,,,T,25.0,1115.0,830.0,18.50,,92.6,,,,10.1007/s11661-018-4646-8,2019
1143,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,38.40,,,,T,25.0,1114.0,935.0,12.20,,101.3,,,,10.1007/s11661-018-4646-8,2019
1144,133,Al1 Nb1 Ti1 V1 Zr0.5,,,,,5.64,,C,25.0,1480.0,,50.00,,,,,,10.1080/02670836.2018.1446267,2018


### Mutating

The `tf_mutate` verb allows us to create / modify columns based on existing column values. For instance, we could use a mutation to convert the units in a column:


In [22]:
## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_mutate(
        E_MPa = DF["PROPERTY: Young modulus (GPa)"] * 1000
    )
    >> gr.tf_select("E_MPa", gr.everything())
)

Unnamed: 0,E_MPa,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,...,PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,...,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.80,,,988.3,C,...,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,...,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,...,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,...,,,,,,,,,10.1016/j.actamat.2019.03.010,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1648,,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,50.10,,,,,...,,,,,,,,,10.1007/s11661-018-4646-8,2019
1649,,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,21.90,,,,,...,,,,,,,,,10.1007/s11661-018-4646-8,2019
1650,,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,75.50,,,,,...,,,,,,,,,10.1007/s11661-018-4646-8,2019
1651,,133,Al1 Nb1 Ti1 V1 Zr0.5,,,,,5.64,,C,...,1480.0,,50.0,,,,,,10.1080/02670836.2018.1446267,2018


This might be useful if we aimed to compare two quantities; elasticity and ultimate tensile strength are somewhat related properties, so we might want to compare them.


In [23]:
## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_mutate(
        E_MPa = DF["PROPERTY: Young modulus (GPa)"] * 1000
    )
    >> gr.tf_rename(
        UTS_MPa = "PROPERTY: UTS (MPa)"
    )
    >> gr.tf_filter(
        gr.not_nan(DF.UTS_MPa),
        gr.not_nan(DF.E_MPa),
    )
    >> gr.tf_select(
        "UTS_MPa",
        "E_MPa",
        gr.everything(),
    )
)

Unnamed: 0,UTS_MPa,E_MPa,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,...,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,1030.0,94000.0,66,Al1 Co0.5 Cr0.5 Fe0.5 Ni0.5 Ti0.5,,CAST,,5.6,,643.0,...,25.0,,,5.0,94.0,,,,10.1016/j.msea.2008.12.053,2008
1,2644.0,205000.0,19,Al1 Cr1 Fe1 Mo0.5 Ni1,,CAST,,6.8,,622.0,...,25.0,1749.0,,13.0,205.0,,,,10.1016/j.jallcom.2013.03.253,2013
2,2368.0,123000.0,39,Al1 Mo0.5 Nb1 Ta0.5 Ti1 Zr1,BCC+B2,OTHER,75.0,7.1,,591.0,...,25.0,2000.0,10.0,,123.0,,,,10.1007/s11837-014-1066-0,2014
3,3285.0,192000.0,8,Al1 Co1 Cr1 Fe1 Nb0.1 Ni1,BCC,CAST,,6.8,,569.0,...,25.0,1641.0,,17.0,192.0,,,,10.1016/j.msea.2011.10.110,2011
4,3222.0,197000.0,19,Al1 Cr1 Fe1 Mo0.2 Ni1,,CAST,,6.5,,549.0,...,25.0,1487.0,,29.0,197.0,,,,10.1016/j.jallcom.2013.03.253,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162,1066.0,105300.0,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,,,,,...,25.0,1094.0,6.0,,105.3,,,,10.1007/s11661-018-4646-8,2019
163,964.0,91600.0,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,,,,,...,25.0,1071.0,4.6,,91.6,,,,10.1007/s11661-018-4646-8,2019
164,977.0,92600.0,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,24.4,,,,...,25.0,1107.0,13.9,,92.6,,,,10.1007/s11661-018-4646-8,2019
165,830.0,92600.0,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,31.3,,,,...,25.0,1115.0,18.5,,92.6,,,,10.1007/s11661-018-4646-8,2019


### __Q6__ Convert the weight parts per million (wppm) of Oxygen to a (weight) percent.


In [24]:
###
# TASK: Convert wppm to a weight percentage
# NOTE: There is some scaffolding code; you need only
#       write the call to tf_mutate
###

(
    df_mpea
    >> gr.tf_filter(
        gr.not_nan(DF["PROPERTY: O content (wppm)"]),
    )
# -- WRITE YOUR CODE HERE -----

    >> gr.tf_mutate(
        O_percent=DF["PROPERTY: O content (wppm)"] * 100 / 1e6
    )
    >> gr.tf_select("O_percent", gr.everything())
    >> gr.tf_head()
)

Unnamed: 0,O_percent,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,...,PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,0.7946,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,...,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,0.7946,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,...,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,0.48,127,Nb1 Ta1 Ti1 V1,BCC,POWDER,38.0,,,510.9,C,...,1373.0,,29.5,,,4800.0,,1900.0,10.1016/j.jallcom.2018.10.230,2018
3,0.7946,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,0.54,,,1072.0,C,...,,,,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
4,0.7946,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,1.24,,,1010.0,C,...,3416.0,,5.3,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018


## Wrangling Data

---

With these basic tools---data pipelines and verbs---we have many of the tools we need to do *data wrangling*. *Very* frequently, data are messy and unusable; we need to do some *wrangling* to get our data into-shape for analysis. The last few tasks will focus on key steps in data wrangling.

### Data Converting

There's something wrong with the Nitrogen content column:


In [25]:
(
    df_mpea
    >> gr.tf_select(gr.contains("content"))
).dtypes

PROPERTY: O content (wppm)    float64
PROPERTY: N content (wppm)     object
PROPERTY: C content (wppm)    float64
dtype: object

The Oxygen and Carbon content columns are fine---they're `float64`, which is a numeric type as we'd expect. But the Nitrogen content is an `object`. Let's see what specific values this column takes:


In [26]:
set(
    df_mpea["PROPERTY: N content (wppm)"]
)

{'5', nan, 'undetectable'}

```{admonition} Use of `set`
The `set` datatype only allows one of each unique value; calling `set()` on a column is a simple way to find all the unique values in a column.
```

It seems that the original data are mixed; some values are a numeric ppm value, while others are the qualitative statement `"undetectable"` (and yet others are simply missing). We can use the Grama helper `gr.as_numeric()` to help convert the data to a numeric type.


In [27]:
## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_rename(N_wppm="PROPERTY: N content (wppm)")
    >> gr.tf_mutate(N_converted = gr.as_numeric(DF.N_wppm))
    >> gr.tf_select("N_wppm", "N_converted")
).dtypes

N_wppm          object
N_converted    float64
dtype: object

With this type conversion, we can express all three element columns as percentages.


### __Q7__ Fix the conversion from wppm to a weight percentage


In [28]:
###
# TASK: Fix the conversion of N wppm
# NOTE: There is some scaffolding code; you need only
#       edit one line
###

(
    df_mpea
    >> gr.tf_mutate(
        O_percent=DF["PROPERTY: O content (wppm)"] * 100 / 1e6,
        C_percent=DF["PROPERTY: C content (wppm)"] * 100 / 1e6,
        N_percent=gr.as_numeric(DF["PROPERTY: N content (wppm)"]) * 100 / 1e6,

    )
    >> gr.tf_select(gr.contains("percent"), gr.everything())
)


Unnamed: 0,O_percent,C_percent,N_percent,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),...,PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,0.7946,,,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,...,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,0.7946,,,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.80,,,...,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,,,,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,...,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,,,,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,...,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,,,,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,...,,,,,,,,,10.1016/j.actamat.2019.03.010,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1648,,,,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,50.10,,,...,,,,,,,,,10.1007/s11661-018-4646-8,2019
1649,,,,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,21.90,,,...,,,,,,,,,10.1007/s11661-018-4646-8,2019
1650,,,,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,75.50,,,...,,,,,,,,,10.1007/s11661-018-4646-8,2019
1651,,,,133,Al1 Nb1 Ti1 V1 Zr0.5,,,,,5.64,...,1480.0,,50.0,,,,,,10.1080/02670836.2018.1446267,2018


### Pivoting Data

Another common data issue is when our data are in the wrong *shape*. To illustrate, let's look at another dataset:


In [29]:
from grama.data import df_stang_wide
df_stang_wide

Unnamed: 0,thick,E_00,mu_00,E_45,mu_45,E_90,mu_90,alloy
0,0.022,10600,0.321,10700,0.329,10500,0.31,al_24st
1,0.022,10600,0.323,10500,0.331,10700,0.323,al_24st
2,0.032,10400,0.329,10400,0.318,10300,0.322,al_24st
3,0.032,10300,0.319,10500,0.326,10400,0.33,al_24st
4,0.064,10500,0.323,10400,0.331,10400,0.327,al_24st
5,0.064,10700,0.328,10500,0.328,10500,0.32,al_24st
6,0.081,10000,0.315,10000,0.32,9900,0.314,al_24st
7,0.081,10100,0.312,9900,0.312,10000,0.316,al_24st
8,0.081,10000,0.311,-1,-1.0,9900,0.314,al_24st


These are observations on different samples of the same rolled aluminum alloy, with measurements taken at different angles relative to the direction of rolling. Note that the relative angle `0, 45, 90` is in the column names, rather than in cells. This means we would need to write special-purpose code in order to analyze these data.

Rather than re-invent our analysis code for every new dataset, we can instead *reshape* our data into a single, consistent format for a single set of analysis tools. To that end, we are going to reshape the data into a [tidy](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) format. 

Our goal will be to wrangle this messy, wide dataset into tidy, long format, shown below:


In [30]:
from grama.data import df_stang
df_stang

Unnamed: 0,thick,alloy,E,mu,ang
0,0.022,al_24st,10600,0.321,0
1,0.022,al_24st,10600,0.323,0
2,0.032,al_24st,10400,0.329,0
3,0.032,al_24st,10300,0.319,0
4,0.064,al_24st,10500,0.323,0
...,...,...,...,...,...
71,0.064,al_24st,10400,0.327,90
72,0.064,al_24st,10500,0.320,90
73,0.081,al_24st,9900,0.314,90
74,0.081,al_24st,10000,0.316,90


To carry out this reshaping, we will use a set of *pivoting* tools. As a simple example, `gr.tf_pivot_longer()` takes a wide dataset and makes it longer.


In [31]:
df_tmp = (
    gr.df_make(
        A=[1, 2, 3],
        B=[4, 5, 6],
        C=[7, 8, 9],
    )
)
print(df_tmp)

(
    df_tmp
    >> gr.tf_pivot_longer(
        columns=["A", "B", "C"],
        names_to="name",
        values_to="value",
    )
)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


Unnamed: 0,name,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9


### __Q8__ Pivot `df_stang_wide` longer to put all the angle values in cells

*Hint 1*: Make sure to add an `observation` column with the `index_to` argument.


In [32]:
###
# TASK: Pivot the data longer
###

# -- WRITE YOUR CODE HERE -----
df_q8 = (
    df_stang_wide
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45", "E_90", "mu_90"],
        names_to="name",
        values_to="value",
        index_to="observation",
    )
)
df_q8

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,observation,thick,alloy,name,value
0,0,0.022,al_24st,E_00,10600.0
1,1,0.022,al_24st,E_00,10600.0
2,2,0.032,al_24st,E_00,10400.0
3,3,0.032,al_24st,E_00,10300.0
4,4,0.064,al_24st,E_00,10500.0
5,5,0.064,al_24st,E_00,10700.0
6,6,0.081,al_24st,E_00,10000.0
7,7,0.081,al_24st,E_00,10100.0
8,8,0.081,al_24st,E_00,10000.0
9,0,0.022,al_24st,mu_00,0.321


Execute the following to check your work.


In [33]:
try:
    assert(df_q8.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not sufficiently long; did you pivot?")
    
try:
    assert(df_q8.shape[1] == 5)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
try:
    assert("observation" in df_q8.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have have an `'observation'` column")
    
print("Success!")

Success!


Next, we need to separate the measurement identifiers `"E", "mu"` from the angle measurements. For that, we can use the `gr.tf_separate()` verb. This allows us to take string values and split them into separate columns, based on a separator character:


In [34]:
(
    gr.df_make(
        combined=["a-1", "b-2", "c-3"],
    )
    >> gr.tf_separate(
        column="combined",
        into=["letter", "number"],
        sep="-",
    )
)

Unnamed: 0,letter,number
0,a,1
1,b,2
2,c,3


### __Q9__ Use `gr.tf_separate()` to separate the measurement identifiers `"E", "mu"` from the measurement angles. Make sure to call the angle column `"angle"`.


In [35]:
###
# TASK: Pivot the data longer
###

df_q9 = (
    df_q8
# -- WRITE YOUR CODE HERE -----

    >> gr.tf_separate(
        column="name",
        into=["var", "angle"],
        sep="_",
    )
)
df_q9

Unnamed: 0,observation,thick,alloy,value,var,angle
0,0,0.022,al_24st,10600.0,E,0
1,1,0.022,al_24st,10600.0,E,0
2,2,0.032,al_24st,10400.0,E,0
3,3,0.032,al_24st,10300.0,E,0
4,4,0.064,al_24st,10500.0,E,0
5,5,0.064,al_24st,10700.0,E,0
6,6,0.081,al_24st,10000.0,E,0
7,7,0.081,al_24st,10100.0,E,0
8,8,0.081,al_24st,10000.0,E,0
9,0,0.022,al_24st,0.321,mu,0


Use the following code to check your work.


In [36]:
try:
    assert(df_q9.shape[0] == df_q8.shape[0])
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert(df_q9.shape[1] == 6)
except AssertionError:
    raise AssertionError("The DataFrame should have six columns")
    
try:
    assert("angle" in df_q9.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
print("Success!")

Success!


We're nearly there! Finally, we need to turn the measurement identifiers `"E", "mu"` back into column names. We can do this by pivoting wider.

### __Q10__ Pivot the data wider to turn the measurement identifiers `"E", "mu"` back into column names.

*Hint*: You should only need to set the `names_from` and `values_from` arguments with this function.


In [37]:
###
# TASK: Pivot the data wider
###

df_q10 = (
    df_q9
# -- WRITE YOUR CODE HERE -----

    >> gr.tf_pivot_wider(
        names_from="var",
        values_from="value",
    )
)
df_q10

Unnamed: 0,observation,thick,alloy,angle,E,mu
0,0,0.022,al_24st,0,10600.0,0.321
1,0,0.022,al_24st,45,10700.0,0.329
2,0,0.022,al_24st,90,10500.0,0.31
3,1,0.022,al_24st,0,10600.0,0.323
4,1,0.022,al_24st,45,10500.0,0.331
5,1,0.022,al_24st,90,10700.0,0.323
6,2,0.032,al_24st,0,10400.0,0.329
7,2,0.032,al_24st,45,10400.0,0.318
8,2,0.032,al_24st,90,10300.0,0.322
9,3,0.032,al_24st,0,10300.0,0.319


Use the following code to check your work.


In [38]:
try:
    assert(df_q10.shape[0] == 27)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert("angle" in df_q10.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
try:
    assert("E" in df_q10.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'E' column")
    
try:
    assert("mu" in df_q10.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'mu' column")
    
print("Success!")

Success!


### Bonus: One-step pivot

As a closing, bonus demonstration, we illustrate some advanced options in `gr.tf_pivot_longer()` that would allow you to reshape the data in single call. This uses the `".value"` special argument to signal that the values that would be placed in that `names_to` column should really be column names themselves; put differently, the `".value"` keyword is like adding a pivot wider after the pivot longer.


In [39]:
## NOTE: No need to edit
(
    df_stang_wide
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45", "E_90", "mu_90"],
        names_to=[".value", "angle"],
        names_sep="_",
    )
)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,thick,alloy,angle,E,mu
0,0.022,al_24st,0,10600.0,0.321
1,0.022,al_24st,45,10700.0,0.329
2,0.022,al_24st,90,10500.0,0.31
3,0.022,al_24st,0,10600.0,0.323
4,0.022,al_24st,45,10500.0,0.331
5,0.022,al_24st,90,10700.0,0.323
6,0.032,al_24st,0,10400.0,0.329
7,0.032,al_24st,45,10400.0,0.318
8,0.032,al_24st,90,10300.0,0.322
9,0.032,al_24st,0,10300.0,0.319


## Endnotes

- The data portions of Grama make heavy use of ideas from the [Tidyverse](https://www.tidyverse.org/); specifically the [dplyr](https://dplyr.tidyverse.org/) package. However, those packages are for the R programming language.
