## Programmatic Data Operations

*Authors: Zach del Rosario*

The purpose of this exercise is to give you some tools to work with data *programmatically*; that is, using a programming language. While you can carry out many data operations by hand or with spreadsheet programs, you will see that doing things programmatically is extremely powerful. 

### Learning Outcomes
By working through this notebook, you will be able to:

- Learn some basics of *data wrangling*
- Use DataFrame operations from the package `py-grama`


In [1]:
import numpy as np
import pandas as pd
import grama as gr

DF = gr.Intention()

# For downloading data
import os
import requests



## DataFrames

---


A `DataFrame` is a data structure provided by Pandas. In contrast with `lists` (which we saw in the previous exercise), DataFrames are explicitly designed to facilitate data analysis. Accordingly, they provide a number of helpful features that aid in data analysis and operations.

A `DataFrame` is a *rectangular* representation of data -- it consists of rows and columns. Each *row* represents an *observation* -- a single instance of data. Each *column* represents a *variable* -- a particular attribute of the observation. 

For instance, the following code chunk downloads a alloy dataset into the DataFrame `df_mpea` -- here each row is an alloy, and each column is some physical property of that alloy.

In [2]:
# Filename for local data
filename_data = "./data/mpea.csv"

# The following code downloads the data, or (after downloaded)
# loads the data from a cached CSV on your machine
if not os.path.exists(filename_data):
    # Make request for data
    url_data = "https://docs.google.com/spreadsheets/u/1/d/1MsF4_jhWtEuZSvWfXLDHWEqLMScGCVXYWtqHW9Y7Yt0/export?format=csv"
    r = requests.get(url_data, allow_redirects=True)
    open(filename_data, 'wb').write(r.content)
    print("   MPEA data downloaded from public Google sheet")
else:
    # Note data already exists
    print("    MPEA data loaded locally")
    
# Read the data into memory
df_mpea = pd.read_csv(filename_data)

# Check basic facts
print(df_mpea.shape)
df_mpea.head()


    MPEA data loaded locally
(1653, 20)


Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,10.1016/j.actamat.2019.03.010,2019


### __Q1__: Inspecting a DataFrame
Consult the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) (it might be useful to use a page search) and use some basic calls on `df_data` to answer the following questions:

- What are the *last* five observations in the DataFrame?
- How many rows are in `df_data`? How many columns?
- How would we access the column `PROPERTY: Microstructure`?

In [3]:
###
# TASK: Inspect df_data
# TODO: Show the last five observations of df_mpea
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
df_mpea.tail(5)
# solution-end


Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
1648,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,50.1,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1649,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,21.9,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1650,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,75.5,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1651,133,Al1 Nb1 Ti1 V1 Zr0.5,,,,,5.64,,C,25.0,1480.0,,50.0,,,,,,10.1080/02670836.2018.1446267,2018
1652,133,Hf1 Nb1 Ti1 Zr1,,,,,,,C,25.0,879.0,,16.5,,,,,,10.1080/02670836.2018.1446267,2018


In [4]:
###
# TASK: Inspect df_mpea
# TODO: Determine the number of rows and columns in df_mpea
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
df_mpea.shape  # rows, columns
# solution-end


(1653, 20)

In [27]:
###
# TASK: Inspect df_data
# TODO: Grab the column `PROPERTY: Microstructure` alone
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
# Note that this returns a Pandas Series
df_mpea["PROPERTY: Microstructure"]
# And this returns a Pandas DataFrame
df_mpea[["PROPERTY: Microstructure"]]
# solution-end


Unnamed: 0,PROPERTY: Microstructure
0,
1,
2,FCC
3,B2+Sec.
4,
...,...
1648,BCC
1649,BCC
1650,BCC
1651,


These manipulations are simple, but they are bread-and-butter for studying new datasets.

## Grama

---

TODO the `py-grama` package builds on top of Pandas to provide a pipeline-based data (and model) infrastructure.

Grama provides 


In [5]:
(
   df_mpea
   >> gr.tf_head()
)


Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.8,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,10.1016/j.actamat.2019.03.010,2019


It's helpful to think of the `>>` symbol as meaning "and then". That means code like this:

```
(
    df_mpea
    >> gr.tf_filter( ... )
    >> gr.tf_mutate( ... )
    >> gr.tf_pivot_longer( ... )
)
```

Can be read something like an English sentence, where we are using various *verbs* to operate on the data:

```
(
    Start with df_mpea
    and then filter the data
    and then mutate the data
    and then pivot the data in to a longer format
)
```

We don't yet know what these verbs do; we'll learn more in the exercises below!


### Selecting

TODO


In [6]:
(
    df_mpea
    >> gr.tf_select("FORMULA")
)

Unnamed: 0,FORMULA
0,Cr1 Mo1 Nb1 Ta1 V1 W1
1,Cr1 Mo1 Nb1 Ta1 V1 W1
2,Co1 Cr1 Fe1 Ni1
3,Al1 Cr1 Fe1 Mo1 Ni1
4,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106
...,...
1648,Hf1 Nb1 Ta1 Ti1 Zr1
1649,Hf1 Nb1 Ta1 Ti1 Zr1
1650,Hf1 Nb1 Ta1 Ti1 Zr1
1651,Al1 Nb1 Ti1 V1 Zr0.5


### __qX__


In [7]:
(
    df_mpea
    >> gr.tf_select("FORMULA", "PROPERTY: Microstructure")
)

Unnamed: 0,FORMULA,PROPERTY: Microstructure
0,Cr1 Mo1 Nb1 Ta1 V1 W1,
1,Cr1 Mo1 Nb1 Ta1 V1 W1,
2,Co1 Cr1 Fe1 Ni1,FCC
3,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.
4,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,
...,...,...
1648,Hf1 Nb1 Ta1 Ti1 Zr1,BCC
1649,Hf1 Nb1 Ta1 Ti1 Zr1,BCC
1650,Hf1 Nb1 Ta1 Ti1 Zr1,BCC
1651,Al1 Nb1 Ti1 V1 Zr0.5,


### __qX__


In [8]:
(
    df_mpea
    >> gr.tf_select(gr.contains("REFERENCE"))
)

Unnamed: 0,REFERENCE: doi,REFERENCE: year
0,10.1016/j.jallcom.2018.11.318,2018
1,10.1016/j.jallcom.2018.11.318,2018
2,10.1016/j.matdes.2019.107698,2019
3,10.1016/j.jallcom.2013.03.253,2013
4,10.1016/j.actamat.2019.03.010,2019
...,...,...
1648,10.1007/s11661-018-4646-8,2019
1649,10.1007/s11661-018-4646-8,2019
1650,10.1007/s11661-018-4646-8,2019
1651,10.1080/02670836.2018.1446267,2018


### Renaming


In [25]:
(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
)

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.0,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.80,,,988.3,C,25.0,3338.0,,1.9,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
3,19,Al1 Cr1 Fe1 Mo1 Ni1,B2+Sec.,CAST,,7.2,,905.0,,25.0,,,,,,,,,10.1016/j.jallcom.2013.03.253,2013
4,155,Al1 Co0.426 Cr0.383 Cu0.106 Fe0.106 Ni0.106,,,,,,883.0,,,,,,,,,,,10.1016/j.actamat.2019.03.010,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1648,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,50.10,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1649,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,21.90,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1650,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,75.50,,,,,,,,,,,,,,10.1007/s11661-018-4646-8,2019
1651,133,Al1 Nb1 Ti1 V1 Zr0.5,,,,,5.64,,C,25.0,1480.0,,50.0,,,,,,10.1080/02670836.2018.1446267,2018


## Interlude: Pipelines and the "Data Pronoun

---


(Illustrate the use of the data pronoun)

Imagine we wanted to search through the dataset to find only those materials with a FCC microstructure. Above, we gave the `microstructure` column a new, convenient name. We might like to use that new, convenient name when searching for FCC materials. However, we're going to run into an issue:


In [29]:
## NOTE: Try uncommenting and running the following code; it WILL break!
# (
#     df_mpea
#     >> gr.tf_rename(
#         microstructure="PROPERTY: Microstructure",
#     )
#     >> gr.tf_filter(
#         df_mpea["microstructure"] == "FCC"
#     )
# )

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
1,209,Al0.5 Co1 Fe1 Ni1 Ti0.5,FCC,,,,,733.0,,,,,,,,,,,10.1016/j.msea.2019.05.056,2019
2,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,7.89,615.0,C,25.0,,,,,,,,,10.3390/e21020122,2019
3,35,Al1 Cu1 Ni1 Ti1,FCC,CAST,,5.7,,537.0,C,25.0,300.0,536.0,,0.85,108.0,,,,10.1016/j.apsusc.2015.07.207,2015
4,130,C0.429 Co1 Cr1 Fe1 Ni1 W0.429,FCC,POWDER,0.0182,,8.39,531.0,,,,,,,,,,,10.3390/coatings9010016,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272,119,Co1 Cr1 Cu1 Fe1 Ni1,FCC,CAST,,,,,C,25.0,300.0,,50.2,,,,,,10.1007/s11665-018-3837-1,2019
273,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,5.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
274,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,6.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
275,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,7.84,,C,25.0,,,,,,,,,10.3390/e21020122,2019


If we want to refer to the data *now*---as it is currently in the pipeline---we need a name to refer to that DataFrame. This is where the *data pronoun* comes in; remember when we ran this line way up above in the setup chunk?

```
DF = gr.Intention()
```

This assigns the data pronoun to the name `DF`. We can use this to take advantage of the new (shorter) name we gave to the microstructure column:

In [29]:
(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
    >> gr.tf_filter(
        DF["microstructure"] == "FCC"
    )
)

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,137,Co1 Cr1 Fe1 Ni1,FCC,CAST,,,,94.7,,,,,,,,,,,10.1016/j.matdes.2019.107698,2019
1,209,Al0.5 Co1 Fe1 Ni1 Ti0.5,FCC,,,,,733.0,,,,,,,,,,,10.1016/j.msea.2019.05.056,2019
2,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,7.89,615.0,C,25.0,,,,,,,,,10.3390/e21020122,2019
3,35,Al1 Cu1 Ni1 Ti1,FCC,CAST,,5.7,,537.0,C,25.0,300.0,536.0,,0.85,108.0,,,,10.1016/j.apsusc.2015.07.207,2015
4,130,C0.429 Co1 Cr1 Fe1 Ni1 W0.429,FCC,POWDER,0.0182,,8.39,531.0,,,,,,,,,,,10.3390/coatings9010016,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272,119,Co1 Cr1 Cu1 Fe1 Ni1,FCC,CAST,,,,,C,25.0,300.0,,50.2,,,,,,10.1007/s11665-018-3837-1,2019
273,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,5.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
274,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,6.26,,C,25.0,,,,,,,,,10.3390/e21020122,2019
275,120,Co1 Cr1 Cu1 Ni1 Zn1,FCC,POWDER,500.0000,,7.84,,C,25.0,,,,,,,,,10.3390/e21020122,2019


Together, the pipe operator `>>` and the data pronoun `DF` form a powerful team that helps us do sophisticated data operations. 


### __qX__ 


In [None]:
## TODO: Eliminate the intermediate variables by using the data pronoun


## Back to Verbs

---


### Filtering

TODO


### __qX__


In [10]:
print("Original shape: {}".format(df_mpea.shape))

(
    df_mpea
    >> gr.tf_filter(gr.not_nan(DF["PROPERTY: YS (MPa)"]))
)



Original shape: (1653, 20)


Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),PROPERTY: YS (MPa),PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year
0,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,5.01,,,991.3,C,25.0,3410.0,,2.00,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
1,146,Cr1 Mo1 Nb1 Ta1 V1 W1,,POWDER,10.80,,,988.3,C,25.0,3338.0,,1.90,,,7946.0,,,10.1016/j.jallcom.2018.11.318,2018
2,181,Al0.667 Co0.833 Cr0.833 Fe0.833 Ni0.833 Ti1,,CAST,22.03,,,856.9,C,25.0,1272.0,,3.15,,,,,,10.1016/j.jallcom.2019.07.100,2019
3,19,Al1 Cr1 Fe1 Mo0.8 Ni1,B2+Sec.,CAST,,7.0,,854.0,C,25.0,1513.0,1513.0,,0.0,,,,,10.1016/j.jallcom.2013.03.253,2013
4,186,Cr1 Fe1 Mo1 Nb1 V1,,CAST,,,,826.0,C,25.0,2663.0,,2.70,,,,,,10.1007/s40195-019-00935-x,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,24.40,,,,T,25.0,1107.0,977.0,13.90,,92.6,,,,10.1007/s11661-018-4646-8,2019
1142,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,31.30,,,,T,25.0,1115.0,830.0,18.50,,92.6,,,,10.1007/s11661-018-4646-8,2019
1143,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,38.40,,,,T,25.0,1114.0,935.0,12.20,,101.3,,,,10.1007/s11661-018-4646-8,2019
1144,133,Al1 Nb1 Ti1 V1 Zr0.5,,,,,5.64,,C,25.0,1480.0,,50.00,,,,,,10.1080/02670836.2018.1446267,2018


### Mutating

TODO


In [11]:
(
    df_mpea
    >> gr.tf_mutate(
        E_MPa = DF["PROPERTY: Young modulus (GPa)"] * 1000
    )
    >> gr.tf_filter(gr.not_nan(DF.E_MPa))
)

Unnamed: 0,IDENTIFIER: Reference ID,FORMULA,PROPERTY: Microstructure,PROPERTY: Processing method,PROPERTY: grain size ($\mu$m),PROPERTY: ROM Density (g/cm$^3$),PROPERTY: Exp. Density (g/cm$^3$),PROPERTY: HV,PROPERTY: Type of test,PROPERTY: Test temperature ($^\circ$C),...,PROPERTY: UTS (MPa),PROPERTY: Elongation (%),PROPERTY: Elongation plastic (%),PROPERTY: Young modulus (GPa),PROPERTY: O content (wppm),PROPERTY: N content (wppm),PROPERTY: C content (wppm),REFERENCE: doi,REFERENCE: year,E_MPa
0,15,Co1 Cr1 Fe1 Mo1 Ni1 Ti1 V1 Zr1,,CAST,,7.3,,850.0,,25.0,...,,,,193.0,,,,10.1002/adem.200300567,2004,193000.0
1,15,Al1 Fe1 Ni1 Ti1 V1 Zr1,BCC,CAST,,5.9,,800.0,,25.0,...,,,,132.0,,,,10.1002/adem.200300567,2004,132000.0
2,15,Al1 Co1 Fe1 Ni1 Ti1 V1 Zr1,BCC,CAST,,6.2,,790.0,,25.0,...,,,,143.0,,,,10.1002/adem.200300567,2004,143000.0
3,26,Al1 Co1 Cu1 Fe1 Ni1 Si1,FCC+BCC,CAST,,5.9,,682.0,,25.0,...,,,,145.0,,,,10.1016/j.msea.2012.07.003,2012,145000.0
4,15,Co1 Cr1 Cu1 Fe1 Ni1 Ti1 V1 Zr1,,CAST,,7.1,,680.0,,25.0,...,,,,168.0,,,,10.1002/adem.200300567,2004,168000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,,,,,T,25.0,...,1066.0,6.0,,105.3,,,,10.1007/s11661-018-4646-8,2019,105300.0
386,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,,,,,T,25.0,...,964.0,4.6,,91.6,,,,10.1007/s11661-018-4646-8,2019,91600.0
387,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,24.4,,,,T,25.0,...,977.0,13.9,,92.6,,,,10.1007/s11661-018-4646-8,2019,92600.0
388,132,Hf1 Nb1 Ta1 Ti1 Zr1,BCC,OTHER,31.3,,,,T,25.0,...,830.0,18.5,,92.6,,,,10.1007/s11661-018-4646-8,2019,92600.0


## Pivoting Data

---

TODO

In [12]:
from grama.data import df_stang_wide
df_stang_wide

Unnamed: 0,thick,E_00,mu_00,E_45,mu_45,E_90,mu_90,alloy
0,0.022,10600,0.321,10700,0.329,10500,0.31,al_24st
1,0.022,10600,0.323,10500,0.331,10700,0.323,al_24st
2,0.032,10400,0.329,10400,0.318,10300,0.322,al_24st
3,0.032,10300,0.319,10500,0.326,10400,0.33,al_24st
4,0.064,10500,0.323,10400,0.331,10400,0.327,al_24st
5,0.064,10700,0.328,10500,0.328,10500,0.32,al_24st
6,0.081,10000,0.315,10000,0.32,9900,0.314,al_24st
7,0.081,10100,0.312,9900,0.312,10000,0.316,al_24st
8,0.081,10000,0.311,-1,-1.0,9900,0.314,al_24st


Our goal will be to wrangle this messy, wide dataset into tidy, long format.

In [13]:
from grama.data import df_stang
df_stang

Unnamed: 0,thick,alloy,E,mu,ang
0,0.022,al_24st,10600,0.321,0
1,0.022,al_24st,10600,0.323,0
2,0.032,al_24st,10400,0.329,0
3,0.032,al_24st,10300,0.319,0
4,0.064,al_24st,10500,0.323,0
...,...,...,...,...,...
71,0.064,al_24st,10400,0.327,90
72,0.064,al_24st,10500,0.320,90
73,0.081,al_24st,9900,0.314,90
74,0.081,al_24st,10000,0.316,90


(What does pivoting look like? Here's an example.)


In [14]:
df_tmp = (
    gr.df_make(
        A=[1, 2, 3],
        B=[4, 5, 6],
        C=[7, 8, 9],
    )
)
print(df_tmp)

(
    df_tmp
    >> gr.tf_pivot_longer(
        columns=["A", "B", "C"],
        names_to="name",
        values_to="value",
    )
)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


Unnamed: 0,name,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9


### __QX__ 

(Make sure to add an `observation` column with the `index_to` argument.)


In [15]:
# solution-begin
df_qX = (
    df_stang_wide
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45", "E_90", "mu_90"],
        names_to="name",
        values_to="value",
        index_to="observation",
    )
)
# solution-end
df_qX

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,observation,thick,alloy,name,value
0,0,0.022,al_24st,E_00,10600.0
1,1,0.022,al_24st,E_00,10600.0
2,2,0.032,al_24st,E_00,10400.0
3,3,0.032,al_24st,E_00,10300.0
4,4,0.064,al_24st,E_00,10500.0
5,5,0.064,al_24st,E_00,10700.0
6,6,0.081,al_24st,E_00,10000.0
7,7,0.081,al_24st,E_00,10100.0
8,8,0.081,al_24st,E_00,10000.0
9,0,0.022,al_24st,mu_00,0.321


Execute the following to check your work.


In [16]:
try:
    assert(df_qX.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not sufficiently long; did you pivot?")
    
try:
    assert(df_qX.shape[1] == 5)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
try:
    assert("observation" in df_qX.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
print("Success!")

Success!


### __QY__


In [17]:
df_qY = (
    df_qX
# solution-begin
    >> gr.tf_separate(
        column="name",
        into=["var", "angle"],
        sep="_",
    )
# solution-end
)
df_qY

Unnamed: 0,observation,thick,alloy,value,var,angle
0,0,0.022,al_24st,10600.0,E,0
1,1,0.022,al_24st,10600.0,E,0
2,2,0.032,al_24st,10400.0,E,0
3,3,0.032,al_24st,10300.0,E,0
4,4,0.064,al_24st,10500.0,E,0
5,5,0.064,al_24st,10700.0,E,0
6,6,0.081,al_24st,10000.0,E,0
7,7,0.081,al_24st,10100.0,E,0
8,8,0.081,al_24st,10000.0,E,0
9,0,0.022,al_24st,0.321,mu,0


In [18]:
try:
    assert(df_qY.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert(df_qY.shape[1] == 6)
except AssertionError:
    raise AssertionError("The DataFrame should have six columns")
    
try:
    assert("angle" in df_qY.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
print("Success!")

Success!


### __QZ__

*Hint*: You should only need to set the `names_from` and `values_from` arguments with this function.


In [19]:
df_qZ = (
    df_qY
# solution-begin
    >> gr.tf_pivot_wider(
        names_from="var",
        values_from="value",
    )
# solution-end
)
df_qZ

Unnamed: 0,observation,thick,alloy,angle,E,mu
0,0,0.022,al_24st,0,10600.0,0.321
1,0,0.022,al_24st,45,10700.0,0.329
2,0,0.022,al_24st,90,10500.0,0.31
3,1,0.022,al_24st,0,10600.0,0.323
4,1,0.022,al_24st,45,10500.0,0.331
5,1,0.022,al_24st,90,10700.0,0.323
6,2,0.032,al_24st,0,10400.0,0.329
7,2,0.032,al_24st,45,10400.0,0.318
8,2,0.032,al_24st,90,10300.0,0.322
9,3,0.032,al_24st,0,10300.0,0.319


In [20]:
try:
    assert(df_qZ.shape[0] == 27)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert("angle" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
try:
    assert("E" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'E' column")
    
try:
    assert("mu" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'mu' column")
    
print("Success!")

Success!


### Bonus: One-step pivot

In [21]:
(
    df_stang_wide
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45", "E_90", "mu_90"],
        names_to=[".value", "angle"],
        names_sep="_",
        values_to="value",
    )
)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,thick,alloy,angle,E,mu
0,0.022,al_24st,0,10600.0,0.321
1,0.022,al_24st,45,10700.0,0.329
2,0.022,al_24st,90,10500.0,0.31
3,0.022,al_24st,0,10600.0,0.323
4,0.022,al_24st,45,10500.0,0.331
5,0.022,al_24st,90,10700.0,0.323
6,0.032,al_24st,0,10400.0,0.329
7,0.032,al_24st,45,10400.0,0.318
8,0.032,al_24st,90,10300.0,0.322
9,0.032,al_24st,0,10300.0,0.319


## Wrangling Data
[Hadley Wickham](http://hadley.nz/) -- author of the `tidyverse` and data science superstar -- notes that "wrangling data is 80% boredom and 20% screaming". To give you a sense of why this stuff is hard (but hopefully avoid the screaming), I'm leaving one of the wrangling steps in the workflow here:

It's not obvious from the exercises above, but *there's an issue with these data*.

In [22]:
df_data.dtypes


NameError: name 'df_data' is not defined

All of the entries are objects, not numbers! We'll need to convert these to numeric values. The following slightly-mysterious call will cast every column of `df_data` to a numeric type and modify the DataFrame.

In [None]:
df_data = df_data.apply(pd.to_numeric)


Let's check the data types again:

In [None]:
df_data.dtypes


These are numbers we can work with!

## Basic DataFrame Operations

With the numerical issues above sorted out, we can carry out *quantitative* operations on the dataframe. One useful thing we can do is compute a set of *summaries* on the data using `describe()`.

In [None]:
df_data.describe()


These summaries include things like the `mean` and standard deviation (`std`), as well as quartiles of the data. These give us a sense of *typical* values; for instance, we can see that a large fraction of observations have a zero-"Diffusion time", but at least one observation has a value `> 70`.

### Special indexing
One of the most powerful features of pandas is the ability to do *logical indexing*; we may provide an array of `True` or `False` values to select only those rows with `True` values. For instance, we could do the following to select the third row.

In [None]:
idx_boolean = [False] * df_data.shape[0]  # Mostly-false array
idx_boolean[2] = True  # Make the third entry True
df_data[idx_boolean]


Where this kind of *logical indexing* becomes helpful is when we chain this with the conditionals we learned in the previous exercise. For instance, we could use logic *using one of the columns* to effectively "filter" for variables that meet some condition. For instance, the following will filter for nonzero "Carburization Time".

In [None]:
df_data[df_data["Carburization Time"] > 0].head()


### Q5: Basic data operations
Once more, use the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) to learn how to do the following tasks:

- Select only those rows for which "Diffusion time" is greater than 70
- Sort df_data in descending order by "Fatigue Strength" and return the top 10
- Take the average of "Normalizing Temperature" and "Tempering Temperature" and add the column "avg_temp" (You may need to Google how to do this one!)

In [None]:
###
# TASK: Basic data operations
# TODO: Select rows for which "Diffusion time" > 70
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
df_data[df_data["Diffusion time"] > 70]
# solution-end


In [None]:
###
# TASK: Basic data operations
# TODO: Sort by "Fatigue Strength" in descending order, take the top-10
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
df_data.sort_values(by="Fatigue Strength", ascending=False).head(10)
# solution-end


In [None]:
###
# TASK: Basic data operations
# TODO: Average "Normalizing Temperature" and "Tempering Temperature" into the column "avg_tmp", return the head
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
df_data.assign(
    avg_tmp=0.5 * (df_data["Normalizing Temperature"] + df_data["Tempering Temperature"])).head()
# solution-end
