## Programmatic Data Operations

*Authors: Zach del Rosario*

The purpose of this exercise is to give you some tools to work with data *programmatically*; that is, using a programming language. While you can carry out many data operations by hand or with spreadsheet programs, you will see that doing things programmatically is extremely powerful. 

### Learning Outcomes
By working through this notebook, you will be able to:

- Learn some basics of *data wrangling*
- Use DataFrame operations from the package `py-grama`


In [None]:
import numpy as np
import pandas as pd
import grama as gr

DF = gr.Intention()

# For downloading data
import os
import requests



## DataFrames

---


A `DataFrame` is a data structure provided by Pandas. In contrast with `lists` (which we saw in the previous exercise), DataFrames are explicitly designed to facilitate data analysis. Accordingly, they provide a number of helpful features that aid in data analysis and operations.

A `DataFrame` is a *rectangular* representation of data -- it consists of rows and columns. Each *row* represents an *observation* -- a single instance of data. Each *column* represents a *variable* -- a particular attribute of the observation. 

For instance, the following code chunk downloads a alloy dataset into the DataFrame `df_mpea` -- here each row is an alloy, and each column is some physical property of that alloy.

In [None]:
# Filename for local data
filename_data = "./data/mpea.csv"

# The following code downloads the data, or (after downloaded)
# loads the data from a cached CSV on your machine
if not os.path.exists(filename_data):
    # Make request for data
    url_data = "https://docs.google.com/spreadsheets/u/1/d/1MsF4_jhWtEuZSvWfXLDHWEqLMScGCVXYWtqHW9Y7Yt0/export?format=csv"
    r = requests.get(url_data, allow_redirects=True)
    open(filename_data, 'wb').write(r.content)
    print("   MPEA data downloaded from public Google sheet")
else:
    # Note data already exists
    print("    MPEA data loaded locally")
    
# Read the data into memory
df_mpea = pd.read_csv(filename_data)

# Check basic facts
print(df_mpea.shape)
df_mpea.head()


### __Q1__: Inspecting a DataFrame
Consult the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) (it might be useful to use a page search) and use some basic calls on `df_data` to answer the following questions:

- What are the *last* five observations in the DataFrame?
- How many rows are in `df_data`? How many columns?
- How would we access the column `PROPERTY: Microstructure`?

In [None]:
###
# TASK: Inspect df_data
# TODO: Show the last five observations of df_mpea
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Inspect df_mpea
# TODO: Determine the number of rows and columns in df_mpea
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Inspect df_data
# TODO: Grab the column `PROPERTY: Microstructure` alone
###

# -- WRITE YOUR CODE BELOW -----



These manipulations are simple, but they are bread-and-butter for studying new datasets.

## Grama

---

TODO the `py-grama` package builds on top of Pandas to provide a pipeline-based data (and model) infrastructure.

Grama provides 


In [None]:
(
   df_mpea
   >> gr.tf_head()
)


It's helpful to think of the `>>` symbol as meaning "and then". That means code like this:

```
(
    df_mpea
    >> gr.tf_filter( ... )
    >> gr.tf_mutate( ... )
    >> gr.tf_pivot_longer( ... )
)
```

Can be read something like an English sentence, where we are using various *verbs* to operate on the data:

```
(
    Start with df_mpea
    and then filter the data
    and then mutate the data
    and then pivot the data in to a longer format
)
```

We don't yet know what these verbs do; we'll learn more in the exercises below!


### Selecting

TODO


In [None]:
(
    df_mpea
    >> gr.tf_select("FORMULA")
)

### __qX__


In [None]:
(
    df_mpea
    >> gr.tf_select("FORMULA", "PROPERTY: Microstructure")
)

### __qX__


In [None]:
(
    df_mpea
    >> gr.tf_select(gr.contains("REFERENCE"))
)

### Renaming


In [None]:
(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
)

## Interlude: Pipelines and the "Data Pronoun

---


(Illustrate the use of the data pronoun)

Imagine we wanted to search through the dataset to find only those materials with a FCC microstructure. Above, we gave the `microstructure` column a new, convenient name. We might like to use that new, convenient name when searching for FCC materials. However, we're going to run into an issue:


In [None]:
## NOTE: Try uncommenting and running the following code; it WILL break!
# (
#     df_mpea
#     >> gr.tf_rename(
#         microstructure="PROPERTY: Microstructure",
#     )
#     >> gr.tf_filter(
#         df_mpea["microstructure"] == "FCC"
#     )
# )

If we want to refer to the data *now*---as it is currently in the pipeline---we need a name to refer to that DataFrame. This is where the *data pronoun* comes in; remember when we ran this line way up above in the setup chunk?

```
DF = gr.Intention()
```

This assigns the data pronoun to the name `DF`. We can use this to take advantage of the new (shorter) name we gave to the microstructure column:

In [None]:
(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
    >> gr.tf_filter(
        DF["microstructure"] == "FCC"
    )
)

Together, the pipe operator `>>` and the data pronoun `DF` form a powerful team that helps us do sophisticated data operations. 


### __qX__ 


In [None]:
## TODO: Eliminate the intermediate variables by using the data pronoun


## Back to Verbs

---


### Filtering

TODO


### __qX__


In [None]:
print("Original shape: {}".format(df_mpea.shape))

(
    df_mpea
    >> gr.tf_filter(gr.not_nan(DF["PROPERTY: YS (MPa)"]))
)



### Mutating

TODO


In [None]:
(
    df_mpea
    >> gr.tf_mutate(
        E_MPa = DF["PROPERTY: Young modulus (GPa)"] * 1000
    )
    >> gr.tf_filter(gr.not_nan(DF.E_MPa))
)

## Pivoting Data

---

TODO

In [None]:
from grama.data import df_stang_wide
df_stang_wide

Our goal will be to wrangle this messy, wide dataset into tidy, long format.

In [None]:
from grama.data import df_stang
df_stang

(What does pivoting look like? Here's an example.)


In [None]:
df_tmp = (
    gr.df_make(
        A=[1, 2, 3],
        B=[4, 5, 6],
        C=[7, 8, 9],
    )
)
print(df_tmp)

(
    df_tmp
    >> gr.tf_pivot_longer(
        columns=["A", "B", "C"],
        names_to="name",
        values_to="value",
    )
)

### __QX__ 

(Make sure to add an `observation` column with the `index_to` argument.)


In [None]:

df_qX

Execute the following to check your work.


In [None]:
try:
    assert(df_qX.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not sufficiently long; did you pivot?")
    
try:
    assert(df_qX.shape[1] == 5)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
try:
    assert("observation" in df_qX.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
print("Success!")

### __QY__


In [None]:
df_qY = (
    df_qX

)
df_qY

In [None]:
try:
    assert(df_qY.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert(df_qY.shape[1] == 6)
except AssertionError:
    raise AssertionError("The DataFrame should have six columns")
    
try:
    assert("angle" in df_qY.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
print("Success!")

### __QZ__

*Hint*: You should only need to set the `names_from` and `values_from` arguments with this function.


In [None]:
df_qZ = (
    df_qY

)
df_qZ

In [None]:
try:
    assert(df_qZ.shape[0] == 27)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert("angle" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
try:
    assert("E" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'E' column")
    
try:
    assert("mu" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'mu' column")
    
print("Success!")

### Bonus: One-step pivot

In [None]:
(
    df_stang_wide
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45", "E_90", "mu_90"],
        names_to=[".value", "angle"],
        names_sep="_",
        values_to="value",
    )
)

## Wrangling Data
[Hadley Wickham](http://hadley.nz/) -- author of the `tidyverse` and data science superstar -- notes that "wrangling data is 80% boredom and 20% screaming". To give you a sense of why this stuff is hard (but hopefully avoid the screaming), I'm leaving one of the wrangling steps in the workflow here:

It's not obvious from the exercises above, but *there's an issue with these data*.

In [None]:
df_data.dtypes


All of the entries are objects, not numbers! We'll need to convert these to numeric values. The following slightly-mysterious call will cast every column of `df_data` to a numeric type and modify the DataFrame.

In [None]:
df_data = df_data.apply(pd.to_numeric)


Let's check the data types again:

In [None]:
df_data.dtypes


These are numbers we can work with!

## Basic DataFrame Operations

With the numerical issues above sorted out, we can carry out *quantitative* operations on the dataframe. One useful thing we can do is compute a set of *summaries* on the data using `describe()`.

In [None]:
df_data.describe()


These summaries include things like the `mean` and standard deviation (`std`), as well as quartiles of the data. These give us a sense of *typical* values; for instance, we can see that a large fraction of observations have a zero-"Diffusion time", but at least one observation has a value `> 70`.

### Special indexing
One of the most powerful features of pandas is the ability to do *logical indexing*; we may provide an array of `True` or `False` values to select only those rows with `True` values. For instance, we could do the following to select the third row.

In [None]:
idx_boolean = [False] * df_data.shape[0]  # Mostly-false array
idx_boolean[2] = True  # Make the third entry True
df_data[idx_boolean]


Where this kind of *logical indexing* becomes helpful is when we chain this with the conditionals we learned in the previous exercise. For instance, we could use logic *using one of the columns* to effectively "filter" for variables that meet some condition. For instance, the following will filter for nonzero "Carburization Time".

In [None]:
df_data[df_data["Carburization Time"] > 0].head()


### Q5: Basic data operations
Once more, use the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) to learn how to do the following tasks:

- Select only those rows for which "Diffusion time" is greater than 70
- Sort df_data in descending order by "Fatigue Strength" and return the top 10
- Take the average of "Normalizing Temperature" and "Tempering Temperature" and add the column "avg_temp" (You may need to Google how to do this one!)

In [None]:
###
# TASK: Basic data operations
# TODO: Select rows for which "Diffusion time" > 70
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Basic data operations
# TODO: Sort by "Fatigue Strength" in descending order, take the top-10
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Basic data operations
# TODO: Average "Normalizing Temperature" and "Tempering Temperature" into the column "avg_tmp", return the head
###

# -- WRITE YOUR CODE BELOW -----

