## Programmatic Data Operations

*Authors: Zach del Rosario*

The purpose of this exercise is to give you some tools to work with data *programmatically*; that is, using a programming language. While you can carry out many data operations by hand or with spreadsheet programs, you will see that doing things programmatically is extremely powerful. 

### Learning Outcomes
By working through this notebook, you will be able to:

- Learn some basics of *data wrangling*
- Use DataFrame operations from the package `py-grama`


In [None]:
import numpy as np
import pandas as pd
import grama as gr

DF = gr.Intention()


## DataFrames

A `DataFrame` is a data structure provided by Pandas. In contrast with `lists` (which we saw in the previous exercise), DataFrames are explicitly designed to facilitate data analysis. Accordingly, they provide a number of helpful features that aid in data analysis and operations.

A `DataFrame` is a *rectangular* representation of data -- it consists of rows and columns. Each *row* represents an *observation* -- a single instance of data. Each *column* represents a *variable* -- a particular attribute of the observation. 

**TODO** (Update) For instance, we have loaded some alloy data into the DataFrame `df_data` -- here each row is an alloy, and each column is some physical property of that alloy.


### __Q1__: Inspecting a DataFrame
Consult the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) (it might be useful to use a page search) and use some basic calls on `df_data` to answer the following questions:

- What are the *last* five observations in the DataFrame?
- How many rows are in `df_data`? How many columns?
- How can you select the column "Normalizing Temperature"?
- How can you select the columns "Normalizing Temperature" and "Fatigue Strength"?

In [None]:
###
# TASK: Inspect df_data
# TODO: Show the last five observations of df_data
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Inspect df_data
# TODO: Determine the number of rows and columns in df_data
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Inspect df_data
# TODO: Select the column "Normalizing Temperature"
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Inspect df_data
# TODO: Select the columns "Normalizing Temperature" and "Fatigue Strength"
###

# -- WRITE YOUR CODE BELOW -----



These manipulations are simple, but they are bread-and-butter for studying new datasets.

### Pivoting

TODO

In [None]:
from grama.data import df_stang_wide
df_stang_wide

Our goal will be to wrangle this messy, wide dataset into tidy, long format.

In [None]:
from grama.data import df_stang
df_stang

(What does pivoting look like? Here's an example.)


In [None]:
df_tmp = (
    gr.df_make(
        A=[1, 2, 3],
        B=[4, 5, 6],
        C=[7, 8, 9],
    )
)
print(df_tmp)

(
    df_tmp
    >> gr.tf_pivot_longer(
        columns=["A", "B", "C"],
        names_to="name",
        values_to="value",
    )
)

### __QX__ 

(Make sure to add an `observation` column with the `index_to` argument.)


In [None]:

df_qX

Execute the following to check your work.


In [None]:
try:
    assert(df_qX.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not sufficiently long; did you pivot?")
    
try:
    assert(df_qX.shape[1] == 5)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
try:
    assert("observation" in df_qX.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
print("Success!")

### __QY__


In [None]:
df_qY = (
    df_qX

)
df_qY

In [None]:
try:
    assert(df_qY.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert(df_qY.shape[1] == 6)
except AssertionError:
    raise AssertionError("The DataFrame should have six columns")
    
try:
    assert("angle" in df_qY.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
print("Success!")

### __QZ__

*Hint*: You should only need to set the `names_from` and `values_from` arguments with this function.


In [None]:
df_qZ = (
    df_qY

)
df_qZ

In [None]:
try:
    assert(df_qZ.shape[0] == 27)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert("angle" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
try:
    assert("E" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'E' column")
    
try:
    assert("mu" in df_qZ.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'mu' column")
    
print("Success!")

### Bonus: One-step pivot

In [None]:
(
    df_stang_wide
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45", "E_90", "mu_90"],
        names_to=[".value", "angle"],
        names_sep="_",
        values_to="value",
    )
)

## Wrangling Data
[Hadley Wickham](http://hadley.nz/) -- author of the `tidyverse` and data science superstar -- notes that "wrangling data is 80% boredom and 20% screaming". To give you a sense of why this stuff is hard (but hopefully avoid the screaming), I'm leaving one of the wrangling steps in the workflow here:

It's not obvious from the exercises above, but *there's an issue with these data*.

In [None]:
df_data.dtypes


All of the entries are objects, not numbers! We'll need to convert these to numeric values. The following slightly-mysterious call will cast every column of `df_data` to a numeric type and modify the DataFrame.

In [None]:
df_data = df_data.apply(pd.to_numeric)


Let's check the data types again:

In [None]:
df_data.dtypes


These are numbers we can work with!

## Basic DataFrame Operations

With the numerical issues above sorted out, we can carry out *quantitative* operations on the dataframe. One useful thing we can do is compute a set of *summaries* on the data using `describe()`.

In [None]:
df_data.describe()


These summaries include things like the `mean` and standard deviation (`std`), as well as quartiles of the data. These give us a sense of *typical* values; for instance, we can see that a large fraction of observations have a zero-"Diffusion time", but at least one observation has a value `> 70`.

### Special indexing
One of the most powerful features of pandas is the ability to do *logical indexing*; we may provide an array of `True` or `False` values to select only those rows with `True` values. For instance, we could do the following to select the third row.

In [None]:
idx_boolean = [False] * df_data.shape[0]  # Mostly-false array
idx_boolean[2] = True  # Make the third entry True
df_data[idx_boolean]


Where this kind of *logical indexing* becomes helpful is when we chain this with the conditionals we learned in the previous exercise. For instance, we could use logic *using one of the columns* to effectively "filter" for variables that meet some condition. For instance, the following will filter for nonzero "Carburization Time".

In [None]:
df_data[df_data["Carburization Time"] > 0].head()


### Q5: Basic data operations
Once more, use the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) to learn how to do the following tasks:

- Select only those rows for which "Diffusion time" is greater than 70
- Sort df_data in descending order by "Fatigue Strength" and return the top 10
- Take the average of "Normalizing Temperature" and "Tempering Temperature" and add the column "avg_temp" (You may need to Google how to do this one!)

In [None]:
###
# TASK: Basic data operations
# TODO: Select rows for which "Diffusion time" > 70
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Basic data operations
# TODO: Sort by "Fatigue Strength" in descending order, take the top-10
###

# -- WRITE YOUR CODE BELOW -----



In [None]:
###
# TASK: Basic data operations
# TODO: Average "Normalizing Temperature" and "Tempering Temperature" into the column "avg_tmp", return the head
###

# -- WRITE YOUR CODE BELOW -----

