![Banner logo](https://raw.githubusercontent.com/CitrineInformatics/community-tools/master/templates/fig/citrine_banner_2.png "Banner logo")

## Programmatic Data Operations

*Authors: Zach del Rosario (zdelrosario@citrine.io)*

The purpose of this exercise is to give you some tools to work with data *programmatically*; that is, using a programming language. While you can carry out many data operations by hand or with spreadsheet programs, you will see that doing things programmatically is extremely powerful. 

### Learning Outcomes
By working through this notebook, you will be able to:

- Build self-sufficiency by consulting documentation to learn new programming concepts
- Inspect Python objects with `dir()`
- Learn some basics of *data wrangling*
- Use DataFrame operations in the Python package `pandas`

(Note: This is a *scavenger hunt*! You will have to follow the links below to finish these examples.)

### Q1: Setting up the Citrination client
Using your previous API work or the [learn-citrination](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/1_data_client_api_tutorial.ipynb) workbook as an example, set up the citrination client below.

In [None]:
## Import some relevant packages
import os
# Scientific computation
import numpy as np
import pandas as pd

# Workshop-specific tools
from workshop_utils import pifs2df, ddir, getAPIKey

# Third-party packages
from citrination_client import CitrinationClient
from citrination_client import PifSystemReturningQuery, PifSystemQuery
from citrination_client import DataQuery, DatasetQuery, DatasetReturningQuery, ChemicalFieldQuery
from citrination_client import PropertyQuery, FieldQuery
from citrination_client import ChemicalFilter, Filter

## TASK: Initialize the client below...
## You will need to provide `client` as a python object


### Q2: Obtaining a known dataset 
Search [citrination datasets](https://citrination.com/datasets) for the "Agrawal IMMI" dataset, find its `ID`, and load the data into memory. 

In [None]:
dataset_id = 1      # TASK: Identify the proper dataset id, use this below


## The following demos how to use the search client API
search_client = client.search
query_agrawal = \
    PifSystemReturningQuery(
        size=500, 
        query=DataQuery(
            dataset=DatasetQuery(
                id=Filter(equal=str(dataset_id))
            )
        )
    )

## Perform checks
query_result = search_client.pif_search(query_agrawal)
print("Found {} PIFs in dataset.".format(query_result.total_num_hits, dataset_id))
print("(Should be 437 PIFs)")

Citrination stores data in [physical information files](http://citrineinformatics.github.io/pif-documentation/) (PIFs). 
 
### Reading a query result 
The code below demonstrates how to investigate a python object -- you can apply the same techniques to studying mysterious python packages in the future.

In [None]:
# query_result has a few useful attributes
dir(query_result)

In [None]:
# The __stuff__ attributes are python built-ins; the other
# other attributes are features provided by the object.
# total_num_hits was used above to count the number of search hits
# hits gives the content of the query hits
query_result.hits[:5]

In [None]:
# The query hits are themselves objects; we'll need to access *their* attributes as well
ddir(query_result.hits[0]) # Helper function filters names with "_" prefix

### Q3: Extract the PIFs
Complete the following code by *extracting* the PIFs from the `query_result`. You will need to use a loop or list comprehension.

In [None]:
# It's not at all obvious from the name, but the `system` attribute returns the actual PIF
query_result.hits[0].system

## TASK: Build a list of list of all the PIF's in query_result, and store it in `pifs`
pifs = []

# Utility function will tabularize PIFs into a plot-able form
df_data = pifs2df(pifs)
df_data.head(5)

## DataFrames

A `DataFrame` is a data structure provided by Pandas. In contrast with `lists` (which we saw in the previous exercise), DataFrames are explicitly designed to facilitate data analysis. Accordingly, they provide a number of helpful features that aid in data analysis and operations.

A `DataFrame` is a *rectangular* representation of data -- it consists of rows and columns. Each *row* represents an *observation* -- a single instance of data. Each *column* represents a *variable* -- a particular attribute of the observation. For instance, we have loaded some alloy data into the DataFrame `df_data` -- here each row is an alloy, and each column is some physical property of that alloy.

Below, we will use pandas functions to study the alloy data using DataFrame operations.

### Q4: Inspecting a DataFrame
Consult the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) (it might be useful to use a page search) and use some basic calls on `df_data` to answer the following questions:

- What are the *last* five observations in the DataFrame?
- How many rows are in `df_data`? How many columns?
- How can you select the column "Normalizing Temperature"?
- How can you select the columns "Normalizing Temperature" and "Fatigue Strength"?

In [None]:
## Task: Show last five observations of df_data


In [None]:
## Task: Determine the number of rows and columns in df_data


In [None]:
## Task: Select the column "Normalizing Temperature"


In [None]:
## Task: Select the columns "Normalizing Temperature" and "Fatigue Strength"


These manipulations are simple, but they are bread-and-butter for studying new datasets.

## Wrangling Data
[Hadley Wickham](http://hadley.nz/) -- author of the `tidyverse` and data science superstar -- notes that "wrangling data is 80% boredom and 20% screaming". To give you a sense of why this stuff is hard (but hopefully avoid the screaming), I'm leaving one of the wrangling steps in the workflow here:

It's not obvious from the exercises above, but *there's an issue with these data*.

In [None]:
df_data.dtypes

All of the entries are objects, not numbers! We'll need to convert these to numeric values. The following slightly-mysterious call will cast every column of `df_data` to a numeric type and modify the DataFrame.

In [None]:
df_data = df_data.apply(pd.to_numeric)

Let's check the data types again:

In [None]:
df_data.dtypes

These are numbers we can work with!

## Basic DataFrame Operations

With the numerical issues above sorted out, we can carry out *quantitative* operations on the dataframe. One useful thing we can do is compute a set of *summaries* on the data using `describe()`.

In [None]:
df_data.describe()

These summaries include things like the `mean` and standard deviation (`std`), as well as quartiles of the data. These give us a sense of *typical* values; for instance, we can see that a large fraction of observations have a zero-"Diffusion time", but at least one observation has a value `> 70`.

### Special indexing
One of the most powerful features of pandas is the ability to do *logical indexing*; we may provide an array of `True` or `False` values to select only those rows with `True` values. For instance, we could do the following to select the third row.

In [None]:
idx_boolean = [False] * df_data.shape[0] # Mostly-false array
idx_boolean[2] = True # Make the third entry True
df_data[idx_boolean]

Where this kind of *logical indexing* becomes helpful is when we chain this with the conditionals we learned in the previous exercise. For instance, we could use logic *using one of the columns* to effectively "filter" for variables that meet some condition. For instance, the following will filter for nonzero "Carburization Time".

In [None]:
df_data[df_data["Carburization Time"] > 0].head()

### Q5: Basic data operations
Once more, use the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) to learn how to do the following tasks:

- Select only those rows for which "Diffusion time" is greater than 70
- Sort df_data in descending order by "Fatigue Strength" and return the top 10
- Take the average of "Normalizing Temperature" and "Tempering Temperature" and add the column "avg_temp" (You may need to Google how to do this one!)

In [None]:
## TASK: Select rows for which "Diffusion time" > 70


In [None]:
## TASK: Sort by "Fatigue Strength" in descending order, take the top-10


In [None]:
## TASK: Average "Normalizing Temperature" and "Tempering Temperature" into the column "avg_tmp", return the head
