## Programmatic Data Manipulations

The purpose of this exercise is to give you some tools to manipulate data *programmatically*; that is, using a programming language. While you can carry out many data operations by hand or with spreadsheet programs, you will see that doing things programmatically is extremely powerful. 

The specific tasks you'll learn to do in this exercise are:

- Initialize the Citrination application programming interface (API) and obtain data
- Inspect Python objects with `dir()`
- Learn some basics of *data wrangling*
- Use DataFrame operations in the Python package `pandas`
- Learn the basics of *featurization* to support training machine learning models

(Note: This is a *scavenger hunt*! You will have to follow the links below to finish these examples.)

__Q1: Setting up the Citrination client__
Using the [learn-citrination](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/1_data_client_api_tutorial.ipynb) workbook as an example, set up the citrination client below.

In [1]:
## Import some relevant packages
import os
# Scientific computation
import numpy as np
import pandas as pd

# Workshop-specific tools
from workshop_utils import pifs2df, ddir

# Third-party packages
from citrination_client import CitrinationClient
from citrination_client import PifSystemReturningQuery, PifSystemQuery
from citrination_client import DataQuery, DatasetQuery, DatasetReturningQuery, ChemicalFieldQuery
from citrination_client import PropertyQuery, FieldQuery
from citrination_client import ChemicalFilter, Filter

## TASK: Initialize the client below...
## You will need to provide `client` as a python object


__Q2: Obtaining a known dataset__ Search [citrination datasets](https://citrination.com/datasets) for the "Agrawal IMMI" dataset, find its `ID`, and load the data into memory. 

In [2]:
dataset_id = 1      # TASK: Identify the proper dataset id, use this below

search_client = client.search
query_agrawal = \
    PifSystemReturningQuery(
        size=500, 
        query=DataQuery(
            dataset=DatasetQuery(
                id=Filter(equal=str(dataset_id))
            )
        )
    )

## Perform checks
query_result = search_client.pif_search(query_agrawal)
print("Found {} PIFs in dataset {}.".format(query_result.total_num_hits, dataset_id))
print("(Should be 437 PIFs)")

Found 437 PIFs in dataset 150670.
(Should be 437 PIFs)


Citrination stores data in [physical information files](http://citrineinformatics.github.io/pif-documentation/) (PIFs). 
 
__Q3: Reading a Query Result__ (Turn the PIFs above into rectangular data)

In [3]:
# query_result has a few useful attributes
dir(query_result)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_convert_to_dictionary',
 '_get_object',
 '_hits',
 '_max_score',
 '_took',
 '_total_num_hits',
 'as_dictionary',
 'hits',
 'max_score',
 'took',
 'total_num_hits']

In [4]:
# The __stuff__ attributes are python built-ins; the other
# other attributes are features provided by the object.
# total_num_hits was used above to count the number of search hits
# hits gives the content of the query hits
query_result.hits[:5]

[<citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x115798f60>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x115813e80>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x115813080>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x11729af28>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x11729a278>]

In [5]:
# The query hits are themselves objects; we'll need to access *their* attributes as well
ddir(query_result.hits[0]) # Helper function filters names with "_" prefix

['as_dictionary',
 'dataset',
 'dataset_version',
 'extracted',
 'extracted_path',
 'id',
 'score',
 'system',
 'updated_at']

In [6]:
# It's not at all obvious from the name, but the `system` attribute returns the actual PIF
query_result.hits[0].system

## TASK: Build a list of list of all the PIF's in query_result, and store it in `pifs`
pifs = []

# Utility function will tabularize PIFs into a plot-able form
df_data = pifs2df(pifs)
df_data.head(5)

Unnamed: 0,Area Proportion of Isolated Inclusions,Normalizing Temperature,Tempering Temperature,Cooling Rate for Through Hardening,Area Proportion of Inclusions Deformed by Plastic Work,Fatigue Strength,Through Hardening Temperature,Cooling Rate for Tempering,Area Proportion of Inclusions Occurring in Discontinuous Array,Diffusion time,Diffusion Temperature,Sample Number,Carburization Temperature,Carburization Time,Reduction Ratio (Ingot to Bar),Through Hardening Time,Tempering Time,Quenching Media Temperature (for Carburization)
0,0.01,870.0,550.0,8.0,0.02,451.0,845.0,24.0,0.0,0.0,30.0,228.0,30.0,0.0,530.0,30.0,60.0,30.0
1,0.03,870.0,550.0,8.0,0.04,631.0,855.0,24.0,0.0,0.0,30.0,193.0,30.0,0.0,510.0,30.0,60.0,30.0
2,0.01,870.0,600.0,8.0,0.03,406.0,845.0,24.0,0.0,0.0,30.0,233.0,30.0,0.0,610.0,30.0,60.0,30.0
3,0.0,865.0,550.0,24.0,0.1,433.0,865.0,24.0,0.0,0.0,30.0,22.0,30.0,0.0,1740.0,30.0,60.0,30.0
4,0.01,870.0,650.0,8.0,0.03,385.0,845.0,24.0,0.0,0.0,30.0,240.0,30.0,0.0,610.0,30.0,60.0,30.0


## DataFrames

(Brief description of pandas dataframes)

**Q4: Inspecting a DataFrame** Consult the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) and use some basic calls on `df_data` to answer the following questions:

- What are the *last* five observations in the DataFrame?
- How many rows are in `df_data`? How many columns?
- How can you select the column "Normalizing Temperature"?
- How can you select the columns "Normalizing Temperature" and "Fatigue Strength"?

In [13]:
## Task: Show last five observations of df_data


Unnamed: 0,Area Proportion of Isolated Inclusions,Normalizing Temperature,Tempering Temperature,Cooling Rate for Through Hardening,Area Proportion of Inclusions Deformed by Plastic Work,Fatigue Strength,Through Hardening Temperature,Cooling Rate for Tempering,Area Proportion of Inclusions Occurring in Discontinuous Array,Diffusion time,Diffusion Temperature,Sample Number,Carburization Temperature,Carburization Time,Reduction Ratio (Ingot to Bar),Through Hardening Time,Tempering Time,Quenching Media Temperature (for Carburization)
432,0.0,870.0,650.0,8.0,0.03,490.0,855.0,24.0,0.0,0.0,30.0,210.0,30.0,0.0,530.0,30.0,60.0,30.0
433,0.0,845.0,600.0,24.0,0.08,463.0,845.0,24.0,0.0,0.0,30.0,64.0,30.0,0.0,1740.0,30.0,60.0,30.0
434,0.0,870.0,550.0,8.0,0.1,592.0,855.0,24.0,0.0,0.0,30.0,148.0,30.0,0.0,820.0,30.0,60.0,30.0
435,0.02,885.0,30.0,0.0,0.06,245.0,30.0,0.0,0.02,0.0,30.0,6.0,30.0,0.0,825.0,0.0,0.0,30.0
436,0.0,870.0,550.0,8.0,0.02,526.0,855.0,24.0,0.0,0.0,30.0,141.0,30.0,0.0,530.0,30.0,60.0,30.0


In [14]:
## Task: Determine the number of rows and columns in df_data


(437, 18)

In [15]:
## Task: Select the column "Normalizing Temperature"


Unnamed: 0,Normalizing Temperature
0,870.0
1,870.0
2,870.0
3,865.0
4,870.0


In [16]:
## Task: Select the columns "Normalizing Temperature" and "Fatigue Strength"


Unnamed: 0,Normalizing Temperature,Fatigue Strength
0,870.0,451.0
1,870.0,631.0
2,870.0,406.0
3,865.0,433.0
4,870.0,385.0


These manipulations are simple, but they are bread-and-butter for studying new datasets.

## Wrangling Data
[Hadley Wickham](http://hadley.nz/) -- author of the `tidyverse` and data science superstar -- notes that "wrangling data is 80% boredom and 20% screaming". To give you a sense of why this stuff is hard (but hopefully avoid the screaming), I'm leaving one of the wrangling steps in the workflow here:

It's not obvious from the exercises above, but *there's an issue with these data*.

In [7]:
df_data.dtypes

Area Proportion of Isolated Inclusions                            object
Normalizing Temperature                                           object
Tempering Temperature                                             object
Cooling Rate for Through Hardening                                object
Area Proportion of Inclusions Deformed by Plastic Work            object
Fatigue Strength                                                  object
Through Hardening Temperature                                     object
Cooling Rate for Tempering                                        object
Area Proportion of Inclusions Occurring in Discontinuous Array    object
Diffusion time                                                    object
Diffusion Temperature                                             object
Sample Number                                                     object
Carburization Temperature                                         object
Carburization Time                                 

All of the entries are objects, not numbers! We'll need to convert these to numeric values. The following slightly-mysterious call will cast every column of `df_data` to a numeric type and modify the DataFrame.

In [8]:
df_data = df_data.apply(pd.to_numeric)

Let's check the data types again:

In [9]:
df_data.dtypes

Area Proportion of Isolated Inclusions                            float64
Normalizing Temperature                                           float64
Tempering Temperature                                             float64
Cooling Rate for Through Hardening                                float64
Area Proportion of Inclusions Deformed by Plastic Work            float64
Fatigue Strength                                                  float64
Through Hardening Temperature                                     float64
Cooling Rate for Tempering                                        float64
Area Proportion of Inclusions Occurring in Discontinuous Array    float64
Diffusion time                                                    float64
Diffusion Temperature                                             float64
Sample Number                                                     float64
Carburization Temperature                                         float64
Carburization Time                    

These are numbers we can work with!

## Basic DataFrame Operations

## Featurization