## Pulling Data from Citrination

__Q1: Setting up the Citrination client__
Using the [learn-citrination](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/1_data_client_api_tutorial.ipynb) workbook as an example, set up the citrination client below.

In [1]:
## Import some relevant packages
import os
# Scientific computation
import numpy as np
import pandas as pd

# Workshop-specific tools
from workshop_utils import pifs2df

# Third-party packages
from citrination_client import CitrinationClient
from citrination_client import PifSystemReturningQuery, PifSystemQuery
from citrination_client import DataQuery, DatasetQuery, DatasetReturningQuery, ChemicalFieldQuery
from citrination_client import PropertyQuery, FieldQuery
from citrination_client import ChemicalFilter, Filter

## TASK: Initialize the client below...
## You will need to provide `client` as a python object


__Q2: Obtaining a known dataset__ Search [citrination datasets](https://citrination.com/datasets) for the "Agrawal IMMI" dataset, find its `ID`, and load the data into memory. 

In [2]:
dataset_id = 1      # TASK: Identify the proper dataset id, use this below

search_client = client.search
query_agrawal = \
    PifSystemReturningQuery(
        size=500, 
        query=DataQuery(
            dataset=DatasetQuery(
                id=Filter(equal=str(dataset_id))
            )
        )
    )

## Perform checks
query_result = search_client.pif_search(query_agrawal)
print("Found {} PIFs in dataset {}.".format(query_result.total_num_hits, dataset_id))
print("(Should be 437 PIFs)")

Found 437 PIFs in dataset 150670.
(Should be 437 PIFs)


Citrination stores data in [physical information files](http://citrineinformatics.github.io/pif-documentation/) (PIFs). 
 
__Q3: Reading a Query Result__ (Turn the PIFs above into rectangular data)

In [3]:
# query_result has a few useful attributes
dir(query_result)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_convert_to_dictionary',
 '_get_object',
 '_hits',
 '_max_score',
 '_took',
 '_total_num_hits',
 'as_dictionary',
 'hits',
 'max_score',
 'took',
 'total_num_hits']

In [4]:
# The __stuff__ attributes are python built-ins; the other
# other attributes are features provided by the object.
# total_num_hits was used above to count the number of search hits
# hits gives the content of the query hits
query_result.hits[:5]

[<citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x1191c7128>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x1191c7048>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x1191c7550>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x11ac08940>,
 <citrination_client.search.pif.result.pif_search_hit.PifSearchHit at 0x11ac18630>]

In [5]:
# The query hits are themselves objects; we'll need to access *their* attributes as well
list(filter(lambda s: s[0] != "_", dir(query_result.hits[0]))) # Skip the "_"-prefixed entries

['as_dictionary',
 'dataset',
 'dataset_version',
 'extracted',
 'extracted_path',
 'id',
 'score',
 'system',
 'updated_at']

In [6]:
# It's not at all obvious from the name, but the `system` attribute returns the actual PIF
query_result.hits[0].system

## TASK: Build a list of list of all the PIF's in query_result, and store it in `pifs`
pifs = []

# Utility function will tabularize PIFs into a plot-able form
df_data = pifs2df(pifs)
df_data.head(5)

Unnamed: 0,Carburization Time,Quenching Media Temperature (for Carburization),Through Hardening Temperature,Tempering Temperature,Fatigue Strength,Sample Number,Normalizing Temperature,Cooling Rate for Tempering,Reduction Ratio (Ingot to Bar),Diffusion time,Area Proportion of Isolated Inclusions,Carburization Temperature,Tempering Time,Area Proportion of Inclusions Deformed by Plastic Work,Cooling Rate for Through Hardening,Diffusion Temperature,Through Hardening Time,Area Proportion of Inclusions Occurring in Discontinuous Array
0,0.0,30.0,845.0,550.0,451.0,228.0,870.0,24.0,530.0,0.0,0.01,30.0,60.0,0.02,8.0,30.0,30.0,0.0
1,0.0,30.0,855.0,550.0,631.0,193.0,870.0,24.0,510.0,0.0,0.03,30.0,60.0,0.04,8.0,30.0,30.0,0.0
2,0.0,30.0,845.0,600.0,406.0,233.0,870.0,24.0,610.0,0.0,0.01,30.0,60.0,0.03,8.0,30.0,30.0,0.0
3,0.0,30.0,865.0,550.0,433.0,22.0,865.0,24.0,1740.0,0.0,0.0,30.0,60.0,0.1,24.0,30.0,30.0,0.0
4,0.0,30.0,845.0,650.0,385.0,240.0,870.0,24.0,610.0,0.0,0.01,30.0,60.0,0.03,8.0,30.0,30.0,0.0


## Wrangling Data
[Hadley Wickham](http://hadley.nz/) -- author of the `tidyverse` and data science superstar -- notes that "wrangling data is 80% boredom and 20% screaming". To give you a sense of why this stuff is hard (but hopefully avoid the screaming), I'm leaving one of the wrangling steps in the workflow here:

It's not obvious from the table above, but there's an issue with these data.

In [7]:
df_data.dtypes

Carburization Time                                                object
Quenching Media Temperature (for Carburization)                   object
Through Hardening Temperature                                     object
Tempering Temperature                                             object
Fatigue Strength                                                  object
Sample Number                                                     object
Normalizing Temperature                                           object
Cooling Rate for Tempering                                        object
Reduction Ratio (Ingot to Bar)                                    object
Diffusion time                                                    object
Area Proportion of Isolated Inclusions                            object
Carburization Temperature                                         object
Tempering Time                                                    object
Area Proportion of Inclusions Deformed by Plastic W

All of the entries are objects, not numbers! We'll need to convert these to numeric values. The following slightly-mysterious call will cast every column of `df_data` to a numeric type and modify the DataFrame.

In [8]:
kwargs = {col: pd.to_numeric(df_data[col]) for col in df_data.columns}
df_data = df_data.assign(**kwargs)

Let's check the data types again:

In [9]:
df_data.dtypes

Carburization Time                                                float64
Quenching Media Temperature (for Carburization)                   float64
Through Hardening Temperature                                     float64
Tempering Temperature                                             float64
Fatigue Strength                                                  float64
Sample Number                                                     float64
Normalizing Temperature                                           float64
Cooling Rate for Tempering                                        float64
Reduction Ratio (Ingot to Bar)                                    float64
Diffusion time                                                    float64
Area Proportion of Isolated Inclusions                            float64
Carburization Temperature                                         float64
Tempering Time                                                    float64
Area Proportion of Inclusions Deformed

These are numbers we can work with!