![Citrine banner](https://raw.githubusercontent.com/CitrineInformatics/community-tools/master/templates/fig/citrine_banner_2.png)

# Introduction to the Python Citrination Client (PyCC)
*Authors: Enze Chen, Eddie Kim, Julia Ling, Zachary del Rosario*

In this notebook, we will cover how to use the **Citrination API** and the core components of the PyCC. We will demonstrate this by building a ML model for predicting band gaps. 

$$ \text{Chemical formula (inorganic descriptor)} \longrightarrow \boxed{\text{ML Model}} \longrightarrow \text{Band gap (real descriptor)} $$

Along the way, we will be explaining various things about the Python programming language. Furthermore, to get some hands-on experience, there are blanks for you to fill in below, marked by

### Learning outcomes
As a result of this exercise, you will learn to:

* Use the Citrination application programming interface (API) to upload data, train a model, and make machine learning predictions
* Learn where example code on [learn-citrination](https://github.com/CitrineInformatics/learn-citrination) resides, to have a reference for your future work

**Note: In this exercise, you will have to consult example code to finish the exercises. You are *not expected* to be able to complete these exercises without consulting the linked tutorials.**

### Client Components
Before learning to use PyCC, we will first give a bit of orientation to the CitrinationClient.

<img src="./incl/02_pycc_hierarchy.png" width="600px">

**CitrinationClient**
- Top-level API interface
- We must initialize this to use any other clients

**DataClient**
- Provides access to data sets; we can upload and download data using this
- Must be instantiated from CitrinationClient

**DataViewsClient**
- Provides tools to visually inspect a dataset, as well as inspect ML model results
- Must be instantiated from CitrinationClient

**ModelsClient**
- Provides tools to train ML models
- Must be instantiated from CitrinationClient

**SearchClient**
- Provides tools to search for relevant data & datasets
- Must be instantiated from CitrinationClient

## Step 1: Python package imports
The following Python packages are needed to run this notebook, so we'll import them at the very beginning. Like many other languages (Java, C++, etc.), even if you have a package installed, you still need to explicitly import it.

In [None]:
# Standard packages
import os
import numpy as np
import matplotlib.pyplot as plt
from time import sleep  # wait time
from uuid import uuid4  # generate random strings

# Workshop-specific tools
from workshop_utils import getAPIKey

# Citrine packages
from citrination_client import *
from citrination_client.views.data_view_builder import DataViewBuilder
from pypif import pif


## Step 2: Initialize the CitrinationClient
In order to initialize the PyCC, you will need your **API key**, which should already be stored in your environment variables. There are some instructions available on the [workshop setup guide](https://citrineinformatics.github.io/ga-tech-workshop/setup.html). If you do not have your API key set up, we recommend pairing up with someone who has it working properly so that we can move forward through this exercise.

### Q1: Initialize the client
Follow [this link](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/1_data_client_api_tutorial.ipynb) -- using the Jupyter notebook at that link as an example, set up the citrination client below.

In [None]:
###
# TASK: Set up the citrination client
# TODO: Use the appropriate function from citrination_client to
# initialize, assign this to the variable `client`
###

# -- NO NEED TO MODIFY THIS CODE -----
# Helper function will load your API key
api_key = getAPIKey()

# -- WRITE YOUR CODE BELOW -----
# solution-begin
# site you want to access; we'll use the public site
site = "https://citrination.com"
client = CitrinationClient(
    api_key=api_key,
    site=site
)
# solution-end
# -- SHOW THE RESULT -----
client  # reveal attributes of the CitrinationClient


The first argument into the `CitrinationClient` constructor is your API key, which you've stored in your system environment, and the second argument is your deployment URL. Different deployments have different API keys, so pay attention to what you have listed in your system environment and/or `~/.bash_profile`.

**Key takeaway**: Never expose your API key in your code.

## Step 3: DataClient
The [`DataClient`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/data/client.py) is used to create new datasets and upload data to datasets on Citrination. Once the base client is initialized, the `DataClient` can be easily accessed using the `.data` attribute of the `CitrinationClient`. We will start with the `DataClient` to create a new dataset and upload data.

### Q2: Initialize the data client
Still using [learn-citrination](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/1_data_client_api_tutorial.ipynb) tutorial, initialize the data client, and provide it as the variable `data_client`.

In [None]:
###
# TASK: Initialize the data client
# TODO: Access the DataClient through the .data attribute
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
data_client = client.data
# solution-end
# -- SHOW THE RESULT -----
data_client  # reveal methods


### Create a dataset
Before you can upload data, you have to create an empty dataset to store the files in. The `create_dataset()` method of the `DataClient` does exactly this and returns a [`Dataset`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/data/dataset.py) object. The method has the following inputs:

* `name`: A string for the name of the dataset. It cannot be the same as that of an existing dataset that you own.
* `description`: A string for the description of the dataset.
* `public`: A Boolean indicating to make the dataset public or not (`default=False`).

We will now create a dataset for the band gaps of various materials.

### Q3: Create an empty dataset
Complete the code below to create an empty dataset.

In [None]:
###
# TASK: Create an empty dataset
# TODO: Create a name and description for your dataset.
# uncomment the code below to begin
###

# -- UNCOMMENT AND MODIFY THIS CODE -----
# task-begin
# base_name = ???
# data_desc = ???
# task-end
# solution-begin
base_name = 'test'
data_desc = 'Test dataset for GATW'
# solution-end

# -- NO NEED TO MODIFY THIS -----
# To avoid name clashes, we add a random string
random_string = str(uuid4())[:6]
data_name = base_name + random_string

# Create the dataset on Citrination using the create_dataset() method
dataset = data_client.create_dataset(
    name=data_name,
    description=data_desc,
    public=False
)
# Check dataset
if dataset is not None:
    print("Dataset created")


Once you've created the `Dataset` object, you can obtain the dataset ID from the `.id` attribute of `Dataset`. You will need this ID for subsequent operations.

### Q4: Access your dataset ID
Still using [learn-citrination](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/1_data_client_api_tutorial.ipynb), retrieve the dataset id and assign it to the variable `dataset_id`.

In [None]:
###
# TASK: Access your dataset ID
# TODO: Obtain the ID of the dataset using the attribute
###

# -- WRITE YOUR CODE BELOW -----
# task-begin
# TODO: Determine how to obtain the id from `dataset`
dataset_id = 0
# task-end
# solution-begin
dataset_id = dataset.id
# solution-end

# -- NO NEED TO MODIFY BELOW -----
# Special string formatting in Python
print('The dataset ID for "{}" is {}.'.format(data_name, dataset_id))
print('Dataset URL: {}/datasets/{}'.format(site, dataset_id))


If you click on the above URL, it will take you to the dataset on Citrination, which at this point should be empty. Jupyter will automatically render URLs—nifty!

### Upload data to a dataset
The `upload()` method of the `DataClient` allows you to upload a file or a directory to a dataset. The method has the following inputs:

* `dataset_id`: The ID of the dataset to which you will be uploading data. (We just found this!)
* `source_path`: The path to the file or directory on your machine that you want to upload.

*Note*: Any file format can be uploaded, but the current `CitrinationClient` (v5.2.0) only supports the ingestion (i.e. processing) of PIF files.

### Q5: Upload data to your dataset
Complete the following code; make sure you have downloaded the example `pycc_intro_pif.json` file, and placed in in a folder called `data/` -- this is the data we will upload.

In [None]:
###
# TASK: Upload data to your dataset
# TODO: Use the dataset_id you obtained above to upload the data
###

# Specify file path and call the upload() method
# os.path.join() is needed for Windows/Mac compatibility
source_path = os.path.join('data', 'pycc_intro_pif.json')

# -- UNCOMMENT AND COMPLETE THIS CODE -----
# solution-begin
# upload_result = data_client.upload(
#     dataset_id = ???
#     source_path = ???
# )
# solution-end
# solution-begin
upload_result = data_client.upload(
    dataset_id=dataset_id,
    source_path=source_path
)
# solution-end

# -- NO NEED TO MODIFY BELOW -----
# Boolean; True if none fail
print('Successful upload? {}'.format(upload_result.successful()))

# Check ingest status with a loop; not required, but very useful!
while (True):
    ingest_status = data_client.get_ingest_status(dataset_id=dataset_id)

    if (ingest_status == 'Finished'):
        print('Ingestion complete!')
        print('Dataset URL: {}/datasets/{}'.format(site, dataset_id))
        break
    else:
        print('Waiting for data ingest...')
        sleep(10)


**Verify**: If you go back to the dataset in the UI and refresh the page, you should find it populated with PIF records!

## Step 4: DataViewsClient
Data views provide the configuration necessary in order to perform data analysis and machine learning (ML). We will demonstrate this functionality using our band gaps dataset, where we will create a data view with a ML model that takes a chemical formula as input and predicts the band gap as an output: 

$$ \text{Chemical formula (inorganic descriptor)} \longrightarrow \boxed{\text{ML Model}} \longrightarrow \text{Band gap (real descriptor)} $$

The [`DataViewsClient`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/views/client.py) can be accessed as an attribute of the main `CitrinationClient` with the `.data_views` attribute. **Completing this step will train a machine learning model, which you can then use to do useful work!**

### Q6: Access the DataViewsClient
Use the [data views tutorial](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/2_data_views_client_api_tutorial.ipynb) to learn how to complete the following.

In [None]:
###
# TASK: Initialize the data client
# TODO: Use the correct attribute of `client` to access the views client
###

# task-begin
# -- MODIFY THIS CODE -----
views_client = None
# task-end
# solution-begin
views_client = client.data_views
# solution-end

# -- NO NEED TO MODIFY BELOW -----
views_client  # reveal methods


### DataViewBuilder

*Note: From here, the API syntax gets a bit more complicated. Please take your time, and do ask questions if you get stuck.*

The [`DataViewBuilder`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/views/data_view_builder.py) class handles the configuration for data views and returns a configuration object that is an input for the data views client. The configuration specifies:
* The datasets you want to include.
* The ML model you want to use. 
* Which properties you want to use as descriptors. 

Some of the important parameters to note are:

* `dataset_ids`: An array of strings, one for each dataset ID that should be included in the view.
* `descriptors`: A descriptor instance, which is one of `{RealDescriptor, InorganicDescriptor, OrganicDescriptor, CategoricalDescriptor, AlloyCompositionDescriptor}`.
    * *Note*: Chemical formulas for the API take the key `"formula"`.
    * *Note*: The output is formatted as `"Property [property name]"`
    * *Note*: Strings are **case-sensitive!**
* `roles`: A role for each descriptor, as a string, which is one of `{'input', 'output', 'latentVariable', 'ignored'}`.

### Q7: Configure a dataview
Use the [data views tutorial](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/2_data_views_client_api_tutorial.ipynb) and follow the instructions below to complete the following.

In [None]:
###
# TASK: Configure a dataview
# TODO: Pass the following variables to RealDescriptor()
# and add_descriptor() to complete the code.
###

# -- NO NEED TO MODIFY THIS -----
# Create a DataViewBuilder object and set dataset_ids
view_builder = DataViewBuilder()
view_builder.dataset_ids(dataset_ids=[dataset_id])

# Define an inorganic chemical formula input descriptor
# InorganicDescriptor(key=, threshold=)
desc_formula = InorganicDescriptor(
    key="formula",
    threshold=1.0
)
# Add the descriptor to the DataViewBuilder
# and specify that it's an input
view_builder.add_descriptor(
    descriptor=desc_formula,
    role="input"
)

# Define a Property Band gap output with units of eV
# RealDescriptor(key=, lower_bound=, upper_bound=, units=)
property_key = "Property Band gap"

# task-begin
# -- UNCOMMENT AND FINISH THIS CODE -----
# desc_bg = RealDescriptor(
#     key = ???,
#     lower_bound = ???,
#     upper_bound = ???,
#     units = ???
# )
# Add the descriptor to the DataViewBuilder
# and specify that it's an output
# view_builder.add_descriptor(
#     descriptor = ???,
#     role = ???
# )
# task-end
# solution-begin
desc_bg = RealDescriptor(
    key=property_key,
    lower_bound=0,
    upper_bound=100,
    units="eV"
)

view_builder.add_descriptor(
    descriptor=desc_bg,
    role="output"
)
# solution-end

# -- NO NEED TO MODIFY BELOW -----

# Build the configuration once all the pieces are in place
view_config = view_builder.build()
view_config  # Inspect the configuration


### Create a view
After obtaining your customized configuration, you can create a data view from the configuration you built. The `create()` method for the `DataViewsClient` takes as input:
* `configuration`: A view configuration, like the template you created above.
* `name`: A name for the data view (must be unique among your data views).
* `description`: A description for the data view.

and returns the ID for the data view, which you will need for subsequent analyses.

### Q8: Create a dataview
Use the [data views tutorial](https://github.com/CitrineInformatics/learn-citrination/blob/master/citrination_api_examples/clients_sequence/2_data_views_client_api_tutorial.ipynb) and follow the instructions below to complete the following.


In [None]:
###
# TASK: Create a dataview
# TODO: Uncomment and finish the code below
###

# -- UNCOMMENT AND COMPLETE THE CODE BELOW
# task-begin
# Specify the view name and description
# To avoid name clashes, include: + random_string
# random_string = str(uuid4())[:6]
# view_name = ???
# view_desc = ???

# Create the data view using the create() method.
# The configuration is the final object from the previous cell.
# view_id = views_client.create(
#     configuration = ???,
#     name = ???,
#     description = ???
# )
# task-end
# solution-begin
random_string = str(uuid4())[:6]
view_name = "test_view" + random_string
view_desc = "Test dataview for GATW"

# Create the data view using the create() method.
# The configuration is the final object from the previous cell.
view_id = views_client.create(
    configuration=view_config,
    name=view_name,
    description=view_desc
)
# solution-end

# -- NO NEED TO MODIFY BELOW -----

# Check status of view services in a loop
while (True):
    view_status = views_client.get_data_view_service_status(
        data_view_id=view_id)

    # Design and Predict are most important endpoints to check
    if (view_status.experimental_design.ready and
            view_status.predict.event.normalized_progress == 1.0):
        print("Data view ready!")
        print("Data view URL: {}/data_views/{}".format(site, view_id))
        break
    else:
        print("Waiting for data view services...")
        sleep(10)


Clicking the above URL will take you to the data view you just created on your deployment of Citrination. From that page, you can inspect the model reports, and use **Predict** and **Design** functionality, just as if you had trained a model through the website.

## Step 5: ModelsClient
Once a data view has been created and our ML models have been trained, we can use the [`ModelsClient`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/models/client.py) to access the **Predict** and **Design** endpoints. We will demonstrate the Predict endpoint here and save the Design endpoint for a later tutorial. 

The `ModelsClient` can be accessed through the `.models` attribute of the main `CitrinationClient`.

### Q9: Access the models client
Complete the code below using the `CitrinationClient` object you initialized at the beginning of this exercise.

In [None]:
###
# TASK: Acces the models client
# TODO: Access the ModelsClient through the .models attribute
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
models_client = client.models
# solution-end

# -- NO NEED TO MODIFY BELOW -----
models_client  # reveal methods


With access to the models client, we have a wide variety of functions available to us. The following sections demonstrate these functions.

### Predict
Predictions through the `ModelsClient` can be made using the `predict()` method, which takes as inputs:
* `data_view_id`: The ID for the data view containing the ML model to use for prediction.
* `candidates`: A list of candidates (each as a `dict`) to make predictions on.

The method returns a `list` of [`PredictionResult`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/models/prediction_result.py) objects, one for each output `Property`.

In [None]:
view_status = models_client.get_data_view_service_status(view_id)
view_status.predict.ready


If the status above returns `True`, then the model is ready to make predictions.

In [None]:
# Input parameters for prediction on a candidate material.
candidate_formula = 'CdTe'  # choose your favorite compound; uncomment this line!
candidates = [{'formula': candidate_formula}]

# Make a prediction using the predict() method
predict_results = models_client.predict(
    data_view_id=view_id,
    candidates=candidates
)
predict_results  # reveal methods


For each `PredictionResult`, the `get_value()` method takes in a `key` for the `Property` name and returns a [`PredictedValue`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/models/predicted_value.py) object.

In [None]:
# Get the predicted value and uncertainty.
# We defined property_key earlier
predict_value = predict_results[0].get_value(key=property_key)
print('{0} has a predicted {1} of {2:.4f} +/- {3:.4f}.'.format(
    candidate_formula,
    predict_value.key,
    predict_value.value,
    predict_value.loss))


### Data inspection: t-SNE
t-SNE is short for t-Distributed Stochastic Neighbor Embedding. This technique was developed about [a decade ago](https://lvdmaaten.github.io/tsne/) ([simpler explanation](https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/)) and it's a powerful projection tool because nearby points in high dimensional space remain close in 2D while distant points remain far apart. Therefore, t-SNE plots are helpful for identifying clusters and structures in your data. We employ t-SNE as a *dimensionality reduction* technique to project the data onto 2 dimensions for ease of visualization.

A [`Tsne`](http://citrineinformatics.github.io/python-citrination-client/modules/models/tsne.html) object contains many [`Projection`](http://citrineinformatics.github.io/python-citrination-client/modules/models/tsne.html) objects (one for each output `Property`) with the following properties:
* `xs`: A list of $x$ values of the projection.
* `ys`: A list of $y$ values of the projection.
* `responses`: A list of $z$ (`Property`) values of the projection.
* `tags`: A list of tags for the projected points.
* `uids`: A list of record UIDs for the projected points.

You can create the t-SNE plot from the coordinates and values. Further analysis with t-SNE is performed in a later part of the workshop ([05_vis_exercise](./05_vis_exercise.ipynb)).

In [None]:
# Get the Tsne object
tsne = models_client.tsne(view_id)

# Get first output Property in dict_keys object
projection_key = list(tsne.projections())[0]

# Get the t-SNE projection from the key
projection = tsne.get_projection(projection_key)
max_index, max_value = (np.argmax(projection.responses),
                        max(projection.responses))

print('Highest band gap material: \t{0}.'.format(projection.tags[max_index]))
print('It has projected coordinates: \t({0:.3f}, {1:.3f}).'.format(
    projection.xs[max_index],
    projection.ys[max_index]))


Once the t-SNE data has been generated, we can use it for plotting.

In [None]:
# Plot results
fig, ax = plt.subplots()
plt.plot(
    projection.xs[max_index], projection.ys[max_index],
    marker="o",
    markersize=7,
    color="red",
    zorder=0
)
plt.scatter(projection.xs, projection.ys, c=projection.responses)
ax.set_aspect(aspect='equal', adjustable='datalim')
plt.colorbar(label='Band gap (eV)')
plt.show()


We will discuss `matplotlib` and visualizing in Python later in the workshop.

## Step 6: SearchClient—OPTIONAL
The [`SearchClient`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/search/client.py) can be used to search for and filter datasets on Citrination based on a **query language** that you construct. This query language is quite sophisticated, so we will only give a brief introduction below.

First, we can access the `SearchClient` through the `.search` attribute of the main `CitrinationClient`.

In [None]:
###
# TASK: Initialize the data client
# TODO: Access the SearchClient through the .search attribute
###

# -- UNCOMMENT AND FINISH THE CODE BELOW
# search_client = ???

# -- NO NEED TO MODIFY BELOW -----
search_client  # reveal methods


### Query language
Each of the methods above will execute search by submitting a [query](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/search/core/query/base_returning_query.py) against Citrination. In this demo, we will search for PIF records, and so we will construct a [`PifSystemReturningQuery`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/search/pif/query/pif_system_returning_query.py) as input to the `pif_search()` method. The structure of the query will resemble the following:

![Query structure](https://raw.githubusercontent.com/CitrineInformatics/learn-citrination/master/citrination_api_examples/fig/query_structure.png "Query structure")

As you can see, there are many different query objects (black and orange) that one can construct to narrow the search. Each one has a different set of parameters to query against. You'll notice that the most nested object is a [`Filter`](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/search/core/query/filter.py) (blue) that performs the matching.

First, let's just see if we can get all the PIFs in the dataset. We'll include everything except the `system` information from above.

In [None]:
### FINISH THE CODE BELOW; replace ??? ###

# Search within the dataset you created above by filtering the ID
# Size can be anything less than 10000.
system_query = PifSystemReturningQuery(
    size=999,
    query=DataQuery(
        dataset=DatasetQuery(
            id=Filter(
                equal=???))))

#---------------------------#

query_result = search_client.pif_search(system_query)
print("Found {} total PIFs in dataset {}.".format(
    query_result.total_num_hits, 
    dataset_id))

# Depending on what you put for size, this number may differ.
pif_hits = query_result.hits
print("{} PIFs were returned.".format(len(pif_hits)))

Now let's restrict our search to only binary oxides within this dataset. We've constructed the query for you; you just have to run the cell below.

In [None]:
# Search within the dataset you created above for binary oxides
system_query = PifSystemReturningQuery(
    size=999,
    query=DataQuery(
        dataset=DatasetQuery(
            id=Filter(
                equal=dataset_id)),
        system=PifSystemQuery(
            chemical_formula=ChemicalFieldQuery(
                filter=ChemicalFilter(
                    equal="?xOy")))))

query_result = search_client.pif_search(system_query)
print("Found {} total PIFs in dataset {}.".format(
    query_result.total_num_hits, 
    dataset_id))
print("The first PIF is:\n{}".format(pif.dumps(query_result.hits[0], indent=4)))

## Conclusion
This concludes the Intro to PyCC tutorial. You should now feel comfortable:
* Initializing the PyCC and accessing its sub-clients.
* Creating new datasets and uploading data through the API.
* Training ML models and submitting prediction queries.

More API tutorial notebooks can be found on our [`learn-citrination`](https://github.com/CitrineInformatics/learn-citrination/tree/master/citrination_api_examples) GitHub repo.