In [2]:
from rcsbapi.data.query import Query
from rcsbapi.data.query import SCHEMA
from pprint import pprint

## RCSB PDB Data API: Quick-start

This quick-start notebook will walk through the basics of making queries in this package using a simple example. For more in-depth documentation reference the [README](https://github.com/rcsb/py-rcsb-api/blob/master/README.md).

\
install the package: 

```pip install rcsbdataapi```

\
In this notebook, we will be working with the below query. This GraphQL query requests non-polymer, polymer components of a structure (ions, cofactors, etc). We will be working with this query or related queries in this notebook.

```
{
  entry(entry_id: "4HHB") {
    rcsb_entry_info {
      nonpolymer_bound_components
    }
  }
}
```

## Making Queries

You would make the equivalent query in this package by creating a Query object as shown below. 

The Query object automatically generates a query and makes a request to our Data API. The JSON response is stored in the `response` attribute of the query object.

In [3]:
#"entry_id" as key in input_ids
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["nonpolymer_bound_components"])
pprint(query.response)

{'data': {'entry': {'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}


Making a query requires 3 arguments - input_ids, input_type, and return_data_list.

### input_ids
input_ids are accepted as a dictionary or a list of PDB-format IDs. input_id dictionaries have specific keys depending on the input_type (entry, polymer_entity, etc). To get the keys associated with an input_type, use the "__" method. #TODO

In [71]:
# requires multiple keys to specify a polymer_entity_instance
query = Query(input_ids={"entry_id":"4HHB","asym_id":"A"},input_type="polymer_entity_instance", return_data_list=["nonpolymer_bound_components"])
pprint(query.response)
#Note that this query returns the same information and has to go to entry again. This could more efficiently be accessed through entry input_type, like above.

{'data': {'polymer_entity_instance': {'polymer_entity': {'entry': {'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}}}


input_id lists must be passed in PDB ID format 

|Type | Format |Example
|---|---|---|
|entry | entry_id | 4HHB
|polymer, branched, or non-polymer entities | [entry_id]_[entity_id] | 4HHB_1 |
|polymer, branched, or non-polymer entity instances| [entry_id].[asym_id] | 4HHB.A |
|biological assemblies | [entry_id]-[assembly_id]| 4HHB-1 |

The below examples with lists for input_ids are equivalent to above. Note that although there is only one input id, the argument must be a list not a string.

In [None]:
query = Query(input_ids=["4HHB"],input_type="entry", return_data_list=["nonpolymer_bound_components"])
pprint(query.response)

In [None]:
# uses PDB ID format
query = Query(input_ids=["4HHB.A"],input_type="polymer_entity_instance", return_data_list=["nonpolymer_bound_components"])  # error
pprint(query.response)

## input_types
input_types are designated points where you can begin your query. Some examples are entry, polymer_entity, and polymer_entity_instance. You can also begin your search with uniprot or pubmed using their IDs. For a full list of input_types see the [README](https://github.com/rcsb/py-rcsb-api/blob/dev-it-schema-parse/README.md#input_types).

If you're unsure of which input_type would be best and are using a PDB ID (4HHB, 4HHB_1, 4HHB.A, 4HHB-1), you can generally begin at entry. This may produce a more verbose query that can later be refined.

## return_data_list
return_data_list are the fields/data you are requesting in your query. <!--You can explore possible fields by using the search method on a string--> 

There are some fields that must be identified using dot notation ([type].[field_name]). You can search for the dot notation of a field by using the `get_unique_fields()` method.

In [None]:
# return_data_list isn't specific enough, throws a ValueError
query = Query(input_ids=["4HHB.A"],input_type="polymer_entity_instance", return_data_list=["nonpolymer_bound_components", "polymer_composition"])

In [None]:
# run get_unique_fields("polymer_composition")
SCHEMA.get_unique_fields("polymer_composition")

In [None]:
# By looking through the list, find the intended field
query = Query(input_ids={"entry_id": "4HHB"},input_type="entry", return_data_list=["nonpolymer_bound_components", "RcsbEntryInfo.polymer_composition"])

### More Complex Queries

You can make more complex queries by searching multiple ids at once or adding more fields/data in the return_data_list.

In [68]:
# search multiple ids. Note the input_type changed from "entry" to "entries"
query = Query(input_ids={"entry_ids": ["4HHB", "12CA", "3PQR"]},input_type="entries", return_data_list=["nonpolymer_bound_components"])
pprint(query.response)

{'data': {'entries': [{'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}},
                      {'rcsb_entry_info': {'nonpolymer_bound_components': ['ZN']}},
                      {'rcsb_entry_info': {'nonpolymer_bound_components': ['NAG',
                                                                           'PLM',
                                                                           'RET']}}]}}


In [None]:
# search multiple fields
query = Query(input_ids={"entry_id": "4HHB"},input_type="entry", return_data_list=["Citation.title", "nonpolymer_bound_components", "RcsbEntryInfo.polymer_composition", ])
pprint(query.response)