<a href="https://colab.research.google.com/github/rcsb/py-rcsb-api/blob/master/notebooks/data_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RCSB PDB Data API: Quickstart

This Quickstart notebook will walk through the basics of creating and executing queries using the `rcsbapi.data` module of the `rcsb-api` package. For more in-depth documentation, reference the [readthedocs page](https://rcsbapi.readthedocs.io/en/latest/data_api/quickstart.html).

\
Before beginning, you must install the package:

```pip install rcsb-api```

In [None]:
%pip install rcsb-api

In [25]:
from rcsbapi.data import DataQuery as Query
import json  # for easy-to-read output

## Creating and executing queries

To create a `Query` object, you need to provide three arguments:
- `input_type`: input_types are points where you can begin your query. Some examples are `entries`, `polymer_entities`, and `polymer_entity_instances`. For a full list of input_types see the [readthedocs](https://rcsbapi.readthedocs.io/en/latest/data_api/query_construction.html#input-type).
- `input_ids`: input_ids are accepted as a list or dictionary of PDB-formatted IDs.
- `return_data_list`: list of data items to return. These must be unique path segments (using dots to separate each name). Further explained [below](#Providing-specific-and-unique-field-names/paths).

(More details on input arguments can be found in [readthedocs: Query Construction](https://rcsbapi.readthedocs.io/en/latest/data_api/query_construction.html).)

For example, to create a `Query` object requesting all non-polymer components of a structure (ions, cofactors, etc):

In [None]:
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["nonpolymer_bound_components"]  # must be unique field or unique path segment
)

# Note: When the package autocompletes a path, it prints an Warning message
# To suppress this warning, either use the fully qualified path ("rcsb_entry_info.nonpolymer_bound_components"),
# or set the `suppress_autocomplete_warning` to True.


After creating a `Query` object, you can run it with `.exec()` or view the GraphQL query with `.get_editor_link()`:

In [None]:
# Execute the query and print the results
return_data = query.exec()
print(json.dumps(return_data, indent=2))  # prints return_data with easy-to-read formatting

## Expected Output:
# {
#  "data": {
#    "entries": [
#      {
#        "rcsb_id": "4HHB",
#        "rcsb_entry_info": {
#          "nonpolymer_bound_components": [
#            "HEM"
#          ]
#        }
#      }
#    ]
#  }
# }

In [None]:
# Print the GraphQL editor URL
query.get_editor_link()

### Querying multiple IDs
You can search multiple entries by starting from `input_type` "entries" and passing in a list of `input_ids`.

In [None]:
query = Query(
    input_type="entries",
    input_ids=["4HHB", "12CA", "3PQR"],
    return_data_list=["nonpolymer_bound_components"]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

### Querying multiple data items
You can also request multiple data items by adding to the `return_data_list`.

In [None]:
#  Query multiple fields in return_data_list
query = Query(
    input_type="entries",
    input_ids=["4HHB", "12CA", "3PQR"],
    return_data_list=[
        "nonpolymer_bound_components",
        "citation.title",
        "rcsb_entry_info.polymer_composition"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

### Autocompletion of nested fields
If there are fields nested under a requested data item in `return_data_list`, the package will add all sub-fields to the query. This allows you to make more general requests to get all information under that field (e.g., `"exptl"`). If you would like a more precise query, you can request specific fields (e.g., `"exptl.method"`).

In [None]:
# Requesting "exptl" gets all fields underneath that field
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["exptl"] # requests exptl.crystals_number, exptl.method, etc.
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

In [None]:
# To view the generated GraphQL query:
query.get_editor_link()

### Querying different `input_types`
You can also start queries from various `input_types` (e.g., `polymer_entities`, `polymer_entity_instances`, `uniprot`). (For more examples, see [readthedocs: Additional Examples](https://rcsbapi.readthedocs.io/en/latest/data_api/additional_examples.html))

In [None]:
# Search from input_type "polymer_entities"
query = Query(
    input_type="polymer_entities",
    input_ids=["2CPK_1", "3WHM_1", "2D5Z_1"],
    return_data_list=[
        "polymer_entities.rcsb_id",
        "rcsb_entity_source_organism.ncbi_taxonomy_id",
        "rcsb_entity_source_organism.ncbi_scientific_name",
        "cluster_id",
        "identity"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

In [None]:
# Search from input_type "polymer_entity_instances"
query = Query(
    input_type="polymer_entity_instances",
    input_ids=["4HHB.A", "12CA.A", "3PQR.A"],
    return_data_list=[
        "polymer_entity_instances.rcsb_id",
        "rcsb_polymer_instance_annotation.annotation_id",
        "rcsb_polymer_instance_annotation.name",
        "rcsb_polymer_instance_annotation.type"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

In [None]:
# Search from input_type "uniprot"
query = Query(
    input_type="uniprot",
    input_ids=["P68871"],
    return_data_list=[
        "rcsb_uniprot_annotation"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

## Determining fields for `return_data_list`

### Providing specific and unique field names/paths
There are some fields that must be further specified using multiple fields separated by dots. This is because some fields are redundant within our GraphQL Data API schema. For example, “id” appears over 50 times.

For example, the field, `"polymer_composition"`, is redundant between several nodes: 

In [None]:
# The field "polymer_composition" isn't specific enough
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["polymer_composition"]
)

# This will throw a ValueError, which will print out up to 10 valid paths that you can use instead

```
ValueError: Given path  "polymer_composition" not specific enough. Use one or more of these paths in return_data_list argument:

3 of 3 possible paths:
  assemblies.interfaces.rcsb_interface_info.polymer_composition
  assemblies.rcsb_assembly_info.polymer_composition
  rcsb_entry_info.polymer_composition
```

To get a list of all possible paths for a given field name, you can use the `DataSchema().find_paths()` method:
```python
from rcsbapi.data import DataSchema
schema = DataSchema()
schema.find_paths(input_type, field_name_or_path_segment)
```
For example:

In [None]:
# Find all paths:
from rcsbapi.data import DataSchema

schema = DataSchema()
schema.find_paths(input_type="entries", return_data_name="polymer_composition")

In [None]:
# By looking through the list, find the intended field path
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["rcsb_entry_info.polymer_composition"]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

###  Discovering field names
If you're unsure which fields exist, you can call `find_field_names(search_substring)`.

For example, to find all fields containing `"comp"`:

In [None]:
from rcsbapi.data import DataSchema

schema = DataSchema()
schema.find_field_names("comp")

Note that once you identify which field you want to use, you may need to also run the `find_paths()` method mentioned above on the field name to identify the set of possible paths for `return_data_list`. 

In [None]:
# Find all paths for the field `"chem_comps"`:
schema.find_paths(input_type="entries", return_data_name="chem_comp")

For more in-depth documentation, go to [readthedocs](https://rcsbapi.readthedocs.io/en/latest/data_api/quickstart.html).