<a href="https://colab.research.google.com/github/rcsb/py-rcsb-api/blob/master/notebooks/data_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RCSB PDB Data API: Quickstart

This Quickstart notebook will walk through the basics of creating and executing queries using the `rcsbapi.data` module of the `rcsb-api` package. For more in-depth documentation, reference the [readthedocs page](https://rcsbapi.readthedocs.io/en/latest/data_api/quickstart.html).

\
Before beginning, you must install the package:

```pip install rcsb-api```

In [None]:
%pip install rcsb-api

In [25]:
from rcsbapi.data import DataQuery as Query
import json  # for easy-to-read output

## Creating and executing queries

To create a `Query` object, you need to provide three arguments:
- `input_type`: input_types are points where you can begin your query. Some examples are entries, polymer_entities, and polymer_entity_instances. For a full list of input_types see the [readthedocs](https://rcsbapi.readthedocs.io/en/latest/data_api/query_construction.html#input-type).
- `input_ids`: input_ids are accepted as a list or dictionary of PDB-formatted IDs.
- `return_data_list`: list of data items to return. These must be unique path segments (using dots to separate each name). Further explained [below](#Providing-specific-and-unique-field-names/paths).

(More details on input arguments can be found in [readthedocs: Query Construction](https://rcsbapi.readthedocs.io/en/latest/data_api/query_construction.html).)

For example, to create a `Query` object requesting all non-polymer components of a structure (ions, cofactors, etc):

In [26]:
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["nonpolymer_bound_components"]  # must be unique field or unique path segment
)

# Note: When the package autocompletes a path, it prints an Warning message
# To suppress this warning, either use the fully qualified path ("rcsb_entry_info.nonpolymer_bound_components"),
# or set the `suppress_autocomplete_warning` to True.


Some paths are being autocompleted based on the current API. If this code is meant for long-term use, use the set of fully qualified paths below:
    [
        "rcsb_entry_info.nonpolymer_bound_components",
    ]


After creating a `Query` object, you can run it with `.exec()` or view the GraphQL query with `.get_editor_link()`:

In [27]:
# Execute the query and print the results
return_data = query.exec()
print(json.dumps(return_data, indent=2))  # prints return_data with easy-to-read formatting

{
  "data": {
    "entries": [
      {
        "rcsb_id": "4HHB",
        "rcsb_entry_info": {
          "nonpolymer_bound_components": [
            "HEM"
          ]
        }
      }
    ]
  }
}


In [28]:
# Print the GraphQL editor URL
query.get_editor_link()

'https://data.rcsb.org/graphql/index.html?query=%7B%20entries%28entry_ids%3A%20%5B%224HHB%22%5D%29%20%7B%0A%20%20rcsb_id%0A%20%20%20%20rcsb_entry_info%7B%0A%20%20%20%20%20%20nonpolymer_bound_components%0A%20%20%20%20%20%20%7D%0A%20%7D%0A%7D%0A'

### Querying multiple IDs
You can search multiple entries by starting from `input_type` "entries" and passing in a list of `input_ids`.

In [29]:
query = Query(
    input_type="entries",
    input_ids=["4HHB", "12CA", "3PQR"],
    return_data_list=["nonpolymer_bound_components"]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

Some paths are being autocompleted based on the current API. If this code is meant for long-term use, use the set of fully qualified paths below:
    [
        "rcsb_entry_info.nonpolymer_bound_components",
    ]


{
  "data": {
    "entries": [
      {
        "rcsb_id": "4HHB",
        "rcsb_entry_info": {
          "nonpolymer_bound_components": [
            "HEM"
          ]
        }
      },
      {
        "rcsb_id": "12CA",
        "rcsb_entry_info": {
          "nonpolymer_bound_components": [
            "ZN"
          ]
        }
      },
      {
        "rcsb_id": "3PQR",
        "rcsb_entry_info": {
          "nonpolymer_bound_components": [
            "NAG",
            "PLM",
            "RET"
          ]
        }
      }
    ]
  }
}


### Querying multiple data items
You can also request multiple data items by adding to the `return_data_list`.

In [30]:
#  Query multiple fields in return_data_list
query = Query(
    input_type="entries",
    input_ids=["4HHB", "12CA", "3PQR"],
    return_data_list=[
        "nonpolymer_bound_components",
        "citation.title",
        "rcsb_entry_info.polymer_composition"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

Some paths are being autocompleted based on the current API. If this code is meant for long-term use, use the set of fully qualified paths below:
    [
        "rcsb_entry_info.nonpolymer_bound_components",
        "citation.title",
        "rcsb_entry_info.polymer_composition",
    ]


{
  "data": {
    "entries": [
      {
        "rcsb_id": "4HHB",
        "rcsb_entry_info": {
          "nonpolymer_bound_components": [
            "HEM"
          ],
          "polymer_composition": "heteromeric protein"
        },
        "citation": [
          {
            "title": "The crystal structure of human deoxyhaemoglobin at 1.74 A resolution"
          },
          {
            "title": "Stereochemistry of Iron in Deoxyhaemoglobin"
          },
          {
            "title": "Regulation of Oxygen Affinity of Hemoglobin. Influence of Structure of the Globin on the Heme Iron"
          },
          {
            "title": "Three-Dimensional Fourier Synthesis of Human Deoxyhemoglobin at 2.5 Angstroms Resolution, I.X-Ray Analysis"
          },
          {
            "title": "Three-Dimensional Fourier Synthesis of Human Deoxyhaemoglobin at 2.5 Angstroms Resolution, Refinement of the Atomic Model"
          },
          {
            "title": "Three-Dimensional Fourier Sy

### Autocompletion of nested fields
If there are fields nested under a requested data item in `return_data_list`, the package will add all sub-fields to the query. This allows you to make more general requests to get all information under that field (e.g., `"exptl"`). If you would like a more precise query, you can still request specific fields (e.g., `"exptl.method"`).

In [31]:
# Requesting "exptl" gets all fields underneath that field
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["exptl"] # requests exptl.crystals_number, exptl.method, etc.
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

{
  "data": {
    "entries": [
      {
        "rcsb_id": "4HHB",
        "exptl": [
          {
            "method_details": null,
            "crystals_number": null,
            "method": "X-RAY DIFFRACTION",
            "details": null
          }
        ]
      }
    ]
  }
}


In [32]:
# To view the generated GraphQL query:
query.get_editor_link()

'https://data.rcsb.org/graphql/index.html?query=%7B%20entries%28entry_ids%3A%20%5B%224HHB%22%5D%29%20%7B%0A%20%20rcsb_id%0A%20%20%20%20exptl%7B%0A%20%20%20%20%20%20%20%20method_details%0A%20%20%20%20%20%20%20%20crystals_number%0A%20%20%20%20%20%20%20%20method%0A%20%20%20%20%20%20%20%20details%0A%20%20%20%20%20%20%7D%0A%20%7D%0A%7D%0A'

### Querying different `input_types`
You can also start queries from various `input_types` (e.g., `polymer_entities`, `polymer_entity_instances`, `uniprot`). (For more examples, see [readthedocs: Additional Examples](https://rcsbapi.readthedocs.io/en/latest/data_api/additional_examples.html))

In [33]:
# Search from input_type "polymer_entities"
query = Query(
    input_type="polymer_entities",
    input_ids=["2CPK_1", "3WHM_1", "2D5Z_1"],
    return_data_list=[
        "polymer_entities.rcsb_id",
        "rcsb_entity_source_organism.ncbi_taxonomy_id",
        "rcsb_entity_source_organism.ncbi_scientific_name",
        "cluster_id",
        "identity"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

Some paths are being autocompleted based on the current API. If this code is meant for long-term use, use the set of fully qualified paths below:
    [
        "rcsb_id",
        "rcsb_entity_source_organism.ncbi_taxonomy_id",
        "rcsb_entity_source_organism.ncbi_scientific_name",
        "rcsb_cluster_membership.cluster_id",
        "rcsb_cluster_membership.identity",
    ]


{
  "data": {
    "polymer_entities": [
      {
        "rcsb_id": "3WHM_1",
        "rcsb_entity_source_organism": [
          {
            "ncbi_taxonomy_id": 9606,
            "ncbi_scientific_name": "Homo sapiens"
          }
        ],
        "rcsb_cluster_membership": [
          {
            "cluster_id": 111,
            "identity": 100
          },
          {
            "cluster_id": 118,
            "identity": 95
          },
          {
            "cluster_id": 110,
            "identity": 90
          },
          {
            "cluster_id": 63,
            "identity": 70
          },
          {
            "cluster_id": 24,
            "identity": 50
          },
          {
            "cluster_id": 38,
            "identity": 30
          }
        ]
      },
      {
        "rcsb_id": "2D5Z_1",
        "rcsb_entity_source_organism": [
          {
            "ncbi_taxonomy_id": 9606,
            "ncbi_scientific_name": "Homo sapiens"
          }
        ],
     

In [34]:
# Search from input_type "polymer_entity_instances"
query = Query(
    input_type="polymer_entity_instances",
    input_ids=["4HHB.A", "12CA.A", "3PQR.A"],
    return_data_list=[
        "polymer_entity_instances.rcsb_id",
        "rcsb_polymer_instance_annotation.annotation_id",
        "rcsb_polymer_instance_annotation.name",
        "rcsb_polymer_instance_annotation.type"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

{
  "data": {
    "polymer_entity_instances": [
      {
        "rcsb_id": "4HHB.A",
        "rcsb_polymer_instance_annotation": [
          {
            "annotation_id": "1.10.490.10",
            "name": "Globins",
            "type": "CATH"
          },
          {
            "annotation_id": "d4hhba_",
            "name": "Hemoglobin, alpha-chain",
            "type": "SCOP"
          },
          {
            "annotation_id": "8039836",
            "name": "Globin-like",
            "type": "SCOP2"
          },
          {
            "annotation_id": "e4hhbA1",
            "name": "Globin",
            "type": "ECOD"
          }
        ]
      },
      {
        "rcsb_id": "12CA.A",
        "rcsb_polymer_instance_annotation": [
          {
            "annotation_id": "3.10.200.10",
            "name": "Alpha carbonic anhydrase",
            "type": "CATH"
          },
          {
            "annotation_id": "d12caa_",
            "name": "Carbonic anhydrase",
            "t

In [35]:
# Search from input_type "uniprot"
query = Query(
    input_type="uniprot",
    input_ids=["P68871"],
    return_data_list=[
        "rcsb_uniprot_annotation"
    ]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

{
  "data": {
    "uniprot": {
      "rcsb_id": "P68871",
      "rcsb_uniprot_annotation": [
        {
          "provenance_source": "UNIPROT",
          "assignment_version": null,
          "name": "hemoglobin complex",
          "type": "GO",
          "annotation_id": "GO:0005833",
          "description": null,
          "additional_properties": null,
          "annotation_lineage": [
            {
              "depth": null,
              "id": "GO:0032991",
              "name": "protein-containing complex"
            },
            {
              "depth": null,
              "id": "GO:0005575",
              "name": "cellular_component"
            },
            {
              "depth": null,
              "id": "GO:0005622",
              "name": "intracellular anatomical structure"
            },
            {
              "depth": null,
              "id": "GO:0005833",
              "name": "hemoglobin complex"
            },
            {
              "depth": null,

## Determining fields for `return_data_list`

### Providing specific and unique field names/paths
There are some fields that must be further specified using multiple fields separated by dots. This is because some fields are redundant within our GraphQL Data API schema. For example, “id” appears over 50 times.

For example, the field, `"polymer_composition"`, is redundant between several nodes: 

In [36]:
# The field "polymer_composition" isn't specific enough
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["polymer_composition"]
)

# This will throw a ValueError, which will print out up to 10 valid paths that you can use instead:

ValueError: Given path  "polymer_composition" not specific enough. Use one or more of these paths in return_data_list argument:

3 of 3 possible paths:
  assemblies.interfaces.rcsb_interface_info.polymer_composition
  assemblies.rcsb_assembly_info.polymer_composition
  rcsb_entry_info.polymer_composition

```
ValueError: Given path  "polymer_composition" not specific enough. Use one or more of these paths in return_data_list argument:

3 of 3 possible paths:
  assemblies.interfaces.rcsb_interface_info.polymer_composition
  assemblies.rcsb_assembly_info.polymer_composition
  rcsb_entry_info.polymer_composition
```

To get a list of all possible paths for a given field name, you can use the `DataSchema().find_paths()` method:
```python
from rcsbapi.data import DataSchema
schema = DataSchema()
schema.find_paths(input_type, field_name_or_path_segment)
```
For example:

In [37]:
# Find all paths:
from rcsbapi.data import DataSchema
schema = DataSchema()
schema.find_paths(input_type="entries", return_data_name="polymer_composition")

['assemblies.interfaces.rcsb_interface_info.polymer_composition',
 'assemblies.rcsb_assembly_info.polymer_composition',
 'rcsb_entry_info.polymer_composition']

In [38]:
# By looking through the list, find the intended field path
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["rcsb_entry_info.polymer_composition"]
)
return_data = query.exec()
print(json.dumps(return_data, indent=2))

{
  "data": {
    "entries": [
      {
        "rcsb_id": "4HHB",
        "rcsb_entry_info": {
          "polymer_composition": "heteromeric protein"
        }
      }
    ]
  }
}


###  Discovering field names
If you're unsure which fields exist, you can call `find_field_names(search_substring)`.

For example, to find all fields containing `"comp"`:

In [39]:
from rcsbapi.data import DataSchema
schema = DataSchema()
schema.find_field_names("comp")

['chem_comps',
 'chem_comp',
 'label_comp_id',
 'chem_comp_monomers',
 'chem_comp_nstd_monomers',
 'pdbx_chem_comp_audit',
 'pdbx_chem_comp_descriptor',
 'pdbx_chem_comp_feature',
 'pdbx_chem_comp_identifier',
 'rcsb_chem_comp_annotation',
 'rcsb_chem_comp_container_identifiers',
 'rcsb_chem_comp_descriptor',
 'rcsb_chem_comp_info',
 'rcsb_chem_comp_related',
 'rcsb_chem_comp_synonyms',
 'rcsb_chem_comp_target',
 'mon_nstd_parent_comp_id',
 'pdbx_subcomponent_list',
 'comp_id',
 'component_id',
 'comp_id_1',
 'comp_id_2',
 'chem_comp_id',
 'compound_details',
 'subcomponent_ids',
 'rcsb_comp_model_provenance',
 'rcsb_branched_component_count',
 'beg_comp_id',
 'ligand_comp_id',
 'polymer_composition',
 'nonpolymer_comp',
 'nonpolymer_comp_id',
 'completeness',
 'target_comp_id',
 'pdb_format_compatible',
 'nonpolymer_bound_components',
 'cofactor_chem_comp_id']

Note that once you identify which field you want to use, you may need to also run the `find_paths()` method mentioned above on the field name to identify the set of possible paths for `return_data_list`. 

In [40]:
# Find all paths for the field `"chem_comps"`:
schema.find_paths(input_type="entries", return_data_name="chem_comp")

['assemblies.branched_entity_instances.branched_entity.chem_comp_monomers.chem_comp',
 'assemblies.branched_entity_instances.branched_entity.prd.chem_comp',
 'assemblies.nonpolymer_entity_instances.nonpolymer_entity.nonpolymer_comp.chem_comp',
 'assemblies.nonpolymer_entity_instances.nonpolymer_entity.prd.chem_comp',
 'assemblies.polymer_entity_instances.polymer_entity.chem_comp_monomers.chem_comp',
 'assemblies.polymer_entity_instances.polymer_entity.chem_comp_nstd_monomers.chem_comp',
 'assemblies.polymer_entity_instances.polymer_entity.prd.chem_comp',
 'branched_entities.chem_comp_monomers.chem_comp',
 'branched_entities.prd.chem_comp',
 'nonpolymer_entities.nonpolymer_comp.chem_comp',
 'nonpolymer_entities.prd.chem_comp',
 'polymer_entities.chem_comp_monomers.chem_comp',
 'polymer_entities.chem_comp_nstd_monomers.chem_comp',
 'polymer_entities.prd.chem_comp']