<a href="https://colab.research.google.com/github/rcsb/rcsb-training-resources/blob/master/training-events/2025/search_api_streamlining_access_to_rcsb_pdb_apis_with_python/data_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using `rcsb-api` to access RCSB PDB's Data API

In [2]:
# Install `rcsb-api`
%pip install --upgrade rcsb-api


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Creating a Data API Query

We'll start by making a Data API query to find the experimental method used to determine PDB entry 4HHB.

A few arguments are required to create a query:

`input_type`: defines the starting point of your query. Some examples include `entries`, `polymer_entities`, and `chem_comps`. If you're unsure which `input_type` to choose, you can usually use `entries`.

`input_ids`: the identifiers of given `input_type` that you would like to search for. There are specific formats for PDB IDs of different `input_type`s
|Type|PDB ID Format|Example|
|---|---|---|
|entries|entry id|4HHB|
|polymer, branched, or non-polymer entities|[entry_id]_[entity_id]|4HHB_1|
|polymer, branched, or non-polymer entity instances|[entry_id].[asym_id]|4HHB.A|
|biological assemblies|[entry_id]-[assembly_id]|4HHB-1|
|interface|[entry_id]-[assembly_id].[interface_id]|4HHB-1.1|

`return_data_list`: the data to request for each of the given `input_ids`

In [4]:
from rcsbapi.data import DataQuery as Query

# Create a `DataQuery`/`Query` object
query = Query(
    input_type="entries",
    input_ids=["4HHB"],  # CSM IDs can be used as well
    return_data_list=["exptl.method"]
)

In [5]:
# Execute the query using `.exec` method
results = query.exec()

In [6]:
# Response is returned by `.exec`
print(results)

{'data': {'entries': [{'rcsb_id': '4HHB', 'exptl': [{'method': 'X-RAY DIFFRACTION'}]}]}}


In [7]:
# You can also access the response through the object
print(query.get_response())

{'data': {'entries': [{'rcsb_id': '4HHB', 'exptl': [{'method': 'X-RAY DIFFRACTION'}]}]}}


By using the Search API and Data API together, you can first refine a list of IDs that are of interest and then request data on those particular structures.

In the example below, I selected human structures associated with the phrase "interleukin" with investigational or experimental drugs. Once I've narrowed down my structures of interest, I request the structure's experimental method and resolution.

In [9]:
from rcsbapi.search import TextQuery
from rcsbapi.search import search_attributes as attrs

# Query for structures associated with phrase "interleukin" from Homo sapiens with investigational or experimental drugs
q1 = TextQuery("interleukin")
q2 = attrs.rcsb_entity_source_organism.scientific_name == "Homo sapiens"
q3 = attrs.drugbank_info.drug_groups == "investigational"
q4 = attrs.drugbank_info.drug_groups == "experimental"

search_query = q1 & q2 & (q3 | q4)
results = search_query()

# Get first 50 IDs from Search API query
id_list = list(results)[:50]

In [None]:
from rcsbapi.data import DataQuery as Query

# Use `id_list` to make Data API query
data_query = Query(
    input_type="entries",
    input_ids=id_list,
    return_data_list=["exptl.method", "diffrn_resolution_high.value"]
)

results = data_query.exec()
print(results)

Some paths are being autocompleted based on the current API. If this code is meant for long-term use, use the set of fully qualified paths below:
    [
        "exptl.method",
        "rcsb_entry_info.diffrn_resolution_high.value",
    ]


{'data': {'entries': [{'rcsb_id': '5HN1', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_info': {'diffrn_resolution_high': {'value': 2.25}}}, {'rcsb_id': '7OX5', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_info': {'diffrn_resolution_high': {'value': 3.09}}}, {'rcsb_id': '6GG1', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_info': {'diffrn_resolution_high': {'value': 1.3}}}, {'rcsb_id': '7S2S', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_info': {'diffrn_resolution_high': {'value': 1.92}}}, {'rcsb_id': '5FB8', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_info': {'diffrn_resolution_high': {'value': 2.07}}}, {'rcsb_id': '9H4O', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_info': {'diffrn_resolution_high': {'value': 1.97}}}, {'rcsb_id': '6O4P', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_info': {'diffrn_resolution_high': {'value': 3.43}}}, {'rcsb_id': '7OX3', 'exptl': [{'method': 'X-RAY DIFFRACTION'}], 'rcsb_entry_

If you're interested in archive-wide data, you can use ALL_STRUCTURES to request fields for every `entry` or `chem_comp` in the PDB. Note that these queries will take longer to complete than queries using fewer structures.

In [12]:
from rcsbapi.data import ALL_STRUCTURES
from rcsbapi.data import DataQuery as Query

query = Query(
    input_type="chem_comps",
    input_ids=ALL_STRUCTURES,
    return_data_list=["drugbank_info.drugbank_id"]
)

# Set progress_bar to True to track query's progress
# progress bar shows number of completed batches
query.exec(progress_bar=True)

ImportError: cannot import name 'ALL_STRUCTURES' from 'rcsbapi.data' (/Users/itruong/.pyenv/versions/3.12.8/lib/python3.12/site-packages/rcsbapi/data/__init__.py)

## Visualizing and Manipulating Queries

Once you have constructed a query, you can visualize it in our Data API query editor by using the `get_editor_link` method.

In [13]:
from rcsbapi.data import DataQuery as Query

query = Query(input_type="entries", input_ids=["4HHB"], return_data_list=["exptl.method"])
print(query.get_editor_link())

https://data.rcsb.org/graphql/index.html?query=%7B%20entries%28entry_ids%3A%20%5B%224HHB%22%5D%29%20%7B%0A%20%20rcsb_id%0A%20%20%20%20exptl%7B%0A%20%20%20%20%20%20method%0A%20%20%20%20%20%20%7D%0A%20%7D%0A%7D%0A


# Working with the Schema

To explore the Data API schema through the package, you can use the `find_field_names` and `find_paths` methods

In [17]:
from rcsbapi.data import DataSchema

# Initialize a schema object
schema = DataSchema()

# To search for fields use `find_field_names`
schema.find_field_names("ligand")

['rcsb_ligand_neighbors',
 'ligand_alt_id',
 'ligand_asym_id',
 'ligand_atom_id',
 'ligand_comp_id',
 'ligand_entity_id',
 'ligand_is_bound',
 'ligand_model_id',
 'ligands_for_buster_report',
 'pdbx_B_iso_mean_ligand',
 'pdbx_number_atoms_ligand']

In [18]:
# Pick your intended field and find the path from your desired `input_type` using `find_paths`
schema.find_paths(
    input_type="entries",
    return_data_name="rcsb_ligand_neighbors"
)

['assemblies.branched_entity_instances.rcsb_ligand_neighbors',
 'assemblies.polymer_entity_instances.rcsb_ligand_neighbors',
 'branched_entities.branched_entity_instances.rcsb_ligand_neighbors',
 'polymer_entities.polymer_entity_instances.rcsb_ligand_neighbors']

You can also explore our schema through our [Data API query editor](https://data.rcsb.org/graphql/index.html)'s Documentation Explorer

## Further Documentation

For more extensive examples and implementation details visit our [readthedocs](https://rcsbapi.readthedocs.io/en/latest/data_api/quickstart.html).