<a href="https://colab.research.google.com/github/rcsb/rcsb-training-resources/blob/master/Accessing_RCSB_PDB_search_and_data_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Leveraging RCSB PDB APIs for Bioinformatics Analyses and Machine Learning

This Colab notebook explains how to interact with the RCSB Search & Data APIs. Code is defined in Python.


## Basics of Google Colab & Jupyter Notebooks

Google Colab allows you to create documents that contain both text and code, with code that can be executed directly on this page. The Colab notebook also provide some quality-of-life features such as code completion and tools for debugging.

Code is defined in "cells", which can be executed by clicking the "play" button in the top-left corner of a cell.

### Performing an HTTP Request

The following cell contains some Python code and makes a HTTP request to the [RCSB.org](https://www.rcsb.org) homepage. It contains 4 components:
- 1st: an import statement that allows us to use the `requests` library
- 2nd: define a variable that holds the URL of interest
- 3rd: use the `requests` library to dispatch a request to this URL
- 4th: a print statement that outputs the response code, with `200` meaning "all's well"

In [None]:
# this is a comment: this cell launches a HTTP query and talks to the RCSB.org homepage
import requests

example_url = 'https://www.rcsb.org/'
r = requests.get(example_url)
print('Status code:', r.status_code)

Status code: 200


The output will be shown beneath each cell that has been executed. You can clear the output by hovering over the output and clicking the "x" icon left of it.

Cells an be run multiple times by hovering over a cell and clicking the "play" button again.

You can freely change the code of each cell and adapt it to your needs or experiment with inputs. How does the output change if you switch the URL to `https://google.com/`? How about `https://not-the-rcsb.org/`?

If things go wrong, it can be helpful to sprinkle `print()` statements between lines.

### Key Points on Colab

*   You can execute the code snippets right in your browser.
*   Be sure to execute all previous cells as cells may depend on predecessors.
*   You can make temporary edits, but changes won't affect other users.
*   If you want to save your changes you must create a copy of this notebook at some point (use the `File` menu in the top-left).

## Interacting with the RCSB Search API
[Search API](https://search.rcsb.org) allows you to compose complex search queries that combine an arbitrary number of individual search conditions in a flexible and extendable fashion. Search API can also be a powerful tool if you want to compile archive-wide statistics, such as the distribution of resolution values across all X-ray structures.

### Defining the Search API Query

Search queries are defined in a domain-specific language, which is tailored to RCSB.org. A simple query that filters for X-ray structures looks like this:

In [None]:
# a query for all X-ray structures -- execute this cell to assign the query to a variable
query_xray = {
  "query": {
    "type": "terminal",
    "label": "text",
    "service": "text",
    "parameters": {
      "attribute": "rcsb_entry_info.experimental_method",
      "operator": "exact_match",
      "value": "X-ray"
    }
  },
  "return_type": "entry"
}

### Using the Query Object to Make a Request to Search API

The corresponding Search API endpoint can be used to execute this query. To do so pass it in JSON as URL parameter. Search API responds with JSON. We use the `json` package to convert to and from JSON.

In [None]:
import json

url_search_api = 'https://search.rcsb.org/rcsbsearch/v2/query?json='

# json.dumps transforms the query object from above into something that can be part of a URL
url_search_xray = url_search_api + json.dumps(query_xray)

r = requests.get(url_search_xray)
# .json() returns the part of the response that we care about, the payload with the search result
result = r.json()
# json.dumps can also help with printing JSON in a style that is easier to read
print(json.dumps(result, indent=2))

{
  "query_id": "c4cc0742-4b3e-4936-a0d8-623f4b50a834",
  "result_type": "entry",
  "total_count": 179141,
  "result_set": [
    {
      "identifier": "100D",
      "score": 1.0
    },
    {
      "identifier": "101D",
      "score": 1.0
    },
    {
      "identifier": "101M",
      "score": 1.0
    },
    {
      "identifier": "102D",
      "score": 1.0
    },
    {
      "identifier": "102L",
      "score": 1.0
    },
    {
      "identifier": "102M",
      "score": 1.0
    },
    {
      "identifier": "103L",
      "score": 1.0
    },
    {
      "identifier": "103M",
      "score": 1.0
    },
    {
      "identifier": "104L",
      "score": 1.0
    },
    {
      "identifier": "104M",
      "score": 1.0
    }
  ]
}


The result reports the `total_count` of items that matched the search condition. It shows that there are ~180,000 entries that were determined using X-ray crystallography.

The actual results is accessible via the `result_set` property. In this case, the first 10 matching identifiers are reported. Identifiers are sorted by their score with more relevant items appearing first. In this case, all items have a score of `1.0` because an entry is either based on X-ray or it isn't.

### FYI: The RCSB Search API Package is an Alternative Way to Interact with Search API
As alternative, we offer a dedicated Python library that makes it easy to interact with Search API: https://github.com/rcsb/py-rcsbsearchapi.

Install it via:

In [None]:
!pip install rcsbsearchapi

Collecting rcsbsearchapi
  Downloading rcsbsearchapi-1.4.2.tar.gz (177 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/177.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━[0m [32m143.4/177.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rcsbsearchapi
  Building wheel for rcsbsearchapi (setup.py) ... [?25l[?25hdone
  Created wheel for rcsbsearchapi: filename=rcsbsearchapi-1.4.2-py2.py3-none-any.whl size=163534 sha256=367b387be220321e2db34528c6e2558853a9f745204adf3e2eb08c5690e8eab8
  Stored in directory: /root/.cache/pip/wheels/26/fe/3f/a1d2a0110ddf201fc3810c9c2097454f39d9ef227ea66c41c5
Successfully built rcsbsearchapi
Installing collected packages: rcsbsearchapi
Successfully i

The request above has the following structure when using the RCSB Search API package:

In [None]:
# some package-specific imports
from rcsbsearchapi.search import Attr

# define the query
q = Attr('rcsb_entry_info.experimental_method').exact_match('X-ray')

# execute it and limit the results to 10
limit = 10
for entry_id in q('entry'):
  if limit <= 0:
    break
  print(entry_id)
  limit = limit - 1

print('In total, there are %s matching entries' % q.count('entry'))

100D
101D
101M
102D
102L
102M
103L
103M
104L
104M
In total, there are 179141 matching entries


## Interacting with RCSB Data API
[Data API](https://data.rcsb.org) provides static information on individual entries of the PDB archive. It also allows querying information on constituents of these entries such as assemblies, entities, chains, and ligands.

Constituents are organized in a tree data structure and often you're only interested in a particular piece of information and don't want to look at a deluge of data. The RCSB Data API makes use of GraphQL, which defines another domain-specific way to specify which information to return.



### Defining the Data API Query
Two arguments are needed to retrieve data from Data API:

1.  GraphQL snippet of properties of interest
2.  The identifier of interest

In [None]:
query_method = '''
query Method($entry_ids: [String!]!) {
  entries(entry_ids: $entry_ids) {
    rcsb_id
    exptl {
      method
    }
  }
}
'''
query_variables = { 'entry_ids': '100D' }

### Installing a GraphQL client
Interacting with GraphQL APIs has some pitfalls. Let's make use of a dedicated client to help with that. This client isn't part of the core modules, thus we need to install it using `pip`.

In [None]:
!pip install python-graphql-client

Collecting python-graphql-client
  Downloading python_graphql_client-0.4.3-py3-none-any.whl (4.9 kB)
Collecting websockets>=5.0 (from python-graphql-client)
  Downloading websockets-12.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.2/130.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: websockets, python-graphql-client
Successfully installed python-graphql-client-0.4.3 websockets-12.0


### Using the Query Objects to Make a Request to Data API

---


Use the query and the desired entry identifier to make a request to the Data API using the GraphQL client.

In [None]:
from python_graphql_client import GraphqlClient

url_data_api = 'https://data.rcsb.org/graphql'
# instantiate client with the RCSB Data API endpoint
client = GraphqlClient(endpoint = url_data_api)

result = client.execute(query=query_method, variables=query_variables)
# actual result is wrapped in a `data` attribute, let's unwrap it right here
result = result['data']

# json.dumps can also help with printing JSON in a style that is easier to read
print(json.dumps(result, indent=2))

{
  "entries": [
    {
      "rcsb_id": "100D",
      "exptl": [
        {
          "method": "X-RAY DIFFRACTION"
        }
      ]
    }
  ]
}


The Data API response has the same structure as whatever was requested. The schema of the Data API closely follows the mmCIF dictionary. You can also use the [GraphiQL interface](https://data.rcsb.org/graphql/index.html?query=%7B%0A%20%20entry(entry_id%3A%20%22101d%22)%20%7B%0A%20%20%20%20rcsb_id%0A%20%20%20%20exptl%20%7B%0A%20%20%20%20%20%20method%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D%0A&variables=%7B%0A%20%20%22id%22%3A%20%22101d%22%0A%7D) to explore supported properties.

## Conclusion
The examples above provide you with the biolerplate code necessary to interact with the RCSB Search & Data APIs. Adapt these examples to your use-case.

In [None]:
# you can either edit the cells above or add your code below

