<a href="https://colab.research.google.com/github/rcsb/py-rcsb-api/blob/master/notebooks/search_data_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install rcsb-api
!pip install rcsbsearchapi

In [None]:
from rcsbapi.data import Query
from rcsbsearchapi import rcsb_attributes as attrs
from pprint import pprint

## RCSB PDB Data API: Search and Data API Workflow Demo

This quick-start notebook will walk through the basics of making queries in this package using a simple example. For more in-depth documentation reference the [README](https://github.com/rcsb/py-rcsb-api/blob/master/README.md).

\
install the package: 

```pip install rcsb-api```

\
In this demo, we are interested in finding potential drugs to treat COVID-19 and the associated literature in order to conduct further research. To do this, we will:
 1. Construct a query to fetch COVID-19 viruses with ligands bound (Python Search API package)
 2. Find information about each ligand (PDB ID, associated publication titles, links to publications) (Python Data API package)
 3. Parse our results and output in an easy-to-read format

### Python Search API: Find COVID-19 Structures with Ligand Bound
To learn more about using the Search API Python package, read the documentation [here](https://rcsbsearchapi.readthedocs.io/en/latest/).

We'll start by constructing a query.
- Source organism is "COVID-19 virus" 
- Nonpolymer_enitity that is the subject of investigation in the structure
- Modified chemical component is present

In [None]:
# Create each subquery
q1 = attrs.rcsb_entity_source_organism.taxonomy_lineage.name == "COVID-19 virus"
q2 = attrs.rcsb_nonpolymer_entity_annotation.type == "SUBJECT_OF_INVESTIGATION"
q3 = attrs.rcsb_polymer_entity_feature_summary.type == "modified_monomer"

# Combine using bitwise operators (&, |, ~, etc)
query = q1 & q2 & q3

# Call the query to execute it
result_list = query()

# Save and print the first ten resilts
short_result_list = (list(result_list)[0:10])
print(short_result_list)


### Python Data API: Find Information About Structures

Once we have the PDB IDs, we can query them using the Data API for information related to the structure. 

In this case, we will find the following for the first 10 results:
- ID
- Chemical component IDs
- Whether the chemical component is the subject of investigation
- Title of associated publication
- Digital Object Identifier (DOI) if applicable

In [None]:
query = Query(
    input_type="entries",
    input_ids=short_result_list,
    return_data_list=[
        "entries.rcsb_id",
        "rcsb_nonpolymer_entity_instance_container_identifiers.comp_id",
        "is_subject_of_investigation",
        "citation.title",
        "citation.pdbx_database_id_DOI"
    ]  
)
query.exec()
pprint(query.get_response())

### Parsing the Result

The result of the request is returned in JSON format. We can refer to the JSON output to understand the data structure and then parse it for the information that is useful to us.
In this case, we will
- Confirm the subject of investigation and find the ID if it exists (comp_id)
- Find the publication title 
- Construct a link to the publication using the DOI
- Put these data into a dictionary

In [None]:
json = query.get_response()["data"]["entries"]
output_dict = {}

# iterate through the result of each entry requested
for entry_dict in json:
    rcsb_id = entry_dict["rcsb_id"]

    # Check for non-polymer subject of investigation, then append to chem_id_list
    for entity_dict in entry_dict["nonpolymer_entities"]:
        for instance_dict in entity_dict["nonpolymer_entity_instances"]:
            is_subject = instance_dict["rcsb_nonpolymer_instance_validation_score"][0]["is_subject_of_investigation"]
            if is_subject == "Y":
                comp_id = instance_dict["rcsb_nonpolymer_entity_instance_container_identifiers"]["comp_id"]

    # Find publication title
    title = entry_dict["citation"][0]["title"]

    # Construct link from DOI (only exists if paper has been published or is on preprint server)
    base_link = "https://doi.org/"
    doi_link = ""
    if entry_dict["citation"][0]["pdbx_database_id_DOI"] is not None:
        doi_link += base_link + entry_dict["citation"][0]["pdbx_database_id_DOI"]

    # Add to dictionary
    output_dict[rcsb_id] = {"title": title, "link": doi_link, "subject_of_investigation": comp_id, }

pprint(output_dict)

### Try it for yourself
Combining use of our Search and Data API packages can make programmatic access to RCSB PDB easier than ever!