<a href="https://colab.research.google.com/github/rcsb/py-rcsb-api/blob/master/notebooks/quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install rcsb-api




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\itruo\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
from rcsbapi.data import Schema, Query
from pprint import pprint

## RCSB PDB Data API: Quick-start

This quick-start notebook will walk through the basics of making queries in this package using a simple example. For more in-depth documentation reference the [readthedocs page](https://py-rcsb-api.readthedocs.io/en/latest/index.html).

\
install the package: 

```pip install rcsb-api```

\
In this notebook, we will be working with the below query. This GraphQL query requests non-polymer, polymer components of a structure (ions, cofactors, etc). We will be working with this query or related queries in this notebook.

```
{
  entry(entry_id: "4HHB") {
    rcsb_entry_info {
      nonpolymer_bound_components
    }
  }
}
```

## Making Queries

You would make the equivalent query in this package by creating a Query object as shown below. 

The Query object automatically generates a query and makes a request to our Data API. The JSON response can be accessed with the `get_response()` method.

In [3]:
#"entry_id" as key in input_ids
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["nonpolymer_bound_components"])
query.exec()
pprint(query.get_response())

{'data': {'entry': {'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}


Making a query requires 3 arguments - input_ids, input_type, and return_data_list.

### input_ids
input_ids are accepted as a dictionary or a list of PDB-format IDs. input_id dictionaries have specific keys depending on the input_type (entry, polymer_entity, etc). To get the keys associated with an input_type, use the `get_input_id_dict(<input_type>)` method.

In [4]:
# requires multiple keys to specify a polymer_entity_instance
query = Query(input_ids={"entry_id":"4HHB","asym_id":"A"},input_type="polymer_entity_instance", return_data_list=["nonpolymer_bound_components"])
query.exec()
pprint(query.get_response())
#Note that this query returns the same information and has to go to entry again. This could more efficiently be accessed through entry input_type, like above.

{'data': {'polymer_entity_instance': {'polymer_entity': {'entry': {'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}}}


In [5]:
# to get the dictionary keys and descriptions for a given input_type, use the get_input_id_dict method
schema = Schema() #create an instance of the API Schema
pprint(schema.get_input_id_dict("polymer_entity_instance"))

{'asym_id': "ENTITY INSTANCE ID, e.g. 'A', 'B'. Identifies structural element "
            'in the asymmetric unit (_struct_asym.id)',
 'entry_id': 'ID'}


input_id lists must be passed in PDB ID format 

|Type | Format |Example
|---|---|---|
|entry | entry_id | 4HHB
|polymer, branched, or non-polymer entities | [entry_id]_[entity_id] | 4HHB_1 |
|polymer, branched, or non-polymer entity instances| [entry_id].[asym_id] | 4HHB.A |
|biological assemblies | [entry_id]-[assembly_id]| 4HHB-1 |
|interface| [entry_id]-[assembly_id]-[interface_id] |4HHB-1.1 |

The below examples with lists for input_ids are equivalent to above. Note that although there is only one input id, the argument must be a list not a string.

In [6]:
query = Query(input_ids=["4HHB"],input_type="entry", return_data_list=["nonpolymer_bound_components"])
query.exec()
pprint(query.get_response())

{'data': {'entry': {'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}


In [7]:
# uses PDB ID format
query = Query(input_ids=["4HHB.A"],input_type="polymer_entity_instance", return_data_list=["nonpolymer_bound_components"])
query.exec()
pprint(query.get_response())

{'data': {'polymer_entity_instance': {'polymer_entity': {'entry': {'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}}}


## input_types
input_types are designated points where you can begin your query. Some examples are entry, polymer_entity, and polymer_entity_instance. You can also begin your search with uniprot or pubmed using their IDs. For a full list of input_types see the [README](https://github.com/rcsb/py-rcsb-api/blob/dev-it-schema-parse/README.md#input_types).

If you're unsure of which input_type would be best and are using a PDB ID (4HHB, 4HHB_1, 4HHB.A, 4HHB-1), you can generally begin at entry. This may produce a more verbose query that can later be refined.

## return_data_list
return_data_list are the fields/data you are requesting in your query. <!--You can explore possible fields by using the search method on a string--> 

There are some fields that must be identified using dot notation ([type].[field_name]). You can search for the dot notation of a field by using the `get_unique_fields(<field name>)` method.

In [8]:
# return_data_list isn't specific enough, throws a ValueError
query = Query(input_ids=["4HHB.A"],input_type="polymer_entity_instance", return_data_list=["polymer_composition"])

ValueError: "polymer_composition" exists, but is not a unique field, must specify further. To find valid fields with this name, run: get_unique_fields("polymer_composition")

In [9]:
# run get_unique_fields("polymer_composition")
schema = Schema()
schema.get_unique_fields("polymer_composition")

['rcsb_interface_info.polymer_composition',
 'rcsb_assembly_info.polymer_composition',
 'rcsb_entry_info.polymer_composition']

In [10]:
# By looking through the list, find the intended field
query = Query(input_ids={"entry_id": "4HHB"},input_type="entry", return_data_list=["rcsb_entry_info.polymer_composition"])
query.exec()
pprint(query.get_response())

{'data': {'entry': {'rcsb_entry_info': {'polymer_composition': 'heteromeric '
                                                               'protein'}}}}


If you're unsure which field to use, you can call `find_field_names(<search string>)` to search for fields and their descripitions. You can also search for fields using incomplete or partially-matching names.

In [11]:
schema = Schema()
pprint(schema.find_field_names("polymer_composition"))

{'rcsb_assembly_info.polymer_composition': 'Categories describing the polymer '
                                           'entity composition for the '
                                           'generated assembly.\n'
                                           '\n'
                                           'Allowable values:\n'
                                           'DNA, DNA/RNA, NA-hybrid, '
                                           'NA/oligosaccharide, RNA, '
                                           'heteromeric protein, homomeric '
                                           'protein, oligosaccharide, other, '
                                           'other type composition, other type '
                                           'pair, protein/NA, '
                                           'protein/NA/oligosaccharide, '
                                           'protein/oligosaccharide\n',
 'rcsb_entry_info.polymer_composition': 'Categories describing the polymer '
 

In [12]:
# searching an incomplete field name
schema = Schema()
pprint(schema.find_field_names("comp"))

{'Query.chem_comp': 'Get a chemical component given the CHEMICAL COMPONENT ID, '
                    "e.g. 'CFF', 'HEM', 'FE'.For nucleic acid polymer "
                    'entities, use the one-letter code for the base.',
 'Query.chem_comps': 'Get a list of chemical components given the list of '
                     "CHEMICAL COMPONENT ID, e.g. 'CFF', 'HEM', 'FE'.For "
                     'nucleic acid polymer entities, use the one-letter code '
                     'for the base.',
 'branched_entities.chem_comp_monomers': 'Get all unique monomers described in '
                                         'this branched entity.',
 'branched_entity.chem_comp_monomers': 'Get all unique monomers described in '
                                       'this branched entity.',
 'chem_comp.chem_comp': None,
 'chem_comp.formula': 'The formula for the chemical component. Formulae are '
                      'written\n'
                      ' according to the following rules:\n'
               

### More Complex Queries

You can make more complex queries by searching multiple ids at once or adding more fields/data in the return_data_list.

In [13]:
# search multiple ids. Note the input_type changed from "entry" to "entries"
query = Query(input_ids={"entry_ids": ["4HHB", "12CA", "3PQR"]},input_type="entries", return_data_list=["nonpolymer_bound_components"])
query.exec()
pprint(query.get_response())

{'data': {'entries': [{'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}},
                      {'rcsb_entry_info': {'nonpolymer_bound_components': ['ZN']}},
                      {'rcsb_entry_info': {'nonpolymer_bound_components': ['NAG',
                                                                           'PLM',
                                                                           'RET']}}]}}


In [14]:
# search multiple fields
query = Query(input_ids={"entry_id": "4HHB"},input_type="entry", return_data_list=["citation.title", "nonpolymer_bound_components", "rcsb_entry_info.polymer_composition", ])
query.exec()
pprint(query.get_response())

{'data': {'entry': {'citation': [{'title': 'The crystal structure of human '
                                           'deoxyhaemoglobin at 1.74 A '
                                           'resolution'},
                                 {'title': 'Stereochemistry of Iron in '
                                           'Deoxyhaemoglobin'},
                                 {'title': 'Regulation of Oxygen Affinity of '
                                           'Hemoglobin. Influence of Structure '
                                           'of the Globin on the Heme Iron'},
                                 {'title': 'Three-Dimensional Fourier '
                                           'Synthesis of Human Deoxyhemoglobin '
                                           'at 2.5 Angstroms Resolution, '
                                           'I.X-Ray Analysis'},
                                 {'title': 'Three-Dimensional Fourier '
                                           'Synthesis