<a href="https://colab.research.google.com/github/rcsb/py-rcsb-api/blob/master/notebooks/quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RCSB PDB Data API: Quick-start

This quick-start notebook will walk through the basics of making queries in this package. For more in-depth documentation reference the [readthedocs page](https://py-rcsb-api.readthedocs.io/en/latest/index.html).

\
install the package: 

```pip install rcsb-api```

In [None]:
%pip install rcsb-api

## Creating and executing queries

In [79]:
from rcsbapi.data import Schema, Query

query = Query(
    input_type="entry",
    input_ids=["4HHB"],  # input_ids can be dictionaries or lists
    return_data_list=["nonpolymer_bound_components"] # must be unique field or unique path segment
)

To create a Query object, you need to provide three arguments:
- `input_type`: input_types are points where you can begin your query. Some examples are entry, polymer_entity, and polymer_entity_instance. For a full list of input_types see the [readthedocs](https://rcsbapi.readthedocs.io/en/latest/query_construction.html#input-type).
- `input_ids`: input_ids are accepted as a list of PDB-format IDs or a dictionary.
- `return_data_list`: list of data items to return. These must be unique path segments (specified using dots to separate each name). Further explained [below](#providing-specific-unique-field-namespaths).

For more details on input arguments, see [readthedocs: Query Construction](https://rcsbapi.readthedocs.io/en/latest/query_construction.html#query-objects).

To create a Query object requesting all non-polymer components of a structure (ions, cofactors, etc):

You can run this query with `.exec()`

In [80]:
return_data = query.exec()
print(return_data)

{'data': {'entry': {'rcsb_id': '4HHB', 'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}


To view the GraphQL query in an interactive editor use `get_editor_link()`

In [81]:
query.get_editor_link()

'https://data.rcsb.org/graphql/index.html?query=%7B%20entry%28entry_id%3A%20%224HHB%22%29%20%7B%0A%20%20rcsb_id%0A%20%20%20%20rcsb_entry_info%7B%0A%20%20%20%20%20%20nonpolymer_bound_components%0A%20%20%20%20%20%20%7D%0A%20%7D%0A%7D%0A'

### Creating more complex queries

You can search multiple entries by starting from `input_type` "entries". You can also request multiple data items by adding to the `return_data_list`.

In [82]:
from rcsbapi.data import Schema, Query
from pprint import pprint  # for easier-to-read output

# search multiple entries by starting at "entries" instead of "entry"
query = Query(
    input_type="entries",
    input_ids=["4HHB", "12CA", "3PQR"],
    return_data_list=["exptl"]
)
return_data = query.exec()
pprint(return_data)

{'data': {'entries': [{'exptl': [{'crystals_number': None,
                                  'details': None,
                                  'method': 'X-RAY DIFFRACTION',
                                  'method_details': None}],
                       'rcsb_id': '4HHB'},
                      {'exptl': [{'crystals_number': None,
                                  'details': None,
                                  'method': 'X-RAY DIFFRACTION',
                                  'method_details': None}],
                       'rcsb_id': '12CA'},
                      {'exptl': [{'crystals_number': 1,
                                  'details': None,
                                  'method': 'X-RAY DIFFRACTION',
                                  'method_details': None}],
                       'rcsb_id': '3PQR'}]}}


In [83]:
from rcsbapi.data import Schema, Query

#  search multiple fields in return_data_list
query = Query(
    input_type="entries",
    input_ids=["4HHB", "12CA", "3PQR"],
    return_data_list=[
        "nonpolymer_bound_components",
        "exptl",
        "citation.title",
        "rcsb_entry_info.polymer_composition"
    ]
)
return_data = query.exec()
pprint(return_data)

{'data': {'entries': [{'citation': [{'title': 'The crystal structure of human '
                                              'deoxyhaemoglobin at 1.74 A '
                                              'resolution'},
                                    {'title': 'Stereochemistry of Iron in '
                                              'Deoxyhaemoglobin'},
                                    {'title': 'Regulation of Oxygen Affinity '
                                              'of Hemoglobin. Influence of '
                                              'Structure of the Globin on the '
                                              'Heme Iron'},
                                    {'title': 'Three-Dimensional Fourier '
                                              'Synthesis of Human '
                                              'Deoxyhemoglobin at 2.5 '
                                              'Angstroms Resolution, I.X-Ray '
                                              'Ana

### Changing query input_type

You can also start queries from various input_types. 


This can result in shorter queries. For example, requesting "polymer_entity_instances" from input_type "polymer_entity" will be a shorter query than starting at input_type "entry". Also, some data are only accessible from certain input_types. For example, "rcsb_uniprot_entry_name" can only be accessed from input_type "uniprot". 

Below are some examples using different input_types.

See more examples on [readthedocs: Additional Examples](https://rcsbapi.readthedocs.io/en/latest/additional_examples.html)

In [84]:
from rcsbapi.data import Schema, Query

# search from input_type "polymer_entities"
query = Query(
    input_type="polymer_entities",
    input_ids=["2CPK_1", "3WHM_1", "2D5Z_1"],
    return_data_list=[
        "polymer_entities.rcsb_id",
        "rcsb_entity_source_organism.ncbi_taxonomy_id",
        "rcsb_entity_source_organism.ncbi_scientific_name",
        "cluster_id",
        "identity"
    ]
)
return_data = query.exec()
pprint(return_data)

{'data': {'polymer_entities': [{'rcsb_cluster_membership': [{'cluster_id': 1415,
                                                             'identity': 100},
                                                            {'cluster_id': 116,
                                                             'identity': 95},
                                                            {'cluster_id': 117,
                                                             'identity': 90},
                                                            {'cluster_id': 155,
                                                             'identity': 70},
                                                            {'cluster_id': 246,
                                                             'identity': 50},
                                                            {'cluster_id': 2,
                                                             'identity': 30}],
                                'rcsb_entity_source

In [85]:
from rcsbapi.data import Schema, Query

# search from input_type "polymer_entity_instances"
query = Query(
    input_type="polymer_entity_instances",
    input_ids=["4HHB.A", "12CA.A", "3PQR.A"],
    return_data_list=[
        "polymer_entity_instances.rcsb_id",
        "rcsb_polymer_instance_annotation.annotation_id",
        "rcsb_polymer_instance_annotation.name",
        "rcsb_polymer_instance_annotation.type"
    ]
)
return_data = query.exec()
pprint(return_data)

{'data': {'polymer_entity_instances': [{'rcsb_id': '4HHB.A',
                                        'rcsb_polymer_instance_annotation': [{'annotation_id': '1.10.490.10',
                                                                              'name': 'Globins',
                                                                              'type': 'CATH'},
                                                                             {'annotation_id': 'd4hhba_',
                                                                              'name': 'Hemoglobin, '
                                                                                      'alpha-chain',
                                                                              'type': 'SCOP'},
                                                                             {'annotation_id': '8039836',
                                                                              'name': 'Globin-like',
                           

In [86]:
from rcsbapi.data import Schema, Query

# search from input_type "Uniprot"
query = Query(
    input_type="uniprot",
    input_ids=["P68871"],
    return_data_list=[
        "rcsb_uniprot_annotation"
    ]
)
return_data = query.exec()
pprint(return_data)

{'data': {'uniprot': {'rcsb_id': 'P68871',
                      'rcsb_uniprot_annotation': [{'additional_properties': None,
                                                   'annotation_id': 'GO:0005833',
                                                   'annotation_lineage': [{'depth': None,
                                                                           'id': 'GO:0032991',
                                                                           'name': 'protein-containing '
                                                                                   'complex'},
                                                                          {'depth': None,
                                                                           'id': 'GO:0005575',
                                                                           'name': 'cellular_component'},
                                                                          {'depth': None,
                           

## Determining fields for `return_data_list`

### Autocompletion of nested fields
If there are fields under a requested data item in `return_data_list`, the package will add all fields to the query. 

This allows you to make more general requests to get all information under that field (ex: "exptl"). If you would like a more precise query, you can still request specific fields (ex: "exptl.method").

In [87]:
from rcsbapi.data import Schema, Query

# requesting "exptl" gets all fields underneath that field
query = Query(
    input_type="entry",
    input_ids=["4HHB"],
    return_data_list=["exptl"] # requests exptl.crystals_number, exptl.method, etc
)
return_data = query.exec()
pprint(return_data)

{'data': {'entry': {'exptl': [{'crystals_number': None,
                               'details': None,
                               'method': 'X-RAY DIFFRACTION',
                               'method_details': None}],
                    'rcsb_id': '4HHB'}}}


In [88]:
# to look at the query in an interactive editor
query.get_editor_link()

'https://data.rcsb.org/graphql/index.html?query=%7B%20entry%28entry_id%3A%20%224HHB%22%29%20%7B%0A%20%20rcsb_id%0A%20%20%20%20exptl%7B%0A%20%20%20%20%20%20%20%20crystals_number%0A%20%20%20%20%20%20%20%20method%0A%20%20%20%20%20%20%20%20method_details%0A%20%20%20%20%20%20%20%20details%0A%20%20%20%20%20%20%7D%0A%20%7D%0A%7D%0A'

### Providing specific (unique) field names/paths
There are some fields that must be further specified using multiple fields separated by dots. You can search for the dot dotation of a field by using `find_paths(input_type, field_name)`.

In [89]:
from rcsbapi.data import Schema, Query

# "polymer_composition" isn't specific enough, throws a ValueError. The ValueError will list up to 10 valid paths.
query = Query(
    input_type="polymer_entity_instance",
    input_ids=["4HHB.A"],
    return_data_list=["polymer_composition"]
)

ValueError: "polymer_composition" exists, but is not a unique field, must specify further.
3 of 3 possible paths:
  polymer_entity.entry.assemblies.interfaces.rcsb_interface_info.polymer_composition
  polymer_entity.entry.assemblies.rcsb_assembly_info.polymer_composition
  polymer_entity.entry.rcsb_entry_info.polymer_composition

For all paths run:
  from rcsbapi.data import Schema
  schema = Schema()
  schema.find_paths("polymer_entity_instance", "polymer_composition")

In [None]:
# run find_paths
schema = Schema()
schema.find_paths(input_type="polymer_entity_instance", return_data_name="polymer_composition")

['polymer_entity.entry.assemblies.interfaces.rcsb_interface_info.polymer_composition',
 'polymer_entity.entry.assemblies.rcsb_assembly_info.polymer_composition',
 'polymer_entity.entry.rcsb_entry_info.polymer_composition']

In [None]:
# By looking through the list, find the intended field
query = Query(
    input_type="polymer_entity_instance",
    input_ids=["4HHB.A"],
    return_data_list=["polymer_entity.entry.rcsb_entry_info.polymer_composition"]
)
return_data = query.exec()
pprint(return_data)

{'data': {'polymer_entity_instance': {'polymer_entity': {'entry': {'rcsb_entry_info': {'polymer_composition': 'heteromeric '
                                                                                                              'protein'}}},
                                      'rcsb_id': '4HHB.A'}}}


###  Searching for field names
If you're unsure which field to use in `return_data_list`, you can call `find_field_names(search_string)`. This method will also return partial matches.

In [None]:
from rcsbapi.data import Schema, Query

schema = Schema()
schema.find_field_names("comp")

['chem_comps',
 'chem_comp',
 'label_comp_id',
 'chem_comp_monomers',
 'chem_comp_nstd_monomers',
 'pdbx_chem_comp_audit',
 'pdbx_chem_comp_descriptor',
 'pdbx_chem_comp_feature',
 'pdbx_chem_comp_identifier',
 'rcsb_chem_comp_annotation',
 'rcsb_chem_comp_container_identifiers',
 'rcsb_chem_comp_descriptor',
 'rcsb_chem_comp_info',
 'rcsb_chem_comp_related',
 'rcsb_chem_comp_synonyms',
 'rcsb_chem_comp_target',
 'mon_nstd_parent_comp_id',
 'pdbx_subcomponent_list',
 'comp_id',
 'component_id',
 'comp_id_1',
 'comp_id_2',
 'chem_comp_id',
 'compound_details',
 'subcomponent_ids',
 'rcsb_comp_model_provenance',
 'rcsb_branched_component_count',
 'beg_comp_id',
 'ligand_comp_id',
 'polymer_composition',
 'nonpolymer_comp',
 'nonpolymer_comp_id',
 'completeness',
 'target_comp_id',
 'pdb_format_compatible',
 'nonpolymer_bound_components',
 'cofactor_chem_comp_id']