<a href="https://colab.research.google.com/github/rcsb/py-rcsb-api/blob/master/notebooks/quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RCSB PDB Data API: Quick-start

This quick-start notebook will walk through the basics of making queries in this package using a simple example. For more in-depth documentation reference the [readthedocs page](https://py-rcsb-api.readthedocs.io/en/latest/index.html).

\
install the package: 

```pip install rcsb-api```

In [None]:
%pip install rcsb-api

## Making Queries

To create a Query object, you need to provide three arguments:
- `input_type`: input_types are points where you can begin your query. Some examples are entry, polymer_entity, and polymer_entity_instance. For a full list of input_types see the [readthedocs](https://rcsbapi.readthedocs.io/en/latest/query_construction.html).
- `input_ids`: input_ids are accepted as a list of PDB-format IDs or a dictionary.
- `return_data_list`: list of data items to return. These must be unique path segments (specified using dots to separate each name). Further explained [below](#being-specific).

For more details on input arguments, see [readthedocs: Query Construction](https://rcsbapi.readthedocs.io/en/staging/query_construction.html#query-objects).

To create a Query object requesting all non-polymer components of a structure (ions, cofactors, etc):

In [27]:
from rcsbapi.data import Schema, Query

query = Query(
    input_type="entry",
    input_ids=["4HHB"],  # input_ids can be dictionaries or lists
    return_data_list=["nonpolymer_bound_components"] # must be unique field or unique path segment
)

You can run this query with `.exec()`

In [28]:
query.exec()

{'data': {'entry': {'rcsb_id': '4HHB',
   'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM']}}}}

To view the GraphQL query in an interactive editor use `get_editor_link()`

In [29]:
query.get_editor_link()

'https://data.rcsb.org/graphql/index.html?query=%7B%20entry%28entry_id%3A%20%224HHB%22%29%20%7B%0A%20%20rcsb_id%0A%20%20%20%20rcsb_entry_info%7B%0A%20%20%20%20%20%20nonpolymer_bound_components%0A%20%20%20%20%20%20%7D%0A%20%7D%0A%7D%0A'

## Finding Fields for `return_data_list`
### Being Specific
There are some fields that must be further specified using multiple fields separated by dots. You can search for the dot dotation of a field by using `find_paths(input_type, field_name)`.

In [33]:
from rcsbapi.data import Schema, Query

# "polymer_composition" isn't specific enough, throws a ValueError. The ValueError will list up to 10 valid paths.
query = Query(
    input_type="polymer_entity_instance",
    input_ids=["4HHB.A"],
    return_data_list=["polymer_composition"]
)

ValueError: "polymer_composition" exists, but is not a unique field, must specify further.
3 of 3 possible paths:
  polymer_entity.entry.assemblies.interfaces.rcsb_interface_info.polymer_composition
  polymer_entity.entry.assemblies.rcsb_assembly_info.polymer_composition
  polymer_entity.entry.rcsb_entry_info.polymer_composition

For all paths run:
  from rcsbapi.data import Schema
  schema = Schema()
  schema.find_paths("polymer_entity_instance", "polymer_composition")

In [34]:
# run find_paths("polymer_composition")
schema = Schema()
schema.find_paths(input_type="polymer_entity_instance", return_data_name="polymer_composition")

['polymer_entity.entry.assemblies.interfaces.rcsb_interface_info.polymer_composition',
 'polymer_entity.entry.assemblies.rcsb_assembly_info.polymer_composition',
 'polymer_entity.entry.rcsb_entry_info.polymer_composition']

In [35]:
# By looking through the list, find the intended field
query = Query(
    input_type="polymer_entity_instance",
    input_ids=["4HHB.A"],
    return_data_list=["polymer_entity.entry.rcsb_entry_info.polymer_composition"]
)
query.exec()

{'data': {'polymer_entity_instance': {'rcsb_id': '4HHB.A',
   'polymer_entity': {'entry': {'rcsb_entry_info': {'polymer_composition': 'heteromeric protein'}}}}}}

###  Searching for Fields
If you're unsure which field to use in `return_data_list`, you can call `find_field_names(search_string)`. This method will also return partial matches.

In [36]:
from rcsbapi.data import Schema, Query

schema = Schema()
schema.find_field_names("comp")

['chem_comps',
 'chem_comp',
 'label_comp_id',
 'chem_comp_monomers',
 'chem_comp_nstd_monomers',
 'pdbx_chem_comp_audit',
 'pdbx_chem_comp_descriptor',
 'pdbx_chem_comp_feature',
 'pdbx_chem_comp_identifier',
 'rcsb_chem_comp_annotation',
 'rcsb_chem_comp_container_identifiers',
 'rcsb_chem_comp_descriptor',
 'rcsb_chem_comp_info',
 'rcsb_chem_comp_related',
 'rcsb_chem_comp_synonyms',
 'rcsb_chem_comp_target',
 'mon_nstd_parent_comp_id',
 'pdbx_subcomponent_list',
 'comp_id',
 'component_id',
 'comp_id_1',
 'comp_id_2',
 'chem_comp_id',
 'compound_details',
 'subcomponent_ids',
 'rcsb_comp_model_provenance',
 'rcsb_branched_component_count',
 'beg_comp_id',
 'ligand_comp_id',
 'polymer_composition',
 'nonpolymer_comp',
 'nonpolymer_comp_id',
 'completeness',
 'target_comp_id',
 'pdb_format_compatible',
 'nonpolymer_bound_components',
 'cofactor_chem_comp_id']

### Autocompletion of Queries
If there are fields under a requested data item in `return_data_list`, the package will add all fields to the query. 

This allows you to make more general requests to get all information under that field (ex: "exptl"). If you would like a more precise query, you can still request specific fields (ex: "exptl.method").

In [37]:
from rcsbapi.data import Schema, Query

# requesting "exptl" gets all fields underneath that field
query = Query(
    input_type="entry",
    input_ids=["4HHB"],
    return_data_list=["exptl"] # requests exptl.crystal number, exptl.method, etc
)
query.exec()

{'data': {'entry': {'rcsb_id': '4HHB',
   'exptl': [{'crystals_number': None,
     'method': 'X-RAY DIFFRACTION',
     'method_details': None,
     'details': None}]}}}

In [38]:
# to look at the query in an interactive editor
query.get_editor_link()

'https://data.rcsb.org/graphql/index.html?query=%7B%20entry%28entry_id%3A%20%224HHB%22%29%20%7B%0A%20%20rcsb_id%0A%20%20%20%20exptl%7B%0A%20%20%20%20%20%20%20%20crystals_number%0A%20%20%20%20%20%20%20%20method%0A%20%20%20%20%20%20%20%20method_details%0A%20%20%20%20%20%20%20%20details%0A%20%20%20%20%20%20%7D%0A%20%7D%0A%7D%0A'

## More Complex Queries

You can make more complex queries by searching multiple ids at once, adding more fields in the return_data_list, or starting at various input_types.

See more examples on [readthedocs: Additional Examples](https://rcsbapi.readthedocs.io/en/latest/additional_examples.html)

In [39]:
from rcsbapi.data import Schema, Query

# search multiple ids and fields. Note the input_type changed from "entry" to "entries"
query = Query(
    input_type="entries",
    input_ids=["4HHB", "12CA", "3PQR"],
    return_data_list=[
        "nonpolymer_bound_components",
        "citation.title",
        "rcsb_entry_info.polymer_composition"
    ]
)
query.exec()

{'data': {'entries': [{'rcsb_id': '4HHB',
    'rcsb_entry_info': {'nonpolymer_bound_components': ['HEM'],
     'polymer_composition': 'heteromeric protein'},
    'citation': [{'title': 'The crystal structure of human deoxyhaemoglobin at 1.74 A resolution'},
     {'title': 'Stereochemistry of Iron in Deoxyhaemoglobin'},
     {'title': 'Regulation of Oxygen Affinity of Hemoglobin. Influence of Structure of the Globin on the Heme Iron'},
     {'title': 'Three-Dimensional Fourier Synthesis of Human Deoxyhemoglobin at 2.5 Angstroms Resolution, I.X-Ray Analysis'},
     {'title': 'Three-Dimensional Fourier Synthesis of Human Deoxyhaemoglobin at 2.5 Angstroms Resolution, Refinement of the Atomic Model'},
     {'title': 'Three-Dimensional Fourier Synthesis of Human Deoxyhaemoglobin at 3.5 Angstroms Resolution'},
     {'title': None},
     {'title': None},
     {'title': None}]},
   {'rcsb_id': '12CA',
    'rcsb_entry_info': {'nonpolymer_bound_components': ['ZN'],
     'polymer_composition': 'ho

In [40]:
from rcsbapi.data import Schema, Query

# search from input_type "polymer_entities"
query = Query(
    input_type="polymer_entities",
    input_ids=["2CPK_1", "3WHM_1", "2D5Z_1"],
    return_data_list=[
        "polymer_entities.rcsb_id",
        "rcsb_entity_source_organism.ncbi_taxonomy_id",
        "rcsb_entity_source_organism.ncbi_scientific_name",
        "cluster_id",
        "identity"
    ]
)
query.exec()

{'data': {'polymer_entities': [{'rcsb_id': '2CPK_1',
    'rcsb_entity_source_organism': [{'ncbi_taxonomy_id': 10090,
      'ncbi_scientific_name': 'Mus musculus'}],
    'rcsb_cluster_membership': [{'cluster_id': 1415, 'identity': 100},
     {'cluster_id': 116, 'identity': 95},
     {'cluster_id': 117, 'identity': 90},
     {'cluster_id': 155, 'identity': 70},
     {'cluster_id': 246, 'identity': 50},
     {'cluster_id': 2, 'identity': 30}]},
   {'rcsb_id': '2D5Z_1',
    'rcsb_entity_source_organism': [{'ncbi_taxonomy_id': 9606,
      'ncbi_scientific_name': 'Homo sapiens'}],
    'rcsb_cluster_membership': [{'cluster_id': 109, 'identity': 100},
     {'cluster_id': 115, 'identity': 95},
     {'cluster_id': 108, 'identity': 90},
     {'cluster_id': 56, 'identity': 70},
     {'cluster_id': 24, 'identity': 50},
     {'cluster_id': 35, 'identity': 30}]},
   {'rcsb_id': '3WHM_1',
    'rcsb_entity_source_organism': [{'ncbi_taxonomy_id': 9606,
      'ncbi_scientific_name': 'Homo sapiens'}],
   

In [41]:
from rcsbapi.data import Schema, Query

# search from input_type "polymer_entity_instances"
query = Query(
    input_type="polymer_entity_instances",
    input_ids=["4HHB.A", "12CA.A", "3PQR.A"],
    return_data_list=[
        "polymer_entity_instances.rcsb_id",
        "rcsb_polymer_instance_annotation.annotation_id",
        "rcsb_polymer_instance_annotation.name",
        "rcsb_polymer_instance_annotation.type"
    ]
)
query.exec()

{'data': {'polymer_entity_instances': [{'rcsb_id': '4HHB.A',
    'rcsb_polymer_instance_annotation': [{'annotation_id': '1.10.490.10',
      'name': 'Globins',
      'type': 'CATH'},
     {'annotation_id': 'd4hhba_',
      'name': 'Hemoglobin, alpha-chain',
      'type': 'SCOP'},
     {'annotation_id': '8039836', 'name': 'Globin-like', 'type': 'SCOP2'},
     {'annotation_id': 'e4hhbA1', 'name': 'Globin', 'type': 'ECOD'}]},
   {'rcsb_id': '12CA.A',
    'rcsb_polymer_instance_annotation': [{'annotation_id': '3.10.200.10',
      'name': 'Alpha carbonic anhydrase',
      'type': 'CATH'},
     {'annotation_id': 'd12caa_',
      'name': 'Carbonic anhydrase',
      'type': 'SCOP'},
     {'annotation_id': '8036258',
      'name': 'Carbonic anhydrase',
      'type': 'SCOP2'},
     {'annotation_id': 'e12caA1', 'name': 'Carb_anhydrase', 'type': 'ECOD'}]},
   {'rcsb_id': '3PQR.A',
    'rcsb_polymer_instance_annotation': [{'annotation_id': '1.20.1070.10',
      'name': 'Rhodopsin 7-helix transmembr

In [42]:
from rcsbapi.data import Schema, Query

# search from input_type "Uniprot"
query = Query(
    input_type="uniprot",
    input_ids=["P68871"],
    return_data_list=[
        "rcsb_uniprot_annotation"
    ]
)
query.exec()

{'data': {'uniprot': {'rcsb_id': 'P68871',
   'rcsb_uniprot_annotation': [{'assignment_version': None,
     'description': None,
     'type': 'GO',
     'additional_properties': None,
     'annotation_id': 'GO:0005833',
     'provenance_source': 'UNIPROT',
     'annotation_lineage': [{'name': 'protein-containing complex',
       'depth': None,
       'id': 'GO:0032991'},
      {'name': 'cellular_component', 'depth': None, 'id': 'GO:0005575'},
      {'name': 'intracellular anatomical structure',
       'depth': None,
       'id': 'GO:0005622'},
      {'name': 'hemoglobin complex', 'depth': None, 'id': 'GO:0005833'},
      {'name': 'cytoplasm', 'depth': None, 'id': 'GO:0005737'},
      {'name': 'cellular anatomical entity',
       'depth': None,
       'id': 'GO:0110165'},
      {'name': 'cytosol', 'depth': None, 'id': 'GO:0005829'}],
     'name': 'hemoglobin complex'},
    {'assignment_version': None,
     'description': None,
     'type': 'GO',
     'additional_properties': None,
     