# rcsbsearchapi quickstart

This notebook contains examples from the rcsbsearchapi [quickstart](https://rcsbsearchapi.readthedocs.io/en/latest/quickstart.html)

In [1]:
from rcsbsearchapi.search import TextQuery, AttributeQuery, SequenceQuery, SeqMotifQuery, StructSimilarityQuery, Attr
from rcsbsearchapi import rcsb_attributes as attrs

## Operator syntax

Here is an example from the [RCSB PDB Search API](http://search.rcsb.org/#search-example-1) page, using the operator syntax. This query finds symmetric dimers having a twofold rotation with the DNA-binding domain of a heat-shock transcription factor. Full text and attribute search are used.

Note the use of standard comparison operators (`==`, `>` etc) for rcsb attributes and set operators for combining queries.

In [2]:
# Create terminals for each query
q1 = TextQuery("heat-shock transcription factor")
q2 = attrs.rcsb_struct_symmetry.symbol == "C2"
q3 = attrs.rcsb_struct_symmetry.kind == "Global Symmetry"
q4 = attrs.rcsb_entry_info.polymer_entity_count_DNA >= 1

# combined using bitwise operators (&, |, ~, etc)
query = q1 & (q2 & q3 & q4) # AND of all queries

# Call the query to execute it
for assemblyid in query("assembly"): # return type specified as "assembly"
    print(assemblyid)


1FYL-1
1FYL-2
1FYM-1
1FYK-1
3HTS-1
5D8K-1
5D8L-1
5D8L-2
5D5W-1
7DCI-1
5D5X-1
7DCT-1
7DCT-2
5D5U-1
7DCJ-1
5HDN-1
5HDN-2
5D5V-1
7LBX-1
7LBW-1
4NNU-1
8HKC-1
3FYL-1
3G6P-1
3G6Q-1
3G6Q-2
3G6R-1
3G6T-1
3G6U-1
3G8U-1
3G8X-1
3G97-1
3G99-1
3G9I-1
3G9I-2
3G9J-1
3G9M-1
3G9O-1
3G9P-1
1GLU-1
1LAT-1
1R4O-1
1R4R-1
7UBM-1
6XAS-1
6XAV-1
5J5Q-1
5J5Q-2
1NFK-1
8FTD-1
8JO2-1
8TO1-1
8TO6-1
8TO8-1
8TOE-1
8TOM-1
8TKL-1
8TKM-1
8TKN-1
5NSS-1
6PSQ-1
6PSR-1
6PSS-1
6PST-1
6PSU-1
8U3B-1
6XL9-1
6K4Y-1
5H3R-1
6JNX-1
6PSV-1
6PSW-1
6XL5-1
6XLL-1
6OUL-1
7CHW-1
6XLM-1
7KHB-1
6CA0-1
7MKD-1
7MKE-1
7MKI-1
7MKJ-1
6N62-1
7N4E-1
7SZJ-1
7SZK-1
8AD1-1
7XUI-1
7KHI-1
8Y6U-1
6LDI-1
6XH8-1
7C17-1
6XH7-1
7KHE-1
3IYD-1
6KOO-1
6KOP-1
5ZX2-1
6JCX-1
6KON-1
6KOQ-1
6P18-1
6XLJ-1
7DY6-1
6WMU-1
7C97-1
5NSR-1
6GH5-1
6GH6-1
6JCY-1
7UIK-1
6B6H-1
6RI7-1
2V2T-1
6PB4-1
6PB5-1
6PB6-1
6RIN-1
3N97-1
4YLP-1
4YLP-2
4YLP-3
6OMF-1
6N60-1
6N61-1
6UU0-1
6UU2-1
6UU4-1
6UUC-1
5EMC-1
5EMP-1
5EMQ-1
6JBQ-1
6UTZ-1
6UU1-1
6UU3-1
6UU9-1
6UUA-1
6KJ6-1
6UTW-1
6UTX-1

Attribute names can be found in the [RCSB PDB schema](http://search.rcsb.org/rcsbsearch/v2/metadata/schema). They can also be found via tab completion, or by iterating:

In [3]:
[a.attribute for a in attrs if "authors" in a.attribute]

['citation.rcsb_authors',
 'pdbx_nmr_software.authors',
 'rcsb_primary_citation.rcsb_authors',
 'rcsb_bird_citation.rcsb_authors']

## Fluent syntax

Here is the same example using the [fluent](https://en.wikipedia.org/wiki/Fluent_interface) syntax:

In [6]:
# Start with a Attr or TextQuery, then add terms
results = TextQuery("heat-shock transcription factor").and_(
    # Add attribute node as fully-formed AttributeQuery
    AttributeQuery(attribute="rcsb_struct_symmetry.symbol", operator="exact_match", value="C2") \
    # Add attribute node as Attr with chained operations
    .and_(Attr("rcsb_struct_symmetry.kind", "text")).exact_match("Global Symmetry") \
    # Add attribute node by name (converted to Attr) with chained operations
    .and_("rcsb_entry_info.polymer_entity_count_DNA").greater_or_equal(1)
    ).exec("assembly")
# Exec produces an iterator of IDs

for assemblyid in results:
    print(assemblyid)

1FYL-1
1FYL-2
1FYM-1
1FYK-1
3HTS-1
5D8K-1
5D8L-1
5D8L-2
5D5W-1
7DCI-1
5D5X-1
7DCT-1
7DCT-2
5D5U-1
7DCJ-1
5HDN-1
5HDN-2
5D5V-1
7LBX-1
7LBW-1
4NNU-1
8HKC-1
3FYL-1
3G6P-1
3G6Q-1
3G6Q-2
3G6R-1
3G6T-1
3G6U-1
3G8U-1
3G8X-1
3G97-1
3G99-1
3G9I-1
3G9I-2
3G9J-1
3G9M-1
3G9O-1
3G9P-1
1GLU-1
1LAT-1
1R4O-1
1R4R-1
7UBM-1
6XAS-1
6XAV-1
5J5Q-1
5J5Q-2
1NFK-1
8FTD-1
8JO2-1
8TO1-1
8TO6-1
8TO8-1
8TOE-1
8TOM-1
8TKL-1
8TKM-1
8TKN-1
5NSS-1
6PSQ-1
6PSR-1
6PSS-1
6PST-1
6PSU-1
8U3B-1
6XL9-1
6K4Y-1
5H3R-1
6JNX-1
6PSV-1
6PSW-1
6XL5-1
6XLL-1
6OUL-1
7CHW-1
6XLM-1
7KHB-1
6CA0-1
7MKD-1
7MKE-1
7MKI-1
7MKJ-1
6N62-1
7N4E-1
7SZJ-1
7SZK-1
8AD1-1
7XUI-1
7KHI-1
8Y6U-1
6LDI-1
6XH8-1
7C17-1
6XH7-1
7KHE-1
3IYD-1
6KOO-1
6KOP-1
5ZX2-1
6JCX-1
6KON-1
6KOQ-1
6P18-1
6XLJ-1
7DY6-1
6WMU-1
7C97-1
5NSR-1
6GH5-1
6GH6-1
6JCY-1
7UIK-1
6B6H-1
6RI7-1
2V2T-1
6PB4-1
6PB5-1
6PB6-1
6RIN-1
3N97-1
4YLP-1
4YLP-2
4YLP-3
6OMF-1
6N60-1
6N61-1
6UU0-1
6UU2-1
6UU4-1
6UUC-1
5EMC-1
5EMP-1
5EMQ-1
6JBQ-1
6UTZ-1
6UU1-1
6UU3-1
6UU9-1
6UUA-1
6KJ6-1
6UTW-1
6UTX-1

## Attribute search

Structural Attributes and Chemical Attributes can be searched using `AttributeQuery`s. Whether an attribute is structural or chemical is determined automatically.

More details on available attributes can be found on the [RCSB PDB Search API](https://search.rcsb.org/#search-attributes) page.

In [7]:
# Structure attribute search
q1 = AttributeQuery("exptl.method", "exact_match", "electron microscopy")
# Chemical attribute search
q2 = AttributeQuery("drugbank_info.brand_names", "contains_phrase", "tylenol")

query = q1 & q2 # combining queries

list(query())


['5T9V',
 '5TA3',
 '5TAL',
 '5TAM',
 '5TAN',
 '5TAP',
 '5TAQ',
 '5TAS',
 '5TAT',
 '5TAU',
 '5TAV',
 '6JHN',
 '6JI0',
 '6JII',
 '6JIU',
 '6JIY',
 '6JRR',
 '6JRS',
 '6M2W',
 '7M6A',
 '7M6L',
 '7TZC',
 '8DRP',
 '8DTB',
 '8DUJ',
 '8DVE',
 '8VJK',
 '8VK4',
 '8ET7',
 '8X63',
 '7YTW',
 '8JEW']

## Computed Structure Models

The [RCSB PDB Search API](https://search.rcsb.org/#results_content_type) page provides information on how to include Computed Structure Models (CSMs) into a search query. Here is a code example below.

This query returns IDs for experimental and computed structure models associated with "hemoglobin". Queries for *only* computed models or *only* experimental models can also be made (default).

In [8]:
q1 = TextQuery("hemoglobin")

# add parameter as a list with either "computational" or "experimental" or both as list values
q2 = q1(return_content_type=["computational", "experimental"])

list(q2)

['2PGH',
 '3PEL',
 '3GOU',
 '1NGK',
 '6IHX',
 '2ZFB',
 '3WR1',
 '4YU3',
 '1G08',
 '1G09',
 '1G0A',
 '1XQ5',
 '2QSP',
 '5C6E',
 '3CIU',
 '1V75',
 '1WMU',
 '2QLS',
 '3PI8',
 '3PI9',
 '3PIA',
 '1FSX',
 '1QPW',
 '3GKV',
 '6II1',
 '2H8D',
 '2H8F',
 '8WIY',
 '1HV4',
 '1G0B',
 '2PEG',
 '2ZLU',
 '4YU4',
 '1FHJ',
 '1HBR',
 '2AA1',
 '2D5X',
 '2QSS',
 '2ZLW',
 '3D4X',
 '3FS4',
 '3GQG',
 '3NFE',
 '3NG6',
 '6R2O',
 '1GCV',
 '1GCW',
 '1C40',
 '3MJU',
 '8PUQ',
 '1A4F',
 '1LA6',
 '2Z6N',
 '3K8B',
 '1FAW',
 '1HDS',
 '1V4U',
 '1V4W',
 '1V4X',
 '3GQP',
 '4G51',
 '2RAO',
 '5LFG',
 '2GTL',
 'AF_AFP07409F1',
 'AF_AFQ9YGW1F1',
 '1IBE',
 '1SPG',
 '8PUR',
 '8WIX',
 '8WIZ',
 '1HDA',
 '1S5X',
 '1T1N',
 '3EOK',
 '6SVA',
 '1IWH',
 '1S5Y',
 '2ZLT',
 '2ZLV',
 '3EU1',
 '3WTG',
 '6RP5',
 '2ZLX',
 '3MJP',
 '2QRW',
 '1NS6',
 '1NS9',
 '2B7H',
 '3D1A',
 '3GDJ',
 '6ZMX',
 '3CY5',
 '4IRO',
 'AF_AFP08422F1',
 'AF_AFB0M2T2F1',
 'AF_AFB2KHZ4F1',
 'AF_AFB3EWC7F1',
 'AF_AFB3EWC8F1',
 'AF_AFB3EWC9F1',
 'AF_AFB3EWD0F1',
 'AF_AFB3E

## Return Types and Attribute Search

A search query can return different result types when a return type is specified. Below are examples on specifying return types Polymer Entities,

Non-polymer Entities, Polymer Instances, and Molecular Definitions, using a Structure Attribute query. More information on return types can be found in the [RCSB PDB Search API](https://search.rcsb.org/#building-search-request) page.

In [9]:
q1 = AttributeQuery("rcsb_entry_container_identifiers.entry_id", "in", ["4HHB"]) # query for 4HHB deoxyhemoglobin

print("Polymer Entities:")
for poly in q1("polymer_entity"): # include return type as a string parameter for query object
    print(poly)

print("Non-polymer Entities:")
for nonPoly in q1("non_polymer_entity"):
    print(nonPoly)

print("Polymer Instances:")
for polyInst in q1("polymer_instance"):
    print(polyInst)

print("Molecular Definitions:")
for mol in q1("mol_definition"):
    print(mol)

Polymer Entities:
4HHB_1
4HHB_2
Non-polymer Entities:
4HHB_3
4HHB_4
Polymer Instances:
4HHB.A
4HHB.B
4HHB.C
4HHB.D
Molecular Definitions:
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HEM
HIS
LEU
LYS
MET
PHE
PO4
PRO
SER
THR
TRP
TYR
VAL


## Counting Results

If only the number of results is desired, the `return_counts` request_option can be used. This query returns the number of experimental models associated with "hemoglobin".

In [10]:
q1 = TextQuery("hemoglobin")

# Set return_counts to True at execution
q1(return_counts=True)

3654

## Obtaining Scores for Results

Results can be returned alongside additional metadata, including result scores. To return this metadata, set the `results_verbosity` parameter to "verbose" (all metadata), "minimal" (scores only), or "compact" (default, no metadata). If set to "verbose" or "minimal", results will be returned as a list of dictionaries. For example, here we get all experimental models associated with "hemoglobin", along with their scores.

In [11]:
q1 = TextQuery("hemoglobin")
for idscore in list(q1(results_verbosity="minimal")):
    print(idscore)

{'identifier': '2PGH', 'score': 1.0}
{'identifier': '3PEL', 'score': 0.9994659834960687}
{'identifier': '3GOU', 'score': 0.9988801157942307}
{'identifier': '1NGK', 'score': 0.9983549953291561}
{'identifier': '6IHX', 'score': 0.9983549953291561}
{'identifier': '2ZFB', 'score': 0.998232992510552}
{'identifier': '3WR1', 'score': 0.998232992510552}
{'identifier': '4YU3', 'score': 0.998232992510552}
{'identifier': '1G08', 'score': 0.9977007552143919}
{'identifier': '1G09', 'score': 0.9977007552143919}
{'identifier': '1G0A', 'score': 0.9977007552143919}
{'identifier': '1XQ5', 'score': 0.9977007552143919}
{'identifier': '2QSP', 'score': 0.9977007552143919}
{'identifier': '5C6E', 'score': 0.9977007552143919}
{'identifier': '3CIU', 'score': 0.997225028946016}
{'identifier': '1V75', 'score': 0.9969287484900033}
{'identifier': '1WMU', 'score': 0.9969287484900033}
{'identifier': '2QLS', 'score': 0.9964570466201575}
{'identifier': '3PI8', 'score': 0.9964570466201575}
{'identifier': '3PI9', 'score':

## Sequence Query

Below is an example from the [RCSB PDB Search API](https://search.rcsb.org/#search-example-3) page. Queries can be made using DNA, RNA, and protein sequences when specified using the SearchQuery class. In this example, we are finding macromolecular PDB entities that share 90% sequence identity with GTPase HRas protein from *Gallus gallus* (*Chicken)*.

In [12]:
# Use SequenceQuery class and add parameters
results = SequenceQuery("MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGET" +
                        "CLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQI" +
                        "KRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQ" +
                        "GVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS", 1, 0.9)

# results("polymer_entity") produces an iterator of IDs with return type - polymer entities
for polyid in results("polymer_entity"):
    print(polyid)

4Q21_1
8ELK_1
8ELR_1
8ELS_1
8ELT_1
8ELU_1
8ELV_1
8ELW_1
8ELX_1
8ELY_1
8ELZ_1
8EM0_1
5X9S_1
1AA9_1
1IOZ_1
1Q21_1
6Q21_1
7VV9_1
2Q21_1
6AMB_1
6KYH_2
1LFD_2
121P_1
1BKD_1
1CRP_1
1CRQ_1
1CRR_1
1CTQ_1
1GNP_1
1GNQ_1
1GNR_1
1K8R_1
1NVV_2
1NVW_1
1P2S_1
1P2T_1
1P2U_1
1P2V_1
1QRA_1
1WQ1_1
1XCM_1
1XD2_2
2RGE_1
3K8Y_1
3KUD_1
3L8Y_1
3L8Z_1
3LBH_1
3LBI_1
3LBN_1
3RRY_1
3RRZ_1
3RS0_1
3RS2_1
3RS3_1
3RS4_1
3RS5_1
3RS7_1
3RSL_1
3RSO_1
3TGP_1
4DLS_1
4DLT_1
4DLU_1
4DLW_1
4EFL_1
4G0N_1
4NYI_2
4NYJ_2
4NYM_2
4RSG_1
4URU_1
4URV_1
4URW_1
4URX_1
4URY_1
4URZ_1
4US0_1
4US1_1
4US2_1
5B2Z_1
5B30_1
5E95_2
5P21_1
5WDO_1
5WFO_1
5WFP_1
5WFQ_1
5WFR_1
6AXG_2
6BVI_3
6BVJ_1
6BVJ_3
6BVK_3
6BVL_3
6BVM_3
6CUO_3
6CUP_3
6CUR_3
6D55_3
6D56_3
6D59_3
6D5E_3
6D5G_3
6D5H_3
6D5J_3
6D5L_3
6D5M_2
6D5V_2
6D5W_2
6V94_3
6V9F_3
6V9J_3
6V9L_3
6V9M_3
6V9N_3
6V9O_3
6ZL3_1
7L0F_1
7OG9_1
7OGA_1
7OGB_1
7OGC_1
8BE6_1
8BE7_1
8BE8_1
8BE9_1
8BEA_1
8BOS_1
8BWG_1
8CNJ_1
8CNN_1
8OSM_1
8OSN_1
8OSO_1
8TBG_1
1IAQ_1
1LF0_1
1LF5_1
1NVU_1
1NVX_1
221P_1
2LCF_1

## Sequence Motif Query

Below is an example from the [RCSB PDB Search API](https://search.rcsb.org/#search-example-6) page, using the sequence motif search function. 
This query retrives occurences of the His2/Cys2 Zinc Finger DNA-binding domain as represented by its PROSITE signature.

In [13]:
# Use SeqMotifQuery class and add parameters
results = SeqMotifQuery("C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.",
                        pattern_type="prosite",
                        sequence_type="protein")

# results("polymer_entity") produces an iterator of IDs with return type - polymer entities
for polyid in results("polymer_entity"):
    print(polyid)

1A1F_3
1A1G_3
1A1H_3
1A1I_3
1A1J_3
1A1K_3
1A1L_3
1AAY_3
1ARD_1
1ARE_1
1ARF_1
1BBO_1
1BHI_1
1E39_1
1EJ6_2
1F2I_2
1FN9_1
1G2D_3
1G2F_3
1JK1_3
1JK2_3
1JMU_3
1JN7_1
1JRX_1
1JRY_1
1JRZ_1
1KSS_1
1KSU_1
1LJ1_1
1LLM_2
1M64_1
1MEY_3
1NCS_1
1NJQ_1
1NX9_1
1P2E_1
1P2H_1
1P47_3
1P7A_1
1PAA_1
1Q9I_1
1QJD_1
1RIK_1
1RIM_1
1RYY_1
1SP1_1
1SP2_1
1SRK_1
1TF3_3
1TF6_3
1U85_1
1U86_1
1UBD_3
1UN6_1
1VA1_1
1VA2_1
1VA3_1
1WIR_1
1WJP_1
1X3C_1
1X5W_1
1X6E_1
1X6F_1
1X6H_1
1XF7_1
1Y0P_1
1YUI_3
1YUJ_3
1Z60_1
1ZAA_3
1ZFD_1
1ZNF_1
1ZNM_1
1ZR9_1
1ZU1_1
2AB3_1
2AB7_1
2ADR_1
2B4K_1
2B7R_1
2B7S_1
2B9V_1
2COT_1
2CSE_2
2CSE_4
2CSH_1
2CT1_1
2CTD_1
2D9H_1
2DLK_1
2DLQ_1
2DMD_1
2DMI_1
2DRP_3
2EBT_1
2EE8_1
2EL4_1
2EL5_1
2EL6_1
2ELO_1
2ELR_1
2ELS_1
2ELU_1
2ELV_1
2ELW_1
2ELX_1
2ELY_1
2ELZ_1
2EM0_1
2EM1_1
2EM2_1
2EM3_1
2EM4_1
2EM5_1
2EM6_1
2EM7_1
2EM8_1
2EM9_1
2EMA_1
2EMB_1
2EMC_1
2EME_1
2EMF_1
2EMG_1
2EMH_1
2EMI_1
2EMJ_1
2EMK_1
2EML_1
2EMM_1
2EMP_1
2EMV_1
2EMW_1
2EMX_1
2EMY_1
2EMZ_1
2EN0_1
2EN1_1
2EN2_1
2EN3_1
2EN4_1
2EN6_1
2EN7_1

Below is an example query for the zinc finger motif that binds Zn in a DNA-binding domain:

In [14]:
results = SeqMotifQuery("C.{2,4}C.{12}H.{3,5}H", pattern_type="regex", sequence_type="protein")

for polyid in results("polymer_entity"):
    print(polyid)

1A1F_3
1A1G_3
1A1H_3
1A1I_3
1A1J_3
1A1K_3
1A1L_3
1AAY_3
1ARD_1
1ARE_1
1ARF_1
1BBO_1
1BHI_1
1E08_1
1E39_1
1EJ6_2
1F2I_2
1FN9_1
1G2D_3
1G2F_3
1GX7_1
1H9H_2
1H9I_2
1HFE_2
1JJD_1
1JK1_3
1JK2_3
1JMU_3
1JN7_1
1JRX_1
1JRY_1
1JRZ_1
1KLR_1
1KLS_1
1KSS_1
1KSU_1
1LJ1_1
1LLM_2
1M64_1
1MEY_3
1NCS_1
1NJQ_1
1NX9_1
1P2E_1
1P2H_1
1P47_3
1P7A_1
1PAA_1
1Q9I_1
1QJD_1
1RIK_1
1RIM_1
1RYY_1
1SP1_1
1SP2_1
1SRK_1
1TF3_3
1TF6_3
1U85_1
1U86_1
1UBD_3
1UN6_1
1V4N_1
1VA1_1
1VA2_1
1VA3_1
1VH9_1
1W0R_1
1W0S_1
1WIR_1
1WJP_1
1X3C_1
1X5W_1
1X6E_1
1X6F_1
1X6H_1
1XF7_1
1XRZ_1
1Y0P_1
1YUI_3
1YUJ_3
1Z60_1
1ZAA_3
1ZFD_1
1ZNF_1
1ZNM_1
1ZR9_1
1ZU1_1
1ZW8_1
2AB3_1
2AB7_1
2ADR_1
2B4K_1
2B7R_1
2B7S_1
2B9V_1
2COT_1
2CSE_2
2CSE_4
2CSH_1
2CSX_2
2CT1_1
2CT8_2
2CTD_1
2D74_2
2D8S_1
2D8V_1
2D9H_1
2DCU_2
2DLK_1
2DLQ_1
2DMD_1
2DMI_1
2DRP_3
2EBT_1
2EE8_1
2EL4_1
2EL5_1
2EL6_1
2ELM_1
2ELN_1
2ELO_1
2ELQ_1
2ELR_1
2ELS_1
2ELT_1
2ELU_1
2ELV_1
2ELW_1
2ELX_1
2ELY_1
2ELZ_1
2EM0_1
2EM1_1
2EM2_1
2EM3_1
2EM4_1
2EM5_1
2EM6_1
2EM7_1
2EM8_1
2EM9_1
2EMA_1

Below is an example query for SH3 domains:

In [15]:
# By default, the pattern_type argument is "simple" and the sequence_type argument is "protein".
results = SeqMotifQuery("XPPXP")  # X is used as a "variable residue" and can be any amino acid. 

for polyid in results("polymer_entity"):
    print(polyid)

13PK_1
16PK_1
16VP_1
1A0L_1
1A3I_1
1A3J_1
1A4R_1
1A5Y_1
1A7M_1
1AAX_1
1ABO_2
1AD3_1
1ADQ_1
1AFV_1
1AIG_2
1AIJ_2
1AJE_1
1AK4_2
1AKA_1
1AKB_1
1AKC_1
1AKZ_1
1AM4_1
1AMA_1
1AN0_1
1AOL_1
1AOZ_1
1ARO_1
1ASO_1
1ASP_1
1ASQ_1
1AW9_1
1AWI_2
1AWJ_1
1AZE_2
1B0L_1
1B1X_1
1B41_1
1B70_2
1B7U_1
1B7Y_2
1B7Z_1
1B80_1
1B82_1
1B85_1
1B8X_1
1BBZ_2
1BE3_2
1BEA_1
1BFA_1
1BGY_2
1BKA_1
1BKV_1
1BL9_1
1BLS_1
1BQ3_1
1BQ4_1
1BQS_1
1BX0_1
1BX1_1
1BZC_1
1C04_4
1C2B_1
1C2O_1
1C2P_1
1C3B_1
1C3V_1
1C8T_1
1CAG_1
1CB6_1
1CC1_1
1CCH_1
1CEZ_3
1CF0_2
1CF4_1
1CF9_1
1CGD_1
1CJF_2
1CKB_2
1CKR_1
1CN3_1
1CNS_1
1COR_1
1CPO_1
1CQZ_1
1CR6_1
1CRK_1
1CSG_1
1CSJ_1
1CXV_1
1D2B_1
1D2T_1
1D8D_1
1D8E_1
1DCT_3
1DD1_1
1DDV_2
1DHN_1
1DN2_1
1DOA_1
1DPB_1
1DPC_1
1DPD_1
1DS8_2
1DSN_1
1DT0_1
1DT6_1
1DTZ_1
1DV3_2
1DV6_2
1DZI_2
1DZL_1
1E07_1
1E0A_1
1E0C_1
1E14_3
1E1C_1
1E3D_1
1E4K_1
1E6D_3
1E6J_3
1E7U_1
1E7V_1
1E8W_1
1E8X_1
1E8Y_1
1E8Z_1
1E90_1
1EAA_1
1EAB_1
1EAC_1
1EAD_1
1EAE_1
1EAF_1
1EEM_1
1EEN_1
1EEO_1
1EER_2
1EFD_1
1EFW_2
1EG0_14
1EH3_1
1EH5_

All 3 of these pattern types can be used to search for DNA and RNA sequences as well.
Demonstrated are 2 queries, one DNA and one RNA, using the simple pattern type:

In [16]:
from rcsbsearchapi.search import SeqMotifQuery

# DNA query: this is a query for a T-Box.
dna = SeqMotifQuery("TCACACCT", sequence_type="dna")

print("DNA results:")
for polyid in dna("polymer_entity"):
    print(polyid)

# RNA query: 6C RNA motif
rna = SeqMotifQuery("CCCCCC", sequence_type="rna")
print("RNA results:")
for polyid in rna("polymer_entity"):
    print(polyid)

DNA results:
1H6F_2
1XBR_1
2X6V_2
4A04_2
4ROC_4
4S0H_3
5FLV_2
5T1J_2
6F58_1
6F59_1
8CDN_3
RNA results:
1A60_1
1ML5_4
1QCU_2
1VVJ_25
1VY4_26
1VY5_26
1VY6_25
1VY7_26
2A8V_2
2GTT_2
2GTT_3
2JEA_3
2LC8_1
2WJ8_2
2YHM_2
3AM1_2
3IYQ_1
3IYR_1
3J3V_5
3J3W_1
3J7O_1
3J7P_1
3J7P_49
3J7Q_1
3J7R_1
3J7R_50
3J92_49
3J9W_33
3JAG_48
3JAG_51
3JAH_48
3JAH_51
3JAI_48
3JAI_51
3JAJ_32
3JAJ_54
3JAN_32
3JAN_53
3PO2_14
3PU0_2
4BKK_1
4BXX_14
4D5L_1
4D5Y_44
4D61_1
4D67_44
4L47_25
4L71_22
4LEL_25
4LFZ_25
4LNT_25
4LSK_25
4LT8_25
4P6F_25
4P70_22
4TUA_25
4TUB_25
4TUC_25
4TUD_25
4TUE_25
4UFT_2
4UG0_1
4UG0_47
4UJC_3
4UJC_4
4UJC_50
4UJD_1
4UJD_49
4UJD_50
4UJE_4
4UJE_81
4V42_25
4V4I_1
4V4J_1
4V4N_10
4V4N_38
4V4N_39
4V4P_2
4V4R_27
4V4S_27
4V4T_26
4V4X_24
4V4Y_24
4V4Z_25
4V51_34
4V5A_35
4V5C_36
4V5D_35
4V5E_35
4V5F_35
4V5G_36
4V5J_35
4V5K_35
4V5L_36
4V5M_35
4V5N_35
4V5P_36
4V5Q_36
4V5R_36
4V5S_36
4V5V_2
4V5Z_1
4V5Z_26
4V63_25
4V67_25
4V68_38
4V6A_25
4V6F_1
4V6F_55
4V6G_24
4V6U_11
4V6U_67
4V6U_68
4V6X_36
4V6X_85
4V7J_57
4V7K

## Structure Similarity Query

The PDB archive can be queried using the 3D shape of a protein structure. To perform this query, 3D protein structure data must be provided as an input or parameter, A chain ID or assembly ID must be specified, whether the input structure data should be compared to Assemblies or Polymer Entity Instance (Chains) is required, and defining the search type as either strict or relaxed is required. More information on how Structure Similarity Queries work can be found on the [RCSB PDB Structure Similarity Search](https://www.rcsb.org/docs/search-and-browse/advanced-search/structure-similarity-search) page.

In [17]:
# Basic query: querying using entry ID and default values assembly ID "1", operator "strict", and target search space "Assemblies"
q1 = StructSimilarityQuery(entry_id="4HHB")

# Same example but with parameters explicitly specified
q1 = StructSimilarityQuery(structure_search_type="entry_id",
                           entry_id="4HHB",## Structure Similarity Query
                           structure_input_type="assembly_id",
                           assembly_id="1",
                           operator="strict_shape_match",
                           target_search_space="assembly"
                           )
for rid in q1("assembly"):
    print(rid)

4HHB-1
1G9V-1
2HHB-1
1BZ0-1
1K0Y-1
1COH-1
3HHB-1
1QSH-1
1VWT-1
1BZZ-1
1GBV-1
2DN2-1
1A01-1
1C7D-1
1Y35-1
1GLI-1
1RQ3-1
1Y4Q-1
1BAB-1
1O1O-1
1Y4B-1
1O1P-1
1DXU-1
1Y4R-1
1DXV-1
1O1M-1
1YHE-1
1BZ1-1
1QSI-1
1J7S-1
1GBU-1
1A0U-1
1RQ4-1
1Y0T-1
1O1J-1
1HBB-1
1O1N-1
1Y31-1
1O1L-1
1XZ2-1
1Y7Z-1
1Y7G-1
1XZV-1
1C7C-1
1Y45-1
1DXT-1
1XXT-1
1Y09-1
1Y4V-1
1XZU-1
6HBW-1
1YH9-1
1Y2Z-1
1Y22-1
1Y0C-1
5KSJ-1
5KSI-1
1Y0A-1
1C7B-1
1XYE-1
1Y0W-1
1A0Z-1
1QI8-1
1J7W-1
1Y46-1
1O1K-1
1YE2-1
1B86-1
1HDB-1
1Y4F-1
1LFL-2
1Y5F-1
2HBS-1
1Y5J-1
1Y7C-1
1GZX-1
1HBA-1
3DUT-1
1Y85-1
1RPS-1
1A00-1
1HBS-1
1R1Y-1
1HGC-1
2W72-1
1XZ5-1
3NMM-1
1Y4P-1
1YIH-1
1XY0-1
3HXN-1
1XZ7-1
1KD2-1
2W6V-1
1YGF-1
2HHD-1
1Y83-1
2HBS-2
1A3N-1
1Y0D-1
5KDQ-1
1HGB-1
1HGA-1
3QJD-1
1Y5K-1
3QJE-1
7UD7-1
6BWP-1
1LFL-1
1XZ4-1
1Y4G-1
1CLS-1
1J7Y-1
1HDA-1
7DY4-2
2YRS-1
1NIH-1
1Y7D-1
1THB-1
7DY4-1
1HBS-2
2HHE-1
2D60-1
3KMF-1
4ROM-1
2DXM-1
5E29-1
2YRS-2
1DKE-1
3WCP-1
7DY3-2
7DY3-1
1FN3-1
1FDH-1
1A3O-1
1ABW-1
4L7Y-1
6LCW-1
6KAE-1
1J40-2
6LCX-2
1YE0-1
6LCW-2

Below is a more complex example that utilizes chain ID, relaxed search operator, and polymer entity instance or target search space. Specifying whether the input structure type is chain id or assembly id is very important. For example, specifying chain ID as the input structure type but inputting an assembly ID can lead to an error.

In [18]:
# More complex query with entry ID value "4HHB", chain ID "B", operator "relaxed", and target search space "Chains"
q2 = StructSimilarityQuery(structure_search_type="entry_id",
                                   entry_id="4HHB",
                                   structure_input_type="chain_id",
                                   chain_id="B",
                                   operator="relaxed_shape_match",
                                   target_search_space="polymer_entity_instance")
list(q2())

['4HHB',
 '1Y35',
 '1Y45',
 '3HHB',
 '1BAB',
 '1Y4R',
 '1O1O',
 '1BZZ',
 '1QSI',
 '1DXV',
 '2HHB',
 '1BZ0',
 '1RQ3',
 '1A0U',
 '1HBB',
 '1K0Y',
 '1YE1',
 '1Y5F',
 '1Y4Q',
 '1XXT',
 '1Y09',
 '1DXU',
 '1BZ1',
 '1VWT',
 '1QSH',
 '1XZV',
 '1KD2',
 '1XZ2',
 '1Y7C',
 '1A0Z',
 '1Y46',
 '1Y4V',
 '1GBV',
 '3QJD',
 '1COH',
 '1GLI',
 '1Y4P',
 '1Y7D',
 '1Y7G',
 '1Y0T',
 '1Y4B',
 '1XYE',
 '1Y5K',
 '2HBS',
 '1G9V',
 '1YGF',
 '1Y4F',
 '1Y0W',
 '1XZ5',
 '1Y4G',
 '1Y31',
 '1A00',
 '1O1N',
 '1HBA',
 '1O1L',
 '1HDB',
 '1Y0D',
 '1XY0',
 '1YH9',
 '3NMM',
 '1Y83',
 '1Y2Z',
 '1GBU',
 '8WJ2',
 '1YHE',
 '2DN2',
 '1YIH',
 '1O1J',
 '1RPS',
 '1B86',
 '1O1P',
 '1XZ7',
 '1YIE',
 '1Y22',
 '2W6V',
 '6BWP',
 '1Y5J',
 '1Y0C',
 '2YRS',
 '1XZ4',
 '1YHR',
 '1YGD',
 '1CLS',
 '1BIJ',
 '1A01',
 '2W72',
 '1C7D',
 '5KSI',
 '6KAI',
 '1RQA',
 '1NIH',
 '7DY4',
 '1C7C',
 '1J40',
 '6LCX',
 '6KAE',
 '6KA9',
 '7UD7',
 '1J7W',
 '3QJE',
 '5KDQ',
 '1O1M',
 '1HGB',
 '1YDZ',
 '1YEO',
 '1Y0A',
 '3WCP',
 '1Y8W',
 '1YE2',
 '1GZX',
 '1A3O',
 

Structure similarity queries also allow users to upload a file from their local computer or input a file url from the website to query the PDB archive for similar proteins. The file represents a target protein structure in the file formats "cif", "bcif", "pdb", "cif.gz", or "pdb.gz". If a user wants to use a file url for queries, the user must specify the structure search type, the value (being the url), and the file format of the file. This is also the same case for file upload, except the value is the absolute path leading to the file that is in the local machine. An example for file url is below for 4HHB (hemoglobin).

In [20]:
q3 = StructSimilarityQuery(structure_search_type="file_url",
                           file_url="https://files.rcsb.org/view/4HHB.cif",
                           file_format="cif")
list(q3())

# If you want to upload your own structure file for similarity search, you can do so by using the `file_path` parameter:
q4 = StructSimilarityQuery(structure_search_type="file_upload",
                           file_path="/PATH/TO/FILE.cif",  # specify local model file path
                           file_format="cif")
list(q4())

['4HHB',
 '1G9V',
 '2HHB',
 '1BZ0',
 '1K0Y',
 '1COH',
 '3HHB',
 '1QSH',
 '1VWT',
 '1BZZ',
 '1GBV',
 '2DN2',
 '1A01',
 '1C7D',
 '1Y35',
 '1GLI',
 '1RQ3',
 '1Y4Q',
 '1BAB',
 '1O1O',
 '1Y4B',
 '1O1P',
 '1DXU',
 '1Y4R',
 '1DXV',
 '1O1M',
 '1YHE',
 '1BZ1',
 '1QSI',
 '1J7S',
 '1GBU',
 '1A0U',
 '1RQ4',
 '1Y0T',
 '1O1J',
 '1HBB',
 '1O1N',
 '1Y31',
 '1O1L',
 '1XZ2',
 '1Y7Z',
 '1Y7G',
 '1XZV',
 '1C7C',
 '1Y45',
 '1DXT',
 '1XXT',
 '1Y09',
 '1Y4V',
 '1XZU',
 '6HBW',
 '1YH9',
 '1Y2Z',
 '1Y22',
 '1Y0C',
 '5KSJ',
 '5KSI',
 '1Y0A',
 '1C7B',
 '1XYE',
 '1Y0W',
 '1A0Z',
 '1QI8',
 '1J7W',
 '1Y46',
 '1O1K',
 '1YE2',
 '1B86',
 '1HDB',
 '1Y4F',
 '1LFL',
 '1Y5F',
 '2HBS',
 '1Y5J',
 '1Y7C',
 '1GZX',
 '1HBA',
 '3DUT',
 '1Y85',
 '1RPS',
 '1A00',
 '1HBS',
 '1R1Y',
 '1HGC',
 '2W72',
 '1XZ5',
 '3NMM',
 '1Y4P',
 '1YIH',
 '1XY0',
 '3HXN',
 '1XZ7',
 '1KD2',
 '2W6V',
 '1YGF',
 '2HHD',
 '1Y83',
 '1A3N',
 '1Y0D',
 '5KDQ',
 '1HGB',
 '1HGA',
 '3QJD',
 '1Y5K',
 '3QJE',
 '7UD7',
 '6BWP',
 '1XZ4',
 '1Y4G',
 '1CLS',
 '1J7Y',
 

### Structure Motif Query Examples

The PDB Archive can also be queried by using a "motif" found in these 3D structures. To perform this type of query, an entry_id or a file URL/path must be provided, along with residues (which are parts of 3D structures.) This is the bare minimum needed to make a search, but there are lots of other parameters that can be added to a Structure Motif Query (see [full search schema](https://search.rcsb.org/redoc/index.html)).

To make a Structure Motif Query, you must first define anywhere from 2-10 "residues" that will be used in the query. Each individual residue has a Chain ID, Operator, Residue Number, and Exchanges (optional) that can be declared in that order using positonal arguments, or using the "chain_id", "struct_oper_id", and "label_seq_id" to define what parameter you are passing through. All 3 of the required parameters must be included, or the package will throw an AssertionError. 

Each residue can have a maximum of 4 Exchanges, and each query can only have 16 exchanges total. Violating any of these rules will cause the package to throw an AssertionError. 

Examples of how to instantiate Residues can be found below. These can then be put into a list and passed through to a Structure Motif Query.

In [22]:
from rcsbsearchapi.search import StructureMotifResidue

# construct a Residue with a Chain ID of A, an operator of 1, a residue 
# number of 192, and Exchanges of "LYS" and "HIS"
Res1 = StructureMotifResidue("A", "1", 192, ["LYS", "HIS"])
# as for what is a valid "Exchange", the package provides these as a literal,
# and they should be type checked. 

# you can also specify the arguments:
# this query is the same as above. 
Res2 = StructureMotifResidue(struct_oper_id="1", chain_id="A", exchanges=["LYS", "HIS"], label_seq_id=192)

# after delcaring a minimum of 2 and as many as 10 residues, they can be passed into a list for use in the query itself:
Res3 = StructureMotifResidue("A", "1", 162)  # exchanges are optional

ResList = [Res1, Res3]

From there, these Residues can be used in a query. As stated before, you can only include 2 - 10 residues in a query. If you fail to provide residues for a query, or provide the wrong amount, the package will throw a ValueError. 

For a Structure Motif Query using an entry_id, the only other necessary value that must be passed into the query is the residue list. The default type of query is an entry_id query. 

As this type of query has a lot of optional parameters, do *not* use positional arguments as more than likely an error will occur. 

Below is an example of a basic entry_id Structure Motif Query, with the residues declared earlier:

In [27]:
from rcsbsearchapi.search import StructMotifQuery

q1 = StructMotifQuery(entry_id="2MNR", residue_ids=ResList)
q1_res = list(q1(return_type='polymer_entity'))
print(len(q1_res))

9761


Like with Structure Similarity Queries, a file url or filepath can also be provided to the program. These can take the place of an entry_id. 

For a file url query, you *must* provide both a valid file URL (a string), and the file's file extension (also as a string). Failure to provide these elements correctly will cause the package to throw an AssertionError. 

Below is an example of the same query as above, only this time providing a file url:

In [28]:
link = "https://files.rcsb.org/view/2MNR.cif"
q2 = StructMotifQuery(structure_search_type="file_url", url=link, file_extension="cif", residue_ids=ResList)
# structure_search_type MUST be provided. A mismatched query type will cause an error. 
list(q2())

['1YTG',
 '3J79',
 '3P5T',
 '4O0C',
 '6JDU',
 '6O81',
 '6SC2',
 '6VYT',
 '6Y98',
 '7C5W',
 '7UWD',
 '8EWA',
 '8F4R',
 '8PTK',
 '8U6B',
 '8VRD',
 '9BRB',
 '1Q0P',
 '1S2D',
 '2J92',
 '2X4F',
 '3LVQ',
 '3QQ1',
 '4YLC',
 '5DOX',
 '5GGF',
 '5GIX',
 '5NZZ',
 '5TN7',
 '6JWP',
 '6L61',
 '6REY',
 '6Z8G',
 '7DQC',
 '7JTH',
 '7PER',
 '7QXI',
 '7UPQ',
 '8CXY',
 '8FKP',
 '8OFF',
 '8V2E',
 '8WC9',
 '2JKI',
 '2NLK',
 '3UTE',
 '4OL8',
 '4PH7',
 '4S0F',
 '5DZ3',
 '5J8K',
 '5T0Z',
 '5TM6',
 '6A7P',
 '6CKE',
 '6ULO',
 '7JQL',
 '7KEO',
 '7OBR',
 '7OIW',
 '7RCC',
 '7SF7',
 '8G0U',
 '8SL8',
 '8UKU',
 '8UVS',
 '1XKZ',
 '2C95',
 '2W2X',
 '4Q10',
 '6BKF',
 '6JIU',
 '6JU4',
 '6P5B',
 '6W62',
 '6WOV',
 '7KPW',
 '7L58',
 '8UQL',
 '1LHT',
 '2JFK',
 '3Q8C',
 '4JS5',
 '5VFO',
 '6J9E',
 '6PUU',
 '7ASP',
 '7EU1',
 '8G5O',
 '8JZZ',
 '8QMY',
 '8RX1',
 '8UNU',
 '1MXW',
 '5SY2',
 '6EIF',
 '6GD0',
 '7UCX',
 '8BTN',
 '8E0O',
 '8G0E',
 '3V7W',
 '5F3X',
 '5KK2',
 '6AA9',
 '6DAH',
 '6OF1',
 '6V75',
 '7F2X',
 '7JPD',
 '8GES',
 

Like with Structure Similarity Queries, a filepath to a file may also be provided. This file must be a valid file accepted by the search API. A file extension must also be provided with the file upload. 

The query would look something like this. Note that this is abstracted for the purpose of notebook portability.
```python
filepath = "/absolute/path/to/file.cif"
q3 = StructMotifQuery(structure_search_type="file_upload", file_path=filepath, file_extension="cif", residue_ids=ResList)

list(q3())
```
There are many additional parameters that Structure Motif Query supports. These include a variety of features such as backbone distance tolerance, side chain distance tolerance, angle tolerance, RMSD cutoff, limits (stop searching after this many hits), atom pairing schemes, motif pruning strategy, allowed structures, and excluded structures. These can be mixed and matched as needed to make accurate and useful queries. All of these have some default value which is used when a parameter isn't provided. These parameters conform to the defaults used by the Search API. 

Below will demonstrate how to define these parameters using non-positional arguments:

In [29]:
# specifying backbone distance tolerance: 0-3, default is 1
# allowed backbone distance tolerance in Angstrom. 
backbone = StructMotifQuery(entry_id="2MNR", backbone_distance_tolerance=2, residue_ids=ResList)
list(backbone())

# specifying sidechain distance tolerance: 0-3, default is 1
# allowed side-chain distance tolerance in Angstrom.
sidechain = StructMotifQuery(entry_id="2MNR", side_chain_distance_tolerance=2, residue_ids=ResList)
list(sidechain())

# specifying angle tolerance: 0-3, default is 1
# allowed angle tolerance in multiples of 20 degrees. 
angle = StructMotifQuery(entry_id="2MNR", angle_tolerance=2, residue_ids=ResList)
list(angle())

# specifying RMSD cutoff: >=0, default is 2
# Threshold above which hits will be filtered by RMSD
rmsd = StructMotifQuery(entry_id="2MNR", rmsd_cutoff=1, residue_ids=ResList)
list(rmsd())

# specifying limit: >=0, default excluded
# Stop accepting results after this many hits. 
limit = StructMotifQuery(entry_id="2MNR", limit=100, residue_ids=ResList)
list(limit())

# specifying atom pairing scheme, default = "SIDE_CHAIN"
# ENUM: "ALL", "BACKBONE", "SIDE_CHAIN", "PSUEDO_ATOMS"
# this is typechecked by a literal. 
# Which atoms to consider to compute RMSD scores and transformations. 
atom = StructMotifQuery(entry_id="2MNR", atom_pairing_scheme="ALL", residue_ids=ResList)
list(atom())

# specifying motif pruning strategy, default = "KRUSKAL"
# ENUM: "NONE", "KRUSKAL"
# this is typechecked by a literal in the package. 
# Specifies how many query motifs are "pruned". KRUSKAL leads to less stringent queries, and faster results.
pruning = StructMotifQuery(entry_id="2MNR", motif_pruning_strategy="NONE", residue_ids=ResList)
list(pruning())

# specifying allowed structures, default excluded
# specify the structures you wish to allow in the return result. As an example,
# we could only allow the results from the limited query we ran earlier. 
allowed = StructMotifQuery(entry_id="2MNR", allowed_structures=list(limit()), residue_ids=ResList)
list(allowed())

# specifying structures to exclude, default excluded
# specify structures to exclude from a query. We could, for example,
# exclude the results of the previous allowed query. 
excluded = StructMotifQuery(entry_id="2MNR", excluded_structures=list(allowed()), residue_ids=ResList)
list(excluded())

['1MXV',
 '2VR3',
 '3RMI',
 '5BW2',
 '5NUI',
 '6C3E',
 '6JG3',
 '6XNS',
 '6Y98',
 '7K5B',
 '7QP7',
 '8PTK',
 '8U6B',
 '1A36',
 '3GZN',
 '3LVQ',
 '3TVI',
 '5GGF',
 '5GPN',
 '5HR4',
 '5NZZ',
 '5TJ2',
 '5XG3',
 '5Z62',
 '6E7D',
 '6Z8G',
 '7JTH',
 '7UPQ',
 '8CXT',
 '8CXY',
 '8DGE',
 '8E9W',
 '8KG6',
 '8RBZ',
 '8V2E',
 '8XVZ',
 '8Z5J',
 '1HK1',
 '2NLK',
 '4JM0',
 '4OL8',
 '5AUQ',
 '5DGE',
 '5J8K',
 '5KCC',
 '5OEJ',
 '5T0Z',
 '6CKE',
 '6HJG',
 '7OIW',
 '7SF7',
 '7WAA',
 '8P3X',
 '1S2L',
 '2C95',
 '3GQY',
 '5J46',
 '6BKF',
 '6P5B',
 '6QDV',
 '7KG6',
 '7KPW',
 '8GG7',
 '8I2M',
 '1LHT',
 '3Q8C',
 '3RBM',
 '4JS5',
 '5VFO',
 '6BAT',
 '6BHF',
 '6G0C',
 '6K0R',
 '6LPC',
 '6PDG',
 '6TLT',
 '7EU1',
 '7FIT',
 '7PCF',
 '7R07',
 '7S5N',
 '7WOW',
 '8GCP',
 '8QMY',
 '8VQ3',
 '1DCF',
 '1FTL',
 '1MH3',
 '1MXW',
 '1ZCZ',
 '2AH8',
 '3ZXT',
 '4BQ4',
 '4OY2',
 '4RDO',
 '5SY2',
 '7EM3',
 '7MB6',
 '8E4P',
 '8G0E',
 '8GCJ',
 '8I1Y',
 '2Z6T',
 '3KHD',
 '3V7W',
 '5EDU',
 '5F3X',
 '5LRW',
 '5RT4',
 '6BBV',
 '6E3L',
 

The Structure Motif Query can be used to make some very specific queries. Below is an example of a query that retrives occurances of the enolase superfamily, a group of proteins diverse in sequence and structure that are all capable of abstracting a proton from a carboxylic acid. Position-specific exchanges are crucial to represent this superfamily accurately.

In [31]:
Res1 = StructureMotifResidue("A", "1", 162, ["LYS", "HIS"])
Res2 = StructureMotifResidue("A", "1", 193)
Res3 = StructureMotifResidue("A", "1", 219)
Res4 = StructureMotifResidue("A", "1", 245, ["GLU", "ASP", "ASN"])
Res5 = StructureMotifResidue("A", "1", 295, ["HIS", "LYS"])

ResList = [Res1, Res2, Res3, Res4, Res5]

query = StructMotifQuery(entry_id="2MNR", residue_ids=ResList)

print(query(return_counts=True))

114


## Faceted Queries

In order to group and perform calculations and statistics on PDB data by using a simple search query, you can use a faceted query (or facets). Facets arrange search results into categories (buckets) based on the requested field values. More information on Faceted Queries can be found [here](https://search.rcsb.org/#using-facets). All facets should be provided with `name`, `aggregation_type`, and `attribute` values. Depending on the aggregation type, other parameters must also be specified. The `facets()` function runs the query `q` using the specified facet(s), and returns a list of dictionaries:


In [32]:
from rcsbsearchapi.search import Facet

q = AttributeQuery("rcsb_accession_info.initial_release_date", operator="greater", value="2019-08-20")
q(facets=Facet(name="Methods", aggregation_type="terms", attribute="exptl.method")).facets

[{'name': 'Methods',
  'buckets': [{'label': 'X-RAY DIFFRACTION', 'population': 50181},
   {'label': 'ELECTRON MICROSCOPY', 'population': 18957},
   {'label': 'SOLUTION NMR', 'population': 1643},
   {'label': 'ELECTRON CRYSTALLOGRAPHY', 'population': 118},
   {'label': 'NEUTRON DIFFRACTION', 'population': 72},
   {'label': 'SOLID-STATE NMR', 'population': 62},
   {'label': 'SOLUTION SCATTERING', 'population': 22},
   {'label': 'POWDER DIFFRACTION', 'population': 2}]}]

#### Term Facets
Terms faceting is a multi-bucket aggregation where buckets are dynamically built - one per unique value. We can specify the minimum count (`>= 0`) for a bucket to be returned using the parameter `min_interval_population` (default value `1`). We can also control the number of buckets returned (`<= 65336`) using the parameter `max_num_intervals` (default value `65336`).

In [33]:
# This is the default query, used by the RCSB Search API when no query is explicitly specified.
# This default query will be used for most of the examples found below for faceted queries.
base_q = AttributeQuery("rcsb_entry_info.structure_determination_methodology", operator="exact_match", value="experimental") 

base_q(facets=Facet(name="Journals", aggregation_type="terms", attribute="rcsb_primary_citation.rcsb_journal_abbrev", min_interval_population=1000)).facets

[{'name': 'Journals',
  'buckets': [{'label': 'To be published', 'population': 35152},
   {'label': 'J Biol Chem', 'population': 14043},
   {'label': 'J Mol Biol', 'population': 11591},
   {'label': 'Nat Commun', 'population': 11367},
   {'label': 'Biochemistry', 'population': 11108},
   {'label': 'Proc Natl Acad Sci U S A', 'population': 11074},
   {'label': 'J Med Chem', 'population': 8981},
   {'label': 'Structure', 'population': 7414},
   {'label': 'Nature', 'population': 6076},
   {'label': 'Acta Crystallogr D Biol Crystallogr', 'population': 4387},
   {'label': 'Science', 'population': 4217},
   {'label': 'Nucleic Acids Res', 'population': 3991},
   {'label': 'Nat Struct Mol Biol', 'population': 3802},
   {'label': 'J Am Chem Soc', 'population': 3601},
   {'label': 'Protein Sci', 'population': 3235},
   {'label': 'Cell', 'population': 2935},
   {'label': 'Sci Rep', 'population': 2765},
   {'label': 'Mol Cell', 'population': 2717},
   {'label': 'EMBO J', 'population': 2509},
   {'

#### Histogram Facets
Histogram facets build fixed-sized buckets (intervals) over numeric values. The size of the intervals must be specified in the parameter `interval`. We can also specify `min_interval_population` if desired.

In [35]:
base_q(
    return_type="polymer_entity",
    facets=Facet(name="Formula Weight",
                 aggregation_type="histogram",
                 attribute="rcsb_polymer_entity.formula_weight",
                 interval=50,
                 min_interval_population=1
    )
).facets

[{'name': 'Formula Weight',
  'buckets': [{'label': '0.0', 'population': 420895},
   {'label': '50.0', 'population': 50567},
   {'label': '100.0', 'population': 9576},
   {'label': '150.0', 'population': 2605},
   {'label': '200.0', 'population': 727},
   {'label': '250.0', 'population': 430},
   {'label': '300.0', 'population': 250},
   {'label': '350.0', 'population': 90},
   {'label': '400.0', 'population': 38},
   {'label': '450.0', 'population': 1124},
   {'label': '500.0', 'population': 317},
   {'label': '550.0', 'population': 345},
   {'label': '600.0', 'population': 187},
   {'label': '650.0', 'population': 5},
   {'label': '700.0', 'population': 17},
   {'label': '750.0', 'population': 5},
   {'label': '800.0', 'population': 5},
   {'label': '850.0', 'population': 21},
   {'label': '900.0', 'population': 1067},
   {'label': '950.0', 'population': 30},
   {'label': '1000.0', 'population': 56},
   {'label': '1050.0', 'population': 239},
   {'label': '1100.0', 'population': 43},

#### Date Histogram Facets
Similar to histogram facets, date histogram facetes build buckets over date values. For date histogram aggregations, we must specify `interval="year"`. Again, we may also specify `min_interval_population`.

In [None]:
base_q.facets(facets=Facet(name="Release Date", aggregation_type="date_histogram", attribute="rcsb_accession_info.initial_release_date", interval="year", min_interval_population=1))

#### Range Facets
We can define the buckets ourselves by using range facets. In order to specify the ranges, we use the `Range` class. Note that the range includes the `start` value and excludes the `end` value (`include_lower` and `include_upper` should not be specified). If the `start` or `end` is omitted, the minimum or maximum boundaries will be used by default. The buckets should be provided as a list of `Range` objects to the `ranges` parameter. 

In [36]:
from rcsbsearchapi.search import Range

base_q(
    facets=Facet(
        name="Resolution Combined",
        aggregation_type="range",
        attribute="rcsb_entry_info.resolution_combined",
        ranges=[Range(start=None,end=2), Range(start=2, end=2.2), Range(start=2.2, end=2.4), Range(start=4.6, end=None)]
    )
).facets

[{'name': 'Resolution Combined',
  'buckets': [{'label': '*-2.0', 'population': 85290},
   {'label': '2.0-2.2', 'population': 27451},
   {'label': '2.2-2.4', 'population': 22144},
   {'label': '4.6-*', 'population': 3496}]}]

#### Date Range Facets
Date range facets allow us to specify date values as bucket ranges, using [date math expressions](https://search.rcsb.org/#date-math-expressions).

In [38]:
base_q(
    facets=Facet(name="Release Date",
                 aggregation_type="date_range",
                 attribute="rcsb_accession_info.initial_release_date",
                 ranges=[Range(start=None,end="2020-06-01||-12M"), Range(start="2020-06-01", end="2020-06-01||+12M"), Range(start="2020-06-01||+12M", end=None)]
    )
).facets

[{'name': 'Release Date',
  'buckets': [{'label': '*-2019-06-01T00:00:00.000Z', 'population': 152114},
   {'label': '2020-06-01T00:00:00.000Z-2021-06-01T00:00:00.000Z',
    'population': 13828},
   {'label': '2021-06-01T00:00:00.000Z-*', 'population': 47672}]}]

#### Cardinality Facets 
Cardinality facets return a single value: the count of distinct values returned for a given field. A `precision_threshold` (`<= 40000`, default value `40000`) may be specified.

In [39]:
base_q(
    facets=Facet(
        name="Organism Names Count",
        aggregation_type="cardinality",
        attribute="rcsb_entity_source_organism.ncbi_scientific_name"
    )
).facets

[{'name': 'Organism Names Count', 'value': 9141}]

#### Multidimensional Facets
Complex, multi-dimensional aggregations are possible by specifying additional facets in the `nested_facets` parameter, as in the example below:

In [40]:
f1 = Facet(name="Polymer Entity Types", aggregation_type="terms", attribute="rcsb_entry_info.selected_polymer_entity_types")
f2 = Facet(name="Release Date", aggregation_type="date_histogram", attribute="rcsb_accession_info.initial_release_date", interval="year")
base_q(facets=Facet(name="Experimental Method", aggregation_type="terms", attribute="rcsb_entry_info.experimental_method", nested_facets=[f1, f2])).facets

[{'name': 'Experimental Method',
  'buckets': [{'label': 'X-ray',
    'population': 188172,
    'facets': [{'name': 'Polymer Entity Types',
      'buckets': [{'label': 'Protein (only)', 'population': 166790},
       {'label': 'Protein/Oligosaccharide', 'population': 9624},
       {'label': 'Protein/NA', 'population': 8710},
       {'label': 'Nucleic acid (only)', 'population': 2867},
       {'label': 'Other', 'population': 170},
       {'label': 'Oligosaccharide (only)', 'population': 11}]},
     {'name': 'Release Date',
      'buckets': [{'label': '1976', 'population': 13},
       {'label': '1977', 'population': 23},
       {'label': '1978', 'population': 6},
       {'label': '1979', 'population': 11},
       {'label': '1980', 'population': 16},
       {'label': '1981', 'population': 16},
       {'label': '1982', 'population': 32},
       {'label': '1983', 'population': 36},
       {'label': '1984', 'population': 21},
       {'label': '1985', 'population': 19},
       {'label': '1986'

#### Filter Facets
Filters allow us to filter documents that contribute to bucket count. Similar to queries, we can group several `TerminalFilter`s into a single `GroupFilter`. We can combine a filter with a facet using the `FilterFacet` class. Terminal filters should specify an `attribute` and `operator`, as well as possible a `value` and whether or not it should be a `negation` and/or `case_sensitive`. Group filters should specify a `logical_operator` (which should be either `"and"` or `"or"`) and a list of filters (`nodes`) that should be combined. Finally, the `FilterFacet` should be provided with a filter and a (list of) facet(s). Here are some examples:

In [None]:
from rcsbsearchapi.search import TerminalFilter, GroupFilter, FilterFacet
tf1 = TerminalFilter(attribute="rcsb_polymer_instance_annotation.type", operator="exact_match", value="CATH")
tf2 = TerminalFilter(attribute="rcsb_polymer_instance_annotation.annotation_lineage.id", operator="in", value=["2.140.10.30", "2.120.10.80"])
ff2 = FilterFacet(filters=tf2, facets=Facet("CATH Domains", "terms", "rcsb_polymer_instance_annotation.annotation_lineage.id", min_interval_population=1))
ff1 = FilterFacet(filters=tf1, facets=ff2)
base_q(
    return_type="polymer_instance"
    facets=ff1
)

tf1 = TerminalFilter(attribute="rcsb_struct_symmetry.kind", operator="exact_match", value="Global Symmetry", negation=False)
f2 = Facet(name="ec_terms", aggregation_type="terms", attribute="rcsb_polymer_entity.rcsb_ec_lineage.id")
f1 = Facet(name="sym_symbol_terms", aggregation_type="terms", attribute="rcsb_struct_symmetry.symbol", nested_facets=f2)
ff = FilterFacet(filters=tf1, facets=f1)
q1 = AttributeQuery("rcsb_assembly_info.polymer_entity_count", operator="equals", value=1)
q2 = AttributeQuery("rcsb_assembly_info.polymer_entity_instance_count", operator="greater", value=1)
q = q1 & q2
q.facets("assembly", ff)

tf1 = TerminalFilter(attribute="rcsb_polymer_entity_group_membership.aggregation_method", operator="exact_match", value="sequence_identity")
tf2 = TerminalFilter(attribute="rcsb_polymer_entity_group_membership.similarity_cutoff", operator="equals", value=100)
gf = GroupFilter(logical_operator="and", nodes=[tf1, tf2])
ff = FilterFacet(filters=gf, facets=Facet("Distinct Protein Sequence Count", "cardinality", "rcsb_polymer_entity_group_membership.group_id"))
base_q.facets("polymer_entity", ff)

For a more practical example, see the [Covid-19 notebook](covid.ipynb)