<a href="https://colab.research.google.com/github/rcsb/py-rcsb-api/blob/master/notebooks/data_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RCSB PDB Search API: Additional Examples

This notebook contains the examples listed in [readthedocs: Additional Examples](https://rcsbapi.readthedocs.io/en/latest/search_api/additional_examples.html) for the Search API sub-package.

If you're looking for an introduction please refer to the `search_quickstart` notebook or [readthedocs: Quickstart](https://rcsbapi.readthedocs.io/en/dev-it-docs/search_api/quickstart.html)

\
Start by installing the package:

```pip install rcsb-api```

In [None]:
%pip install rcsb-api

## Sequence Motif Search Examples

In [Query Construction](https://rcsbapi.readthedocs.io/en/dev-it-docs/data_api/query_construction.html#query-construction), you saw an example query using a PROSITE signature.
You can also use a regular expression (RegEx) to make a sequence motif search. As an example, here is a query for the zinc finger motif that binds Zn in a DNA-binding domain:

In [2]:
from rcsbapi.search import SeqMotifQuery

results = SeqMotifQuery(
    "C.{2,4}C.{12}H.{3,5}H",
    pattern_type="regex",
    sequence_type="protein")

for polyid in results("polymer_entity"):
    print(polyid)

1A1F_3
1A1G_3
1A1H_3
1A1I_3
1A1J_3
1A1K_3
1A1L_3
1AAY_3
1ARD_1
1ARE_1
1ARF_1
1BBO_1
1BHI_1
1E08_1
1E39_1
1EJ6_2
1F2I_2
1FN9_1
1G2D_3
1G2F_3
1GX7_1
1H9H_2
1H9I_2
1HFE_2
1JJD_1
1JK1_3
1JK2_3
1JMU_3
1JN7_1
1JRX_1
1JRY_1
1JRZ_1
1KLR_1
1KLS_1
1KSS_1
1KSU_1
1LJ1_1
1LLM_2
1M64_1
1MEY_3
1NCS_1
1NJQ_1
1NX9_1
1P2E_1
1P2H_1
1P47_3
1P7A_1
1PAA_1
1Q9I_1
1QJD_1
1RIK_1
1RIM_1
1RYY_1
1SP1_1
1SP2_1
1SRK_1
1TF3_3
1TF6_3
1U85_1
1U86_1
1UBD_3
1UN6_1
1V4N_1
1VA1_1
1VA2_1
1VA3_1
1VH9_1
1W0R_1
1W0S_1
1WIR_1
1WJP_1
1X3C_1
1X5W_1
1X6E_1
1X6F_1
1X6H_1
1XF7_1
1XRZ_1
1Y0P_1
1YUI_3
1YUJ_3
1Z60_1
1ZAA_3
1ZFD_1
1ZNF_1
1ZNM_1
1ZR9_1
1ZU1_1
1ZW8_1
2AB3_1
2AB7_1
2ADR_1
2B4K_1
2B7R_1
2B7S_1
2B9V_1
2COT_1
2CSE_2
2CSE_4
2CSH_1
2CSX_2
2CT1_1
2CT8_2
2CTD_1
2D74_2
2D8S_1
2D8V_1
2D9H_1
2DCU_2
2DLK_1
2DLQ_1
2DMD_1
2DMI_1
2DRP_3
2EBT_1
2EE8_1
2EL4_1
2EL5_1
2EL6_1
2ELM_1
2ELN_1
2ELO_1
2ELQ_1
2ELR_1
2ELS_1
2ELT_1
2ELU_1
2ELV_1
2ELW_1
2ELX_1
2ELY_1
2ELZ_1
2EM0_1
2EM1_1
2EM2_1
2EM3_1
2EM4_1
2EM5_1
2EM6_1
2EM7_1
2EM8_1
2EM9_1
2EMA_1

You can use a standard amino acid sequence to make a sequence motif search. 
X can be used to allow any amino acid in that position. 
As an example, here is a query for SH3 domains:

In [3]:
from rcsbapi.search import SeqMotifQuery

# The default pattern_type argument is "simple" and the sequence_type argument is "protein".
# X is used as a "variable residue" and can be any amino acid. 
results = SeqMotifQuery("XPPXP")

for polyid in results("polymer_entity"):
    print(polyid)

13PK_1
16PK_1
16VP_1
1A0L_1
1A3I_1
1A3J_1
1A4R_1
1A5Y_1
1A7M_1
1AAX_1
1ABO_2
1AD3_1
1ADQ_1
1AFV_1
1AIG_2
1AIJ_2
1AJE_1
1AK4_2
1AKA_1
1AKB_1
1AKC_1
1AKZ_1
1AM4_1
1AMA_1
1AN0_1
1AOL_1
1AOZ_1
1ARO_1
1ASO_1
1ASP_1
1ASQ_1
1AW9_1
1AWI_2
1AWJ_1
1AZE_2
1B0L_1
1B1X_1
1B41_1
1B70_2
1B7U_1
1B7Y_2
1B7Z_1
1B80_1
1B82_1
1B85_1
1B8X_1
1BBZ_2
1BE3_2
1BEA_1
1BFA_1
1BGY_2
1BKA_1
1BKV_1
1BL9_1
1BLS_1
1BQ3_1
1BQ4_1
1BQS_1
1BX0_1
1BX1_1
1BZC_1
1C04_4
1C2B_1
1C2O_1
1C2P_1
1C3B_1
1C3V_1
1C8T_1
1CAG_1
1CB6_1
1CC1_1
1CCH_1
1CEZ_3
1CF0_2
1CF4_1
1CF9_1
1CGD_1
1CJF_2
1CKB_2
1CKR_1
1CN3_1
1CNS_1
1COR_1
1CPO_1
1CQZ_1
1CR6_1
1CRK_1
1CSG_1
1CSJ_1
1CXV_1
1D2B_1
1D2T_1
1D8D_1
1D8E_1
1DCT_3
1DD1_1
1DDV_2
1DHN_1
1DN2_1
1DOA_1
1DPB_1
1DPC_1
1DPD_1
1DS8_2
1DSN_1
1DT0_1
1DT6_1
1DTZ_1
1DV3_2
1DV6_2
1DZI_2
1DZL_1
1E07_1
1E0A_1
1E0C_1
1E14_3
1E1C_1
1E3D_1
1E4K_1
1E6D_3
1E6J_3
1E7U_1
1E7V_1
1E8W_1
1E8X_1
1E8Y_1
1E8Z_1
1E90_1
1EAA_1
1EAB_1
1EAC_1
1EAD_1
1EAE_1
1EAF_1
1EEM_1
1EEN_1
1EEO_1
1EER_2
1EFD_1
1EFW_2
1EG0_14
1EH3_1
1EH5_

All 3 of these pattern types can be used to search for DNA and RNA sequences as well.
These are two queries, one DNA and one RNA, using the `simple` pattern type:

In [4]:
from rcsbapi.search import SeqMotifQuery

# DNA query: this is a query for a T-Box.
dna = SeqMotifQuery("TCACACCT", sequence_type="dna")

print("DNA results:")
for polyid in dna("polymer_entity"):
    print(polyid)

# RNA query: 6C RNA motif
rna = SeqMotifQuery("CCCCCC", sequence_type="rna")
print("RNA results:")
for polyid in rna("polymer_entity"):
    print(polyid)

DNA results:
1H6F_2
1XBR_1
2X6V_2
4A04_2
4ROC_4
4S0H_3
5FLV_2
5T1J_2
6F58_1
6F59_1
8CDN_3
RNA results:
1A60_1
1ML5_4
1QCU_2
1VVJ_25
1VY4_26
1VY5_26
1VY6_25
1VY7_26
2A8V_2
2GTT_2
2GTT_3
2JEA_3
2LC8_1
2WJ8_2
2YHM_2
3AM1_2
3IYQ_1
3IYR_1
3J3V_5
3J3W_1
3J7O_1
3J7P_1
3J7P_49
3J7Q_1
3J7R_1
3J7R_50
3J92_49
3J9W_33
3JAG_48
3JAG_51
3JAH_48
3JAH_51
3JAI_48
3JAI_51
3JAJ_32
3JAJ_54
3JAN_32
3JAN_53
3PO2_14
3PU0_2
4BKK_1
4BXX_14
4D5L_1
4D5Y_44
4D61_1
4D67_44
4L47_25
4L71_22
4LEL_25
4LFZ_25
4LNT_25
4LSK_25
4LT8_25
4P6F_25
4P70_22
4TUA_25
4TUB_25
4TUC_25
4TUD_25
4TUE_25
4UFT_2
4UG0_1
4UG0_47
4UJC_3
4UJC_4
4UJC_50
4UJD_1
4UJD_49
4UJD_50
4UJE_4
4UJE_81
4V42_25
4V4I_1
4V4J_1
4V4N_10
4V4N_38
4V4N_39
4V4P_2
4V4R_27
4V4S_27
4V4T_26
4V4X_24
4V4Y_24
4V4Z_25
4V51_34
4V5A_35
4V5C_36
4V5D_35
4V5E_35
4V5F_35
4V5G_36
4V5J_35
4V5K_35
4V5L_36
4V5M_35
4V5N_35
4V5P_36
4V5Q_36
4V5R_36
4V5S_36
4V5V_2
4V5Z_1
4V5Z_26
4V63_25
4V67_25
4V68_38
4V6A_25
4V6F_1
4V6F_55
4V6G_24
4V6U_11
4V6U_67
4V6U_68
4V6X_36
4V6X_85
4V7J_57
4V7K

## Structure Similarity Search Examples
This is a more complex example that utilizes `chain_id`, the `relaxed_shape_match` operator, and a `target_search_space` of `polymer_entity_instance`. Specifying whether the input structure type is `chain_id` or `assembly_id` is very important. For example, specifying `chain_id` as the input structure type but inputting an assembly ID can lead to
an error.

In [5]:
from rcsbapi.search import StructSimilarityQuery

# More complex query:
# Entry ID value "4HHB", chain ID "B", operator "relaxed", and target search space "Chains"
q2 = StructSimilarityQuery(
    structure_search_type="entry_id",
    entry_id="4HHB",
    structure_input_type="chain_id",
    chain_id="B",
    operator="relaxed_shape_match",
    target_search_space="polymer_entity_instance"
)
list(q2())

['4HHB',
 '1Y35',
 '1Y45',
 '3HHB',
 '1BAB',
 '1Y4R',
 '1O1O',
 '1BZZ',
 '1QSI',
 '1DXV',
 '2HHB',
 '1BZ0',
 '1RQ3',
 '1A0U',
 '1HBB',
 '1K0Y',
 '1YE1',
 '1Y5F',
 '1Y4Q',
 '1XXT',
 '1Y09',
 '1DXU',
 '1BZ1',
 '1VWT',
 '1QSH',
 '1XZV',
 '1KD2',
 '1XZ2',
 '1Y7C',
 '1A0Z',
 '1Y46',
 '1Y4V',
 '1GBV',
 '3QJD',
 '1COH',
 '1GLI',
 '1Y4P',
 '1Y7D',
 '1Y7G',
 '1Y0T',
 '1Y4B',
 '1XYE',
 '1Y5K',
 '2HBS',
 '1G9V',
 '1YGF',
 '1Y4F',
 '1Y0W',
 '1XZ5',
 '1Y4G',
 '1Y31',
 '1A00',
 '1O1N',
 '1HBA',
 '1O1L',
 '1HDB',
 '1Y0D',
 '1XY0',
 '1YH9',
 '3NMM',
 '1Y83',
 '1Y2Z',
 '1GBU',
 '8WJ2',
 '1YHE',
 '2DN2',
 '1YIH',
 '1O1J',
 '1RPS',
 '1B86',
 '1O1P',
 '1XZ7',
 '1YIE',
 '1Y22',
 '2W6V',
 '6BWP',
 '1Y5J',
 '1Y0C',
 '2YRS',
 '1XZ4',
 '1YHR',
 '1YGD',
 '1CLS',
 '1BIJ',
 '1A01',
 '2W72',
 '1C7D',
 '5KSI',
 '6KAI',
 '1RQA',
 '1NIH',
 '7DY4',
 '1C7C',
 '1J40',
 '6LCX',
 '6KAE',
 '6KA9',
 '7UD7',
 '1J7W',
 '3QJE',
 '5KDQ',
 '1O1M',
 '1HGB',
 '1YDZ',
 '1YEO',
 '1Y0A',
 '3WCP',
 '1Y8W',
 '1YE2',
 '1GZX',
 '1A3O',
 

Structure similarity queries also allow users to upload a file from their local computer or input a file url from the website to query the PDB archive for similar proteins. The file represents a target protein structure in the file formats "cif", "bcif", "pdb", "cif.gz", or "pdb.gz". If a user wants to use a file url for queries, the user must specify the `structure_search_type`, the `file_url`, and the `file_format` of the file. This is the same case for file upload, except the user must provide the absolute path leading to the file that is in the local machine.

In [7]:
from rcsbapi.search import StructSimilarityQuery

# Using file_url
q3 = StructSimilarityQuery(
    structure_search_type="file_url",
    file_url="https://files.rcsb.org/view/4HHB.cif",
    file_format="cif"
)
list(q3())


['4HHB',
 '1G9V',
 '2HHB',
 '1BZ0',
 '1K0Y',
 '1COH',
 '3HHB',
 '1QSH',
 '1VWT',
 '1BZZ',
 '1GBV',
 '2DN2',
 '1A01',
 '1C7D',
 '1Y35',
 '1GLI',
 '1RQ3',
 '1Y4Q',
 '1BAB',
 '1O1O',
 '1Y4B',
 '1O1P',
 '1DXU',
 '1Y4R',
 '1DXV',
 '1O1M',
 '1YHE',
 '1BZ1',
 '1QSI',
 '1J7S',
 '1GBU',
 '1A0U',
 '1RQ4',
 '1Y0T',
 '1O1J',
 '1HBB',
 '1O1N',
 '1Y31',
 '1O1L',
 '1XZ2',
 '1Y7Z',
 '1Y7G',
 '1XZV',
 '1C7C',
 '1Y45',
 '1DXT',
 '1XXT',
 '1Y09',
 '1Y4V',
 '1XZU',
 '6HBW',
 '1YH9',
 '1Y2Z',
 '1Y22',
 '1Y0C',
 '5KSJ',
 '5KSI',
 '1Y0A',
 '1C7B',
 '1XYE',
 '1Y0W',
 '1A0Z',
 '1QI8',
 '1J7W',
 '1Y46',
 '1O1K',
 '1YE2',
 '1B86',
 '1HDB',
 '1Y4F',
 '1LFL',
 '1Y5F',
 '2HBS',
 '1Y5J',
 '1Y7C',
 '1GZX',
 '1HBA',
 '3DUT',
 '1Y85',
 '1RPS',
 '1A00',
 '1HBS',
 '1R1Y',
 '1HGC',
 '2W72',
 '1XZ5',
 '3NMM',
 '1Y4P',
 '1YIH',
 '1XY0',
 '3HXN',
 '1XZ7',
 '1KD2',
 '2W6V',
 '1YGF',
 '2HHD',
 '1Y83',
 '1A3N',
 '1Y0D',
 '5KDQ',
 '1HGB',
 '1HGA',
 '3QJD',
 '1Y5K',
 '3QJE',
 '7UD7',
 '6BWP',
 '1XZ4',
 '1Y4G',
 '1CLS',
 '1J7Y',
 

In [None]:
from rcsbapi.search import StructSimilarityQuery

# Using `file_path`
q4 = StructSimilarityQuery(
    structure_search_type="file_upload",
    file_path="/PATH/TO/FILE.cif",  # specify local model file path
    file_format="cif"
)
list(q4())

## Structure Motif Search Examples

Like with Structure Similarity Queries, a `file_url` or `file_path` can also be provided to the program. These can take the place of an entry_id. 

For a `file_url` query, you *must* provide both a valid file URL (a string) and the file's file extension (also as a string). Failure to provide these elements will cause the package to throw an `AssertionError`. 

Below is an example of the same query as shown in [Query Construction](https://rcsbapi.readthedocs.io/en/dev-it-docs/search_api/query_construction.html#structure-motif-search), only this time providing a file url:

In [10]:
from rcsbapi.search import StructMotifQuery, StructMotifResidue

# Construct a Residue with:
# Chain ID of A, an operator of 1, residue number 192, and Exchanges of "LYS" and "HIS".
# As for what is a valid "Exchange", the package provides these as a literal,
# and they should be type checked. 
Res1 = StructMotifResidue(
    struct_oper_id="1",
    chain_id="A",
    exchanges=["LYS", "HIS"],  # exchanges are optional
    label_seq_id=192
)

Res2 = StructMotifResidue(
    struct_oper_id="1",
    chain_id="A",
    label_seq_id=162
)

# After declaring a minimum of 2 and as many as 10 residues,
# they can be passed into a list for use in the query itself:
ResList = [Res1, Res2]

link = "https://files.rcsb.org/view/2MNR.cif"
q2 = StructMotifQuery(
    structure_search_type="file_url",
    url=link,
    file_extension="cif",
    residue_ids=ResList
)
# structure_search_type MUST be provided. A mismatched query type will cause an error. 
list(q2())

['2V4I',
 '3JCM',
 '4E9O',
 '4GA5',
 '4QSL',
 '5TN9',
 '5Z1F',
 '6ZU9',
 '7BN9',
 '7QDZ',
 '7SQO',
 '7UWA',
 '8CY3',
 '8GU6',
 '1I4Z',
 '1VZ6',
 '1ZO8',
 '2VFJ',
 '3J8C',
 '4F9L',
 '4JHM',
 '5GGF',
 '5NZZ',
 '5RUH',
 '5X52',
 '5XLS',
 '6AOM',
 '6GCK',
 '6GCO',
 '6JWP',
 '6RX6',
 '6TBM',
 '8E9W',
 '8QCF',
 '8STS',
 '8Z5J',
 '8ZDY',
 '9BAP',
 '1FT6',
 '3UTF',
 '4OL8',
 '4P79',
 '5KRO',
 '5VWT',
 '5ZH2',
 '6C21',
 '6L73',
 '6QFT',
 '7E5B',
 '7JMH',
 '7OBR',
 '7OGT',
 '7R5S',
 '7U0G',
 '8CVS',
 '8DIC',
 '8FS2',
 '8GET',
 '8IT9',
 '8TCL',
 '1TC2',
 '2W2X',
 '3NOG',
 '3O2U',
 '3OHB',
 '4E1J',
 '4Z8C',
 '5JJT',
 '6AHD',
 '6PS6',
 '7ENI',
 '7KO2',
 '7UTZ',
 '8GBB',
 '8OU1',
 '8WAQ',
 '2JF9',
 '3RBM',
 '4JPG',
 '5CE6',
 '5EZK',
 '5HD1',
 '5VFO',
 '5X23',
 '5ZEA',
 '6BLJ',
 '6DL8',
 '6JE4',
 '6LPC',
 '7ASP',
 '7D89',
 '7QWK',
 '8UO3',
 '1FTL',
 '3FSU',
 '3LVR',
 '5D1S',
 '5EMZ',
 '6JUY',
 '7K36',
 '8DSY',
 '8URO',
 '3G45',
 '3HKB',
 '4U8O',
 '4WP9',
 '5J6D',
 '5VQS',
 '5X11',
 '6COZ',
 '6K4U',
 

A query using `file_path` would look something like this:

In [None]:
from rcsbapi.search import StructMotifQuery

file_path = "/absolute/path/to/file.cif"
q3 = StructMotifQuery(
    structure_search_type="file_upload",
    file_path=file_path,
    file_extension="cif",
    residue_ids=ResList
)
list(q3())

There are many additional parameters that Structure Motif Query supports. These include a variety of features such as `backbone_distance_tolerance`, `side_chain_distance_tolerance`, `angle_tolerance`, `rmsd_cutoff`, `limit` (stop searching after this many hits), `atom_pairing_scheme`, `motif_pruning_strategy`, `allowed_structures`, and `excluded_structures`. These can be mixed and matched as needed to make accurate and useful queries. All of these have some default value which is used when a parameter isn't provided (See [Query Construction](query_construction.md#structure-motif-search)). These parameters conform to the defaults used by the Search API. 

Below will demonstrate how to define these parameters:

In [None]:
from rcsbapi.search import StructMotifQuery

# Specifying backbone distance tolerance: 0-3, default is 1
# Allowed backbone distance tolerance in Angstrom. 
backbone = StructMotifQuery(
    entry_id="2MNR",
    backbone_distance_tolerance=2,
    residue_ids=ResList
)
list(backbone())

# Specifying sidechain distance tolerance: 0-3, default is 1
# Allowed side-chain distance tolerance in Angstrom.
sidechain = StructMotifQuery(
    entry_id="2MNR",
    side_chain_distance_tolerance=2,
    residue_ids=ResList
)
list(sidechain())

# Specifying angle tolerance: 0-3, default is 1
# Allowed angle tolerance in multiples of 20 degrees. 
angle = StructMotifQuery(
    entry_id="2MNR",
    angle_tolerance=2,
    residue_ids=ResList
)
list(angle())

# Specifying RMSD cutoff: >=0, default is 2
# Threshold above which hits will be filtered by RMSD
rmsd = StructMotifQuery(
    entry_id="2MNR",
    rmsd_cutoff=1,
    residue_ids=ResList
)
list(rmsd())

# Specifying limit: >=0, default excluded
# Stop accepting results after this many hits. 
limit = StructMotifQuery(
    entry_id="2MNR",
    limit=100,
    residue_ids=ResList
)
list(limit())

# Specifying atom pairing scheme, default = "SIDE_CHAIN"
# ENUM: "ALL", "BACKBONE", "SIDE_CHAIN", "PSUEDO_ATOMS"
# This is typechecked by a literal. 
# Which atoms to consider to compute RMSD scores and transformations. 
atom = StructMotifQuery(
    entry_id="2MNR",
    atom_pairing_scheme="ALL",
    residue_ids=ResList
)
list(atom())

# Specifying motif pruning strategy, default = "KRUSKAL"
# ENUM: "NONE", "KRUSKAL"
# This is typechecked by a literal in the package. 
# Specifies how many query motifs are "pruned".
# KRUSKAL leads to less stringent queries, and faster results.
pruning = StructMotifQuery(
    entry_id="2MNR",
    motif_pruning_strategy="NONE",
    residue_ids=ResList
)
list(pruning())

# Specifying allowed structures, default excluded
# Specify the structures you wish to allow in the return result. As an example,
# We could only allow the results from the limited query we ran earlier. 
allowed = StructMotifQuery(
    entry_id="2MNR",
    allowed_structures=list(limit()),
    residue_ids=ResList
)
list(allowed())

# Specifying structures to exclude, default excluded
# Specify structures to exclude from a query. We could, for example,
# Exclude the results of the previous allowed query. 
excluded = StructMotifQuery(
    entry_id="2MNR",
    excluded_structures=list(allowed()),
    residue_ids=ResList
)
list(excluded())

The Structure Motif Query can be used to make some very specific queries. Below is an example of a query that retrieves occurrences of the enolase superfamily, a group of proteins diverse in sequence and structure that are all capable of abstracting a proton from a carboxylic acid. Position-specific exchanges are crucial to represent this superfamily accurately.

In [None]:
from rcsb.search import StructMotifResidue

Res1 = StructMotifResidue("A", "1", 162, ["LYS", "HIS"])
Res2 = StructMotifResidue("A", "1", 193)
Res3 = StructMotifResidue("A", "1", 219)
Res4 = StructMotifResidue("A", "1", 245, ["GLU", "ASP", "ASN"])
Res5 = StructMotifResidue("A", "1", 295, ["HIS", "LYS"])

ResList = [Res1, Res2, Res3, Res4, Res5]

query = StructMotifQuery(entry_id="2MNR", residue_ids=ResList)

list(query())

## Chemical Similarity Search Examples

In [None]:
from rcsbapi.search import ChemSimilarityQuery

# Basic query with default values: query type = formula and match subset = False
q1 = ChemSimilarityQuery(value="C12 H17 N4 O S")

# Same example but with all the parameters listed
q1 = ChemSimilarityQuery(
    value="C12 H17 N4 O S",
    query_type="formula",
    match_subset=False
)
list(q1())

Below are two examples of using the query option `descriptor`. Both `descriptor_type`s are shown.

In [None]:
from rcsbapi.search import ChemSimilarityQuery

# Query with descriptor_type SMILES,
# match_type = "graph-relaxed-stereo" (similar ligands (stereospecific))
q2 = ChemSimilarityQuery(
    value="Cc1c(sc[n+]1Cc2cnc(nc2N)C)CCO",
    query_type="descriptor",
    descriptor_type="SMILES",
    match_type="graph-relaxed-stereo"
)
list(q2())

In [None]:
from rcsbapi.search import ChemSimilarityQuery

# Query descriptor_type InChI,
# match_type = "sub-struct-graph-relaxed-stereo" (substructure (stereospecific))
q3 = ChemSimilarityQuery(
    value="InChI=1S/C13H10N2O4/c16-10-6-5-9(11(17)14-10)15-12(18)7-3-1-2-4-8(7)13(15)19/h1-4,9H,5-6H2,(H,14,16,17)/t9-/m0/s1",
    query_type="descriptor",
    descriptor_type="InChI",
    match_type="sub-struct-graph-relaxed-stereo"
)
list(q3())

## Faceted Query Examples
For more details on arguments, see the [API reference](api.rst)

### Terms Facets
Terms faceting is a multi-bucket aggregation where buckets are dynamically built - one per unique value. We can specify the minimum count (`>= 0`) for a bucket to be returned using the parameter `min_interval_population` (default value `1`). We can also control the number of buckets returned using the parameter `max_num_intervals` (default value `65336`).

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Facet

# This is the default query used by the RCSB Search API when no query is specified.
# This default query will be used for most of the examples found below for faceted queries.
q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental",
) 

q(
    facets= Facet(
        name="Journals",
        aggregation_type="terms",
        attribute="rcsb_primary_citation.rcsb_journal_abbrev",
        min_interval_population=1000
    )
).facets

### Histogram Facets
Histogram facets build fixed-sized buckets (intervals) over numeric values. The size of the intervals must be specified in the parameter `interval`. We can also specify `min_interval_population` if desired.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Facet

q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental"
) 

q(
    return_type="polymer_entity",
    facets=Facet(
        name="Formula Weight",
        aggregation_type="histogram",
        attribute="rcsb_polymer_entity.formula_weight",
        interval=50,
        min_interval_population=1
    )
).facets

### Date Histogram Facets
Similar to histogram facets, date histogram facets build buckets over date values. For date histogram aggregations, we must specify `interval="year"`. Again, we may also specify `min_interval_population`.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Facet

q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental"
) 

q(
    facets=Facet(
        name="Release Date",
        aggregation_type="date_histogram",
        attribute="rcsb_accession_info.initial_release_date",
        interval="year",
        min_interval_population=1
    )
).facets

### Range Facets
We can define the buckets ourselves by using range facets. In order to specify the ranges, we use the `FacetRange` class. Note that the range includes the `start` value and excludes the `end` value (`include_lower` and `include_upper` should not be specified). If the `start` or `end` is omitted, the minimum or maximum boundaries will be used by default. The buckets should be provided as a list of `FacetRange` objects to the `ranges` parameter.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Facet, FacetRange

q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental"
)

q(
    facets=Facet(
        name="Resolution Combined",
        aggregation_type="range",
        attribute="rcsb_entry_info.resolution_combined",
        ranges=[
            FacetRange(start=None,end=2),
            FacetRange(start=2, end=2.2),
            FacetRange(start=2.2, end=2.4),
            FacetRange(start=4.6, end=None)
        ]
    )
).facets

### Date Range Facets
Date range facets allow us to specify date values as bucket ranges, using [date math expressions](https://search.rcsb.org/#date-math-expressions).

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Facet, FacetRange

q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental"
)

q(
    facets=Facet(
        name="Release Date",
        aggregation_type="date_range",
        attribute="rcsb_accession_info.initial_release_date",
        ranges=[
            FacetRange(start=None,end="2020-06-01||-12M"),
            FacetRange(start="2020-06-01", end="2020-06-01||+12M"),
            FacetRange(start="2020-06-01||+12M", end=None)
        ]
    )
).facets

### Cardinality Facets 
Cardinality facets return a single value: the count of distinct values returned for a given field. A `precision_threshold` (`<= 40000`, default value `40000`) may be specified.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Facet

q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental",
) 

q(
    facets=Facet(
        name="Organism Names Count",
        aggregation_type="cardinality",
        attribute="rcsb_entity_source_organism.ncbi_scientific_name"
    )
).facets

### Multidimensional Facets
Complex, multi-dimensional aggregations are possible by specifying additional facets in the `nested_facets` parameter, as in the example below:

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Facet

f1 = Facet(
    name="Polymer Entity Types",
    aggregation_type="terms",
    attribute="rcsb_entry_info.selected_polymer_entity_types"
)
f2 = Facet(
    name="Release Date",
    aggregation_type="date_histogram",
    attribute="rcsb_accession_info.initial_release_date",
    interval="year"
)

q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental",
) 

q(
    facets=Facet(
        name="Experimental Method",
        aggregation_type="terms",
        attribute="rcsb_entry_info.experimental_method",
        nested_facets=[f1, f2]
    )
).facets

### Filter Facets
Filters allow us to filter documents that contribute to bucket count. Similar to queries, we can group several `TerminalFilter`s into a single `GroupFilter`. We can combine a filter with a facet using the `FilterFacet` class. Terminal filters should specify an `attribute` and `operator`, as well as possible a `value` and whether or not it should be a `negation` and/or `case_sensitive`. Group filters should specify a `logical_operator` (which should be either `"and"` or `"or"`) and a list of filters (`nodes`) that should be combined. Finally, the `FilterFacet` should be provided with a filter and a (list of) facet(s).

Here is an example that filters only protein chains which adopt 2 different beta propeller arrangements according to the CATH classification.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import TerminalFilter, GroupFilter, FilterFacet, Facet

tf1 = TerminalFilter(
    attribute="rcsb_polymer_instance_annotation.type",
    operator="exact_match",
    value="CATH"
)
tf2 = TerminalFilter(
    attribute="rcsb_polymer_instance_annotation.annotation_lineage.id",
    operator="in",
    value=["2.140.10.30", "2.120.10.80"]
)
ff2 = FilterFacet(
    filter=tf2,
    facets=Facet(
        name="CATH Domains",
        aggregation_type="terms",
        attribute="rcsb_polymer_instance_annotation.annotation_lineage.id",
        min_interval_population=1
    )
)

q = AttributeQuery(
    attribute="rcsb_entry_info.structure_determination_methodology",
    operator="exact_match",
    value="experimental"
) 

q(
    return_type="polymer_instance",
    facets=FilterFacet(filter=tf1, facets=ff2
)).facets

This example shows how to get assembly counts per symmetry types, further broken down by Enzyme Classification (EC) classes.
The assemblies are first filtered to homo-oligomers only.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import TerminalFilter, GroupFilter, FilterFacet, Facet

tf1 = TerminalFilter(
    attribute="rcsb_struct_symmetry.kind",
    operator="exact_match",
    value="Global Symmetry",
    negation=False
)
f2 = Facet(
    name="ec_terms",
    aggregation_type="terms",
    attribute="rcsb_polymer_entity.rcsb_ec_lineage.id"
)
f1 = Facet(
    name="sym_symbol_terms",
    aggregation_type="terms",
    attribute="rcsb_struct_symmetry.symbol",
    nested_facets=f2
)

ff = FilterFacet(filter=tf1, facets=f1)
q1 = AttributeQuery(
    attribute="rcsb_assembly_info.polymer_entity_count",
    operator="equals",
    value=1
)
q2 = AttributeQuery(
    attribute="rcsb_assembly_info.polymer_entity_instance_count",
    operator="greater",
    value=1
)
q = q1 & q2
q(return_type="assembly", facets=ff).facets

This example shows how to get the number of distinct protein sequences in the PDB archive.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import TerminalFilter, GroupFilter, FilterFacet, Facet

tf1 = TerminalFilter(
    attribute="rcsb_polymer_entity_group_membership.aggregation_method",
    operator="exact_match",
    value="sequence_identity"
)
tf2 = TerminalFilter(
    attribute="rcsb_polymer_entity_group_membership.similarity_cutoff",
    operator="equals",
    value=100)
gf = GroupFilter(logical_operator="and", nodes=[tf1, tf2])
ff = FilterFacet(
    filter=gf,
    facets=Facet(
        "Distinct Protein Sequence Count",
        "cardinality",
        "rcsb_polymer_entity_group_membership.group_id"
    )
)
q = AttributeQuery(
    attribute="rcsb_assembly_info.polymer_entity_count",
    operator="equals",
    value=1,
)
q(return_type="polymer_entity", facets=ff).facets

## GroupBy Example
For more details on arguments to create `RequestOption` objects, see the [API reference](api.rst).

Sequence Identity and Matching Uniprot Accession examples from [Search API Documentation](https://search.rcsb.org/#group-by-return-type).

### Matching Deposit Group ID
Aggregation method `matching_deposit_group_id` groups on the basis of a common identifier for a group of entries deposited as a collection.

This example searches for entries associated with "interleukin" from humans with investigational or experimental drugs bound.
Since `group_by_return_type` is specified as `representatives`, one representative structure per group is returned.

In [None]:
from rcsbapi.search import AttributeQuery, TextQuery
from rcsbapi.search import search_attributes as attrs
from rcsbapi.search import GroupBy

q1 = TextQuery("interleukin")
q2 = attrs.rcsb_entity_source_organism.scientific_name == "Homo sapiens"
q3 = attrs.drugbank_info.drug_groups == "investigational"
q4 = attrs.drugbank_info.drug_groups == "experimental"

query = q1 & q2 & (q3 | q4)
list(
    query(
        group_by=GroupBy(aggregation_method="matching_deposit_group_id"),
        # "representatives" means that only a single search hit is returned per group
        group_by_return_type="representatives"
    )
)

### Sequence Identity
Aggregation method `sequence_identity` is used to group search hits on the basis of protein sequence clusters that meet a predefined identity threshold.

This example groups together identical human sequences from high-resolution (1.0-2.0Å) structures determined by X-ray crystallography. Among the resulting groups, there is a cluster of human glutathione transferases in complex with different substrates.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import search_attributes as attrs
from rcsbapi.search import GroupBy, RankingCriteriaType

q1 = attrs.rcsb_entity_source_organism.taxonomy_lineage.name == "Homo sapiens"
q2 = attrs.exptl.method == "X-RAY DIFFRACTION"
q3 = attrs.rcsb_entry_info.resolution_combined >= 1
q4 = attrs.rcsb_entry_info.resolution_combined <= 2

query = q1 & q2 & q3 & q4

list(query(
    # "sequence_identity" aggregation method must use return_type "polymer_entity"
    # If not return_type will be changed and a warning will be raised.
    return_type="polymer_entity",
    group_by=GroupBy(
        aggregation_method="sequence_identity",
        similarity_cutoff=100,  # 100, 95, 90, 70, 50, or 30
        ranking_criteria_type=RankingCriteriaType(
                sort_by="entity_poly.rcsb_sample_sequence_length",
                direction="desc"
        )
    ),
    group_by_return_type="groups"  # divide into groups returned with all associated hits
))

### Matching Uniprot Accession
This example demonstrates how to use `matching_uniprot_accession` grouping to get distinct Spike protein S1 proteins released from the beginning of 2020. Here, all entities are represented by distinct groups of SARS-CoV, SARS-CoV-2 and Pangolin coronavirus spike proteins.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import search_attributes as attrs
from rcsbapi.search import GroupBy, RankingCriteriaType

q1 = AttributeQuery(
    attribute="rcsb_polymer_entity.pdbx_description",
    operator="contains_phrase",
    value="Spike protein S1"
)
q2 = attrs.rcsb_accession_info.initial_release_date > "2020-01-01"

query = q1 & q2
list(query(
    # "matching_uniprot_accession" aggregation method
    # must use return type "polymer_entity"
    return_type="polymer_entity",
    group_by=GroupBy(
        aggregation_method="matching_uniprot_accession",
        ranking_criteria_type= RankingCriteriaType(
            sort_by="coverage"
        )
    ),
    group_by_return_type="groups"
))

## Sort Example
The `sort` request option can be used to control sorting of results. By default, results are sorted by `score` in descending order.
You can also sort by attribute name and apply filters.

Example from [RCSB PDB Search API](https://search.rcsb.org/#sorting) page.

In [None]:
from rcsbapi.search import AttributeQuery
from rcsbapi.search import Sort

query = AttributeQuery(
    attribute="struct.title",
    operator="contains_phrase",
    value="hiv protease",
)

list(query(sort=
    Sort(
        sort_by="rcsb_accession_info.initial_release_date",
        direction="desc"
    )
))