# pypdb demos

This is a set of basic examples of the usage and outputs of the various individual functions included in. There are generally two types of functions:

+ Functions that perform searches and return lists of PDB IDs
+ Functions that get information about specific PDB IDs

The list of supported search types, as well as the different types of information that can be returned for a given PDB ID, is large (and growing) and is enumerated completely in the docstrings of pypdb.py. The PDB allows a very wide range of different types of queries, and so any option that is not currently available can likely be implemented pretty easily based on the structure of the query types that have already been implemented. I appreciate any feedback and pull requests.

**Another notebook in this directory, advanced_demos.ipynb, includes more in-depth usages of multiple functions, including the tutorial on graphing the popularity of CRISPR that was originally included in this notebook**

### Preamble

In [1]:
%pylab inline
from IPython.display import HTML

## Import from local directory
import sys
sys.path.insert(0, '../pypdb')
from pypdb import *

## Import from installed package
# from pypdb import *

import pprint

Populating the interactive namespace from numpy and matplotlib


# 1. Search functions that return lists of PDB IDs

### Get a list of PDBs for a specific search term

In [2]:
search_dict = make_query('actin network')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['1D7M', '3W3D']


### Search by PubMed ID Number

In [3]:
search_dict = make_query('27499440','PubmedIdQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['5IMT', '5IMW', '5IMY']


### Search by a specific modified structure

In [4]:
search_dict = make_query('3W3D',querytype='ModifiedStructuresQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['2M53', '2RKZ', '3CAL', '4YWF', '5CCF', '5CK7', '5DIR', '5DJL', '5DJM', '5FOV', '5FOW', '5IQ5', '5IZJ', '5J5X', '5LQG', '5MMK', '5MML', '5MTA', '5MTG', '5NYS', '5NYT', '5NYU', '5OPH', '5T7R', '5XO0', '5XO1', '5YTS', '5YTT', '5YTV', '5YTX', '5YV8', '5YV9', '5YVA', '5YVB', '5YVC', '5ZKI', '5ZKJ', '5ZYX', '6AE8', '6AI0', '6AI1', '6AI2', '6AI3', '6BBT', '6BBW', '6BO4', '6BO5', '6BOF', '6BQT', '6BRB', '6BRJ', '6BSD', '6DZT', '6E0C', '6E0P', '6E55', '6E66', '6EUS', '6FMS', '6FN8', '6FOI', '6FOL', '6FON', '6FP6', '6G4B', '6G4C', '6G4D', '6G4E', '6G4F', '6GC2', '6GHB', '6GHO', '6GIV', '6GL4', '6GUA', '6GVK', '6GVL', '6GVX', '6GZV', '6H5T', '6H86', '6HAX', '6HAY', '6HAZ', '6HG7', '6HPO', '6HR2', '6HWT', '6HWU', '6HWV', '6HYC', '6I1R', '6I1Z', '6I20', '6I21', '6I22', '6I23', '6I24', '6I3M', '6I4X', '6I5J', '6I5N', '6I7T', '6IID', '6J44', '6J6J', '6J6K', '6J7J', '6J7K', '6J7L', '6J7M', '6JKN', '6JM9', '6JMA', '6JP1', '6JVV', '6JVW', '6JYV', '6MHY', '6MJ1', '6MRR', '6MRS', '6MSP', '6MTI', '6MU4',

### Search by Author

In [5]:
search_dict = make_query('Perutz, M.F.',querytype='AdvancedAuthorQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['1CQ4', '1FDH', '1GDJ', '1HDA', '1PBX', '2DHB', '2GDM', '2HHB', '2MHB', '3HHB', '4HHB']


### Search by Motif

In [6]:
search_dict = make_query('T[AG]AGGY',querytype='MotifQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['3LEZ', '3SGH:1', '4F47:1']


### Search by a specific experimental method

In [7]:
search_dict = make_query('SOLID-STATE NMR',querytype='ExpTypeQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['1CEK', '1EQ8', '1M8M', '1MAG', '1MP6', '1MZT', '1NH4', '1NYJ', '1PI7', '1PI8', '1PJD', '1PJE', '1PJF', '1Q7O', '1RVS', '1XSW', '1ZN5', '1ZY6', '2C0X', '2CZP', '2E8D', '2H3O', '2H95', '2JSV', '2JU6', '2JZZ', '2K0P', '2KAD', '2KB7', '2KHT', '2KIB', '2KJ3', '2KLR', '2KQ4', '2KQT', '2KRJ', '2KSJ', '2KWD', '2KYV', '2L0J', '2L3Z', '2LBU', '2LEG', '2LGI', '2LJ2', '2LME', '2LMN', '2LMO', '2LMP', '2LMQ', '2LNL', '2LNQ', '2LNY', '2LPZ', '2LTQ', '2LU5', '2M02', '2M3B', '2M3G', '2M4J', '2M5K', '2M5M', '2M5N', '2M67', '2MC7', '2MCU', '2MCV', '2MCW', '2MCX', '2MEX', '2MJZ', '2MME', '2MMU', '2MPX', '2MPZ', '2MS7', '2MSG', '2MTZ', '2MVX', '2MXU', '2N0A', '2N0R', '2N1E', '2N1F', '2N28', '2N3D', '2N70', '2N7H', '2NNT', '2RLZ', '2UVS', '2W0N', '2XKM', '3J07', '3ZPK', '5IRT', '5JR0', '5JXV', '5JZR', '5KK3', '5LCB', '5MWV', '5UGK', '5UK6', '5V7Z', '5W3N', '6DLN', '6EKA', '6EWU', '6F3K', '6F3V', '6F3W', '6F3X', '6F3Y', '6GVT', '6NZN', '6OC9', '6QEB']


### Search by whether it has free ligands

In [8]:
search_dict = make_query('', querytype='NoLigandQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs[:10])

['100D', '101D', '101M', '102D', '102L', '102M', '103L', '103M', '104M', '105M']


### Search by protein symmetry group

In [9]:
kk = do_protsym_search('C9', min_rmsd=0.0, max_rmsd=1.0)
print(kk[:5])

['1KZU', '1NKZ', '2FKW', '3B8M', '3B8N']


# Information Search functions

While the basic functions described in the previous section are useful for looking up and manipulating individual unique entries, these functions are intended to be more user-facing: they take search keywords and return lists of authors or dates

### Find most common authors for a given keyword

In [10]:
top_authors = find_authors('crispr', max_results=100)
pprint.pprint(top_authors[:5])

['Doudna, J.A.', 'Jinek, M.', 'Li, H.', 'Nam, K.H.', 'Ke, A.']


### Find papers for a given keyword

In [11]:
matching_papers = find_papers('crispr',max_results=3)
pprint.pprint(matching_papers)

['Crystal structure of a CRISPR-associated protein from thermus thermophilus',
 'Crystal structure of hypothetical protein sso1404 from Sulfolobus '
 'solfataricus P2',
 'NMR solution structure of a CRISPR repeat binding protein']


# 2. Functions that return information about single PDB entries

### Get the full PDB file

In [14]:
pdb_file = get_pdb_file('4lza', filetype='cif', compression=False)
print(pdb_file[:400])

data_4LZA
# 
_entry.id   4LZA 
# 
_audit_conform.dict_name       mmcif_pdbx.dic 
_audit_conform.dict_version    5.281 
_audit_conform.dict_location   http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 
# 
loop_
_database_2.database_id 
_database_2.database_code 
PDB   4LZA         
RCSB  RCSB081269   
WWPDB D_1000081269 
# 
_pdbx_database_related.db_name        TargetTrack 
_pdbx_database_rela


### Get a general description of the entry's metadata

In [15]:
describe_pdb('4lza')

{'structureId': '4LZA',
 'title': 'Crystal structure of adenine phosphoribosyltransferase from Thermoanaerobacter pseudethanolicus ATCC 33223, NYSGRC Target 029700.',
 'expMethod': 'X-RAY DIFFRACTION',
 'resolution': '1.84',
 'keywords': 'TRANSFERASE',
 'nr_entities': '1',
 'nr_residues': '390',
 'nr_atoms': '2681',
 'deposition_date': '2013-07-31',
 'release_date': '2013-08-14',
 'last_modification_date': '2013-08-14',
 'structure_authors': 'Malashkevich, V.N., Bhosle, R., Toro, R., Hillerich, B., Gizzi, A., Garforth, S., Kar, A., Chan, M.K., Lafluer, J., Patel, H., Matikainen, B., Chamala, S., Lim, S., Celikgil, A., Villegas, G., Evans, B., Love, J., Fiser, A., Khafizov, K., Seidel, R., Bonanno, J.B., Almo, S.C., New York Structural Genomics Research Consortium (NYSGRC)',
 'citation_authors': 'Malashkevich, V.N., Bhosle, R., Toro, R., Hillerich, B., Gizzi, A., Garforth, S., Kar, A., Chan, M.K., Lafluer, J., Patel, H., Matikainen, B., Chamala, S., Lim, S., Celikgil, A., Villegas, G., 

### Get all of the information deposited in a PDB entry

In [16]:
all_info = get_all_info('4lza')
print(all_info)

{'polymer': {'@entityNr': '1', '@length': '195', '@type': 'protein', '@weight': '22023.7', 'chain': [{'@id': 'A'}, {'@id': 'B'}], 'Taxonomy': {'@name': 'Thermoanaerobacter pseudethanolicus', '@id': '496866'}, 'synonym': {'@name': 'APRT'}, 'macroMolecule': {'@name': 'Adenine phosphoribosyltransferase', 'accession': {'@id': 'B0K969'}}, 'polymerDescription': {'@description': 'Adenine phosphoribosyltransferase'}, 'enzClass': {'@ec': '2.4.2.7'}}, 'id': '4LZA'}


In [17]:
results = get_all_info('2F5N')
first_polymer = results['polymer'][0]
first_polymer['polymerDescription']

{'@description': "5'-D(*AP*GP*GP*TP*AP*GP*AP*CP*CP*TP*GP*GP*AP*CP*GP*C)-3'"}

### Run a BLAST search on an entry

There are several options here: One function, get_blast(), returns a dict() just like every other function. However, all the metadata associated with this function leads to deeply-nested dictionaries. A simpler function, get_blast2(), uses text parsing on the raw output page, and it returns a tuple consisting of 1. a ranked list of other PDB IDs that were hits, and 2. A list of the actual BLAST alignments and similarity scores.

In [18]:
blast_results = get_blast('2F5N', chain_id='A')
just_hits = blast_results['BlastOutput_iterations']['Iteration']['Iteration_hits']['Hit']
print(just_hits[50]['Hit_hsps']['Hsp']['Hsp_hseq'])

PELPEVETVRRELEKRIVGQKIISIEATYPRMVL--TGFEQLKKELTGKTIQGISRRGKYLIFEIGDDFRLISHLRMEGKYRLATLDAPREKHDHLTMKFADG-QLIYADVRKFGTWELISTDQVLPYFLKKKIGPEPTYEDFDEKLFREKLRKSTKKIKPYLLEQTLVAGLGNIYVDEVLWLAKIHPEKETNQLIESSIHLLHDSIIEILQKAIKLGGSSIRTY-SALGSTGKMQNELQVYGKTGEKCSRCGAEIQKIKVAGRGTHFCPVCQQ


In [19]:
blast_results = get_blast2('2F5N', chain_id='A', output_form='HTML')
print('Total Results: ' + str(len(blast_results[0])) +'\n')
pprint.pprint(blast_results[1][0])

Total Results: 94

<pre>
&gt;<a name="45278"></a>2F5P:3:A|pdbid|entity|chain(s)|sequence
          Length = 274

 Score =  545 bits (1404), Expect = e-155,   Method: Composition-based stats.
 Identities = 274/274 (100%), Positives = 274/274 (100%)

Query: 1   MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK 60
           MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK
Sbjct: 1   MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK 60

Query: 61  FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY 120
           FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY
Sbjct: 61  FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY 120

Query: 121 AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES 180
           AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES
Sbjct: 121 AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES 180

Query: 181 LFRAGILPGRPAASLSSKEIERLHEEMVATIGEAVMKGGSTVRTYVNTQGEAGTFQHHLY 240
  

### Get PFAM information about an entry

In [20]:
pfam_info = get_pfam('2LME')
print(pfam_info)

{'pfamHit': {'@structureId': '2LME', '@chainId': 'A', '@pdbResNumStart': '46', '@pdbResNumEnd': '105', '@pfamAcc': 'PF03895.14', '@pfamName': 'YadA_anchor', '@pfamDesc': 'YadA-like membrane anchor domain', '@eValue': '4.7E-16'}}


### Get chemical info

This function takes the name of the chemical, not a PDB ID

In [21]:
chem_desc = describe_chemical('NAG')
pprint.pprint(chem_desc)

{'describeHet': {'ligandInfo': {'ligand': {'@chemicalID': 'NAG',
                                           '@molecularWeight': '221.208',
                                           '@type': 'D-saccharide',
                                           'InChI': 'InChI=1S/C8H15NO6/c1-3(11)9-5-7(13)6(12)4(2-10)15-8(5)14/h4-8,10,12-14H,2H2,1H3,(H,9,11)/t4-,5-,6-,7-,8-/m1/s1',
                                           'InChIKey': 'OVRNDRQMDRJTHS-FMDGEEDCSA-N',
                                           'chemicalName': 'N-ACETYL-D-GLUCOSAMINE',
                                           'formula': 'C8 H15 N O6',
                                           'smiles': 'CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O'}}}}


### Get ligand info if present


In [22]:
ligand_dict = get_ligands('100D')
pprint.pprint(ligand_dict)

{'id': '100D',
 'ligandInfo': {'ligand': {'@chemicalID': 'SPM',
                           '@molecularWeight': '202.34',
                           '@structureId': '100D',
                           '@type': 'non-polymer',
                           'InChI': 'InChI=1S/C10H26N4/c11-5-3-9-13-7-1-2-8-14-10-4-6-12/h13-14H,1-12H2',
                           'InChIKey': 'PFNFFQXMRSDOHW-UHFFFAOYSA-N',
                           'chemicalName': 'SPERMINE',
                           'formula': 'C10 H26 N4',
                           'smiles': 'C(CCNCCCN)CNCCCN'}}}


### Get gene ontology info

In [23]:
gene_info = get_gene_onto('4Z0L ')
pprint.pprint(gene_info['term'][0])

{'@chainId': 'A',
 '@id': 'GO:0001516',
 '@structureId': '4Z0L',
 'detail': {'@definition': 'The chemical reactions and pathways resulting in '
                           'the formation of prostaglandins, any of a group of '
                           'biologically active metabolites which contain a '
                           'cyclopentane ring.',
            '@name': 'prostaglandin biosynthetic process',
            '@ontology': 'B',
            '@synonyms': 'prostaglandin anabolism, prostaglandin biosynthesis, '
                         'prostaglandin formation, prostaglandin synthesis'}}


### Get sequence clusters by chain

In [24]:
sclust = get_seq_cluster('2F5N.A')
pprint.pprint(sclust['pdbChain'][:10]) # Just look at the top 10

[{'@name': '6FL1.A', '@rank': '1'},
 {'@name': '4PD2.A', '@rank': '2'},
 {'@name': '3U6P.A', '@rank': '3'},
 {'@name': '4PCZ.A', '@rank': '4'},
 {'@name': '3GPU.A', '@rank': '5'},
 {'@name': '3JR5.A', '@rank': '6'},
 {'@name': '3SAU.A', '@rank': '7'},
 {'@name': '3GQ4.A', '@rank': '8'},
 {'@name': '1R2Z.A', '@rank': '9'},
 {'@name': '3U6E.A', '@rank': '10'}]


### Get the representative for a chain

In [25]:
clusts = get_clusters('4hhb.A')
print(clusts)

{'pdbChain': {'@name': '2W72.A'}}


### List all taxa associated with a list of IDs

In [26]:
crispr_query = make_query('crispr')
crispr_results = do_search(crispr_query)
pprint.pprint(list_taxa(crispr_results[:10]))

['Thermus thermophilus',
 'Saccharolobus solfataricus',
 'Hyperthermus butylicus',
 'Pseudomonas phage JBD30',
 'Saccharolobus solfataricus',
 'Pseudomonas aeruginosa',
 'Pseudomonas aeruginosa',
 'Pseudomonas aeruginosa',
 'Saccharolobus solfataricus',
 'Thermus thermophilus']


### List data types with a list of IDs

In [27]:
crispr_query = make_query('crispr')
crispr_results = do_search(crispr_query)
pprint.pprint(list_types(crispr_results[:20]))

['protein',
 'protein',
 'protein',
 'protein',
 'protein',
 'rna',
 'rna',
 'rna',
 'protein',
 'rna',
 'rna',
 'rna',
 'protein',
 'protein',
 'protein',
 'protein',
 'protein',
 'protein',
 'rna',
 'protein']
