# Tutorial on  accessing, filtering and downloading structures from Protein Data Bank.

Sometimes one need to load many stcrutures from Protein Data Bank (PDB) based on some criteria. The one option is to go to the PDB webpage and manually filter all the strcutures. However one can automate this procedure. 

Here we will see how to access structures from Protein DataBank (PDB) with python.The good think about using python for this task instead of going to the webpage yourself is that python approach is repetitve, and all steps are documented. Next time you need to perform the same task you rerun the code. 

The way presented here may be not the best, but this is what I use so far. You may modify it for your needs.

**Comment:** There are different packages one can use to download structures from PDB. I know about: Bio.PDB, pypdb, pymol.
Here I use Bio.PDB. Although if you plan to perform further analysis on structures, I would recommend using PyMol package.

Let's get started. 


## Preparation

Before we run any command in this Jupyter Notebook, we need load all necessary packages first.
Perform the following conda commands in your terminal to load packages:

*Optional.* create a new environment with conda to prevent conflicts with already existed python packages. Here we create environment called 'pdb'.

    conda create --name pdb
    conda activate pdb
    
*Necessary.* Install necessary pacakges

    conda install jupyter notebook
    conda install -c conda-forge biopython
    conda install -c conda-forge biotite
    conda install -c conda-forge tqdm

After you succesfully installed necessary packages above, proceed with this notebook.

## Import libraries

In [6]:
# here we import necessary packages to the current session
import biotite.database.rcsb as rcsb
from Bio.PDB import PDBList
from Bio.PDB import PDBParser
from tqdm.auto import tqdm
import datetime

    *Comment*. If you are using python 3.9, the first run of the above cell will give you an error: 
    
    'ImportError: cannot import name 'gcd' from 'fractions' (/home/dstepanenko/anaconda3/envs/my_tutorial/lib/python3.9/fractions.py)'

    Just rerun the cell one more time, the error should go away.

## Main part

After we imported necessary packages we start working on our main goal.

**Firstly**  we get the list of available structures based on uniprot_id and other filter criteria. One can set different filter criteria based on one's interests and aims. rcsb.FieldQuery() has different filters one can set.

Here I filter SARS-CoV-2 spike protein with resolution not worse than 4A and minimimum weight 400Da to get only full spike strutcures, to get rid of Receptor Binding Domain only structures.

In [7]:
def get_pdbs_ids(uniprot_id = "P0DTC2" , min_weight = 400, max_resolution = 4.0):
    """
    return the list of pdbs of defined uniprot_id with defined weight and resolution, 
    the default input: uniprot_id for SARS-CoV-2 spike, min_weight for Spike , and max_resolution 4.0
    """
    #uniprot_id = "P0DTC2" #spike in Sars-cov-2
    #max_resolution = 4.0
    #min_weight =400 
    """
    in Da, structure min mass to get rid of rbd only structures,
    Spike mass is 429 Da.
    """
    query_by_uniprot_id = rcsb.FieldQuery(
        "rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession",
        exact_match=uniprot_id,
    )
    today = datetime.datetime.now()
    print(
        f"Number of structures with defined uniprot_id on {today.year}-{today.month}-{today.day}: {rcsb.count(query_by_uniprot_id)}"
    )

  
    query_by_resolution = rcsb.FieldQuery(
        "rcsb_entry_info.resolution_combined", less_or_equal=max_resolution
    )
    print(
        f"Number of structures with resolution less than or equal to {max_resolution}: {rcsb.count(query_by_resolution)}"
    )


    query_by_polymer_weight = rcsb.FieldQuery(
        "rcsb_entry_info.molecular_weight", greater=min_weight
    )
    print(
        f"Number of structures with mass more than or equal to {min_weight}: {rcsb.count(query_by_polymer_weight)}"
    )


    query = rcsb.CompositeQuery(
        [
            query_by_uniprot_id,
            query_by_resolution,
            #query_by_polymer_count,
            query_by_polymer_weight, 
        ],
        "and",
    )
    pdb_ids = rcsb.search(query)
    print(f"Number of spike matches: {len(pdb_ids)}")
    print("Selected PDB IDs:")
    print(*pdb_ids)
    return(pdb_ids)

#test
#pdb_ids = get_pdbs_ids()


**Secondly** we download pdbs of interes to the local computer.

In [8]:
def pdbs_download(pdb_ids):
    '''
    takes a list od pdb_ids and download all coordinates to the directory 'PDB' in pdb format. 
    Some strcutures are reliased in .cif format instead of pdb. 
    remove 'file_format' in 'pdbl.retrieve_pdb_file' to download stcrutrues in .cif format.
    '''
    pdbl = PDBList()
    PDBlist2= pdb_ids
    for i in tqdm(PDBlist2):
        pdbl.retrieve_pdb_file(i, pdir='PDB', file_format = 'pdb' )

#test
#pdbs_download(['6xm0', '7cak'])        
#pdbs_download(pdb_ids)

**Thirdly** we load coordinates to the current python session if necessary. Also we can get some detailed information of the stcrutrue such as author, paper ttitle and so on. 


In [9]:
def pdbs_load(pdb_id):
    """
    load coordinates of one pdb from ./PDB/ to this notebook for futher work.

    """
    #parser = MMCIFParser(QUIET = True) #mcif file
    #structure = parser.get_structure(pdb_id, './PDB/' + pdb_id + '.cif')
    parser = PDBParser(QUIET = True) #pdb extension
    structure = parser.get_structure(pdb_id, './PDB/pdb' + pdb_id + '.ent')
    return structure


#test
#structure = pdbs_load('6xm0')
#print(structure.header.keys())
#print(structure.header['idcode'])

We can collect three above functions and run one function after other. 

In [10]:
def main():
    pdb_ids = get_pdbs_ids()
    pdbs_download(pdb_ids)
    structure = pdbs_load('6xm0')
    print(structure.header.keys())
    return 'done'

main()

Number of structures with defined uniprot_id on 2022-3-11: 808
Number of structures with resolution less than or equal to 4.0: 170161
Number of structures with mass more than or equal to 400: 8232
Number of spike matches: 425
Selected PDB IDs:
6VSB 6VXX 6VYB 6WPS 6WPT 6X29 6X2A 6X2B 6X2C 6X6P 6X79 6XCM 6XCN 6XEY 6XF5 6XF6 6XKL 6XLU 6XM0 6XM3 6XM4 6XM5 6XR8 6XRA 6XS6 6Z43 6Z97 6ZB4 6ZB5 6ZDH 6ZGE 6ZGG 6ZGI 6ZHD 6ZOW 6ZOX 6ZOY 6ZOZ 6ZP0 6ZP1 6ZP5 6ZP7 6ZWV 6ZXN 7A25 7A29 7A4N 7A94 7AD1 7AKD 7B18 7BNM 7BNN 7BYR 7C2L 7CAB 7CAC 7CAI 7CAK 7CHH 7CN4 7CT5 7CWL 7CWM 7CWN 7CWS 7CWT 7CWU 7CYP 7CZP 7CZQ 7CZR 7CZS 7CZT 7CZU 7CZV 7CZW 7CZX 7CZY 7CZZ 7D00 7D03 7D0B 7D0C 7D0D 7DDD 7DF3 7DF4 7DK4 7DWY 7DWZ 7DX0 7DX1 7DX2 7DX3 7DX5 7DX6 7DX7 7DX8 7DX9 7DZW 7DZX 7DZY 7E3K 7E3L 7E5R 7E5S 7E7B 7E7D 7E8C 7E9O 7E9Q 7EAZ 7EB0 7EB3 7EB4 7EB5 7EDF 7EDG 7EDH 7EDI 7EDJ 7EH5 7EJ4 7EJ5 7FAE 7FAF 7FCD 7FCE 7FET 7JJI 7JV4 7JV6 7JVC 7JWB 7JWY 7JZL 7JZN 7K43 7K4N 7K8S 7K8T 7K8U 7K8V 7K8W 7K8X 7K8Z 7K90 7K9H 7K9J 7KDG 7

  0%|          | 0/425 [00:00<?, ?it/s]

Downloading PDB structure '6VSB'...
Downloading PDB structure '6VXX'...
Downloading PDB structure '6VYB'...
Downloading PDB structure '6WPS'...
Downloading PDB structure '6WPT'...
Downloading PDB structure '6X29'...
Downloading PDB structure '6X2A'...
Downloading PDB structure '6X2B'...
Downloading PDB structure '6X2C'...
Downloading PDB structure '6X6P'...
Downloading PDB structure '6X79'...
Downloading PDB structure '6XCM'...
Downloading PDB structure '6XCN'...
Downloading PDB structure '6XEY'...
Downloading PDB structure '6XF5'...
Downloading PDB structure '6XF6'...
Downloading PDB structure '6XKL'...
Downloading PDB structure '6XLU'...
Downloading PDB structure '6XM0'...
Downloading PDB structure '6XM3'...
Downloading PDB structure '6XM4'...
Downloading PDB structure '6XM5'...
Downloading PDB structure '6XR8'...
Desired structure doesn't exists
Downloading PDB structure '6XRA'...
Downloading PDB structure '6XS6'...
Downloading PDB structure '6Z43'...
Downloading PDB structure '6Z97

Downloading PDB structure '7M0J'...
Downloading PDB structure '7M6E'...
Downloading PDB structure '7M6F'...
Downloading PDB structure '7M6G'...
Downloading PDB structure '7M6H'...
Downloading PDB structure '7M6I'...
Downloading PDB structure '7MJG'...
Downloading PDB structure '7MJH'...
Downloading PDB structure '7MJJ'...
Downloading PDB structure '7MJK'...
Downloading PDB structure '7MJM'...
Downloading PDB structure '7MKL'...
Downloading PDB structure '7MM0'...
Downloading PDB structure '7MTC'...
Downloading PDB structure '7MTD'...
Downloading PDB structure '7MTE'...
Downloading PDB structure '7MY2'...
Downloading PDB structure '7MY3'...
Downloading PDB structure '7N0G'...
Downloading PDB structure '7N0H'...
Downloading PDB structure '7N1Q'...
Downloading PDB structure '7N1T'...
Downloading PDB structure '7N1U'...
Downloading PDB structure '7N1V'...
Downloading PDB structure '7N1W'...
Downloading PDB structure '7N1X'...
Downloading PDB structure '7N5H'...
Downloading PDB structure '7

'done'

That is it. Happy coding!