# CryptoBench tutorial
This Jupyter Notebook shows some basic examples for parsing and parsing the CryptoBench dataset in general. CryptoBench is a dataset of structures containing cryptic binding sites (BSs). One ligand-free (**apo**) structure can be associated with more than one ligand-bound (**holo**) structure as there can be more than one BSs, each associated with different holo structure (or one BS can bind different ligand).

## Download the dataset
First, download the dataset (either manually [from here](https://osf.io/pz4a9/) or using the command bellow)

In [None]:
!wget -O cryptobench.zip https://files.de-1.osf.io/v1/resources/pz4a9/providers/osfstorage/?zip= --no-check-certificate
!unzip cryptobench.zip
!rm cryptobench.zip

## Initialize the notebook

In [1]:
import json
from typing import TypeAlias, Dict, List

CRYPTOBENCH_PATH = './cryptobench'

JSON: TypeAlias = dict[str, "JSON"] | list["JSON"] | str | int | float | bool | None

## HOW TO: Load dataset and get APO binding residues

In [12]:
def load_dataset(path: str) -> JSON:
    """Loads dataset from JSON file.

    Args:
        path (str): Path to JSON file

    Returns:
        JSON: dataset
    """
    with open(path) as f:
        dataset = json.load(f)
    return dataset


def get_apo_binding_residues(dataset: JSON) -> Dict[str, List[str]]:
    """Loads binding residues for each APO structure from the dataset. As the APO structure can be associated with more than one HOLO structure, you need to loop over all structures to receive every binding residue.

    Returns:
        Dict[str, List[str]]: Dictionary of all auth_seq_id indices
    """
    apo_residues = {}
    for apo_pdb_id, holo_structures in dataset.items():
        apo_residues[apo_pdb_id] = set()
        for holo_structure in holo_structures:
            apo_residues[apo_pdb_id].update(holo_structure['apo_pocket_selection'])
    return apo_residues


# you can load the whole dataset or any of the subsets, for example cryptobench-dataset/folds/test.json'
dataset = load_dataset(f'{CRYPTOBENCH_PATH}/cryptobench-dataset/dataset.json')

apo_binding_residues = get_apo_binding_residues(dataset)

## HOW TO: Show cryptic binding residues for Cobyrinic acid a,c diamide synthase (PDB ID: 4pfs)
This enzyme is involved in the biosynthesis of cobalamin (vitamin B12) in anaerobic bacteria. The binding residues have format `"{auth_asym_id}_{auth_seq_id}"`. The `auth_asym_id` is necessary as some BSs might stretch over multiple chains.

In [22]:
APO_STRUCTURE_OF_INTEREST = '4pfs'
apo_binding_residues[APO_STRUCTURE_OF_INTEREST]

{'B_135',
 'B_19',
 'B_192',
 'B_20',
 'B_21',
 'B_22',
 'B_220',
 'B_221',
 'B_222',
 'B_223',
 'B_23',
 'B_24',
 'B_25',
 'B_26',
 'B_48'}

## HOW TO: get `main` HOLO structures
To perform an analysis on HOLO structure, it might be difficult to pick the right HOLO structure for each APO structure. In our CryptoBench manuscript, we opt for the following approach: when selecting representative HOLO structures for an performance evaluation of P2Rank: 

Each APO structure had is associated with a HOLO structure, which is *responsible* for inclusion of the APO structure into the dataset (e.g. the HOLO structure with the largest pocket RMSD). We selected those HOLO structures as the main HOLO representatives. The corresponding HOLO structures are marked using `is_main_holo_structure` flag.

In [14]:
def get_main_holo_structures(dataset: JSON) -> Dict[str, str]:
    """Retrieves 'main' HOLO structure for each APO structure

    Args:
        dataset (JSON): dataset

    Returns:
        Dict[str, str]: apo_pdb_id is key, holo_pdb_id is value
    """
    apo_to_holo = {}
    for apo_pdb_id, holo_structures in dataset.items():
        for holo_structure in holo_structures:
            holo_pdb_id = holo_structure['holo_pdb_id']
            if holo_structure['is_main_holo_structure']:
                apo_to_holo[apo_pdb_id] = holo_pdb_id
        assert apo_pdb_id in apo_to_holo
    return apo_to_holo


main_holo_structures = get_main_holo_structures(dataset)

## HOW TO: Show `main` HOLO structure for Cobyrinic acid a,c diamide synthase (PDB ID: 4pfs)

In [23]:
# main holo structure for Cobyrinic acid a,c diamide synthase
HOLO_STRUCTURE_OF_INTEREST = main_holo_structures[APO_STRUCTURE_OF_INTEREST] 
HOLO_STRUCTURE_OF_INTEREST

'5ihp'

## HOW TO: Load binding residues for each HOLO structure

In [20]:
def get_holo_binding_residues(dataset: JSON) -> Dict[str, List[str]]:
    """Get holo binding residues for each HOLO structure in the dataset

    Args:
        dataset (JSON): dataset

    Returns:
        Dict[str, List[str]]: Dictionary of all HOLO structures, values are their binding residues. 
    """
    holo_residues = {}
    for apo_pdb_id, holo_structures in dataset.items():
        for holo_structure in holo_structures:
            holo_pdb_id = holo_structure['holo_pdb_id']
            if holo_pdb_id not in holo_residues:
                holo_residues[holo_pdb_id] = set()
            holo_residues[holo_pdb_id].update(holo_structure['holo_pocket_selection'])
    return holo_residues

holo_binding_residues = get_holo_binding_residues(dataset)

## HOW TO: Show holo binding residues for PDB ID: 5ihp (the `main` HOLO structure for Cobyrinic acid a,c diamide synthase; PDB ID: 4pfs)

In [21]:
holo_binding_residues[HOLO_STRUCTURE_OF_INTEREST]

{'A_19',
 'A_192',
 'A_20',
 'A_21',
 'A_22',
 'A_220',
 'A_221',
 'A_222',
 'A_223',
 'A_23',
 'A_24',
 'A_25',
 'A_26'}

## HOW TO: get ligands from particular pair 
How to retrieve ligand information from the `4pfs-5ihp` pair. The `5ihp` structure binds ATP analog ADP.

In [24]:
def get_ligand_information(dataset: JSON, apo_pdb_id: str, holo_pdb_id: str) -> List[tuple[str, str, str]]:
    """Retrieve information about ligand in apo-holo pair. The ligand is present only in the HOLO form, therefore the information is applicable only for the HOLO structure as there is no ligand in APO structure.

    Args:
        dataset (JSON): dataset
        apo_pdb_id (str): APO structure
        holo_pdb_id (str): HOLO structure

    Returns:
        List[tuple[str, str, str]]: List of ligand acronyms, ligand auth_seq_ids, and ligand auth_asym_ids
    """
    assert apo_pdb_id in dataset, f'{apo_pdb_id} is not present in the dataset'
    ligands = []
    for holo_structure in dataset[apo_pdb_id]:
        this_holo_pdb_id = holo_structure['holo_pdb_id']
        this_ligand = holo_structure['ligand']
        this_ligand_index = holo_structure['ligand_index']
        this_ligand_chain = holo_structure['ligand_chain']
        if this_holo_pdb_id == holo_pdb_id:
            ligands.append((this_ligand, this_ligand_index, this_ligand_chain))
    return ligands       

get_ligand_information(dataset, APO_STRUCTURE_OF_INTEREST, HOLO_STRUCTURE_OF_INTEREST)

[('ADP', '1001', 'A')]

## HOW TO: Retrieve all apo-holo pairs from CryptoBench

In [None]:
def get_apo_holo_pairs(dataset: JSON) -> set[tuple[str, str]]:
    """Retrieves every apo-holo pair from the dataset

    Args:
        dataset (JSON): dataset

    Returns:
        set[tuple[str, str]]: every apo-holo pair
    """
    
    for apo_pdb_id, holo_structures in dataset.items():
        for holo_structure in holo_structures:
            holo_pdb_id = holo_structure['holo_pdb_id'] 

TODO: README in src