# `7w19A` prediction with evaluation
If you want to perform only the cryptic binding site prediction without running the evaluation, please refer to the `run-1jwpA-prediction.ipynb` script. This script runs the prediction and compares the results with the confirmed pocket.

## Define the structure and pocket
The `pocket` annotation was extracted from the CryptoBench dataset.

In [3]:
pdb_id = '7w19'
chain_id = 'A'
pocket = 'A_N22 A_M24 A_G25 A_I26 A_D27 A_F28 A_V30 A_F32 A_L42 A_R44 A_Y86 A_I88 A_L89 A_D91 A_N92 A_P93 A_N96 A_D192 A_G196 A_H197 A_L199 A_I210 A_D211'
pocket = [int(i.split('_')[1][1:]) for i in pocket.split(' ')]

Retrieve the sequence and binding indices within the sequence

In [4]:
import biotite.database.rcsb as rcsb
import biotite.structure.io.pdbx as pdbx
from biotite.structure.io.pdbx import get_structure
from biotite.sequence import ProteinSequence
import numpy as np

CIF_FILES_PATH = '/path/to/your/cif/files'

cif_file_path = rcsb.fetch(pdb_id, "cif", CIF_FILES_PATH)
cif_file = pdbx.CIFFile.read(cif_file_path)

protein = get_structure(cif_file, model=1)
protein = protein[(protein.atom_name == "CA") 
                       & (protein.element == "C") 
                       & (protein.chain_id == chain_id) ]

sequence = ''.join([ProteinSequence.convert_letter_3to1(residue.res_name) for residue in protein])
binding_indices = [f'{residue.chain_id}_{ProteinSequence.convert_letter_3to1(residue.res_name)}{i}' for i, residue in enumerate(protein) if residue.res_id in pocket]

Create the annotation and sequence files

In [5]:
assert len(binding_indices) == len(pocket)

with open(f'data/{pdb_id}{chain_id}.txt', 'w') as f:
    f.write(sequence)

with open(f'data/{pdb_id};{chain_id}_annotation.txt', 'w') as f:
    f.write(f'{pdb_id};{chain_id};UNK;{" ".join(binding_indices)}')

# ⚠️ CAUTION: ESM2-3B Embedding computation required!
For optimal performance, use a **GPU-equipped machine** when computing ESM2-3B embeddings, especially if processing multiple structures. While computation on a CPU-only machine should be possible, I haven't tested it. 

*Note: Computation of the ESM2 embedding is not part of this script. To generate embeddings, you may find [this script](https://github.com/skrhakv/esm2-generator/blob/master/compute-esm.py) in the [esm2-generator repository](https://github.com/skrhakv/esm2-generator) useful.*


### Run the prediction and evaluate

In [9]:
# This script is similar to the script provided in the CryptoBench dataset repository (https://osf.io/pz4a9/).

import numpy as np
from tensorflow import keras
import tensorflow_addons as tfa
import sys

# CAUTION: You need to specify the path to the CryptoBench dataset! It is available at: https://osf.io/pz4a9/
CRYPTOBENCH_PATH = '/path/to/cryptobench'
sys.path.append(f'{CRYPTOBENCH_PATH}/scripts')
from Protein import Protein

MODEL_PATH = f'{CRYPTOBENCH_PATH}/benchmark/best_trained'
STRUCTURE_ID = f'{pdb_id}{chain_id}'

# 0.95 decision threshold was used in the CryptoBench paper
DECISION_THRESHOLD = 0.95


def load_model():
    print("Loading CryptoBench model ...")
    return keras.models.load_model(MODEL_PATH,
                                   custom_objects={
                                       'MatthewsCorrelationCoefficient': tfa.metrics.MatthewsCorrelationCoefficient(num_classes=2)},
                                   compile=False)


def predict(X, model):
    print("Making prediction ...")
    return model.predict(X)


def load_data():
    print("Loading data - embeddings and annotations ...")
    embeddings = np.load(f'data/{STRUCTURE_ID}.npy')

    with open(f'data/{STRUCTURE_ID}_annotation.txt', 'r') as f:
        annotations = f.read().split(';')[3].split(' ')

    # the format of each annotation is as follows: 
    # 'A_G210' denotes a single binding residue, which belongs to the 'A' chain,
    # 'G' denotes that the residue is Glycine, and the corresponding embedding
    # can be found at index 210 in the embeddings array
    annotations = [int(i.split('_')[1][1:]) for i in annotations]
    y = [0] * embeddings.shape[0]
    for ix in annotations:
        y[ix] = 1

    return embeddings, y


def print_evaluation(evaluation):
    print(
        f'\n\n\nEvaluation for {evaluation.id} with decision threshold = {DECISION_THRESHOLD}:\n')
    print(f'AUC: {evaluation.auc}')
    print(f'AUPRC: {evaluation.auprc}')
    print(f'ACC: {evaluation.accuracy}')
    print(f'TPR: {evaluation.get_TPR()}')
    print(f'FPR: {evaluation.get_FPR()}')
    print(f'MCC: {evaluation.mcc}')
    print(f'F1: {evaluation.f1}')


def evaluate(prediction, actual):
    evaluation = Protein(STRUCTURE_ID, prediction, actual,
                         threshold=DECISION_THRESHOLD)
    print_evaluation(evaluation)
    return evaluation

model = load_model()
embeddings, annotations = load_data()
predictions = predict(embeddings, model)
evaluation = evaluate(predictions, annotations)




Loading CryptoBench model ...
Loading data - embeddings and annotations ...
Making prediction ...



Evaluation for 7w19A with decision threshold = 0.95:

AUC: 0.8685990338164251
AUPRC: 0.4806823218753024
ACC: 0.9112627986348123
TPR: 0.4782608695652174
FPR: 0.05185185185185185
MCC: 0.4105204339643005
F1: 0.909579045975699


## Prediction examination
The prediction achieved a true positive rate (TPR) of 0.48 and a false positive rate (FPR) on the `7w19A` structure with an AUC equal of 0.86 and an F1 score of 0.91. Let's examine the results. For each residue, indicate whether the method predicts it as part of a cryptic binding site (True) or not (False).

In [8]:
for prediction, residue in zip(predictions, protein):
    print(f"{residue.res_id} {residue.res_name} {'False' if prediction[1] <= DECISION_THRESHOLD else 'True'}")

2 THR False
3 ILE False
4 GLN False
5 ASP False
6 ILE False
7 GLN False
8 SER False
9 LEU False
10 ALA False
11 GLU False
12 ALA False
13 HIS False
14 GLY False
15 LEU False
16 LEU False
17 LEU False
18 THR False
19 ASP False
20 LYS False
21 MET False
22 ASN False
23 PHE False
24 ASN False
25 GLU False
26 MET True
27 GLY True
28 ILE True
29 ASP True
30 PHE True
31 LYS False
32 VAL False
33 VAL False
34 PHE False
35 ALA False
36 LEU False
37 ASP False
38 THR False
39 LYS False
40 GLY False
41 GLN False
42 GLN False
43 TRP False
44 LEU True
45 LEU False
46 ARG True
47 ILE False
48 PRO False
49 ARG False
50 ARG False
51 ASP False
52 GLY False
53 MET False
54 ARG False
55 GLU False
56 GLN False
57 ILE False
58 LYS False
59 LYS False
60 GLU True
61 LYS False
62 ARG False
63 ILE False
64 LEU False
65 GLU False
66 LEU False
67 VAL False
68 LYS False
69 LYS False
70 HIS False
71 LEU False
72 SER False
73 VAL False
74 GLU False
75 VAL False
76 PRO True
77 ASP False
78 TRP False
79 ARG False
80 