# ClassyPose: A Machine-Learning Classification Model for Correct Ligand Pose Selection

Please cite: Tran-Nguyen, V.K., Camproux, A.C. & Taboureau, O. ClassyPose: A Machine-Learning Classification Model for Ligand Pose Selection Applied to Virtual Screening in Drug Discovery.

The **protocol-env** environment has to be set up beforehand. To do this, please use the file **protocol-env.yml** in our **MLSF-protocol** repository: https://github.com/vktrannguyen/MLSF-protocol.

### Step 1: Calling all Python dependencies 

In [None]:
import os
import numpy as np
import pandas as pd
import oddt
import oddt.pandas as opd
from sklearn.svm import SVC

### Step 2: Loading CSV data files for training and test sets 

**1. For the pose selection/classification task**: where the **Real Class of the pose** (good or bad pose) is already known

In [None]:
train_data = pd.read_csv("Pathway_to_the_training_data_file_:_training_data_poses.csv")
Train_Class = train_data['Classification']
test_data = pd.read_csv("Pathway_to_the_test_data_file")
Test_Class = test_data['Classification']

**2. For the virtual screening task**: where the **Real Class of the pose** (good or bad pose) is not known

Attention: here we consider the **Real Class of the pose**, **not** the Real Class of the screened molecule (active or inactive/decoy)

In [None]:
train_data = pd.read_csv("Pathway_to_the_training_data_file_:_training_data_poses.csv")
Train_Class = train_data['Classification']
test_data = pd.read_csv("Pathway_to_the_test_data_file")

### Step 3: Loading PLEC fingerprints of training and test data 

In [None]:
d_train_csv = pd.read_csv('Pathway_to_the_PLEC_fingerprints_of_training_data_:_training_data_PLEC.csv', header=None)
d_test_csv = pd.read_csv('Pathway_to_the_PLEC_fingerprints_of_test_data', header=None)

### Step 4: Training and applying ClassyPose

**1. For the pose selection/classification task**: where the **Real Class of the pose** (good or bad pose) is already known

In [None]:
#Train ClassyPose on the training set poses:
svm_plec = SVC(degree = 3, kernel = "rbf", gamma = 'scale', probability = True)
svm_plec.fit(d_train_csv, Train_Class)

#Predict the Good Pose Probability for the test set poses:
prediction_test_svm_plec_prob = svm_plec.predict_proba(d_test_csv)
plec_result_svm  = pd.DataFrame({"Good_Pose_Prob": prediction_test_svm_plec_prob[:, 1], "Real_Class": Test_Class})

#Classify the test set poses:
df_Predicted_Class = []
for i in range(len(plec_result_svm)):
    if (plec_result_svm.loc[i, "Good_Pose_Prob"] > 0.5):
        df_Predicted_Class.append("Good")
    elif (plec_result_svm.loc[i, "Good_Pose_Prob"] <= 0.5):
        df_Predicted_Class.append("Bad")
plec_result_svm.insert(loc=len(plec_result_svm.columns), column='Predicted_Class', value=df_Predicted_Class)

#Save the output as a csv file:
rmsd = test_data.iloc[:, 1]
pose = test_data.iloc[:, 0]
plec_result_svm['RMSD'] = rmsd
plec_result_svm['Pose'] = pose
plec_result_svm.to_csv("Pathway_to_the_CSV_result_file")

**2. For the virtual screening task**: where the Real Class of the pose (good or bad pose) is not known

In [None]:
#Train ClassyPose on the training set poses:
svm_plec = SVC(degree = 3, kernel = "rbf", gamma = 'scale', probability = True)
svm_plec.fit(d_train_csv, Train_Class)

#Predict the Good Pose Probability for the test set poses:
prediction_test_svm_plec_prob = svm_plec.predict_proba(d_test_csv)
plec_result_svm  = pd.DataFrame({"Good_Pose_Prob": prediction_test_svm_plec_prob[:, 1]})

#Save the output as a csv file:
pose = test_data.iloc[:, 0]
plec_result_svm['Pose'] = pose
plec_result_svm.to_csv("Pathway_to_the_CSV_result_file")