<a href="https://colab.research.google.com/github/valsson-group/UNT-ChemicalApplicationsOfMachineLearning-Spring2026/blob/main/Lecture-10_February-24-2026/Lecture-10_BinaryClassification-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lecture 10 - Binary Classification

Here, we are going to continue to consider binary classifcation and consider the same data as in Lecture 9.

We will consider data from this paper:
- Enhancing Permeability Prediction of Heterobifunctional Degraders Using Machine Learning and Metadynamics-Informed 3D Molecular Descriptors - [DOI:10.1021/acs.jcim.5c01600](https://doi.org/10.1021/acs.jcim.5c01600)

Where the authors consider the Permeability of so-called PROTAC compounds that are large and flexible molecules used in Targeted Protein Degradation.

All the dataset used in the paper, and the code use to obtain the results are given in this following Github repository:
- https://github.com/brykimjh/degrader-permeability-ml3d-metaD  

The specfic dataset that we use 32 PROTACs with measured passive permeability (given in nm/s) and includes 17 features calculated by RDKit (see [here](https://github.com/brykimjh/degrader-permeability-ml3d-metaD/blob/main/data/calculate_2d_properties.py) for the script they are calculated)

The target value is the measured passive permeability that is experimentaly measured.

The dataset can be seen here:
- https://github.com/brykimjh/degrader-permeability-ml3d-metaD/blob/main/data/2d_features.csv

Where the assive permeability is given by `P_app`

In [None]:
# Download dataset

%%bash
dataset_url="https://raw.githubusercontent.com/brykimjh/degrader-permeability-ml3d-metaD/refs/heads/main/data/2d_features.csv"
wget ${dataset_url}
ls

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
dataset = pd.read_csv("2d_features.csv")

We now turn the problem into a classification problem by seperating the molecules into molecules with high permeability and low permeability, by using a cutoff of 7 nm/s that will split the data set equally.



In [None]:
Permeable_cutoff = 7.0
Low_label = 0
High_label = +1
Permeable_key_str = f'Permeability High({High_label:})/Low({Low_label:})'
dataset[Permeable_key_str] = [High_label if p > Permeable_cutoff else Low_label for p in dataset['P_app']]

Number_Permeable_High = np.sum(dataset[Permeable_key_str] == +1)
Number_Permeable_Low = np.sum(dataset[Permeable_key_str] == 0)

print("Key:",Permeable_key_str)

print("Number with high permeability (above {:.1f} nm/s): {:d}".format(Permeable_cutoff,Number_Permeable_High))
print("Number with low permeability (above {:.1f} nm/s): {:d}".format(Permeable_cutoff,Number_Permeable_Low))

print("")

dataset[['P_app', Permeable_key_str] ]

In [None]:
print(dataset.keys())

In [None]:
# generate a data frame with just the features and target values
features = dataset.drop(columns=['Index',
                                 'Compound',
                                 'P_app AB (nm/s)',
                                 'P_app BA (nm/s)',
                                 'P_app',
                                 'Smiles',
                                 'Permeability High(1)/Low(0)'])
target = dataset['Permeability High(1)/Low(0)']

In [None]:
features

In [None]:
target

### k Nearest Neighbors




In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate,ShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.5)

n_neighbors=5

model = Pipeline(
    steps=[("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=n_neighbors))]
)

model.fit(features_train,target_train)

target_test_predicted = model.predict(features_test)

print("Accuracy:                 {:.4f}".format(metrics.accuracy_score(target_test,target_test_predicted)))
print("Precision:                {:.4f}".format(metrics.precision_score(target_test,target_test_predicted)))
print("Recall:                   {:.4f}".format(metrics.recall_score(target_test,target_test_predicted)))

cfm = metrics.ConfusionMatrixDisplay.from_predictions(target_test,target_test_predicted)


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate,ShuffleSplit,StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

n_neighbors=5

model = Pipeline(
    steps=[("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=n_neighbors))]
)


scoring = {'accuracy':'accuracy',
            'recall': metrics.make_scorer(metrics.recall_score, zero_division=np.nan),
            'precision': metrics.make_scorer(metrics.precision_score, zero_division=np.nan),
           'roc_auc': 'roc_auc'
}


# employ 5-fold CV
scores_fold = cross_validate(
    model,
    features, target,
    scoring=scoring,
    cv=StratifiedKFold(n_splits=4, shuffle=True),
    return_train_score=True,
    return_estimator=True,
    return_indices=True
)

# Evaluate the models using crossvalidation
NumSplits=100
cv_random = ShuffleSplit(n_splits=NumSplits, test_size=0.5)
scores_random = cross_validate(
    model,
    features, target,
    scoring=scoring,
    cv=cv_random,
    return_train_score=True,
    return_estimator=True,
    return_indices=True
)

# metrics.RocCurveDisplay.from_cv_results(scores_random,
#                                         features,
#                                         target)


print("Accuracy - Test")
print("- 5-Fold CV                   : {:.3f} +- {:.3f}".format(scores_fold['test_accuracy'].mean(),scores_fold['test_accuracy'].std()))
print("- Random Splits ({:d} splits) : {:.3f} +- {:.3f}".format(NumSplits, scores_random['test_accuracy'].mean(), scores_random['test_accuracy'].std()))

print("ROC AUC - Test")
print("- 5-Fold CV                   : {:.3f} +- {:.3f}".format(scores_fold['test_roc_auc'].mean(),scores_fold['test_roc_auc'].std()))
print("- Random Splits ({:d} splits) : {:.3f} +- {:.3f}".format(NumSplits, scores_random['test_roc_auc'].mean(), scores_random['test_roc_auc'].std()))

print("Precision")
print("- 5-Fold CV                   : {:.3f} +- {:.3f}".format( np.nanmean(scores_fold['test_precision']),np.nanstd(scores_fold['test_precision'])))
print("- Random Splits ({:d} splits) : {:.3f} +- {:.3f}".format(NumSplits, np.nanmean(scores_random['test_precision']), np.nanstd(scores_random['test_precision'])))


