<a href="https://colab.research.google.com/github/valsson-group/UNT-ChemicalApplicationsOfMachineLearning-Spring2026/blob/main/Lecture-9_February-17-2026/Lecture-9_BinaryClassification-1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lecture 9 - Binary Classification

Here, we are going to take the first step into supervised learning and consider a classifcation problem.

We will consider data from this paper:
- Enhancing Permeability Prediction of Heterobifunctional Degraders Using Machine Learning and Metadynamics-Informed 3D Molecular Descriptors - [DOI:10.1021/acs.jcim.5c01600](doi.org/10.1021/acs.jcim.5c01600)

Where the authors consider the Permeability of so-called PROTAC compounds that are large and flexible molecules used in Targeted Protein Degradation.

All the dataset used in the paper, and the code use to obtain the results are given in this following Github repository:
- https://github.com/brykimjh/degrader-permeability-ml3d-metaD  

The specfic dataset that we use 32 PROTACs with measured passive permeability (given in nm/s) and includes 17 features calculated by RDKit (see [here](https://github.com/brykimjh/degrader-permeability-ml3d-metaD/blob/main/data/calculate_2d_properties.py) for the script they are calculated)

The target value is the measured passive permeability that is experimentaly measured.

The dataset can be seen here:
- https://github.com/brykimjh/degrader-permeability-ml3d-metaD/blob/main/data/2d_features.csv

Where the assive permeability is given by `P_app`

In [None]:
# Download dataset

%%bash
dataset_url="https://raw.githubusercontent.com/brykimjh/degrader-permeability-ml3d-metaD/refs/heads/main/data/2d_features.csv"
wget ${dataset_url}
ls

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
dataset = pd.read_csv("2d_features.csv")

In [None]:
dataset

In [None]:
dataset.describe()

In [None]:
plt.hist(dataset['P_app'],bins=20)
plt.show()

### Task 1

Use the mols2grid package to create a visulazation of all molecules and show the passive permeability on the grid with the correct units.

We now turn the problem into a classification problem by seperating the molecules into molecules with high permeability and low permeability, by using a cutoff of 7 nm/s that will split the data set equally.



In [None]:
Permeable_cutoff = 7.0
Low_label = 0
High_label = +1
Permeable_key_str = f'Permeability High({High_label:})/Low({Low_label:})'
dataset[Permeable_key_str] = [High_label if p > Permeable_cutoff else Low_label for p in dataset['P_app']]

Number_Permeable_High = np.sum(dataset[Permeable_key_str] == +1)
Number_Permeable_Low = np.sum(dataset[Permeable_key_str] == 0)

print("Key:",Permeable_key_str)

print("Number with high permeability (above {:.1f} nm/s): {:d}".format(Permeable_cutoff,Number_Permeable_High))
print("Number with low permeability (above {:.1f} nm/s): {:d}".format(Permeable_cutoff,Number_Permeable_Low))

print("")

dataset[['P_app', Permeable_key_str] ]

In [None]:
# generate a data frame with just the features and target values
features = dataset.drop(columns=['Index','Compound','P_app AB (nm/s)','P_app BA (nm/s)','P_app','Smiles', Permeable_key_str])
target = dataset[Permeable_key_str]

In [None]:
features

In [None]:
target

### Decision Trees

Decision Trees are a simple supervised learning method that can be used for classifcation by creating a tree based on rules that split the dataset into subsets until we get a subset that is only one class.
- https://en.wikipedia.org/wiki/Decision_tree_learning
- https://scikit-learn.org/stable/modules/tree.html

The train tree can then be used to predict. The obtained tree can also be visualized.

It is a good idea to visulize the predicted values using a confusion matrix.

We then have the
- True Positives (TP)
- False Positives (FP)
- True Negatives (TN)
- False Negatives (FN)

The choice what is postive and negative is often arbitrary. Here, we consider High Permeability (> 7.0 nm) to be postive. In scikit-learn, the value of +1 is consider as the postive by default.  

To measure the performance, we can define different metrics:

- Accuracy: (TP+TN) / (TP+TN+FN+FP)
- Precision: (TP) / (TP+FP)
- Recall: (TP) / (TP+FN)

Further information and other performance metrics can be found in the [sklearn manual](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-string-names)

[Reminder on the difference between accuracy and precision](https://en.wikipedia.org/wiki/Accuracy_and_precision)



In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn import metrics

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.5)

model = DecisionTreeClassifier()

model.fit(features_train,target_train)

target_test_predicted = model.predict(features_test)

print("Accuracy:                 {:.4f}".format(metrics.accuracy_score(target_test_predicted,target_test)))
print("Precision:                {:.4f}".format(metrics.precision_score(target_test_predicted,target_test)))
# print("Precision (pos_label=1):  {:.4f}".format(metrics.precision_score(target_test_predicted,target_test,pos_label=1 )))
print("Recall:                   {:.4f}".format(metrics.recall_score(target_test_predicted,target_test)))
# print("Recall: (pos_label=1)     {:.4f}".format(metrics.recall_score(target_test_predicted,target_test, pos_label=1 )))



tree_plt = plot_tree(model,
                     feature_names=features.keys(),
                     fontsize=8)
cfm = metrics.ConfusionMatrixDisplay.from_predictions(target_test,target_test_predicted)

In [None]:
# A nicer way to get the tree
import graphviz
dot_data = export_graphviz(model, out_file=None,
                     feature_names=features.keys(),
                     filled=True, rounded=True,
                     special_characters=True)
graph = graphviz.Source(dot_data)
graph

Here we implement cross validation

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate,ShuffleSplit


model = DecisionTreeClassifier()

# Evaluate the models using crossvalidation

scoring = {'accuracy':'accuracy',
           'recall': 'recall',
           'precision':  metrics.make_scorer(metrics.precision_score, zero_division=np.nan)
}



# employ 5-fold CV
scores_fold = cross_validate(
    model,
    features, target,
    scoring=scoring,
    cv=5
)

# Evaluate the models using crossvalidation
NumSplits=1000
cv_random = ShuffleSplit(n_splits=NumSplits, test_size=0.5)
scores_random = cross_validate(
    model,
    features, target,
    scoring=scoring,
    cv=cv_random
)

print("Accuracy")
print("- 5-Fold CV                   : {:.3f} +- {:.3f}".format(scores_fold['test_accuracy'].mean(),scores_fold['test_accuracy'].std()))
print("- Random Splits ({:d} splits) : {:.3f} +- {:.3f}".format(NumSplits, scores_random['test_accuracy'].mean(), scores_random['test_accuracy'].std()))





### Task 2

Repeat the analysis above using Random Forest
- https://scikit-learn.org/stable/modules/ensemble.html#random-forests
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

This is an ensemble method that creates multiple different decision trees, and predicts based on the average. In classification, values will be predicated based on the majority.

The hyperparametes in Random Forest is the number of trees. Use the default value of 100 decision trees.