# Other interesting features

Scikit-fingerprints contains many other larger and smaller features, two of which we cover here: working with peptides and checking applicability domain (AD).

## Peptides

Scikit-fingerprints enables the user to work with peptide data easily. Those small biologics are of great interest as therapeutics, particularly seeing the success of semaglutide.

RDKit can read natural amino acid sequences out-of-the-box and parse them as `Mol` objects. We used this capability [in our publication](https://arxiv.org/abs/2501.17901) to show that molecular fingerprints are simple, yet very effective tools for peptide property predition.

Here, we will use the antimicrobial peptides (AMPs) [benchmark by Xu et al.](https://academic.oup.com/bib/article/22/5/bbab083/6189771) (also known as XuAMP). The goal is to predict whether a peptide shows general antimicrobial effect or not.

In [None]:
import numpy as np
from Bio import SeqIO


def read_seqs_from_fasta(path: str) -> list[str]:
    return [str(record.seq) for record in SeqIO.parse(path, "fasta")]


train_pos = read_seqs_from_fasta("data/Xu_AMP/train_positive.fasta")
train_neg = read_seqs_from_fasta("data/Xu_AMP/train_negative.fasta")
test_pos = read_seqs_from_fasta("data/Xu_AMP/test_positive.fasta")
test_neg = read_seqs_from_fasta("data/Xu_AMP/test_negative.fasta")

labels_train = np.array([1] * len(train_pos) + [0] * len(train_neg))
labels_test = np.array([1] * len(test_pos) + [0] * len(test_neg))

seqs_train = train_pos + train_neg
seqs_test = test_pos + test_neg

In [None]:
print(f"Sample sequence: {seqs_train[0]}")
print()
print(f"Training peptides: {len(seqs_train)}")
print(f"Test peptides: {len(seqs_test)}")

**Exercise 1**

Transform aminosequences into molecules using `MolFromAminoseqTransformer`. This is very similar to `MolFromSmilesTransformer`. See [documentation](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.preprocessing.MolFromAminoseqTransformer.html) if necessary.

Using multiple processes here is very useful, since peptides are larger molecules and dataset is quite large.

**Exercise 2**

Let's test a few fingerprints for the peptide function prediction task:

1. Try ECFP, Topological Torsion, and RDKit fingerprints ([documentation](https://scikit-fingerprints.readthedocs.io/latest/modules/fingerprints.html)). Compare their binary and count variants. Remember to parallelize fingerprints with `n_jobs`.
2. Use Random Forest classifier. Using `n_jobs` and setting `random_state=0` will be useful.
3. Evaluate using AUROC.

This will result in 6 scores, and will allow comparison of fingerprints and binary vs count approach.

## Applicability domain checking

Appicability domain (AD) is the area of chemical space spanned by the training dataset, also known as in-domain samples in machine learning. Predictions for out-of-distribution molecules, outside this AD, are inherently less reliable and should be less trusted. Thus, falling inside or outside of AD is the basic indicator of prediction certainty.

There are various algorithms to check this, based on descriptor space geometry, variance analysis and feature reduction, or ensemble classifier probability distribution, among others. See e.g. [this paper](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0230-2) for an overview.

As our main training dataset we will use the [BACE dataset from MoleculeNet benchmark](https://scikit-fingerprints.readthedocs.io/latest/modules/datasets/generated/skfp.datasets.moleculenet.load_bace.html), where we need to predict the ability of molecules to inhibit the beta-secretase 1 enzyme. It is suspected to be related to the development of Alzheimer's disease.

Further, we will use [ApisTox dataset](http://doi.org/10.1038/s41597-024-04232-w) to provide pesticide compounds, in most cases obviously out-of-domain of BACE dataset.

In [None]:
from skfp.datasets.moleculenet import load_bace
import pandas as pd

bace_smiles, _ = load_bace()

apistox_smiles = pd.read_csv("data/ApisTox/dataset_final.csv")["SMILES"]

In [None]:
from skfp.fingerprints import ECFPFingerprint

ecfp = ECFPFingerprint(n_jobs=-1, count=True)

X_bace = ecfp.transform(bace_smiles)
X_apistox = ecfp.transform(apistox_smiles)

Let's check the internal diversity of those datasets, using the average Tanimoto similarity between molecules. We also used this technique [in ApisTox follow-up paper](https://arxiv.org/abs/2503.24305).

In [None]:
import numpy as np
from skfp.distances import bulk_tanimoto_count_similarity


bace_self_sim = bulk_tanimoto_count_similarity(X_bace)
apistox_self_sim = bulk_tanimoto_count_similarity(X_apistox)
cross_dataset_sim = bulk_tanimoto_count_similarity(X_bace, X_apistox)

# select upper triangle of similarity matrix for self-similarity
bace_self_mean = bace_self_sim[np.triu_indices_from(bace_self_sim, k=1)].mean()
apistox_self_mean = apistox_self_sim[np.triu_indices_from(apistox_self_sim, k=1)].mean()
cross_dataset_mean = cross_dataset_sim.mean()

print(f"Avg BACE Tanimoto self-similarity: {bace_self_mean:.3f}")
print(f"Avg ApisTox Tanimoto self-similarity: {apistox_self_mean:.3f}")
print(f"Avg BACE-ApisTox Tanimoto similarity: {cross_dataset_mean:.3f}")

Let's use the [bounding box method](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.applicability_domain.ConvexHullADChecker.html) for checking the applicability domain. It simply checks if the feature values fall inside the min-max range seen in the training data, basically performing univariate outlier detection. Note that many AD checks do not work well on fingerprints, including this one, so we will use continuous RDKit descriptors instead.

AD checkers, in contrast to fingerprints, need to be fitted to training data with `.fit()` method. Then we can use the `.predict()` method to get binary outside/inside AD predictions (inside = class 1). Some algorithms also implement an interpretable `.score_samples()` method to get a continous score how far a molecule is from the applicability domain.

In [None]:
from skfp.fingerprints import RDKit2DDescriptorsFingerprint
from sklearn.impute import SimpleImputer


fp = RDKit2DDescriptorsFingerprint(n_jobs=-1)
X_bace = fp.transform(bace_smiles)
X_apistox = fp.transform(apistox_smiles)

# some RDKit descriptors may have missing values in some cases
imputer = SimpleImputer()
X_bace = imputer.fit_transform(X_bace)
X_apistox = imputer.transform(X_apistox)

In [None]:
from skfp.applicability_domain import BoundingBoxADChecker


ad_checker = BoundingBoxADChecker()
ad_checker.fit(X_bace)

ad_preds = ad_checker.predict(X_apistox)

num_inside_ad = ad_preds.sum()
num_outside_ad = (~ad_preds).sum()

print(f"ApisTox size: {len(X_apistox)}")
print(f"Num inside AD: {num_inside_ad}")
print(f"Num outside AD: {num_outside_ad}")

## Customizing scikit-fingerprints

To see how customizable scikit-fingerprints is, we will create two custom fingerprints.

**SMARTS example**

First, we will create `CustomSMARTSFingerprint` that checks the occurrences of given SMARTS patterns.

In [None]:
import numpy as np

from skfp.bases import BaseSubstructureFingerprint


class CustomSMARTSFingerprint(BaseSubstructureFingerprint):
    def __init__(
        self,
        # arguments required for compatibility with parent class
        # you could ignore those, but they provide full functionality
        count: bool = False,
        sparse: bool = False,
        n_jobs: int | None = None,
        batch_size: int | None = None,
        verbose: int | dict = 0,
    ):
        # SMARTS patterns for our fingerprint, subset of PubChem fingerprint
        smarts_patterns = [
            "[#6]-,:[#6]-,:[#6]#[#6]",
            "[#8]-,:[#6]-,:[#6]=,:[#7]",
            "[#8]-,:[#6]-,:[#6]=,:[#8]",
            "[#7]:[#6]-,:[#16&!H0]",
            "[#7]-,:[#6]-,:[#6]=,:[#6]",
            "[#8]=,:[#16]-,:[#6]-,:[#6]",
            "[#7]#[#6]-,:[#6]=,:[#6]",
            "[#6]=,:[#7]-,:[#7]-,:[#6]",
            "[#8]=,:[#16]-,:[#6]-,:[#7]",
            "[#16]-,:[#16]-,:[#6]:[#6]",
            "[#6]:[#6]-,:[#6]=,:[#6]",
        ]
        self._feature_names = smarts_patterns

        # passing patterns to parent class is the only required element
        super().__init__(
            patterns=smarts_patterns,
            count=count,
            sparse=sparse,
            n_jobs=n_jobs,
            batch_size=batch_size,
            verbose=verbose,
        )

    # you should implement this method if you can, but it is not necessary
    def get_feature_names_out(self, input_features=None) -> np.ndarray:
        return np.array(self._feature_names, dtype=object)


In [None]:
custom_smarts = CustomSMARTSFingerprint(n_jobs=-1, count=True)

X = custom_smarts.transform(apistox_smiles)
X

**General fingerprint example**

In the general case, e.g. when fingerprint includes more complex checks or features, we need to implement the `._calculate_fingerprint()` method.

In `CustomAtomTypeFingerprint`, we will implement the simple fingerprint based on counting atom element types.

In [None]:
from collections.abc import Sequence

from rdkit.Chem import Mol
from scipy.sparse import csr_array
from skfp.bases import BaseFingerprintTransformer
from skfp.utils import ensure_mols


class CustomAtomTypeFingerprint(BaseFingerprintTransformer):
    def __init__(
        self,
        count: bool = False,
        sparse: bool = False,
        n_jobs: int | None = None,
        batch_size: int | None = None,
        verbose: int | dict = 0,
    ):
        super().__init__(
            n_features_out=10,
            count=count,
            sparse=sparse,
            n_jobs=n_jobs,
            batch_size=batch_size,
            verbose=verbose,
        )

        # select some atom types
        self.atom_types = ["C", "N", "O", "S", "P", "F", "Cl", "Br", "I", "OTHER"]

        # utility for converting atom to feature list index
        self.atom_to_idx = {atom: i for i, atom in enumerate(self.atom_types)}

    # for simplicity, we omit `.get_feature_names_out()` this time

    # `_calculate_fingerprints()` must be implemented in every custom fingerprint
    def _calculate_fingerprint(self, X: Sequence[str | Mol]) -> np.ndarray | csr_array:
        # make sure we have proper Mol inputs
        X = ensure_mols(X)

        # compute fingerprints
        X = [self._get_per_molecule_atom_counts(mol) for mol in X]

        # return dense or sparse outputs
        return csr_array(X) if self.sparse else np.vstack(X)

    def _get_per_molecule_atom_counts(self, mol: Mol) -> np.ndarray:
        counts = np.zeros(len(self.atom_types), dtype=int)
        for atom in mol.GetAtoms():
            symbol = atom.GetSymbol()
            if symbol in self.atom_to_idx:
                counts[self.atom_to_idx[symbol]] += 1
            else:
                counts[self.atom_to_idx["OTHER"]] += 1
        
        return counts


In [None]:
custom_atom_type = CustomAtomTypeFingerprint(n_jobs=-1, count=True)

X = custom_atom_type.transform(apistox_smiles)
X