GitHub

CLMeval

CLMeval is an R package that enables the holistic evaluation of chemical language models (that is, recurrent neural network-based models of textual representations of molecules). This is achieved by integrating multiple orthogonal sources of information about model performance via principal component analysis.

The main function of the CLMeval package, calculate_PC1, allows the user to evaluate a chemical language model by integrating five orthogonal metrics of model performance. This is accomplished by principal component analysis of a dataset where the major dimension of variance is model performance (that is, models segregate along the first principal component based on their ability to match the chemical space of the training set). This function performs PCA in a reference matrix of chemical outcomes, then uses the base R predict function to project a model of interest onto the same principal components.

Installation

Install CLMeval with devtools:

devtools::install_github("skinnider/CLMeval")

Usage

The main function of the CLMeval package, calculate_PC1, takes as input five metrics that have been calculated for a generative model of chemical structures. These metrics were chosen because they were found to be robustly correlated to the number of molecules in the training set across a series of benchmarking analyses. These five metrics are as follows:

the proportion of valid molecules generated by the trained model;
the Frechet ChemNet distance to the training set;
the Jensen-Shannon distance between the number of stereocenters in molecules sampled from the trained model vs. the training set;
the Jensen-Shannon distance between the frequency distribution of Murcko scaffolds within molecules sampled from the trained model vs. the training set;
the Jensen-Shannon distance between the natural product-likeness scores of molecules sampled from the trained molecule vs. the training set

The reference matrix used to perform PCA contains metrics for a total of 440 chemical language models. These were obtained by training recurrent neural network-based models on SMILES strings from the ChEMBL, COCONUT, GDB, and ZINC databases. The number of models from each database varied between 1,000 and 500,000, in eleven increments, and ten random samples of each size were drawn from each database. For further details, see the reference documentation.

To obtain the PC1 score for a model of interest, the calculate_PC1 function can be used as follows:

library(CLMeval)
pc1 = calculate_PC1(pct_valid, FCD, JSD_stereocenters, JSD_murcko, JSD_NP)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
R		R
data		data
man		man
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

data

data

man

man

.gitignore

.gitignore

DESCRIPTION

DESCRIPTION

LICENSE

LICENSE

NAMESPACE

NAMESPACE

README.md

README.md

Repository files navigation

CLMeval

Installation

Usage

About

Releases

Packages

Languages

License

skinnider/CLMeval

Folders and files

Latest commit

History

Repository files navigation

CLMeval

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Languages