Skip to content

reymond-group/Coconut-TMAP-SVM

Repository files navigation

Chemical Space Map and Machine Learning Classification of Natural Products in the COCONUT Database

The Natural Product Atlas MAP4 TMAP The COCONUT microbial and plants natural products MAP4 TMAP colored by plant (in green), fungal (in blue) or bacterial (in orange) origin.

After cloning the repo:

  • cat data/COCONUT_DB.sdf.gz.part-* | gunzip -d -c > data/COCONUT_DB.sdf
  • cat data/MAP4-SVM-coconut.all.pkl.gz-part-* | gunzip -d -c > data/MAP4-SVM-coconut.all.pkl

Jupyter Notebook Description:

1. Properties Calculation

The February 2021 version of the COCONUT was downloaded. the 60,171 COCONUT entries with a publication source and annotated as fungal, bacterial, or plant NPs were extracted. Number of carbons, oxygen, and nitrogens, total number of atoms, number of bonds were extracted from the DB. MW, fraction of sp3 C, hydrogen bond donor (HBD) and acceptor (HBA) count, calculated logP with the Crippen method (AlogP), and topological polar surface area (TPSA) were calculated using RDKit. To identify glycosylated and/or peptidic structures Daylight SMARTS language was used. Molecules that violated more than one Lipinski rule were labeled as non-Lipinski. The MAP4 fingerprint was calculated in 1024 dimensions.

2. MAP4 SVM Classifiers

The coconut SUBSET entries were assigned to training or test set with a 50% random split. The SVM was trained using the MAP4 fingerprints of the training set, and it utilized a custom kernel.

Please note that when using MAP4 for machine learning a custom kernel (or a custom loss function) is needed because the similarity/dissimilarity between two MinHashed fingerprints cannot be assessed with "standard" Jaccard, Manhattan, or Cosine functions. In fact, due to MinHashing, the order of the features matters, and the distance cannot be calculated "feature-wise". There is a well-written blog post that explains it.

Custom Kernels

  • The custom kernel implemented for the SVM models calculates the similarity matrix between two lists of MinHashed fingerprints; where the similarity of fingerprint a and fingerprint b is calculated (1) counting of elements with the same value and the same index across a and b, and (2) dividing the obtained value by the number of elements of fingerprint a.

The class weights were inversely proportional to the class frequency, and the hyperparameter C was optimized using 5-fold cross-validation. During the hyperparameter optimization, 20% of the training set was left out as a validation set, and the balanced accuracy of the validation set was maximized. The hyperparameter C was optimized among the values 0.1,1, 10, 100, and 1000, resulting in C = 1.
The classifier was implemented using scikit-learn with the “one versus rest” strategy. After the evaluation process, a second version of the MAP4 SVM classifier was trained using both training and test to learn from all curated 60 thousand datapoints. This version of the MAP4 SVM classifier can be use here.

3. The COCONUT microbial and plants natural products MAP4 TMAP

Using the indices generated by the MinHashing procedure of the MAP4 calculation, an LSH forest was generated and used to layout the TMAP. The resulting TMAP can be found here.

4. MAP4, ECFP4, RDKit AP and properties SVMs comparison

The MAP4, ECFP4, and the RDKit AP fingerprints and a set of 11 properties (MW, fraction of sp3 C, HBD and HBA count, AlogP, number of carbons, oxygen, and nitrogens, total number of atoms, number of bonds, and TPSA) were used to train four different SVM classifiers in a 5-fold cross valiadation. For all classifiers the class weights were inversely proportional to the class frequency, and the hyperparameters were optimized using 10% of the available data to maximaze the balance accuracy (Table 4). For the properties SVM, the 11 values were scaled to zero mean and unit variance. All classifiers were implemented using scikit-learn with the “one versus rest” strategy.

5. test and use the MAP4 classifier in Python

required environment installation:

  • conda env create -f environment.yml
  • conda activate aipep

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published