Data set and source code for generative multi-target compound modeling

This repository holds the data sets and the code used for fine-tuning and sampling of multi-target compounds using REINVENT.

The scripts and folders are the following:

The extracted PubChem data sets can be found in the data folder.
The Python Notebooks and script contain all the code used for transfer learning, sampling, and the analysis of the sampled compounds.
./reinvent folder: Contains a modified version of REINVENT.
./sampled folder: Contains all configurations used for sampling of the multi-target compounds. The configuration files are needed to reproduce the sampling results.

Data Set description

We provide a data set with 2809 multi-target (MT), 61,928 single-target (ST), and 295,395 and inactive (no-target; NT) compounds. The compounds were extracted from PubChem assays using the following criteria:

Assays for individual human targets
Only Qualitative activity annotations were considered (‘active’ or ‘inactive’)
Inconsistently annotated or revoked assays were disregarded as well as assays imported from other databases (for external assays, negative test results were mostly missing)
Assays with an unusually high hit rate > 2% were eliminated
Screening compounds with aggregation or other assay interference (artifact) potential were discarded
Compounds were categorized into three groups: 6.1. MT: Screening compounds with activity against five or more targets. 6.2. ST: Activity against only one target and confirmed inactivity against at least four other targets. 6.3. NT: No reported activity, but confirmed inactivity against at least five targets.

The collection of all extracted compounds can be found in ./data/all_extracted_pubchem_mt_st_nt_compounds.tsv.gz.

./datapubchem_assay_compounds_processed.tsv contains all MT, ST, and a random subset of NT compounds. This file was used to perform the analysis if generative multi-target compound modeling was successful. The SMILES used fine-tuning of REINVENT can be found in ./data/pubchem_assay_compounds_processed_training.smi.

Differences of the included REINVENT and the upstream REINVENT

We had to modify the publically available REINVENT model to be able to perform this analysis. The following changes were made and will be proposed to the upstream REINVENT repository:

Bug fixes in the NLL calculation inside REINVENT. The upstream version calculates a wrong NLL for the longest SMILES in a batch
Added support for random seeds. We had to modify the configuration parser to allow for an additional parameter and added a helper function to set the seed.
Make logging of SMILES sampling optional. Sampling many SMILES (200M) was too slow because the tensorboard logger was creating too much disk i/o.
Skip gradient computation when evaluating NLLs. REINVENT was always evaluating the gradient when computing the NLL for SMILES.

General Code Usage

The repository includes a Conda environment.yml file to create an Anaconda environment with all the required software dependencies.

git clone https://github.com/tblaschke/reinvent-multi-target
cd reinvent-multi-target
conda env create -f environment.yml
conda activate einvent_shared.v2.1

From here start a Jupyter Notebook Server and check out the notebooks and their description.

Generated Molecules and Fine-Tuned Models

Due to file size restriction in GitHub we had to deposit the fine-tuned models, the 200M sampled compounds including their fingerprints and descriptors on [Zenodo] (https://zenodo.org/record/4594647/).

You can download the files and unpack them in the this repository by executing:

cd reinvent-multi-target
curl -O https://zenodo.org/record/4594647/files/fine_tuned_models.tar.gz
curl -O https://zenodo.org/record/4594647/files/fingerprints_and_descriptors.tar.gz
curl -O https://zenodo.org/record/4594647/files/sampled_and_processed_multi_target_compounds.tar.gz

tar -xzvf  fine_tuned_models.tar.gz 
tar -xzvf  fingerprints_and_descriptors.tar.gz    
tar -xzvf  sampled_and_processed_multi_target_compounds.tar.gz

rm  fine_tuned_models.tar.gz fingerprints_and_descriptors.tar.gz sampled_and_processed_multi_target_compounds.tar.gz

Support

If you have any questions please feel free to open an issue on GitHub or write a mail to thomas.blaschke@uni-bonn.de

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
output		output
reinvent		reinvent
sampled		sampled
.gitignore		.gitignore
00_generate_dataset.ipynb		00_generate_dataset.ipynb
01_basic_dataset_statistics.ipynb		01_basic_dataset_statistics.ipynb
02_validate_NLL_calculation_reinvent.ipynb		02_validate_NLL_calculation_reinvent.ipynb
03_transferlearning_with_multitarget_compounds.ipynb		03_transferlearning_with_multitarget_compounds.ipynb
04_NLL_calculation_for_pubchem_compounds.ipynb		04_NLL_calculation_for_pubchem_compounds.ipynb
05_plot_NLL.ipynb		05_plot_NLL.ipynb
06_create_sampling_configs_with_seed.ipynb		06_create_sampling_configs_with_seed.ipynb
06a_create_sampling_jobs.sh		06a_create_sampling_jobs.sh
07_process_sampled_compounds.ipynb		07_process_sampled_compounds.ipynb
07a_process_sampled_compounds.py		07a_process_sampled_compounds.py
08_nn_calculation.ipynb		08_nn_calculation.ipynb
08a_nn_calculation.py		08a_nn_calculation.py
09_nn_analysis.ipynb		09_nn_analysis.ipynb
10_scaffold_analysis.ipynb		10_scaffold_analysis.ipynb
11_predictions_of_MT.ipynb		11_predictions_of_MT.ipynb
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

tblaschke/reinvent-multi-target

Folders and files

Latest commit

History

Repository files navigation

Data set and source code for generative multi-target compound modeling

Data Set description

Differences of the included REINVENT and the upstream REINVENT

General Code Usage

Generated Molecules and Fine-Tuned Models

Support

About

Resources

License

Stars

Watchers

Forks

Languages