# Predicting the Solubility of Small Molecules using DeepChem

## Overview

This tutorial provides a practical introduction to predicting small molecule solubility using DeepChem, an open-source toolkit that combines cheminformatics with machine learning. We'll walk through the essential steps of building predictive models, starting with how to represent molecular structures in formats suitable for machine learning algorithms.

## Learning Objectives

- Learn how to use DeepChem for molecular property prediction
- Understand how to process and prepare molecular data for machine learning
- Build and train deep learning models for solubility prediction
- Evaluate model performance on chemical datasets

### Tasks to complete

- Load and process molecular data
- Build DeepChem model
- Train the model on solubility data
- Evaluate predictions and model performance

## Prerequisites

- A working Python environment and familiarity with Python
- Basic understanding of machine learning concepts
- Familiarity with pandas and numpy libraries
- Knowledge of basic statistical concepts

## Get Started

- Please select "conda_tensorflow2_p310" kernel from SageMake Jupyter-lab notebook.

### Import necessary libraries

Note that you will likely get some warnings about missing dependencies and removed features.  This is expected since we aren't using the full capabilities of deepchem in this tutorial.

In [1]:
# Install the pre-release version of the deepchem library with tensorflow support using pip.
%pip install --pre deepchem[tensorflow]

zsh:1: no matches found: deepchem[tensorflow]
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import the DeepChem library, which provides tools for deep learning in chemistry and drug discovery.
import deepchem as dc

# Import the 'warnings' module to manage warning messages during code execution.
import warnings
# Filter out and ignore all warning messages that might be generated during the execution of the code.
warnings.filterwarnings('ignore')

# Print the version of the DeepChem library that is currently installed.
dc.__version__

## Training a Model with DeepChem

Deep learning can be used to solve many sorts of problems, but the basic workflow is usually the same.  Here are the typical steps you follow.

1. Select the data set you will train your model on (or create a new data set if there isn't an existing suitable one).
2. Create the model.
3. Train the model on the data.
4. Evaluate the model on an independent test set to see how well it works.
5. Use the model to make predictions about new data.

With DeepChem, each of these steps can be as little as one or two lines of Python code.  In this tutorial we will walk through a basic example showing the complete workflow to solve a real world scientific problem.

The problem we will solve is predicting the solubility of small molecules given their chemical formulas.  This is a very important property in drug development: if a proposed drug isn't soluble enough, you probably won't be able to get enough into the patient's bloodstream to have a therapeutic effect.  

The first thing we need is a data set of measured solubilities for real molecules.  One of the core components of DeepChem is MoleculeNet, a diverse collection of chemical and molecular data sets.  For this tutorial, we can use the Delaney solubility data set. The property of solubility in this data set is reported in log(solubility) where solubility is measured in moles/liter.

In [None]:
# The Delaney (ESOL) dataset a regression dataset containing structures and
# water solubility data for 1128 compounds. The dataset is widely used to
# validate machine learning models on estimating solubility directly from
# molecular structures (as encoded in SMILES strings).
# featurizer: the featurizer to use for processing the data.
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="GraphConv")

# Load the Delaney dataset using DeepChem's molnet module.
#   - tasks:  List of tasks in the dataset (in this case, solubility prediction).
#   - datasets: Tuple containing training, validation, and test datasets.
#   - transformers: List of transformers used for data preprocessing (not used in this line but returned).
#   - featurizer="GraphConv": Specifies that the 'GraphConv' featurizer should be used to convert molecular SMILES strings into graph-based features.
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="GraphConv")

# Unpack the datasets tuple into separate variables for training, validation, and test sets.
train_dataset, valid_dataset, test_dataset = datasets

In [None]:
print(test_dataset.ids)

First, notice the `featurizer` argument passed to the `load_delaney()` function.  Molecules can be represented in many ways.  We therefore tell it which representation we want to use, or in more technical language, how to "featurize" the data.  Second, notice that we actually get three different data sets: a training set, a validation set, and a test set.  Each of these serves a different function in the standard deep learning workflow.

### Create model

Now that we have our data, the next step is to create a model.  We will use a particular kind of model called a "graph convolutional network", or "graphconv" for short. We initializes a Graph Convolutional Neural Network (GCN) model using DeepChem's GraphConvModel for predicting molecular solubility—a continuous property (regression task). The GraphConvModel leverages graph convolutions to learn directly from molecular structures (represented as graphs), where atoms are nodes and bonds are edges. This architecture captures atomic interactions and spatial relationships, making it particularly effective for molecular property prediction. The model is now ready for training on the featurized Delaney dataset.

In [None]:
# Build Graph Convolutional Models.
# These graph convolutions start with a per-atom set of
# descriptors for each atom in a molecule, then combine and recombine these
# descriptors over convolutional layers.
# model = dc.models.GraphConvModel(n_tasks=1, mode="regression", dropout=0.2)
# Graph convolutional model for regression
import warnings # Import the warnings module to handle warning messages.
warnings.filterwarnings('ignore') # Filter and ignore warning messages to keep the output clean.

model = dc.models.GraphConvModel( # Initialize a Graph Convolutional Model from DeepChem.
    n_tasks=1,       # Specify the number of tasks the model will predict (1 for single regression task).
    mode="regression",  # Set the model mode to 'regression' for predicting continuous values.
    dropout=0.2      # Apply dropout regularization with a probability of 0.2 to prevent overfitting.
)

### Train model

To train our model on the prepared dataset, we'll use the fit() method, which handles the entire training process. This method requires two key parameters: the training dataset itself and the number of training epochs (nb_epoch). Each epoch represents one complete pass through the entire training dataset, allowing the model to iteratively learn and refine its parameters. For this solubility prediction task, we'll run the training for 200 epochs to ensure the model has sufficient opportunity to learn the complex relationships between molecular structures and their solubility properties.

In [None]:
# Suppresses warning messages during code execution to keep the output cleaner.
warnings.filterwarnings("ignore")

# Trains the machine learning model using the provided training dataset.
# - train_dataset: Input dataset used for training the model. This likely contains features and corresponding labels.
# - nb_epoch=200: Specifies the number of training epochs (iterations over the entire training dataset) to be performed, set to 200 in this case.
model.fit(train_dataset, nb_epoch=200)

### Evaluate model

Now that we've completed training, it's time to validate our model's performance through rigorous evaluation. We'll assess how well our model predicts solubility values by testing it against both the training data (to check learning efficacy) and the test set (to measure generalization capability). For this evaluation, we'll use the Pearson R² (r-squared) score, a standard metric in regression tasks that quantifies how closely our predictions match the actual values, with 1 representing perfect correlation and 0 indicating no correlation.

In [None]:
# Initialize a Metric object from DeepChem's metrics module.
# This metric will be used to evaluate the model's performance.
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

# Evaluate the model on the training dataset using the specified metric (Pearson R^2 score).
# Print the training set score, which indicates how well the model performs on the data it was trained on.
print("Training set score:", model.evaluate(train_dataset, [metric], transformers))

# Evaluate the model on the test dataset using the same metric (Pearson R^2 score).
# Print the test set score, which indicates how well the model generalizes to unseen data.
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))

Notice that it has a higher score on the training set than the test set.  Models usually perform better on the particular data they were trained on than they do on similar but independent data.  This is called "overfitting", and it is the reason it is essential to evaluate your model on an independent test set.

Our model still has quite respectable performance on the test set.  For comparison, a model that produced totally random outputs would have a correlation of 0, while one that made perfect predictions would have a correlation of 1.  Our model does quite well, so now we can use it to make predictions about other molecules we care about.

### Make predictions

To demonstrate our model's predictive capabilities, we'll examine its performance on a representative subset of molecules. The code below analyzes the first ten compounds from the test set, displaying each molecule's SMILES string (a text-based representation of its chemical structure), the model's predicted log(solubility) value, and the corresponding experimental measurement from the test set for direct comparison.

This side-by-side comparison serves multiple purposes:
- Provides immediate, interpretable validation of the model's predictions at the molecular level
- Helps identify any systematic prediction errors (e.g., consistently overestimating certain chemical classes)
- Offers tangible examples of the model's performance beyond aggregate metrics
- Allows quick visual inspection of whether prediction errors correlate with specific structural features

The output format clearly distinguishes between the model's predictions and ground truth values, enabling researchers to assess predictive accuracy for individual compounds while maintaining the context of experimental measurements.

In [None]:
# Predicts solubilities for the first 10 samples in the test dataset using the model in batch mode.
solubilities = model.predict_on_batch(test_dataset.X[:10])

# Iterates through the first 10 molecules, their predicted solubilities, and their actual test solubilities.
for molecule, solubility, test_solubility in zip(test_dataset.ids, solubilities, test_dataset.y):
    # Prints the predicted solubility, actual test solubility, and molecule identifier for each molecule.
    print(solubility, test_solubility, molecule)

## Conclusion

In this tutorial, we learned how to:

- Work with molecular data using DeepChem
- Build deep learning models for property prediction
- Process chemical structures for machine learning
- Make predictions about molecular solubility

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.

