# Predicting the Solubility of Small Molecules using DeepChem

## Overview

This tutorial provides a practical introduction to predicting small molecule solubility using DeepChem, an open-source toolkit that combines cheminformatics with machine learning. We'll walk through the essential steps of building predictive models, starting with how to represent molecular structures in formats suitable for machine learning algorithms.

## Learning Objectives

- Learn how to use DeepChem for molecular property prediction
- Understand how to process and prepare molecular data for machine learning
- Build and train deep learning models for solubility prediction
- Evaluate model performance on chemical datasets

### Tasks to complete

- Load and process molecular data
- Build DeepChem model
- Train the model on solubility data
- Evaluate predictions and model performance

## Prerequisites

- A working Python environment and familiarity with Python
- Basic understanding of machine learning concepts
- Familiarity with pandas and numpy libraries
- Knowledge of basic statistical concepts

## Get Started

- Please select "conda_tensorflow2_p310" kernel from SageMake Jupyter-lab notebook.

### Import necessary libraries

Note that you will likely get some warnings about missing dependencies and removed features.  This is expected since we aren't using the full capabilities of deepchem in this tutorial.

In [None]:
# Install the pre-release version of the deepchem library with tensorflow support using pip.
%pip install --pre deepchem[tensorflow]

In [None]:
# Import the DeepChem library, which provides tools for deep learning in chemistry and drug discovery.
import deepchem as dc

# Import the 'warnings' module to manage warning messages during code execution.
import warnings
# Filter out and ignore all warning messages that might be generated during the execution of the code.
warnings.filterwarnings('ignore')

import os
# Suppress all logs (INFO, WARNING, ERROR)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
# Disable GPU usage to avoid CUDA initialization errors
os.environ['CUDA_VISIBLE_DEVICES'] = ''
# Import TensorFlow after setting environment variables

# Print the version of the DeepChem library that is currently installed.
dc.__version__

## Training a Model with DeepChem

Deep learning can solve diverse problems through a consistent workflow. Here's the typical process:

### Basic Workflow Steps

1. **Data Selection**  
   Choose an existing dataset or create a new one suitable for your task.

2. **Model Creation**  
   Design the architecture of your deep learning model.

3. **Model Training**  
   Fit the model to your training data.

4. **Model Evaluation**  
   Test the model's performance on an independent test set.

5. **Model Deployment**  
   Use the trained model to make predictions on new data.

### DeepChem Implementation

With DeepChem, each step can be implemented in just 1-2 lines of Python code. This tutorial demonstrates the complete workflow for a real-world scientific problem.

### Problem Statement: Solubility Prediction

**Objective**: Predict small molecule solubility from chemical formulas.  

**Importance**: Solubility is crucial in drug development - insufficient solubility may prevent therapeutic concentrations in the bloodstream.

### Dataset

We'll use the **Delaney solubility dataset** from MoleculeNet (a DeepChem component offering diverse chemical datasets).  

**Key Details**:
- Measures solubility in log(solubility)
- Units: moles/liter
- Contains experimentally measured values for real molecules

In [None]:
# The Delaney (ESOL) dataset a regression dataset containing structures and
# water solubility data for 1128 compounds. The dataset is widely used to
# validate machine learning models on estimating solubility directly from
# molecular structures (as encoded in SMILES strings).

# Load the Delaney dataset using DeepChem's molnet module.
#   - tasks:  List of tasks in the dataset (in this case, solubility prediction).
#   - datasets: Tuple containing training, validation, and test datasets.
#   - transformers: List of transformers used for data preprocessing (not used in this line but returned).
#   - featurizer="GraphConv": Specifies that the 'GraphConv' featurizer should be used to convert molecular SMILES strings into graph-based features.
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="GraphConv")

# Unpack the datasets tuple into separate variables for training, validation, and test sets.
train_dataset, valid_dataset, test_dataset = datasets

In [None]:
print(test_dataset.ids)

The `load_delaney()` function accepts a `featurizer` argument, which specifies how to convert molecular structures into machine-readable representations. Since molecules can be represented in multiple ways, this parameter determines the specific featurization method. 

The function returns three distinct datasets:
   - **Training set**: Used for model fitting
   - **Validation set**: Used for hyperparameter tuning
   - **Test set**: Used for final performance evaluation  
   
This tripartite split follows standard deep learning practices, where each subset serves a unique purpose in the model development workflow.

### Create Model

With our data prepared, we proceed to model creation using a **Graph Convolutional Network (GCN)**. 

#### GraphConvModel Implementation
- **Model Type**: `GraphConvModel` from DeepChem  
- **Task**: Regression (predicting molecular solubility)  
- **Architecture**:  
  - Processes molecular structures as graphs (atoms=nodes, bonds=edges)  
  - Uses graph convolutions to capture atomic interactions and spatial relationships  
  - Optimized for molecular property prediction  

#### Data Featurization  
- The `load_delaney()` function accepts a `featurizer` argument to specify molecular representation  
- Featurization converts raw molecular data into model-digestible formats  

#### Dataset Splits  
Three distinct datasets are generated:  
1. **Training set**: Model learning  
2. **Validation set**: Hyperparameter tuning  
3. **Test set**: Final performance evaluation  


In [None]:
# Build Graph Convolutional Models.
# These graph convolutions start with a per-atom set of
# descriptors for each atom in a molecule, then combine and recombine these
# descriptors over convolutional layers.
# model = dc.models.GraphConvModel(n_tasks=1, mode="regression", dropout=0.2)
# Graph convolutional model for regression
import warnings # Import the warnings module to handle warning messages.
warnings.filterwarnings('ignore') # Filter and ignore warning messages to keep the output clean.

model = dc.models.GraphConvModel( # Initialize a Graph Convolutional Model from DeepChem.
    n_tasks=1,       # Specify the number of tasks the model will predict (1 for single regression task).
    mode="regression",  # Set the model mode to 'regression' for predicting continuous values.
    dropout=0.2      # Apply dropout regularization with a probability of 0.2 to prevent overfitting.
)

### Train model

The model is now initialized and ready for training on the featurized Delaney dataset.

To train our model on the prepared dataset:

- Use the `fit()` method to handle the entire training process
- Key parameters:
  - Training dataset (features and labels)
  - Number of epochs (`nb_epoch`)

**Training Details:**
- Each epoch = one complete pass through the training data
- Model iteratively adjusts parameters during training
- For this solubility prediction task:
  - Training duration: **200 epochs**
  - Allows the model to adequately learn complex structure-solubility relationships

In [None]:
# Suppresses warning messages during code execution to keep the output cleaner.
warnings.filterwarnings("ignore")

# Trains the machine learning model using the provided training dataset.
# - train_dataset: Input dataset used for training the model. This likely contains features and corresponding labels.
# - nb_epoch=200: Specifies the number of training epochs (iterations over the entire training dataset) to be performed, set to 200 in this case.
model.fit(train_dataset, nb_epoch=200)

### Model Evaluation

With training complete, we now validate our model's performance through comprehensive evaluation. This involves:

- **Dual Assessment**:
  - Testing on **training data** to verify learning efficacy
  - Testing on **holdout test set** to measure generalization capability

- **Primary Metric**: 
  - Pearson R² (r-squared) score for regression performance
    - Range: 0 (no correlation) to 1 (perfect correlation)
    - Measures how closely predictions match actual solubility values

This evaluation framework ensures we quantify both memorization capacity and true predictive power.

In [None]:
# Initialize a Metric object from DeepChem's metrics module.
# This metric will be used to evaluate the model's performance.
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

# Evaluate the model on the training dataset using the specified metric (Pearson R^2 score).
# Print the training set score, which indicates how well the model performs on the data it was trained on.
print("Training set score:", model.evaluate(train_dataset, [metric], transformers))

# Evaluate the model on the test dataset using the same metric (Pearson R^2 score).
# Print the test set score, which indicates how well the model generalizes to unseen data.
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))

#### Model Performance Analysis  

- **Training vs. Test Scores**:  
  The model shows a higher score on the training set compared to the test set. This is expected because models typically perform better on the data they were trained on than on independent data—a phenomenon known as **overfitting**.  
  **Key Takeaway**: Always evaluate models on a separate test set to detect overfitting.  

- **Test Set Performance**:  
  Despite the slight drop in performance, the model achieves **respectable results** on the test set. For context:  
  - Random predictions would yield a correlation of **0**.  
  - Perfect predictions would score **1**.  
  Our model’s performance is statistically meaningful, confirming its utility for real-world applications.  

- **Next Steps**:  
  With validation complete, the model is now ready to generate predictions for new molecules of interest.  

### Make predictions

To demonstrate our model's predictive capabilities, we'll examine its performance on a representative subset of molecules. The code below analyzes the first ten compounds from the test set, displaying each molecule's SMILES string (a text-based representation of its chemical structure), the model's predicted log(solubility) value, and the corresponding experimental measurement from the test set for direct comparison.

This side-by-side comparison serves multiple purposes:
- Provides immediate, interpretable validation of the model's predictions at the molecular level
- Helps identify any systematic prediction errors (e.g., consistently overestimating certain chemical classes)
- Offers tangible examples of the model's performance beyond aggregate metrics
- Allows quick visual inspection of whether prediction errors correlate with specific structural features

The output format clearly distinguishes between the model's predictions and ground truth values, enabling researchers to assess predictive accuracy for individual compounds while maintaining the context of experimental measurements.

In [None]:
# Predicts solubilities for the first 10 samples in the test dataset using the model in batch mode.
solubilities = model.predict_on_batch(test_dataset.X[:10])

# Iterates through the first 10 molecules, their predicted solubilities, and their actual test solubilities.
for molecule, solubility, test_solubility in zip(test_dataset.ids, solubilities, test_dataset.y):
    # Prints the predicted solubility, actual test solubility, and molecule identifier for each molecule.
    print(solubility, test_solubility, molecule)

## Conclusion

In this tutorial, we learned how to:

- Work with molecular data using DeepChem
- Build deep learning models for property prediction
- Process chemical structures for machine learning
- Make predictions about molecular solubility

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.

