Basic Tools

In [1]:
!pip install --pre deepchem[tensorflow]

Collecting deepchem[tensorflow]
  Downloading deepchem-2.7.2.dev20240209195529-py3-none-any.whl.metadata (1.9 kB)
Collecting rdkit (from deepchem[tensorflow])
  Downloading rdkit-2023.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting tensorflow (from deepchem[tensorflow])
  Downloading tensorflow-2.15.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting tensorflow-probability (from deepchem[tensorflow])
  Downloading tensorflow_probability-0.23.0-py2.py3-none-any.whl.metadata (13 kB)
Collecting tensorflow-addons (from deepchem[tensorflow])
  Downloading tensorflow_addons-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting absl-py>=1.0.0 (from tensorflow->deepchem[tensorflow])
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow->deepchem[tensorflow])
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12

In [2]:
import deepchem as dc
dc.__version__

No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!


Error: Unable to import pysam. Please make sure it is installed.
Error: Unable to import pysam. Please make sure it is installed.
Error: Unable to import pysam. Please make sure it is installed.


2024-02-10 06:57:33.275860: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-10 06:57:33.275915: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-10 06:57:33.277434: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-10 06:57:33.285102: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with transformers dependency. No module named 'transformers'
cannot import name 'HuggingFaceModel' from 'deepchem.models.torch_models' (/home/codespace/.python/current/lib/python3.10/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/home/codespace/.python/current/lib/python3.10/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


'2.7.2.dev'

The problem we will solve is predicting the solubility of small molecules given their chemical formulas
The first thing we need is a data set of measured solubilities for real molecules. One of the core components of DeepChem is MoleculeNet, a diverse collection of chemical and molecular data sets. For this tutorial, we can use the Delaney solubility data set. The property of solubility in this data set is reported in log(solubility) where solubility is measured in moles/liter.

In [3]:
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

1. featurizer argument passed to the load_delaney()
POINT: tell it which representation we want to use, or in more technical language, how to "featurize" the data
2. we actually get three different data sets: a training set, a validation set, and a test set. Each of these serves a different function in the standard deep learning workflow.


NEXT STEP: Create a model with our new data
using "graph convolutional network", or "graphconv" for short.

In [4]:
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

now, TRAIN the model on the dataset
to train: give it the data set and tell how many epochs of training to perform (aka, how many complete passes through the dataset) using model.fit()

In [5]:
model.fit(train_dataset, nb_epoch=100)

0.1126149559020996

to test if we have a fully trained model, evalute the model on a test set. 
Select an evaluation metric and call evaluate() on the model
Using Pearson correlation, or r^2 as our metric

In [6]:
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric], transformers))
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))

Training set score: {'pearson_r2_score': 0.9200403257692505}
Test set score: {'pearson_r2_score': 0.6497356268010457}


The training set has a higher score than the test set
Models typically perform betteron the data they were trained on, rather than similar but independent data (term: overfitting*)
*why we MUST evaluate model on an independent test set

Our model still has quite respectable performance on the test set. For comparison, a model that produced totally random outputs would have a correlation of 0, while one that made perfect predictions would have a correlation of 1. Our model does quite well, so now we can use it to make predictions about other molecules we care about.

Since this is just a tutorial and we don't have any other molecules we specifically want to predict, let's just use the first ten molecules from the test set. For each one we print out the chemical structure (represented as a SMILES string) and the predicted log(solubility). To put these predictions in context, we print out the log(solubility) values from the test set as well.

In [7]:
solubilities = model.predict_on_batch(test_dataset.X[:10])
for molecule, solubility, test_solubility in zip(test_dataset.ids, solubilities, test_dataset.y):
    print(solubility, test_solubility, molecule)

[-1.6660576] [-1.60114461] c1cc2ccc3cccc4ccc(c1)c2c34
[0.82121] [0.20848251] Cc1cc(=O)[nH]c(=S)[nH]1
[-0.5756947] [-0.01602738] Oc1ccc(cc1)C2(OC(=O)c3ccccc23)c4ccc(O)cc4 
[-2.1604166] [-2.82191713] c1ccc2c(c1)cc3ccc4cccc5ccc2c3c45
[-1.5736535] [-0.52891635] C1=Cc2cccc3cccc1c23
[1.2720398] [1.10168349] CC1CO1
[-0.00350919] [-0.88987406] CCN2c1ccccc1N(C)C(=S)c3cccnc23 
[-0.9291749] [-0.52649706] CC12CCC3C(CCc4cc(O)ccc34)C2CCC1=O
[-1.1816086] [-0.76358725] Cn2cc(c1ccccc1)c(=O)c(c2)c3cccc(c3)C(F)(F)F
[0.1285029] [-0.64020358] ClC(Cl)(Cl)C(NC=O)N1C=CN(C=C1)C(NC=O)C(Cl)(Cl)Cl 
