In [3]:
!conda create -n ftudd_chem --file requirements_chem.txt -c pytorch -c rdkit -c conda-forge -c rmg -y
# This will take a couple of minutes.
# After installation is finished start the notebook in the new environment.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.4
  latest version: 4.11.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/m275696/.conda/envs/ftudd_chem

  added / updated specs:
    - boost
    - boost-cpp
    - descriptastorus
    - jupyterlab
    - numpy
    - numpy-base
    - pandas
    - python
    - pytorch
    - rdkit
    - readline
    - scikit-learn
    - scipy
    - tensorboard
    - torchvision
    - tqdm
    - transformers
    - typing


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_llvm
  absl-py            conda-forge/noarch::absl-py-1.0.0-pyhd8ed1ab_0
  aiohttp            conda-forge/linux-64::aiohttp-3.8.1-py37h5e8e339_0
  aiosignal          conda-forge/noarch::aiosignal-1.2.0-pyhd8ed1ab_0
  a

### Grover model

Code and pretrained weights are available from here: https://github.com/tencent-ailab/grover. We will use a fork of it where I already fixed a bug in the code base.

Implementation of Yu et al., Self-Supervised Graph Transformer on Large-Scale Molecular Data, NeurIPS 2020

Grover is an instance of a graph neural network. It is trained in a self-supervised way, i.e. from unlabeled training data, and creates an embedding of a molecule. It can be fine-tuned for downstream tasks.

In [23]:
# clone Grover repository
!git clone https://github.com/emdgroup/grover.git ../grover
!mkdir ../grover/data/
!wget https://ai.tencent.com/ailab/ml/ml-data/grover-models/pretrain/grover_large.tar.gz -O ../grover/data/grover_large.tar.gz
!tar -xzf ../grover/data/grover_large.tar.gz -C ../grover/data/
sys.path.append('../grover')

Cloning into '../grover'...
remote: Enumerating objects: 76, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 76 (delta 14), reused 61 (delta 5), pack-reused 0[K
Unpacking objects: 100% (76/76), done.
--2022-03-04 12:57:50--  https://ai.tencent.com/ailab/ml/ml-data/grover-models/pretrain/grover_large.tar.gz
Resolving ai.tencent.com (ai.tencent.com)... 116.128.164.87
Connecting to ai.tencent.com (ai.tencent.com)|116.128.164.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 399013496 (381M) [application/octet-stream]
Saving to: ‘../grover/data/grover_large.tar.gz’


2022-03-04 12:59:14 (4.61 MB/s) - ‘../grover/data/grover_large.tar.gz’ saved [399013496/399013496]



In [16]:
# Grover has a command line interface, let us use it to generate an embedding of a sample molecule
import pandas as pd
smiles = ['CC(=O)O', 'CCCCCC']
pd.DataFrame({'SMILES': smiles}).to_csv('test_smiles.csv', index=False)

In [17]:
%%bash
python ../grover/scripts/save_features.py --data_path test_smiles.csv --save_path test_features.npz --features_generator rdkit_2d_normalized --restart 
python ../grover/main.py fingerprint --data_path test_smiles.csv --features_path test_features.npz --checkpoint_path ../grover/data/grover_large.pt --fingerprint_source both --output test_fingerprints.npz

Loading data


100%|██████████| 2/2 [00:00<00:00, 34.99it/s]
Total size = 2
Generating...
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_k.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.1.mpn_k.W_h.weight".
Loading pretrained parameter "

In [22]:
# load the fingerprints
import numpy as np
fingerprints = np.load('test_fingerprints.npz')["fps"]
print(fingerprints)
print(fingerprints.shape)

[[ 6.3896969e-02  3.3971572e-01 -3.9031762e-01 ...  4.7035982e-08
   1.6663340e-01  2.6578519e-01]
 [-2.4661005e-01  1.2936553e-01  7.6430368e-01 ...  9.6070671e-01
   1.6663340e-01  3.0714130e-01]]
(2, 5000)


### ChemBERTa model

ChemBERTa is based on the BERT NLP model and treats SMILES strings as text that can be modeled. Most NLP models are nicely wrapped by the Huggingface transformer library and hence, we can leverage their API. Further details on ChemBERTa can be found in the paper:

Chithrananda et al., ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction, arXiv 2020

or at Github: https://github.com/seyonechithrananda/bert-loves-chemistry

In [7]:
# Download Chemberta model
from transformers import AutoTokenizer, AutoModelForMaskedLM
chemberta_model_name = 'seyonec/ChemBERTa-zinc-base-v1'
chemberta_tokenizer = AutoTokenizer.from_pretrained(chemberta_model_name)
chemberta_model = AutoModelForMaskedLM.from_pretrained(chemberta_model_name)

In [9]:
import torch
def embed_smiles(smiles, tokenizer, model, layers):
    """
    Returns the embedding of a SMILES string.
    """
    # Get the tokenized input
    tokenized_input = tokenizer(smiles, return_tensors='pt')
    # Get the embedding
    with torch.no_grad():
        output = model(**tokenized_input, output_hidden_states=True)
    # Return the embedding
    states = torch.stack([output.hidden_states[l] for l in layers]).mean([1,2]).view(-1)
    return states.detach().numpy()

test_embedding = embed_smiles('CC(=O)O', chemberta_tokenizer, chemberta_model, [-1])
print(test_embedding.shape)
print(test_embedding[:10])

(768,)
[ 0.30163023  0.50087255 -0.67029685 -1.5062698   0.09748616 -0.6335993
 -0.02147115  0.14238326 -1.3668206   0.44067818]


In [None]:
# load AqSolDB data 
import pandas as pd
df_aqsol = pd.read_csv('curated-solubility-dataset.csv')
print(df_aqsol.head(4))
smiles = df_aqsol['SMILES'].values
targets = df_aqsol['Solubility'].values

### Your tasks
1. Create embeddings for the molecules in the AqSolDB dataset using both the pretrained ChemBERTa model as well as the Grover model
2. Train a suitable scikit-learn model on top of these embeddings to predict the solubility
3. Experiment with this setting and summarize your findings

### The advanced stuff
4. Fine tune Grover and ChemBERTa on the AqSol prediction task 
5. Experiment and summarize your findings
For fine-tuning Grover have a look at their Github page. For fine-tuning a language model from Huggingface, see here: https://huggingface.co/docs/transformers/training