## MolBERT: Molecular representation learning with language models and domain-relevant auxiliary tasks

ABSTRACT: We apply a Transformer architecture, specifically BERT, to learn flexible and
high quality molecular representations for drug discovery problems, and study the
impact of using different combinations of self-supervised tasks for pre-training.
Our results on established Virtual Screening and QSAR benchmarks show that: i) The selection of appropriate self-supervised task(s) for pre-training has a significant
impact on performance in subsequent downstream tasks such as Virtual Screening. ii) Using auxiliary tasks with more domain relevance for Chemistry, such as learning to predict calculated molecular properties, increases the fidelity of our learnt representations. iii) Finally, we show that molecular representations learnt by
our model ‘MOLBERT’ improve upon the current state of the art on the benchmark
datasets.

Link to paper: https://arxiv.org/pdf/2011.13230v1.pdf

Credit: https://github.com/BenevolentAI/MolBERT

In [29]:
# Clone the repository and cd into directory
!git clone https://github.com/BenevolentAI/MolBERT.git
%cd MolBERT

/content/MolBERT


In [None]:
# Install requirements / dependencies
!pip install -e .

# Install torchvision
!pip install torchvision==0.6.0 torchtext==0.6.0

# Install RDKit 
!pip install rdkit-pypi==2021.3.1.5

### Load pretrained model
You can download the pretrained model [here](https://ndownloader.figshare.com/files/25611290)

After downloading the weights, you can follow `scripts/featurize.py` to load the model and use it as a featurizer (you just need to replace the path in the script).

### Train model from scratch:

You can use the guacamol dataset

In [None]:
# download the guacamol dataset
!wget https://raw.githubusercontent.com/BenevolentAI/guacamol_baselines/master/fetch_guacamol_dataset.sh
!sh fetch_guacamol_dataset.sh

In [None]:
!python molbert/apps/smiles.py \
    --train_file data/guacamol_v1_train.smiles \
    --valid_file data/guacamol_v1_valid.smiles \
    --max_seq_length 128 \
    --batch_size 64 \
    --masked_lm 1 \
    --max_epochs 20 \
    --num_workers 8 \
    --val_check_interval 1 \
    --gpus 1 \
    --tiny

Add the `--tiny` flag to train a smaller model on a CPU, or the `--fast_dev_run` flag for testing purposes. For full list of options see `molbert/apps/args.py` and `molbert/apps/smiles.py`.

### Finetune

After you have trained a model, and you would like to finetune on a certain training set, you can use the `FinetuneSmilesMolbertApp` class to further specialize your model to your task.

For classification you can set can set the mode to `classification` and the `output_size` to 2.

To reproduce the finetuning experiments we direct you to use `scripts/run_qsar_test_molbert.py` and `scripts/run_finetuning.py`. 
Both scripts rely on the [Chembench](https://github.com/shenwanxiang/ChemBench) and optionally the [CDDD](https://github.com/jrwnter/cddd) repositories. 
Please follow the installation instructions described in their READMEs.

```shell script
python molbert/apps/finetune.py \
    --train_file path/to/train.csv \
    --valid_file path/to/valid.csv \
    --test_file path/to/test.csv \
    --mode classification \
    --output_size 2 \
    --pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
    --label_column my_label_column
```

For regression set the mode to `regression` and the `output_size` to 1.

```shell script
python molbert/apps/finetune.py \
    --train_file path/to/train.csv \
    --valid_file path/to/valid.csv \
    --test_file path/to/test.csv \
    --mode regression \
    --output_size 1 \
    --pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
    --label_column pIC50
```

To reproduce the finetuning experiments we direct you to use `scripts/run_qsar_test_molbert.py` and `scripts/run_finetuning.py`. 
Both scripts rely on the [Chembench](https://github.com/shenwanxiang/ChemBench) and optionally the [CDDD](https://github.com/jrwnter/cddd) repositories. 
Please follow the installation instructions described in their READMEs.