<a href="https://colab.research.google.com/github/seyonechithrananda/bert-loves-chemistry/blob/master/HuggingFace_ZINC_ROBERTA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip uninstall -y tensorflow
!pip install transformers



In [2]:
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer


CPU times: user 6.41 ms, sys: 0 ns, total: 6.41 ms
Wall time: 8.14 ms


In [0]:
tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files='/content/drive/My Drive/Project De Novo/100k_rndm_zinc_drugs_clean.txt', vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Lets make a new directory, BERT_loves_chemistry, to store our tokenize

In [4]:
!mkdir BERT_loves_chemistry
tokenizer.save("BERT_loves_chemistry")

mkdir: cannot create directory ‘BERT_loves_chemistry’: File exists


['BERT_loves_chemistry/vocab.json', 'BERT_loves_chemistry/merges.txt']

In [0]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing 

In [0]:
tokenizer = ByteLevelBPETokenizer(
    "/content/drive/My Drive/Project De Novo/BERT_loves_chemistry/vocab.json",
    "/content/drive/My Drive/Project De Novo/BERT_loves_chemistry/merges.txt",
)

In [0]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)

tokenizer.enable_truncation(max_length=512)

In [8]:
#tokenize remdesivir SMILES to test the tokenizer!
tokenizer.encode("CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@](C#N)([C@H](O)[C@@H]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1")


Encoding(num_tokens=76, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])

In [9]:
tokenizer.encode("CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@](C#N)([C@H](O)[C@@H]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1").tokens


['<s>',
 'CCC',
 '(',
 'CC',
 ')',
 'COC',
 '(=',
 'O',
 ')[',
 'C',
 '@',
 'H',
 '](',
 'C',
 ')',
 'N',
 '[',
 'P',
 '@](=',
 'O',
 ')(',
 'OC',
 '[',
 'C',
 '@',
 'H',
 ']',
 '1',
 'O',
 '[',
 'C',
 '@](',
 'C',
 '#',
 'N',
 ')([',
 'C',
 '@',
 'H',
 '](',
 'O',
 ')[',
 'C',
 '@@',
 'H',
 ']',
 '1',
 'O',
 ')',
 'C',
 '1',
 '=',
 'CC',
 '=',
 'C',
 '2',
 'N',
 '1',
 'N',
 '=',
 'CN',
 '=',
 'C',
 '2',
 'N',
 ')',
 'OC',
 '1',
 '=',
 'CC',
 '=',
 'CC',
 '=',
 'C',
 '1',
 '</s>']

## 3. Train a language model from scratch

We will now train our language model using the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) script from `transformers` (newly renamed from `run_lm_finetuning.py` as it now supports training from scratch more seamlessly). Just remember to leave `--model_name_or_path` to `None` to train from scratch vs. from an existing model or checkpoint.

> We’ll train a BERT for chemistry model, with the help of our tokenizdr trained on the ZINC 250k dataset we used.

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.

In [10]:
!nvidia-smi #check GPU 

Tue Apr  7 02:09:37 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [11]:
import torch
torch.cuda.is_available() #checking if CUDA + Colab GPU works

True

In [12]:
# Get the example scripts.
!wget -c https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py

--2020-04-07 02:09:41--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [0]:
import json 
config = {
    "architectures":[
      "RobertaForMaskedLM"
    ],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": 52000
}


In [0]:
with open("./BERT_loves_chemistry/config.json", 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {
	"max_len": 512
}
with open("./BERT_loves_chemistry/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

In [15]:
cd /content/drive/My Drive/Project De Novo

/content/drive/My Drive/Project De Novo


In [0]:
# run script
cmd =	"""
  python run_language_modeling.py
  --train_data_file ./100k_rndm_zinc_drugs_clean.txt
  --output_dir ./output_dir
	--model_type roberta
	--mlm
	--config_name ./BERT_loves_chemistry
	--tokenizer_name ./BERT_loves_chemistry
	--do_train
	--line_by_line
	--learning_rate 1e-4
	--num_train_epochs 12
	--save_total_limit 2
	--save_steps 2000
	--per_gpu_train_batch_size 16
	--seed 42
""".replace("\n", " ")

In [20]:
%%time
!{cmd}

KeyboardInterrupt: ignored

To visualize the training progress, we can use `!tensorboard dev upload --logdir ./path/to/runs`

# Exporting the model
To share the model with the NLP community (in which access to large language models trained on chemical data is sparse) you need to export it in the appropriate format. Let's prepare the path for the model’s latest checkpoint and then run the following code:




In [0]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import os


directory = "/path/to/your/model/checkpoint-30000"

model = AutoModelWithLMHead.from_pretrained(directory)
tokenizer = AutoTokenizer.from_pretrained(directory)
out = "ChemBERTa-zinc-base-v1"
os.makedirs(out, exist_ok=True)
model.save_pretrained(out)
tokenizer.save_pretrained(out)


From here, we'll upload the weights onto HuggingFace's system using `transformers-cli` util to upload the model:



In [0]:
transformers-cli upload ./ChemBERTa-zinc-base-v1/
