<a href="https://colab.research.google.com/github/seyonechithrananda/bert-loves-chemistry/blob/master/HuggingFace_ZINC_ROBERTA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip uninstall -y tensorflow
!pip install transformers

Uninstalling tensorflow-2.1.0:
  Successfully uninstalled tensorflow-2.1.0


In [0]:
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer


CPU times: user 17 µs, sys: 4 µs, total: 21 µs
Wall time: 25.3 µs


In [0]:
tokenizer = ByteLevelBPETokenizer()

In [0]:
tokenizer.train(files='/content/drive/My Drive/Project De Novo/100k_rndm_zinc_drugs_clean.txt', vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

In [0]:
!mkdir BERT_loves_chemistry
tokenizer.save("BERT_loves_chemistry")

mkdir: cannot create directory ‘BERT_loves_chemistry’: File exists


['BERT_loves_chemistry/vocab.json', 'BERT_loves_chemistry/merges.txt']

In [0]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing 

In [0]:
tokenizer = ByteLevelBPETokenizer(
    "/content/drive/My Drive/Project De Novo/BERT_loves_chemistry/vocab.json",
    "/content/drive/My Drive/Project De Novo/BERT_loves_chemistry/merges.txt",
)

In [0]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)

tokenizer.enable_truncation(max_length=512)

In [0]:
#tokenize remdesivir SMILES to test the tokenizer!
tokenizer.encode("CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@](C#N)([C@H](O)[C@@H]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1")


Encoding(num_tokens=76, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])

In [0]:
tokenizer.encode("CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@](C#N)([C@H](O)[C@@H]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1").tokens


['<s>',
 'CCC',
 '(',
 'CC',
 ')',
 'COC',
 '(=',
 'O',
 ')[',
 'C',
 '@',
 'H',
 '](',
 'C',
 ')',
 'N',
 '[',
 'P',
 '@](=',
 'O',
 ')(',
 'OC',
 '[',
 'C',
 '@',
 'H',
 ']',
 '1',
 'O',
 '[',
 'C',
 '@](',
 'C',
 '#',
 'N',
 ')([',
 'C',
 '@',
 'H',
 '](',
 'O',
 ')[',
 'C',
 '@@',
 'H',
 ']',
 '1',
 'O',
 ')',
 'C',
 '1',
 '=',
 'CC',
 '=',
 'C',
 '2',
 'N',
 '1',
 'N',
 '=',
 'CN',
 '=',
 'C',
 '2',
 'N',
 ')',
 'OC',
 '1',
 '=',
 'CC',
 '=',
 'CC',
 '=',
 'C',
 '1',
 '</s>']

## 3. Train a language model from scratch

We will now train our language model using the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) script from `transformers` (newly renamed from `run_lm_finetuning.py` as it now supports training from scratch more seamlessly). Just remember to leave `--model_name_or_path` to `None` to train from scratch vs. from an existing model or checkpoint.

> We’ll train a BERT for chemistry model, with the help of our tokenizdr trained on the ZINC 250k dataset we used.

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.

In [0]:
!nvidia-smi #check GPU 

Wed Mar 25 04:10:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     7W /  75W |     10MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
import torch
torch.cuda.is_available() #checking if CUDA + Colab GPU works

True

In [0]:
# Get the example scripts.
!wget -c https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py

--2020-03-25 04:10:55--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [0]:
import json 
config = {
    "architectures":[
      "RobertaForMaskedLM"
    ],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": 52000
}


In [0]:
with open("./BERT_loves_chemistry/config.json", 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {
	"max_len": 512
}
with open("./BERT_loves_chemistry/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

In [0]:
cd /content/drive/My Drive/Project De Novo

/content/drive/My Drive/Project De Novo


In [0]:
# run script
cmd =	"""
  python run_language_modeling.py
  --train_data_file ./100k_rndm_zinc_drugs_clean.txt
  --output_dir ./output_dir
	--model_type roberta
	--mlm
	--config_name ./BERT_loves_chemistry
	--tokenizer_name ./BERT_loves_chemistry
	--do_train
	--line_by_line
	--learning_rate 1e-4
	--num_train_epochs 1
	--save_total_limit 2
	--save_steps 2000
	--per_gpu_train_batch_size 16
	--seed 42
""".replace("\n", " ")

In [0]:
%%time
!{cmd}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Iteration:   8% 491/6250 [00:54<10:49,  8.87it/s][A
Iteration:   8% 492/6250 [00:54<11:04,  8.67it/s][A
Iteration:   8% 493/6250 [00:54<10:45,  8.91it/s][A
Iteration:   8% 495/6250 [00:54<10:18,  9.30it/s][A
Iteration:   8% 496/6250 [00:54<10:40,  8.99it/s][A
Iteration:   8% 497/6250 [00:54<11:12,  8.56it/s][A

Iteration:   8% 500/6250 [00:55<10:41,  8.96it/s][A
Iteration:   8% 502/6250 [00:55<10:15,  9.34it/s][A
Iteration:   8% 503/6250 [00:55<10:10,  9.41it/s][A
Iteration:   8% 504/6250 [00:55<10:50,  8.83it/s][A
Iteration:   8% 505/6250 [00:55<11:15,  8.50it/s][A
Iteration:   8% 506/6250 [00:55<11:13,  8.53it/s][A
Iteration:   8% 508/6250 [00:55<10:47,  8.87it/s][A
Iteration:   8% 509/6250 [00:56<10:35,  9.04it/s][A
Iteration:   8% 510/6250 [00:56<10:24,  9.18it/s][A
Iteration:   8% 511/6250 [00:56<10:43,  8.92it/s][A
Iteration:   8% 512/6250 [00:56<10:32,  9.08it/s][A
Iteration:   8% 513/6250 [00:56<1