Pre-training Molecular BERT Model with MLM or MTR

Firstly, install the required dependencies using pip install -r requirements.txt, you can find the requirements.txt file here.
Download the Guacamol dataset using the script found in src/1_pre_training/data/guacamol_dataset_downloader.py using:

python guacamol_dataset_downloader.py

Subsequently train the tokenizer using the script found in src/1_pre_training/train_bert_tokenizer.py after adjusting the MODEL_NAME variable in the script and then execute:

python train_bert_tokenizer.py

To reproduce our pre-training step please follow the steps below:

MLM Pre-training

Make sure you have adjusted tokenizer name and trained it using the instructions above.
Finally start pre-training the BERT model using the script in src/1_pre_training/mlm_pre_training/mlm_pre_train_bert.py using:

python mlm_pre_train_bert.py

MTR Pre-training

Make sure you have adjusted tokenizer name and trained it using the instructions above.
In addition to downloading the dataset and training the tokenizer, for MTR you also need to preprare physicochemical properties as labels and compute normalization values (mean and std) on the extracted properties. You can achieve this by executing the script src/1_pre_training/data/prepare_mtr_dataset.py

python prepare_mtr_dataset.py

Then, start pre-training the BERT model with Multi-task Regression objective using the script in src/1_pre_training/mtr_pretraining/mtr_pre_train_bert.py using:

python mtr_pre_train_bert.py

Final Pre-trained Model

At the end training you can located the pre-trained model for MLM and MTR pre-training in src/1_pre_training/smole-bert and/or src/1_pre_training/smole-bert-mtr respectively for the given pre-training objective.

BART Pre-training

You can ignore the above steps for pre-training BART
Download the Guacamol dataset tailored for Seq2Seq BART using the script found in src/1_pre_training/data/download_bart_dataset.py using:

python download_bart_dataset.py

Then, start pre-training the BART model with denoising objective with masking and permutation using the script in src/1_pre_training/bart_pre_training/denoise_pre_train_bart.py using:

python denoise_pre_train_bart.py --model_name_or_path='shahrukhx01/smole-bart' \
--dataset_name='guacamol_data' --per_device_train_batch_size=16 \
--output_dir=smole-bart-mask-permute --num_train_epochs=10 --masking_noise=1

Pre-training the BART model with denoising objective without masking using the script in src/1_pre_training/bart_pre_training/denoise_pre_train_bart.py using:

python denoise_pre_train_bart.py --model_name_or_path='shahrukhx01/smole-bart' \
--dataset_name='guacamol_data' --per_device_train_batch_size=16 \
--output_dir=smole-bart-permute-only --num_train_epochs=10 --masking_noise=0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pre-training Molecular BERT Model with MLM or MTR

MLM Pre-training

MTR Pre-training

Final Pre-trained Model

BART Pre-training

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pre-training Molecular BERT Model with MLM or MTR

MLM Pre-training

MTR Pre-training

Final Pre-trained Model

BART Pre-training