# MegaMolBART Embedding Consistency

Make sure you have weights in `/workspace/bionemo/models`. This notebooks uses the interactive inference method without the need to launch local triton server explicitly

In [36]:
import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
import numpy as np

In [37]:
from typing import List
from pathlib import Path
import os

try:
    BIONEMO_HOME: Path = Path(os.environ['BIONEMO_HOME']).absolute()
except KeyError:
    print("Must have BIONEMO_HOME set in the environment! See docs for instructions.")
    raise

config_path = BIONEMO_HOME / "examples" / "molecule" / "megamolbart" / "conf"
print(f"Using model configuration at: {config_path}")
assert config_path.is_dir()

Using model configuration at: /workspace/bionemo/examples/molecule/megamolbart/conf


### Setup and Test Data

`InferenceWrapper` is an adaptor that allows interaction with inference service.

In [38]:
from bionemo.utils.hydra import load_model_config

cfg = load_model_config(config_name="infer.yaml", config_path=config_path)

In [39]:
from bionemo.triton.utils import load_model_for_inference
from bionemo.model.molecule.megamolbart.infer import MegaMolBARTInference

inferer = load_model_for_inference(cfg, interactive=True)

print(f"Loaded a {type(inferer)}")
assert isinstance(inferer, MegaMolBARTInference)

[NeMo I 2024-05-22 21:52:55 utils:490] pytorch DDP is not initialized. Initializing with pytorch-lightening...


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[NeMo I 2024-05-22 21:52:55 utils:333] Restoring model from /workspace/bionemo/models/molecule/megamolbart/megamolbart.nemo
[NeMo I 2024-05-22 21:52:55 utils:337] Loading model class: bionemo.model.molecule.megamolbart.megamolbart_model.MegaMolBARTModel


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Interactive mode selected, using strategy='auto'
[NeMo I 2024-05-22 21:52:56 exp_manager:394] Experiments will be logged at /workspace/bionemo/examples/molecule/megamolbart/nbs/nemo_experiments/MegaMolBART_Inference/2024-05-22_21-47-12
[NeMo I 2024-05-22 21:52:56 exp_manager:835] TensorboardLogger has been set up
[NeMo I 2024-05-22 21:52:56 utils:306] 
    
    ************** Trainer configuration ***********
[NeMo I 2024-05-22 21:52:56 utils:307] 
    name: MegaMolBART_Inference
    desc: Minimum configuration for initializing a MegaMolBART model for inference.
    trainer:
      precision: 16-mixed
      devices: 1
      num_nodes: 1
      accelerator: gpu
      logger: false
      accumulate_grad_batches: 1
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_checkpoint_callback: false
    model:
      name: small_span_aug
      micro_batch_size: ${model.data.batch_size}
      global_batch_size: 128
      tensor_model_parallel_size: 1
  

[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: virtual_pipeline_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: sequence_parallel in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: use_cpu_initializat

[NeMo I 2024-05-22 21:52:56 megatron_init:234] Rank 0 has data parallel group: [0]
[NeMo I 2024-05-22 21:52:56 megatron_init:237] All data parallel group ranks: [[0]]
[NeMo I 2024-05-22 21:52:56 megatron_init:238] Ranks 0 has data parallel rank: 0
[NeMo I 2024-05-22 21:52:56 megatron_init:246] Rank 0 has model parallel group: [0]
[NeMo I 2024-05-22 21:52:56 megatron_init:247] All model parallel group ranks: [[0]]
[NeMo I 2024-05-22 21:52:56 megatron_init:257] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-05-22 21:52:56 megatron_init:261] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-05-22 21:52:56 megatron_init:262] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-05-22 21:52:56 megatron_init:276] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-05-22 21:52:56 megatron_init:288] Rank 0 has embedding group: [0]
[NeMo I 2024-05-22 21:52:56 megatron_init:294] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-05-22 21:52:56 megatron_init:295]

[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: virtual_pipeline_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: sequence_parallel in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-22 21:52:56 megatron_base_model:821] The model: MegaMolBARTModel() does not have field.name: use_cpu_initializat

[NeMo I 2024-05-22 21:52:56 tokenizer_utils:199] Using regex tokenization
[NeMo I 2024-05-22 21:52:56 regex_tokenizer:240] Loading vocabulary from file = /tmp/tmpcfus9d0j/36b36f49c3e64962a7b54f1a1ba2b580_megamolbart.vocab
[NeMo I 2024-05-22 21:52:56 regex_tokenizer:254] Loading regex from file = /tmp/tmpcfus9d0j/111b90cc2819425382967ab999101096_megamolbart.model
[NeMo I 2024-05-22 21:52:56 megatron_base_model:315] Padded vocab_size: 640, original vocab_size: 523, dummy tokens: 117.
[NeMo I 2024-05-22 21:52:56 nlp_overrides:752] Model MegaMolBARTModel was successfully restored from /workspace/bionemo/models/molecule/megamolbart/megamolbart.nemo.
[NeMo I 2024-05-22 21:52:56 megatron_lm_encoder_decoder_model:1195] Decoding using the greedy-search method...
Loaded a <class 'bionemo.model.molecule.megamolbart.infer.MegaMolBARTInference'>


### SMILES to Embedding

`smis_to_embedding` queries the model to fetch the encoder embedding for the input SMILES.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
from rdkit import Chem

In [87]:
def canonicalize_smiles(smiles: str) -> str:
    """Canonicalize input SMILES"""
    mol = Chem.MolFromSmiles(smiles)
    canon_smiles = Chem.MolToSmiles(mol, canonical=True)
    return canon_smiles

In [88]:
s = 'c1cc2ccccc2cc1'
cs = canonicalize_smiles(s)

# Test 1: absolute difference

In [89]:
smis1 = [s] # original
smis2 = [cs] # original canoicalized
smis3 = [s, s] # orignial
smis4 = [s,  'COc1cc2nc(N3CCN(C(=O)c4ccco4)CC3)nc(N)c2cc1OC'] # orignianl, random mol
smis5 = [s,  'CC(C)CC1=CC=C(C=C1)C2=CC=CC=C2C3=CC=C(C=C3)C4=CC=CC=C4'] # original, random
smis6 = [s, 'Nc1nc(cs1)C(=NOCC(O)=O)C(=O)N[C@H]1[C@H]2SCC(C=C)=C(N2C1=O)C(O)=O', 'COc1cc2nc(N3CCN(C(=O)c4ccco4)CC3)nc(N)c2cc1OC'] # original, rand, rand

In [90]:

test1 = inferer.seq_to_embeddings(smis1)[0, :].cpu()
test2 = inferer.seq_to_embeddings(smis2)[0, :].cpu()
test3 = inferer.seq_to_embeddings(smis3)[0, :].cpu()
test4 = inferer.seq_to_embeddings(smis4)[0, :].cpu()
test5 = inferer.seq_to_embeddings(smis5)[0, :].cpu()
test6 = inferer.seq_to_embeddings(smis6)[0, :].cpu()

In [91]:
# original [s] vs canonicalized mols [cs] produce very different embeddings
np.allclose(test1, test2, atol=1e-01, rtol=0)

False

In [92]:
# [s] vs [s, s] generates similar embeddings for s, up to a certain threshold 1e-3
np.allclose(test1, test3, atol=1e-03, rtol=0)

True

In [93]:
# [s] vs [s, random] generates simlar embeddings for s, up to a certain threshold 1e-3
np.allclose(test1, test4, atol=1e-03, rtol=0)

True

In [94]:
# [s] vs [s, random] generates simlar embeddings for s, up to a certain threshold 1e-3
np.allclose(test1, test5, atol=1e-03, rtol=0)

True

In [95]:
# [s] vs [s, random, random] generates simlar embeddings for s, up to a certain threshold 1e-3

np.allclose(test1, test5, atol=1e-03, rtol=0)

True

# Test 2: cosine similarity

In [96]:
# [s] vs [cs]
cos_sim = cosine_similarity(test1.reshape(-1, 1), test2.reshape(-1, 1))
np.mean(np.diag(cos_sim))

0.9140625

In [97]:
cos_sim = cosine_similarity(test1.reshape(-1, 1), test3.reshape(-1, 1))
np.mean(np.diag(cos_sim))

0.99609375

In [98]:
cos_sim = cosine_similarity(test1.reshape(-1, 1), test4.reshape(-1, 1))
np.mean(np.diag(cos_sim))

1.0

In [99]:
cos_sim = cosine_similarity(test1.reshape(-1, 1), test5.reshape(-1, 1))
np.mean(np.diag(cos_sim))

1.0

In [100]:
cos_sim = cosine_similarity(test1.reshape(-1, 1), test6.reshape(-1, 1))
np.mean(np.diag(cos_sim))

1.0

# Test 3: How to always generate the same result for a given molecule

You will need use a N=1 batch for inference (e.g. predict 1 molecule each time)

In [103]:
N = 10

for n in range(N): 
    result = inferer.seq_to_embeddings(smis1)[0, :].cpu()
    
    # exactly the same array
    assert np.array_equal(test1, result)
    
    # cosine = 1
    assert np.mean(np.diag(cosine_similarity(test1.reshape(-1, 1), result.reshape(-1, 1)))) == 1