# ESM2nv embedding consistency

In [47]:
import warnings
from pathlib import Path
import os
import numpy as np

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Load model 

In [1]:

try:
    BIONEMO_HOME: Path = Path(os.environ['BIONEMO_HOME']).absolute()
except KeyError:
    print("Must have BIONEMO_HOME set in the environment! See docs for instructions.")
    raise

config_path = BIONEMO_HOME / "examples" / "protein" / "esm2nv" / "conf"
print(f"Using model configuration at: {config_path}")
assert config_path.is_dir()

Using model configuration at: /workspace/bionemo/examples/protein/esm2nv/conf


In [5]:
from bionemo.utils.hydra import load_model_config

# the config file at /workspace/bionemo/examples/protein/esm2nv/conf/infer.yaml uses 650M model
# as seen by this line in the YAML file: restore_from_path: "${oc.env:BIONEMO_HOME}/models/protein/esm2nv/esm2nv_650M_converted.nemo"
cfg = load_model_config(config_name="infer.yaml", config_path=config_path)


In [6]:
from bionemo.triton.utils import load_model_for_inference
from bionemo.model.protein.esm1nv.infer import ESM1nvInference

inferer = load_model_for_inference(cfg, interactive=True)

print(f"Loaded a {type(inferer)}")
assert isinstance(inferer, ESM1nvInference)

INFO:rdkit:Enabling RDKit 2023.09.1 jupyter extensions
INFO:datasets:PyTorch version 2.1.0a0+32f93b1 available.


[NeMo I 2024-05-30 19:32:14 megatron_hiddens:110] Registered hidden transform sampled_var_cond_gaussian at bionemo.model.core.hiddens_support.SampledVarGaussianHiddenTransform
[NeMo I 2024-05-30 19:32:14 megatron_hiddens:110] Registered hidden transform interp_var_cond_gaussian at bionemo.model.core.hiddens_support.InterpVarGaussianHiddenTransform
[NeMo I 2024-05-30 19:32:14 utils:490] pytorch DDP is not initialized. Initializing with pytorch-lightening...


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[NeMo I 2024-05-30 19:32:14 utils:333] Restoring model from /workspace/bionemo/models/protein/esm2nv/esm2nv_650M_converted.nemo
[NeMo I 2024-05-30 19:32:14 utils:337] Loading model class: bionemo.model.protein.esm1nv.esm1nv_model.ESM2nvModel


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Interactive mode selected, using strategy='auto'
[NeMo I 2024-05-30 19:32:14 exp_manager:394] Experiments will be logged at /workspace/bionemo/examples/protein/esm2nv/nbs/nemo_experiments/ESM2nv_Inference/2024-05-30_19-32-14
[NeMo I 2024-05-30 19:32:14 exp_manager:835] TensorboardLogger has been set up
[NeMo I 2024-05-30 19:32:14 utils:306] 
    
    ************** Trainer configuration ***********
[NeMo I 2024-05-30 19:32:14 utils:307] 
    name: ESM2nv_Inference
    desc: Minimum configuration for initializing a ESM2nv model for inference.
    trainer:
      precision: 16-mixed
      devices: 1
      num_nodes: 1
      accelerator: gpu
      logger: false
      accumulate_grad_batches: 1
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_checkpoint_callback: false
    model:
      micro_batch_size: ${model.data.batch_size}
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      seq_length: 1024
      max_position

[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: virtual_pipeline_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: sequence_parallel in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: gradient_accumulation_fusion in its cfg. Add

[NeMo I 2024-05-30 19:32:17 megatron_init:234] Rank 0 has data parallel group: [0]
[NeMo I 2024-05-30 19:32:17 megatron_init:237] All data parallel group ranks: [[0]]
[NeMo I 2024-05-30 19:32:17 megatron_init:238] Ranks 0 has data parallel rank: 0
[NeMo I 2024-05-30 19:32:17 megatron_init:246] Rank 0 has model parallel group: [0]
[NeMo I 2024-05-30 19:32:17 megatron_init:247] All model parallel group ranks: [[0]]
[NeMo I 2024-05-30 19:32:17 megatron_init:257] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-05-30 19:32:17 megatron_init:261] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-05-30 19:32:17 megatron_init:262] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-05-30 19:32:17 megatron_init:276] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-05-30 19:32:17 megatron_init:288] Rank 0 has embedding group: [0]
[NeMo I 2024-05-30 19:32:17 megatron_init:294] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-05-30 19:32:17 megatron_init:295]

[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: virtual_pipeline_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: sequence_parallel in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-30 19:32:17 megatron_base_model:821] The model: ESM2nvModel() does not have field.name: gradient_accumulation_fusion in its cfg. Add

[NeMo I 2024-05-30 19:32:17 tokenizer_utils:182] Getting HuggingFace AutoTokenizer with pretrained_model_name: facebook/esm2_t33_650M_UR50D
[NeMo I 2024-05-30 19:32:17 megatron_base_model:315] Padded vocab_size: 128, original vocab_size: 33, dummy tokens: 95.


[NeMo W 2024-05-30 19:32:17 base:108] Using custom ESM2 Embeddings instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 attention:87] Using custom ESM2 attention instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 mlp:192] Using custom ESM2 GELU function instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 attention:87] Using custom ESM2 attention instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 mlp:192] Using custom ESM2 GELU function instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 attention:87] Using custom ESM2 attention instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 mlp:192] Using custom ESM2 GELU function instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 attention:87] Using custom ESM2 attention instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 mlp:192] Using custom ESM2 GELU function instead of the default NeMo version
[NeMo W 2024-05-30 19:32:17 attention:87] Using custo

[NeMo I 2024-05-30 19:32:31 nlp_overrides:752] Model ESM2nvModel was successfully restored from /workspace/bionemo/models/protein/esm2nv/esm2nv_650M_converted.nemo.


      return bias_dropout_add_fused_inference_(*args)
    


Loaded a <class 'bionemo.model.protein.esm1nv.infer.ESM1nvInference'>


# Test embedding consistency

## Input data

In [26]:
short_seq = 'MQLYLVLLLISYLLTPIGASILGRCTVAKMLYDGGLNYFEGYSLENWVCLAYFESKFNPSAVYEDPQDGSTGFGLFQIRDNEWCGHGKNLCSVSCTALLNPNLKDTIQCAKKIVKGKHGMGAWPIWSKNCQLSDVLDRWLDGCDL'
long_seq = 'MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA'

input1 = [short_seq]
input2 = [short_seq, short_seq]
input3 = [short_seq, long_seq]

print(len(short_seq))

145


## Run tests

### Batch Effects

In [99]:
result1 = inferer.seq_to_hiddens(input1)[0][0, :].cpu()
result2a = inferer.seq_to_hiddens(input2)[0][0, :].cpu() # same seq, at index 0 in the array
result2b = inferer.seq_to_hiddens(input2)[0][1, :].cpu() # same seq, at index 1 in the array
result3 = inferer.seq_to_hiddens(input3)[0][0, :].cpu()

In [100]:
# result 1 = result 2a 
# result 1 = result 2b
assert np.array_equal(result1, result2a)
assert np.array_equal(result1, result2b)

In [101]:
print(result1.shape)
print(result3.shape)

torch.Size([147, 1280])
torch.Size([609, 1280])


Note that `result3` has a different shape, because the sequence was padded to the longer sequence in the list. 

In [102]:
result3_paddings_removed = result3[:147, :]

# the actual embeddings between result1 and result3 are NOT exactly the same
assert not np.array_equal(result1, result3_paddings_removed)

# they are slightly different
assert np.allclose(result1[1:146, :], result3[1:146, :], atol=0.02, rtol=0)

### Running N=1 Batch

We will run N=1 batch for 10 times on the same protein. 

In [103]:
N = 10
# running inference 10 time
for n in range(N): 
	# but each time, we only embed 1 protein
    result = inferer.seq_to_hiddens(input1)[0][0, :].cpu()
    # this generates exactly the same array at each round
    assert np.array_equal(result1, result)

For the same protein, if we run N=1 batch, the result is **deterministic**

**Conclusion**: To ensure that the embeddings are exactly same from each round, submit one protein sequence for each query, instead of putting them into a list. 