<a href="https://colab.research.google.com/github/shannonfernandes25/bioinformatics-BPRI-Bioinformatics/blob/main/negative_embedding_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Transformer - encoding - converting our alphabetical data into numeric form - HuggingFace

torch - PyTorch

tqdm - progress bars

In [None]:
!pip install transformers tqdm biopython torch

Collecting biopython
  Downloading biopython-1.85-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


In [None]:
import torch
from tqdm import tqdm
import pandas as pd
from transformers import AutoTokenizer, AutoModel

In [None]:
# Load your data
df = pd.read_excel("/content/shannon_negative_data.xlsx")

# Ensure 'Sequence' column exists and converts to list
sequences = df["Sequence"].tolist()

Potein embedding models-
1. protbert

   a. protbert_03

   b. protbert_05

2. ESM

   a. ESM_t6
   
   b. ESM_t12

max_length = 1024

truncation = True  (cut long sequences as per token length)


In [None]:
# generate mean embeddings
def generate_embeddings(model_name, sequences, device):
    tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False)
    model = AutoModel.from_pretrained(model_name).to(device)
    model.eval()

    embeddings = []
    with torch.no_grad():
        for seq in tqdm(sequences):
            # Tokenize and convert to tensor
            tokens = tokenizer(seq, return_tensors='pt', truncation=True, max_length=1024)
            tokens = {k: v.to(device) for k, v in tokens.items()}

            outputs = model(**tokens)
            # Take mean across all residues
            emb = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
            embeddings.append(emb)
    return embeddings

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

All the encoding model will generate embeddings per residue but we want the embeddings per protein so to get that, we are calculating mean of the embeddings per residue.

device -  parameters to GPU (cuda) or CPU

model.eval() - put the model in evaluation mode (disable other training mode)

 torch.no_grad - it is a context manager which will disable gradient computation so that it will not take extra memory

In [None]:
# Generate embeddings for each model
models = {
    "ProtBERT03": "Rostlab/prot_bert_bfd",
    "ProtBERT05": "Rostlab/prot_bert",
    "ESM_t6": "facebook/esm2_t6_8M_UR50D",
    "ESM_t12": "facebook/esm2_t12_35M_UR50D"
}

# Generate and store
all_embeddings = {}
for label, model_name in models.items():
    print(f"\n🔹 Generating {label} embeddings...")
    all_embeddings[label] = generate_embeddings(model_name, sequences, device)

# Combine all embeddings into one DataFrame
for label in all_embeddings:
    emb_df = pd.DataFrame(all_embeddings[label])
    emb_df.columns = [f"{label}_{i}" for i in range(emb_df.shape[1])]
    df = pd.concat([df, emb_df], axis=1)

# Save to CSV
output_path = "protein_embeddings.csv"
df.to_csv(output_path, index=False)
print(f"\n✅ Embeddings saved to: {output_path}")



🔹 Generating ProtBERT03 embeddings...


tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]


  0%|          | 0/510 [00:00<?, ?it/s][A
  0%|          | 1/510 [00:01<08:29,  1.00s/it][A
  0%|          | 2/510 [00:01<04:43,  1.79it/s][A
  1%|          | 3/510 [00:01<03:32,  2.39it/s][A
  1%|          | 4/510 [00:01<03:00,  2.81it/s][A
  1%|          | 5/510 [00:01<02:36,  3.22it/s][A
  1%|          | 6/510 [00:02<02:22,  3.52it/s][A
  1%|▏         | 7/510 [00:02<02:15,  3.71it/s][A
  2%|▏         | 8/510 [00:02<02:14,  3.73it/s][A
  2%|▏         | 9/510 [00:03<03:07,  2.67it/s][A
  2%|▏         | 10/510 [00:03<03:21,  2.48it/s][A
  2%|▏         | 11/510 [00:04<02:55,  2.85it/s][A
  2%|▏         | 12/510 [00:04<02:37,  3.17it/s][A
  3%|▎         | 13/510 [00:04<02:26,  3.39it/s][A
  3%|▎         | 14/510 [00:04<02:19,  3.55it/s][A
  3%|▎         | 15/510 [00:05<02:11,  3.76it/s][A
  3%|▎         | 16/510 [00:05<02:05,  3.92it/s][A
  3%|▎         | 17/510 [00:05<02:05,  3.94it/s][A
  4%|▎         | 18/510 [00:05<02:01,  4.05it/s][A
  4%|▎         | 19/510 [00:0


🔹 Generating ProtBERT05 embeddings...





tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]


  0%|          | 0/510 [00:00<?, ?it/s][A
  0%|          | 1/510 [00:00<02:33,  3.32it/s][A
  0%|          | 2/510 [00:00<02:14,  3.79it/s][A
  1%|          | 3/510 [00:00<02:07,  3.97it/s][A
  1%|          | 4/510 [00:01<02:09,  3.90it/s][A
  1%|          | 5/510 [00:01<02:07,  3.96it/s][A
  1%|          | 6/510 [00:01<02:09,  3.89it/s][A
  1%|▏         | 7/510 [00:01<02:10,  3.86it/s][A
  2%|▏         | 8/510 [00:02<02:11,  3.81it/s][A
  2%|▏         | 9/510 [00:02<02:07,  3.93it/s][A
  2%|▏         | 10/510 [00:02<02:13,  3.74it/s][A
  2%|▏         | 11/510 [00:02<02:15,  3.67it/s][A
  2%|▏         | 12/510 [00:03<02:22,  3.49it/s][A
  3%|▎         | 13/510 [00:03<02:16,  3.63it/s][A
  3%|▎         | 14/510 [00:03<02:10,  3.79it/s][A
  3%|▎         | 15/510 [00:03<02:07,  3.88it/s][A
  3%|▎         | 16/510 [00:04<02:08,  3.85it/s][A
  3%|▎         | 17/510 [00:04<02:03,  3.98it/s][A
  4%|▎         | 18/510 [00:04<02:02,  4.02it/s][A
  4%|▎         | 19/510 [00:0


🔹 Generating ESM_t6 embeddings...


tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/31.4M [00:00<?, ?B/s]

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 510/510 [00:15<00:00, 33.58it/s]


🔹 Generating ESM_t12 embeddings...





tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/136M [00:00<?, ?B/s]

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 510/510 [00:50<00:00, 10.07it/s]



✅ Embeddings saved to: protein_embeddings.csv
