<a href="https://colab.research.google.com/github/shannonfernandes25/bioinformatics-BPRI-Bioinformatics/blob/main/positive_embeddings_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers tqdm biopython torch

Collecting biopython
  Downloading biopython-1.85-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


In [2]:
import torch
from tqdm import tqdm
import pandas as pd
from transformers import AutoTokenizer, AutoModel

In [3]:
# Load your data
df = pd.read_excel("/content/positive_dataset.xlsx")

# Ensure 'Sequence' column exists and converts to list
sequences = df["sequence"].tolist()

In [4]:
df['sequence'] = df['sequence'].astype(str).str.strip()  #str.strip() removes leading and trailing spaces from each string value
df = df[df['sequence'].str.len() > 0]


df = df[~df['sequence'].str.lower().isin(['nan', 'none', 'null'])]
#str.lower() converts all sequence entries to lowercase for consistent comparison.
#The tilde ~ negates the mask   Goal: remove textual missing-value markers like "nan", "None", or "null".

# Final list of sequences
sequences = df['sequence'].tolist()

In [5]:
# generate mean embeddings
def generate_embeddings(model_name, sequences, device):
    tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False)
    model = AutoModel.from_pretrained(model_name).to(device)
    model.eval()

    embeddings = []
    with torch.no_grad():
        for seq in tqdm(sequences):
            # Tokenize and convert to tensor
            tokens = tokenizer(seq, return_tensors='pt', truncation=True, max_length=1024)
            tokens = {k: v.to(device) for k, v in tokens.items()}

            outputs = model(**tokens)
            # Take mean across all residues
            emb = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
            embeddings.append(emb)
    return embeddings

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [6]:
# Generate embeddings for each model
models = {
    "ProtBERT03": "Rostlab/prot_bert_bfd",
    "ProtBERT05": "Rostlab/prot_bert",
    "ESM_t6": "facebook/esm2_t6_8M_UR50D",
    "ESM_t12": "facebook/esm2_t12_35M_UR50D"
}

# Generate and store
all_embeddings = {}
for label, model_name in models.items():
    print(f"\n🔹 Generating {label} embeddings...")
    all_embeddings[label] = generate_embeddings(model_name, sequences, device)

# Combine all embeddings into one DataFrame
for label in all_embeddings:
    emb_df = pd.DataFrame(all_embeddings[label])
    emb_df.columns = [f"{label}_{i}" for i in range(emb_df.shape[1])]
    df = pd.concat([df, emb_df], axis=1)

# Save to CSV
output_path = "positive_protein_embeddings.csv"
df.to_csv(output_path, index=False)
print(f"\n✅ Embeddings saved to: {output_path}")


🔹 Generating ProtBERT03 embeddings...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]


  0%|          | 0/886 [00:00<?, ?it/s][A
  0%|          | 1/886 [00:00<08:39,  1.70it/s][A
  0%|          | 2/886 [00:00<05:50,  2.52it/s][A
  0%|          | 3/886 [00:01<05:02,  2.92it/s][A
  0%|          | 4/886 [00:01<04:34,  3.21it/s][A
  1%|          | 5/886 [00:01<04:15,  3.44it/s][A
  1%|          | 6/886 [00:01<04:08,  3.54it/s][A
  1%|          | 7/886 [00:02<04:12,  3.48it/s][A
  1%|          | 8/886 [00:02<04:06,  3.56it/s][A
  1%|          | 9/886 [00:02<04:02,  3.62it/s][A
  1%|          | 10/886 [00:03<04:18,  3.39it/s][A
  1%|          | 11/886 [00:03<04:35,  3.18it/s][A
  1%|▏         | 12/886 [00:03<04:54,  2.96it/s][A
  1%|▏         | 13/886 [00:04<05:01,  2.89it/s][A
  2%|▏         | 14/886 [00:04<05:05,  2.85it/s][A
  2%|▏         | 15/886 [00:04<05:08,  2.83it/s][A
  2%|▏         | 16/886 [00:05<05:18,  2.73it/s][A
  2%|▏         | 17/886 [00:05<05:54,  2.45it/s][A
  2%|▏         | 18/886 [00:06<07:13,  2.00it/s][A
  2%|▏         | 19/886 [00:0


🔹 Generating ProtBERT05 embeddings...


tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]


  0%|          | 0/886 [00:00<?, ?it/s][A
  0%|          | 1/886 [00:00<03:58,  3.71it/s][A
  0%|          | 2/886 [00:00<03:53,  3.79it/s][A
  0%|          | 3/886 [00:00<04:00,  3.68it/s][A
  0%|          | 4/886 [00:01<04:12,  3.50it/s][A
  1%|          | 5/886 [00:01<04:35,  3.19it/s][A
  1%|          | 6/886 [00:01<04:22,  3.35it/s][A
  1%|          | 7/886 [00:02<04:16,  3.43it/s][A
  1%|          | 8/886 [00:02<04:08,  3.54it/s][A
  1%|          | 9/886 [00:02<04:02,  3.61it/s][A
  1%|          | 10/886 [00:02<04:02,  3.62it/s][A
  1%|          | 11/886 [00:03<04:05,  3.57it/s][A
  1%|▏         | 12/886 [00:03<04:01,  3.62it/s][A
  1%|▏         | 13/886 [00:03<03:59,  3.65it/s][A
  2%|▏         | 14/886 [00:03<04:01,  3.61it/s][A
  2%|▏         | 15/886 [00:04<04:00,  3.62it/s][A
  2%|▏         | 16/886 [00:04<03:58,  3.65it/s][A
  2%|▏         | 17/886 [00:04<03:56,  3.68it/s][A
  2%|▏         | 18/886 [00:05<04:03,  3.57it/s][A
  2%|▏         | 19/886 [00:0


🔹 Generating ESM_t6 embeddings...


tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/31.4M [00:00<?, ?B/s]

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 886/886 [00:12<00:00, 71.22it/s]



🔹 Generating ESM_t12 embeddings...


tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/136M [00:00<?, ?B/s]

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 886/886 [00:39<00:00, 22.42it/s]



✅ Embeddings saved to: positive_protein_embeddings.csv
