<a href="https://colab.research.google.com/github/suresh2014/Bioinformatics_resources/blob/main/CodonTransformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="center">
  <img src="https://github.com/Adibvafa/CodonTransformer/raw/main/src/banner_final.png" alt="CodonTransformer Logo" width="100%" height="100%" style="vertical-align: middle;"/>
</p>

<p align="center">
  <a href="https://www.biorxiv.org/content/10.1101/2024.09.13.612903" target="_blank"><img src="https://img.shields.io/badge/arXiv-Paper-FF6B6B?style=for-the-badge&logo=arxiv&logoColor=white" alt="arXiv"></a>
  <a href="https://github.com/Adibvafa/CodonTransformer"><img src="https://img.shields.io/badge/GitHub-Code-4A90E2?style=for-the-badge&logo=github&logoColor=white" alt="GitHub"></a>
  <a href="https://adibvafa.github.io/CodonTransformer/"><img src="https://img.shields.io/badge/Website-Online-00B89E?style=for-the-badge&logo=internet-explorer&logoColor=white" alt="Website"></a>
  <a href="https://huggingface.co/adibvafa/CodonTransformer"><img src="https://img.shields.io/badge/HuggingFace-Model-FFBF00?style=for-the-badge&logo=huggingface&logoColor=white" alt="HuggingFace Model"></a>
  <a href="https://adibvafa.github.io/CodonTransformer/GoogleColab"><img src="https://img.shields.io/badge/Colab-Notebook-e2006a?style=for-the-badge&logo=googlecolab&logoColor=white" alt="Colab"></a>
</p>

**Welcome to the CodonTransformer Google Colab!**

- Select `File` -> `Save a copy in drive` to save this notebook.
- You can run each cell by clicking on the ▶️ icon.
- Use these sections for [single sequence](#scrollTo=yI_HL2WdPxVn) or [multiple sequene](#scrollTo=YNR7DPR9CUjn) optimization.

--------------------------------------------------------------------------
# **Setup Notebook**

In [None]:
#@title Install the Package
%%capture
!pip install --upgrade CodonTransformer

In [None]:
#@title Import the Package
import torch
import pandas as pd

import warnings
from tqdm import tqdm

from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import UserContainer, format_model_output

from google.colab import output
output.enable_custom_widget_manager()
warnings.filterwarnings("ignore")

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Packages Imported Successfully.")

In [None]:
#@title Load Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer")
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer").to(DEVICE)
print("Model and Tokenizer Loaded Successfully.")

--------------------------------------------------------------------------
# **Optimizing a Single Sequence**

In [None]:
#@title **Enter Protein Sequence and Organism**
#@markdown Input your protein sequence and organism, then run this cell. <br>
#@markdown CodonTransformer is finetuned on [major organisms](https://github.com/Adibvafa/CodonTransformer/blob/main/CodonTransformer/CodonUtils.py#L250).
#@markdown Check our paper for more information! <br> <br>

protein_sequence = 'MALWMRLLPLLALLALWGPDPAAAFVNQHTPKTRREAEDLQVGQVELGG' #@param {type:"string"}
organism = "Escherichia coli general" # @param ["Arabidopsis thaliana", "Bacillus subtilis", "Caenorhabditis elegans", "Chlamydomonas reinhardtii", "Chlamydomonas reinhardtii chloroplast", "Danio rerio", "Drosophila melanogaster", "Escherichia coli general", "Homo sapiens", "Mus musculus", "Nicotiana tabacum", "Nicotiana tabacum chloroplast", "Pseudomonas putida", "Saccharomyces cerevisiae", "Atlantibacter hermannii", "Brenneria goodwinii", "Buchnera aphidicola (Schizaphis graminum)", "Candidatus Erwinia haradaeae", "Candidatus Hamiltonella defensa 5AT (Acyrthosiphon pisum)", "Citrobacter amalonaticus", "Citrobacter braakii", "Citrobacter cronae", "Citrobacter europaeus", "Citrobacter farmeri", "Citrobacter freundii", "Citrobacter koseri ATCC BAA-895", "Citrobacter portucalensis", "Citrobacter werkmanii", "Citrobacter youngae", "Cronobacter dublinensis subsp. dublinensis LMG 23823", "Cronobacter malonaticus LMG 23826", "Cronobacter sakazakii", "Cronobacter turicensis", "Dickeya dadantii 3937", "Dickeya dianthicola", "Dickeya fangzhongdai", "Dickeya solani", "Dickeya zeae", "Edwardsiella anguillarum ET080813", "Edwardsiella ictaluri", "Edwardsiella piscicida", "Edwardsiella tarda", "Enterobacter asburiae", "Enterobacter bugandensis", "Enterobacter cancerogenus", "Enterobacter chengduensis", "Enterobacter cloacae", "Enterobacter hormaechei", "Enterobacter kobei", "Enterobacter ludwigii", "Enterobacter mori", "Enterobacter quasiroggenkampii", "Enterobacter roggenkampii", "Enterobacter sichuanensis", "Erwinia amylovora CFBP1430", "Erwinia persicina", "Escherichia albertii", "Escherichia coli O157-H7 str. Sakai", "Escherichia coli str. K-12 substr. MG1655", "Escherichia fergusonii", "Escherichia marmotae", "Escherichia ruysiae", "Ewingella americana", "Hafnia alvei", "Hafnia paralvei", "Kalamiella piersonii", "Klebsiella aerogenes", "Klebsiella grimontii", "Klebsiella michiganensis", "Klebsiella oxytoca", "Klebsiella pasteurii", "Klebsiella pneumoniae subsp. pneumoniae HS11286", "Klebsiella quasipneumoniae", "Klebsiella quasivariicola", "Klebsiella variicola", "Kosakonia cowanii", "Kosakonia radicincitans", "Leclercia adecarboxylata", "Lelliottia amnigena", "Lonsdalea populi", "Moellerella wisconsensis", "Morganella morganii", "Obesumbacterium proteus", "Pantoea agglomerans", "Pantoea allii", "Pantoea ananatis PA13", "Pantoea dispersa", "Pantoea stewartii", "Pantoea vagans", "Pectobacterium aroidearum", "Pectobacterium atrosepticum", "Pectobacterium brasiliense", "Pectobacterium carotovorum", "Pectobacterium odoriferum", "Pectobacterium parmentieri", "Pectobacterium polaris", "Pectobacterium versatile", "Photorhabdus laumondii subsp. laumondii TTO1", "Plesiomonas shigelloides", "Pluralibacter gergoviae", "Proteus faecis", "Proteus mirabilis HI4320", "Proteus penneri", "Proteus terrae subsp. cibarius", "Proteus vulgaris", "Providencia alcalifaciens", "Providencia heimbachae", "Providencia rettgeri", "Providencia rustigianii", "Providencia stuartii", "Providencia thailandensis", "Pyrococcus furiosus", "Pyrococcus horikoshii", "Pyrococcus yayanosii", "Rahnella aquatilis CIP 78.65 = ATCC 33071", "Raoultella ornithinolytica", "Raoultella planticola", "Raoultella terrigena", "Rosenbergiella epipactidis", "Rouxiella badensis", "Saccharolobus solfataricus", "Salmonella bongori N268-08", "Salmonella enterica subsp. enterica serovar Typhimurium str. LT2", "Serratia bockelmannii", "Serratia entomophila", "Serratia ficaria", "Serratia fonticola", "Serratia grimesii", "Serratia liquefaciens", "Serratia marcescens", "Serratia nevei", "Serratia plymuthica AS9", "Serratia proteamaculans", "Serratia quinivorans", "Serratia rubidaea", "Serratia ureilytica", "Shigella boydii", "Shigella dysenteriae", "Shigella flexneri 2a str. 301", "Shigella sonnei", "Thermoccoccus kodakarensis", "Thermococcus barophilus MPT", "Thermococcus chitonophagus", "Thermococcus gammatolerans", "Thermococcus litoralis", "Thermococcus onnurineus", "Thermococcus sibiricus", "Xenorhabdus bovienii str. feltiae Florida", "Yersinia aldovae 670-83", "Yersinia aleksiciae", "Yersinia alsatica", "Yersinia enterocolitica", "Yersinia frederiksenii ATCC 33641", "Yersinia intermedia", "Yersinia kristensenii", "Yersinia massiliensis CCUG 53443", "Yersinia mollaretii ATCC 43969", "Yersinia pestis A1122", "Yersinia proxima", "Yersinia pseudotuberculosis IP 32953", "Yersinia rochesterensis", "Yersinia rohdei", "Yersinia ruckeri", "Yokenella regensburgei"]

user = UserContainer()
user.protein = protein_sequence.upper().strip().replace("\n", "").replace(" ", "").replace("\t", "")
user.organism = organism

print("Ready!")

In [None]:
#@title **Run CodonTransformer!**
#@markdown The Predicted DNA sequence is your optimized DNA sequence by CodonTransformer!

output = predict_dna_sequence(
    protein=user.protein,
    organism=user.organism,
    device=DEVICE,
    tokenizer=tokenizer,
    model=model,
    attention_type="original_full",
)

print(format_model_output(output))

--------------------------------------------------------------------------
# **Optimizing Multiple Sequences**

You can download the [inference template](https://github.com/Adibvafa/CodonTransformer/raw/main/src/CodonTransformer_inference_template.xlsx) to create the input dataset.

Your input dataset should be saved as a CSV file with the following columns:

- `protein_sequence`: Protein sequences <br>
- `organism`: Target organism (use the template to select)

Notes:
- For E. coli, use `Escherichia coli general` .
- The protein sequence may terminate with either no special characters, `*`, or `_` .
- You can have any other columns you wish!

In [None]:
#@title **Enter Dataset and Output Path**
dataset_path = 'Upload your CSV file to `files` and enter its path here.' #@param {type:"string"}
output_path = 'Enter your desired filename to save the predictions under `files`.' #@param {type:"string"}

dataset_path = dataset_path.strip()
output_path = output_path.strip()

dataset = pd.read_csv(dataset_path, index_col=0)
dataset["predicted_dna"] = None
dataset.head()

In [None]:
#@title **Run CodonTransformer!**

for index, data in tqdm(
    dataset.iterrows(),
    desc=f"CodonTransformer Predicting",
    unit=" Sequences",
    total=dataset.shape[0],
):

    outputs = predict_dna_sequence(
        protein=data["protein_sequence"],
        organism=data["organism"],
        device=DEVICE,
        tokenizer=tokenizer,
        model=model,
    )
    dataset.loc[index, "predicted_dna"] = outputs.predicted_dna

dataset.to_csv(output_path)
dataset.head()