<a href="https://colab.research.google.com/github/xuebingwu/ESM-Scan/blob/main/ColabVEP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='MediumSlateBlue '> **ESM-Scan**  </font>
## Predict the impact of every possible mutation in any protein using Evolutionary Scale Modeling ([ESM](https://github.com/facebookresearch/esm))
---
[Xuebing Wu lab @ Columbia](https://xuebingwu.github.io/)     |     [GitHub repository](https://github.com/xuebingwu/ESMScan)

In [2]:
##@title Analyze your protein

import os
from google.colab import files
import datetime
import re

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

########## input
INPUT = "MSHRKFSAPRHGHLGFLSHRHRGKVKTWPRDDPSQPVHLTAFLGYKAGMTHTLREVHRPGLKISKREEVEAVTIVETPPLVVVGVVGYVATPRGLRSFKTIFAEHLSDECRRRFYKDWHKSKKKAFTKACKRWRDTDGKKQLQKDFAAMKKYCKVIRVIVHTQMKLLPFRQKKAHIMEIQLNGGTVAEKVAWAQARLEKQVPVHSVFSQSEVIDVIAVTKGRGVKGVTSRWHTKKLPRKTHKGLRKVACIGAWHPARVGCSIARAGQKGYHHRTELNKKIFRIGRGPHMEDGKLVKNNASTSYDVTAKSITPLGGFPHYGEVNNDFVMLKGCIAGTKKRVITLRKSLLVHHSRQAVENIELKFIDTTSKFGHGRFQTAQEKRAFMGPQKKHLEKETPETSGDL"#@param ["RPL3L", "MYC"] {allow-input: true}

#@markdown - Input format: one raw protein sequence; space allowed
#@markdown - Example: copy & paste a multi-line sequence from a FASTA file (without the header)
#@markdown - To run: click `Runtime` -> `Run all` in the menu bar, or click the triangle play/run button on the left

seq = INPUT

if seq == "RPL3L":
  seq = "MSHRKFSAPRHGHLGFLPHKRSHRHRGKVKTWPRDDPSQPVHLTAFLGYKAGMTHTLREVHRPGLKISKREEVEAVTIVETPPLVVVGVVGYVATPRGLRSFKTIFAEHLSDECRRRFYKDWHKSKKKAFTKACKRWRDTDGKKQLQKDFAAMKKYCKVIRVIVHTQMKLLPFRQKKAHIMEIQLNGGTVAEKVAWAQARLEKQVPVHSVFSQSEVIDVIAVTKGRGVKGVTSRWHTKKLPRKTHKGLRKVACIGAWHPARVGCSIARAGQKGYHHRTELNKKIFRIGRGPHMEDGKLVKNNASTSYDVTAKSITPLGGFPHYGEVNNDFVMLKGCIAGTKKRVITLRKSLLVHHSRQAVENIELKFIDTTSKFGHGRFQTAQEKRAFMGPQKKHLEKETPETSGDL"
elif seq == "MYC":
  seq = "MDFFRVVENQQPPATMPLNVSFTNRNYDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIWKKFELLPTPPLSPSRRSGLCSPSYVAVTPFSLRGDNDGGGGSFSTADQLEMVTELLGGDMVNQSFICDPDDETFIKNIIIQDCMWSGFSAAAKLVSEKLASYQAARKDSGSPNPARGHSVCSTSSLYLQDLSAAASECIDPSVVFPYPLNDSSSPKSCASQDSSAFSPSSDSLLSSTESSPQGSPEPLVLHEETPPTTSSDSEEEQEDEEEIDVVSVEKRQAPGKRSESGSPSAGGHSKPPHSPLVLKRCHVSTHQHNYAAPPSTRKDYPAAKRVKLDSVRVLRQISNNRKCTSPRSSDTEENVKRRTHNVLERQRRNELKRSFFALRDQIPELENNEKAPKVVILKKATAYILSVQAEEQKLISEEDLLRKRREQLKHKLEQLRNSCA"
else: # user input
  # clean up sequence: upper case, remove space
  seq = seq.upper().replace(' ','')
  # if contains non aa letters:
  if not all(char in 'ACDEFGHIKLMNPQRSTVWY' for char in seq):
    print("\n\n")
    print('\n'+ bcolors.BOLD +bcolors.FAIL + "WARNING: Your sequence contains letters other than ACDEFGHIKLMNPQRSTVWY!"+bcolors.ENDC)
    L0  = len(seq)
    seq = re.sub('[^ACDEFGHIKLMNPQRSTVWY]+', '', seq)
    L1 = len(seq)
    print(L0-L1,'non-aa letters removed!'+bcolors.ENDC)
    exit()

######### options

# set model
MODEL = "esm1b_t33_650M_UR50S" #@param ["esm1v_t33_650M_UR90S_1", "esm1b_t33_650M_UR50S"]

# remove files from a previous run
if os.path.exists("ESMScan-all-mutants.txt"):
  print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+': Removing files from a previous run')
  !rm ESMScan-* res.zip run.sh

if not os.path.exists("ESM-Scan"):
  print("\n")
  print('\n\n'+ bcolors.BOLD +bcolors.OKBLUE + "Installing packages"  +bcolors.ENDC)
  print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
  print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
  !pip install biopython
  !pip install fair-esm
  !git clone https://github.com/xuebingwu/ESM-Scan.git
  !cd /content
  !mv /content/ESM-Scan/esm1b_t33_650M_UR50S-contact-regression.pt /content/
  print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

model_location="/content/"+MODEL+".pt"
if not os.path.exists(model_location ):
  print('\n\n'+ bcolors.BOLD +bcolors.OKBLUE + "Downloading pre-trained ESM model"  +bcolors.ENDC)
  if MODEL == "esm1b_t33_650M_UR50S":
    !wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
  else:
    !wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt

print('\n\n'+ bcolors.BOLD +bcolors.OKBLUE + "Running saturation mutagenesis"  +bcolors.ENDC)

cmd="python /content/ESM-Scan/esmscan.py --model-location "+model_location+" --sequence "+seq

print(cmd)

with open("run.sh",'w') as f:
  f.write(cmd+'\n')

!chmod +x /content/run.sh
!/content/run.sh

'''
import subprocess

proc = subprocess.Popen([cmd], stdout=subprocess.PIPE, shell=True)

(out, err) = proc.communicate()
print("Screen output:", out)
print("Screen error:", err)
'''
#os.system(cmd)

print('\n\n'+ bcolors.BOLD +bcolors.OKBLUE + "Downloading results"  +bcolors.ENDC)

if os.path.exists('ESMScan-res-in-matrix.csv'):
  os.system(f'zip res.zip *.pdf *.csv')
  files.download(f"res.zip")
  print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+': Done! Please see results in res.zip')
else:
  print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+': No output files generated')


2024-02-20 02:47:58: Removing files from a previous run


[1m[94mRunning saturation mutagenesis[0m
python /content/ESM-Scan/esmscan.py --model-location /content/esm1b_t33_650M_UR50S.pt --sequence MSHRKFSAPRHGHLGFLSHRHRGKVKTWPRDDPSQPVHLTAFLGYKAGMTHTLREVHRPGLKISKREEVEAVTIVETPPLVVVGVVGYVATPRGLRSFKTIFAEHLSDECRRRFYKDWHKSKKKAFTKACKRWRDTDGKKQLQKDFAAMKKYCKVIRVIVHTQMKLLPFRQKKAHIMEIQLNGGTVAEKVAWAQARLEKQVPVHSVFSQSEVIDVIAVTKGRGVKGVTSRWHTKKLPRKTHKGLRKVACIGAWHPARVGCSIARAGQKGYHHRTELNKKIFRIGRGPHMEDGKLVKNNASTSYDVTAKSITPLGGFPHYGEVNNDFVMLKGCIAGTKKRVITLRKSLLVHHSRQAVENIELKFIDTTSKFGHGRFQTAQEKRAFMGPQKKHLEKETPETSGDL
Transferred model to GPU


[1m[94mDownloading results[0m


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

2024-02-20 02:48:54: Done! Please see results in res.zip


# About <a name="Instructions"></a>

**Applications**
* Assess the impact of all possible mutaitons in a protein.

**Input**
* A single protein sequence

**Output**
* Data: CSV files containing the effect of each mutations. Negative means more deleterious.
* Visualizaitons: A heatmap color coding the effect of all possible mutations (20 columns) at each amino acid in the protein (row). Blue means more deleterious. A box-plot along each position is also included.

<img src="https://github.com/xuebingwu/ESM-Scan/blob/main/example-output.png" height="400" align="center">

**Methods**
* Please see the following preprint for more details:
[Language models enable zero-shot prediction of the effects of mutations on protein function](https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2).

**Limitations**
* A gmail account is required to run Google Colab notebooks.
* This notebook was designed for analyzing a single sequence.
* Only sequences of length ~400aa have been tested. Longer sequences may fail due to a lack of memory.
* The first run is slow due to the need to download pre-trained ESM models.  
* GPU is required and may not be available on Colab.
* Your browser can block the pop-up for downloading the result file. You can choose the `save_to_google_drive` option to upload to Google Drive instead or manually download the result file: Click on the little folder icon to the left, navigate to file: `res.zip`, right-click and select \"Download\".


**Bugs**
- If you encounter any bugs, please report the issue by emailing Xuebing Wu (xw2629 at cumc dot columbia dot edu)

**License**

* The source code of this notebook is licensed under [MIT](https://raw.githubusercontent.com/sokrypton/ColabFold/main/LICENSE).

**Acknowledgments**
- We thank the [ESM](https://github.com/facebookresearch/esm) team for developing an excellent model and open sourcing the software.

- This notebook is modeld after the [ColabFold notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb).
