<a href="https://colab.research.google.com/github/xuebingwu/ESMScan/blob/main/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# <font color='MediumSlateBlue '> **ESMScan**  </font> 
## Complete saturation mutagenesis using ESM
---
[Xuebing Wu lab @ Columbia](https://xuebingwu.github.io/)     |     [GitHub repository](https://github.com/xuebingwu/ESMScan) 


In [3]:
##@title Analyze your protein

import os
from google.colab import files
import datetime
import re

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

########## input 
INPUT = "MSHRKFSAPRHGHLGFLPHKRSHRHRGKVKTWPRDDPSQPVHLTAFLGYKAGMTHTLREVHRPGLKISKREEVEAVTIVETPPLVVVGVVGYVATPRGLRSFKTIFAEHLSDECRRRFYKDWHKSKKKAFTKACKRWRDTDGKKQLQKDFAAMKKYCKVIRVIVHTQMKLLPFRQKKAHIMEIQLNGGTVAEKVAWAQARLEKQVPVHSVFSQSEVIDVIAVTKGRGVKGVTSRWHTKKLPRKTHKGLRKVACIGAWHPARVGCSIARAGQKGYHHRTELNKKIFRIGRGPHMEDGKLVKNNASTSYDVTAKSITPLGGFPHYGEVNNDFVMLKGCIAGTKKRVITLRKSLLVHHSRQAVENIELKFIDTTSKFGHGRFQTAQEKRAFMGPQKKHLEKETPETSGDL"#@param ["RPL3L", "MYC"] {allow-input: true}

#@markdown - To run: click `Runtime` -> `Run all` in the menu bar, or click the triangle play/run button on the left
#@markdown - Input format: one raw protein sequence; space allowed
#@markdown - Example: copy & paste a multi-line sequence from a FASTA file (without the header)

seq = INPUT

if seq == "RPL3L":
  seq = "MSHRKFSAPRHGHLGFLPHKRSHRHRGKVKTWPRDDPSQPVHLTAFLGYKAGMTHTLREVHRPGLKISKREEVEAVTIVETPPLVVVGVVGYVATPRGLRSFKTIFAEHLSDECRRRFYKDWHKSKKKAFTKACKRWRDTDGKKQLQKDFAAMKKYCKVIRVIVHTQMKLLPFRQKKAHIMEIQLNGGTVAEKVAWAQARLEKQVPVHSVFSQSEVIDVIAVTKGRGVKGVTSRWHTKKLPRKTHKGLRKVACIGAWHPARVGCSIARAGQKGYHHRTELNKKIFRIGRGPHMEDGKLVKNNASTSYDVTAKSITPLGGFPHYGEVNNDFVMLKGCIAGTKKRVITLRKSLLVHHSRQAVENIELKFIDTTSKFGHGRFQTAQEKRAFMGPQKKHLEKETPETSGDL"
elif seq == "MYC":
  seq = ""
else: # user input 
  # clean up sequence: upper case, remove space 
  seq = seq.upper().replace(' ','')
  # if contains non aa letters:
  if not all(char in 'ACDEFGHIKLMNPQRSTVWY' for char in seq):
    print("\n\n")
    print('\n'+ bcolors.BOLD +bcolors.FAIL + "WARNING: Your sequence contains letters other than ACDEFGHIKLMNPQRSTVWY!"+bcolors.ENDC)
    L0  = len(seq)
    seq = re.sub('[^ACDEFGHIKLMNPQRSTVWY]+', '', seq)
    L1 = len(seq)
    print(L0-L1,'non-aa letters removed!'+bcolors.ENDC)
    exit()

print('\n\n'+ bcolors.BOLD +bcolors.OKBLUE + "OUTPUT"  +bcolors.ENDC)

######### options

# set model
MODEL = "esm1v_t33_650M_UR90S_1" #@param ["esm1v_t33_650M_UR90S_1", "Other"]
##@markdown - *Usage*: predict tissue-specific usage of a splice site (default).  
##@markdown - *P(splice)*: predict tissue-specific probability of a site being spliced

if not os.path.exists("ESMScan"):
  print("\n")
  print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+': Loading ESMScan')
  print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
  !pip install fair-esm 
  !git clone https://github.com/xuebingwu/ESMScan.git
  print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

from pkg_resources import resource_filename

cmd="python /content/ESMScan/esmscan.py \
 	--model-location "+MODEL+" \
	--sequence "+seq+" \
	--output-prefix ESMScan  \
	--scoring-strategy wt-marginals "

print(cmd)
os.system(cmd)

# print out predicted sites
if os.path.exists('predicted_splice_sites.txt'):
  os.system(f'zip res.zip *.pdf *.csv')
  files.download(f"res.zip")



[1m[94mOUTPUT[0m


2023-03-31 04:00:54: Loading ESMScan
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Cloning into 'ESMScan'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 13 (delta 4), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (13/13), 29.81 KiB | 4.97 MiB/s, done.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
python /content/ESMScan/esmscan.py  	--model-location esm1v_t33_650M_UR90S_1 	--sequence MSHRKFSAPRHGHLGFLPHKRSHRHRGKVKTWPRDDPSQPVHLTAFLGYKAGMTHTLREVHRPGLKISKREEVEAVTIVETPPLVVVGVVGYVATPRGLRSFKTIFAEHLSDECRRRFYKDWHKSKKKAFTKACKRWRDTDGKKQLQKDFAAMKKYCKVIRVIVHTQMKLLPFRQKKAHIMEIQLNGGTVAEKVAWAQARLEKQVPVHSVFSQSEVIDVIAVTKGRGVKGVTSRWHTKKLPRKTHKGLRKVACIGAWHPARVGCSIARAGQKGYHHRTELNKKIFRIGRGPHMEDGKLVKNNASTSYDVTAKSITPLGGFPHYGEVNN

# About <a name="Instructions"></a>

**Applications**
* Identify potential splicing artifacts in plasmid reporters.


**Limitations**
* A gmail account is required to run Google Colab notebooks.
* This notebook was designed for analyzing a single sequence. 
* Only sequences of length 1-150,000 bases have been tested. Longer sequences may fail due to a lack of memory.
* The first run is slow due to the need to install the `Pangolin` package.  
* GPU may not be available and running the prediction on CPU will be significantly slower. 
* Your browser can block the pop-up for downloading the result file. You can choose the `save_to_google_drive` option to upload to Google Drive instead or manually download the result file: Click on the little folder icon to the left, navigate to file: `res.zip`, right-click and select \"Download\".


**Bugs**
- If you encounter any bugs, please report the issue by emailing Xuebing Wu (xw2629 at cumc dot columbia dot edu)

**License**

* The source code of this notebook is licensed under [MIT](https://raw.githubusercontent.com/sokrypton/ColabFold/main/LICENSE). See details of the license for Pangolin [here](https://github.com/tkzeng/Pangolin/blob/main/LICENSE).

**Acknowledgments**
- We thank the [Pangolin](https://doi.org/10.1186/s13059-022-02664-4) team for developing an excellent model and open sourcing the software. 

- This notebook is modeld after the [ColabFold notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb).
