# PhosKing

PhosKing is a predictor of protein phosphorylation that uses ESM-2 embeddings as the basis for the prediction. [ESM-2](https://github.com/facebookresearch/esm) is a large protein language model released in August 2022 by Meta's AI research division, and pretrained weights are available for download.

Phosphorylation is one of the most common and widely studied PTMs. It involves the addition of a phosphoryl group to a protein, which commonly alters its structure and regulates a variety of processes, from metabolic activity to signal cascading. This can have a myriad of effects on the activity of cells.

The main Phosking model integrates Convolutional, Long-Short-Term Memory (LSTM) and Feedforward (FF) modules to produce its predictions. The input to this model is the ESM-2 embeddings of the target amino acid and a 16 amino acid window to each side of it. The embeddings are given as input to 2 independent convolutional layers, and their outputs are concatenated with the original ESM-2 embeddings. This new tensor is provided to a bidirectional LSTM module of 2 layers, which enables the model to capture contextual information about the protein sequence. Finally, the output of the LSTM module is provided to a FF module to produce the final output. This architecture is inspired by NetSurfP-3.0 ([paper](https://doi.org/10.1093/nar/gkac439)).

<center><img width=900 src="model.png"></cneter>

### Demo

Here is a demonstration of the model. To start, enter some protein sequences and names in the box below and run it (and install the `esm` package if needed).

In the output, the symbols mean **high confidence (*), medium confidence (+) and poor confidence (.)**, based on our custom thresholds. If you install the kinase model (optional), only kinases with a score higher than 0.1 will be shown (top 3 maximum).

**NOTE: The first time you run the prediction, the ESM-2 weights will be downloaded locally. This takes a few minutes.**



In [1]:
# SETUP

# This is the only package requirement (apart from PyTorch and numpy)
# !pip install fair-esm

# (Optional) Install kinase model (1.6 Gb)
# !wget -P states_dicts/ https://services.healthtech.dtu.dk/Data/kinase_model.pth

In [2]:
sequences = [('seq_name_1', 'MNAEPERKFGVVVVGVGRAGSVRMRDLRNPHPSSAFLNLIGFVSRRELGSIDGVQQISLEDALSSQEVEVAYICSESSSHEDYIRQFLNAGKHVLVEYPMTLSLAAAQELWELAEQKGKVLHEEHVELLMEEFAFLKKEVVGKDLLKGSLLFTAGPLEEERFGFPAFSGISRLTWLVSLFGELSLVSATLEERKEDQYMKMTVCLETEKKSPLSWIEEKGPGLKRNRYLSFHFKSGSLENVPNVGVNKNIFLKDQNIFVQKLLGQFSEKELAAEKKRILHCLGLAEEIQKYCCSRK'),
             ('seq_name_2', 'MELENIVANTVLLKAREGGGGNRKGKSKKWRQMLQFPHISQCEELRLSLERDYHSLCERQPIGRLLFREFCATRPELSRCVAFLDGVAEYEVTPDDKRKACGRQLTQNFLSHTGPDLIPEVPRQLVTNCTQRLEQGPCKDLFQELTRLTHEYLSVAPFADYLDSIYFNRFLQWKWLERQPVTKNTFRQYRVLGKGGFGEVCACQVRATGKMYACKKLEKKRIKKRKGEAMALNEKQILEKVNSRFVVSLAYAYETKDALCLVLTLMNGGDLKFHIYHMGQAGFPEARAVFYAAEICCGLEDLHRERIVYRDLKPENILLDDHGHIRISDLGLAVHVPEGQTIKGRVGTVGYMAPEVVKNERYTFSPDWWALGCLLYEMIAGQSPFQQRKKKIKREEVERLVKEVPEEYSERFSPQARSLCSQLLCKDPAERLGCRGGSAREVKEHPLFKKLNFKRLGAGMLEPPFKPDPQAIYCKDVLDIEQFSTVKGVELEPTDQDFYQKFATGSVPIPWQNEMVETECFQELNVFGLDGSVPPDLDWKGQPPAPPKKGLLQRLFSRQDCCGNCSDSEEELPTRL'),
             ('seq_name_3', 'MPNPSCTSSPGPLPEEIRNLLADVETFVADTLKGENLSKKAKEKRESLIKKIKDVKSVYLQEFQDKGDAEDGDEYDDPFAGPADTISLASERYDKDDDGPSDGNQFPPIAAQDLPFVIKAGYLEKRRKDHSFLGFEWQKRWCALSKTVFYYYGSDKDKQQKGEFAIDGYDVRMNNTLRKDGKKDCCFEICAPDKRIYQFTAASPKDAEEWVQQLKFILQDLGSDVIPEDDEERGELYDDVDHPAAVSSPQRSQPIDDEIYEELPEEEEDTASVKMDEQGKGSRDSVHHTSGDKSTDYANFYQGLWDCTGALSDELSFKRGDVIYILSKEYNRYGWWVGEMKGAIGLVPKAYLMEMYDI')]

import phosking.predict as phosking

# Turn on force_cpu if you run out of GPU memory (for normal sequences, even laptop GPUs should 
# be able to handle it just fine, but ESM's memory usage grows a lot with sequence length)
predictions = phosking.predict(sequences, force_cpu=False)
phosking.format_predictions(sequences, predictions)

Using PyTorch version 1.12.1
Using torch device of type cuda: NVIDIA GeForce RTX 3050 Laptop GPU
Loading ESM-2...
Loaded ESM-2. Computing embeddings...
Finished computing embeddings
Loading PhosKing model (phosphorylation)
Computing phosphorylation predictions
Loading PhosKing model (kinase)
Computing kinase predictions
Finished
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
 > seq_name_1
                    *           **         +     *       *     **               
MNAEPERKFGVVVVGVGRAGSVRMRDLRNPHPSSAFLNLIGFVSRRELGSIDGVQQISLEDALSSQEVEVAYICSESSSH
         |10       |20       |30       |40       |50       |60       |70       |80       
  *                                                                 *           
EDYIRQFLNAGKHVLVEYPMTLSLAAAQELWELAEQKGKVLHEEHVELLMEEFAFLKKEVVGKDLLKGSLLFTAGPLEEE
         |90       |100      |110      |120      |130      |140      |150      |160      
             *         .             *        +   +                *