# UbiNet Position Weight Matrices (PWMs) validation with ELM-Manual degrons

This notebook contains the code to perform the validation of UbiNet's PWMs with the corresponding degrons of ELM-Manual database. Note this database contains experimentally validated degrons.

The validation was done in two steps:
1. With every PWM from UbiNet that has a corresponding set of true degrons from ELM-Manual, perform the scan. 
2. Manually compare ELM-Manual degrons with those used for motif generation in UbiNet. 

## Import libraries

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import gzip

## my modules ##
sys.path.append("../scripts/Utils/")    # modules folder
from fasta_utils import readFasta_gzip
from motif_utils import motif_scan_v1 as motif_scan

## Define variables and paths

In [3]:
# paths
data_path = "../data/"
external_data_path = "external/"
results_path = "../results/"
logs_path = "../logs/"
tests_path = "../tests/"
cluster_data_path = "/workspace/projects/degrons/data/"

ubinet_path = "ubinet/"
meme_path = "meme/"
elm_manual_path = "elm_manual/"

weight_m_path = "motif_matrices/PWM/"
pos_set_path = "positive_set/"

ubinet_meme_validation_path = "ubinet_meme_validation/elm_manual_substrates_to_ubinet_meme/"

# files names
ubinet_meme_validation_wth_elm_manual_file = "ubinet_meme_validation_wth_elm_manual/ubinet_meme_validation_wth_elm_manual.tsv"
elm_manual_degrons_file = "elm_manual_instances_E3ligases.tsv"
proteome_file = "uniprot_proteome_UP000005640.fasta.gz"

In [2]:
# paths
base = "../"

data = "data/"

weight_m_path = os.path.join(base, data, "ubinet/motif_matrices/PWM/")                 
elm_manual_E3_true_degrons_path = os.path.join(base, data, "elm_manual/true_degrons_E3_ubinet/")
ubinet_validation_path(base, data, "ubinet/ubinet_elm_manual_validation.tsv")
elm_manual_E3_path = os.path.join(base, data, "elm_manual/elm_manual_instances_E3ligases.tsv")
ubinet_degrons_path = os.path.join(base, data, "ubinet/positive_set/")

## Define functions

In [3]:
def motif_substrate_score(E3, gene, sequence, start, end, weight_m_dir):
    """ 
    Scans a sequence with a PWM considering the difference in length and performing a 
    sequence extension if necessary. This function was specifically created to validate
    UbiNet PWMs with ELM_manual degron instances.
    
    Parameters
    ----------
    E3: str
           E3 ligase to perform the scan with
    gene: str
            Name of the protein which is scanned
    sequence: str
            Instance sequence
    start: int
            Starting position of the sequence instance in the protein sequence
    end: int
            Ending position of the sequence instance in the protein sequence
    weight_m_dir: str
            Path to the folder where the E3 ligase PWM is located
    
    Returns
    -------
    E3: str 
        E3 ligase to perform the scan with
    substrate: str 
        Name of the protein which is scanned
    max_score: int
            Maximum score a sequence instance obtains after being analyzed with a 
            E3 ligase PWM
    """
    
    # Load weight matrix for motif scanning
    weight_m = pd.read_csv(weight_m_dir+E3+".tsv", sep = "\t")
        
    # Case 1: sequence longer or equal than motif
    if len(sequence) >= len(weight_m):
        
        scores = motif_scan(sequence, weight_m)
        max_score = max(scores)
    
    # Case 2: sequence shorter than motif
    else:
        
        diff = len(weight_m) - len(sequence)
        
        # Case 2.1: degron starts in protein position 0
        if start == 0:
            
            sequence_ext = proteome[gene][start:end+diff]

            scores = motif_scan(sequence_ext, weight_m)
            max_score = max(scores)
        
        # Case 2.2: degron starts in last protein position
        elif end == len(proteome[gene]):           
                        
            sequence_ext = proteome[gene][start-diff:end]

            scores = motif_scan(sequence_ext, weight_m)
            max_score = max(scores)
        
        # Case 2.3: degron starts in diff range
        elif start in range(1, diff):
            
            all_scores = []
                        
            for i in range(start+1):
                            
                sequence_ext = proteome[gene][0+i:end+(diff-start)+i]

                scores = motif_scan(sequence_ext, weight_m)
                all_scores.append(max(scores))
                        
            max_score = max(all_scores)
        
        # Case 2.4: degron ends in diff range
        elif end in range(len(proteome[gene])-diff+1, len(proteome[gene])): 
            
            all_scores = []
                        
            for i in range(len(proteome[gene])-end+1):
                            
                sequence_ext = proteome[gene][start-(diff-(len(proteome[gene])-end))-i:len(proteome[gene])-i]

                scores = motif_scan(sequence_ext, weight_m)
                all_scores.append(max(scores))
                        
            max_score = max(all_scores)
            
        # Case 2.5: degron is in any other position
        else:
            
            all_scores = []
                        
            for i in range(diff+1):
                            
                sequence_ext = proteome[gene][start-diff+i:end+i]

                scores = motif_scan(sequence_ext, weight_m)
                all_scores.append(max(scores))
                        
            max_score = max(all_scores)
            
    return (E3, substrate, max_score)

## 1. Motifs that can be validated

In [6]:
# E3 ligases which have validation degrons in ELM_manual

E3s_for_validation = [E3.split(".")[0] for E3 in os.listdir(elm_manual_E3_true_degrons_path)]
E3s_for_validation

['O43791',
 'P22681',
 'P40337',
 'Q00987',
 'Q13309',
 'Q14145',
 'Q5XUX0',
 'Q8IUQ4',
 'Q8NHY2',
 'Q969H0',
 'Q96J02',
 'Q9NZJ0',
 'Q9Y297']

In [7]:
len(E3s_for_validation)

13

## 2. Scan ELM-Manual degrons with UbiNet's PWMs

Considerations:
1. We scanned the entire protein sequence, not the degron directly, to test the ability of the PWM to recover the degron. We took the maximum score raised by the PWM to be the "found degron" in that protein sequence. 
2. UbiNet PWMs have very low negative values when the probability of an amino acid in a position is very low. Thus, we decided to consider there is a "found degron" when the maximum score is above 0. 

In [8]:
# Load proteome to extract substrates sequences

proteome = readFasta_gzip(data_path+external_data_path+proteome_file)

# Expected number of read sequences: 78120 (https://www.uniprot.org/proteomes/UP000005640)

Number of retrieved sequences: 78120



In [18]:
# Compute scorer for UbiNet PWMs on their ELM_manual substrates (code for df generation)

temp = []

# Analyze each PWM that has ELM-Manual degrons to be validated
for E3 in E3s_for_validation:
    
    validation_set = pd.read_csv(elm_manual_E3_true_degrons_path+E3+".tsv", sep = "\t")
    
    # Scan each substrate containing a degron
    for index, row in validation_set.iterrows():
        
        substrate = row.Substrate.split("-")[0]  # split+index to bypass isoforms
        
        if substrate in proteome.keys():

            sequence = proteome[gene][row.Start-1:row.End] 
            start = row.Start-1
            end = row.End
                
            weight_m_dir = data_path+ubinet_path+weight_m_path
                
            E3, substrate, max_score = \
            motif_substrate_score(E3, substrate, sequence, start, end, weight_m_dir)
                
            if max_score < 0:
                max_score = 0   # for better visualization: transform very low negative values to zeros
                
            temp.append([E3, substrate, max_score])
         

In [9]:
# Generate validation scores df

elm_manual_validation_scores = pd.DataFrame(temp, 
                                            columns = ['E3_ligase',
                                                       'Substrate', 'Score'])


In [19]:
# Save validation scores df

elm_manual_validation_scores.to_csv(ubinet_validation_path, sep = "\t", index = False)

Visualization in: ``Figures.ipynb``

## 3. Manually check differences between UbiNet and ELM-Manual substrates and degrons

In [22]:
# Load ELM-Manual database

elm_manual_degrons_df = pd.read_csv(elm_manual_E3_path, sep = "\t")
elm_manual_degrons_df

Unnamed: 0,Degron,Substrate,Start,End,Database,Sequence,Sequence_amplified,Start_amplified,End_amplified,E3_ligase
0,DEG_SCF_FBXO31_1,P30281,286,292,ELM,DVTAIHL,SSSQGPSQTSTPTDVTAIHL,273.0,292.0,Q5XUX0
1,DEG_SCF_FBXO31_1,P30279,283,289,ELM,DVRDIDL,KSEDELDQASTPTDVRDIDL,270.0,289.0,Q5XUX0
2,DEG_SCF_FBXO31_1,P24385,289,295,ELM,DVRDVDI,EEEEEVDLACTPTDVRDVDI,276.0,295.0,Q5XUX0
3,DEG_COP1_1,P14921,273,283,ELM,SFNSLQRVPSY,WSSQSSFNSLQRVPSYDSFD,268.0,287.0,Q8NHY2
4,DEG_COP1_1,P15036,301,311,ELM,SLLDVQRVPSF,WNSQSSLLDVQRVPSFESFE,296.0,315.0,Q8NHY2
...,...,...,...,...,...,...,...,...,...,...
225,DEG_CRL4_CDT2_1,Q9NQR1,178,190,Manual,PPKTPPSSCDSTN,EAAEPPKTPPSSCDSTNAAI,174.0,193.0,Q9NZJ0
226,CBL_MET,Q9UIW2,441,443,Manual,DYR,DGLTAVAAYDYRGRTVVFAG,432.0,451.0,P22681
227,DEG_APCC_TPR_1,Q9UM11,491,493,Manual,LFT,SKTRSTKVKWESVSVLNLFT,474.0,493.0,
228,CBL_MET,Q9UQQ2,88,90,Manual,DYR,VRDGRAPGRDYRDTGRGPPA,79.0,98.0,P22681


As the number of cases is very little, we manually checked whether UbiNet presumed degrons used to build the PWMs match in any case ELM-Manual real degrons. Below a code example with E3 ligase Q9Y297:

In [25]:
# Recall E3s that can be validated
E3s_for_validation

['O43791',
 'P22681',
 'P40337',
 'Q00987',
 'Q13309',
 'Q14145',
 'Q5XUX0',
 'Q8IUQ4',
 'Q8NHY2',
 'Q969H0',
 'Q96J02',
 'Q9NZJ0',
 'Q9Y297']

In [26]:
E3 = "Q9Y297"

In [27]:
# ELM-Manual real degrons

elm_manual_set = pd.read_csv(elm_manual_E3_true_degrons_path+E3+".tsv", sep = "\t")
elm_manual_set.rename(columns = {"Substrate": "gene", "Start": "start", "End": "end"}, inplace = True)

In [28]:
# UbiNet presumed degrons

ubinet_set = pd.read_csv(ubinet_degrons_path+E3+".tsv", sep = "\t")

In [29]:
# Merge (substrates coincidence)
merge = pd.merge(elm_manual_set, ubinet_set, on = ['gene'], how = 'inner')
merge

Unnamed: 0,gene,start_x,end_x,sequence,start_y,end_y
0,P17181,534,539,GPPEV,129,133
1,Q5JSP0,75,80,SSTPP,7,11
2,P98174,282,287,PPPPS,181,185
3,Q53EL6,70,76,SGVPV,198,202
4,P16471,348,353,LPQPS,418,422
5,Q6PGQ7,496,501,LPTPG,50,54
6,O95863,95,100,PPSPP,102,106
7,Q9UKT4,144,149,GPLPG,432,436
8,Q12959,597,602,LPSPP,167,171
9,P18848,218,224,LPSPG,246,250


#### Results for all UbiNet motifs having any match with ELM-Manual degrons

- Q14145 (KEAP1). Common degrons:

    * Q86YC2	89	94	DEETGE	89	94
    * O14920	34	39	NQETGE	34	39
    * Q13501	347	352	DPSTGE	347	352
    * Q16236	77	82	DEETGE	77	82
    * Q96HS1	77	82	NVESGE	77	82
    
- Q8IUQ4 (SIAH1). Common degrons:

    * Q9HB71	59	67	PAAVVAP	60	66 (almost)
    
- Q969H0 (**FBXW7**). Common degrons:

    * P04198	57	62	PTPPLSPSRG	57	66
    * P01106	55	62	PTPPLSPSRR	57	66 (almost)
    * P05412	236	243	ETPPLSPIDM	238	247 (almost)
    * P24864	393	399	LTPPQSGKKQ	394	403 (almost)
    * P24864	394	399	LTPPQSGKKQ	394	403 (almost)
    * Q13887	300	307	PSPPSSEPGS	302	311 (almost)
    
- Q96J02 (**ITCH**). Common degrons:

    * Q9H3D4	540	543	HCTPPPPY	536	543 (almost)
    
- Q9UKB1 (FBXW11). Common degrons:

    * P16471	347	351	LDPDTDSG	343	350 (almost)