# Preprocess ELM-Manual database. Generate true degrons sets for UbiNet PWMs validations. Generate ELM-Manual true degrons sets.

This notebook contains the code to preprocess the ELM-Manual database obtained from ELM, Mészarós *et al*. (2017) and Martínez-Jiménez *et al*. (2019). Also, here we generate the ELM-Manual real degrons sets to validate UbiNet Position Weight Matrices (PWMs). Note this database was edited after performing the alignments for some degron motifs.

Steps performed:
1. **Map ELM-Manual motifs IDs with the corresponding UbiNet's E3 ligases**. In ELM, degron motifs are not built by ligase. Add an additional `E3 ligase` column to the dataframe.
2. **Create the true degrons set from ELM-Manual for UbiNet PWMs validation**.
3. **Define motifs consensus IDs**: add an additional `Degron_consensusID` column to the dataframe. Consensus IDs are used to gather motifs IDs from different databases that refer to the same degron motif. 
3. **Extract true degrons set for each consensus motif**. Note that the positive set extracted before depended on UbiNet E3 ligases. This positive set depends on ELM-Manual motifs. 

**Important note**: UbiNet PWMs need to have already been generated to run this notebook.

## Import libraries

In [2]:
import os
import pandas as pd
import urllib.parse
import urllib.request
import json
from pprint import pprint

## Define variables and paths

In [3]:
base = "../../"

data = "data/"

elm_manual_path = os.path.join(base, data, "elm_manual/elm_manual_instances.tsv")
weight_m_path = os.path.join(base, data, "ubinet/motif_matrices/PWM/")                 
elm_manual_E3_path = os.path.join(base, data, "elm_manual/elm_manual_instances_E3ligases.tsv")
elm_manual_E3_true_degrons_path = os.path.join(base, data, "elm_manual/true_degrons_E3_ubinet/")
elm_manual_E3_consensusID_path = os.path.join(base, data, "elm_manual/elm_manual_instances_E3ligases_consensusID.tsv")
elm_manual_true_degrons_path = os.path.join(base, data, "elm_manual/positive_set/")

## Load data: ELM-Manual database

In [4]:
elm_manual_degrons_df = pd.read_csv(elm_manual_path, sep = "\t")
elm_manual_degrons_df

Unnamed: 0,Degron,Substrate,Start,End,Database,Sequence,Sequence_amplified,Start_amplified,End_amplified
0,DEG_SCF_FBXO31_1,P30281,286,292,ELM,DVTAIHL,SSSQGPSQTSTPTDVTAIHL,273.0,292.0
1,DEG_SCF_FBXO31_1,P30279,283,289,ELM,DVRDIDL,KSEDELDQASTPTDVRDIDL,270.0,289.0
2,DEG_SCF_FBXO31_1,P24385,289,295,ELM,DVRDVDI,EEEEEVDLACTPTDVRDVDI,276.0,295.0
3,DEG_COP1_1,P14921,273,283,ELM,SFNSLQRVPSY,WSSQSSFNSLQRVPSYDSFD,268.0,287.0
4,DEG_COP1_1,P15036,301,311,ELM,SLLDVQRVPSF,WNSQSSLLDVQRVPSFESFE,296.0,315.0
...,...,...,...,...,...,...,...,...,...
225,DEG_CRL4_CDT2_1,Q9NQR1,178,190,Manual,PPKTPPSSCDSTN,EAAEPPKTPPSSCDSTNAAI,174.0,193.0
226,CBL_MET,Q9UIW2,441,443,Manual,DYR,DGLTAVAAYDYRGRTVVFAG,432.0,451.0
227,DEG_APCC_TPR_1,Q9UM11,491,493,Manual,LFT,SKTRSTKVKWESVSVLNLFT,474.0,493.0
228,CBL_MET,Q9UQQ2,88,90,Manual,DYR,VRDGRAPGRDYRDTGRGPPA,79.0,98.0


Number of different motifs:

In [5]:
# Degron motifs IDs

elm_manual_degrons_motifs = elm_manual_degrons_df.Degron.unique()

print(f'Number of motifs: {len(elm_manual_degrons_motifs)}\n')
pprint(elm_manual_degrons_motifs)


Number of motifs: 34

array(['DEG_SCF_FBXO31_1', 'DEG_COP1_1', 'DEG_SPOP_SBC_1',
       'DEG_MDM2_SWIB_1', 'DEG_Kelch_KLHL3_1', 'DEG_Kelch_Keap1_1',
       'DEG_APCC_KENBOX_2', 'DEG_APCC_DBOX_1', 'DEG_APCC_TPR_1',
       'DEG_SCF_TRCP1_1', 'DEG_SCF_SKP2-CKS1_1', 'DEG_SCF_FBW7_1',
       'DEG_CRL4_CDT2_1', 'DEG_SCF_FBW7_2', 'DEG_ODPH_VHL_1',
       'DEG_SIAH_1', 'D-box', 'Other', 'ABBA', 'SCF_FBXO31', 'SCF_Fbw7',
       'KEN-box', 'CBLL1', 'CBL', 'DEG_Kelch_actinfilin_1',
       'FBW7_predicted', 'DEG_SPOP', 'DEG_Kelch_Keap1_2', 'SCF_beta-TrCP',
       'DEG_Nend_UBRbox_4', 'CBL_PTK', 'CBL_MET', 'LIG_APCC_ABBA_1',
       'ITCH'], dtype=object)


## 1. Map UbiNet E3 ligases with ELM-Manual degron motifs

First, extract all unique E3 ligases having a PWM UbiNet dataset.

In [6]:
# UbiNet E3 ligases AC

E3_ligases = [E3.split(".")[0] for E3 in os.listdir(weight_m_path)]
E3_ligases_AC = (" ").join(E3_ligases)   # UniProt idmapping tool requires from list to string transformation

print(f'Number of E3 ligases: {len(E3_ligases)}\n')
print(E3_ligases_AC)

Number of E3 ligases: 104

Q969K3 Q00987 Q969V5 Q9Y4L5 Q9H4P4 Q9UKC9 Q9HCE7 Q9Y4K3 Q6ZNA4 O95155 Q99942 Q8TEB7 Q13489 P29372 Q9NTX7 P78317 O14543 Q5T0T0 Q96EP0 Q9H0M0 P0C2W1 Q9Y2M5 O00308 P22681 P40337 Q96EP1 Q9H2C0 O76064 Q86YJ5 Q96EQ8 Q05086 Q96PU5 Q8TCQ1 Q9NX47 Q9NVW2 Q9Y4B6 Q99728 Q969P5 Q7L5Y6 Q8WVD3 Q9BYM8 Q13309 Q86TM6 P10523 Q96FA3 Q13049 Q9UKT5 O43164 O15524 Q53G59 O60858 P49427 O43791 P46934 Q969H0 Q7Z6Z7 Q99496 Q96J02 P19474 Q8TBB1 Q13233 Q9UK22 P35226 Q8IYU2 Q9NWF9 Q12933 Q96PM5 Q12834 Q5XUX0 Q8N3Y1 Q5T447 Q9Y297 Q9NZJ0 Q9HAU4 Q9BV68 Q96AX9 Q13490 O95071 Q86YT6 Q14669 Q8NG27 Q13191 Q9UNE7 Q92844 Q9NXK8 Q96F44 Q6VVB1 Q9UKV5 Q9NS91 P98170 Q9UKA1 Q14145 Q96SW2 O60260 Q14258 Q969Q1 Q8IUQ4 Q7Z6J0 Q9UKB1 Q8WZ73 P38398 O43255 Q9UM11 Q9UHC7


ELM-Manual degron motifs IDs contain (sometimes) the gene name of the E3 ligase, so we map the E3 ligases AC to the corresponding gene names. Part of the following code is directly copied from the UniProt's API manual. 

In [None]:
# Map UbiNet ACs to UniProt gene names

E3_ligases_map = {}    # dictionary to store the GeneName-AC mapping

##### UniProt code ##########################
url = 'https://www.uniprot.org/uploadlists/'

params = {
'from': 'ID',
'to': 'GENENAME',
'format': 'tab',
'query': E3_ligases_AC
}

data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)

with urllib.request.urlopen(req) as f:
    response = f.read().decode('utf-8')
############################################   

for ac_genename in response.split("\n")[1:-1]:
    
    ac = ac_genename.split("\t")[0]
    genename = ac_genename.split("\t")[1]
    
    E3_ligases_map[genename] = ac

Having all E3 ligases ACs mapped to the corresponding gene name, we map ELM-Manual motifs to UbiNet E3 ligases. First, we do it automatically looking for matches between gene names and motifs IDs.

In [42]:
# Generate the mapping dictionary: ELM_manual_motif:E3_ligase

elm_manual_ubinet_map = {}

print("First round splitting motif terms by underscore\n")
for motif in elm_manual_degrons_motifs:
    
    motif_terms = motif.upper().split("_")[1:-1]   # capital letter to match gene names which are in capital letters
    
    for term in motif_terms:
        for E3 in E3_ligases_map.keys():
            if term == E3:
                print(f'ELM_manual motif {term, motif}: corresponding {E3_ligases_map[E3]} ligase from UbiNet')
                elm_manual_ubinet_map[motif] = E3_ligases_map[E3]

                
print("---------------------------------------------------------------------------------")   
print("Second round of splitting motif terms to preserve number at the end (for SIAH1 and others)\n")   
for motif in elm_manual_degrons_motifs:
    
    motif_terms = motif.upper().split("_")[1:]
    motif_term = "".join(motif_terms)

    for E3 in E3_ligases_map.keys():
        if motif_term == E3:
            print(f'ELM motif {motif_term, motif}: corresponding {E3_ligases_map[E3]} ligase from UbiNet')
            elm_manual_ubinet_map[motif] = E3_ligases_map[E3]

print("---------------------------------------------------------------------------------")   
print("Third round without splitting for those motifs not coming from ELM (which do not have underscores)\n") 
for motif in elm_manual_degrons_motifs:
    
    for E3 in E3_ligases_map.keys():
        if motif == E3:
            print(f'ELM motif {motif}: corresponding {E3_ligases_map[E3]} ligase from UbiNet')
            elm_manual_ubinet_map[motif] = E3_ligases_map[E3]
    

First round splitting motif terms by underscore

ELM_manual motif ('FBXO31', 'DEG_SCF_FBXO31_1'): corresponding Q5XUX0 ligase from UbiNet
ELM_manual motif ('COP1', 'DEG_COP1_1'): corresponding Q8NHY2 ligase from UbiNet
ELM_manual motif ('SPOP', 'DEG_SPOP_SBC_1'): corresponding O43791 ligase from UbiNet
ELM_manual motif ('MDM2', 'DEG_MDM2_SWIB_1'): corresponding Q00987 ligase from UbiNet
ELM_manual motif ('KEAP1', 'DEG_Kelch_Keap1_1'): corresponding Q14145 ligase from UbiNet
ELM_manual motif ('VHL', 'DEG_ODPH_VHL_1'): corresponding P40337 ligase from UbiNet
ELM_manual motif ('KEAP1', 'DEG_Kelch_Keap1_2'): corresponding Q14145 ligase from UbiNet
---------------------------------------------------------------------------------
Second round of splitting motif terms to preserve number at the end (for SIAH1 and others)

ELM motif ('SIAH1', 'DEG_SIAH_1'): corresponding Q8IUQ4 ligase from UbiNet
ELM motif ('FBXO31', 'SCF_FBXO31'): corresponding Q5XUX0 ligase from UbiNet
ELM motif ('SPOP', 'DEG

In [43]:
# Show mapped ELM_manual IDs with UbiNet E3 ligases

print(json.dumps(elm_manual_ubinet_map, indent = 4))

{
    "DEG_SCF_FBXO31_1": "Q5XUX0",
    "DEG_COP1_1": "Q8NHY2",
    "DEG_SPOP_SBC_1": "O43791",
    "DEG_MDM2_SWIB_1": "Q00987",
    "DEG_Kelch_Keap1_1": "Q14145",
    "DEG_ODPH_VHL_1": "P40337",
    "DEG_Kelch_Keap1_2": "Q14145",
    "DEG_SIAH_1": "Q8IUQ4",
    "SCF_FBXO31": "Q5XUX0",
    "DEG_SPOP": "O43791",
    "CBL": "P22681",
    "ITCH": "Q96J02"
}


For those motifs which did not map to any E3 ligase, manually check.

In [44]:
# Non-mapped ELM_manual IDs with UbiNet E3 ligases


elm_manual_ubinet_nomap = set(elm_manual_degrons_motifs) - set(elm_manual_ubinet_map.keys())
pprint(elm_manual_ubinet_nomap) 

{'ABBA',
 'CBLL1',
 'CBL_MET',
 'CBL_PTK',
 'D-box',
 'DEG_APCC_DBOX_1',
 'DEG_APCC_KENBOX_2',
 'DEG_APCC_TPR_1',
 'DEG_CRL4_CDT2_1',
 'DEG_Kelch_KLHL3_1',
 'DEG_Kelch_actinfilin_1',
 'DEG_Nend_UBRbox_4',
 'DEG_SCF_FBW7_1',
 'DEG_SCF_FBW7_2',
 'DEG_SCF_SKP2-CKS1_1',
 'DEG_SCF_TRCP1_1',
 'FBW7_predicted',
 'KEN-box',
 'LIG_APCC_ABBA_1',
 'Other',
 'SCF_Fbw7',
 'SCF_beta-TrCP'}


In [45]:
# Manually assigned through looking at ELM descriptions + Degrons in cancer review

elm_manual_ubinet_map['DEG_SCF_SKP2-CKS1_1'] = E3_ligases_map['SKP2']
elm_manual_ubinet_map['DEG_SCF_FBW7_2'] = E3_ligases_map['FBXW7']  # FBW7 == FBXW7 (entry in ELM)
elm_manual_ubinet_map['DEG_SCF_FBW7_1'] = E3_ligases_map['FBXW7']
elm_manual_ubinet_map['SCF_Fbw7'] = E3_ligases_map['FBXW7']
elm_manual_ubinet_map['FBW7_predicted'] = E3_ligases_map['FBXW7']
elm_manual_ubinet_map['CBL_APS'] = E3_ligases_map["CBL"]
elm_manual_ubinet_map['DEG_CRL4_CDT2_1'] = E3_ligases_map["DTL"]
elm_manual_ubinet_map['CRL4_Cdt2'] = E3_ligases_map["DTL"]
elm_manual_ubinet_map['CBL_PTK'] = E3_ligases_map["CBL"]
elm_manual_ubinet_map['SCF_beta-TrCP2'] = E3_ligases_map["FBXW11"]
elm_manual_ubinet_map['CBL_MET'] = E3_ligases_map["CBL"]
elm_manual_ubinet_map['SCF_beta-TrCP'] = E3_ligases_map["BTRC"]
elm_manual_ubinet_map['DEG_SCF_TRCP1_1'] = E3_ligases_map["BTRC"]

Important note: those motifs not having a UbiNet E3 ligase counterpart, are not added in this dictionary. 

In [47]:
# Final ELM_manual-UbiNet mapping dictionary

print(json.dumps(elm_manual_ubinet_map, indent = 4))

{
    "DEG_SCF_FBXO31_1": "Q5XUX0",
    "DEG_COP1_1": "Q8NHY2",
    "DEG_SPOP_SBC_1": "O43791",
    "DEG_MDM2_SWIB_1": "Q00987",
    "DEG_Kelch_Keap1_1": "Q14145",
    "DEG_ODPH_VHL_1": "P40337",
    "DEG_Kelch_Keap1_2": "Q14145",
    "DEG_SIAH_1": "Q8IUQ4",
    "SCF_FBXO31": "Q5XUX0",
    "DEG_SPOP": "O43791",
    "CBL": "P22681",
    "ITCH": "Q96J02",
    "DEG_SCF_SKP2-CKS1_1": "Q13309",
    "DEG_SCF_FBW7_2": "Q969H0",
    "DEG_SCF_FBW7_1": "Q969H0",
    "SCF_Fbw7": "Q969H0",
    "FBW7_predicted": "Q969H0",
    "CBL_APS": "P22681",
    "DEG_CRL4_CDT2_1": "Q9NZJ0",
    "CRL4_Cdt2": "Q9NZJ0",
    "CBL_PTK": "P22681",
    "SCF_beta-TrCP2": "Q9UKB1",
    "CBL_MET": "P22681",
    "SCF_beta-TrCP": "Q9Y297",
    "DEG_SCF_TRCP1_1": "Q9Y297"
}


### 1.1 Add UbiNet E3 ligases ACs to the corresponding motif

In [49]:
# Add UbiNet ACs according to mapping dictionary

elm_manual_degrons_df['E3_ligase'] = elm_manual_degrons_df['Degron'].map(elm_manual_ubinet_map)

# Save
elm_manual_degrons_df.to_csv(elm_manual_E3_path, sep = "\t", index = False)

In [50]:
elm_manual_degrons_df

Unnamed: 0,Degron,Substrate,Start,End,Database,Sequence,Sequence_amplified,Start_amplified,End_amplified,E3_ligase
0,DEG_SCF_FBXO31_1,P30281,286,292,ELM,DVTAIHL,SSSQGPSQTSTPTDVTAIHL,273.0,292.0,Q5XUX0
1,DEG_SCF_FBXO31_1,P30279,283,289,ELM,DVRDIDL,KSEDELDQASTPTDVRDIDL,270.0,289.0,Q5XUX0
2,DEG_SCF_FBXO31_1,P24385,289,295,ELM,DVRDVDI,EEEEEVDLACTPTDVRDVDI,276.0,295.0,Q5XUX0
3,DEG_COP1_1,P14921,273,283,ELM,SFNSLQRVPSY,WSSQSSFNSLQRVPSYDSFD,268.0,287.0,Q8NHY2
4,DEG_COP1_1,P15036,301,311,ELM,SLLDVQRVPSF,WNSQSSLLDVQRVPSFESFE,296.0,315.0,Q8NHY2
...,...,...,...,...,...,...,...,...,...,...
225,DEG_CRL4_CDT2_1,Q9NQR1,178,190,Manual,PPKTPPSSCDSTN,EAAEPPKTPPSSCDSTNAAI,174.0,193.0,Q9NZJ0
226,CBL_MET,Q9UIW2,441,443,Manual,DYR,DGLTAVAAYDYRGRTVVFAG,432.0,451.0,P22681
227,DEG_APCC_TPR_1,Q9UM11,491,493,Manual,LFT,SKTRSTKVKWESVSVLNLFT,474.0,493.0,
228,CBL_MET,Q9UQQ2,88,90,Manual,DYR,VRDGRAPGRDYRDTGRGPPA,79.0,98.0,P22681


## 2. For UbiNet PWMs validation: generate true degrons sets per E3 ligase

In [51]:
# Generate a dataframe for each E3 ligase ELM_manual degrons (coordinates)

grouped = elm_manual_degrons_df.groupby('E3_ligase')

for E3, substrates in grouped:
    
    E3_subs = substrates[["Substrate", "Start", "End"]]
    E3_subs.to_csv(elm_manual_E3_true_degrons_path+E3+".tsv", index = False, sep = "\t")

## 3. Generate motifs consensus IDs

We need to group motifs referring to the same degrons instances but named after a different ID (due to databases differences). 

In [5]:
# Load data again (for this part of the nb to be independent of the previous)
elm_manual_degrons_df = pd.read_csv(elm_manual_E3_path, sep = "\t")
elm_manual_degrons_df

Unnamed: 0,Degron,Substrate,Start,End,Database,Sequence,Sequence_amplified,Start_amplified,End_amplified,E3_ligase
0,DEG_SCF_FBXO31_1,P30281,286,292,ELM,DVTAIHL,SSSQGPSQTSTPTDVTAIHL,273.0,292.0,Q5XUX0
1,DEG_SCF_FBXO31_1,P30279,283,289,ELM,DVRDIDL,KSEDELDQASTPTDVRDIDL,270.0,289.0,Q5XUX0
2,DEG_SCF_FBXO31_1,P24385,289,295,ELM,DVRDVDI,EEEEEVDLACTPTDVRDVDI,276.0,295.0,Q5XUX0
3,DEG_COP1_1,P14921,273,283,ELM,SFNSLQRVPSY,WSSQSSFNSLQRVPSYDSFD,268.0,287.0,Q8NHY2
4,DEG_COP1_1,P15036,301,311,ELM,SLLDVQRVPSF,WNSQSSLLDVQRVPSFESFE,296.0,315.0,Q8NHY2
...,...,...,...,...,...,...,...,...,...,...
225,DEG_CRL4_CDT2_1,Q9NQR1,178,190,Manual,PPKTPPSSCDSTN,EAAEPPKTPPSSCDSTNAAI,174.0,193.0,Q9NZJ0
226,CBL_MET,Q9UIW2,441,443,Manual,DYR,DGLTAVAAYDYRGRTVVFAG,432.0,451.0,P22681
227,DEG_APCC_TPR_1,Q9UM11,491,493,Manual,LFT,SKTRSTKVKWESVSVLNLFT,474.0,493.0,
228,CBL_MET,Q9UQQ2,88,90,Manual,DYR,VRDGRAPGRDYRDTGRGPPA,79.0,98.0,P22681


Show all unique IDs for each database (ELM as reference):

In [7]:
# Degron motifs IDs according to database

previous_db_motifs = []   # to obtain only unique motif IDs

# Obtain unique motif IDs from each database
for db in elm_manual_degrons_df.Database.unique():
    
    unique_db_motifs = []
    
    db_motifs = elm_manual_degrons_df[elm_manual_degrons_df.Database == db].Degron.unique()
    
    for motif in db_motifs:
        if motif not in previous_db_motifs:
            unique_db_motifs.append(motif)
            
    previous_db_motifs = previous_db_motifs + unique_db_motifs
    
    print(f'Number of unique {db} motifs: {len(unique_db_motifs)}\n')
    pprint(unique_db_motifs)
    print()


Number of unique ELM motifs: 16

['DEG_SCF_FBXO31_1',
 'DEG_COP1_1',
 'DEG_SPOP_SBC_1',
 'DEG_MDM2_SWIB_1',
 'DEG_Kelch_KLHL3_1',
 'DEG_Kelch_Keap1_1',
 'DEG_APCC_KENBOX_2',
 'DEG_APCC_DBOX_1',
 'DEG_APCC_TPR_1',
 'DEG_SCF_TRCP1_1',
 'DEG_SCF_SKP2-CKS1_1',
 'DEG_SCF_FBW7_1',
 'DEG_CRL4_CDT2_1',
 'DEG_SCF_FBW7_2',
 'DEG_ODPH_VHL_1',
 'DEG_SIAH_1']

Number of unique Degrons_cancer motifs: 14

['D-box',
 'Other',
 'ABBA',
 'SCF_FBXO31',
 'SCF_Fbw7',
 'KEN-box',
 'CBLL1',
 'CBL',
 'DEG_Kelch_actinfilin_1',
 'FBW7_predicted',
 'DEG_SPOP',
 'DEG_Kelch_Keap1_2',
 'SCF_beta-TrCP',
 'DEG_Nend_UBRbox_4']

Number of unique Manual motifs: 4

['CBL_PTK', 'CBL_MET', 'LIG_APCC_ABBA_1', 'ITCH']



Manually check mapping between databases and also between motifs of the same E3 ligase. With this information, generate a manual mapping dictionary. Note that in the case of motifs not having a match in the other databases, the consensus name is the database name.

First, motifs which already have been put together as they have a UbiNet counterpart and motif generation has been already performed (info here: `Degrons_project_update_23-12-21 presentation`). Some notes on it:
- KEAP1 (Q14145): DEG_Kelch_Keap1_1 (ELM and Manual), DEG_Kelch_Keap1_2 (Degron_cancer). The two ELM motifs make sense together when we generate the consensus motif.
- SPOP (O43791): DEG_SPOP_SBC_1 (Degron_cancer and ELM), DEG_SPOP (Degron_cancer). DEG_SPOP is from Degron in cancer, so has no motif, but the alignment makes sense to generate a common consensus motif. 
- VHL (P40337): DEG_ODPH_VHL_1 (ELM and Manual).
- FBXO31 (Q5XUX0): DEG_SCF_FBXO31_1 (ELM), SCF_FBXO31 (Degrons_cancer). The two ELM motifs make sense together when we generate the consensus motif.
- SIAH1 (Q8IUQ4): DEG_SIAH_1 (ELM and Manual).
- COP1 (Q8NHY2): DEG_COP1_1 (ELM and Degrons_cancer). 
- DTL (Q9NZJ0): DEG_CRL4_CDT2_1 (ELM and Manual).
- FBXW7 (Q969H0): DEG_SCF_FBW7_1 (ELM and Manual), DEG_SCF_FBW7_2 (Degrons_cancer and ELM), SCF_Fbw7 (Degrons_cancer), FBW7_predicted (Degrons_cancer). The two ELM motifs make sense together when we generate the consensus motif. 
- MDM2 (Q00987): DEG_MDM2_SWIB_1 (ELM).
- CBL (P22681): separated in 3 alignments. Motifs: CBL (Degrons_cancer), CBL_MET (Manual), CBL_PTK (Manual)

In [8]:
# Mapping dictionary for motifs consensus names
# has to be: {ID-in-the-table: new-ID}

motifs_consensus = {
    "DEG_Kelch_Keap1_1": "KEAP1",
    "DEG_Kelch_Keap1_2": "KEAP1",
    "DEG_SPOP_SBC_1": "SPOP",
    "DEG_SPOP": "SPOP",
    "DEG_ODPH_VHL_1": "VHL",
    "DEG_SCF_FBXO31_1": "FBXO31",
    "SCF_FBXO31": "FBXO31",
    "DEG_SIAH_1": "SIAH1",
    "DEG_COP1_1": "COP1",
    "DEG_CRL4_CDT2_1": "DTL",
    "DEG_SCF_FBW7_1": "FBXW7",
    "DEG_SCF_FBW7_2": "FBXW7",
    "SCF_Fbw7": "FBXW7",
    "FBW7_predicted": "FBXW7",
    "DEG_MDM2_SWIB_1": "MDM2",    
}

Remaining motifs:

In [9]:
# Degron motifs IDs according to database AND those not in motifs_consensus_val

previous_db_motifs = []          # to obtain only unique motif IDs

# Obtain unique motif IDs from each database
for db in elm_manual_degrons_df.Database.unique():
    
    unique_db_motifs = []    # to not repeat between ddbb
    remain_db_motifs = []    # to not repeat those previously mapped
    
    db_motifs = elm_manual_degrons_df[elm_manual_degrons_df.Database == db].Degron.unique()
    
    # Unique motif between databases
    for motif in db_motifs:
        if motif not in previous_db_motifs:
            unique_db_motifs.append(motif)
            
    previous_db_motifs = previous_db_motifs + unique_db_motifs
    
    # Motifs which have not been mapped yet belonging to this database
    for motif in unique_db_motifs:
        if motif not in motifs_consensus:
            remain_db_motifs.append(motif)
    
    print(f'Number of unique (and remaining) {db} motifs: {len(remain_db_motifs)}\n')
    pprint(remain_db_motifs)
    print()

Number of unique (and remaining) ELM motifs: 6

['DEG_Kelch_KLHL3_1',
 'DEG_APCC_KENBOX_2',
 'DEG_APCC_DBOX_1',
 'DEG_APCC_TPR_1',
 'DEG_SCF_TRCP1_1',
 'DEG_SCF_SKP2-CKS1_1']

Number of unique (and remaining) Degrons_cancer motifs: 9

['D-box',
 'Other',
 'ABBA',
 'KEN-box',
 'CBLL1',
 'CBL',
 'DEG_Kelch_actinfilin_1',
 'SCF_beta-TrCP',
 'DEG_Nend_UBRbox_4']

Number of unique (and remaining) Manual motifs: 4

['CBL_PTK', 'CBL_MET', 'LIG_APCC_ABBA_1', 'ITCH']



Add to the mapping dictionary. Some notes:
- DEG_Kelch_KLHL3_1 (ELM): http://elm.eu.org/elms/DEG_Kelch_KLHL3_1. Maps to DEG_Kelch_KLHL3_1.
- APC motifs: separe APC motifs in the 3 subtypes of ELM + ABBA, as looking at their regex, they seem different. 
- BTRC: SCF_beta-TrCP is F box of BTRC (Degrons in cancer)
- DEG_Nend_UBRbox_4: N-terminal E3 ligase

In [10]:
# Mapping dictionary for motifs consensus names
# has to be: {ID-in-the-table: new-ID}
# Adding more entries

motifs_consensus["DEG_Kelch_KLHL3_1"] = "DEG_Kelch_KLHL3_1"
motifs_consensus["DEG_APCC_KENBOX_2"] = "APC_KENBOX"
motifs_consensus["DEG_APCC_DBOX_1"] = "APC_DBOX"
motifs_consensus["DEG_APCC_TPR_1"] = "DEG_APCC_TPR_1"
motifs_consensus["KEN-box"] = "APC_KENBOX"   # Almost certainly APC
motifs_consensus["D-box"] = "APC_DBOX"   # Almost certainly APC
motifs_consensus["ABBA"] = "APC_ABBA"   # Almost certainly APC
motifs_consensus["LIG_APCC_ABBA_1"] = "APC_ABBA"  # ABBA motif
motifs_consensus["DEG_SCF_TRCP1_1"] = "BTRC"
motifs_consensus["SCF_beta-TrCP"] = "BTRC"
motifs_consensus["DEG_SCF_SKP2-CKS1_1"] = "DEG_SCF_SKP2-CKS1_1"
motifs_consensus["Other"] = "Other"
motifs_consensus["CBLL1"] = "CBLL1" 
motifs_consensus["DEG_Kelch_actinfilin_1"] = "DEG_Kelch_actinfilin_1"
motifs_consensus["DEG_Nend_UBRbox_4"] = "DEG_Nend_UBRbox_4"
motifs_consensus["ITCH"] = "ITCH"

**The special case of CBL...**

We generated three different alignments from ELM CBL degrons, and analyzed UbiNet CBL substrates with the PWMs derived from the three of them. The first two alignments were appeareantly more specif for ELM motifs CBL_MET and CBL_PTK, but the third gathered both types. Hence, next step is mapping CBL_PTK and CBL_MET independently and try to align CBL degrons to both alignments, selecting the one in which the sequence fits better. For now, we map the IDs as below:

In [11]:
motifs_consensus["CBL_MET"] = "CBL_MET"
motifs_consensus["CBL_PTK"] = "CBL_PTK"
motifs_consensus["CBL"] = "CBL"

Add `consensus name` column to the ELM_manual dataframe:

In [12]:
# Add consensus names according to mapping dictionary

elm_manual_degrons_df['Degron_consensusID'] = elm_manual_degrons_df['Degron'].map(motifs_consensus)
elm_manual_degrons_df


Unnamed: 0,Degron,Substrate,Start,End,Database,Sequence,Sequence_amplified,Start_amplified,End_amplified,E3_ligase,Degron_consensusID
0,DEG_SCF_FBXO31_1,P30281,286,292,ELM,DVTAIHL,SSSQGPSQTSTPTDVTAIHL,273.0,292.0,Q5XUX0,FBXO31
1,DEG_SCF_FBXO31_1,P30279,283,289,ELM,DVRDIDL,KSEDELDQASTPTDVRDIDL,270.0,289.0,Q5XUX0,FBXO31
2,DEG_SCF_FBXO31_1,P24385,289,295,ELM,DVRDVDI,EEEEEVDLACTPTDVRDVDI,276.0,295.0,Q5XUX0,FBXO31
3,DEG_COP1_1,P14921,273,283,ELM,SFNSLQRVPSY,WSSQSSFNSLQRVPSYDSFD,268.0,287.0,Q8NHY2,COP1
4,DEG_COP1_1,P15036,301,311,ELM,SLLDVQRVPSF,WNSQSSLLDVQRVPSFESFE,296.0,315.0,Q8NHY2,COP1
...,...,...,...,...,...,...,...,...,...,...,...
225,DEG_CRL4_CDT2_1,Q9NQR1,178,190,Manual,PPKTPPSSCDSTN,EAAEPPKTPPSSCDSTNAAI,174.0,193.0,Q9NZJ0,DTL
226,CBL_MET,Q9UIW2,441,443,Manual,DYR,DGLTAVAAYDYRGRTVVFAG,432.0,451.0,P22681,CBL_MET
227,DEG_APCC_TPR_1,Q9UM11,491,493,Manual,LFT,SKTRSTKVKWESVSVLNLFT,474.0,493.0,,DEG_APCC_TPR_1
228,CBL_MET,Q9UQQ2,88,90,Manual,DYR,VRDGRAPGRDYRDTGRGPPA,79.0,98.0,P22681,CBL_MET


After performing the alignments, we manually set CBL_MET or CBL_PTK to CBL degrons:

CBL_MET:

In [14]:
# CLEARLY CBL_MET (previous alignment)

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P51805') & 
                          (elm_manual_degrons_df['Start_amplified'] == 1289) &
                          (elm_manual_degrons_df['End_amplified'] == 1308) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_MET"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'Q9UIW2') & 
                          (elm_manual_degrons_df['Start_amplified'] == 1314) &
                          (elm_manual_degrons_df['End_amplified'] == 1333) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_MET"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P08138') & 
                          (elm_manual_degrons_df['Start_amplified'] == 326) &
                          (elm_manual_degrons_df['End_amplified'] == 345) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_MET"

# New degrons to fit
elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'Q04912') & 
                          (elm_manual_degrons_df['Start_amplified'] == 1007) &
                          (elm_manual_degrons_df['End_amplified'] == 1026) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_MET"

CBL_PTK:

In [15]:
# CLEARLY CBL_PTK (previous alignment)

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'A0A0C4DFS6') & 
                          (elm_manual_degrons_df['Start_amplified'] == 65) &
                          (elm_manual_degrons_df['End_amplified'] == 84) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'O43597') & 
                          (elm_manual_degrons_df['Start_amplified'] == 45) &
                          (elm_manual_degrons_df['End_amplified'] == 64) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'O43609') & 
                          (elm_manual_degrons_df['Start_amplified'] == 43) &
                          (elm_manual_degrons_df['End_amplified'] == 62) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P00533') & 
                          (elm_manual_degrons_df['Start_amplified'] == 1059) &
                          (elm_manual_degrons_df['End_amplified'] == 1078) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P09619') & 
                          (elm_manual_degrons_df['Start_amplified'] == 870) &
                          (elm_manual_degrons_df['End_amplified'] == 889) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P10721') & 
                          (elm_manual_degrons_df['Start_amplified'] == 558) &
                          (elm_manual_degrons_df['End_amplified'] == 577) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P16234') & 
                          (elm_manual_degrons_df['Start_amplified'] == 862) &
                          (elm_manual_degrons_df['End_amplified'] == 881) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

# New degrons to fit
elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P06241') & 
                          (elm_manual_degrons_df['Start_amplified'] == 410) &
                          (elm_manual_degrons_df['End_amplified'] == 429) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P43403') & 
                          (elm_manual_degrons_df['Start_amplified'] == 282) &
                          (elm_manual_degrons_df['End_amplified'] == 301) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_PTK"

Those fitting no where:

In [16]:
elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P07333') & 
                          (elm_manual_degrons_df['Start_amplified'] == 953) &
                          (elm_manual_degrons_df['End_amplified'] == 972) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_unknown"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'P10721') & 
                          (elm_manual_degrons_df['Start_amplified'] == 926) &
                          (elm_manual_degrons_df['End_amplified'] == 945) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_unknown"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'Q9NRF2') & 
                          (elm_manual_degrons_df['Start_amplified'] == 737) &
                          (elm_manual_degrons_df['End_amplified'] == 756) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_unknown"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'Q9UQQ2') & 
                          (elm_manual_degrons_df['Start_amplified'] == 556) &
                          (elm_manual_degrons_df['End_amplified'] == 575) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL')), 
                          'Degron_consensusID'] = "CBL_unknown"

Probable bad annotated degrons in CBL_MET:

In [17]:
elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'Q04912') & 
                          (elm_manual_degrons_df['Start_amplified'] == 901) &
                          (elm_manual_degrons_df['End_amplified'] == 920) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL_MET')), 
                          'Degron_consensusID'] = "CBL_unknown"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'Q04912') & 
                          (elm_manual_degrons_df['Start_amplified'] == 958) &
                          (elm_manual_degrons_df['End_amplified'] == 977) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL_MET')), 
                          'Degron_consensusID'] = "CBL_unknown"

elm_manual_degrons_df.loc[((elm_manual_degrons_df['Substrate'] == 'Q14247') & 
                          (elm_manual_degrons_df['Start_amplified'] == 406) &
                          (elm_manual_degrons_df['End_amplified'] == 425) &
                          (elm_manual_degrons_df['Degron_consensusID'] == 'CBL_MET')), 
                          'Degron_consensusID'] = "CBL_unknown"

In [18]:
# Save
elm_manual_degrons_df.to_csv(elm_manual_E3_consensusID_path, sep = "\t", index = False)

## 4. Generate ELM-Manual real degrons sets for consensusID motifs

For later generation of the positivity threshold of each PWM.

In [6]:
# Load data again (for this part of the nb to be independent of the previous)
elm_manual_degrons_df = pd.read_csv(elm_manual_E3_consensusID_path, sep = "\t")
elm_manual_degrons_df

Unnamed: 0,Degron,Substrate,Start,End,Database,Sequence,Sequence_amplified,Start_amplified,End_amplified,E3_ligase,Degron_consensusID
0,DEG_SCF_FBXO31_1,P30281,286,292,ELM,DVTAIHL,SSSQGPSQTSTPTDVTAIHL,273.0,292.0,Q5XUX0,FBXO31
1,DEG_SCF_FBXO31_1,P30279,283,289,ELM,DVRDIDL,KSEDELDQASTPTDVRDIDL,270.0,289.0,Q5XUX0,FBXO31
2,DEG_SCF_FBXO31_1,P24385,289,295,ELM,DVRDVDI,EEEEEVDLACTPTDVRDVDI,276.0,295.0,Q5XUX0,FBXO31
3,DEG_COP1_1,P14921,273,283,ELM,SFNSLQRVPSY,WSSQSSFNSLQRVPSYDSFD,268.0,287.0,Q8NHY2,COP1
4,DEG_COP1_1,P15036,301,311,ELM,SLLDVQRVPSF,WNSQSSLLDVQRVPSFESFE,296.0,315.0,Q8NHY2,COP1
...,...,...,...,...,...,...,...,...,...,...,...
225,DEG_CRL4_CDT2_1,Q9NQR1,178,190,Manual,PPKTPPSSCDSTN,EAAEPPKTPPSSCDSTNAAI,174.0,193.0,Q9NZJ0,DTL
226,CBL_MET,Q9UIW2,441,443,Manual,DYR,DGLTAVAAYDYRGRTVVFAG,432.0,451.0,P22681,CBL_MET
227,DEG_APCC_TPR_1,Q9UM11,491,493,Manual,LFT,SKTRSTKVKWESVSVLNLFT,474.0,493.0,,DEG_APCC_TPR_1
228,CBL_MET,Q9UQQ2,88,90,Manual,DYR,VRDGRAPGRDYRDTGRGPPA,79.0,98.0,P22681,CBL_MET


In [9]:
# Generate a dataframe for each motif (consensus ID) gathering all of their degrons (positive set)

grouped = elm_manual_degrons_df.groupby('Degron_consensusID')

for E3, substrates in grouped:
    
    E3_subs = substrates[["Substrate", "Sequence", "Start", "End"]].copy()
    E3_subs.rename(columns = {"Substrate": "gene", "Sequence": "sequence", "Start": "start", "End": "end"}, 
                  inplace = True)
    E3_subs.reset_index(drop = True, inplace = True)
    E3_subs.to_csv(elm_manual_true_degrons_path+E3+".tsv", index = False, sep = "\t")