# ELM-Manual motifs. From degron sequences to alignments to Position Weight Matrices (PWMs)

This notebook contains the code to generate ELM-Manual PWMs, which are those used for further analysis. 

The process of generating PWMs from ELM-Manual motifs consists of:
- Retrieving motif's degrons sequences in FASTA format from the ELM-Manual database.
- Performing a MSA using, in this case, Clustal Omega EBI tool.
- Manually curating the alignments with Aliview software, so that gaps are removed, degrons sequences maintained (we are using extended degron sequences) and conserved positions are maximized (not in this notebook).
- Generating PWMs from this alignments.

## Import libraries

In [9]:
# to reload automatically the changes in the scripts.
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
import os
import sys
import pandas as pd
import numpy as np
import logomaker
import xmltramp2

## my modules ##
sys.path.append("../scripts/Utils/")    # modules folder
from fasta_utils import generate_fasta_from_df

## Define variables and paths


In [1]:
base = "../../"

data = "data/"

elm_manual_E3_consensusID_path = os.path.join(base, data, "elm_manual/elm_manual_instances_E3ligases_consensusID.tsv")
elm_manual_true_degrons_fasta_path = os.path.join(base, data, "elm_manual/positive_set_fasta/")
elm_manual_align_clustal_path = os.path.join(base, data, "elm_manual/alignments/clustal")
elm_manual_align_curated_path = os.path.join(base, data, "elm_manual/alignments/curated")
weight_m_path = os.path.join(base, data, "elm_manual/motif_matrices/PWM/")                 
prob_m_path = os.path.join(base, data, "elm_manual/motif_matrices/PPM/")       

In [12]:
# variables

# aa background probabilities (sorted by aa)
bg_matrix = pd.read_table(data_path+external_data_path+aa_bg_file).sort_values(by = "Aminoacid")

aa_probs = bg_matrix["Frequency"].to_numpy()            # array with aa background frequencies
aa = bg_matrix["Aminoacid"].to_numpy()                  # array with aa names

## Define functions

In [13]:
def run_clustal(dir_fasta, dir_output, name_file):
    """
    Given a fasta file, run clustal.py and keep the result in dir_output.
    By default, the file will be alinment_clustal.aln-clustal_num.clustal_num.
    Only works in a Jupyter notebook. 
    """
    output_file = os.path.join(dir_output, name_file)
    command = f"python ../scripts/external/clustalo.py --email raquel.blanco@irbbarcelona.org --stype protein --sequence {dir_fasta} --outfile {output_file} --outformat aln-clustal_num"
    ! $command

## 1. Retrieve motif's degrons sequences in fasta format from the ELM-Manual table

Specifically, extended sequences.

In [12]:
# Load data

elm_manual_degrons_df = pd.read_csv(elm_manual_E3_consensusID_path, sep = "\t")
elm_manual_degrons_df

Unnamed: 0,Degron,Substrate,Start,End,Database,Sequence,Sequence_amplified,Start_amplified,End_amplified,E3_ligase,Degron_consensusID
0,DEG_SCF_FBXO31_1,P30281,286,292,ELM,DVTAIHL,SSSQGPSQTSTPTDVTAIHL,273.0,292.0,Q5XUX0,FBXO31
1,DEG_SCF_FBXO31_1,P30279,283,289,ELM,DVRDIDL,KSEDELDQASTPTDVRDIDL,270.0,289.0,Q5XUX0,FBXO31
2,DEG_SCF_FBXO31_1,P24385,289,295,ELM,DVRDVDI,EEEEEVDLACTPTDVRDVDI,276.0,295.0,Q5XUX0,FBXO31
3,DEG_COP1_1,P14921,273,283,ELM,SFNSLQRVPSY,WSSQSSFNSLQRVPSYDSFD,268.0,287.0,Q8NHY2,COP1
4,DEG_COP1_1,P15036,301,311,ELM,SLLDVQRVPSF,WNSQSSLLDVQRVPSFESFE,296.0,315.0,Q8NHY2,COP1
...,...,...,...,...,...,...,...,...,...,...,...
225,DEG_CRL4_CDT2_1,Q9NQR1,178,190,Manual,PPKTPPSSCDSTN,EAAEPPKTPPSSCDSTNAAI,174.0,193.0,Q9NZJ0,DTL
226,CBL_MET,Q9UIW2,441,443,Manual,DYR,DGLTAVAAYDYRGRTVVFAG,432.0,451.0,P22681,CBL_MET
227,DEG_APCC_TPR_1,Q9UM11,491,493,Manual,LFT,SKTRSTKVKWESVSVLNLFT,474.0,493.0,,DEG_APCC_TPR_1
228,CBL_MET,Q9UQQ2,88,90,Manual,DYR,VRDGRAPGRDYRDTGRGPPA,79.0,98.0,P22681,CBL_MET


Generate fasta files for extended degron sequences

In [13]:
# Unique degrons consensus IDs (remove NaN)

motifs = elm_manual_degrons_df.Degron_consensusID.dropna().unique()   # dropna to avoid NAs

# Generate a fasta file per motif
counter = 0

for motif in motifs:
    
    counter += 1
    print(f'{counter}: {motif}')
    
    subset = elm_manual_degrons_df[elm_manual_degrons_df.Degron_consensusID == motif].copy() # motif's subset
    subset.reset_index(inplace = True, drop = True)
    
    generate_fasta_from_df(elm_manual_true_degrons_fasta_path, subset, "Substrate", "Start_amplified",
                           "End_amplified", "Sequence_amplified")


1: FBXO31
2: COP1
3: SPOP
4: MDM2
5: DEG_Kelch_KLHL3_1
6: KEAP1
7: APC_KENBOX
8: APC_DBOX
9: DEG_APCC_TPR_1
10: BTRC
11: DEG_SCF_SKP2-CKS1_1
12: FBXW7
13: DTL
14: VHL
15: SIAH1
16: Other
17: APC_ABBA
18: CBLL1
19: CBL_unknown
20: CBL_PTK
21: DEG_Kelch_actinfilin_1
22: CBL_MET
23: DEG_Nend_UBRbox_4
24: ITCH


## 2. Perform MSA using Clustal Omega

The alignment is performed using EBI `clustalo.py` script in a *homemade* function to only retrieve the alignment file.

In [27]:
for motif in motifs:
        
    print(motif)
    fasta_path = elm_manual_true_degrons_fasta_path+".fasta"
    run_clustal(fasta_path, elm_manual_align_clustal_path, motif)
    print()
    
# note: the output seen below is illustrative, not all motifs were generated in this notebook

DEG_Kelch_KLHL3_1
JobId: clustalo-R20220116-180143-0799-31972980-p2m
FINISHED
Creating result file: ../data/elm_manual/elm_manual_alignments/elm_manual_NO_ubinet/DEG_Kelch_KLHL3_1.aln-clustal_num.clustal_num

APC_KENBOX
JobId: clustalo-R20220116-180151-0261-71819121-p2m
FINISHED
Creating result file: ../data/elm_manual/elm_manual_alignments/elm_manual_NO_ubinet/APC_KENBOX.aln-clustal_num.clustal_num

APC_DBOX
JobId: clustalo-R20220116-180159-0011-11353832-p1m
FINISHED
Creating result file: ../data/elm_manual/elm_manual_alignments/elm_manual_NO_ubinet/APC_DBOX.aln-clustal_num.clustal_num

DEG_APCC_TPR_1
JobId: clustalo-R20220116-180206-0353-71728104-p2m
FINISHED
Creating result file: ../data/elm_manual/elm_manual_alignments/elm_manual_NO_ubinet/DEG_APCC_TPR_1.aln-clustal_num.clustal_num

Other
JobId: clustalo-R20220116-180213-0477-1103967-p1m
FINISHED
Creating result file: ../data/elm_manual/elm_manual_alignments/elm_manual_NO_ubinet/Other.aln-clustal_num.clustal_num

APC_ABBA
JobId: cl

## 3. Generate PPMs and PWMs from the alignments 

PWMs generated from curated alignments.

In [1]:
E3s = np.unique(np.array([E3.split(".")[0] for E3 in os.listdir(elm_manual_align_curated_path)]))

for E3 in E3s:
    
    # Gather each alignment sequences in a list (logomaker requirement)
    seqs = []
    with open(elm_manual_align_curated_path+E3+".fasta", 'r') as f:
        for line in f:
            if line[0] != ">":
                seqs.append(line.strip())
    
    # Generate count matrix to add those aa not present in the alignment (for next step is necessary)
    count_m = logomaker.alignment_to_matrix(seqs, to_type = "counts")
    
    if count_m.shape[1] != 20:
        
        diff_aa = set(aa) - set(count_m.columns)
        for d in diff_aa:
            count_m[d] = 0.0
            
        count_m.sort_index(axis = 1, inplace = True)
    
    # Generate probability matrix 
    prob_m = logomaker.transform_matrix(count_m, from_type = 'counts', 
                                          to_type = 'probability')
    prob_m.to_csv(prob_m_path+E3+".tsv", 
                    sep = "\t", header = True, index = False)
    
    # Generate weight matrix considering bg aa probabilities
    weight_m = logomaker.transform_matrix(count_m, from_type = 'counts', 
                                          to_type = 'weight', background = aa_probs)
    weight_m.to_csv(weight_m_path+E3+".tsv", 
                    sep = "\t", header = True, index = False)
    