# UbiNet motifs. From Position Probability Matrices (PPMs) to Position Weight Matrices (PWMs) 

This notebook contains the code to extract the Position Probability Matrices (PPMs) degron motifs from [Ubinet 2.0. database](https://awi.cuhk.edu.cn/~ubinet/index.php) and transforms them into Position Weight Matrices (PWMs). 

## Import libraries

In [3]:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm           
from bs4 import BeautifulSoup      # html parsing library
import re                          # regular expressions library
import requests                    # allows sending http requests
import logomaker                   # for probability matrix transformation


## Define variables and paths

In [2]:
# paths
base = "../"

data = "data/"

prob_m_path = os.path.join(base, data, "ubinet/motif_matrices/PPM/")                      
weight_m_path = os.path.join(base, data, "ubinet/motif_matrices/PWM/")                 
aa_bg_path = os.path.join(base, data, "external/aminoacid_frequency.txt")  
html_ubinet_path = os.path.join(base, data, "external/ubinet/browseE3_ubinet.php")
url_motifs_path = os.path.join(base, data, "ubinet/motif_links_ubinet.txt")

In [3]:
# variables

# aa background probabilities (sorted by aa)
bg_matrix = pd.read_table(aa_bg_path).sort_values(by = "Aminoacid")

aa_probs = bg_matrix["Frequency"].to_numpy()            # array with aa background frequencies
aa = bg_matrix["Aminoacid"].to_numpy()                  # array with aa names

## Define functions

In [4]:
# Test OK!
def retrieve_motif_links(html_path, links_path, regex, save = True):
    """
    Retrieves motifs hyperlinks matching regexs from a HTML file and save these links in a 
    text file, printing the total number of links
    
    Parameters
    ----------
    html_path: str
                Path to the folder where the HTML file is located
    links_path: str
                Path to the folder where the text file with all hyperlinks will be located
    regex: str
                Regular expression to match every motif containing link
    save: boolean (default: True)
                If True, the links text file is generated. If false, only prints the number
                of retrieved links
                
    Returns
    -------
    None
    
    """
    
    # Read website HTML file and apply HTML parser to it
    with open(html_path) as fp:
        soup = BeautifulSoup(fp, "html.parser")
    
    if save:
    # Retrieve all hyperlinks and save then in a txt file
        counter = 0                 # (optional) Monitor number of retrieved motifs

        with open(links_path, "w") as fp:
            for link in soup.find_all(href = re.compile(regex)):
                counter += 1      
                # use get function to only retrieve links after 'href' tag
                fp.write(link.get('href')[1:]+"\n")        # Note: [1:] to avoid the dot at the beginning

    else:
        counter = 0
        for link in soup.find_all(href = re.compile(regex)):
                counter += 1
    
    print(f'{counter} retrieved links')
        
    return None
        
    

Note: `<a>` tag defines a hyperlink in HTML, which most important attribute is `href`, which indicates the link's destination. 

In [5]:
# Tests OK!
def retrieve_prob_matrices(links_path, prob_m_path, urlbase, regex, one_pm = True, save = True):
    """
    Retrieves motifs probability matrices from a HTML request to the motifs hyperlinks 
    and saves each matrix in a separated tab-delimited file, whose name is the E3-ligase AC. 
    Also, prints the number of retrieved matrices
    
    Parameters
    ----------
    links_path: str
                Path to the folder where the text file with all hyperlinks is located
    prob_m_path: str
                Path to the folder where the probability matrices files will be located
    urlbase: str
                Non-mutable part of the URL to access the website
    regex: str
                Regular expression to find the probability matrix in the HTML tree
    one_pm: boolean (default:True)
                If True, only the first probability matrix instance per HTML is retrived.
                If False, every probability matrix instance is retrived
    save: boolean (default: True)
                If True, the probability matrices are saved independently in text files.
                If False, only prints the number of retrived matrices
                
    Returns
    -------
    None
    """
    
    # Access each motif's url and find the first frequency matrix instance
    if one_pm:
        
        counter = 0                 # (optional) Monitor number of E3-ligases

        with open(links_path) as fp:
            for url in tqdm(fp):
                counter += 1
                E3_id = url.split("/")[3]   # E3-ligase AC position in the URL
                r = (requests.get(urlbase+url.strip())).text
                soup = BeautifulSoup(r, "html.parser")
                matrix = ((soup.find(value = re.compile(regex))).get("value")).split("=")[-1][3:]

                # Store each probability matrix independently
                if save:
                    with open(prob_m_path+E3_id+".tsv", "w") as fp:
                        fp.write(matrix)
                        
    
    
    # Access each motif's url and find every frequency matrix instance
    else:
        
        counter = 0                       # (optional) Monitor number of E3-ligases
        
        with open(links_path) as fp:
            for url in tqdm(fp):
                E3_id = url.split("/")[3]   # E3-ligase AC position in the URL
                r = (requests.get(urlbase+url.strip())).text
                soup = BeautifulSoup(r, "html.parser")
                matrices = soup.find_all(value = re.compile(regex))
    
                # First matrix
                matrix_1 = str(matrices[0]).split("=")[-1][3:].split('"')[0]
                counter += 1
                if save:
                    with open(prob_m_path+E3_id+".tsv", "w") as fp:
                        fp.write(matrix)
    
                # Rest of matrices (checked there is always a second matrix, a replicate of the first if there are not two)
                for i, matrix in enumerate(matrices[1:]):
                    matrix_n = str(matrix).split("=")[-1][3:].split('"')[0]
                    
                        
                    # An additional matrix exists
                    if matrix_1 != matrix_n:
                        counter += 1
                        if save:
                            with open(prob_m_path+E3_id+"_"+str(i+2)+".tsv", "w") as fp:
                                fp.write(matrix)
    
    
    print(f'{counter} retrieved probability matrices')
                        
    
        
    return None
            
            
    

Note on probability matrix parsing: 
- **For one matrix retrieval**: this line of code finds the first matrix instance in the tree (indicated by regex) and gets its value, which is the probability matrix itself. We make a split with `=` first and then index with `[-1]` to ensure we keep the string's end. However, the beginning of the string contains an extra number which is not part of the matrix, so we keep the string from index 3 (`[3:]`). In some cases, an additional white line remains in the beginning of the matrix, but it does not seem to be problematic for posterior loading as a Pandas dataframe, so is preferred to use index 3 instead of index 4.
- **For more than one matrix retrieval**: instead, the line of code finds every matrix instance in the tree, according to regex. This search returns a list of matrices, which can be splitted using `"`. Also, I have checked there is always a replica of the matrix if there is a single one and, appareantly, there are two different matrices at most, but the code is adapted to several. 

In [13]:
# Tests OK!
def from_prob_to_weight_matrix(prob_m_path, weight_m_path, aa_bg):
    """
    Transforms n probability matrices to weight matrices according to provided aminoacid
    background probabilities
    
    Parameters
    ----------
    prob_m_path: str
                Path to the folder where the probability matrices are located. Each file has to be tab-separated.
    weight_m_path: str
                Path to the folder where the weight matrices will be located. Each file has to be tab-separated.
    aa_bg: numpy.ndarray
                Aminoacid background probabilities, sorted the same as columns in the matrices
                
                
    Returns
    -------
    None
    """
    
    # Retrieve probability matrices files names 
    E3_ligases = os.listdir(prob_m_path)
    
    # Transform every probability matrix into a weight matrix
    
    counter = 0                                    # (optional) Monitor number of E3-ligases
    
    for E3_ligase in E3_ligases:
        
        counter += 1
        
        prob_m = pd.read_csv(prob_m_path+E3_ligase, sep = "\t")
        weight_m = logomaker.transform_matrix(prob_m, from_type = 'probability', to_type = 'weight', background = aa_bg) 
        # following line commented to avoid file saving (testing run for modifications)
        #weight_m.to_csv(weight_m_path+E3_ligase, sep = "\t", header = True, index = False) # index = False to avoid keeping first column (motif positions)
                                                                                           # header = True to maintain aa letter names
        
    
    print(f'{counter} transformed probability matrices to weight matrices')
    
    return None
    
    


## Data generation

### 1. Fetch URL of PPM-containing motifs

First, HTML file is saved from 'Browse E3 ligases' section in UbiNet, opening the navigator's inspector. Find those motifs which contain a PPM:

In [7]:
print("Motifs links")
print("----------------")
regex = "./data/UbiNet2.0_Motifs_1129/[^NoMotif.html]"

retrieve_motif_links(html_ubinet_path, url_motifs_path, regex,
                    save = True) 

Motifs links
----------------
104 retrieved links


### 2. Fetch PPMs

Only one probability matrix per E3-ligase is retrieved, although in same cases, there is more than one available. The retrieved matrix is supposed to be the one with the highest score, but the code does not have that implementation as it seems the first matrix is always the one with the highest score.

In [10]:
regex = "letter-probability matrix"
urlbase = "https://awi.cuhk.edu.cn/~ubinet"

retrieve_prob_matrices(url_motifs_path, prob_m_path, urlbase, regex,
                       save = True)

104it [04:35,  2.65s/it]

104 retrieved probability matrices





For further automatize implementations, we read back all the probability matrices and stored them as tab-delimited dataframes.

In [7]:
E3_ligases = os.listdir(prob_m_path)

for E3_ligase in E3_ligases:
    
    prob_m = pd.read_csv(prob_m_path+E3_ligase, sep = "  ", header = None, names = aa, engine = 'python')
    prob_m.to_csv(prob_m_path+E3_ligase, sep = "\t", header = True, index = False)

### 3. Probability matrices transformation to weight matrices

Transformation performed using `logomaker.transform_matrix` function from logomaker library.

Requirement: aminoacids background probability, sorted alphabetically by aa to preserve the order of the weight matrices columns. 

In [14]:
from_prob_to_weight_matrix(prob_m_path, weight_m_path, aa_probs)

104 transformed probability matrices to weight matrices
