### Search for Potential Cis-Regulation in lncRNA-mRNA Pairs

The following script integrates the results generated by RNAContigProfiler with a table derived from DEGs analysis, containing only those genes identified as DE. In this study, an exclusive lncRNA for a given sample is defined as an lncRNA whose overexpression is consistent across all tissues. Therefore, we seek neighbouring mRNAs detected by RNAContigProfiler that exhibit the same expression patterns. The files needed to run the notebooks can be found in the notebooks_files folder.

*Executed with Python v3.11.4*

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yahelGB/M.Sc_thesis/blob/main/notebooks/cis_regulation_module.ipynb)

In [1]:
# import libraries
import pandas as pd

In [2]:
# define characteristic expression patterns for each tissue.
# each tissue key (e.g., "hepatopancreas") maps to a list of expression patterns 
# that are specific to that tissue in this study. These patterns are used to filter 
# mRNAs based on their expression behaviour relative to other tissues, helping 
# to identify potential cis-regulatory interactions with neighbouring lncRNAs
tissue_patterns = {
    "hepatopancreas": ["Up_hep_Down_gills", "Up_hep_Down_hemocytes", "Down_muscle_Up_hep"],
    "gills": ["Down_hemocytes_Up_gills", "Down_hep_Up_gills", "Down_muscle_Up_gills"],
    "hemocytes": ["Up_hemocytes_Down_gills", "Down_hep_Up_hemocytes", "Down_muscle_Up_hemocytes"],
    "muscle": ["Up_muscle_Down_gills", "Up_muscle_Down_hemocytes", "Up_muscle_Down_hep"]
}

In [3]:
# load combined intersect.txt file
expression_patterns_file = "/Users/yahelgb/Desktop/Project_Lvannamei/vscode/combined_intersect.txt"

In [4]:
# define the paths of the files generated by RNAContigProfiler 
tissues = {
    "hepatopancreas": "hepatopancreas_exclusive_lncRNAs_rnacontigprofiler_output.txt",
    "gills": "gills_exclusive_lncRNAs_rnacontigprofiler_output.txt",
    "hemocytes": "hemocytes_exclusive_lncRNAs_rnacontigprofiler_output.txt",
    "muscle": "muscle_exclusive_lncRNAs_rnacontigprofiler_output.txt"
}

In [5]:
def load_expression_patterns(file_path, tissue):
    """
        loads mRNA expression patterns from a file and filters them 
        to include only those patterns specific to a given tissue

        parameters:
        - file_path: path to the combined_intersect.txt file
        - tissue: name of the tissue for which patterns are sought

        returns:
        - a dictionary with mRNA IDs and their associated expression patterns
    """
    expression_patterns = {}
    with open(file_path, "r") as file:
        for line in file:
            cols = line.strip().split("\t")
            gene_id = cols[0]
            expression_pattern = cols[-1]
            
            # verifies if the expression pattern belongs to the specified tissue
            if any(pattern in expression_pattern for pattern in tissue_patterns[tissue]):
                if gene_id not in expression_patterns:
                    expression_patterns[gene_id] = []
                expression_patterns[gene_id].append(expression_pattern)
    return expression_patterns

In [6]:
# main function
def evaluate_patterns(lncrna_file, expression_patterns_file, output_file, tissue):
    """
    evaluates expression patterns for lncRNAs based on neighbouring mRNAs

    parameters:
    - lncrna_file: input file containing information on lncRNAs
    - expression_patterns_file: file with mRNA expression patterns
    - output_file: name of the output file
    - tissue: specific tissue for which patterns are evaluated

    processes the lncRNA data and saves the count of pattern matches in the output file
"""
    # load the expression patterns for the specified tissue
    expression_patterns = load_expression_patterns(expression_patterns_file, tissue)

    # load lncRNAs data into a df
    df_lncrna = pd.read_csv(lncrna_file, sep="\t")
    
    # replace values ‘not DE’ with ‘not_DE’ to avoid name conflicts and because otherwise Jerôme will get mad at me
    df_lncrna['DE_status'] = df_lncrna['DE_status'].replace("not DE", "not_DE")
    
    output_rows = []
    
    # evaluate expression patterns for each lncRNA
    for _, row in df_lncrna.iterrows():
        lncRNA_ID = row['lncRNA_ID']
        lncRNA_strand = row['strand']
        mRNA_ID = row['mRNA_ID']
        mRNA_strand = row['mRNA_strand']
        mRNA_position = row['mRNA_Position']
        DE_status = row['DE_status']
        distance_kb = row['distance_Kb']
        product = row['product']
        
        # search for mRNA-associated expression patterns
        mRNA_Patterns = expression_patterns.get(mRNA_ID, [])
        match_count = len(mRNA_Patterns)
        
        # add the information to the list of results
        output_rows.append({
            "lncRNA_ID": lncRNA_ID,
            "lncRNA_strand": lncRNA_strand,
            "mRNA_ID": mRNA_ID,
            "mRNA_strand": mRNA_strand,
            "mRNA_position": mRNA_position,
            "DE_status": DE_status,
            "match_count": match_count,
            "distance_Kb": distance_kb,
            "product": product
        })
    
    # convert the results to a DataFrame and export to an output file
    df_output = pd.DataFrame(output_rows)
    df_output.to_csv(output_file, sep="\t", index=False)

In [8]:
# iterate on each tissue and lncRNA file specific to that tissue
for tissue, lncrna_file in tissues.items():
    output_file = f"{tissue}_output_result.txt"
    evaluate_patterns(lncrna_file, expression_patterns_file, output_file, tissue)