# Parse the output of the MATLAB subsystem annotation

Parse the KEGG subsystem mapping obtained from matlab_files/subsystem_annotation.
The "duplicated" annotation refers to the end file: reactions that are present in several subsystems are present several times in the end file: one for each subsystem.

In [None]:
save_file = True

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
mappings = pd.read_csv(
    "../data/subsystem_assignation/subsystem_mapping.csv", index_col=0
)

mappings

Start by removing biomass and exchange reactions

In [None]:
biomass_reac = [reac for reac in mappings.index if "biomass" in reac]
exch_reac = [reac for reac in mappings.index if "EX_" in reac]
mappings = mappings.drop(biomass_reac + exch_reac)
mappings

Get rid of the NA values by replacing them with an "Unknown" subsystem.
Then filter for reactions with known KEGG subsystem assignation as this is the standard we will use.

In [None]:
mappings_filled = mappings.fillna("Unknown")
mappings_kegg = mappings_filled[mappings_filled["subsKEGG"] != "Unknown"]
mappings_kegg

Check how many reactions have multiple assignations

In [None]:
kegg_subsys = mappings_kegg["subsKEGG"]
has_multiple_map = pd.Series([0] * len(mappings_kegg))
has_multiple_map.index = kegg_subsys.index


for reac in kegg_subsys.index:
    subsys = kegg_subsys[reac]
    if "|" in subsys:
        has_multiple_map[reac] = subsys.count("|")

has_multiple_map.value_counts()

# Demultiplying reactions

Since those pathways are all valid KEGG annotations, we will consider that each reaction takes part in all those pathways.
There are 3 levels of subsystem in KEGG: we will use the third one.
There is however one annotation at level 1 which is "Non included in pathway or brite" and has an annotation at level 3 that is not "Unknown". However,all reactions with this annotation except 1 (NTP12) have other subsystems. We will thus remove the rows corresponding to this "Non included" subsystem.

In [None]:
series_list = []
for reaction in mappings_kegg.index:
    subsys_str = mappings_kegg.loc[reaction, "subsKEGG"]
    subsys_mult = subsys_str.split("|")
    for s in subsys_mult:
        levels = s.split(";")
        series = pd.Series(
            [reaction, levels[0], levels[1], levels[2], len(subsys_mult)]
        )
        series.index = ["rxn", "level1", "level2", "level3", "num_diff_subsys"]
        series_list.append(series)
subsys_table = pd.concat(series_list, axis=1).transpose()
subsys_table

In [None]:
# All the reactions that have a "non included in pathway or brite" field except 1 have other subsystems : remove the "non included subsystem"
not_included = subsys_table[
    subsys_table["level1"] == "Not Included in Pathway or Brite"
]
not_included[not_included["num_diff_subsys"] == 1]

In [None]:
subsys_table[subsys_table["rxn"] == "NTP10"]

In [None]:
to_remove = subsys_table[subsys_table["level1"] == "Not Included in Pathway or Brite"]
subsys_table = subsys_table.drop(list(to_remove.index))
subsys_table

In [None]:
if save_file:
    subsys_table.to_csv("../data/processed_files/subsystem_duplicated.csv")