# Parsing SwissLipids into a network for LipiNet

In [None]:
import lipinet.databases  # Import the module
import importlib

# Reload the module to ensure changes are picked up
importlib.reload(lipinet)

# Now can use the functions after reloading the module
from lipinet.databases import get_prior_knowledge
from lipinet.utils import split_and_expand_large, create_nodedf_from_edgedf

import pandas as pd

## Parsing the manual way

LipiNet offers conventient functions to parse prior knowledge resources straight into networks. But to show what is happening behind the scenes, this notebook goes through the data and each of the steps. Which may also be particularly helpful to you if you need to customise the networks in a way that is not yet supported by LipiNet directly. 

In [3]:
df_swisslipids = get_prior_knowledge('swisslipids', verbose=True)
df_swisslipids['from_layer_col'] = 'swisslipids'
df_swisslipids

File found locally at /Users/agjanyunlu/Documents/Metabolomics/lipinet/lipinet/.data/downloaded/swisslipids_lipids.tsv. Loading data...
Before cleaning, number of values in lipid class column with trailing space: Lipid class*
False    779171
True         76
Name: count, dtype: int64
After cleaning, number of values in lipid class column with trailing space: Lipid class*
False    779247
Name: count, dtype: int64


Unnamed: 0,Lipid ID,Level,Name,Abbreviation*,Synonyms*,Lipid class*,Parent,Components*,SMILES (pH7.3),InChI (pH7.3),...,Exact m/z of [M+NH4]+,Exact m/z of [M-H]-,Exact m/z of [M+Cl]-,Exact m/z of [M+OAc]-,CHEBI,LIPID MAPS,HMDB,MetaNetX,PMID,from_layer_col
0,SLM:000000002,Class,Ceramide (iso-d17:1(4E)),Cer(iso-d17:1(4E)),N-acyl-15-methylhexadecasphing-4-enine,SLM:000399814,,,CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O,InChI=none,...,,,,,70846,,,MNXM97012,| 11443131 | 14685263 | 18390550 | 21325339 |...,swisslipids
1,SLM:000000003,Isomeric subspecies,15-methylhexadecasphing-4-enine,,,SLM:000390097,,,CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@@H]([NH3+])CO,InChI=1S/C17H35NO2/c1-15(2)12-10-8-6-4-3-5-7-9...,...,303.300605,284.259503,320.236181,344.280632,70771,,,MNXM57784,19372430,swisslipids
2,SLM:000000006,Isomeric subspecies,15-methylhexadecasphinganine,,,SLM:000390097,,,CC(C)CCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO,InChI=1S/C17H37NO2/c1-15(2)12-10-8-6-4-3-5-7-9...,...,305.316255,286.275153,322.251831,346.296282,70829,,,MNXM97029,19372430,swisslipids
3,SLM:000000007,Class,Sphingomyelin (iso-d17:1(4E)),SM(iso-d17:1(4E)),N-acyl-15-methylhexadecasphing-4-enine-1-phosp...,SLM:000001000,,,CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O...,InChI=none,...,,,,,70775,,,MNXM97113,14685263 | 21926990 | 9603947,swisslipids
4,SLM:000000035,Isomeric subspecies,sphinganine,,,SLM:000390097,,,CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO,InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12...,...,319.331905,300.290803,336.267481,360.311932,57817,LMSP01020001,HMDB00269,MNXM302,10652340 | 10702247 | 10751414 | 10802064 | 10...,swisslipids
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
779244,SLM:000782324,,apo carotenoid,,,SLM:000508864,,,,,...,,,,,53183,,,,,swisslipids
779245,SLM:000782325,,terpenoid,,,SLM:000508864,,,,,...,,,,,26873,,,,,swisslipids
779246,SLM:000782326,,C-45 isoprenoid,,,SLM:000508864,,,,,...,,,,,87168,,,,,swisslipids
779247,SLM:000782327,,gamma-lactone,,,SLM:000782238,,,O1C(C(C(C1=O)*)*)*,,...,,,,,37581,,,,,swisslipids


If we take a closer look into the data, especially the `Lipid class*` column, we will see that some of the values have multiple entries. For example Ceramide phosphoinositol is a Class level entry that itself belongs to both the SLM:000000834 and SLM:000399815 classes.

In [4]:
df_swisslipids.dropna(subset='Lipid class*')[df_swisslipids['Lipid class*'].dropna().str.contains('|', regex=False)]

Unnamed: 0,Lipid ID,Level,Name,Abbreviation*,Synonyms*,Lipid class*,Parent,Components*,SMILES (pH7.3),InChI (pH7.3),...,Exact m/z of [M+NH4]+,Exact m/z of [M-H]-,Exact m/z of [M+Cl]-,Exact m/z of [M+OAc]-,CHEBI,LIPID MAPS,HMDB,MetaNetX,PMID,from_layer_col
142,SLM:000000392,Class,Ceramide phosphoinositol,IPC,Inositol-1-phosphoceramide,SLM:000000834 | SLM:000399815,,,O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[...,InChI=none,...,,,,,64916,,,,10888667 | 20727985,swisslipids
234,SLM:000000509,Isomeric subspecies,All-trans-retinyl hexadecanoate,,all-trans-retinyl palmitate,SLM:000000982 | SLM:000508854,,,CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C...,InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1...,...,,,,,17616,,HMDB03648,,10769148 | 10819989 | 12230550 | 15550674 | 15...,swisslipids
315,SLM:000000612,,tetracosenoyl-CoA,,,SLM:000390051 | SLM:000782334,,,CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...,,...,,,,,74146,,,,18541923 | 20110363 | 20937905,swisslipids
317,SLM:000000614,,hexacosenoyl-CoA,,,SLM:000390051 | SLM:000782334,,,CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...,,...,,,,,74161,,,,18165233,swisslipids
319,SLM:000000621,,2-hydroxy-tetracosenoyl-CoA,,,SLM:000390051 | SLM:000782334,,,CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...,,...,,,,,74215,,,,18541923,swisslipids
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755324,SLM:000758294,Class,Globoside,Globo,Globo-series,SLM:000000834 | SLM:000399813,,,,,...,,,,,61360,,,,,swisslipids
755325,SLM:000758295,Class,Isogloboside,Isoglobo,Isoglobo-series,SLM:000000834 | SLM:000399813,,,,,...,,,,,78257,,,,,swisslipids
779141,SLM:000782221,,Resolvin E,RvE,,SLM:000501332 | SLM:000508853,,,,InChI=none,...,,,,,,LMFA0314,,,,swisslipids
779142,SLM:000782222,,Resolvin D,RvD,,SLM:000501331 | SLM:000508853,,,,InChI=none,...,,,,,,LMFA0403,,,,swisslipids


These double entries for the classes will be important to take into account for our class hierarchy, because if we don't many of these Class level entries will become disjointed in the ontology. To help us handle this connection we will split it into two using the `split_and_expand_large` utility function.

In [5]:
df_swisslipids_splitexp = split_and_expand_large(df_swisslipids, 
                       split_col='Lipid class*', 
                       expand_cols=['Lipid ID','Level', 'Name', 'Abbreviation*',
                                    'CHEBI', 'LIPID MAPS', 'HMDB', 'MetaNetX', 'PMID', 'from_layer_col'],
                       delimiter='|')

Now we can melt to start creating the edges df

In [6]:
df_swisslipids_edges = pd.melt(df_swisslipids_splitexp, 
                id_vars=['Lipid ID'], 
                value_vars=['CHEBI','LIPID MAPS','HMDB','MetaNetX','PMID','Lipid class*'], 
                var_name='melted_column', value_name='value')
df_swisslipids_edges

Unnamed: 0,Lipid ID,melted_column,value
0,SLM:000000002,CHEBI,70846
1,SLM:000000003,CHEBI,70771
2,SLM:000000006,CHEBI,70829
3,SLM:000000007,CHEBI,70775
4,SLM:000000035,CHEBI,57817
...,...,...,...
4676203,SLM:000782324,Lipid class*,SLM:000508864
4676204,SLM:000782325,Lipid class*,SLM:000508864
4676205,SLM:000782326,Lipid class*,SLM:000508864
4676206,SLM:000782327,Lipid class*,SLM:000782238


There are still some things we need to tidy up so that it is in a suitable format for OnionNet

In [7]:
df_swisslipids_edges['source_layer'] = 'swisslipids'
df_swisslipids_edges.rename(columns={'Lipid ID':'source_id', 'melted_column':'target_layer', 'value':'target_id'}, inplace=True)
df_swisslipids_edges = df_swisslipids_edges[['source_layer','source_id','target_layer','target_id']]
df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: 'swisslipids' if x=='Lipid class*' else x)
df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: str(x).replace(' ','').strip('*').lower())
df_swisslipids_edges

Unnamed: 0,source_layer,source_id,target_layer,target_id
0,swisslipids,SLM:000000002,chebi,70846
1,swisslipids,SLM:000000003,chebi,70771
2,swisslipids,SLM:000000006,chebi,70829
3,swisslipids,SLM:000000007,chebi,70775
4,swisslipids,SLM:000000035,chebi,57817
...,...,...,...,...
4676203,swisslipids,SLM:000782324,swisslipids,SLM:000508864
4676204,swisslipids,SLM:000782325,swisslipids,SLM:000508864
4676205,swisslipids,SLM:000782326,swisslipids,SLM:000508864
4676206,swisslipids,SLM:000782327,swisslipids,SLM:000782238


For rows where it is swisslipids to swisslipids, we actually want to correct this from target_layer to source_layer, because currently the target_layer in this case is actually the parent class, and ideally it would be better to have the parent point towards the children, so that way the root node should be the one with multiple outgoing edges and no incoming edges...

Be sure to only run this once, otherwise it will switch back again...

In [8]:
# Identify rows where both source_layer and target_layer are 'swisslipids'
condition = (df_swisslipids_edges["source_layer"] == "swisslipids") & (df_swisslipids_edges["target_layer"] == "swisslipids")

# Swap the columns for rows satisfying the condition
df_swisslipids_edges.loc[condition, ["source_layer", "source_id", "target_layer", "target_id"]] = df_swisslipids_edges.loc[condition, ["target_layer", "target_id", "source_layer", "source_id"]].values

# Output the modified DataFrame
df_swisslipids_edges

Unnamed: 0,source_layer,source_id,target_layer,target_id
0,swisslipids,SLM:000000002,chebi,70846
1,swisslipids,SLM:000000003,chebi,70771
2,swisslipids,SLM:000000006,chebi,70829
3,swisslipids,SLM:000000007,chebi,70775
4,swisslipids,SLM:000000035,chebi,57817
...,...,...,...,...
4676203,swisslipids,SLM:000508864,swisslipids,SLM:000782324
4676204,swisslipids,SLM:000508864,swisslipids,SLM:000782325
4676205,swisslipids,SLM:000508864,swisslipids,SLM:000782326
4676206,swisslipids,SLM:000782238,swisslipids,SLM:000782327


Now we will determine whether the edge is within the same layer (intralayer) or between different layers (interlayer)

In [9]:
def assess_edge_layertype(df):
    interlayer = df['source_layer']!=df['target_layer']
    df['interlayer'] = interlayer
    return df 

df_swisslipids_edges = assess_edge_layertype(df_swisslipids_edges)
df_swisslipids_edges

Unnamed: 0,source_layer,source_id,target_layer,target_id,interlayer
0,swisslipids,SLM:000000002,chebi,70846,True
1,swisslipids,SLM:000000003,chebi,70771,True
2,swisslipids,SLM:000000006,chebi,70829,True
3,swisslipids,SLM:000000007,chebi,70775,True
4,swisslipids,SLM:000000035,chebi,57817,True
...,...,...,...,...,...
4676203,swisslipids,SLM:000508864,swisslipids,SLM:000782324,False
4676204,swisslipids,SLM:000508864,swisslipids,SLM:000782325,False
4676205,swisslipids,SLM:000508864,swisslipids,SLM:000782326,False
4676206,swisslipids,SLM:000782238,swisslipids,SLM:000782327,False


In [10]:
df_swisslipids_edges.groupby('target_layer').value_counts(subset=['target_id'], dropna=False)

target_layer  target_id    
chebi         NaN              774975
              131943                2
              131944                2
              131990                2
              132079                2
                                ...  
swisslipids   SLM:000782324         1
              SLM:000782325         1
              SLM:000782326         1
              SLM:000782327         1
              SLM:000782328         1
Name: count, Length: 1319208, dtype: int64

We can see above that many of the target_layers are NaN, because they don't have chebi, metanetx, etc. 

But this shouldn't be an issue, because during the edge addition we can drop these rows, since they don't tell us anything anyway.

Now we will create the node df

In [11]:
df_swisslipids_nodes = create_nodedf_from_edgedf(edge_df=df_swisslipids_edges, props=['layer', 'id'], cols=['layer', 'node_id'])
df_swisslipids_nodes

Unnamed: 0,layer,node_id
0,swisslipids,SLM:000000002
1,swisslipids,SLM:000000003
2,swisslipids,SLM:000000006
3,swisslipids,SLM:000000007
4,swisslipids,SLM:000000035
...,...,...
9352411,swisslipids,SLM:000782324
9352412,swisslipids,SLM:000782325
9352413,swisslipids,SLM:000782326
9352414,swisslipids,SLM:000782327


Let's also see how many are duplicates

In [12]:
df_swisslipids_nodes.value_counts(dropna=True)

layer        node_id      
swisslipids  SLM:000000353    132658
             SLM:000000377     98794
             SLM:000000102     80215
             SLM:000117148     46826
             SLM:000000400     38520
                               ...  
metanetx     MNXM536149            1
             MNXM536150            1
             MNXM536151            1
             MNXM536152            1
chebi        10036                 1
Name: count, Length: 1319298, dtype: int64

Now let's merge the nodes with the information from earlier to create richer node attributes

In [13]:
df_swisslipids_nodes = pd.merge(df_swisslipids_nodes, df_swisslipids_splitexp,
                                left_on=['layer','node_id'], right_on=['from_layer_col','Lipid ID'],
                                how='outer')
df_swisslipids_nodes

Unnamed: 0,layer,node_id,Lipid ID,Level,Name,Abbreviation*,CHEBI,LIPID MAPS,HMDB,MetaNetX,PMID,from_layer_col,Lipid class*
0,chebi,10036,,,,,,,,,,,
1,chebi,10362,,,,,,,,,,,
2,chebi,111515,,,,,,,,,,,
3,chebi,11152,,,,,,,,,,,
4,chebi,1156,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9362046,swisslipids,SLM:000782331,,,,,,,,,,,
9362047,swisslipids,SLM:000782331,,,,,,,,,,,
9362048,swisslipids,SLM:000782332,,,,,,,,,,,
9362049,swisslipids,,,,,,,,,,,,


This has a lot of duplicates in it, so lets remove them, along with the from_layer_col which means nothing in this case and is just a relic of our join back with the initial df we used to create the edges (which could probably be tidied up)

In [14]:
df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes = df_swisslipids_nodes.drop(columns='from_layer_col')
df_swisslipids_nodes

Unnamed: 0,layer,node_id,Lipid ID,Level,Name,Abbreviation*,CHEBI,LIPID MAPS,HMDB,MetaNetX,PMID,Lipid class*
0,chebi,10036,,,,,,,,,,
1,chebi,10362,,,,,,,,,,
2,chebi,111515,,,,,,,,,,
3,chebi,11152,,,,,,,,,,
4,chebi,1156,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
9362041,swisslipids,SLM:000782329,,,,,,,,,,
9362044,swisslipids,SLM:000782330,,,,,,,,,,
9362045,swisslipids,SLM:000782331,,,,,,,,,,
9362048,swisslipids,SLM:000782332,,,,,,,,,,


Now we have the nodes and edges dfs for swisslipids and understand how we have arrived at them. In reality you don't have to go through this process every time, LipiNet offers a convenient function to do just this if you are interested in this same network setup.

## Using the LipiNet `parse_swisslipids` function

The LipiNet `parse_swisslipids` function automatically runs through all of the same steps as we have just covered.

In [None]:
from lipinet.parse_swisslipids import parse_swisslipids_data

sl_results = parse_swisslipids_data(verbose=False)
df_sl_nodes = sl_results['df_nodes']
df_sl_edges = sl_results['df_edges']

We can also check to make sure these are equal here for an individual entry

In [20]:
df_swisslipids_nodes.iloc[0]

layer            chebi
node_id          10036
Lipid ID           NaN
Level              NaN
Name               NaN
Abbreviation*      NaN
CHEBI              NaN
LIPID MAPS         NaN
HMDB               NaN
MetaNetX           NaN
PMID               NaN
Lipid class*       NaN
Name: 0, dtype: object

In [19]:
df_sl_nodes.iloc[0]

layer            chebi
node_id          10036
Lipid ID           NaN
Level              NaN
Name               NaN
Abbreviation*      NaN
CHEBI              NaN
LIPID MAPS         NaN
HMDB               NaN
MetaNetX           NaN
PMID               NaN
Lipid class*       NaN
Name: 0, dtype: object

For the first entry it looks good, what about for the entire df? We can use the `pd.testing.assert_frame_equal` function to do this. 

First we will use a null test to test equality between `df_swisslipids_nodes` and `df_swisslipids_edges`, which should obviously be `False`.

In [27]:
try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_swisslipids_edges)
    print('DataFrames are equal')
except AssertionError as e:
    print(e)

DataFrame are different

DataFrame shape mismatch
[left]:  (1319422, 12)
[right]: (4676208, 5)


Now let's test between `df_swisslipids_nodes` and `df_sl_nodes`, which should hopefully be `True` and not throw an error. We will also test the edges df while we're at it too.

In [31]:
try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_sl_nodes)
    print('DataFrames for nodes are equal')
except AssertionError as e:
    print(e)

DataFrames for nodes are equal


In [32]:
try:
    pd.testing.assert_frame_equal(df_swisslipids_edges, df_sl_edges)
    print('DataFrames for edges are equal')
except AssertionError as e:
    print(e)

DataFrames for edges are equal


Great! It looks like both approaches achieve the same df. We will use these dfs in other parts of the package.