1. pyRhea
* who is it for
* what is it for
* how we used it here
 
  
2. Plugging gaps in this set through manual curation (e.g. curation of Rhea class reactions)
* How we identify them
* How many gaps we filled

3. BioREx – analysis and possibility to plug gaps


## Answers:

**pyRheaDB** is useful to anyone who wants to combine cheminformatics and network-level analysis to gain insight into metabolic network underlying Rhea database.

Example: analysis of lipids in Rhea + analysis of potential integration with BioREx

* Easy export of RheaDB to python objects. In this notebook: dataframes (pandas), graphs (networkx).

In [None]:
from pyrheadb.RheaDB import RheaDB
from pyrheadb.ReactionNetwork import ReactionNetwork

In [None]:
rdb = RheaDB()

In [None]:
rxnnet = ReactionNetwork()
rxnnet.build_network_from_long_format_table(rdb.rhea_reaction_long_format_smiles_chebi)

In [None]:
print('nodes:', len(rxnnet.participants_graph.nodes()))
print('edges:', len(rxnnet.participants_graph.edges()))
# Example nodes of the graph
list(rxnnet.participants_graph.nodes())[:5]

## One of the advantages for Recon4IMD: transforming RheaDB to format that is easy to merge with SwissLipids

In [None]:
import pandas as pd
df_swisslipids = pd.read_csv('../../swisslipids.tsv', sep='\t', encoding='latin-1',
                usecols=['Lipid ID', 'CHEBI', 'Level', 'Lipid class*', 'Components*', 'SMILES (pH7.3)'],
                dtype={'Lipid ID': str, 'CHEBI':str, 'Level': str, 'Lipid class*': str, 'Components*': str,
                       'SMILES (pH7.3)': str})

In [None]:
rdb.rhea_reaction_long_format_smiles_chebi['chebinum']=rdb.rhea_reaction_long_format_smiles_chebi['chebiid'].apply(
    lambda x: x.split(':')[1]
)
df_common_compounds = rdb.rhea_reaction_long_format_smiles_chebi.merge(df_swisslipids, 
                                                                       left_on='chebinum',
                                                                      right_on='CHEBI',
                                                                      how='inner')

In [None]:
df_swisslipids

## Lipid enumeration

1. The number of lipid class reactions from Rhea as templates

# Rhea reactions with at least one compound from SwissLipids

In [None]:
df_common_compounds.head()

## Number of unique lipid ChEBI IDs (defined + class compounds)

In [None]:
len(df_common_compounds['CHEBI'].unique())

## Number of unique Rhea MASTER IDs (defined + class reactions)

In [None]:
len(df_common_compounds['MASTER_ID'].unique())

Filtering out the reactions with defined lipid structures

In [None]:
df_common_compounds['star'] = df_common_compounds['smiles'].apply(lambda x: '*' in x)
df_common_compounds_template = df_common_compounds[df_common_compounds['star']==True]

## Total number of lipid class ids in Rhea (as uniquie ChEBI IDs)

In [None]:
len(df_common_compounds_template['CHEBI'].unique())

## Total number of lipid class reactions:

In [None]:
len(df_common_compounds_template['MASTER_ID'].unique())

2. How connected are they (the lipid class reactions)? Gaps in this network will propagate down to all reactions that are enumerated from this set; any template reactions in Rhea that are NOT connected to other Rhea will give rise to child reactions that are not connected too. So, the message would be – we can identify those unconnected class reactions and target them for curation. But do we know which of the lipid classes to focus on?
To connect lipids we only need to curate their classes. So from the perspective of connecting lipids up, it’s interesting to know how the classes are connected, as this gives a more accurate estimate of the scope of the manual work needed.

## Connectivity

In [None]:
lipid_class_subgraph = rxnnet.participants_graph.subgraph(set(df_common_compounds_template['chebiid'].to_list())).copy()

In [None]:
import networkx as nx

In [None]:
nx.draw_spring(lipid_class_subgraph)

In [None]:
print('nodes:', len(lipid_class_subgraph.nodes()))
print('edges:', len(lipid_class_subgraph.edges()))

In [None]:
component_sizes = [len(c) for c in sorted(nx.connected_components(lipid_class_subgraph), key=len, reverse=True)]
sizes = list(set(component_sizes))
sizes.sort(reverse=True)

In [None]:
print('Biggest subgraph: ', sizes[0], 'class compounds')

In [None]:
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

## In the bar chart of graph component sizes biggest subgraph is excluded since it is much bigger and "dwarfs" all the rest, making them invisible

In [None]:
def plot_subgraphs(sizes, component_sizes):
    count_subgraphs=[]
    for s in sizes:
        count_subgraphs.append(component_sizes.count(s))
        
    plt.bar(sizes[1:], count_subgraphs[1:])
    plt.xlabel('Compounds per subgraph')
    plt.ylabel('Number of subgraphs')
    for i in range(len(count_subgraphs[1:])):
        plt.text(sizes[1:][i]-0.5,count_subgraphs[1:][i]+0.5,sizes[1:][i])
    plt.show()


In [None]:
plot_subgraphs(sizes, component_sizes)

In [None]:
components = sorted(nx.connected_components(lipid_class_subgraph), key=len, reverse=True)

In [None]:
# components[1] - second biggest component
second_biggest_component = rdb.df_chebi_cmpname.merge(pd.DataFrame(components[1], columns=['chebiid']), on='chebiid')

In [None]:
second_biggest_component

## 1, 2, 3, 4, 5, 6 - compound subgraph members

In [None]:
flatlist = []
for i in range(len(components[2:])):
    flatlist.extend(components[2+i])
all_the_rest = rdb.df_chebi_cmpname.merge(pd.DataFrame(flatlist, columns=['chebiid']), on='chebiid')

In [None]:
print(len(list(all_the_rest['cmpname'].unique())))
list(all_the_rest['cmpname'].unique())

## BioREx

 

We have thousands of pairs from BioREx that are not in Rhea. We suggest using these for gap-filling (specifically in the lipid subnetwork). Do we know how many of these unique BioREx pairs include lipids? These would be the first to look at for curators. 

One could  just report the number
or
TODO:
show a TMAP with lipid pairs coloured!


In [None]:
df_biorex_pairs = pd.read_csv("/scratch/biorex/data/pairs_all.tsv", sep='\t', usecols=['chebiid1', 'chebiid2'])

In [None]:
df_biorex_pairs['chebinum1']=df_biorex_pairs['chebiid1'].apply(
    lambda x: x.split(':')[1]
)
df_biorex_pairs['chebinum2']=df_biorex_pairs['chebiid2'].apply(
    lambda x: x.split(':')[1]
)
chebiids = set(df_biorex_pairs['chebinum2'].to_list()+df_biorex_pairs['chebinum1'].to_list())
sl = df_swisslipids[['CHEBI', 'Lipid ID', 'SMILES (pH7.3)']].copy()
sl['star_smiles'] = sl['SMILES (pH7.3)'].apply(lambda x: '*' in str(x))
sl.drop(columns=['SMILES (pH7.3)'], inplace=True)
df_biorex_pairs_sl = df_biorex_pairs.merge(sl,how='left', left_on='chebinum1', right_on='CHEBI').copy()
df_biorex_pairs_sl = df_biorex_pairs_sl.merge(sl,how='left', left_on='chebinum2', right_on='CHEBI')
df_biorex_pairs_sl.drop(columns=['chebinum1', 'chebinum2', 'CHEBI_y', 'CHEBI_x'], inplace=True)

In [None]:
biorex_lipids = df_swisslipids.merge(pd.DataFrame(chebiids, columns=['CHEBI']), how='inner')
len(biorex_lipids['CHEBI'].unique())

In [None]:
df_biorex_pairs_sl.dropna(how='all', subset=['Lipid ID_x', 'Lipid ID_y'], inplace=True)

## chebiid1 - chebiid2 : biorex pair

In [None]:
df_biorex_pairs_sl

## If both compounds in pair should be a lipid

In [None]:
df_biorex_pairs_sl.dropna(how='any', subset=['Lipid ID_x', 'Lipid ID_y'])

## If both compounds in pair should be a class lipid

In [None]:
lipid_class_reactions_biorex=df_biorex_pairs_sl.query('star_smiles_x==True&star_smiles_y==True')
lipid_class_reactions_biorex

## Inject pair into lipid class subgraph

In [None]:
new_edges_class = list(zip(lipid_class_reactions_biorex['chebiid1'], lipid_class_reactions_biorex['chebiid2']))
edges_old = [list(i) for i in lipid_class_subgraph.edges()]
for edge in edges_old:
    edge.sort()
edges_old_pair=[';'.join(i) for i in edges_old]
new_edges_class = [list(i) for i in new_edges_class]
for edge in new_edges_class:
    edge.sort()
new_edges_class_pair = [';'.join(i) for i in new_edges_class]
print(len(set(edges_old_pair)))
#print(set(edges_old_pair))
print(len(set(new_edges_class_pair)))
#print(set(new_edges_class_pair))
print(len(set(new_edges_class_pair).intersection(set(edges_old_pair))))
new_edges_class = set(new_edges_class_pair)-set(edges_old_pair)
new_edges_class = [set(i.split(';')) for i in new_edges_class]

In [None]:
len(new_edges_class)

In [None]:
df_edges = pd.DataFrame(new_edges_class, columns=['chebi1', 'chebi2'])
df_edges_m = df_edges.merge(rdb.df_chebi_cmpname, left_on='chebi1', right_on='chebiid')
df_edges_m = df_edges_m.merge(rdb.df_chebi_cmpname, left_on='chebi2', right_on='chebiid')
df_edges_m = df_edges_m.query('chebi1!=chebi2')
df_edges_m.drop(columns=['chebiid_x', 'chebiid_y'], inplace=True)
df_edges_m.to_csv('edges_new.tsv', sep='\t', index=False)

In [None]:
lipid_class_subgraph.add_edges_from(new_edges_class)

In [None]:
nx.draw_spring(lipid_class_subgraph)

In [None]:
print('nodes:', len(lipid_class_subgraph.nodes()))
print('edges:', len(lipid_class_subgraph.edges()))

In [None]:
component_sizes = [len(c) for c in sorted(nx.connected_components(lipid_class_subgraph), key=len, reverse=True)]
sizes = list(set(component_sizes))
sizes.sort(reverse=True)

In [None]:
print('Biggest subgraph: ', sizes[0], 'class compounds')

In [None]:
plot_subgraphs(sizes, component_sizes)

## Lipid defined reactions biorex

In [None]:
lipid_defined_reactions_biorex=df_biorex_pairs_sl.query('star_smiles_x==False&star_smiles_y==False')
lipid_defined_reactions_biorex

## Analyse the defined lipid structures subgraph
df_common_compounds_defined
lipid_class_subgraph = rxnnet.participants_graph.subgraph(set(df_common_compounds_defined['chebiid'].to_list())).copy()

In [None]:
df_common_compounds_defined = df_common_compounds[df_common_compounds['star']==False]
lipid_class_subgraph = rxnnet.participants_graph.subgraph(set(df_common_compounds_defined['chebiid'].to_list())).copy()
print(len(df_common_compounds_defined['CHEBI'].unique()))
print(len(df_common_compounds_defined['MASTER_ID'].unique()))

In [None]:
lipid_defined_subgraph = rxnnet.participants_graph.subgraph(set(df_common_compounds_defined['chebiid'].to_list())).copy()

In [None]:
nx.draw_spring(lipid_defined_subgraph)

In [None]:
print('nodes:', len(lipid_defined_subgraph.nodes()))
print('edges:', len(lipid_defined_subgraph.edges()))
component_sizes = [len(c) for c in sorted(nx.connected_components(lipid_defined_subgraph), key=len, reverse=True)]
sizes = list(set(component_sizes))
sizes.sort(reverse=True)
print('Biggest subgraph: ', sizes[0], 'defined compounds')

In [None]:
plot_subgraphs(sizes, component_sizes)

## Inject biorex defined reactions

In [None]:
new_edges = list(zip(lipid_defined_reactions_biorex['chebiid1'], lipid_defined_reactions_biorex['chebiid1']))
lipid_defined_subgraph.add_edges_from(new_edges)

In [None]:
nx.draw_spring(lipid_defined_subgraph)

In [None]:
print('nodes:', len(lipid_defined_subgraph.nodes()))
print('edges:', len(lipid_defined_subgraph.edges()))
component_sizes = [len(c) for c in sorted(nx.connected_components(lipid_defined_subgraph), key=len, reverse=True)]
sizes = list(set(component_sizes))
sizes.sort(reverse=True)
print('Biggest subgraph: ', sizes[0], 'defined compounds')

In [None]:
plot_subgraphs(sizes, component_sizes)