This notebook loads all edges and nodes of the relevant knowledge graph. New dataframes are created for the nodes and edges including index values for each node.

# Import Libraries

In [52]:
import pandas as pd

# Set Parameters

Specify which dataset is used.
*   1 (Original knowledge graph)
*   2 (Restructured knowledge graph)

In [53]:
dataset_nr = 2
assert dataset_nr == 1 or 2

disease_prefix = 'hd'
assert disease_prefix == 'dmd' or 'hd' or 'oi'

if dataset_nr == 1:
    prefix = 'prev'
else:
    prefix = 'restr'
    
FILE_EDGES = f'{prefix}_{disease_prefix}_kg_edges.csv'
FILE_NODES = f'{prefix}_{disease_prefix}_kg_nodes.csv'

# Load Edges

Load all edges of knowledge graph

In [54]:
edges = pd.read_csv(f'../output/{disease_prefix}/{FILE_EDGES}', header=0)
edges.head(10)

Unnamed: 0,id,subject,object,relation_id,relation_label,relation_iri
0,c5f18132a9521c15c8abd0240da2c89c,HGNC:11280,HGNC:9539,RO:0002434,interacts with,
1,2f4182da8cf6af6ca49912ea3ceabfee,FlyBase:FBgn0264855,MGI:101921,RO:HOM0000017,in orthology relationship with,
2,5ca2f0ea7df2c1f6973774b80a1bcade,SGD:S000001709,WormBase:WBGene00001134,RO:HOM0000017,in orthology relationship with,
3,0f2324357c6e3d10ae1e4573da055ae6,RGD:1595923,ENSEMBL:ENSDARG00000115781,RO:HOM0000017,in orthology relationship with,
4,d7253620ceec8e39bcc3ab6cae816df6,WormBase:WBGene00012735,FlyBase:CG5669,RO:HOM0000017,in orthology relationship with,
5,24902093f06314c017791733102c32b5,Xenbase:XB-GENE-486558,Xenbase:XB-GENE-967617,RO:0002434,interacts with,
6,3fa27d22ca82a4a9a92d8691fda05019,ENSEMBL:ENSCAFG00000017516,FlyBase:FBgn0034774,RO:HOM0000017,in orthology relationship with,
7,077bfbc256527fabd59ce71824c99298,ENSEMBL:ENSECAG00000022119,ENSEMBL:ENSMODG00000015091,RO:HOM0000017,in orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000017
8,1cdd6ffe016bea4ca3ae5d547e89742f,ENSEMBL:ENSRNOG00000049403,ENSEMBL:ENSCAFG00000029076,RO:HOM0000017,in orthology relationship with,
9,426d243f415d40b2833d6a3799625954,Xenbase:XB-GENE-952286,HGNC:1856,RO:HOM0000017,in orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000017


Extract the relevant columns and rename them

In [55]:
edges = edges[['subject', 'relation_label', 'object']]
edges.rename(columns={'subject': 'head', 'relation_label': 'relation', 'object': 'tail'}, inplace=True)
edges.head(10)

Unnamed: 0,head,relation,tail
0,HGNC:11280,interacts with,HGNC:9539
1,FlyBase:FBgn0264855,in orthology relationship with,MGI:101921
2,SGD:S000001709,in orthology relationship with,WormBase:WBGene00001134
3,RGD:1595923,in orthology relationship with,ENSEMBL:ENSDARG00000115781
4,WormBase:WBGene00012735,in orthology relationship with,FlyBase:CG5669
5,Xenbase:XB-GENE-486558,interacts with,Xenbase:XB-GENE-967617
6,ENSEMBL:ENSCAFG00000017516,in orthology relationship with,FlyBase:FBgn0034774
7,ENSEMBL:ENSECAG00000022119,in orthology relationship with,ENSEMBL:ENSMODG00000015091
8,ENSEMBL:ENSRNOG00000049403,in orthology relationship with,ENSEMBL:ENSCAFG00000029076
9,Xenbase:XB-GENE-952286,in orthology relationship with,HGNC:1856


In [56]:
def convert_to_int_str(value):
    try:
        return str(int(float(value)))
    except ValueError:
        return value

edges['head'] = edges['head'].apply(convert_to_int_str)
edges['tail'] = edges['tail'].apply(convert_to_int_str)

In [57]:
print(f'There are {edges.shape[0]} edges')

There are 230228 edges


# Load Nodes

Load all nodes of knowledge graph

In [58]:
nodes = pd.read_csv(f'../output/{disease_prefix}/{FILE_NODES}', header=0)
nodes['index_id'] = nodes.index
nodes.head(10)

Unnamed: 0,id,label,iri,semantic,index_id
0,ENSEMBL:ENSSSCG00000033987,UBE2E3,,gene,0
1,MP:0000790,abnormal stratification in cerebral cortex,,phenotype,1
2,ZP:0011626,"cell trunk apoptotic, abnormal",,phenotype,2
3,ZP:0019272,pronephric nephron tubule epithelial cell diff...,,phenotype,3
4,ENSEMBL:ENSACAG00000007327,ENSEMBL:ENSACAG00000007327,,gene,4
5,MP:0009154,pancreatic acinar hypoplasia,,phenotype,5
6,ENSEMBL:ENSOANG00000047629,ENSEMBL:ENSOANG00000047629,,gene,6
7,WormBase:WBGene00002583,let-363,,gene,7
8,ZP:0000943,"eye morphology, abnormal",,phenotype,8
9,ENSEMBL:ENSSSCG00000012913,PPP1CA,,gene,9


Extract relevant columns

In [59]:
nodes = nodes[['index_id', 'id', 'semantic', 'label']]
nodes.head(10)

Unnamed: 0,index_id,id,semantic,label
0,0,ENSEMBL:ENSSSCG00000033987,gene,UBE2E3
1,1,MP:0000790,phenotype,abnormal stratification in cerebral cortex
2,2,ZP:0011626,phenotype,"cell trunk apoptotic, abnormal"
3,3,ZP:0019272,phenotype,pronephric nephron tubule epithelial cell diff...
4,4,ENSEMBL:ENSACAG00000007327,gene,ENSEMBL:ENSACAG00000007327
5,5,MP:0009154,phenotype,pancreatic acinar hypoplasia
6,6,ENSEMBL:ENSOANG00000047629,gene,ENSEMBL:ENSOANG00000047629
7,7,WormBase:WBGene00002583,gene,let-363
8,8,ZP:0000943,phenotype,"eye morphology, abnormal"
9,9,ENSEMBL:ENSSSCG00000012913,gene,PPP1CA


In [60]:
print(f'There are {nodes.shape[0]} nodes')

There are 14882 nodes


Convert semantic groups from categorical to numerical values (`semantic_id`)

In [61]:
nodes['semantic'] = nodes['semantic'].astype('category')
nodes['semantic_id'] = nodes['semantic'].cat.codes
nodes.head(10)

Unnamed: 0,index_id,id,semantic,label,semantic_id
0,0,ENSEMBL:ENSSSCG00000033987,gene,UBE2E3,4
1,1,MP:0000790,phenotype,abnormal stratification in cerebral cortex,8
2,2,ZP:0011626,phenotype,"cell trunk apoptotic, abnormal",8
3,3,ZP:0019272,phenotype,pronephric nephron tubule epithelial cell diff...,8
4,4,ENSEMBL:ENSACAG00000007327,gene,ENSEMBL:ENSACAG00000007327,4
5,5,MP:0009154,phenotype,pancreatic acinar hypoplasia,8
6,6,ENSEMBL:ENSOANG00000047629,gene,ENSEMBL:ENSOANG00000047629,4
7,7,WormBase:WBGene00002583,gene,let-363,4
8,8,ZP:0000943,phenotype,"eye morphology, abnormal",8
9,9,ENSEMBL:ENSSSCG00000012913,gene,PPP1CA,4


Create a dictionary that can be used as mapping between `semantic_id` and `semantic`

In [62]:
node_semantic_dict = dict(enumerate(nodes['semantic'].cat.categories))
node_semantic_dict

{0: 'biological artifact',
 1: 'biological process',
 2: 'disease',
 3: 'drug',
 4: 'gene',
 5: 'gene product',
 6: 'genotype',
 7: 'molecular function',
 8: 'phenotype',
 9: 'taxon',
 10: 'variant'}

Save new dataframe containing all nodes into csv file

In [63]:
nodes.to_csv(f'../output/{disease_prefix}/{prefix}_{disease_prefix}_indexed_nodes.csv', index=False)

# Joining Dataframes

Merge dataframes with nodes and edges in order to join semantic classes and indices of nodes with heads and tails of the edges.

In [64]:
kg_df = pd.merge(edges, nodes,
                 left_on='head', right_on='id',
                 how='inner')
kg_df = kg_df[['head', 'label', 'semantic_id', 'index_id', 'relation', 'tail']]
kg_df.rename(columns={'label': 'label_head', 'semantic_id': 'class_head', 'index_id': 'index_head'}, inplace=True)
kg_df.head(10)

Unnamed: 0,head,label_head,class_head,index_head,relation,tail
0,HGNC:11280,SQSTM1,4,9467,interacts with,HGNC:9539
1,FlyBase:FBgn0264855,AP-2alpha,4,6589,in orthology relationship with,MGI:101921
2,SGD:S000001709,VPS1,4,9926,in orthology relationship with,WormBase:WBGene00001134
3,RGD:1595923,RT1-A2,4,9824,in orthology relationship with,ENSEMBL:ENSDARG00000115781
4,WormBase:WBGene00012735,sptf-3,4,1165,in orthology relationship with,FlyBase:CG5669
5,Xenbase:XB-GENE-486558,pias1,4,8551,interacts with,Xenbase:XB-GENE-967617
6,ENSEMBL:ENSCAFG00000017516,CALM1,4,10763,in orthology relationship with,FlyBase:FBgn0034774
7,ENSEMBL:ENSECAG00000022119,ENSEMBL:ENSECAG00000022119,4,8715,in orthology relationship with,ENSEMBL:ENSMODG00000015091
8,ENSEMBL:ENSRNOG00000049403,ENSEMBL:ENSRNOG00000049403,4,2590,in orthology relationship with,ENSEMBL:ENSCAFG00000029076
9,Xenbase:XB-GENE-952286,cenpe,4,8623,in orthology relationship with,HGNC:1856


In [65]:
kg_df = pd.merge(kg_df, nodes,
                 left_on='tail', right_on='id',
                 how='left')
kg_df = kg_df[['head', 'label_head', 'class_head', 'index_head', 'relation', 'tail', 'label', 'semantic_id', 'index_id']]
kg_df.rename(columns={'label': 'label_tail', 'semantic_id': 'class_tail', 'index_id': 'index_tail'}, inplace=True)
kg_df.head(10)

Unnamed: 0,head,label_head,class_head,index_head,relation,tail,label_tail,class_tail,index_tail
0,HGNC:11280,SQSTM1,4,9467,interacts with,HGNC:9539,PSMB2,4,12355
1,FlyBase:FBgn0264855,AP-2alpha,4,6589,in orthology relationship with,MGI:101921,Ap2a1,4,10645
2,SGD:S000001709,VPS1,4,9926,in orthology relationship with,WormBase:WBGene00001134,eat-3,4,8881
3,RGD:1595923,RT1-A2,4,9824,in orthology relationship with,ENSEMBL:ENSDARG00000115781,CR339041.3,4,14166
4,WormBase:WBGene00012735,sptf-3,4,1165,in orthology relationship with,FlyBase:CG5669,FlyBase:CG5669,4,3446
5,Xenbase:XB-GENE-486558,pias1,4,8551,interacts with,Xenbase:XB-GENE-967617,cetn1,4,6209
6,ENSEMBL:ENSCAFG00000017516,CALM1,4,10763,in orthology relationship with,FlyBase:FBgn0034774,CG13526,4,1670
7,ENSEMBL:ENSECAG00000022119,ENSEMBL:ENSECAG00000022119,4,8715,in orthology relationship with,ENSEMBL:ENSMODG00000015091,ENSEMBL:ENSMODG00000015091,4,3536
8,ENSEMBL:ENSRNOG00000049403,ENSEMBL:ENSRNOG00000049403,4,2590,in orthology relationship with,ENSEMBL:ENSCAFG00000029076,CCDC126,4,10793
9,Xenbase:XB-GENE-952286,cenpe,4,8623,in orthology relationship with,HGNC:1856,CENPE,4,3187


Encode string values of relation labels as numeric values.

In [66]:
kg_df['type'], relation_labels = pd.factorize(kg_df['relation'])
kg_df.head(10)

Unnamed: 0,head,label_head,class_head,index_head,relation,tail,label_tail,class_tail,index_tail,type
0,HGNC:11280,SQSTM1,4,9467,interacts with,HGNC:9539,PSMB2,4,12355,0
1,FlyBase:FBgn0264855,AP-2alpha,4,6589,in orthology relationship with,MGI:101921,Ap2a1,4,10645,1
2,SGD:S000001709,VPS1,4,9926,in orthology relationship with,WormBase:WBGene00001134,eat-3,4,8881,1
3,RGD:1595923,RT1-A2,4,9824,in orthology relationship with,ENSEMBL:ENSDARG00000115781,CR339041.3,4,14166,1
4,WormBase:WBGene00012735,sptf-3,4,1165,in orthology relationship with,FlyBase:CG5669,FlyBase:CG5669,4,3446,1
5,Xenbase:XB-GENE-486558,pias1,4,8551,interacts with,Xenbase:XB-GENE-967617,cetn1,4,6209,0
6,ENSEMBL:ENSCAFG00000017516,CALM1,4,10763,in orthology relationship with,FlyBase:FBgn0034774,CG13526,4,1670,1
7,ENSEMBL:ENSECAG00000022119,ENSEMBL:ENSECAG00000022119,4,8715,in orthology relationship with,ENSEMBL:ENSMODG00000015091,ENSEMBL:ENSMODG00000015091,4,3536,1
8,ENSEMBL:ENSRNOG00000049403,ENSEMBL:ENSRNOG00000049403,4,2590,in orthology relationship with,ENSEMBL:ENSCAFG00000029076,CCDC126,4,10793,1
9,Xenbase:XB-GENE-952286,cenpe,4,8623,in orthology relationship with,HGNC:1856,CENPE,4,3187,1


In [67]:
relation_labels

Index(['interacts with', 'in orthology relationship with', 'causes condition',
       'is variant in', 'enables', 'found in', 'contributes to condition',
       'colocalizes with', 'has role in modeling', 'expresses gene',
       'associated with phenotype', 'is of', 'targets',
       'is substance that treats', 'has affected feature', 'is product of',
       'is allele of', 'has genotype', 'involved in'],
      dtype='object')

Save new dataframe containing all edges into csv file

In [68]:
kg_df.to_csv(f'../output/{disease_prefix}/{prefix}_{disease_prefix}_indexed_edges.csv', index=False)