# Subsample Hetionet edges to determine effect on performance

Last version tried to preserve edges if they removed any gold standard nodes. In this new version we will not artifically limit which edges are removed, and will allow disappearance of gold standard nodes. These nodes will be counted as unable to be predicted.

Remove entire edge types to see if we will get diminishing performance.

In [1]:
import pandas as pd
from collections import defaultdict
from tqdm import tqdm

import numpy as np
from itertools import product

In [2]:
np.random.seed(20180329)

## Read hetionet

using hetionet for now because it should be faster to run than semmeddb

In [3]:
hnodes = (pd
    .read_csv("../../merge/data/hetionet/hetnet_nodes.csv", sep=',')
    .drop("identifier:string", axis=1)
    .rename(columns={
        ":ID": "node_id",
        "name:string": "name",
        ":LABEL": "het_type"
    })
)

In [4]:
hedges = (pd
    .read_csv("../../merge/data/hetionet/hetnet_edges.csv", sep=',')
    .rename(columns={
        ":START_ID": "start_id",
        ":END_ID": "end_id",
        ":TYPE": "het_etype"
    })
)

In [5]:
hnodes.head()

Unnamed: 0,node_id,name,het_type
0,DB00795,Sulfasalazine,Compound
1,N0000000151,Histamine H2 Receptor Antagonists,Pharmacologic Class
2,100996420,DNM1P50,Gene
3,DB04898,Ximelagatran,Compound
4,C0278151,Facial spasm,Side Effect


In [6]:
hedges.head()

Unnamed: 0,start_id,end_id,het_etype
0,8568,GO:0042254,PARTICIPATES_GpBP
1,UBERON:0002450,51316,EXPRESSES_AeG
2,4893,PC7_8430,PARTICIPATES_GpPW
3,UBERON:0002185,55186,EXPRESSES_AeG
4,119504,GO:0033047,PARTICIPATES_GpBP


In [7]:
gold = (pd
    .read_csv("../hetio_dw/filtered_semmed_gold_for_hetionet.tsv", sep='\t')
    .rename(columns={
        "chemical_id": "chemical_cui",
        "disease_id": "disease_cui",
        "chemical_hetid": "chemical_id",
        "disease_hetid": "disease_id"
    })
)

In [8]:
gold.shape

(2924, 9)

In [9]:
gold.head()

Unnamed: 0,chemical_cui,chemical_name,disease_cui,disease_name,etype,chemical_id,chemical_htype,disease_id,disease_htype
0,UMLS:C0000477,fampridine,UMLS:C0026769,multiple sclerosis,TREATS_CDtDO,DB06637,Compound,C0026769,Side Effect
1,UMLS:C0000545,eicosapentaenoic acid,UMLS:C0032961,"Pregnancy, function",TREATS_CDtDO,DB00159,Compound,C0032961,Side Effect
2,UMLS:C0000618,mercaptopurine,UMLS:C0023449,acute lymphocytic leukemia,TREATS_CDtDO,DB01033,Compound,C0023449,Side Effect
3,UMLS:C0000618,mercaptopurine,UMLS:C0023487,acute promyelocytic leukemia,TREATS_CDtDO,DB01033,Compound,C0023487,Side Effect
4,UMLS:C0000956,acenocoumarol,UMLS:C0034065,Pulmonary embolism,TREATS_CDtDO,DB01418,Compound,C0034065,Side Effect


In [10]:
gold_nodes = set(gold["chemical_id"]) | set(gold["disease_id"])

In [11]:
len(gold_nodes)

1342

---

In [12]:
edge_nodes = set(hedges["start_id"]) | set(hedges["end_id"])

In [13]:
len(edge_nodes)

45158

In [14]:
edge_nodes >= gold_nodes

False

---

In [15]:
set(hnodes["node_id"]) >= gold_nodes

True

In [16]:
len(set(hnodes["node_id"]) - edge_nodes)

1873

In [17]:
len(set(hnodes["node_id"]) & gold_nodes - edge_nodes)

3

There are only three nodes in the gold standard which have no edges at all in the original Hetionet. These are small enough where we don't have to treat them differently at all. The main majority of the nodes which have no edges at all can be safely ignored since they don't show up in the gold standard either.

---

## Add edge types to hetionet edges

In [18]:
edges = (hedges
    .merge(
        hnodes, how="inner",
        left_on="start_id", right_on="node_id"
    )
    .drop(["node_id", "name"], axis=1)
    .rename(columns={"het_type": "start_htype"})

    .merge(
        hnodes, how="inner",
        left_on="end_id", right_on="node_id"
    )
    .drop(["node_id", "name"], axis=1)
    .rename(columns={"het_type": "end_htype"})
)

In [19]:
edges.head()

Unnamed: 0,start_id,end_id,het_etype,start_htype,end_htype
0,8568,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process
1,6201,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process
2,6223,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process
3,6202,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process
4,65003,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process


In [20]:
edges.groupby(["start_htype", "end_htype"]).size()

start_htype          end_htype         
Anatomy              Gene                  726495
Compound             Compound                6486
                     Gene                   51429
                     Side Effect           138944
Disease              Anatomy                 3602
                     Disease                  543
                     Gene                   27977
                     Symptom                 3357
Gene                 Biological Process    559504
                     Cellular Component     73566
                     Gene                  474526
                     Molecular Function     97222
                     Pathway                84372
Pharmacologic Class  Compound                1029
dtype: int64

In [21]:
edges.groupby(["start_htype", "end_htype"]).size().shape

(14,)

There are 14 unique node-node types. Since deepwalk collapses the edge semantics, let's subsample edges based on the node-node pair they link, instead of what semantic node type links the nodes.

---

## Assign deepwalk metadata

In [22]:
fhnodes = (hnodes
    .reset_index()
    .rename(columns={
        "index": "node_uid"
    })
)

In [23]:
fhnodes.head()

Unnamed: 0,node_uid,node_id,name,het_type
0,0,DB00795,Sulfasalazine,Compound
1,1,N0000000151,Histamine H2 Receptor Antagonists,Pharmacologic Class
2,2,100996420,DNM1P50,Gene
3,3,DB04898,Ximelagatran,Compound
4,4,C0278151,Facial spasm,Side Effect


### Add deepwalk ids to edges

In [24]:
fedges = (edges
    .merge(
        fhnodes[["node_uid", "node_id"]],
        how="inner", left_on="start_id", right_on="node_id"
    )
    .drop("node_id", axis=1)
    .rename(columns={"node_uid": "source_uid"})

    .merge(
        fhnodes[["node_uid", "node_id"]],
        how="inner", left_on="end_id", right_on="node_id"
    )
    .drop("node_id", axis=1)
    .rename(columns={"node_uid": "target_uid"})
)

In [25]:
fedges.head()

Unnamed: 0,start_id,end_id,het_etype,start_htype,end_htype,source_uid,target_uid
0,8568,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process,26311,5975
1,6201,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process,9449,5975
2,6223,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process,32586,5975
3,6202,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process,21669,5975
4,65003,GO:0042254,PARTICIPATES_GpBP,Gene,Biological Process,20653,5975


---

## Subsample edges

This new version allows removal of gold standard nodes if the edge was the only thing linking the node into the network.

In [26]:
fhnodes.to_csv("tmp/hetionet_nodes.tsv", sep='\t', index=False)

In [27]:
gold.to_csv("tmp/hetionet_gold.tsv", sep='\t', index=False)

---

## remove all edges of a certain type!

In [28]:
for node_link, edge_pool in fedges.groupby(["start_htype", "end_htype"]):
        
    print(node_link)
    
    shtype, ehtype = node_link
    others = fedges.query(
        "start_htype != @shtype or end_htype != @ehtype"
    )

    #-------------------------------------------------------

    keep_pct = 0
    new_edges = others

    fname = "tmp/edges/{}_{}_{}_kept.tsv".format(shtype, ehtype, keep_pct)
    new_edges.to_csv(fname, sep='\t', index=False)

('Anatomy', 'Gene')
('Compound', 'Compound')
('Compound', 'Gene')
('Compound', 'Side Effect')
('Disease', 'Anatomy')
('Disease', 'Disease')
('Disease', 'Gene')
('Disease', 'Symptom')
('Gene', 'Biological Process')
('Gene', 'Cellular Component')
('Gene', 'Gene')
('Gene', 'Molecular Function')
('Gene', 'Pathway')
('Pharmacologic Class', 'Compound')
