# Determine how to merge Hetionet with SemmedDB

Executive decision after discussion: overwrite hetio types with semmed types. For any remaining hetio types without cuis in semmed, just change the type to the most common one

Keep any hetio nodes with missing cuis in hetio id space

**new:**

Try to merge semmeddb types and edges into hetionet space. this reduces the complexity, and allows us to try running rephetio on the resulting network

In [1]:
import pandas as pd

## Read Hetionet nodes

In [2]:
hnodes = pd.read_csv("../pipeline/hetionet/hetionet_nodes_umls.tsv", sep='\t')

In [3]:
hnodes.shape

(58700, 4)

In [4]:
hnodes.head()

Unnamed: 0,hetio_id,name,het_type,cui
0,1,A1BG,Gene,UMLS:C1412045
1,10,NAT2,Gene,UMLS:C0796518
2,100,ADA,Gene,UMLS:C1412179
3,1000,CDH2,Gene,UMLS:C1413277
4,10000,AKT3,Gene,UMLS:C1332074


---

## Read SemmedDB nodes

In [5]:
snodes = (pd
    .read_csv("../data/semmeddb/no_treats/semmeddb_no_treats_nodes.tsv", sep='\t')
    .rename(columns={
        "node_type": "sem_type",
        "node_id": "cui"
    })
)

In [6]:
snodes.shape

(210375, 3)

In [7]:
snodes.head()

Unnamed: 0,cui,node_name,sem_type
0,DOID:0050545,visceral heterotaxy,Disorders
1,DOID:0050591,tooth agenesis,Disorders
2,DOID:0060061,cutaneous T cell lymphoma,Disorders
3,DOID:0060073,lymphatic system cancer,Disorders
4,DOID:0060119,pharynx cancer,Disorders


---

## Determine the most common SemmedDB node type for each Hetionet node type

In [8]:
ntype_map = (hnodes
    .drop("name", axis=1)
    .merge(snodes, how="inner", on="cui")
    .drop("node_name", axis=1)

    .groupby("het_type")
    .apply(lambda df: df["sem_type"].value_counts().axes[0][0])
    .to_frame("sem_type")
    .reset_index()
             
    .append(
        pd.DataFrame(
            [("Pathway", "Pathway")],
            columns=["het_type", "sem_type"]
        )
    )
    .reset_index(drop=True)
)

In [9]:
ntype_map

Unnamed: 0,het_type,sem_type
0,Anatomy,Anatomy
1,Biological Process,Physiology
2,Cellular Component,Anatomy
3,Compound,Chemicals & Drugs
4,Disease,Disorders
5,Gene,Genes & Molecular Sequences
6,Molecular Function,Physiology
7,Pharmacologic Class,Chemicals & Drugs
8,Side Effect,Disorders
9,Symptom,Disorders


Hetionet nodes which have CUIs that exist in SemmedDB will just get mapped to that CUI.

Hetionet nodes which do not exist in SemmedDB will use their original Hetionet id. Their node type is converted to the node type most commonly associated with that Hetionet node type (through direct CUI mappings).

---

## Determine how Hetionet nodes will be identified in the merged network

Rules:
* If the Hetionet node's CUIs are not in SemmedDB: use the original Hetionet id
* If the Hetionet node has N CUIs, then only retain those which exist in SemmedDB (discard the non-existent ones)

In [10]:
all_scuis = set(snodes["cui"])

hnode_map = (hnodes
    .assign(in_sem = lambda df: df["cui"].map(lambda v: v in all_scuis))
    .groupby("hetio_id")
    .apply(
        lambda df: df.query("in_sem") if df["in_sem"].any() else df.head(1)
    )
    .reset_index(drop=True)
    .assign(final_id = lambda df:
        df[["hetio_id", "cui", "in_sem"]].apply(
            lambda r: r["cui"] if r["in_sem"] else r["hetio_id"],
            axis=1
        )
    )
    .drop("cui", axis=1)
)

In [11]:
hnode_map.shape

(48276, 5)

In [12]:
hnode_map.head()

Unnamed: 0,hetio_id,name,het_type,in_sem,final_id
0,1,A1BG,Gene,True,UMLS:C1412045
1,10,NAT2,Gene,True,UMLS:C0796518
2,100,ADA,Gene,True,UMLS:C1412179
3,1000,CDH2,Gene,True,UMLS:C1413277
4,10000,AKT3,Gene,True,UMLS:C1332074


---

### Resulting stats

In [13]:
hnode_map["in_sem"].value_counts()

True     27122
False    21154
Name: in_sem, dtype: int64

In [14]:
hnode_map["hetio_id"].nunique()

47031

In [15]:
hnode_map.query("~in_sem")["het_type"].value_counts()

Biological Process     9647
Gene                   4914
Molecular Function     1934
Pathway                1822
Side Effect            1457
Cellular Component     1030
Pharmacologic Class     216
Compound                101
Symptom                  20
Disease                   8
Anatomy                   5
Name: het_type, dtype: int64

### How many identifiers per hetio id (as a %)

In [16]:
(hnode_map
    .groupby("hetio_id")
    .size()
    .value_counts(normalize=True)
    .multiply(100)
    .sort_index()
)

1     98.396802
2      1.156684
3      0.236015
4      0.087177
5      0.048904
6      0.019136
7      0.021263
8      0.006379
9      0.010631
10     0.004253
12     0.004253
14     0.004253
16     0.002126
18     0.002126
dtype: float64

---

## Change Hetionet semantic types to SemmedDB semantic types

In [17]:
fin_hnode_map = (hnode_map
    .merge(
        snodes.drop("node_name", axis=1), how="left",
        left_on="final_id", right_on="cui"
    )
    .drop("cui", axis=1)
    .rename(columns={"sem_type": "orig_stype"})
    .merge(ntype_map, how="left", on="het_type")
    .assign(
        fin_type = lambda df: df["orig_stype"].fillna(df["sem_type"])
    )
    .drop(["orig_stype", "sem_type"], axis=1)
    .reset_index(drop=True)
)

In [18]:
fin_hnode_map.shape

(48276, 6)

In [19]:
fin_hnode_map.head()

Unnamed: 0,hetio_id,name,het_type,in_sem,final_id,fin_type
0,1,A1BG,Gene,True,UMLS:C1412045,Genes & Molecular Sequences
1,10,NAT2,Gene,True,UMLS:C0796518,Genes & Molecular Sequences
2,100,ADA,Gene,True,UMLS:C1412179,Genes & Molecular Sequences
3,1000,CDH2,Gene,True,UMLS:C1413277,Genes & Molecular Sequences
4,10000,AKT3,Gene,True,UMLS:C1332074,Genes & Molecular Sequences


## Save to disk

In [20]:
fin_hnode_map.to_csv(
    "../pipeline/hetionet/final_hetionet_nodes_umls.tsv", sep='\t', index=False
)

---

# Determine how to map Hetionet edges

Need to change Hetionet edges to the edge types in SemmedDB.

## Map Hetionet edges to SemmedDB edges

Adjust the semantics of the edges accordingly, and simplify where possible.

In [21]:
hedges = (pd
    .read_csv("../data/hetionet/raw/hetionet_edges.csv", sep=',')
    .rename(columns={
        ":START_ID": "source_id",
        ":END_ID": "target_id",
        ":TYPE": "het_etype"
    })
)

In [22]:
hedges.shape

(2249052, 3)

In [23]:
hedges.head()

Unnamed: 0,source_id,target_id,het_etype
0,8568,GO:0042254,PARTICIPATES_GpBP
1,UBERON:0002450,51316,EXPRESSES_AeG
2,4893,PC7_8430,PARTICIPATES_GpPW
3,UBERON:0002185,55186,EXPRESSES_AeG
4,119504,GO:0033047,PARTICIPATES_GpBP


---

### Convert Hetionet edges to their final node types

In [24]:
het_etypes = (hedges
    .merge(
        fin_hnode_map[["hetio_id", "het_type", "fin_type"]],
        how="inner", left_on="source_id", right_on="hetio_id"
    )
    .drop("hetio_id", axis=1)
    .rename(columns={
        "het_type": "source_htype",
        "fin_type": "source_ftype"
    })

    .merge(
        fin_hnode_map[["hetio_id", "het_type", "fin_type"]],
        how="inner", left_on="target_id", right_on="hetio_id"
    )
    .drop("hetio_id", axis=1)
    .rename(columns={
        "het_type": "target_htype",
        "fin_type": "target_ftype"
    })
    .drop_duplicates()
    .reset_index(drop=True)
)

In [25]:
het_etypes.shape

(2250659, 7)

In [26]:
het_etypes.head()

Unnamed: 0,source_id,target_id,het_etype,source_htype,source_ftype,target_htype,target_ftype
0,8568,GO:0042254,PARTICIPATES_GpBP,Gene,Genes & Molecular Sequences,Biological Process,Physiology
1,6201,GO:0042254,PARTICIPATES_GpBP,Gene,Genes & Molecular Sequences,Biological Process,Physiology
2,6223,GO:0042254,PARTICIPATES_GpBP,Gene,Genes & Molecular Sequences,Biological Process,Physiology
3,6202,GO:0042254,PARTICIPATES_GpBP,Gene,Genes & Molecular Sequences,Biological Process,Physiology
4,65003,GO:0042254,PARTICIPATES_GpBP,Gene,Genes & Molecular Sequences,Biological Process,Physiology


### Group hetio edges by node types and count edges

In [27]:
h_etype_count = (het_etypes
    .groupby(["source_ftype", "target_ftype"])
    ["het_etype"]
    .value_counts()
    .to_frame("count")
    .reset_index()
)

In [28]:
h_etype_count

Unnamed: 0,source_ftype,target_ftype,het_etype,count
0,Anatomy,Genes & Molecular Sequences,EXPRESSES_AeG,526407
1,Anatomy,Genes & Molecular Sequences,DOWNREGULATES_AdG,102240
2,Anatomy,Genes & Molecular Sequences,UPREGULATES_AuG,97848
3,Chemicals & Drugs,Anatomy,CAUSES_CcSE,36
4,Chemicals & Drugs,Chemicals & Drugs,RESEMBLES_CrC,6486
5,Chemicals & Drugs,Chemicals & Drugs,INCLUDES_PCiC,920
6,Chemicals & Drugs,Disorders,CAUSES_CcSE,137458
7,Chemicals & Drugs,Genes & Molecular Sequences,DOWNREGULATES_CdG,21102
8,Chemicals & Drugs,Genes & Molecular Sequences,UPREGULATES_CuG,18756
9,Chemicals & Drugs,Genes & Molecular Sequences,BINDS_CbG,11571


---

## Read SemmedDB edges

In [29]:
sedges = (pd
    .read_csv("../data/semmeddb/no_treats/semmeddb_no_treats_edges.tsv", sep='\t')
    .drop(["pmids", "n_pmids"], axis=1)
)

In [30]:
sedges.shape

(9646780, 3)

In [31]:
sedges.head()

Unnamed: 0,source_id,target_id,edge_type
0,DOID:0050545,UMLS:C0000768,ASSOCIATED_WITH_DOawDO
1,DOID:0050545,UMLS:C0000772,ASSOCIATED_WITH_DOawDO
2,DOID:0050545,UMLS:C0001792,AFFECTS_DOafLB
3,DOID:0050545,UMLS:C0002871,ASSOCIATED_WITH_DOawDO
4,DOID:0050545,UMLS:C0004245,ASSOCIATED_WITH_DOawDO


---

### Add in node types for SemmedDB

In [32]:
sem_etypes = (sedges
    .merge(
        snodes.drop("node_name", axis=1),
        how="inner", left_on="source_id", right_on="cui"
    )
    .drop("cui", axis=1)
    .rename(columns={"sem_type": "source_stype"})

    .merge(
        snodes.drop("node_name", axis=1),
        how="inner", left_on="target_id", right_on="cui"
    )
    .drop("cui", axis=1)
    .rename(columns={"sem_type": "target_stype"})
)

In [33]:
sem_etypes.shape

(9646780, 5)

In [34]:
sem_etypes.head()

Unnamed: 0,source_id,target_id,edge_type,source_stype,target_stype
0,DOID:0050545,UMLS:C0000768,ASSOCIATED_WITH_DOawDO,Disorders,Disorders
1,DOID:0050591,UMLS:C0000768,ASSOCIATED_WITH_DOawDO,Disorders,Disorders
2,DOID:0060668,UMLS:C0000768,ASSOCIATED_WITH_DOawDO,Disorders,Disorders
3,UMLS:C0000726,UMLS:C0000768,LOCATION_OF_AloDO,Anatomy,Disorders
4,UMLS:C0000727,UMLS:C0000768,ASSOCIATED_WITH_DOawDO,Disorders,Disorders


---

### Group SemmedDB edges by node type and count edges

In [35]:
s_etype_count = (sem_etypes
    .groupby(["source_stype", "target_stype"])
    ["edge_type"]
    .value_counts()
    .to_frame("count")
    .reset_index()
)

In [36]:
s_etype_count

Unnamed: 0,source_stype,target_stype,edge_type,count
0,Anatomy,Anatomy,LOCATION_OF_AloA,267214
1,Anatomy,Chemicals & Drugs,LOCATION_OF_AloCD,996348
2,Anatomy,Disorders,LOCATION_OF_AloDO,520885
3,Anatomy,Genes & Molecular Sequences,LOCATION_OF_AloG,534310
4,Anatomy,Living Beings,PART_OF_Apo>LB,133351
5,Anatomy,Living Beings,LOCATION_OF_AloLB,36757
6,Anatomy,Physiology,LOCATION_OF_AloPS,29184
7,Chemicals & Drugs,Anatomy,AFFECTS_CDafA,313572
8,Chemicals & Drugs,Chemicals & Drugs,REGULATES_CDreg>CD,814972
9,Chemicals & Drugs,Chemicals & Drugs,RELATED_TO_CDrtCD,216035


---

## Combine all edges from both networks and manually decide which ones to retain

In [37]:
res = (h_etype_count
    .rename(columns={
        "source_ftype": "source_stype",
        "target_ftype": "target_stype",
        "het_etype": "edge_type"
    })
    .assign(orig = "h")
    .append(
        s_etype_count.assign(orig = "semmed")
    )
    .sort_values(
        by=["source_stype", "target_stype", "count"],
        ascending=[True, True, False]
    )
)

In [38]:
res

Unnamed: 0,source_stype,target_stype,edge_type,count,orig
0,Anatomy,Anatomy,LOCATION_OF_AloA,267214,semmed
1,Anatomy,Chemicals & Drugs,LOCATION_OF_AloCD,996348,semmed
2,Anatomy,Disorders,LOCATION_OF_AloDO,520885,semmed
3,Anatomy,Genes & Molecular Sequences,LOCATION_OF_AloG,534310,semmed
0,Anatomy,Genes & Molecular Sequences,EXPRESSES_AeG,526407,h
1,Anatomy,Genes & Molecular Sequences,DOWNREGULATES_AdG,102240,h
2,Anatomy,Genes & Molecular Sequences,UPREGULATES_AuG,97848,h
4,Anatomy,Living Beings,PART_OF_Apo>LB,133351,semmed
5,Anatomy,Living Beings,LOCATION_OF_AloLB,36757,semmed
6,Anatomy,Physiology,LOCATION_OF_AloPS,29184,semmed


In [39]:
res.to_csv("../pipeline/manual_review/edge_merge.tsv", sep='\t', index=False)