# Condensing Edge Semmantics

Computational complexity for the Rephetio Algorithm that will be used for Machine Learning is highly dependant on the number of potential Metapaths between the source and target Metanodes in the edge to be predicted. 

For this reason, we will try to reduce the number of different semmantic edge types that connect any two metanodes.  We will attempt to condense to at-most 3 metaedges connecting any given metanode.  One reflecting a postitive association (increases, augments, causes, upregulates, etc.), one a negative association (decreases, disrupts, mitigates, downregulates, etc.), and one a neutural association (associated with, affects, method of, etc.).  Some semmantic types are also not useful in this context, and thus will be removed (any Negating concepts, higher than, compared with, etc.)

[This Map File](https://github.com/mmayers12/semmed/blob/master/data/edge_condense_map.csv) details all of the different mappings that occur in the semmantic condensation.

In [1]:
import os
import pickle
%matplotlib inline
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from collections import defaultdict

import sys
sys.path.append('../../hetnet-ml/src')
import graph_tools as gt

In [2]:
nodes = gt.remove_colons(pd.read_csv('../data/nodes_VER31_R_nodes_consolidated.csv'))
edges = gt.remove_colons(pd.read_csv('../data/edges_VER31_R_nodes_consolidated.csv', converters={'pmids':eval}))

In [3]:
start_edge_num = len(edges)

In [4]:
nodes.head()

Unnamed: 0,id,name,label,id_source
0,C0418940,Change of employment,Activities & Behaviors,UMLS
1,C0871147,Professional Development,Activities & Behaviors,UMLS
2,D013221,State Health Plans,Activities & Behaviors,MeSH
3,C0336931,Waterskiing,Activities & Behaviors,UMLS
4,C0678998,literacy skills,Activities & Behaviors,UMLS


In [5]:
edges.head(10)

Unnamed: 0,start_id,end_id,type,pmids,n_pmids
0,C0556656,D066231,ADMINISTERED_TO_ABatLB,{26223185},1
1,C0441648,C0687735,ADMINISTERED_TO_ABatLB,{3434797},1
2,D012648,C0450068,ADMINISTERED_TO_ABatLB,{25086797},1
3,D012648,C0341618,ADMINISTERED_TO_ABatLB,{1789375},1
4,D012648,C0302833,ADMINISTERED_TO_ABatLB,{28382666},1
5,D012648,C0282122,ADMINISTERED_TO_ABatLB,"{23815873, 24393417, 28092492, 21689455, 26313...",6
6,D012648,C0237419,ADMINISTERED_TO_ABatLB,{25648778},1
7,D012648,C0221191,ADMINISTERED_TO_ABatLB,{24526037},1
8,C0441648,C0851286,ADMINISTERED_TO_ABatLB,{29093281},1
9,C0018581,C0870704,ADMINISTERED_TO_ABatLB,{25870425},1


In [6]:
def sanitize(x):
    """Some pmids have the appearance of '2015332 [3]' for some reason. This fixes that"""
    if type(x) == str:
        if ' ' in x:
            x = x.split(' ')[0]
    return x

# Some pmids are appearing as string, e.g. row 6.  They should all be int
edges['pmids'] = edges['pmids'].apply(lambda ids: set([int(sanitize(x)) for x in ids]))

In [7]:
edge_map = pd.read_csv('../data/edge_condense_map.csv')

In [8]:
edge_map.head(2)

Unnamed: 0,original_edge,condensed_to,relationship,reverse,node_semtypes
0,AFFECTS_ABafAB,AFFECTS_ABafAB,neutral,False,Activities & Behaviors --- Activities & Behaviors
1,PREDISPOSES_ABpsAB,AFFECTS_ABafAB,neutral,False,Activities & Behaviors --- Activities & Behaviors


In [9]:
def change_edge_type(from_type, to_type, swap=False):
    idx = edges.query('type == @from_type').index
    edges.loc[idx, 'type'] = to_type
    if swap:
        tmp = edges.loc[idx, 'start_id']
        edges.loc[idx, 'start_id'] = edges.loc[idx, 'end_id']
        edges.loc[idx, 'end_id'] = tmp
                                             
def merge_edge_types(from_list, to_type, swap=False):
    for from_type in from_list:
        change_edge_type(from_type, to_type, swap=swap)
        
def drop_edges_from_list(drop_edges):
    idx = edges.query('type in @drop_edges').index
    edges.drop(idx, inplace=True)

In [10]:
edges['type'].nunique()

2900

In [11]:
# Order is important here
# Previous iterations of this pipeline had multiple rounds of edge condensation
# so some edges will be changed multiple times, and going through the .csv in row order
# ensurse that these changes are all applied correctly.
for row in tqdm(edge_map.itertuples(), total=len(edge_map)):
    change_edge_type(row.original_edge, row.condensed_to, swap=row.reverse)
edges = edges.dropna(subset=['type']).reset_index(drop=True)

100%|██████████| 292/292 [05:28<00:00,  1.06s/it]


In [12]:
edges['type'].nunique()

2686

## Fix Potential problems of duplicated undirected edges

Similar to the issue at the end of notebook `01-building-the-hetnet`, by switching some of the semmantics, we may now have some instances where Metanode1 and Metanode2 for a given edge of type Metaedge1 are opposite of that for a different edge of the same Metaedge1.

The `reverse` option in the `edge_map` should have accounted for this, but we will do a thorough check just to make sure all the nodes of a semmantic type for a given edge type are in the correct locaiton.

In [13]:
abv, met = gt.get_abbrev_dict_and_edge_tuples(gt.add_colons(nodes), gt.add_colons(edges))

In [14]:
id_to_label = nodes.set_index('id')['label'].to_dict()

In [15]:
edges['start_label'] = edges['start_id'].map(lambda c: id_to_label[c])
edges['end_label'] = edges['end_id'].map(lambda c: id_to_label[c])
edges['sem'] = edges['type'].map(lambda e: '_'.join(e.split('_')[:-1]))

edges['abbrev'] = edges['type'].map(lambda e: e.split('_')[-1])

proper_abbrevs = []
for e in tqdm(edges.itertuples(), total=len(edges)):
    if '>' in e.abbrev:
        abbrev = abv[e.start_label] + abv[e.sem] + '>' + abv[e.end_label]
    else:
        abbrev = abv[e.start_label] + abv[e.sem] + abv[e.end_label]
    proper_abbrevs.append(abbrev)
    
edges['calc_abbrev'] = proper_abbrevs

100%|██████████| 14042303/14042303 [00:34<00:00, 412058.58it/s]


In [16]:
edges.head(2)

Unnamed: 0,start_id,end_id,type,pmids,n_pmids,start_label,end_label,sem,abbrev,calc_abbrev
0,C0556656,D066231,ADMINISTERED_TO_ABatLB,{26223185},1,Activities & Behaviors,Living Beings,ADMINISTERED_TO,ABatLB,ABatLB
1,C0441648,C0687735,ADMINISTERED_TO_ABatLB,{3434797},1,Activities & Behaviors,Living Beings,ADMINISTERED_TO,ABatLB,ABatLB


In [17]:
idx = edges['calc_abbrev'] != edges['abbrev']
idx.sum()  # This should be Zero! If so then there's no issue!

0

### Undirected Edges between two nodes of the same type should have only 1 instance

Example: If we had `Disease_1 -- Associated_with -- Disease_2` and `Disease_2 -- Affects -- Disease_1`, these types were merged to `Associated_with` so the undirected edge would be duplicated in both forward and reverse directions.  This will examine the potential issue. Standardize these so that only the `Disease_1 -- Associated_with -- Disease_2` version of the edge remains.

In [18]:
# Get the edges that are un-directed, between same type
idx = edges['start_label'] == edges['end_label']

self_refferential_types = edges.loc[idx, 'type'].unique()
self_refferential_types = [e for e in self_refferential_types if '>' not in e]

self_ref_idx = edges.query('type in @self_refferential_types').index

In [19]:
# Need to keep and combine the PMIDs for Node_1 -- Node_2 and Node_2 -- Node_1
# Get a map of this structure {edge_type: {edge_id: [pmids]}}

edge_map = {}

# Look at the self-reffrentual types
for kind in tqdm(self_refferential_types):
    
    # Need map from edges to pmids
    pmid_map = defaultdict(set)
    # Only look at 1 kind of edge
    subedges = edges.query('type == @kind')
    
    for row in subedges.itertuples():
        # Grab the edge ID, sorting, so lowest CUI first:
        #     If both 'C00001 -- C00002' and 'C00002 -- C00001' exist, effectively standarizes to 
        #     'C00001 -- C00002' while combining the PMID evidence
        edge_id = tuple(sorted([row.start_id, row.end_id]))
        
        # Store the pmids for that edge
        pmid_map[edge_id] = pmid_map[edge_id].union(row.pmids)
        # Keep all the mappings for edge type
        edge_map[kind] = pmid_map

100%|██████████| 286/286 [01:45<00:00,  2.96it/s]


In [20]:
# Convert back to a DataFrame
kinds = []
start_ids = []
end_ids = []
pmids = []

for kind, e_dict in edge_map.items():
    # Restructure as lists for easy dataframe generation
    for (s_id, e_id), pms in e_dict.items():
        kinds.append(kind)
        start_ids.append(s_id)
        end_ids.append(e_id)
        pmids.append(pms)
        
fixed_edges = pd.DataFrame({'start_id': start_ids, 'end_id': end_ids, 'type': kinds, 'pmids': pmids})

In [21]:
print('Before De-duplication: {:,} Edges between nodes of the same type'.format(len(edges.loc[self_ref_idx])))
print('After De-duplication: {:,} Edges between nodes of the same type'.format(len(fixed_edges)))

Before De-duplication: 1,924,184 Edges between nodes of the same type
After De-duplication: 1,511,577 Edges between nodes of the same type


In [22]:
# Remove all the potential duplicated edges
print('Total Edges: {:,}'.format(len(edges)))
edges.drop(self_ref_idx, inplace=True)
print('Edges between two different Metanodes or Directed Edges: {:,}'.format(len(edges)))

# Then add back in all de-duplicated edgers
edges = pd.concat([edges, fixed_edges], sort=False)
print('Total edges with De-duped edges added back: {:,}'.format(len(edges)))

Total Edges: 14,042,303
Edges between two different Metanodes or Directed Edges: 12,118,119
Total edges with De-duped edges added back: 13,629,696


In [23]:
edges = edges[['start_id', 'end_id', 'type', 'pmids']]

In [24]:
edges.head()

Unnamed: 0,start_id,end_id,type,pmids
0,C0556656,D066231,ADMINISTERED_TO_ABatLB,{26223185}
1,C0441648,C0687735,ADMINISTERED_TO_ABatLB,{3434797}
2,D012648,C0450068,ADMINISTERED_TO_ABatLB,{25086797}
3,D012648,C0341618,ADMINISTERED_TO_ABatLB,{1789375}
4,D012648,C0302833,ADMINISTERED_TO_ABatLB,{28382666}


Finish de-duplication, and merge any pmids between those duplicated edges

In [25]:
%%time

before_dedup = len(edges)

# Some edges now duplicated, de-duplicate and combine pmids
grpd = edges.groupby(['start_id', 'end_id', 'type'])
edges = grpd['pmids'].apply(lambda Series: set.union(*Series.values)).reset_index()

# re-count the pmid numbers
edges['n_pmids'] = edges['pmids'].apply(len)

after_dedup = len(edges)

CPU times: user 21min 15s, sys: 24.7 s, total: 21min 39s
Wall time: 21min 39s


In [26]:
print('Edges before final Deduplication: {:,}'.format(before_dedup))
print('Edges after final Deduplication: {:,}'.format(after_dedup))

Edges before final Deduplication: 13,629,696
Edges after final Deduplication: 12,270,779


In [27]:
# Sort values before writing to disk
nodes = nodes.sort_values('label')
edges = edges.sort_values('type')

# Add in colons required by neo4j
nodes = gt.add_colons(nodes)
edges = gt.add_colons(edges)

nodes.to_csv('../data/nodes_VER31_R_consolidated_condensed.csv', index=False)
edges.to_csv('../data/edges_VER31_R_consolidated_condensed.csv', index=False)