# Initial Data Cleaning of Semmed DB

This notebook will go through the SemmedDB database and clean it of most of the errors before it can be
processed into a hetnet.


There are two main problems found when first looking into semmedDB:

1. There are several rows where there are multiple subjects or objects, sepearted by a pipe `|` character.
2. The database is not entirely in CUI space, some concepts are given entrez gene ids.


There are also two minor problems that we will help clear up:

1. There is a third minor problem of data corruption, however this is on less than 0.001% of the data, so when identified, these will just be removed.
2. Finally, some of the CUIs contained in SemmedDB are either depricated or have been merged with other CUIs.  These will be resolved

In [1]:
import os
import pickle
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import sys
sys.path.append('../tools')
import load_umls

from collections import defaultdict as ddict

In [2]:
sem_df = pd.read_csv('../data/semmedVER43_R.csv',encoding='latin1')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
sem_df = sem_df.drop(columns = ['extra00','extra01','extra02'])

In [4]:
sem_df.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,10592604,16,16530475,PROCESS_OF,C0003725,Arboviruses,virs,1,C0999630,Lepus capensis,mamm,1
1,10592697,17,16530475,ISA,C0039258,Tahyna virus,virs,1,C0446169,California Group Viruses,virs,1
2,10592728,17,16530475,ISA,C0318627,Eyach virus,virs,1,C0206590,Coltivirus,virs,1
3,10592759,17,16530475,ISA,C0446169,California Group Viruses,virs,1,C0003725,Arboviruses,virs,1
4,10592832,18,16530475,PROCESS_OF,C0012634,Disease,dsyn,0,C0020114,Human,humn,0


In [5]:
print('Rows: {:,}'.format(sem_df.shape[0]))
print('Cols: {}'.format(sem_df.shape[1]))

Rows: 112,796,186
Cols: 12


#### Version 31 Stats
* Rows: 96,363,098
* Cols: 12

In [6]:
# Get all the pmids and save them to a file
pmids = set(sem_df['PMID'])
out = []
print('{:,}'.format(len(pmids)))
for pmid in pmids:
    try:
        # PMIDs should be convertable to int, if not, probably corrupted so don't add
        out.append(int(pmid))
    except:
        pass
print('{:,}'.format(len(out)))
with open('../data/pmid_list_ver43.txt', 'w') as out_file:
    for pmid in out:
        out_file.write(str(pmid)+'\n')
print('Done!')

20,750,278
20,750,278
Done!


In [7]:
# Clear memory
pmids=None

#### Version 31
* initial PMID 17,899,155
* After removal 17,898,897


# Expanding the pipes in subjects and objects

One of the first things that was noticed upon look at the data in semmedDB was that some subjects and objects of extracted statments contained the pipe character `|` as an indicator of multiple concepts in the sentence.

## Examining Pipes in Subject/Object IDs

First thing to do is just examine some of these pipes and took at their corresponding sentences in the database, see if they do infact correspond to two concepts.

initially: There are 3,645,614 lines that contain a pipe in the subject

In [8]:
multi_subject = sem_df[sem_df['SUBJECT_CUI'].str.contains('|', regex=False)]
print("There are {:,} lines that contain a pipe in the subject".format(multi_subject.shape[0]))

sentence_ids = multi_subject['SENTENCE_ID'].values

There are 4,181,817 lines that contain a pipe in the subject


In [9]:
multi_subject.iloc[:5]

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
8,10593243,26,16530476,PART_OF,C0056207|3075,COMPLEMENT FACTOR H|CFH,aapp,1,C0006034,Borrelia burgdorferi,bact,1
47,10597756,105,16530483,INTERACTS_WITH,C0027893|4852,neuropeptide Y|NPY,gngm,1,C0039194,T-Lymphocyte,cell,1
48,10597803,105,16530483,STIMULATES,C0027893|4852,neuropeptide Y|NPY,aapp,1,C0003315,Antigen-Presenting Cells,cell,1
53,10598263,112,16530485,CAUSES,C0812258|1869|7332,E2F1 gene|E2F1|UBE2L3,gngm,1,C0162638,Apoptosis,celf,1
54,10598476,116,16530485,INTERACTS_WITH,C0812258|1869|7332,E2F1 gene|E2F1|UBE2L3,gngm,1,C0013227,Pharmaceutical Preparations,phsu,0


In [10]:
# clear the variable
multi_subject = None 

Before we go any further, lets drop any rows that conatin NaN values, as these are corrupted rows that have no good data

In [11]:
# Remove any NaN values
print('Rows before NaN removal {:,}'.format(sem_df.shape[0]))
sem_df = sem_df.dropna()
print('Rows after NaN removal {:,}'.format(sem_df.shape[0]))

Rows before NaN removal 112,796,186
Rows after NaN removal 112,796,186


No rows were removed

In [12]:
# series of True/False, True if condition is satisfied. Dimensiosn are the same for both.
multi_start = sem_df['SUBJECT_CUI'].str.contains('|', regex=False)
multi_end = sem_df['OBJECT_CUI'].str.contains('|', regex=False)

In [13]:
multi_start.shape

(112796186,)

In [14]:
multi_end.shape

(112796186,)

In [15]:
sum(multi_start)

4181817

In [16]:
sum(multi_end)

3792760

In [17]:
pipe_lines = sem_df[multi_start | multi_end]
good_lines = sem_df[~multi_start & ~multi_end]
print('Rows with multiple subjects or objects {:,}'.format(len(pipe_lines)))
print('Rows with only 1 subject AND only 1 object {:,}'.format(len(good_lines)))

Rows with multiple subjects or objects 7,590,109
Rows with only 1 subject AND only 1 object 105,206,077


In [18]:
len(pipe_lines)+len(good_lines)==len(sem_df)

True

Lines with a pipe in the subject OR a pipe in the object can be delt with in a rather straightforward manner. 

Those with a pipe in both the subject AND the object will require a slightly different algorithm, so we'll separate those out.

In [19]:
# get indices for those only with a multi start, multi end, and those with bith a multi start and multi end
multi_start_subset = multi_start[multi_start | multi_end] # OR statement removes the non-piped entries
multi_end_subset = multi_end[multi_start | multi_end]
multi_both_subset = multi_start_subset & multi_end_subset

In [20]:
print('start: ', sum(multi_start_subset))
print('end: ', sum(multi_end_subset))
print('both: ', sum(multi_both_subset))

start:  4181817
end:  3792760
both:  384468


In [21]:
start_only_subset = multi_start_subset & ~multi_end_subset
end_only_subset = multi_end_subset & ~multi_start_subset

In [22]:
print(sum(start_only_subset))
print(sum(end_only_subset))

3797349
3408292


In [23]:
# Clear variables
multi_start = None
multi_end = None
multi_start_subset = None
multi_end_subset = None

### Splitting the IDs of the Subjects OR Objects

To split the IDs, the IDs and names will be split into `n+1` rows where `n` is the number of pipes `|`, then the data from the rest of the columns will be duplicated across these new rows.

In [24]:
from itertools import chain

In [25]:
# Split the IDs and Names
start_id_split = pipe_lines.loc[start_only_subset, 'SUBJECT_CUI'].str.split('|')
start_name_split = pipe_lines.loc[start_only_subset, 'SUBJECT_NAME'].str.split('|')

In [26]:
start_id_split.head()

8           [C0056207, 3075]
47          [C0027893, 4852]
48          [C0027893, 4852]
53    [C0812258, 1869, 7332]
54    [C0812258, 1869, 7332]
Name: SUBJECT_CUI, dtype: object

In [27]:
start_id_split.shape

(3797349,)

In [28]:
start_name_split.shape

(3797349,)

In [29]:
# Get the number of items after splitting
start_lens = start_id_split.apply(len)

In [30]:
# Need the column names for duplicating the data
all_cols = list(pipe_lines.columns)

# Copy the columns and only keep those where the data will be duped
start_cols = all_cols[:]

In [31]:
# Need the column names for duplicating the data
all_cols = list(pipe_lines.columns)

# Copy the columns and only keep those where the data will be duped
start_cols = all_cols[:]
start_cols.remove('SUBJECT_CUI')
start_cols.remove('SUBJECT_NAME')

In [32]:
# Retaining the same order, duplicate the data times of the new number of rows after the split
new_starts = dict()
for c in start_cols:
    tmp = pipe_lines.loc[start_only_subset, c].apply(lambda x: [x]) * start_lens
    new_starts[c] = [x for x in chain(*tmp.values)]

In [33]:
# Now we have the expanded rows with everthing except the subject CUIs and Names
fixed_starts = pd.DataFrame(new_starts)
fixed_starts.head(10)

Unnamed: 0,OBJECT_CUI,OBJECT_NAME,OBJECT_NOVELTY,OBJECT_SEMTYPE,PMID,PREDICATE,PREDICATION_ID,SENTENCE_ID,SUBJECT_NOVELTY,SUBJECT_SEMTYPE
0,C0006034,Borrelia burgdorferi,1,bact,16530476,PART_OF,10593243,26,1,aapp
1,C0006034,Borrelia burgdorferi,1,bact,16530476,PART_OF,10593243,26,1,aapp
2,C0039194,T-Lymphocyte,1,cell,16530483,INTERACTS_WITH,10597756,105,1,gngm
3,C0039194,T-Lymphocyte,1,cell,16530483,INTERACTS_WITH,10597756,105,1,gngm
4,C0003315,Antigen-Presenting Cells,1,cell,16530483,STIMULATES,10597803,105,1,aapp
5,C0003315,Antigen-Presenting Cells,1,cell,16530483,STIMULATES,10597803,105,1,aapp
6,C0162638,Apoptosis,1,celf,16530485,CAUSES,10598263,112,1,gngm
7,C0162638,Apoptosis,1,celf,16530485,CAUSES,10598263,112,1,gngm
8,C0162638,Apoptosis,1,celf,16530485,CAUSES,10598263,112,1,gngm
9,C0013227,Pharmaceutical Preparations,0,phsu,16530485,INTERACTS_WITH,10598476,116,1,gngm


In [34]:
# Add in the subject CUIs and Names
fixed_starts['SUBJECT_CUI'] = [x for x in chain(*start_id_split.values)]
fixed_starts['SUBJECT_NAME'] = [x for x in chain(*start_name_split.values)]

fixed_starts = fixed_starts[all_cols]

In [35]:
fixed_starts.head(5)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,10593243,26,16530476,PART_OF,C0056207,COMPLEMENT FACTOR H,aapp,1,C0006034,Borrelia burgdorferi,bact,1
1,10593243,26,16530476,PART_OF,3075,CFH,aapp,1,C0006034,Borrelia burgdorferi,bact,1
2,10597756,105,16530483,INTERACTS_WITH,C0027893,neuropeptide Y,gngm,1,C0039194,T-Lymphocyte,cell,1
3,10597756,105,16530483,INTERACTS_WITH,4852,NPY,gngm,1,C0039194,T-Lymphocyte,cell,1
4,10597803,105,16530483,STIMULATES,C0027893,neuropeptide Y,aapp,1,C0003315,Antigen-Presenting Cells,cell,1


In [36]:
start_only_subset = None

#### Fixing the lines where the Objects contain pipes

In [37]:
end_id_split = pipe_lines.loc[end_only_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[end_only_subset, 'OBJECT_NAME'].str.split('|')

In [38]:
end_id_split.shape

(3408292,)

In [39]:
end_name_split.shape

(3408292,)

When examining the data, we can see that some of the lines were not parsed correctly.  This must have happened before the data was downloaded, because mysql shows the same issues when the dump is loaded and queried. 

These line will be dropped since there aren't many and they're pretty much garbage.

In [40]:
end_lens = end_id_split.apply(len)
end_lens1 = end_name_split.apply(len)

print('There are {} lines with data corrupted in this manner'.format(sum(end_lens != end_lens1)))

pipe_lines.loc[end_only_subset][(end_lens != end_lens1)]

There are 6 lines with data corrupted in this manner


Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
7154043,80264874,35392205,26191840,PREP,1756,DMD,gngm,1,1|medd,C0175723,medd,1
7154067,80264901,35392205,26191840,PREP,1756,DMD,gngm,1,1|medd,C0175723,medd,1
35698402,109882731,83393692,11116804,241,C0024141,"Lupus Erythematosus, Systemic",dsyn,1,235|Patients,1|podg|humn,C0030705,1
48339696,123980473,106727258,15221490,PREP,C0334168,New bone formation,ortf,1,1|anim,C0599779,anim,1
60007190,137779669,253076136,27557071,NOM,C0600324,Serodiagnosis,lbpr,1,1|dsyn,C0085293,dsyn,1
80202343,160180312,297142654,23023178,VERB,C0019291,"Hernia, Hiatal",dsyn,1,1|humn,C0030705,"podg,humn",1


In [41]:
# get the index for the bad lines
bad_lines = pipe_lines.loc[end_only_subset][(end_lens != end_lens1)].index

# Remove them from the main dataframe
pipe_lines = pipe_lines.drop(bad_lines)

# Remove from the indicies that still need to be used as well...
end_only_subset = end_only_subset.drop(bad_lines)
multi_both_subset = multi_both_subset.drop(bad_lines)

Now the splitting algorithm is identical to that used for the Subject lines

In [42]:
end_id_split = pipe_lines.loc[end_only_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[end_only_subset, 'OBJECT_NAME'].str.split('|')
end_lens = end_id_split.apply(len)

In [43]:
end_cols = all_cols[:]
end_cols.remove('OBJECT_CUI')
end_cols.remove('OBJECT_NAME')

In [44]:
new_ends = dict()
for c in end_cols:
    tmp = pipe_lines.loc[end_only_subset, c].apply(lambda x: [x]) * end_lens
    new_ends[c] = [x for x in chain(*tmp.values)]

In [45]:
fixed_ends = pd.DataFrame(new_ends)
fixed_ends['OBJECT_CUI'] = [x for x in chain(*end_id_split.values)]
fixed_ends['OBJECT_NAME'] = [x for x in chain(*end_name_split.values)]

fixed_ends = fixed_ends[all_cols]

In [46]:
fixed_ends.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,10603809,199,16530496,USES,C0087111,Therapeutic procedure,topp,0,C0020063,"PTH protein, human",aapp,1
1,10603809,199,16530496,USES,C0087111,Therapeutic procedure,topp,0,5741,PTH,aapp,1
2,10604007,200,16530497,ASSOCIATED_WITH,C0699748,Pathogenesis,patf,1,C1414461,ESR1 gene,gngm,1
3,10604007,200,16530497,ASSOCIATED_WITH,C0699748,Pathogenesis,patf,1,2099,ESR1,gngm,1
4,10604038,200,16530497,ASSOCIATED_WITH,C0699748,Pathogenesis,patf,1,C1414462,ESR2 gene,gngm,1


In [47]:
fixed_ends.shape

(7472299, 12)

In [48]:
end_only_subset = None

### Splitting of lines where both the subject and object contain pipes

These differ slightly in the way that they will have to be treated.  First, the number of pipe in the subject and object can be different.  The total number of new lines to be created is `(n+1) * (m+1)` where `n` is the number of pipes in the subject and `m` is the number of pipes in the object.

Secondly, every possible combination of subject and object will need to be made.  Given subjects `A` and `B`, and objects `X` and `Y`, and predicate `p` you will need rows for the following combinatinos `ApX`, `ApY`, `BpX`, `BpY`.

In [49]:
start_id_split = pipe_lines.loc[multi_both_subset, 'SUBJECT_CUI'].str.split('|')
start_name_split = pipe_lines.loc[multi_both_subset, 'SUBJECT_NAME'].str.split('|')
start_lens = start_id_split.apply(len)

end_id_split = pipe_lines.loc[multi_both_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[multi_both_subset, 'OBJECT_NAME'].str.split('|')
end_lens = end_id_split.apply(len)

In [50]:
# Quick check to see if the name split is equal for start
[len(i) for i in start_name_split] == [len(i) for i in start_id_split]

True

In [51]:
# Quick check to see if the name split is equal for end. In our case its not
[len(i) for i in end_name_split] == [len(i) for i in end_id_split]

False

In [52]:
end_name_split_len = [len(i) for i in end_name_split]
end_id_split_len = [len(i) for i in end_id_split]

In [53]:
# Lets find out why its false.
for j,v in enumerate(end_name_split_len):
    if v==end_id_split_len[j]:
        pass
    else:
        print('index: ',j,
              '\nlen end name split: ',v,
              '\nlen end id split: ',end_id_split_len[j],
              '\nsplit id entry: ',list(end_id_split)[j],
              '\nsplit name entry: ',list(end_name_split)[j])

index:  376203 
len end name split:  1 
len end id split:  2 
split id entry:  ['1', 'neop'] 
split name entry:  ['C0017636']


In [54]:
# replace entry with equal length named entry
for j,v in enumerate(end_name_split_len):
    if v!=end_id_split_len[j]:
        len_id = end_id_split_len[j]
        len_name = end_name_split_len[j]
        
        end_id_entry = list(end_id_split)[j]
        end_name_entry = list(end_name_split)[j]
        
        print(end_name_entry)
        end_name_split.iloc[j]=len_id*[end_name_entry[0]]
     
        print(end_name_split.iloc[[j]])

['C0017636']
111403679    [C0017636, C0017636]
Name: OBJECT_NAME, dtype: object


In [55]:
end_lens = end_id_split.apply(len)

In [56]:
# Multiply the start splits by the end length, so you get end_len*start_len total rows
start_id_split = start_id_split * end_lens
start_name_split = start_name_split * end_lens

end_id_split = end_id_split * start_lens
end_name_split = end_name_split * start_lens

In [57]:
# only sort the starts so that you get all possible combinations....
# For example right now we have start = [A, B, C, A, B, C] and end = [X, Y, X, Y, X, Y]
# By sorting the start we will have start = [A, A, B, B, C, C] and end = [X, Y, X, Y, X, Y]
# Therefore when combined element-wise, all possible combinatinos will arise

sorting_df = pd.DataFrame()
sorting_df['ID'] = start_id_split
sorting_df['NAME'] = start_name_split

sorted_start_id_split = sorting_df['ID'].apply(lambda x: sorted(x))
# Need to sort the names based on IDs so that the same name still corresponds to the same ID
sorted_start_name_split = sorting_df.apply(lambda row: [x for y,x in sorted(zip(row['ID'], row['NAME']))], axis = 1)

In [58]:
sorting_df.head()

Unnamed: 0,ID,NAME
117,"[C1414462, 2100, C1414462, 2100]","[ESR2 gene, ESR2, ESR2 gene, ESR2]"
441,"[C0166418, 5465, C0166418, 5465]","[Peroxisome Proliferator-Activated Receptors, ..."
974,"[C0051980, 959, C0051980, 959]","[anti-IgM, CD40LG, anti-IgM, CD40LG]"
1492,"[C1419746, 9252, C1419746, 9252, C1419746, 925...","[RPS6KA5 gene, RPS6KA5, RPS6KA5 gene, RPS6KA5,..."
2034,"[C0020063, 5741, C0020063, 5741]","[PTH protein, human, PTH, PTH protein, human, ..."


In [59]:
sorted_start_id_split.head()

117                      [2100, 2100, C1414462, C1414462]
441                      [5465, 5465, C0166418, C0166418]
974                        [959, 959, C0051980, C0051980]
1492    [9252, 9252, 9252, 9252, 9252, 9252, 9252, 925...
2034                     [5741, 5741, C0020063, C0020063]
Name: ID, dtype: object

In [60]:
sorted_start_name_split.head()

117                    [ESR2, ESR2, ESR2 gene, ESR2 gene]
441     [PPARA, PPARA, Peroxisome Proliferator-Activat...
974                  [CD40LG, CD40LG, anti-IgM, anti-IgM]
1492    [RPS6KA5, RPS6KA5, RPS6KA5, RPS6KA5, RPS6KA5, ...
2034    [PTH, PTH, PTH protein, human, PTH protein, hu...
dtype: object

Now the algorithm continues in a similar manner to that of the Only subject or Only Object corrections

In [61]:
both_cols = all_cols[:]
both_cols.remove('SUBJECT_CUI')
both_cols.remove('SUBJECT_NAME')
both_cols.remove('OBJECT_CUI')
both_cols.remove('OBJECT_NAME')

In [62]:
new_both = dict()
for c in both_cols:
    tmp = pipe_lines.loc[multi_both_subset, c].apply(lambda x: [x]) * (start_lens * end_lens)
    new_both[c] = [x for x in chain(*tmp.values)]

In [63]:
end_name_split.values

array([list(['ESR1 gene', 'ESR1', 'ESR1 gene', 'ESR1']),
       list(['PLA2G2A', 'PLA2G10', 'PLA2G2A', 'PLA2G10']),
       list(['cyclin D2', 'CCND2', 'cyclin D2', 'CCND2']), ...,
       list(['PPARGC1A gene', 'PPARGC1A', 'PPARGC1A gene', 'PPARGC1A']),
       list(['TNFSF13B gene', 'TNFSF13B', 'TNFSF13B gene', 'TNFSF13B']),
       list(['STAT3 gene', 'STAT3', 'STAT3 gene', 'STAT3'])], dtype=object)

In [64]:
ob_nm = [x for x in chain(*end_name_split.values)]
ob_id = [x for x in chain(*end_id_split.values)]
sub_nm =[x for x in chain(*sorted_start_name_split.values)]
sub_id = [x for x in chain(*sorted_start_id_split.values)]

In [65]:
# quick check if the # of indices are the same
print('object name len: ', len(end_name_split))
print('object id len: ', len(end_id_split))
print('sub name len: ', len(sorted_start_name_split))
print('sub id len: ', len(sorted_start_id_split))

object name len:  384468
object id len:  384468
sub name len:  384468
sub id len:  384468


In [66]:
# quick check when unrolled, if all the indicies are the same
print('object name len: ', len(ob_nm))
print('object id len: ', len(ob_id))
print('sub name len: ', len(sub_nm))
print('sub id len: ', len(sub_id))

object name len:  1948232
object id len:  1948232
sub name len:  1948232
sub id len:  1948232


In [67]:
fixed_both = pd.DataFrame(new_both)

fixed_both['SUBJECT_CUI'] = [x for x in chain(*sorted_start_id_split.values)]
fixed_both['SUBJECT_NAME'] = [x for x in chain(*sorted_start_name_split.values)]

fixed_both['OBJECT_CUI'] = [x for x in chain(*end_id_split.values)]
fixed_both['OBJECT_NAME'] = [x for x in chain(*end_name_split.values)]

fixed_both = fixed_both[all_cols]

In [68]:
fixed_both.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,10604461,206,16530497,INTERACTS_WITH,2100,ESR2,gngm,1,C1414461,ESR1 gene,gngm,1
1,10604461,206,16530497,INTERACTS_WITH,2100,ESR2,gngm,1,2099,ESR1,gngm,1
2,10604461,206,16530497,INTERACTS_WITH,C1414462,ESR2 gene,gngm,1,C1414461,ESR1 gene,gngm,1
3,10604461,206,16530497,INTERACTS_WITH,C1414462,ESR2 gene,gngm,1,2099,ESR1,gngm,1
4,10645011,923,16530726,INHIBITS,5465,PPARA,aapp,1,5320,PLA2G2A,gngm,1


In [69]:
# Clear Variables
start_id_split = None
start_name_split = None
end_id_split = None
end_name_split = None
sorted_start_name_split = None
sorted_start_id_split = None

Recombine these lines into the new dataframe. 

VER31: 104,929,678 rows

In [70]:
sem_df = pd.concat([good_lines, fixed_starts, fixed_ends, fixed_both]).reset_index(drop=True)

In [71]:
# Clear Variables
good_lines = None
fixed_starts = None
fixed_ends = None
fixed_both = None

In [72]:
print('The data now contains {:,} rows'.format(sem_df.shape[0]))

The data now contains 122,935,992 rows


# NORMALIZE IDS FOR GENES to CUIs

Sometimes genes appear with a CUI as an identifier, other times they have an entrez gene id.

In [73]:
# POMC Gene is 5443 and has CUI C1337111
sem_df.query('SUBJECT_CUI == "C1337111"').head(2)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
105219320,29597661,54478476,1259607,AUGMENTS,C1337111,POMC gene,gngm,1,C0040132,Thyroid Gland,bpoc,1
105231299,67156299,121330064,18977407,AFFECTS,C1337111,POMC gene,gngm,1,C0031715,Phosphorylation,moft,1


In [74]:
sem_df.query('SUBJECT_CUI == "5443"').head(2)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
406208,72228418,18820740,26476089,NEG_INTERACTS_WITH,5443,POMC,gngm,1,C0002520,Amino Acids,aapp,1
406231,72228448,18820740,26476089,NEG_INTERACTS_WITH,5443,POMC,gngm,1,C0033684,Proteins,aapp,0


In [75]:
#expand the sem_df to remove the lists within the series
sem_df.shape

(122935992, 12)

In [76]:
sem_df.explode(column='SUBJECT_CUI').shape

(122935992, 12)

In [77]:
sem_df.explode(column='SUBJECT_CUI').explode(column='OBJECT_CUI').shape

(122935992, 12)

## Strategy for normalizing to CUI

mygene.info has umls data now so we will use this as a reliable, up-to-date soruce for mapping.  For those that cannot be acquired by mygene.info, we will use HGNC Mappings as UMLS contains those values

In [78]:
import mygene
mg = mygene.MyGeneInfo()

In [79]:
gene_lines = ~sem_df['SUBJECT_CUI'].str.startswith('C')
genes_entrez = set(sem_df.loc[gene_lines, 'SUBJECT_CUI'])

gene_lines1 = ~sem_df['OBJECT_CUI'].str.startswith('C')
genes_entrez.update(set(sem_df.loc[gene_lines1, 'OBJECT_CUI']))

genes_need_fixing = gene_lines | gene_lines1

In [80]:
print("{} genes appear with Entrez IDs that will need to be mapped".format(len(genes_entrez)))

21428 genes appear with Entrez IDs that will need to be mapped


In [81]:
# Query Mygene.info and make the result a DataFrame
mg_result = mg.getgenes(list(genes_entrez), fields='symbol,namel,umls,HGNC', dotfield=True)
mg_result = pd.DataFrame(mg_result)
mg_result.columns = [c.replace('.', '_') for c in mg_result.columns]

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-21428...done.


In [82]:
e_to_cui = mg_result.dropna(subset=['umls_cui']).set_index('query')['umls_cui'].to_dict()
print("{} out of {} Entrez IDs can be mapped to CUI via mygene.info".format(len(e_to_cui), len(genes_entrez)))

20495 out of 21428 Entrez IDs can be mapped to CUI via mygene.info


In [83]:
mg = None

### Get some more mappings from umls

Although UMLS does not have direct Entrez Gene IDs mappings to UMLS CUIs, it does have HGNC IDs.  Some Entrez to HGNC values were picked up from mygene, so they will be used to further increase the maps size.

In [84]:
# Fix mygene result so it has HGNC: at start of HGNC ids
mg_result['HGNC'] = 'HGNC:' + mg_result['HGNC']

In [85]:
need_map = genes_entrez - set(e_to_cui.keys())
e_to_hgnc = mg_result.query('query in @need_map').dropna(subset=['HGNC']).set_index('query')['HGNC'].to_dict()

hgnc_ids = list(e_to_hgnc.values())

len(hgnc_ids)

25

In [86]:
# Get the values from the UMLS metathesaurus
conso = load_umls.open_mrconso()
q_res = conso.query('SCUI in @hgnc_ids and TTY == "MTH_ACR"')
len(q_res)

  exec(code_obj, self.user_global_ns, self.user_ns)


14

In [87]:
hgnc_to_cui = q_res.set_index('SCUI')['CUI'].to_dict()
e_to_cui_1 = {k: hgnc_to_cui[v] for k, v in e_to_hgnc.items() if v in hgnc_to_cui.keys()}

e_to_cui = {**e_to_cui_1, **e_to_cui}

In [88]:
print("{} out of {} Entrez IDs now mapped".format(len(e_to_cui), len(genes_entrez)))

20509 out of 21428 Entrez IDs now mapped


In [89]:
# Everytime we map to e_to_cui we get multiple items. Need to explode the df everytime after mapping it.
counting_dict = ddict(int)
for i in list(e_to_cui.values()):
    if type(i)==list:
        counting_dict[tuple(i)]+=1

In [90]:
# lists in e_to_cui and the number of times e's are mapped to the cui
counting_dict

defaultdict(int,
            {('C1416655', 'C0920288'): 1,
             ('C1425504', 'C1422570'): 1,
             ('C1425505', 'C1422567'): 1,
             ('C1537454', 'C4722638'): 3})

### Generate a CUI to name map


Once the Entrez IDs are changed to CUIs, the names will not be the true name that is associated with that CUI.  We will use the UMLS API to get the correct names. This will ensure the correct UMLS name appears with the CUI.


In [91]:
# First get the names from semmed for everything that already has a CUI
d = sem_df[sem_df['SUBJECT_CUI'].str.startswith('C')].set_index('SUBJECT_CUI')['SUBJECT_NAME'].to_dict()
od = sem_df[sem_df['OBJECT_CUI'].str.startswith('C')].set_index('OBJECT_CUI')['OBJECT_NAME'].to_dict()

c_to_name_dict = {**d, **od}

In [92]:
len(c_to_name_dict)

322978

In [93]:
# generate a list of cui values that isn't a list of list otherwise it cannot be hashed
e_to_cui_vals = []
for i in (list(e_to_cui.values())):
    if type(i) == str: 
        e_to_cui_vals.append(i)
    else:
        for j in i:
            e_to_cui_vals.append(j)

In [94]:
# list comprehension because some values are lists which cannot be hashed
need_name = set(e_to_cui_vals) - set(c_to_name_dict.keys())
len(need_name)

3895

In [95]:
# Get as many names as possible directly from UMLS
# ISPREF == Y gets preferred names for preferred name
c_to_name_1 = conso.query('CUI in @need_name and ISPREF == "Y"').set_index('CUI')['STR'].to_dict()
len(c_to_name_1)

3862

In [96]:
# Add the two dictionaries together. The missing values can be used to query
c_to_name_dict = {**c_to_name_1, **c_to_name_dict}
to_query = set(e_to_cui_vals) - set(c_to_name_dict.keys())
len(to_query)

33

In [97]:
# Check to see if there is a list in the data
mg_umls_cui=[]

for i in mg_result.umls_cui:
    if type(i)==list:
        print(i)

['C1425504', 'C1422570']
['C1416655', 'C0920288']
['C1537454', 'C4722638']
['C1537454', 'C4722638']
['C1537454', 'C4722638']
['C1425505', 'C1422567']


In [98]:
#from mg_results, query for rows that are lists. Split the dataframe by those rows and add them back
# get a column of the types. We want to expand the type of 'lists'
mg_result['umls_cui_len'] = mg_result.umls_cui.apply(lambda x: type(x))

In [99]:
mg_result.head()

Unnamed: 0,HGNC,_id,_version,notfound,query,symbol,umls_cui,umls_protein_cui,umls_cui_len
0,HGNC:2518,1497,2.0,,1497,CTNS,C1413803,,<class 'str'>
1,HGNC:15595,84561,2.0,,84561,SLC12A8,C1423604,,<class 'str'>
2,HGNC:53531,100008588,2.0,,100008588,RNA18SN5,C4555093,,<class 'str'>
3,HGNC:674,393,2.0,,393,ARHGAP4,C1412521,,<class 'str'>
4,HGNC:43874,100507257,1.0,,100507257,MEG9,C3470947,,<class 'str'>


In [100]:
# Most names are the Gene symbol + 'gene' so we'll use that for the remainder
name_from_mygene = (
    pd
    .concat( # concatenates types of lists and not_lists
        [mg_result[mg_result['umls_cui_len']!=list],
         mg_result[mg_result['umls_cui_len']==list].explode('umls_cui')] # expanded mg_results
    )
    .query('umls_cui in @to_query')
    .set_index('umls_cui')['symbol']
    .to_dict()
)
# returns a dict{umls_cui:symbol}

In [101]:
# Make sure those mapped from mygene via HGNC have names
to_query_1 = [k for k, v in hgnc_to_cui.items() if v in to_query]

In [102]:

hgnc_to_name = (
    pd
    .concat( # concatenates types of lists and not_lists
        [mg_result[mg_result['umls_cui_len']!=list],
         mg_result[mg_result['umls_cui_len']==list].explode('umls_cui')] # expanded mg_results
    )
    .query('HGNC in @to_query_1').set_index('HGNC')['symbol'].to_dict()
)

In [103]:
# nothing is matched
hgnc_to_name

{}

In [104]:
name_from_mygene.update({hgnc_to_cui[k]: v for k, v in hgnc_to_name.items()})

name_from_mygene = {k: v+' gene' for k, v in name_from_mygene.items()}

In [105]:
name_from_mygene

{'C1415391': 'H1-10 gene',
 'C1415392': 'H2AC8 gene',
 'C1415393': 'H2AC13 gene',
 'C1415395': 'H2AC14 gene',
 'C1415400': 'H2AC6 gene',
 'C1415402': 'H2AC17 gene',
 'C1415404': 'H2AC11 gene',
 'C1415405': 'H2AC20 gene',
 'C1415410': 'H2BC8 gene',
 'C1415412': 'H2BC13 gene',
 'C1415413': 'H2BC15 gene',
 'C1415414': 'H2BC14 gene',
 'C1415416': 'H2BC7 gene',
 'C1415417': 'H2BC6 gene',
 'C1415419': 'H2BC9 gene',
 'C1415420': 'H2BC10 gene',
 'C1415422': 'H2BC17 gene',
 'C1415425': 'H2BC11 gene',
 'C1415426': 'H2BS1 gene',
 'C1422138': 'H2AC12 gene',
 'C1422664': 'H2AJ gene',
 'C1425743': 'H2AC1 gene',
 'C1425744': 'H2BC1 gene',
 'C1426978': 'H2AW gene',
 'C1426979': 'H2AC21 gene',
 'C1426986': 'H2BC19P gene',
 'C1539441': 'DENND10 gene',
 'C1824474': 'LINC02875 gene',
 'C1825439': 'H1-7 gene',
 'C1825474': 'H1-9P gene',
 'C2829551': 'H2AZP1 gene',
 'C3815640': 'LDC1P gene',
 'C3891383': 'COMETT gene'}

In [106]:
# Ensure that all mappable genes now have a mappable name
c_to_name_dict = {**name_from_mygene, **c_to_name_dict}
to_query = set(e_to_cui_vals) - set(c_to_name_dict.keys())
len(to_query)

0

In [107]:
len(c_to_name_dict)

326873

In [108]:
pickle.dump(c_to_name_dict, open( "../data/cui_to_name.pkl", "wb" ) )
pickle.dump(e_to_cui, open( "../data/entrez_to_cui.pkl", "wb" ) )

In [109]:
# Check that the mapper produces the correct name when given the CUI for POMC gene
c_to_name_dict[e_to_cui['5443']]

'POMC gene'

In [110]:
# clear variables to make space in memory
mg_result = None
to_query_1 = None
name_from_mygene = None
hgnc_to_cui = None

### Apply the Changes

In [111]:
genes_need_fixing = gene_lines | gene_lines1

sem_df[genes_need_fixing].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
7,10593208,26,16530476,PART_OF,C0242210,Binding Protein,bacs,1,2273,FHL1,aapp,1
9,10593621,34,16530477,PART_OF,7523,XS,gngm,1,C0006034,Borrelia burgdorferi,bact,1
10,10593815,39,16530477,PRODUCES,7523,XS,gngm,1,C0019878,homocysteine,aapp,1
66,10600813,151,16530491,LOCATION_OF,C0013935,Embryo,emst,1,5539,PPY,aapp,1
68,10600962,153,16530491,PART_OF,5539,PPY,gngm,1,C0026591,Mothers,humn,1
70,10601037,153,16530491,LOCATION_OF,C0010834,Cytoplasm,celc,1,5539,PPY,aapp,1
72,10601184,155,16530491,LOCATION_OF,C0021358,Inferior Colliculus,bpoc,1,5539,PPY,aapp,1
300,10634790,767,16530706,PREDISPOSES,8720,MBTPS1,gngm,1,C0178874,Neoplasm progression,neop,1
301,10634871,768,16530706,ASSOCIATED_WITH,8720,MBTPS1,gngm,1,C1326912,Tumorigenesis,neop,1
303,10635049,771,16530706,AUGMENTS,7422,VEGFA,gngm,1,C0005847,Blood Vessels,bpoc,1


In [112]:
# check for lists in sem_df. Here are the counts in Subjects
type_in_sub = ddict(int)
for i in list(sem_df[genes_need_fixing]['SUBJECT_CUI']):
    if type(i)==list:
        i=tuple(i)
        type_in_sub[i]+=1
        
type_in_sub

The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.


defaultdict(int, {})

In [113]:
# check for lists in sem_df. Here are the counts in Objects
type_in_obj = ddict(int)
for i in list(sem_df[genes_need_fixing]['OBJECT_CUI']):
    if type(i)==list:
        i=tuple(i)
        type_in_obj[i]+=1

type_in_obj 

defaultdict(int, {})

In [114]:
sem_df.shape

(122935992, 12)

In [115]:
sem_df.explode('SUBJECT_CUI').shape

(122935992, 12)

In [116]:
sem_df.explode('SUBJECT_CUI').explode('OBJECT_CUI').shape

(122935992, 12)

In [117]:
sem_df=sem_df.explode('SUBJECT_CUI').explode('OBJECT_CUI')

In [None]:
gene_lines = ~sem_df['SUBJECT_CUI'].str.startswith('C')
genes_entrez = set(sem_df.loc[gene_lines, 'SUBJECT_CUI'])

gene_lines1 = ~sem_df['OBJECT_CUI'].str.startswith('C')
genes_entrez.update(set(sem_df.loc[gene_lines1, 'OBJECT_CUI']))

genes_need_fixing = gene_lines | gene_lines1

In [None]:
genes_need_fixing

In [None]:
# converts subjects/object from e_to_cui to cui_ids.
# added explode to remove lists
sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'] = sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'].apply(lambda e: e_to_cui.get(e,e))
sem_df.loc[genes_need_fixing, 'OBJECT_CUI'] = sem_df.loc[genes_need_fixing, 'OBJECT_CUI'].apply(lambda e: e_to_cui.get(e,e))

In [None]:
sem_df.shape

In [122]:
sem_df = sem_df.explode('SUBJECT_CUI')

In [123]:
# after explosion
sem_df.shape

(122942852, 12)

In [124]:
# after second explosion
sem_df = sem_df.explode('OBJECT_CUI')

In [125]:
sem_df.shape

(122964075, 12)

In [126]:
gene_lines = ~sem_df['SUBJECT_CUI'].str.startswith('C')
genes_entrez = set(sem_df.loc[gene_lines, 'SUBJECT_CUI'])

gene_lines1 = ~sem_df['OBJECT_CUI'].str.startswith('C')
genes_entrez.update(set(sem_df.loc[gene_lines1, 'OBJECT_CUI']))

genes_need_fixing = gene_lines | gene_lines1

In [127]:
# converts subject/object from cui to names
sem_df.loc[genes_need_fixing, 'SUBJECT_NAME'] = sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'].apply(lambda e: c_to_name_dict.get(e,e))
sem_df.loc[genes_need_fixing, 'OBJECT_NAME'] = sem_df.loc[genes_need_fixing, 'OBJECT_CUI'].apply(lambda e: c_to_name_dict.get(e,e))

In [128]:
sem_df[genes_need_fixing].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
9,10593621,34,16530477,PART_OF,7523,7523,gngm,1,C0006034,Borrelia burgdorferi,bact,1
10,10593815,39,16530477,PRODUCES,7523,7523,gngm,1,C0019878,homocysteine,aapp,1
8615,11811555,21179,16534425,ASSOCIATED_WITH,50941,50941,gngm,1,C0021390,Inflammatory Bowel Diseases,dsyn,1
10932,12165524,27984,16535373,PART_OF,6268,6268,gngm,1,C0085470,Pseudomonas putida,bact,1
10933,12165566,27984,16535373,INTERACTS_WITH,C0015690,"Fatty Acids, Unsaturated",lipd,1,6268,6268,gngm,1
10934,12165655,27985,16535373,PART_OF,6268,6268,gngm,1,C0085470,Pseudomonas putida,bact,1
11149,12197584,28648,16535458,USES,C0042196,Vaccination,topp,1,474222,474222,aapp,1
16638,13017663,43841,16538004,LOCATION_OF,C0043381,Y Chromosome,celc,1,560,560,aapp,1
16666,13020696,43873,16538008,LOCATION_OF,C0043381,Y Chromosome,celc,1,560,560,aapp,1
17533,13135322,45778,16538385,CAUSES,780896,780896,gngm,1,C0162638,Apoptosis,celf,1


### The lines that got no Query Result

Some of the ID produced no query result.  They should still have a SUBJECT_CUI with only a number. We'll examine those and see if they produce any insight.

In [129]:
gene_lines2 = ~sem_df['SUBJECT_CUI'].str.startswith('C')
gene_lines3 = ~sem_df['OBJECT_CUI'].str.startswith('C')

genes_need_fixing1 = gene_lines2 | gene_lines3

sem_df[genes_need_fixing1].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
9,10593621,34,16530477,PART_OF,7523,7523,gngm,1,C0006034,Borrelia burgdorferi,bact,1
10,10593815,39,16530477,PRODUCES,7523,7523,gngm,1,C0019878,homocysteine,aapp,1
8615,11811555,21179,16534425,ASSOCIATED_WITH,50941,50941,gngm,1,C0021390,Inflammatory Bowel Diseases,dsyn,1
10932,12165524,27984,16535373,PART_OF,6268,6268,gngm,1,C0085470,Pseudomonas putida,bact,1
10933,12165566,27984,16535373,INTERACTS_WITH,C0015690,"Fatty Acids, Unsaturated",lipd,1,6268,6268,gngm,1
10934,12165655,27985,16535373,PART_OF,6268,6268,gngm,1,C0085470,Pseudomonas putida,bact,1
11149,12197584,28648,16535458,USES,C0042196,Vaccination,topp,1,474222,474222,aapp,1
16638,13017663,43841,16538004,LOCATION_OF,C0043381,Y Chromosome,celc,1,560,560,aapp,1
16666,13020696,43873,16538008,LOCATION_OF,C0043381,Y Chromosome,celc,1,560,560,aapp,1
17533,13135322,45778,16538385,CAUSES,780896,780896,gngm,1,C0162638,Apoptosis,celf,1


In [130]:
sem_df[genes_need_fixing1].shape

(145656, 12)

In [131]:
len(sem_df[genes_need_fixing1])

145656

For now we will save these to their own file and remove them from the 'cleaned' data.

In [132]:
sem_df[genes_need_fixing1].to_csv('../data/semmedVER43_R_no_CUI.csv', index=False)

In [133]:
sem_df = sem_df.drop(sem_df[genes_need_fixing1].index)
sem_df.to_csv('../data/semmedVER43_R_clean.csv', index=False)

## Remove Depricated CUIs

Some CUIs in the database are depreicated.  They may have newer versions to which they have not yet been mapped.  However, UMLS has record of these deprecated values that can be used to map 

In [134]:
# Get the map from old CUIs to new CUIs
retired_cui = load_umls.open_mrcui()
retired_cui.head(2)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,CUI1,VER,REL,RELA,MAPREASON,CUI2,MAPIN
0,C0000002,2000AC,SY,,,C0007404,Y
1,C0000003,1999AA,SY,,,C0010504,Y


In [135]:
# Make a mapper from the old to the new
cui_map = retired_cui.set_index('CUI1')['CUI2'].to_dict()

# Ensure we have names for all the new values
no_name = set(cui_map.values()) - set(c_to_name_dict.keys())
if len(no_name) > 0:
    query_result = conso.query('CUI in @no_name and ISPREF == "Y"').set_index('CUI')['STR'].to_dict()
    c_to_name_dict.update(query_result)
    pickle.dump(c_to_name_dict, open( "../data/cui_to_name.pkl", "wb" ) )
print('{} concepts identifiers could not be mapped to a name'.format(len(no_name) - len(query_result)))

1 concepts identifiers could not be mapped to a name


In [136]:
# How many unique s-p-o triples before de-depreication?
'{:,} Unique S-P-O triples before de-deprecation'.format(len(sem_df.drop_duplicates(subset=['SUBJECT_CUI', 'PREDICATE', 'OBJECT_CUI'])))

'25,264,949 Unique S-P-O triples before de-deprecation'

In [137]:
# Map the depricated values to their new CUIs
sem_df['SUBJECT_CUI'] = sem_df['SUBJECT_CUI'].apply(lambda c: cui_map.get(c, c))
sem_df['OBJECT_CUI'] = sem_df['OBJECT_CUI'].apply(lambda c: cui_map.get(c, c))

# Any removed CUIs should be taken out
sem_df = sem_df.dropna(subset=['SUBJECT_CUI', 'OBJECT_CUI'])

# Ensure the names are now corrected
sem_df['SUBJECT_NAME'] = sem_df['SUBJECT_CUI'].apply(lambda c: c_to_name_dict.get(c, c))
sem_df['OBJECT_NAME'] = sem_df['OBJECT_CUI'].apply(lambda c: c_to_name_dict.get(c, c))

# How many unique spo triples after the corrections?
'{:,} Unique S-P-O triples after de-deprecation'.format(len(sem_df.drop_duplicates(subset=['SUBJECT_CUI', 'PREDICATE', 'OBJECT_CUI'])))

'24,964,980 Unique S-P-O triples after de-deprecation'

In [138]:
sem_df.to_csv('../data/semmedVER43_R_clean_de-depricate.csv', index=False)