## Go from MeSH IDs to Names via UMLS

In the notebook `03-umls_cui_to_mesh_descriptorID` we got a map from UMLS CUI to MeSH IDs.  We also grabbed some concept names in case we'd need them later.  However, there's no guarnetee that these are the 'preferred' names, so here we'll go a bit deeper and try to get the preferred ones first

In [1]:
import pickle
import pandas as pd

import sys
sys.path.append('../tools/')
import load_umls

In [2]:
conso = load_umls.open_mrconso()

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
conso.head(2)

Unnamed: 0,CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAUI,SCUI,SDUI,SAB,TTY,CODE,STR,SRL,SUPPRESS,CVF
0,C0000005,ENG,P,L0000005,PF,S0007492,Y,A26634265,,M0019694,D012711,MSH,PEP,D012711,(131)I-Macroaggregated Albumin,0,N,256.0
1,C0000005,ENG,S,L0270109,PF,S0007491,Y,A26634266,,M0019694,D012711,MSH,ET,D012711,(131)I-MAA,0,N,256.0


In [4]:
msh_rows = conso.query('SAB == "MSH"')

In [5]:
len(msh_rows)

997781

TTY values give hints to what are the preferred naames...  
Primarily, MH and NM are very close to a 1 to 1 relationship of id to value

See [abbreviations here](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html) for meanings of all the abbreviations

In [6]:
for tty in ['PXQ', 'PEP', 'PCE', 'HT', 'QEV',  'NM', 'MH']:

    print('Ratio of names to unique IDs for TYY == "{}": '.format(tty), end='')
    q_res = msh_rows.query('LAT == "ENG" and TTY == @tty')
    print('{}'.format(q_res.shape[0] / q_res['SDUI'].nunique()))

Ratio of names to unique IDs for TYY == "PXQ": 3.0392156862745097
Ratio of names to unique IDs for TYY == "PEP": 2.6377830533235938
Ratio of names to unique IDs for TYY == "PCE": 1.6984903395068591
Ratio of names to unique IDs for TYY == "HT": 1.0
Ratio of names to unique IDs for TYY == "QEV": 1.0263157894736843
Ratio of names to unique IDs for TYY == "NM": 1.0
Ratio of names to unique IDs for TYY == "MH": 1.0


In [7]:
msh_to_name = {}

# Least to most important / redundant
for tty in ['PXQ', 'PEP', 'PCE', 'HT', 'QEV',  'NM', 'MH']:
    q_res = msh_rows.query('LAT == "ENG" and TTY == @tty')
    msh_to_name.update(q_res.set_index('SDUI')['STR'].to_dict())
len(msh_to_name) == msh_rows['SDUI'].nunique()

True

Add in the old mappings, (overwritten by the new ones), in case any happened to be missed

In [8]:
msh_to_name_old = pickle.load(open('../data/MeSH_to_name_quick_n_dirty.pkl', 'rb'))

In [9]:
msh_to_name_final = {**msh_to_name_old, **msh_to_name}

In [10]:
print('Quick and Dirty MeSH concepts mapped to names: {:,}'.format(len(msh_to_name_old)))
print('MeSH concepts mapped to names now: {:,}'.format(len(msh_to_name)))
print('Total MeSH concepts with mapped names: {:,}'.format(len(msh_to_name_final)))

Quick and Dirty MeSH concepts mapped to names: 347,511
MeSH concepts mapped to names now: 347,565
Total MeSH concepts with mapped names: 347,565


In [11]:
pickle.dump(msh_to_name_final, open('../data/MeSH_id_to_name_via_UMLS.pkl', 'wb'))