# Automated CUI Bank Generation Using the Unified Medical Language System (UMLS) API



# Method

## CUI Bank

For each search term, we created a list of relevant concepts extracted from the UMLS database using their application programming interface (API). The list was generated with a graph-search module using the Unified Medical Language System (UMLS) API to search the UMLS graph from a given concept to all its children.  

For example, for the term "anxiety", we generated list of all UMLS concepts (CUI) that are returned from a direct search result for anxiety, or one of its many descendants. For this search term, several concepts were returned such as C0520683	("Occupation-related stress disorder"), C0497238 ("Fear of hypertension"). Some terms returned multiple CUIs from different vocabularies such as C0424154 and C0497237 that refer to "Fear of heart attack".

We refer to the list of CUIs extracted from a particular search term as the CUI-bank for that term.


## Code

This program uses the UMLS API to create a CUI bank for a specified list of search terms. Before running this code, cope the API key avaliable in the UMLS a/c profile page. Copy the API key into a .txt file and save it in the same folder as the code. Set the hyperparameters, which include:
1. list of search terms (example: "anxiety", "depression", "heart disease")
2. List of vocabularies to consider (example: SNOMED_CT, ICD10, RxNORM etc.)


In [1]:
!pip install simplejson



In [2]:
import pandas as pd
import time
import os
import requests
import lxml.html as lh
from lxml.html import fromstring
from bs4 import BeautifulSoup
import simplejson
import re
import copy
from IPython.display import clear_output
import pickle
import datetime


In [32]:
os.listdir()

['CUI_BANK_2020-10-19 14:46:19.211295.csv',
 'Untitled1.ipynb',
 '.DS_Store',
 'ICD10_guidelines.pdf',
 'CUI_BANK_2020-10-30 10:22:40.379944.csv',
 'CUI_BANK_2020-10-29 17:45:17.913307.csv',
 'cuicounts_sb.csv',
 'old codes',
 'Untitled.ipynb',
 'CUI_BANK_2020-10-19 14:51:41.350281.csv',
 'Concept_bank.pdf',
 'apikey.txt',
 'ICD-10.ipynb',
 'UMLS_manual.pdf',
 'CUI_BANK_2020-10-30 10:01:19.433252.csv',
 'CUI_BANK_2020-10-29 17:25:55.977396.csv',
 'CUI_BANK_anxiety_depression_dementia_falls_heart disease_weakness.csv',
 '.ipynb_checkpoints',
 'create_CUI_bank_final.ipynb',
 'CUI_BANK_2020-10-30 13:05:22.594204.csv',
 'CUI_BANK_2020-10-20 15:17:28.620386.csv',
 'Method_CUI_bank.ipynb',
 'seed_terms_Aug4_sb.xlsx']

# Launch

In [4]:
# time
start_time = datetime.datetime.now()
str(start_time)

'2020-10-30 13:05:22.594204'

# Set Hyperparameters

These parameters must be set by the user before the program is run. 

1. **term_list**: A list of terms that must be searched and whose children are be retreieved from UMLS. For example, "falls" will retreieve al the CUI from UMLS that either returned directly from this search query or is a children of one of the CUIs returned. 
2. **sabs**: The list of vocabulary to be considered

In [5]:
df = pd.read_excel('seed_terms_Aug4_sb.xlsx')
df.head()

Unnamed: 0,cardiovascular,falls,frailty,anxiety_depression,dementia
0,cardiac,falls,frailty,anxiety,dementia
1,heart,fracture,weakness,depression,senile
2,arrhythmia,unsteady,dizziness,psychiatric,alzheimer
3,myocardial,dizziness,fatigue,panic,altered mental status
4,cholesterol,balance,malaise,mood,cognition disorders


In [6]:
terms = list(df.values.ravel())
term_list = [x for x in terms if str(x)!= 'nan']
#term_list = ['insomnia']
print(term_list)

['insomnia']


In [7]:
sabs =  ['SNOMEDCT_US', 'ICD10CM','LNC','NCI','RXNORM', 'MSH']

**Once the hyperparameters have been set, go to the Menu bar of this notebook and click "Restart and Run All".**

## Set Option Parameters
1. Set **get_children = True** if you would like to get children of the search terms from the term_list (True by default)
2. Set **get_sem_type = True** if you would like to get Semantic Type
3. Set **get_parents = True** if you would like to obtain immediate parents of a CUI

In [31]:
get_children = True
get_sem_type = True
get_parents = True

cui_bank_name = 'CUI_BANK_' + str(start_time) + '.csv'
print(cui_bank_name.split())

['CUI_BANK_2020-10-30', '13:05:22.594204.csv']


# UMLS API

The UMLS API requires an API key that is freely available to anyone who has an a/c. Save the API key in an appropriate text file that the program can read. 

For any search query (example: "falls") the code performs a UMLS search by term. The 

In [9]:
# Code from: https://github.com/HHS/uts-rest-api/blob/master/samples/python/Authentication.py
#!/usr/bin/python
## 6/16/2017 - remove PyQuery dependency
## 5/19/2016 - update to allow for authentication based on api-key, rather than username/pw
## See https://documentation.uts.nlm.nih.gov/rest/authentication.html for full explanation

import requests
import lxml.html as lh
from lxml.html import fromstring
from bs4 import BeautifulSoup
import simplejson

uri="https://utslogin.nlm.nih.gov"
#option 1 - username/pw authentication at /cas/v1/tickets
#auth_endpoint = "/cas/v1/tickets/"
#option 2 - api key authentication at /cas/v1/api-key
auth_endpoint = "/cas/v1/api-key"

class Authentication:

   #def __init__(self, username,password):
   def __init__(self, apikey):
    #self.username=username
    #self.password=password
    self.apikey=apikey
    self.service="http://umlsks.nlm.nih.gov"

   def gettgt(self):
     #params = {'username': self.username,'password': self.password}
     params = {'apikey': self.apikey}
     h = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain", "User-Agent":"python" }
     r = requests.post(uri+auth_endpoint,data=params,headers=h)
     response = fromstring(r.text)
     ## extract the entire URL needed from the HTML form (action attribute) returned - looks similar to https://utslogin.nlm.nih.gov/cas/v1/tickets/TGT-36471-aYqNLN2rFIJPXKzxwdTNC5ZT7z3B3cTAKfSc5ndHQcUxeaDOLN-cas
     ## we make a POST call to this URL in the getst method
     tgt = response.xpath('//form/@action')[0]
     return tgt

   def getst(self,tgt):

     params = {'service': self.service}
     h = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain", "User-Agent":"python" }
     r = requests.post(tgt,data=params,headers=h)
     st = r.text
     return st

    
def get_cuis_list_from_term_list(TERM_LIST,tgt):
    """Accepts a list of search terms and returns search results (cuis and terms)
    """
    
    if len(term_list) == 0 or not isinstance(TERM_LIST, list):
        print("Nothing")
        comments = comments + 'term lkist empty. '
        return []
    
    # Build CUI list
    print("Getting cui list")
    CUI_LIST = []
    for target_term in TERM_LIST:
        print(target_term)
        query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
        endpoint = f'/search/current?string={target_term}'

        try:
            r = requests.get(base_uri + endpoint, params=query)
        except:
            tgt = generate_tgt()
            r = requests.get(base_uri + endpoint, params=query)
            
        cui_results = simplejson.loads(r.text)
        for result in cui_results['result']['results']:
            if 'rootSource' not in result:
                continue
            if result['rootSource'] in sabs + ['MTH']:
                CUI_LIST.append((target_term, result['ui'], result['name']))
    return CUI_LIST



def get_cui_list_from_term(target_term, tgt):
    """Accepts a target_term and returns the CUIs associated with that term
    """
    
    query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
    endpoint = f'/search/current?string={target_term}'

    try:
        r = requests.get(base_uri + endpoint, params=query)
    except:
        tgt = generate_tgt()
        r = requests.get(base_uri + endpoint, params=query)

    cui_results = simplejson.loads(r.text)
    cui_list = []
    
    for result in cui_results['result']['results']:
        cui_list.append(result['ui'])
        
    return ';'.join(cui_list)



def get_atoms_from_cuis(CUI_LIST, tgt, verbose = False):
    """Obtains the atoms for a list of CUIs
    """
    
    
    CODE_LIST = {}
    for c, (term, cui, name) in enumerate(CUI_LIST):
        
        # time.sleep(wait_for_confirmation)
        if c%10 == 0:
            clear_output()
        
        # Initilize for specific term-cui pair

        print(f"Obtaining ATOMS {c} out of {len(CUI_LIST)}")

        query = {'pageSize': 5000, 'sabs': [], 'ticket':AuthClient.getst(tgt)}
        endpoint = f'/content/current/CUI/{cui}/atoms?language=ENG'


        try:
            r = requests.get(base_uri + endpoint, params=query)
        except:
            tgt = generate_tgt()
            r = requests.get(base_uri + endpoint, params=query)
            
        atoms = simplejson.loads(r.text)
        if 'result' not in atoms:
            continue
        # len(atoms['result'])

        # each atom can have many or no children
        for atom in atoms['result']:
            
            if atom['language'].find('ENG') < 0:
                continue

            code = re.findall(r'SNOMEDCT_US/(\d+)', atom['code'])
            name = atom['name']

            if atom['code'].find('SNOMEDCT_US') == -1:
                continue
                
            if verbose:
                print()
                print("Name of this atom is:", atom['name'])
                print("root_sourse is:", atom['rootSource'])
                
                
            if len(code)>0:
                if verbose:
                    print("code is:", code[0])
                if (term, cui, name) not in CODE_LIST:
                    CODE_LIST[(term, cui, name)] = []

                CODE_LIST[(term, cui, name)].append(code[0])
            else:
                if verbose:
                    comments = comments + f'no code found for {term}/{cui}\n'
                    print("No code found!")
    
    print(f"Codes have been obtained for {len(CODE_LIST)} cui-term pairs")

    # CHILDREN_LIST= get_descendants_of_atoms(CODE_LIST, tgt)
                
    return CODE_LIST


def get_descendants_of_atoms(CODE_LIST, tgt, verbose = False):
    """Retreieves children of a specified code
    """

    ALL_CHILDREN = dict()
    for cl, (term,cui,name) in enumerate(CODE_LIST.keys()):

        
        if cl%10 == 0:
            clear_output()
            print("Getting children")
            print("Children obtained so far:", len(ALL_CHILDREN))
        print(f"{cl} out of {len(CODE_LIST)}")
        
        for code in CODE_LIST[(term,cui,name)]:

            query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
            endpoint = f'/content/current/source/SNOMEDCT_US/{code}/descendants'

            try:
                r = requests.get(base_uri + endpoint, params=query)
            except:
                tgt = generate_tgt()
                r = requests.get(base_uri + endpoint, params=query)

            children = simplejson.loads(r.text)



            if 'result' not in children or children['result'] == None:
                continue

            if verbose:
                print(f"Total number of children of {name} are", len(children['result']))



            if (term, cui,name) not in ALL_CHILDREN:
                ALL_CHILDREN[(term,cui,name)] = []

            for child in children['result']:

                if verbose:
                    print(f"getting children {child['name']} and code {child['ui']} of {(term,cui,name)}")
                    print()

                if 'rootSource' not in child:
                    ALL_CHILDREN[(term,cui,name)].append(('missing', 'missing'))
                    continue
                cui_descendants = get_cui_list_from_term(child['name'], tgt)
                ALL_CHILDREN[(term,cui,name)].append((child['name'], child['ui'], cui_descendants))
    return ALL_CHILDREN


                          
def generate_tgt(sabs =  sabs, base_uri =  'https://uts-ws.nlm.nih.gov/rest'):
    """Generate granting ticket, valid for ~8 hours
    """
    
    # Set parameters
    # Get API key and authenticate and initialize
    # The tgt works for 8 hrs
    
    file = 'apikey.txt'
    file = os.path.join(this_directory, file)
    with open(file) as f:    
        apikey = f.read()
    AuthClient = Authentication(apikey)
    tgt = AuthClient.gettgt()
    version = '2020AA'
    return tgt, AuthClient
                          
                          
def get_tui_from_cui(cui, query):
    tgt = generate_tgt()                    
    # query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
    endpoint = f'/content/current/CUI/{cui}/'

    try:
        r = requests.get(base_uri + endpoint, params=query)
    except:
        tgt = generate_tgt()
        r = requests.get(base_uri + endpoint, params=query)
                          
    cui_results = simplejson.loads(r.text)
    if ('result' in cui_results) and ('semanticTypes' in cui_results['result']):
        return cui_results['result']['semanticTypes'][0]['name']
    else:
        return None


def update_cui_term_dict(CUI_LIST):
    for f,c,n in CUI_LIST:
        if c not in CUI_TERM_DICT:
            CUI_TERM_DICT[c] = n


## Initialize

In [10]:
# Initialize
base_uri = 'https://uts-ws.nlm.nih.gov/rest'
# Set parameters
# Get API key and authenticate and initialize
# The tgt works for 8 hrs
file = 'apikey.txt'
with open(file) as f:    
    apikey = f.read()
AuthClient = Authentication(apikey)
tgt = AuthClient.gettgt()
version = '2020AA'
base_uri = 'https://uts-ws.nlm.nih.gov/rest'
sabs = ['SNOMEDCT_US', 'ICD10CM','LNC','NCI','RXNORM', 'MSH']

In [11]:
global AuthClient

# Implementation

In [12]:
this_directory = os.getcwd()
file = 'apikey.txt'
file = os.path.join(this_directory, file)
with open(file) as f:    
    apikey = f.read()
AuthClient = Authentication(apikey)
tgt = AuthClient.gettgt()

### Create Prelinary List of CUIs

In [13]:
tgt, Auth = generate_tgt()
base_uri = 'https://uts-ws.nlm.nih.gov/rest'
# ['frailty', 'diabetes', 'heart attack', 'anxiety depression'] #['falls', 'anxiety', 'depression']
CUI_LIST = get_cuis_list_from_term_list(term_list, tgt)



Getting cui list
insomnia


In [14]:
len(CUI_LIST), CUI_LIST[:15]

(124,
 [('insomnia', 'C0917801', 'Sleeplessness'),
  ('insomnia', 'C4082202', 'Sleep Quality Question'),
  ('insomnia', 'C0029645', 'Other insomnia'),
  ('insomnia', 'C0033139', 'Primary Insomnia'),
  ('insomnia', 'C0752286', 'Sleep State Misperception'),
  ('insomnia', 'C1561841', 'Adjustment insomnia'),
  ('insomnia', 'C1561842', 'Idiopathic insomnia'),
  ('insomnia', 'C1960036', 'Psychophysiologic insomnia'),
  ('insomnia', 'C0349255', 'Nonorganic Insomnia'),
  ('insomnia', 'C0393760', 'Initial insomnia'),
  ('insomnia', 'C0393761', 'Middle insomnia'),
  ('insomnia', 'C0751249', 'Chronic Insomnia'),
  ('insomnia', 'C1333141', 'Conditioned Insomnia'),
  ('insomnia', 'C1963237', 'Insomnia, CTCAE 3.0'),
  ('insomnia', 'C3640481', 'HAMA - Insomnia')])

### Get Atoms

In [15]:
%%time
tgt, Auth = generate_tgt()
atoms = get_atoms_from_cuis(CUI_LIST, tgt)
len(atoms)

Obtaining ATOMS 120 out of 124
Obtaining ATOMS 121 out of 124
Obtaining ATOMS 122 out of 124
Obtaining ATOMS 123 out of 124
Codes have been obtained for 129 cui-term pairs
CPU times: user 7.68 s, sys: 342 ms, total: 8.02 s
Wall time: 1min 44s


129

In [16]:
atoms

{('insomnia', 'C0917801', 'Insomnia'): ['193462001'],
 ('insomnia', 'C0917801', 'Insomnia (disorder)'): ['193462001'],
 ('insomnia', 'C0917801', 'Sleeplessness'): ['193462001'],
 ('insomnia', 'C0033139', 'Primary insomnia'): ['3972004'],
 ('insomnia', 'C0033139', 'Primary insomnia (disorder)'): ['3972004'],
 ('insomnia', 'C0752286', 'Paradoxical insomnia'): ['427745001'],
 ('insomnia', 'C0752286', 'Sleep state misperception'): ['427745001'],
 ('insomnia',
  'C0752286',
  'Sleep state misperception (finding)'): ['427745001'],
 ('insomnia', 'C1561841', 'Adjustment insomnia'): ['472819006'],
 ('insomnia', 'C1561841', 'Adjustment insomnia (disorder)'): ['472819006'],
 ('insomnia', 'C1561842', 'Idiopathic insomnia'): ['3972004'],
 ('insomnia', 'C1960036', 'Psychophysiologic insomnia'): ['425832009'],
 ('insomnia',
  'C1960036',
  'Psychophysiologic insomnia (disorder)'): ['425832009'],
 ('insomnia', 'C0349255', 'Nonorganic insomnia'): ['192454004'],
 ('insomnia', 'C0349255', 'Nonorganic ins

### Get Children

In [17]:
%%time
tgt, Auth = generate_tgt()
children = get_descendants_of_atoms(atoms, tgt)
len(children)

Getting children
Children obtained so far: 21
120 out of 129
121 out of 129
122 out of 129
123 out of 129
124 out of 129
125 out of 129
126 out of 129
127 out of 129
128 out of 129
CPU times: user 16 s, sys: 663 ms, total: 16.7 s
Wall time: 2min 57s


21

In [18]:
total_children = 0
for parent in children:
    print(f"{parent} has {len(children[parent])} descendant")
    total_children = total_children + len(children[parent])
print()
print("Total number of children:", total_children)

('insomnia', 'C0917801', 'Insomnia') has 34 descendant
('insomnia', 'C0917801', 'Insomnia (disorder)') has 34 descendant
('insomnia', 'C0917801', 'Sleeplessness') has 34 descendant
('insomnia', 'C0541798', 'Early waking') has 1 descendant
('insomnia', 'C0541798', 'Matutinal insomnia') has 1 descendant
('insomnia', 'C0541798', 'Terminal insomnia') has 1 descendant
('insomnia', 'C0541798', 'Terminal insomnia (disorder)') has 1 descendant
('insomnia', 'C0541798', 'Wakes and cannot sleep again') has 1 descendant
('insomnia', 'C0541798', 'Wakes early') has 1 descendant
('insomnia', 'C0233505', 'Mood insomnia') has 3 descendant
('insomnia', 'C0233505', 'Mood insomnia (finding)') has 3 descendant
('insomnia', 'C1561701', 'Behavioral insomnia of childhood') has 3 descendant
('insomnia', 'C1561701', 'Behavioral insomnia of childhood (disorder)') has 3 descendant
('insomnia', 'C1561701', 'Behavioral sleep problem') has 3 descendant
('insomnia', 'C1561701', 'Behavioural insomnia of childhood') ha

In [19]:
children.keys()

dict_keys([('insomnia', 'C0917801', 'Insomnia'), ('insomnia', 'C0917801', 'Insomnia (disorder)'), ('insomnia', 'C0917801', 'Sleeplessness'), ('insomnia', 'C0541798', 'Early waking'), ('insomnia', 'C0541798', 'Matutinal insomnia'), ('insomnia', 'C0541798', 'Terminal insomnia'), ('insomnia', 'C0541798', 'Terminal insomnia (disorder)'), ('insomnia', 'C0541798', 'Wakes and cannot sleep again'), ('insomnia', 'C0541798', 'Wakes early'), ('insomnia', 'C0233505', 'Mood insomnia'), ('insomnia', 'C0233505', 'Mood insomnia (finding)'), ('insomnia', 'C1561701', 'Behavioral insomnia of childhood'), ('insomnia', 'C1561701', 'Behavioral insomnia of childhood (disorder)'), ('insomnia', 'C1561701', 'Behavioral sleep problem'), ('insomnia', 'C1561701', 'Behavioural insomnia of childhood'), ('insomnia', 'C1561701', 'Behavioural sleep problem'), ('insomnia', 'C1561839', 'Drug-induced insomnia'), ('insomnia', 'C1561839', 'Drug-induced insomnia (disorder)'), ('insomnia', 'C1561850', 'Insomnia disorder relat

## Create CUI Bank

The following format is followed:

root_term | CUI | CUI_name | root_list | semType | Is-a_CUI (Only if root_list is False, what is its Parent CUI) | is-a CUI_Name| is-a semType |

Example:
Cardiovascular | C0155626 | Acute myocardial infarction |  False | Disease or Syndrome |  C0027051 | Myocardial Infarction | Disease or Syndrome |

In [20]:
%%time
# Intitialize
DF_CUI_BANK = pd.DataFrame(columns = ['root_term','CUI', 'CUI_name', 'root_list', 'root_cui', 'root_name'])

for root,cui,name in CUI_LIST:
    idx = len(DF_CUI_BANK)
    DF_CUI_BANK.loc[idx,"root_term"] = root
    DF_CUI_BANK.loc[idx,"CUI"] = cui
    DF_CUI_BANK.loc[idx,"CUI_name"] = name
    DF_CUI_BANK.loc[idx,"root_list"] = True
    DF_CUI_BANK.loc[idx,"root_cui"] = None
    DF_CUI_BANK.loc[idx,"root_name"] = None
    
    if (root,cui,name) in children:
        for child_name, child_code, child_cui in children[(root,cui,name)]:
            
            if len(child_cui.split(';')) == 1:
                idx = len(DF_CUI_BANK)
                DF_CUI_BANK.loc[idx,"root_term"] = root
                DF_CUI_BANK.loc[idx,"CUI"] = child_cui
                DF_CUI_BANK.loc[idx,"CUI_name"] = child_name
                DF_CUI_BANK.loc[idx,"root_list"] = False
                DF_CUI_BANK.loc[idx,"root_cui"] = cui
                DF_CUI_BANK.loc[idx,"root_name"] = name
            elif len(child_cui.split(';')) > 1:
                
                for child in child_cui.split(';'):
                    idx = len(DF_CUI_BANK)
                    DF_CUI_BANK.loc[idx,"root_term"] = root
                    DF_CUI_BANK.loc[idx,"CUI"] = child
                    DF_CUI_BANK.loc[idx,"CUI_name"] = child_name
                    DF_CUI_BANK.loc[idx,"root_list"] = False
                    DF_CUI_BANK.loc[idx,"root_cui"] = cui
                    DF_CUI_BANK.loc[idx,"root_name"] = name
            

CPU times: user 262 ms, sys: 8.48 ms, total: 271 ms
Wall time: 266 ms


In [21]:
DF_CUI_BANK.head()

Unnamed: 0,root_term,CUI,CUI_name,root_list,root_cui,root_name
0,insomnia,C0917801,Sleeplessness,True,,
1,insomnia,C3531724,Primary hyposomnia,False,C0917801,Sleeplessness
2,insomnia,C3662857,Insomnia caused by alcohol,False,C0917801,Sleeplessness
3,insomnia,C0581874,Late insomnia,False,C0917801,Sleeplessness
4,insomnia,C3640497,Late insomnia,False,C0917801,Sleeplessness


In [22]:
%%time
if get_sem_type == True:
    
    DF_CUI_BANK['semType'] = ''
    for row in DF_CUI_BANK.iterrows():
        idx = row[0]
        cui = row[1].CUI
        root_cui = row[1].root_cui

        if idx%20 == 0:
            clear_output()
        print(f"{idx} of {len(DF_CUI_BANK)}")

        query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
        semType = get_tui_from_cui(cui, query)
        DF_CUI_BANK.loc[idx, 'semType'] = semType
        
#         if root_cui == None:
#             continue
#         else:
#             query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
#             parent_semType = get_tui_from_cui(root_cui, query)
#             DF_CUI_BANK.loc[idx, 'root_semType'] = parent_semType
            
            
DF_CUI_BANK = DF_CUI_BANK[['root_term', 'CUI', 'CUI_name', 'semType',
                           'root_list', 
                           'root_cui','root_name']]

180 of 192
181 of 192
182 of 192
183 of 192
184 of 192
185 of 192
186 of 192
187 of 192
188 of 192
189 of 192
190 of 192
191 of 192
CPU times: user 17.7 s, sys: 754 ms, total: 18.4 s
Wall time: 2min 31s


In [23]:
DF_CUI_BANK

Unnamed: 0,root_term,CUI,CUI_name,semType,root_list,root_cui,root_name
0,insomnia,C0917801,Sleeplessness,Sign or Symptom,True,,
1,insomnia,C3531724,Primary hyposomnia,Finding,False,C0917801,Sleeplessness
2,insomnia,C3662857,Insomnia caused by alcohol,Mental or Behavioral Dysfunction,False,C0917801,Sleeplessness
3,insomnia,C0581874,Late insomnia,Mental or Behavioral Dysfunction,False,C0917801,Sleeplessness
4,insomnia,C3640497,Late insomnia,Intellectual Product,False,C0917801,Sleeplessness
...,...,...,...,...,...,...,...
187,insomnia,C3166625,What number describes how much your insomnia h...,Intellectual Product,True,,
188,insomnia,C3166626,What number describes how much your insomnia h...,Intellectual Product,True,,
189,insomnia,C3175407,How old were you the last time you experienced...,Clinical Attribute,True,,
190,insomnia,C3175408,How old were you the last time you experienced...,Intellectual Product,True,,


In [24]:
# break_here

# Get Immediate parents

In [25]:
def get_code_from_cui(cui):
    # Get CUI atoms
    endpoint = f'/content/current/CUI/{cui}/atoms'
    tgt, Auth = generate_tgt()
    query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
    try:
        r = requests.get(base_uri + endpoint, params=query)
    except:
        tgt = generate_tgt()
        r = requests.get(base_uri + endpoint, params=query)

    atoms = simplejson.loads(r.text)

    for atom in atoms['result']:
        code = re.findall(r'SNOMEDCT_US/(\d+)', atom['code'])
        if len(code)>0:
            return code[0]
    return None

def get_parent_terms_from_code(code):

    tgt, Auth = generate_tgt()
    query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
    endpoint = f'/content/2020AA/source/SNOMEDCT_US/{code}/parents'
    try:
        r = requests.get(base_uri + endpoint, params=query)
    except:
        tgt = generate_tgt()
        query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
        r = requests.get(base_uri + endpoint, params=query)
      
    
    all_parents = []
    parents = simplejson.loads(r.text)
    if 'result' not in parents: 
        return all_parents

    for parent in parents['result']:
        all_parents.append((parent['name'], parent['ui'], get_cui_from_code(parent['ui'])))
        
        
    return all_parents

def get_cui_from_code(code):
    tgt, Auth = generate_tgt()
    query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
    endpoint = f'/search/current?string={code}&inputType=sourceUi&searchType=exact&sabs=SNOMEDCT_US'
    try:
        r = requests.get(base_uri + endpoint, params=query)
    except:
        tgt = generate_tgt()
        query = {'pageSize': 5000, 'sabs': sabs, 'ticket':AuthClient.getst(tgt)}
        r = requests.get(base_uri + endpoint, params=query)
    result = simplejson.loads(r.text)
    return result['result']['results'][0]['ui']

In [26]:
%%time

if get_parents:
    DF_CUI_BANK['parent_term'] = ''
    DF_CUI_BANK['parent_cui'] = ''

    for row in DF_CUI_BANK.iterrows():

        idx = row[0]
        cui = row[1].CUI
        print(f"In row {idx} out of {len(DF_CUI_BANK)}")
        if idx%10 == 0:
            clear_output()

        parent_cuis = []
        parent_terms = []
        for (term, code, cui) in get_parent_terms_from_code(get_code_from_cui(cui)):
            parent_cuis.append(cui)
            parent_terms.append(term)

        DF_CUI_BANK.loc[idx, 'parent_term'] = ';'.join(parent_terms)
        DF_CUI_BANK.loc[idx, 'parent_cui'] = ';'.join(parent_cuis)
    

In row 191 out of 192
CPU times: user 46.2 s, sys: 1.86 s, total: 48.1 s
Wall time: 7min 49s


In [27]:
DF_CUI_BANK.head(140).tail()

Unnamed: 0,root_term,CUI,CUI_name,semType,root_list,root_cui,root_name,parent_term,parent_cui
135,insomnia,C3662857,Insomnia caused by alcohol,Mental or Behavioral Dysfunction,False,C1561839,Drug-induced insomnia,Alcohol-induced sleep disorder;Drug-induced in...,C0236662;C1561839
136,insomnia,C1561850,Insomnia due to mental disorder,Mental or Behavioral Dysfunction,True,,,Insomnia;Mental disorder,C0917801;C0004936
137,insomnia,C3661804,Insomnia due to anxiety and fear,Mental or Behavioral Dysfunction,False,C1561850,Insomnia due to mental disorder,Insomnia disorder related to another mental di...,C1561850
138,insomnia,C3661804,Insomnia due to anxiety and fear,Mental or Behavioral Dysfunction,True,,,Insomnia disorder related to another mental di...,C1561850
139,insomnia,C3662857,Insomnia caused by alcohol,Mental or Behavioral Dysfunction,True,,,Alcohol-induced sleep disorder;Drug-induced in...,C0236662;C1561839


# Save CUI_bank File

In [28]:
DF_CUI_BANK.to_csv(cui_bank_name)

In [29]:
stop_time =  datetime.datetime.now()

delta_time = stop_time - start_time
print(f"It took {delta_time.seconds//60} mins and {delta_time.seconds%60} secs")

It took 15 mins and 3 secs


# Summary

In [30]:
print("Total Number of search terms returned:", len(DF_CUI_BANK[DF_CUI_BANK.root_list == True]))
print("Total number of CUIs in CUI Bank:", len(DF_CUI_BANK.CUI.unique()))
print("Number of children for each search term:")
DF_CUI_BANK.root_term.value_counts()
print("Total time taken:", stop_time - start_time)

Total Number of search terms returned: 124
Total number of CUIs in CUI Bank: 131
Number of children for each search term:
Total time taken: 0:15:03.629849
