# Criteria2Query+ - Extension of Criteria2Query
## Background
Criteria2Query (C2Q) is a tool that extracts eligibility criteria from ClinicalTrials.gov. It uses CoreNLP from Stanford to extract and identify candidate entities for concept mapping to OMOP CDM Standard Vocabularies (such as LOINC and SNOMED-CT). After extraction, the entities are mapped to concepts using a Lucene-based tool, Usagi. The best candidate concept is then assigned a concept set in OHDSI's ATLAS tool and all possible candidates are provided to the user. The user can then either select the most appropriate concept set, or allow C2Q to automatically generate a JSON formatted query for ATLAS. After selection, the query is submitted to ATLAS so that the user now has a cohort for querying a variety of databases.

## Motivation
Though Usagi has some successes, its downside is a reliance on string-based similarity for scoring mappings. For example, when sent to ConceptHub, the text "_neurological disease_" can map to "_neurological disorder_" and "_urologic disease_," but "_urologic disease_" has a score of 0.90, due to the limited character variance, so Usagi will provide "_urologic disease_" as its best option for mapping.

## Methodology
As an alternative first step, I propose using MetaMap as it generally performs better than Usagi at mapping concepts. Criteria2Query+ is a tool that could be used as an add-on to C2Q. C2Q+ takes the parsed eligibility criteria from C2Q and then map those terms to MetaMap using a series of options that are relatively synonymous to those of CoreNLP and OMOP CDM tools. After obtaining the CUI from the UMLS associated with OMOP CDM Standard Vocabularies, C2Q+ then maps the CUIs back to the respective Standard Vocabulary. C2Q+ then compares the MetaMap and Usagi mappings, as well as their scores, for a given term and takes the better of the two to return to the user for a mapping as the "best option." It will still provide all possible concept sets for manual selection. 

## Evaluation
For my dataset, I relied on the 18 clinical trials that were chosen for evaluation by Yuan et al. 2017. After the entities were extracted from the NLP step and marked with their respective domains, I compared the term in question with the mapping provided by ConceptHub and MetaMap, which I classified as correct, partially-correct, incorrect, or N/A for entities that were not mapped. For my analysis, I calculated both loose and strict recall, precision, and F-1 scores under the following definitions:
* **Loose includes partially-correct mappings**
* **Strict excludes partially-correct mappings**
* $Recall = \frac{correct \hspace{1mm} mappings}{total \hspace{1mm} existing \hspace{1mm} mappings}$
* $Precision = \frac{correct \hspace{1mm} mappings}{possible \hspace{1mm} correct \hspace{1mm} mappings}$
* $F_1 \hspace{1mm} Score = 2 * (\frac{precision \hspace{1mm} * \hspace{1mm} recall}{precision \hspace{1mm} + \hspace{1mm} recall})$

## Get list of NCTIDs to review

In [None]:
import csv

#Uncomment the below code to create a list of NCTIDs with user input
#ids = input("Enter NCTIDs separated by spaces: ")
#idsList = ids.split()
#print(idsList)

#Read csv file
with open('trials.csv', 'r') as csvfile:
    reader = csv.reader(csvfile)
    trials = list(reader)

#Eliminate exterior list
t = []
for nct in trials:
    t.append(nct[0])

trials = t

## Use Selenium to fetch each trial, parse the criteria, and download JSON file
Rename the files of the form: [trial-ID].json

In [None]:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
import shutil
import time

#Not very reliable for len(trials) > 4
#To attempt full automation, replace for i in range(x, y) with for t in trials and trials[i] with t
#Current form is for small batches, which is better considering that the scrapping is not entirely reliable and
#is not guaranteed to work 100% of the time.

for i in range(0, 4):

    #Connect to C2Q
    driver = webdriver.Chrome("/Users/sal/Downloads/chromedriver")
    driver.get("http://www.ohdsi.org/web/criteria2query/")

    #Load clinical trial
    nctid = driver.find_element_by_id('nctid')
    fetchct = driver.find_element_by_id('fetchct')
    nctid.send_keys(trials[i])
    ActionChains(driver).click(fetchct).perform()

    #Parse the criteria
    wait = WebDriverWait(driver, 20)
    parse = wait.until(EC.element_to_be_clickable((By.ID, 'start')))
    ActionChains(driver).click(parse).perform()

    #Extract JSON text
    download = wait.until(EC.element_to_be_clickable((By.ID, 'downloadfile')))
    ActionChains(driver).click(download).perform()

    time.sleep(5)
    driver.quit()

    #Move the file from downloads to the JSON_Formatted_Trials folder
    source = '/Users/sal/Downloads/Criteria2Query.json'
    destination = '/Users/sal/Desktop/DBMI/SymbolicMethods/ConceptMapping/JSON_Formatted_Trials'

    shutil.move(source, destination)

    #Rename the file in the JSON folder
    if os.path.exists('Criteria2Query.json'):
        src = os.path.realpath('Criteria2Query.json')
        os.rename('Criteria2Query.json', trials[i] +'.json')

## Run extraction of on each trial

In [None]:
import json
import pandas as pd

#Reformat the terms from the .json file
#extraction[] contains all of the trials in the format {text : [{term : domain}]}
extraction = []

for nct in trials:
    with open("./JSON_Formatted_Trials/"+ nct + ".json", "r") as read_file:
        data = json.load(read_file)

    trial = pd.DataFrame.from_dict(data)    
    exclusion = trial.iat[0,0]
    inclusion = trial.iat[1,0]

    #trial[0,0] is the exclusion criteria
    #trial[1,0] is the inclusion criteria
    
    m = []
    for e in exclusion:
        phrase = e.get('sents')[0].get('text')
        q = []
        for t in e.get('sents')[0].get('terms'):
            entity = t.get('text')
            domain = t.get('categorey')
            q.append([entity, domain])
        m.append({phrase:q})

    for i in inclusion:
        phrase = i.get('sents')[0].get('text')
        q = []
        for t in i.get('sents')[0].get('terms'):
            entity = t.get('text')
            domain = t.get('categorey')
            q.append([entity, domain])
        m.append({phrase:q})
    
    extraction.append(m)

In [None]:
extraction[2]

## Open MetaMap servers and send terms to MetaMap in a txt file
Include line breaks and end each file with '\n'

Export the output of each to (NCTID)\_MM.txt

In [None]:
#Starts the SKR/Medpost Part-of-Speech Tagger and Word Sense Disambiguation Servers
#The WSD server takes about a minute to fully load, so the TimeoutExpired exception is just used to move onto the next
import subprocess

mm_path = "/Users/sal/Desktop/DBMI/SymbolicMethods/ConceptMapping/public_mm"
subprocess.run([mm_path + "/bin/skrmedpostctl start"], shell = True, capture_output=True)
subprocess.run([mm_path + "/bin/wsdserverctl start"], shell = True, capture_output=True, timeout = 60)

In [None]:
#Running MetaMap
run_mm = mm_path + "/bin/metamap18 -AIGy+ --negex -R CVX,HCPCS,ICD10PCS,LNC,RXNORM,SNOMEDCT_US,SNOMEDCT_VET --conj -V USAbase"

#Retrieve MetaMap results for each trial
for e,nct in zip(extraction, trials):
    
    #Pull out the terms for the trial
    terms = []
    for i in range(0, len(e)):
        for text in e[i].values():
            for t in text:
                terms.append(t[0])
    
    with open('temp.txt', "w") as mm_input:
        for t in terms:
            mm_input.write(t)
            mm_input.write("\n\n")
    
    #Convert all non-ASCII characters using UMLS provided .jar file
    subprocess.run("java -jar replace_utf8.jar temp.txt > input.txt", shell = True)
    
    #Submit to MetaMap
    subprocess.run(run_mm + " input.txt ./MM_Results/" + nct + "_MM.txt", shell = True)


In [None]:
#Close the MetaMap Servers
subprocess.run([mm_path + "/bin/skrmedpostctl stop"], shell = True)
subprocess.run([mm_path + "/bin/wsdserverctl stop"], shell = True)

## Send term with domain to ConceptHub and save concept, vocab, and score
Export the output of each to (NCTID)\_ch.txt

In [None]:
import requests as r
import json

url = 'http://149.28.237.139:8002/concepthub/omop/searchOneEntityByTermAndDomain'
for e,nct in zip(extraction, trials):
    
    #Send the terms and domains pairs for the trial to ConceptHub and save query and mapping
    responses = []
    for i in range(0, len(e)):
        for text in e[i].values():
            for t in text:
                query = {'term':t[0], 'domain':t[1]}
                reply = r.post(url, json = query)
                responses.append([query,reply.text])
    
    #Store the ConceptHub mappings for the trial
    with open('./ConceptHub_Results/'+ nct + '_ch.txt', "w") as ch_output:
        for res in responses:
            ch_output.write(str(res[0])+"\n")
            #Catch HTTP Status 500 Error when term is not found
            if (res[1][:15] == "<!doctype html>"):
                ch_output.write("N/A")
            else:
                ch_output.write(res[1])
            ch_output.write("\n\n")

In [None]:
#List of dataframes for each trial
ch_dataframes = []

#Provide the eval() function a source to replace "null" in dictionary conversion
null = False

#Opens each results txt file and extracts each relevant information for dataframe
ch_results = []
for nct in trials:
    trial = [nct]
    buffer = []
    with open('./ConceptHub_Results/' + nct + '_ch.txt', "r") as ch_file:
        for line in ch_file:
            if line != "\n" and line != "\n\n":
                buffer.append(line)
                if line.startswith("{\"matchScore") or line.startswith("N/A"):
                    trial.append("".join(buffer))
                    buffer = []    
    ch_results.append(trial)

for t in ch_results:
    nctid = t[0]
    df = pd.DataFrame()
    
    #Go through each query and convert the text into dictionaries for better parsing
    #Also converts mappings to dictionaires
    for i in range(1, len(t)):
        s = t[i].splitlines()
        
        query = eval(s[0])
        term = query.get('term')
        domain = query.get('domain')
        
        #Account for unmapped queries
        if (s[1] != "N/A"):
            match = eval(s[1])
            score = match.get('matchScore')
            c = match.get('concept')
            c_ID = c.get('conceptId')
            concept = c.get('conceptName')
            vocab = c.get('vocabularyId')
            c_class = c.get('conceptClassId')
            standard = c.get('standardConcept')
            source_ID = c.get('conceptCode')
            
            d = pd.DataFrame({'NCTID':[nctid], 'Term':[term], 'Domain':[domain], 
                                  'Score':[float(score)], 'Concept ID':[c_ID], 'Concept Name':[concept], 
                                  'Vocab':[vocab], 'Source Code':[source_ID], 'Standard':[standard],
                                  'Concept Class ID':[c_class]})
        else:
            d = pd.DataFrame({'NCTID':[nctid], 'Term':[term], 'Domain':[domain], 
                                  'Score':[float(0)], 'Concept ID':['N/A'], 'Concept Name':['N/A'], 
                                  'Vocab':['N/A'], 'Source Code':['N/A'], 'Standard':['N/A'],
                                  'Concept Class ID':['N/A']})
        
        df = df.append(d, ignore_index = True)
    
    ch_dataframes.append(df)

## Extract MetaMap score, CUI, source vocab

In [None]:
#Returns the index of a substring from a list of strings
def index_of(substring, l):
    for x, elem in enumerate(l):
        if substring in elem:
            return x

In [None]:
#List of dataframes for each trial
mm_dataframes = []

#Opens each results txt file and extracts each relevant information for dataframe
mm_results = []
for nct in trials:
    trial = [nct]
    with open('./MM_Results/' + nct + '_MM.txt', "r") as mm_file:
        buffer = []
        for line in mm_file:
            if line.startswith("Processing") and "<<<<< Phrase\n" in buffer:
                trial.append("".join(buffer))
                buffer = []
            
            buffer.append(line)
            if line.startswith("<<<<< Mappings"):
                trial.append("".join(buffer))
                buffer = []    
    mm_results.append(trial)

for t in mm_results:
    nctid = t[0]
    df = pd.DataFrame()
    
    #p indicates how far back from a multi-phrase input to look for the input text
    p = 0
    
    #Checks to see if there are multiple mappings for a given entry
    for i in range(1, len(t)):
        
        # Under the conditions provided to MetaMap in mm_path, all individual mappings have the same score
        # as the overall score. So we assume that we can check for the count of that score surrounded by spaces as
        # to not confuse it with any CUI that contains the score as a substring. Also, we need to account for a given
        # input being broken down into multiple phrases.
        
        s = t[i].splitlines()

        #If the input has been broken down, t[i][0] will be '', otherwise 'Processing input...'
        if (s[0] == ''):
            p = p + 1
        else:
            p = 0
        
        #Input - if its been broken down, pull from t[i-1]
        if (p != 0):
            sp = t[i-p].splitlines()
            input_txt = sp[0][sp[0].index(":")+2:]
        else: 
            input_txt = s[0][s[0].index(":")+2:]
        
        #j indicates the index of interest
        j = index_of('Phrase: ', s)
        
        #Phrase - for exact matching with C2Q entity
        phrase = s[j][s[j].index(":")+2:]

        #Some inputs don't map, so those must be accounted for as well
        if ("Mappings" not in t[i]):
            d = pd.DataFrame({'NCTID':[nctid], 'Input':[input_txt], 'Phrase':[phrase], 
                              'Score':[0], 'CUI':["N/A"], 'String':["N/A"], 
                              'Preferred Name':["N/A"], 'Vocab':["N/A"], 'Semantic Type':["N/A"]})
            df = df.append(d)
        else:
            j = index_of('Meta Mapping', s)
            score = s[j][s[j].index("(")+1:s[j].index(")")]
            #Formatted for count function
            f_score = " " + score + " "

            #c is used to determine which mapping to pull
            for c in range(t[i].count(f_score)):
                j = index_of(f_score,s) + c
                
                #CUI
                cui = s[j][s[j].index(score)+6:s[j].index(":")].strip()

                #Not all mappings include a specified UMLS preferred name, so it must be checked for
                if "})" in s[j]:
                    #UMLS Matched String
                    if ") (" in s[j]:
                        ms = s[j][s[j].index(":")+1:s[j].index(") ")]
                    else:
                        ms = s[j][s[j].index(":")+1:s[j].index(" (")]
                    #UMLS Preferred Name
                    name = s[j][s[j].index("(")+1:s[j].index(" {")]
                #If it is not present, the matched string will be used as the preferred name
                else:
                    if "(" in s[j]:
                        ms = s[j][s[j].index(":")+1:s[j].index(")")+1]
                    else:
                        ms = s[j][s[j].index(":")+1:s[j].index("{")]
                    name = ms

                #Source Vocabularies
                vocab = s[j][s[j].index("{")+1:s[j].index("}")]

                #Semantic Type
                s_type = s[j][s[j].index("[")+1:s[j].index("]")]

                #Add query to the dataframe
                d = pd.DataFrame({'NCTID':[nctid], 'Input':[input_txt], 'Phrase':[phrase], 
                                  'Score':[int(score)], 'CUI':[cui], 'String':[ms], 
                                  'Preferred Name':[name], 'Vocab':[vocab], 'Semantic Type':[s_type]})
                df = df.append(d, ignore_index = True)
                
    #Add the dataframe for this trial to the list of dataframes
    mm_dataframes.append(df)     

In [None]:
mm_dataframes[1]

In [None]:
ch_dataframes[0]

## Send CUIs to UMLS API to Convert Into AUI
Optional component to be worked on later
For My_API_Key, follow the steps at: https://documentation.uts.nlm.nih.gov/rest/authentication.html

In [None]:
#MetaMap Results
for df in mm_dataframes:
    df.to_csv(r''+df.iloc[0]['NCTID']+'MM_table.csv', index = False)

#Authentication.py from Steven P. Emrick with UMLS
#from pyquery import PyQuery as pq
import lxml.html as lh
from lxml.html import fromstring

uri="https://utslogin.nlm.nih.gov"
#option 1 - username/pw authentication at /cas/v1/tickets
#auth_endpoint = "/cas/v1/tickets/"
#option 2 - api key authentication at /cas/v1/api-key
auth_endpoint = "/cas/v1/api-key"

class Authentication:

    #def __init__(self, username,password):
    def __init__(self, apikey):
        #self.username=username
        #self.password=password
        self.apikey=apikey
        self.service="http://umlsks.nlm.nih.gov"
    
    def gettgt(self):
        #params = {'username': self.username,'password': self.password}
        params = {'apikey': self.apikey}
        h = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain", "User-Agent":"python" }
        rep = r.post(uri+auth_endpoint,data=params,headers=h)
        response = fromstring(rep.text)
        ## extract the entire URL needed from the HTML form (action attribute) returned - looks similar to https://utslogin.nlm.nih.gov/cas/v1/tickets/TGT-36471-aYqNLN2rFIJPXKzxwdTNC5ZT7z3B3cTAKfSc5ndHQcUxeaDOLN-cas
        ## we make a POST call to this URL in the getst method
        tgt = response.xpath('//form/@action')[0]
        return tgt

    def getst(self,tgt):
        params = {'service': self.service}
        h = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain", "User-Agent":"python" }
        rep = r.post(tgt,data=params,headers=h)
        st = rep.text
        return st

#Need to use the API key to retrieve Ticket Granting Ticket and 
#a unique one-time use Service Ticket for each call to the API

My_API_Key = "replace-with-your-key"
auth = Authentication(My_API_Key)
tgt = auth.gettgt()
uri = "https://uts-ws.nlm.nih.gov/rest/"

#Go through each trial and take out the preferred term
for m in mm_dataframes:
    for i in range(len(m)):
        st = auth.getst(tgt)
        content_endpoint = "/content/current/CUI/"+m.loc[i,'CUI']+"/atoms"
        query = {'sabs':'CVX, HCPCS, ICD10PCS, LNC, RXNORM, SNOMEDCT_US, SNOMEDCT_VET',
                 'ttys':'PT, LN, CN, LA, PSN','ticket':st}
        
        p = r.get(uri+content_endpoint, params = query)
        parsed = eval(p.text)
        code = parsed.get('result')[0].get('code')
        s_code = code[code.rfind('/')+1:]
        s_vocab = parsed.get('result')[0].get('rootSource')
        print (m.loc[i,'Phrase'],s_code, s_vocab)

from owlready2 import *
from owlready2.pymedtermino2 import *
from owlready2.pymedtermino2.umls import *

import_umls("umls-2019AB-metathesaurus.zip")
default_world.save()

## Parse dataframes to extract best concepts
Creates three different dataframes
1. MetaMap results supplemented by ConceptHub
2. ConceptHub results supplemented by MetaMap
3. Highest score supplemented by MetaMap

In [None]:
import re

c2qp_1 = []
c2qp_2 = []
c2qp_3 = []

for m, u in zip(mm_dataframes, ch_dataframes):
    #Dataframe for each of the methods
    df1 = pd.DataFrame()
    df2 = pd.DataFrame()
    df3 = pd.DataFrame()
    
    #Relies on ConceptHub for the terms (or 'input', in MetaMap)
    terms = u.T.loc['Term']
    for i in range(len(terms)):
        t = terms[i]
        #Convert º to degrees and adjust any regex characters
        if "°" in t:
            t = t[:t.index("°")] + "degrees " + t[t.index("°")+1:]
        
        #Escapes any regex characters before using str.contains
        t = re.escape(t)
        
        #Rows of interest for parsing
        m_temp = m[m['Input'].str.contains(t)]
        u_temp = u.iloc[i]
        
        #Method 1
        for row in m_temp.iterrows():
            nctid = row[1]['NCTID']
            text = row[1]['Input']
            if row[1]['CUI'] != "N/A":
                domain = row[1]['Semantic Type']
                concept_id = row[1]['CUI']
                concept = row[1]['String']
                score = row[1]['Score']
                
                d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain], 'Mapping':['M'], 'Concept ID':[concept_id],
                                 'Concept':[concept],'Score':[score]})
                df1 = df1.append(d,ignore_index = True)
            elif u_temp['Concept ID'] != "N/A":
                domain = u_temp['Domain']
                concept_id = u_temp['Concept ID']
                concept = u_temp['Concept Name']
                #To match MetaMap scoring format
                score = int(u_temp['Score']*1000)
                
                d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain], 'Mapping':['C'], 'Concept ID':[concept_id],
                                 'Concept':[concept],'Score':[score]})
                df1 = df1.append(d,ignore_index = True)
            else:
                d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain],'Mapping':['N'], 'Concept ID':["N/A"],
                                 'Concept':["N/A"],'Score':[0]})
                df1 = df1.append(d,ignore_index = True)
        
        #Method 2
        for row in m_temp.iterrows():
            nctid = row[1]['NCTID']
            text = row[1]['Input']
            if u_temp['Concept ID'] != "N/A":
                domain = u_temp['Domain']
                concept_id = u_temp['Concept ID']
                concept = u_temp['Concept Name']
                #To match MetaMap scoring format
                score = int(u_temp['Score']*1000)
                
                d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain], 'Mapping':['C'], 'Concept ID':[concept_id],
                                 'Concept':[concept],'Score':[score]})
                df2 = df2.append(d,ignore_index = True)
            elif row[1]['CUI'] != "N/A":
                domain = row[1]['Semantic Type']
                concept_id = row[1]['CUI']
                concept = row[1]['String']
                score = row[1]['Score']
                
                d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain], 'Mapping':['M'], 'Concept ID':[concept_id],
                                 'Concept':[concept],'Score':[score]})
                df2 = df2.append(d,ignore_index = True)
            else:
                d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain],'Mapping':['N'], 'Concept ID':["N/A"],
                                 'Concept':["N/A"],'Score':[0]})
                df2 = df2.append(d,ignore_index = True)
        
        #Method 3
        for row in m_temp.iterrows():
            nctid = row[1]['NCTID']
            text = row[1]['Input']
            if u_temp['Concept ID'] != "N/A" and row[1]['CUI'] != "N/A":
                u_score = int(u_temp['Score']*1000)
                m_score = row[1]['Score']
                
                #Assumes that MetaMapping is better if scores are equal
                if u_score > m_score:
                    domain = u_temp['Domain']
                    concept_id = u_temp['Concept ID']
                    concept = u_temp['Concept Name']
                    score = u_score

                    d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain], 'Mapping':['C'], 'Concept ID':[concept_id],
                                     'Concept':[concept],'Score':[score]})
                    df3 = df3.append(d,ignore_index = True)
                else:
                    domain = row[1]['Semantic Type']
                    concept_id = row[1]['CUI']
                    concept = row[1]['String']
                    score = m_score

                    d = pd.DataFrame({'NCTID':[nctid], 'Text':[text], 'Domain':[domain], 'Mapping':['M'], 'Concept ID':[concept_id],
                                     'Concept':[concept],'Score':[score]})
                    df3 = df3.append(d,ignore_index = True)

    c2qp_1.append(df1)
    c2qp_2.append(df2)
    c2qp_3.append(df3)

In [None]:
#Standardizes the concept reviewal process
for d1, d2, d3 in zip(c2qp_1, c2qp_2, c2qp_3):
    d1.drop_duplicates(inplace = True)
    d2.drop_duplicates(inplace = True)
    d3.drop_duplicates(inplace = True)

## Create Excel file from each dataframe
Format: C2QP_[Method].xlsx

In [None]:
#First Sheet
with pd.ExcelWriter('C2QP_MC.xlsx') as writer:
    for df in c2qp_1:
        df.to_excel(writer, sheet_name = df.iloc[0]['NCTID'], index = False)
#Second Sheet
with pd.ExcelWriter('C2QP_CM.xlsx') as writer:
    for df in c2qp_2:
        df.to_excel(writer, sheet_name = df.iloc[0]['NCTID'], index = False)
#Third Sheet
with pd.ExcelWriter('C2QP_BS.xlsx') as writer:
    for df in c2qp_3:
        df.to_excel(writer, sheet_name = df.iloc[0]['NCTID'], index = False)