# Criteria2Query+ - Extension of Criteria2Query
## Background
Criteria2Query (C2Q) is a tool that extracts eligibility criteria from ClinicalTrials.gov. It uses CoreNLP from Stanford to extract and identify candidate entities for concept mapping to OMOP CDM Standard Vocabularies (such as LOINC and SNOMED-CT). After extraction, the entities are mapped to concepts using a Lucene-based tool, Usagi. The best candidate concept is then assigned a concept set in OHDSI's ATLAS tool and all possible candidates are provided to the user. The user can then either select the most appropriate concept set, or allow C2Q to automatically generate a JSON formatted query for ATLAS. After selection, the query is submitted to ATLAS so that the user now has a cohort for querying a variety of databases.

## Motivation
Though Usagi has some successes, its downside is a reliance on string-based similarity for scoring mappings. For example, the text "_neurological disease_" can map to "_neurological disorder_" and "_urologic disease_," but "_urologic disease_" has a score of 0.90, due to the limited character variance, so Usagi will provide "_urologic disease_" as its best option for mapping.

## Methodology
As an alternative first step, I propose using MetaMap as it generally performs better than Usagi at mapping concepts. Criteria2Query+ is a tool that could be used as an add-on to C2Q. C2Q+ would take the parsed eligibility criteria and then map those terms to MetaMap using a series of options that are relatively synonymous to those of CoreNLP and OMOP CDM tools. After obtaining the CUI from the UMLS associated with OMOP CDM Standard Vocabularies, C2Q+ then maps the CUIs back to the source vocabularies. C2Q+ then compares the MetaMap score and Usagi score for a given term and takes the better of the two to return to the user for a mapping as the "best option." It will still provide all possible concept sets for manual selection. 

## Evaluation
For my gold-standard, I relied on the 18 clinical trials that were chosen for evaluation by Yuan et al. 2017. After the entities were extracted from the NLP step and marked with their respective domains, I compared these tokens to the corresponding mappings linked in ATLAS, which I classified as correct, partially-correct, incorrect, or N/A for entities that were not mapped. For my analysis, I calculated both loose and strict precision, recall, and F-1 scores under the following definitions:
* **Loose includes partially-correct mappings**
* **Strict excludes partially-correct mappings**
* $Precision = \frac{correct \hspace{1mm} mappings}{total \hspace{1mm} possible \hspace{1mm} mappings}$
* $Recall = \frac{correct \hspace{1mm} mappings}{total \hspace{1mm} mappings}$
* $F_1 \hspace{1mm} Score = 2 * (\frac{precision \hspace{1mm} * \hspace{1mm} recall}{precision \hspace{1mm} + \hspace{1mm} recall})$

## Get list of NCTIDs to review

In [14]:
import csv

#Uncomment the below code to create a list of NCTIDs with user input
#ids = input("Enter NCTIDs separated by spaces: ")
#idsList = ids.split()
#print(idsList)

#Read csv file
with open('trials.csv', 'r') as csvfile:
    reader = csv.reader(csvfile)
    trials = list(reader)

#Eliminate exterior list
t = []
for nct in trials:
    t.append(nct[0])

trials = t

## Use Selenium to fetch each trial, parse the criteria, and download JSON file
Rename the files of the form: [trial-ID].json

In [138]:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import os
import shutil
import time

#Not very reliable for len(trials) > 4
#To attempt full automation, replace for i in range(x, y) with for t in trials and trials[i] with t
#Current form is for small batches, which is better considering that the scrapping is not entirely reliable and
#is not guaranteed to work 100% of the time.

for i in range(4, 5):

    #Connect to C2Q
    driver = webdriver.Chrome("/Users/sal/Downloads/chromedriver")
    driver.get("http://www.ohdsi.org/web/criteria2query/")

    #Load clinical trial
    nctid = driver.find_element_by_id('nctid')
    fetchct = driver.find_element_by_id('fetchct')
    nctid.send_keys(trials[i])
    ActionChains(driver).click(fetchct).perform()

    #Parse the criteria
    wait = WebDriverWait(driver, 20)
    parse = wait.until(EC.element_to_be_clickable((By.ID, 'start')))
    ActionChains(driver).click(parse).perform()

    #Extract JSON text
    download = wait.until(EC.element_to_be_clickable((By.ID, 'downloadfile')))
    ActionChains(driver).click(download).perform()

    time.sleep(5)
    driver.quit()

    #Move the file from downloads to the JSON_Formatted_Trials folder
    source = '/Users/sal/Downloads/Criteria2Query.json'
    destination = '/Users/sal/Desktop/DBMI/SymbolicMethods/ConceptMapping/JSON_Formatted_Trials'

    shutil.move(source, destination)

    #Rename the file in the JSON folder
    if os.path.exists('Criteria2Query.json'):
        src = os.path.realpath('Criteria2Query.json')
        os.rename('Criteria2Query.json', trials[i] +'.json')

## Run extraction of on each trial

In [177]:
import json
import pandas as pd

#Reformat the terms from the .json file
#extraction[] contains all of the trials in the format {text : [{term : domain}]}
extraction = []

for nct in trials:
    with open("./JSON_Formatted_Trials/"+ nct + ".json", "r") as read_file:
        data = json.load(read_file)

    trial = pd.DataFrame.from_dict(data)    
    exclusion = trial.iat[0,0]
    inclusion = trial.iat[1,0]

    #trial[0,0] is the exclusion criteria
    #trial[1,0] is the inclusion criteria
    
    m = []
    for e in exclusion:
        phrase = e.get('sents')[0].get('text')
        q = []
        for t in e.get('sents')[0].get('terms'):
            entity = t.get('text')
            domain = t.get('categorey')
            q.append([entity, domain])
        m.append({phrase:q})

    for i in inclusion:
        phrase = i.get('sents')[0].get('text')
        q = []
        for t in i.get('sents')[0].get('terms'):
            entity = t.get('text')
            domain = t.get('categorey')
            q.append([entity, domain])
        m.append({phrase:q})
    
    extraction.append(m)

In [178]:
extraction[0]

[{' Contraindication for the use of the study medication or other beta-lactam antibiotics , e.g. patients with advanced renal impairment or patients requiring hemodialysis ': [['Contraindication',
    'Condition'],
   ['other beta-lactam antibiotics', 'Drug'],
   ['advanced renal impairment', 'Condition'],
   ['hemodialysis', 'Procedure']]},
 {' Antibiotic therapy in the two weeks prior to the start of the study ': [['Antibiotic therapy',
    'Procedure'],
   ['in the two weeks', 'Temporal']]},
 {' Patients with an advanced incurable disease ': [['disease', 'Condition']]},
 {' Patients with a hematologic/oncologic disease ( leukemia , lymphoma ) ': [['hematologic/oncologic disease',
    'Condition'],
   ['leukemia', 'Condition'],
   ['lymphoma', 'Condition']]},
 {' Patients on immunosuppressants ': [['immunosuppressants', 'Drug']]},
 {' Complications of sigmoid diverticulitis leading to an immediate indication for surgery ': [['Complications',
    'Condition'],
   ['sigmoid diverticuli

## Open MetaMap servers and send terms to MetaMap in a txt file
Include line breaks and end each file with '\n'

Export the output of each to (NCTID)\_MM.txt

In [202]:
#Starts the SKR/Medpost Part-of-Speech Tagger and Word Sense Disambiguation Servers
#The WSD server takes about a minute to fully load, so the TimeoutExpired exception is just used to move onto the next
import subprocess

mm_path = "/Users/sal/Desktop/DBMI/SymbolicMethods/ConceptMapping/public_mm"
subprocess.run([mm_path + "/bin/skrmedpostctl start"], shell = True, capture_output = True)
subprocess.run([mm_path + "/bin/wsdserverctl start"], shell = True, capture_output = True, timeout = 60)

TimeoutExpired: Command '['/Users/sal/Desktop/DBMI/SymbolicMethods/ConceptMapping/public_mm/bin/wsdserverctl start']' timed out after 60 seconds

In [207]:
#Running MetaMap
run_mm = mm_path + "/bin/metamap18 -AIGy+ --negex -R ICD10CM,ICD10PCS,ICD9CM,ICD9CM,LNC,LNC,SNOMEDCT_US,SNOMEDCT_US,SNOMEDCT_US --conj -V USAbase"

#Retrieve MetaMap results for each trial
for e,nct in zip(extraction, trials):
    
    #Pull out the terms for the trial
    terms = []
    for i in range(0, len(e)):
        for text in e[i].values():
            for t in text:
                terms.append(t[0])
    
    with open('temp.txt', "w") as mm_input:
        for t in terms:
            mm_input.write(t)
            mm_input.write("\n\n")
    
    #Convert all non-ASCII characters using UMLS provided .jar file
    subprocess.run("java -jar replace_utf8.jar temp.txt > input.txt", shell = True, capture_output = True)
    
    #Submit to MetaMap
    subprocess.run(run_mm + " input.txt ./MM_Results/" + nct + "_MM.txt", shell = True, capture_output = True)


In [208]:
#Close the MetaMap Servers
subprocess.run([mm_path + "/bin/skrmedpostctl stop"], shell = True, capture_output = True)
subprocess.run([mm_path + "/bin/wsdserverctl stop"], shell = True, capture_output = True)

CompletedProcess(args=['/Users/sal/Desktop/DBMI/SymbolicMethods/ConceptMapping/public_mm/bin/wsdserverctl stop'], returncode=0, stdout=b'Stopping wsdserverctl: \nStopping WSD Server process..\nProcess 23567 stopped\n', stderr=b'/Users/sal/Desktop/DBMI/SymbolicMethods/ConceptMapping/public_mm/bin/wsdserverctl: line 55: kill: (23567) - No such process\n')

## Send term with domain to Usagi and save concept, vocab, and score
Export the output of each to (NCTID)\_usagi.txt

In [196]:
import requests as r
import json

url = 'http://149.28.237.139:8002/concepthub/omop/searchOneEntityByTermAndDomain'
for e,nct in zip(extraction, trials):
    
    #Send the terms and domains pairs for the trial to Usagi and save query and mapping
    responses = []
    for i in range(0, len(e)):
        for text in e[i].values():
            for t in text:
                query = {'term':t[0], 'domain':t[1]}
                reply = r.post(url, json = query)
                responses.append([query,reply.text])
    
    #Store the Usagi mappings for the trial
    with open('./Usagi_Results/'+ nct + '_usagi.txt', "w") as usagi_output:
        for res in responses:
            usagi_output.write(str(res[0])+"\n")
            #Catch HTTP Status 500 Error when term is not found
            if (res[1][:15] == "<!doctype html>"):
                usagi_output.write("N/A")
            else:
                usagi_output.write(res[1])
            usagi_output.write("\n\n")

## Extract MetaMap score, CUI, source vocab

In [215]:
#List of dataframes for each trial
mm_results = []
cols = ["ID", "Input", "Phrase", "Score", "String", "Preferred Name", "Vocab", "Semantic Type"]

for nct in trials:
    data = [nct]
    
    #Opens each results txt file and extracts each relevant information for dataframe
    with open('./MM_Results/' + nct + '_MM.txt', "r") as mm_file:
        for line in mm_file:
            #Input
            if line.startswith("Processing input"):
            
            #Phrase
            if line.startswith():
            
            #MetaMap Score
            if line.startswith():
            
            #UMLS Identified String
            if line.startswith():
            
            #Preferred Name
            if line.startswith():
            
            #Source Vocabularies
            if line.startswith():
            
            #Semantic Type
            if line.startswith("<<<<< Mappings"):
                data.append("".join(buffer))
                buffer = []
                
    trial = pd.DataFrame(tuple(data), columns = cols)
    mm_results.append(trial)

print(mm_results[0])

['NCT00097734', 'Processing input.txt.tx.1: Contraindication\n\nPhrase: Contraindication\n>>>>> Phrase\ncontraindication\n<<<<< Phrase\n>>>>> Mappings\nMeta Mapping (1000):\n  1000   C0522473:Contraindication {SNOMEDCT_US,SNOMEDCT_US,SNOMEDCT_US} [Qualitative Concept]\n<<<<< Mappings\n', 'Processing input.txt.tx.1: other beta-lactam antibiotics\n\nPhrase: other beta-lactam antibiotics\n>>>>> Phrase\nother beta lactam antibiotics\n<<<<< Phrase\n>>>>> Mappings\nMeta Mapping (923):\n   923   C0026458:Monobactam (Monobactams {SNOMEDCT_US,SNOMEDCT_US,SNOMEDCT_US}) [Antibiotic,Organic Chemical]\n<<<<< Mappings\n', 'Processing input.txt.tx.1: advanced renal impairment\n\nPhrase: advanced renal impairment\n>>>>> Phrase\nadvanced renal impairment\n<<<<< Phrase\n>>>>> Mappings\nMeta Mapping (901):\n   901   C1565489:Impaired renal function (Renal Insufficiency {SNOMEDCT_US,SNOMEDCT_US,SNOMEDCT_US}) [Disease or Syndrome]\n   901   C0205179:Advanced phase {SNOMEDCT_US,SNOMEDCT_US,SNOMEDCT_US} [Qua

## Convert MetaMap CUI to AUI using UMLS REST API

## Check equality of MetaMap concept and C2Q concept
Use a conversion table for the domains

## Store information from extracted entities in a dataframe per trial

## Parse dataframe to extract best concept
* If MM concept exists, provide that term
* Else if, provide Usagi
* Else, provide 'N/A'

## Export dataframe with best concept marking added
Format: (NCTID)\_mastersheet.xlsx

## Create csv/Excel file from master sheet
Format: (NCTID)\_eval.xlsx