## Azure Text Analytics for Health
  
**Goal of this notebook:**
- Extract medical concepts and the synonym terminology (according to UMLS) from the NOTEEVENT TEXT field in the mimic-iii dataser
  
The extracted data will be used to...
- Build/Supplement a dataset for fine tuning
- Supplement Note Text to improve Language Model (LM) understanding and reasoning for medical coding classification
  
**Requirements**
- Setup [Azure Language Resource](https://learn.microsoft.com/en-us/azure/ai-services/language-service/text-analytics-for-health/overview?tabs=ner)
- Setup [UMLS API](https://documentation.uts.nlm.nih.gov/rest/home.html) access
- Populated .env file (see [.env.sample](./.env.sample))

**Azure Text Analytics for Health**  
  
Text Analytics for health is one of the prebuilt features offered by Azure AI Language. It is a cloud-based API service that applies machine-learning intelligence to extract and label relevant medical information from a variety of unstructured texts such as doctor's notes, discharge summaries, clinical documents, and electronic health records.

**UMLS Metathesaurus API**
  
The UMLS Metathesaurus is a large biomedical thesaurus that is organized by concept, or meaning. It links synonymous names from over 200 different source vocabularies. The Metathesaurus also identifies useful relationships between concepts and preserves the meanings, concept names, and relationships from each vocabulary. 

In [None]:
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv, find_dotenv
import pandas as pd
import os 
import re
import requests

load_dotenv(find_dotenv(), override=True)

key = os.getenv("LANGUAGE_KEY")
endpoint = os.getenv("LANGUAGE_ENDPOINT")


In [None]:
# Read in a subsample of the Mimic-iii NOTEEVENTS.csv table
data_folder = "data/raw/"
note_events= pd.read_csv(data_folder + 'NOTEEVENTS.csv.gz', usecols=['HADM_ID','TEXT'], nrows=20000)

#### Extract Medical Concepts with Azure Text Analytics

In [None]:
# Authenticate the client using your key and endpoint 
ta_credential = AzureKeyCredential(key)
client = TextAnalyticsClient(
        endpoint=endpoint, 
        credential=ta_credential)

In [None]:
# Simple Example function for extracting information from healthcare-related text 

def az_ta_fh(client, documents):
    poller = client.begin_analyze_healthcare_entities(documents)
    result = poller.result()

    docs = [doc for doc in result if not doc.is_error]

    for idx, doc in enumerate(docs):
        for entity in doc.entities:
            print(f"{entity}")
        
        for relation in doc.entity_relations:
            print(f"{relation}")
            for role in relation.roles:
                print(f"{role}")



documents = ["Infectious and Parasitic Diseases", "Patient was given 50mg of ibuprofen"]
az_ta_fh(client, documents)

In [None]:
# Example with the Mimic-iii data NOTEEVENT TEXT column

print(f"*******NOTE******:\n{note_events['TEXT'][0]}\n******END NOTE********\n")

mimic_docs = [note_events['TEXT'][0]]
az_ta_fh(client, mimic_docs)

''' 
NOTE: Filtering by entity.category can simplify results. For medical coding 'SymptonOrSign' and 'Diagnosis' are useful categories.
For relations, 'ExaminationFindsCondition' and 'DirectionOfCondition' are useful categories.
'''

In [None]:
# Get all UMLS concepts from the Azure Text Analytics API
def get_umls_concepts(client, documents):
    umls_concepts = []
    poller = client.begin_analyze_healthcare_entities(documents)
    result = poller.result()

    docs = [doc for doc in result if not doc.is_error]

    for idx, doc in enumerate(docs):
        for entity in doc.entities:
            if entity.data_sources and entity.category in ['SymptonOrSign', 'Diagnosis']:
                for data_source in entity.data_sources:
                    if data_source.name == "UMLS":
                        umls_concepts.append((data_source.entity_id, entity.text))

    return umls_concepts

umls_concepts = get_umls_concepts(client, mimic_docs)
print(umls_concepts)

#### Supplement Medical Concepts with UMLS Metathesaurus

In [None]:
def get_umls_atoms(cuid):
    synonyms = []
    sabs = ['ICD10', 'ICD10CM', 'ICD9CM', 'SNOMEDCT_US', 'MDR']      
    atom_uri = f"https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/{cuid}/atoms"
    page = 0  
    try:   
        while True:
            page += 1
            atom_query = {'apiKey':os.getenv("UMLS_API_KEY"), 'pageNumber':page, 'language':'ENG', 'sabs': ','.join(sabs)}
            a = requests.get(atom_uri, params=atom_query)
            a.encoding = 'utf-8'
            
            if a.status_code != 200:
                break

            all_atoms = a.json()
        
            for atom in all_atoms['result']:
                synonyms.append(re.sub("[\(\[].*?[\)\]]", "", atom['name']).lower().rstrip())
                #print(f'{atom}')

            return synonyms
            
    except Exception as except_error:
        print(except_error)
        return


synonyms = get_umls_atoms('C0033785')
print(list(set(synonyms)))

In [None]:
# Get the UMLS concept definitions

def umls_define(cuid):    
    definitions = []
    umls_uri = f"https://uts-ws.nlm.nih.gov/rest/content/current/CUI/{cuid}/definitions"
    page = 0  
    try:   
        while True:
            page += 1
            query = {'apiKey':os.getenv("UMLS_API_KEY"), 'pageNumber':page}
            a = requests.get(umls_uri, params=query)
            a.encoding = 'utf-8'
            
            if a.status_code != 200:
                break
            result = a.json()
        
            for value in result['result']:
                definitions.append((value['value'].lower().rstrip(), value['rootSource']))
                print(value['value'])
                print(value['rootSource'])

            return definitions
            
    except Exception as except_error:
        print(except_error)
        return
    

definitions = umls_define('C0033785')