## Azure Text Analytics for Health
  
**Goal of this notebook:**
- Extract medical concepts and the sysnonym terminology (according to UMLS) from the NOTEEVENT TEXT field in the mimic-iii dataser
  
**The extracted data will be used to...**
- Build/Supplement a dataset for fine tuning
- Supplement Note Text to improve Language Model (LM) understanding and reasoning for medical coding classification
  
**Requirements**
- UMLS setup and API access

**Overview of [Azure Text Analytics for Health](https://learn.microsoft.com/en-us/azure/ai-services/language-service/text-analytics-for-health/overview?tabs=ner)**  
  
Text Analytics for health is one of the prebuilt features offered by Azure AI Language. It is a cloud-based API service that applies machine-learning intelligence to extract and label relevant medical information from a variety of unstructured texts such as doctor's notes, discharge summaries, clinical documents, and electronic health records.

**Overview of [UMLS Metathesaurus API](https://documentation.uts.nlm.nih.gov/rest/home.html)**
  
The UMLS Metathesaurus is a large biomedical thesaurus that is organized by concept, or meaning. It links synonymous names from over 200 different source vocabularies. The Metathesaurus also identifies useful relationships between concepts and preserves the meanings, concept names, and relationships from each vocabulary. 

In [8]:
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv, find_dotenv
import pandas as pd
import os 
import requests

load_dotenv(find_dotenv(), override=True)

key = os.getenv("LANGUAGE_KEY")
endpoint = os.getenv("LANGUAGE_ENDPOINT")


In [None]:
# Read in a subsample of the Mimic-iii NOTEEVENTS.csv table
data_folder = "data/raw/"
note_events= pd.read_csv(data_folder + 'NOTEEVENTS.csv.gz', usecols=['HADM_ID','TEXT'], nrows=20000)

#### Extract Medical Concepts with Azure Text Analytics

In [3]:
# Authenticate the client using your key and endpoint 
def authenticate_client():
    ta_credential = AzureKeyCredential(key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

In [7]:
# Example function for extracting information from healthcare-related text 

def health_example(client):
    documents = [
        """
        Patient needs to take 50 mg of ibuprofen.
        """
    ]

    poller = client.begin_analyze_healthcare_entities(documents)
    result = poller.result()

    docs = [doc for doc in result if not doc.is_error]

    for idx, doc in enumerate(docs):
        for entity in doc.entities:
            print(f"{entity}")
        
        for relation in doc.entity_relations:
            print(f"{relation}")
            for role in relation.roles:
                print(f"{role}")
        print("------------------------------------------")
health_example(client)

{'text': '50 mg', 'normalized_text': None, 'category': 'Dosage', 'subcategory': None, 'assertion': None, 'length': 5, 'offset': 31, 'confidence_score': 1.0, 'data_sources': None}
{'text': 'ibuprofen', 'normalized_text': 'ibuprofen', 'category': 'MedicationName', 'subcategory': None, 'assertion': None, 'length': 9, 'offset': 40, 'confidence_score': 1.0, 'data_sources': [HealthcareEntityDataSource(entity_id=C0020740, name=UMLS), HealthcareEntityDataSource(entity_id=0000019879, name=AOD), HealthcareEntityDataSource(entity_id=M01AE01, name=ATC), HealthcareEntityDataSource(entity_id=0046165, name=CCPSS), HealthcareEntityDataSource(entity_id=0000006519, name=CHV), HealthcareEntityDataSource(entity_id=2270-2077, name=CSP), HealthcareEntityDataSource(entity_id=DB01050, name=DRUGBANK), HealthcareEntityDataSource(entity_id=1611, name=GS), HealthcareEntityDataSource(entity_id=sh97005926, name=LCH_NW), HealthcareEntityDataSource(entity_id=LP16165-0, name=LNC), HealthcareEntityDataSource(entity_id=

#### Supplement Medical Concepts with UMLS Metathesaurus