# 2. Building the model to classify electronic medical records (EMR)

In the batch data processing using [HHBatchDataProcessing.ipynb](./HHBatchDataProcessing.ipynb), I have prepared a dataset by extracting medical records that had medical speciality of the following categories on the MTSamples data. Those records have been passed through Comprehend Medical to extract medical key workds and the data is converted to a flat file having the feature set and the label.

    1: "Cardiovascular / Pulmonary"
    2: "Orthopedic"
    3: "Radiology"
    4: "General Medicine"
    5: "Gastroenterology"
    6: "Neurology"


In this notebook, I will be using the extracted dataset to create a classification model.

The goal of this experiment is to do a **Next step Prediction** which aims at predicting the speciality needed for a patient with certain diseases. In practice, the model could be used to analyze a medical transcription in real-time that can be used to provide a recommended referals to respective specialist, provide medical information related to health condition, provide nutrition or suppliments, exercises or available therapies that can help to improve quality of life and life style decisions. In this way it can establish a portal to integrate health care providers to the patients. 

The input for the prediction is the EMR as a pdf file with doctor's notes about the patient or patients notes about their illness described in free form. This unstructured free form text is passed through Comprehend Medical to extract the medical terms which can then be used to predict medical speciality using the trained model.

---

## Contents

1. [Objective](#Objective)
1. [Setup Environment](#Setup-Environment)
1. [Load and Explore the Dataset](#Load-and-Explore-Dataset)
1. [Prepare Dataset for Model Training](#Prepare-Dataset-for-Model-Training)
1. [Linear learner Algorithm](#Linear-learner-Algorithm)
1. [Train the Model](#Train-the-Model)
1. [Deploy and Evaluate the Model](#Deploy-and-Evaluate-the-Model)
1. [Hyperparameter Optimization](#Hyperparameter-Optimization)
1. [Inference Example](#Inference-Example)
1. [Conclusion](#Conclusion)
1. [Clean up resources](#Clean-up-resources)



---
## Objective
Predict health condition according to the EMR

Input: Free text of patients health condition written by the patient, a prescription or a doctors transcript.

Final goal: According to the predicted Health speciality, provide information about health recommendations and medical speciality.  (this programe is ending at the prediciton state but during a product implementation it can be integrated to a health care provider database which can provide information about illnesse, doctors list, nutrition or suppliment list, therapies etc.) 

Challenges:
- Dataset is limited and a larger dataset will help to train the model with more accuracy.
- Dataset contains limited amount of health conditions.

---
## Setup Environment

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services
- **Upgrade** SageMaker to the latest version


In [None]:
pip install --upgrade sagemaker

In [None]:
pip install textract-trp

In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
pd.set_option('display.max_colwidth', None)

import time
import os

# reuse frunctions from medical document processing notebooks
#from util.classification_report import generate_classification_report, predict_from_numpy_V2  # helper function for classification reports
from util.Pipeline import extractTextract, extractMedical
from util.preprocess import *

# import record processing functions
from sklearn.model_selection import train_test_split
from sagemaker.amazon.amazon_estimator import RecordSet, get_image_url

# setting up SageMaker parameters
import pkg_resources
pkg_resources.require("sagemaker>2.9.2") 
import sagemaker
import boto3

from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import seaborn as sns

boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "emr-mtSample"  # Location in the bucket to store our files
sgmk_session = sagemaker.Session()
sgmk_client = boto_session.client("sagemaker")
sgmk_role = sagemaker.get_execution_role()

---
## Load and Explore Dataset

Load the dataset prepared from the previous notebook [BatchDataProcessing](./BatchDataProcessing.ipynb). This dataset contains labelled data based on the medical speciality selected above and the medical features that were extracted from the electronic medical reports.

You can find the processed dataset in the following location '/data/processed_combined_extract.csv'.

*Demographics:*
* `ID`: id of the patients (int)
* `Label`: the medical condition (1-6 chosen categories)
* The rest of the columns e.g. `fever`, `wheezing`: medical condition extracted from notes. The number indicate confidence of the symptom (float), there are 113 features in this dataset.

In [None]:
df_wide_full=pd.read_csv("./data/processed_combined_extract.csv")
df_wide_full.head()

#### Explore correlation between the input variables and output one


In [None]:
corrPlot(df_wide_full)

---
## Prepare-Dataset-for-Model-Training

1. Convert Label to start from 0 than 1 as required in linear learner.
2. Suffle and split the data into **Training (80%)**, **Validation (10%)**, and **Test (10%)** sets.
3. Visualize data to see the number of records per category.

The training and validation datasets will be used during the training (and tuning) phase, while the 'holdout' test set will be used afterwards to evaluate the model.

In [None]:
# use AWS classifier - linear learner multi classifier

# transform labels to 0 index as it is required by linear learner to have labels starting 0
df_wide_full['Label'] -= 1

df_wide_full=df_wide_full.apply(pd.to_numeric, downcast='float', errors='coerce')

# remove the id column and drop label for X dataset
X=df_wide_full.drop(['Label', 'ID'], axis=1)
y=df_wide_full['Label'] # chose Label for y dataset

# shuffle and split into train and test sets
np.random.seed(0)
train_features, test_features, train_labels, test_labels = train_test_split(X, y, test_size=0.2)
# further split the test set into validation and test sets
val_features, test_features, val_labels, test_labels = train_test_split(
    test_features, test_labels, test_size=0.5
)

In [None]:
# Visualize data
# assign label names and count label frequencies

label_map = {
    0: "Cardiovascular / Pulmonary",
    1: "Orthopedic",
    2: "Radiology",
    3: "General Medicine",
    4: "Gastroenterology",
    5: "Neurology",
}

label_counts = (
    train_labels.map(label_map).value_counts(sort=False).sort_index(ascending=False)
)

label_counts.plot(kind="barh", color="tomato", title="Label Counts")

---
## Linear learner Algorithm


### Define Hyperparameters & Algorithm
Use the [sagemaker.estimator.Estimator()](https://sagemaker.readthedocs.io/en/v1.72.0/api/training/estimators.html) function to configure the following:

* train_instance_type - Type of instance to use.
* train_instance_count - The number of instances to run the training job. For suitable algorithms that support distributed training, set an instance count of more than 1.
* role - IAM role used to run the training job
* train_use_spot_instances - Specify whether to use spot instances. For more information about spot training, refer to the following url: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html
* train_max_run - Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status.
* train_max_wait - Timeout in seconds waiting for spot training instances
* hyperparameters - Our hyperparameters used to train the model
* predictor type - multiclass_classifier

In [None]:
training_image = get_image_url(boto3.Session().region_name, 'linear_learner')

hyperparameters = {
    "num_round": "150",     # int: [1,300]
    "max_depth": "6",     # int: [1,10]
    "alpha": "2.5",         # float: [0,5]
    "eta": "0.2",           # float: [0,1]
#    "objective": "binary:logistic", # binary classification
    "objective": "multi:softmax",    # multi class
    "num_class": "8",
    "gamma": "4",
    "min_child_weight": "6",
    "init_method": "normal", # uniform or normal
}


# instantiate the LinearLearner estimator object
multiclass_estimator = sagemaker.LinearLearner(
    image_uri=training_image, # newly added!!
    role=sagemaker.get_execution_role(), # IAM role to be used
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    predictor_type="multiclass_classifier",
    num_classes=8,
    epochs= 50, 
    num_models = 32,                # max models to test is 32
#    max_run=20*60,                 # Maximum allowed active runtime
#    max_wait=30*60,                # Maximum clock time (including spot delays)
    use_spot_instances=True,       # Use spot instances to reduce cost
  #  hyperparameters=hyperparameters, 
)

In [None]:
# LL - wrap data in RecordSet objects as required by linear learner
train_records = multiclass_estimator.record_set(train_features.values, train_labels.values, channel="train")
val_records = multiclass_estimator.record_set(val_features.values, val_labels.values, channel="validation")
test_records = multiclass_estimator.record_set(test_features.values, test_labels.values, channel="test")

---
## Train-the-Model

To start the training job call the `estimator.fit()` function. This will start a Sagemaker training job in the background. You can also see your training job within the AWS console by going to Sagemaker -> Training jobs.

Once the training job is completed, proceed to the next step.

In [None]:
# LL - start a training job
multiclass_estimator.fit([train_records, val_records, test_records])

## Deploy and Evaluate the Model
After trainin the model, proceed with deploying the model (hosting it behind a real-time endpoint) so that we can start running predictions in real-time. This can be done using the `estimator.deploy()` function. (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html.)

This deployment might take few minutes, and by default the code will wait for the deployment to complete.

+ Use the Endpoints page of the SageMaker Console to check the status of the deployment
+ You can start the Hyperparameter Optimization job in parallel - which will take a while to run too. 
+ Prediction would have to wait till the end point deployment is complete.

In [None]:
# LL - deploy a model hosting endpoint
multiclass_predictor = multiclass_estimator.deploy(
    endpoint_name="hhmulticlass",
    initial_instance_count=1, 
    instance_type="ml.m4.xlarge",
    predictor_cls=sagemaker.predictor.Predictor,
)

In [None]:
# Metric evaluation
def evaluate_metrics(predictor, test_features, test_labels):
    """
    Evaluate a model on a test set using the given prediction endpoint. Display classification metrics.
    """
    # split the test dataset into batches and evaluate using prediction endpoint
    prediction_batches = [predictor.predict(batch) for batch in np.array_split(test_features, 10)]

    # parse protobuf responses to extract predicted labels
    extract_label = lambda x: x.label['predicted_label'].float32_tensor.values
    test_preds = np.concatenate([np.array([extract_label(x) for x in batch]) for batch in prediction_batches])
    test_preds = test_preds.reshape((-1,))
    
    # calculate accuracy
    accuracy = (test_preds == test_labels).sum() / test_labels.shape[0]
    
    # calculate recall for each class
    recall_per_class, classes = [], []
    for target_label in np.unique(test_labels):
        recall_numerator = np.logical_and(test_preds == target_label, test_labels == target_label).sum()
        recall_denominator = (test_labels == target_label).sum()
        recall_per_class.append(recall_numerator / recall_denominator)
        classes.append(label_map[target_label])
    recall = pd.DataFrame({'recall': recall_per_class, 'class_label': classes})
    recall.sort_values('class_label', ascending=False, inplace=True)

    # calculate confusion matrix
    label_mapper = np.vectorize(lambda x: label_map[x])
    confusion_matrix = pd.crosstab(label_mapper(test_labels), label_mapper(test_preds), 
                                   rownames=['Actuals'], colnames=['Predictions'], normalize='index')

    # display results
    sns.heatmap(confusion_matrix, annot=True, fmt='.2f', cmap="YlGnBu").set_title('Confusion Matrix')  
    ax = recall.plot(kind='barh', x='class_label', y='recall', color='steelblue', title='Recall', legend=False)
    ax.set_ylabel('')
    print('Accuracy: {:.3f}'.format(accuracy))

## Run predictions

Once the Sagemaker endpoint has been deployed, we can now run some prediction to test our endpoint. Let us test our endpoint by running some predictions on our test data and evaluating the results.

In [None]:
# LL predict
result=multiclass_predictor.predict(test_features)
print(result)

In [None]:
#LL - evaluate metrics of the model trained with default hyperparameters
evaluate_metrics(multiclass_predictor, test_features, test_labels)

In [None]:
# Evaluate classification report 
#print(classification_report(y_test,y_test_pred,labels=category_list))
extract_label = lambda x: x.label['predicted_label'].float32_tensor.values
test_preds = [np.array([extract_label(x) for x in result]
test_preds = test_preds.reshape((-1,))
print(classification_report(test_labels,test_preds,labels=label_map))

### Create model from the training job

After the training job is done, the model is not saved yet. Check training jobs and models in your SageMaker Console. To create a model from a training job, refer to the documentation for  *[create_model API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model)*

In [None]:
## create a primary container with the trained model 
 model_data=multiclass_estimator.create_model().model_data
 primary_container = {
     'Image': training_image,
     'ModelDataUrl': model_data
 }

In [None]:
## Prepare a model for hosting to run inference
 model_name = '=LL-HH-model' ## new model name
 create_model_response = sgmk_client.create_model(
     ModelName = model_name,
     ExecutionRoleArn = sgmk_role,
     PrimaryContainer = primary_container,
 )

---
### Inference Example
A simplified pipeline to process an Electronic Health Record
Combine Textract, Comprehend Medical and SageMaker endpoint to process an electronic medical resport. 

In [None]:
from imp import reload
from util.Pipeline import extractTextract, extractMedical

### Step 1: Extract data from Textract

I have used 2 use cases below. 

1- A medical report in English: please chose first document and uncomment the second one. If you chose this path after running the first code block below you can move to Step 2: Extract data from Comprehend Medical to skip the language detection and translation blocks.

2- A medical report in German: please chose second document and uncomment the first one
    - first use Translate to create a English translation

In [None]:
PDFprefix='hhtestdata' # bucket name if you use test data from s3- customize name if you test this code

# Check the 2 use cases seperately (you should chose either Use case 1 or Use case 2
# If you chose use case 1, you can skip the next few blocks and directly go to Step 2: Extract data from Comprehend Medical
# Use case 1 - English language report
fileName =  'sample_report_1.pdf' 

# Use case 2 - German language report
#fileName =  'sample_report_2.pdf' 

fileUploadPath = os.path.join("./data", fileName) # if you upload from working dir
#fileUploadPath = os.path.join(PDFprefix, fileName) # if you upload from a s3 bucket
print("EHR file to be processed is at ", fileUploadPath)

boto3.Session().resource("s3").Bucket(bucket_name).Object(fileName).upload_file(
    fileUploadPath
)

doc=extractTextract(bucket_name,fileName) # extract pdf file 

In [None]:
 # read full text
print("Total length of document is", len(doc.pages))
idx = 1
full_text = ""
for page in doc.pages:
    print(f"Results from page {idx}: \n", page.text)
    full_text += page.text
    idx = idx + 1

In [None]:
# detect languagge
comprehend_client = boto3.client(service_name="comprehend", region_name="us-east-1")
response = comprehend_client.detect_dominant_language(Text=full_text).get(
    "Languages", []
)
for language in response:
    print(
        f"Detected language is {language.get('LanguageCode', [])}, with a confidence score of {language.get('Score', [])}"
    )

In [None]:
# if language is de then translate to en
if language.get('LanguageCode', [])=='de':
    translate = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
    result = translate.translate_text(Text=full_text[:5000], SourceLanguageCode="de", TargetLanguageCode="en")
    enFullText = result.get('TranslatedText')
    print('TranslatedText: ' + enFullText)

### Step 2: Extract data from Comprehend Medical

In [None]:
if language.get('LanguageCode', [])=='de':
    comprehendResponse=extractMedical(enFullText)
else:
    comprehendResponse=extractMedical(doc)
df_cm=extractMC_v2(comprehendResponse[0]) # create dataframe with feature set

### Step 3: Organize the extracted json file into dataframe

In [None]:
mclist, df_cm2=retrieve_mcList(df_cm, nFeature=40,threshold=0.8) # use same nfeatures and threshold as before
df_cm2=df_mc_generator_slim(df_cm2)
df_cm2

### Step 4: Prediction with the endpoint

In [None]:
# Create an empty dataset with same feature list as in our dataset used to train
df_final=test_features.iloc[0:0,0:] 
#print(df_final)

# chose from the comprehend medical extracted features only features as in the train dataset
df_final=df_final.append(df_cm2[df_cm2.columns.intersection(df_final.columns)])

df_final=df_final.fillna(0)
df_final=df_final.apply(pd.to_numeric, downcast='float', errors='coerce')
#print(df_final)

In [None]:
# 1. predict with trained model
result=multiclass_predictor.predict(df_final.values)
# result=LL-HH-model.predict(df_final.values) # using setup model
print(result)

In [None]:
# create csv format input string to predict using endpoint
import json
#print(df_final.values)
s = json.dumps(df_final.values.tolist())
#print(s[0:])
td=s[0:]
td=td.replace('[', '')
td=td.replace(']', '')
print(td)

In [None]:
# a test record
#td='0.0, 0.0, 0.0, 0.0, 0.0, 0.6807340383529663, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9984696507453918, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0'

# 2. predict using the endpoint
endpoint = 'hhmulticlass'
runtime = boto3.Session().client('sagemaker-runtime')

# Send input data to get prediction via InvokeEndpoint API
response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='text/csv', Body=td)

# Unpack response
result = json.loads(response['Body'].read().decode())
print(result)

---
## Conclusion
SageMaker linear learner algorithms performed well with the limited dataset. It would be intersting to compare results with some other models. 

At the end inference results showed the predicted classification which can be used for providing health recommendations. 

---
## Clean up resources
### Delete the endpoint and configuration if needed

In [None]:
multiclass_estimator.delete_endpoint(delete_endpoint_config=True)

### Delete the generated files S3 bucket files

In [None]:
## Delete all the content in the emr-mtSample folder. Check S3 before deleting it
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
bucket.objects.filter(Prefix=bucket_prefix).delete()

In [None]:
### Delete all the content in the PDF folder 
bucket.objects.filter(Prefix=PDFprefix).delete()

### Best Practice:
 1. Delete the buckets created from testing
 2. Shut down your notebook instance if you are not planning to explore more