# 2.2 Building the model to classify electronic medical records (EMR)
     - Use XGBoost to compare with previous Linear Learner model
In the batch data processing using [HHBatchDataProcessing.ipynb](./HHBatchDataProcessing.ipynb), I have prepared a dataset by extracting medical records that had medical speciality of the following categories on the MTSamples data. Those records have been passed through Comprehend Medical to extract medical key workds and the data is converted to a flat file having the feature set and the label.

    1: "Cardiovascular / Pulmonary"
    2: "Orthopedic"
    3: "Radiology"
    4: "General Medicine"
    5: "Gastroenterology"
    6: "Neurology"


In this notebook, I will be using the extracted dataset to create a classification model.

The goal of this experiment is to do a **Next step Prediction** which aims at predicting the speciality needed for a patient with certain diseases. In practice, the model could be used to analyze a medical transcription in real-time that can be used to provide a recommended referals to respective specialist, provide medical information related to health condition, provide nutrition or suppliments, exercises or available therapies that can help to improve quality of life and life style decisions. In this way it can establish a portal to integrate health care providers to the patients. 

The input for the prediction is the EMR as a pdf file with doctor's notes about the patient or patients notes about their illness described in free form. This unstructured free form text is passed through Comprehend Medical to extract the medical terms which can then be used to predict medical speciality using the trained model.

---

## Contents

1. [Objective](#Objective)
1. [Setup Environment](#Setup-Environment)
1. [Load and Explore the Dataset](#Load-and-Explore-Dataset)
1. [Prepare Dataset for Model Training](#Prepare-Dataset-for-Model-Training)
1. [XGBoost Algorithm](#XGBoost-Algorithm)
1. [Train the Model](#Train-the-Model)
1. [Deploy and Evaluate the Model](#Deploy-and-Evaluate-the-Model)
1. [Hyperparameter Optimization](#Hyperparameter-Optimization)
1. [Inference Example](#Inference-Example)
1. [Conclusion](#Conclusion)
1. [Clean up resources](#Clean-up-resources)



---
## Objective
Predict health condition according to the EMR

Input: Free text of patients health condition written by the patient, a prescription or a doctors transcript.

Final goal: According to the predicted Health speciality, provide information about health recommendations and medical speciality.  (this programe is ending at the prediciton state but during a product implementation it can be integrated to a health care provider database which can provide information about illnesse, doctors list, nutrition or suppliment list, therapies etc.) 

Challenges:
- Dataset is limited and a larger dataset will help to train the model with more accuracy.
- Dataset contains limited amount of health conditions.

---
## Setup Environment

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services
- **Upgrade** SageMaker to the latest version

In [None]:
pip install --upgrade sagemaker

In [None]:
pip install textract-trp

In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
pd.set_option('display.max_colwidth', None)

import time
import os

# import self-defined functions
#from util.classification_report import generate_classification_report, predict_from_numpy_V2  # helper function for classification reports
from util.Pipeline import extractTextract, extractMedical
from util.preprocess import *
from sklearn.model_selection import train_test_split
from sagemaker.amazon.amazon_estimator import RecordSet

# setting up SageMaker parameters
import pkg_resources
pkg_resources.require("sagemaker>2.9.2") 
import sagemaker
import boto3

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report

boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "emr-mtSample"  # Location in the bucket to store our files
sgmk_session = sagemaker.Session()
sgmk_client = boto_session.client("sagemaker")
sgmk_role = sagemaker.get_execution_role()

---
## Load and Explore Dataset

Load the dataset prepared from the previous notebook [BatchDataProcessing](./BatchDataProcessing.ipynb). This dataset contains labelled data based on the medical speciality selected above and the medical features that were extracted from the electronic medical reports.

You can find the processed dataset in the following location '/data/processed_combined_extract.csv'.

*Demographics:*
* `ID`: id of the patients (int)
* `Label`: the medical condition (1-6 chosen categories)
* The rest of the columns e.g. `fever`, `wheezing`: medical condition extracted from notes. The number indicate confidence of the symptom (float), there are 113 features in this dataset.

In [None]:
df_wide_full=pd.read_csv("./data/processed_combined_extract.csv")
df_wide_full.head()

#### Explore correlation between the input variables and output one

In [None]:
corrPlot(df_wide_full)

---
## Prepare-Dataset-for-Model-Training

1. Suffle and split the data into **Training (80%)**, **Validation (10%)**, and **Test (10%)** sets
2. Convert the data to the format the algorithm expects (e.g. CSV)
3. Upload the data to S3
4. Create `s3_input` objects defining the data sources for the SageMaker SDK

The training and validation datasets will be used during the training (and tuning) phase, while the 'holdout' test set will be used afterwards to evaluate the model.

SageMaker XGBoost algorithm expects data in the **libSVM** or **CSV** formats with the following format:
- The target variable in the first column, and
- No header row

In [None]:
# XGboost data prep
# remove the id column 
df_combined_model=df_wide_full.iloc[:,1:] 

# transform labels to 0 index to compare with the LL model results
df_wide_full['Label'] -= 1

# all feature data should be float32
df_wide_full=df_wide_full.apply(pd.to_numeric, downcast='float', errors='coerce')

# Shuffle and splitting dataset
train_data, validation_data, test_data = np.split(df_combined_model.sample(frac=1, random_state=123), 
                                                  [int(0.8 * len(df_combined_model)), int(0.9*len(df_combined_model))],) 

# Create CSV files for Train / Validation / Test
train_data.to_csv("data/train.csv", index=False, header=False)
validation_data.to_csv("data/validation.csv", index=False, header=False)
test_data.to_csv("data/test.csv", index=False, header=True)

### Upload dataset to S3

In [None]:
# XG boost
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/train.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix
)
val_uri = sgmk_session.upload_data(
    path="data/validation.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix
)

# Create s3_inputs
s3_input_train = sagemaker.TrainingInput(s3_data=train_uri, content_type="csv")
s3_input_validation = sagemaker.TrainingInput(s3_data=val_uri, content_type="csv")

print(f"{s3_input_train.config}\n\n{s3_input_validation.config}")

## XGBoost Algorithm
**`XGBoost`** stands for e**X**treme **G**radient **Boosting**. It implements the gradient boosting decision tree algorithm, which is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction.

The two major advantages of using XGBoost are:

    1. Fast Execution Speed: Generally, XGBoost is faster when compared to other implementations of gradient boosting.
    2. High Model Performance: XGBoost has exceled in either structured or tabular datasets on classification and regression predictive modeling problems.

In [None]:
# XGboost
from sagemaker import image_uris 
from sagemaker.image_uris import retrieve

training_image = retrieve(framework="xgboost", region=region, version="1.5-1") # fot multi class

print(training_image)

### Define Hyperparameters & Algorithm
* image_name - Training image to use(image_name), in this case we will be using the xgboost training image
* train_instance_type - Type of instance to use.
* train_instance_count - The number of instances to run the training job. For suitable algorithms that support distributed training, set an instance count of more than 1.
* role - IAM role used to run the training job
* train_use_spot_instances - Specify whether to use spot instances. 
* train_max_run - Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status.
* train_max_wait - Timeout in seconds waiting for spot training instances
* hyperparameters - Our hyperparameters used to train the model

In [None]:
hyperparameters = {
    "num_round": "150",     # int: [1,300]
    "max_depth": "6",     # int: [1,10]
    "alpha": "2.5",         # float: [0,5]
    "eta": "0.2",           # float: [0,1]
#    "objective": "binary:logistic", # binary classification
    "objective": "multi:softmax",    # multi class
    "num_class": "8",
    "gamma": "4",
    "min_child_weight": "6",
}

# Instantiate an XGBoost estimator object
estimator = sagemaker.estimator.Estimator(
    image_uri=training_image,           # XGBoost algorithm container
    instance_type="ml.m5.xlarge",  # type of training instance
    instance_count=1,              # number of instances to be used
    role=sgmk_role,                      # IAM role to be used
    use_spot_instances=True,       # Use spot instances to reduce cost
    max_run=20*60,                 # Maximum allowed active runtime
    max_wait=30*60,                # Maximum clock time (including spot delays)
    hyperparameters=hyperparameters
)

---
## Train-the-Model

In [None]:
# start a training (fitting) job
estimator.fit({ "train": s3_input_train, "validation": s3_input_validation })

## Deploy and Evaluate the Model
After trainin the model, proceed with deploying the model (hosting it behind a real-time endpoint) so that we can start running predictions in real-time. This can be done using the `estimator.deploy()` function. (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html.)

This deployment might take few minutes, and by default the code will wait for the deployment to complete.

+ Use the Endpoints page of the SageMaker Console to check the status of the deployment
+ You can start the Hyperparameter Optimization job in parallel - which will take a while to run too. 
+ Prediction would have to wait till the end point deployment is complete.

In [None]:
predictor = estimator.deploy(
    endpoint_name='hhxgbmulti',
    initial_instance_count=1,
    instance_type="ml.m5.large",
    #inference_response_keys=inference_response_keys,
    predictor_cls=sagemaker.predictor.Predictor,
    #serializer = sagemaker.serializers.CSVSerializer()
    #wait=False
)

## Run predictions

Once the Sagemaker endpoint has been deployed, we can now run some prediction to test our endpoint. Let us test our endpoint by running some predictions on our test data and evaluating the results.

In [None]:
# predict for test data
resultxgb=predictor.predict(test_data.iloc[:,1:].values)
print(resultxgb) # this result is in byte format
#print(test_data.iloc[:,0:1].size)

In [None]:
pred=list(resultxgb.decode()[0:-1].split('\n')) # split to get predicted labels
print(pred)
df_pred=pd.DataFrame(pred) # convert to dataframe
df_pred=df_pred.apply(pd.to_numeric, downcast='float', errors='coerce') # convert to float
print(df_pred.size)

In [None]:
label_map = {
    0: "Cardiovascular / Pulmonary",
    1: "Orthopedic",
    2: "Radiology",
    3: "General Medicine",
    4: "Gastroenterology",
    5: "Neurology",
}
label_mapper = np.vectorize(lambda x: label_map[x])

In [None]:
from sklearn.metrics import classification_report
# classification report 
print("Label category (1-5):", list(label_mapper(list(label_map.keys()))))
print(classification_report(test_data.iloc[:,0:1].values, df_pred. values,labels=list(label_map.keys())))

### Create model from the training job

After the training job is done, the model is not saved yet. Check training jobs and models in your SageMaker Console. To create a model from a training job, refer to the documentation for  *[create_model API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model)*

In [None]:
## create a primary container with the trained model 
 model_data=estimator.create_model().model_data
 primary_container = {
     'Image': training_image,
     'ModelDataUrl': model_data
 }

In [None]:
## Prepare a model for hosting to run inference
 create_model_response = sgmk_client.create_model(
     ModelName = 'hhxgbmulti',
     ExecutionRoleArn = sgmk_role,
     PrimaryContainer = primary_container,
 )


---
## Hyperparameter Optimization - TODO - NOT TESTED!!

We can check if the model improves with SageMaker HyperParameter Optimization (HPO) by automating the search for an optimal hyperparameter. We **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# set up hyperparameter ranges
ranges = {
    "num_round": IntegerParameter(100, 300),
    "max_depth": IntegerParameter(1, 10),
    "alpha": ContinuousParameter(0, 5),
    "eta": ContinuousParameter(0, 1),
}

# set up the objective metric
objective = "validation:auc"
#objective = "validation:accuracy"
# instantiate a HPO object
tuner = HyperparameterTuner(
    estimator=estimator,              # the SageMaker estimator object
    hyperparameter_ranges=ranges,     # the range of hyperparameters
    max_jobs=10,                      # total number of HPO jobs
    max_parallel_jobs=2,              # how many HPO jobs can run in parallel
    strategy="Bayesian",              # the internal optimization strategy of HPO
    objective_metric_name=objective,  # the objective metric to be used for HPO
    objective_type="Maximize",        # maximize or minimize the objective metric
)  


In [None]:
%%time
# start HPO
tuner.fit({ "train": s3_input_train, "validation": s3_input_validation })

In [None]:
%%time
# wait, until HPO is finished
hpo_state = "InProgress"

while hpo_state == "InProgress":
    hpo_state = sgmk_client.describe_hyper_parameter_tuning_job(
                HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)["HyperParameterTuningJobStatus"]
    print("-", end="")
    time.sleep(60)  # poll once every 1 min

print("\nHPO state:", hpo_state)



In [None]:
%%time
# deploy the best model from HPO
hpo_predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge",predictor_cls=sagemaker.predictor.Predictor,
    serializer = sagemaker.serializers.CSVSerializer())

In [None]:
hpo_predictor.deserializer=sagemaker.deserializers.CSVDeserializer()

In [None]:
# getting the predicted probabilities of the best model
hpo_predictions = predict_from_numpy_V2(hpo_predictor, test_data.drop(["Label"], axis=1))
print(hpo_predictions)

# generate report for the best model
generate_classification_report(
    y_real=test_data["Label"].values, 
    y_predict_proba=hpo_predictions, 
    decision_threshold=0.5,
    class_names_list=["Consultation","Surgery"],
    title="Best model (with HPO)",
)

---
##  Inference Example

A simplified pipeline to process an Electronic Health Record
Combine Textract, Comprehend Medical and SageMaker endpoint to process an electronic medical resport. 

In [None]:
from imp import reload
from util.Pipeline import extractTextract, extractMedical

### Step 1: Extract data from Textract

In [None]:
PDFprefix='hhtestdata' # bucket name if you use test data from s3- customize to your s3 if you test this code

# Check the 2 use cases seperately (you should chose either Use case 1 or Use case 2
# If you chose use case 1, you can skip the next few blocks and directly go to Step 2: Extract data from Comprehend Medical
# Use case 1 - English language report
#fileName =  'sample_report_1.pdf' 

# Use case 2 - German language report
fileName =  'sample_report_2.pdf' 

fileUploadPath = os.path.join("./data", fileName) # if you upload from working dir
#fileUploadPath = os.path.join(PDFprefix, fileName) # if you upload from a s3 bucket
print("EHR file to be processed is at ", fileUploadPath)

boto3.Session().resource("s3").Bucket(bucket_name).Object(fileName).upload_file(
    fileUploadPath
)

doc=extractTextract(bucket_name, fileName) # extract pdf file 

In [None]:
 # read full text
print("Total length of document is", len(doc.pages))
idx = 1
full_text = ""
for page in doc.pages:
    print(f"Results from page {idx}: \n", page.text)
    full_text += page.text
    idx = idx + 1

In [None]:
# detect languagge
comprehend_client = boto3.client(service_name="comprehend", region_name="us-east-1")
response = comprehend_client.detect_dominant_language(Text=full_text).get(
    "Languages", []
)
for language in response:
    print(
        f"Detected language is {language.get('LanguageCode', [])}, with a confidence score of {language.get('Score', [])}"
    )

In [None]:
# if language is de then translate to en
if language.get('LanguageCode', [])=='de':
    translate = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
    result = translate.translate_text(Text=full_text[:5000], SourceLanguageCode="de", TargetLanguageCode="en")
    enFullText = result.get('TranslatedText')
    print('TranslatedText: ' + enFullText)

### Step 2: Extract data from Comprehend Medical

In [None]:
if language.get('LanguageCode', [])=='de':
    comprehend_medical_client = boto3.client(service_name='comprehendmedical', region_name='us-east-1')
    comprehendResponse = comprehend_medical_client.detect_entities_v2(Text=enFullText)
    df_cm=extractMC_v2(comprehendResponse) # create dataframe with feature set
else:
    comprehendResponse=extractMedical(doc)
    df_cm=extractMC_v2(comprehendResponse[0]) # create dataframe with feature set

### Step 3: Organize the extracted json file into dataframe

In [None]:
mclist, df_cm2=retrieve_mcList(df_cm, nFeature=40,threshold=0.8) # use same nfeatures and threshold as before
df_cm2=df_mc_generator_slim(df_cm2)
df_cm2

### Step 4: Prediction with the endpoint

In [None]:
# Create an empty dataset with same feature list as in our dataset used to train
df_final=test_features.iloc[0:0,0:] 
#print(df_final)

# chose from the comprehend medical extracted features only features as in the train dataset
df_final=df_final.append(df_cm2[df_cm2.columns.intersection(df_final.columns)])

df_final=df_final.fillna(0)
df_final=df_final.apply(pd.to_numeric, downcast='float', errors='coerce')
#print(df_final)

In [None]:
# 1. predict with trained model
result=predictor.predict(df_final.values)
# result=LL-HH-model.predict(df_final.values) # using setup model
print(result)

In [None]:
# create csv format input string to predict using endpoint
import json
#print(df_final.values)
s = json.dumps(df_final.values.tolist())
#print(s[0:])
td=s[0:]
td=td.replace('[', '')
td=td.replace(']', '')
print(td)

In [None]:
#td='0.0, 0.0, 0.0, 0.0, 0.0, 0.6807340383529663, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9984696507453918, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0'
#td-de=0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7928379774093628, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0

# predict with new model using setup endpoint
# xgbmulticlass
endpoint = 'hhxgbmulti'
runtime = boto3.Session().client('sagemaker-runtime')
# Send image via InvokeEndpoint API
responsexgb = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='text/csv', Body=td)

# Unpack response
resultxgb = json.loads(responsexgb['Body'].read().decode())
print(resultxgb)

---
## Conclusion
According to the comparison with the previous model:
LL accuracy was 0.55 in average and 0.68 for best category and XGBoost accuracy using default parameters was 0.51 & best category 0.61.

**Parameter tuning and model testing yet to do - WIP!!!**

At the end inference results showed the predicted classification which can be used for providing health recommendations.  

---
## Clean up resources
### Delete the endpoint and configuration if needed

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)
hpo_predictor.delete_endpoint(delete_endpoint_config=True)

### Delete the generated files S3 bucket files

In [None]:
## Delete all the content in the emr-mtSample folder. Check S3 before deleting it
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
bucket.objects.filter(Prefix=bucket_prefix).delete()

In [None]:
### Delete all the content in the PDF folder 

bucket.objects.filter(Prefix=PDFprefix).delete()

### Best Practice:
 1. Delete the buckets created from testing
 2. Shut down your notebook instance if you are not planning to explore more