# 1. Batch Data Processing
Batch processing of  Electronic Medical Reports (EMR) using Amazon Comprehend Medical

Assumption: A row dataset such as MTSamples exists.
- The medical reports (in pdf form) can be extracted using Textract and data can be inserted to a dataset similar to MTSamples. The example notebook in medical document processing workshop in [1.Data_Processing.ipynb](https://github.com/aws-samples/amazon-textract-and-comprehend-medical-document-processing) can be used for extracting transcript text.
- Data can be also be directly inserted using online user input.

# Contents

1. [Objective](#Objective)
1. [Background](#Background)
1. [Setup Environment](#Setup-Environment)
1. [Load and Explore Data](#Load-and-Explore-Data)
1. [Data Sampling for modeling](#Data-Sampling-for-modeling)
1. [Combine the dataset](#Combine-the-dataset)
1. [Save the processed file](#Save-the-processed-file)

---
## Objective 
This notebook is the preprocessing step to prepare a batch of medical records for model training. This will use Comprehend Medical to extract medical key words (e.g. fever, wheezing, injury) from doctors's transcripts, patients input and organize them into data frame that will be used as features in model training. Afterwards this trained model will be used to  classify the medical specialties in a new transcription text. In real life use case, the model predictions can be used for automatic reference to respective specialist, provide medical information, recommendations for nutritions & suppliments, relavent exercises & therapies etc.

---

## Background

**Dataset**: Medical transcription data scraped from mtsamples.com. This dataset is used in the medical document processing sample notebook. (`./data/mtsample.csv`). You can find the raw dataset at [kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions). 

**Amazon Comprehend Medical**: Comprehend Medical detects useful information in unstructured clinical text. As much as 75% of all health record data is found in unstructured text Amazon Comprehend Medical uses Natural Language Processing (NLP) models to sort through text for valuable information.

**Supported Languages**: Amazon Comprehend Medical only detects medical entities in English language texts. However I have used a sample prediction use case in the ModelDeployment notebook using a German transcript which is first translated to English and used for prediction.

---
## Setup Environment


- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to Amazon Comprehend(with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services


In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import time
import os

# reuse frunctions from medical document processing notebooks
from util.preprocess import *  # helper function for classification reports

# setting up SageMaker parameters
import sagemaker
import boto3

import matplotlib.pyplot as plt
import seaborn as sns

boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
#bucket_prefix = "emr-mtSample"  # Location in the bucket to store our files
sgmk_session = sagemaker.Session()

sgmk_client = boto_session.client("sagemaker")  ## API for sagemaker
cm_client = boto3.client(service_name='comprehendmedical', use_ssl=True, region_name = 'us-east-1') ## API for comprehend medical

---
## Load and Explore Data

MTSamples dataset (`./data/mtsample.csv`)

**Columns in the dataset**:

* `description`: Short description of transcription (string)
* `medical_specialty`: Medical specialty classification of transcription (string)
* `sample_name`: Transcription title
* `transcription`: Transcribed doctors' notes
* `keywords`: Relevant keywords from transcription

To train the model the features are extracted from processing the `transcription` column using Amazon Comprehend Medical. The `medical_specialty`column is used as the label for the classification model.

In [None]:
df=pd.read_csv("./data/mtsamples.csv")
df.head()

### Clean up dataset

Check for empty columns and remove them.

In [None]:
df.isnull().sum(axis=0) ## check for missing information

Remove the *33* rows with `transcription` is null.

In [None]:
df=df[df['transcription'].isnull()==False].reset_index()
df.isnull().sum(axis=0) 

### Explore dataset by medical speciality

Observe the distribution of medical reports by medical speciality

In [None]:
## add patient_id for reference
df['id']=df.index
sns.set(rc={'figure.figsize':(15,10)})
sns.countplot(y='medical_specialty',order=df['medical_specialty'].value_counts().index, data=df)  #df.medical_specialty.value_counts()

---
## Data Sampling for modeling

#### Business Question:
How often have we searched information about illnesses, prescriptions, symptoms to find health information?
Any patient would be intersted to know information about possible illnesses, what can improve the situation with respect to nutritions, suppliments, exercises, therapies or which specilization of doctors should be consulted.

#### ML problem to resolve:
Multiclass classfication for patient input based on medical conditions.

#### Why we do data sampling at this step?  

For demo purpose and data limitations, 6 medical specialities are chosen. The 6 categories are:

    1: "Cardiovascular / Pulmonary"   
    2: "Orthopedic"   
    3: "Radiology"   
    4: "General Medicine"    
    5: "Gastroenterology"   
    6: "Neurology"
    
A sample of 200 records from each category is selected randomly. Surgery and Consultation data categories are removed as they can belong to multiple medical specialities. (e.g. Surgery can be on different organs.) Categories with less than 200 records are not taken.


### Data Sampling from 6 categories: 200 samples each

Extract the medical conditions using Amazon Comprehend Medical.

In [None]:
%%time
nSample=200 ## number to process the medical terms in a batch 
# Do not run with 200 if you just want to test, as this will cost around $100!!!
# Use 20 for time & cost consideration if you want to test the funciton, however this is not enough for good model accuracy!!

df_list_cp, patient_ids_cp = subpopulation_comprehend(df, ' Cardiovascular / Pulmonary',sampleSize=nSample)
df_list_or, patient_ids_or = subpopulation_comprehend(df, ' Orthopedic',sampleSize=nSample)
df_list_ra, patient_ids_ra = subpopulation_comprehend(df, ' Radiology',sampleSize=nSample)
df_list_gm, patient_ids_gm = subpopulation_comprehend(df, ' General Medicine',sampleSize=nSample)
df_list_ga, patient_ids_ga = subpopulation_comprehend(df, ' Gastroenterology',sampleSize=nSample)
df_list_nu, patient_ids_nu = subpopulation_comprehend(df, ' Neurology',sampleSize=nSample)

Batch processing using Amazon Comprehend Medical 

Extract all the medical_conditions for each patient, together with the confidence score 


In [None]:
## Function to process a multiple records
def extractMCbatch(transcriptionList,patientIDList):
    df_final = pd.DataFrame()
    
    if(len(transcriptionList)!=len(patientIDList)):
        return("Error! different length!")
    
    ## In this for loop, gererate a wide dataframe with extracted medical condition from each item, together with the corresponding ID 
    for item,patient_id in zip(transcriptionList,patientIDList):
#        print("processing patient_id:",patient_id )
        df_ind = extractMC_v2(item)
        df_ind['ID']=patient_id
        df_final=df_final.append(df_ind)
        
    # remove the duplicated entries if any
    df_final=df_final.sort_values(by=['ID','MEDICAL_CONDITION']).drop_duplicates(['ID','MEDICAL_CONDITION'],keep='last')

    return df_final

### Test function *extractMCbatch* and visualize it

In [None]:
df_extracted_cp=extractMCbatch(df_list_cp,patient_ids_cp)
df_extracted_or=extractMCbatch(df_list_or,patient_ids_or)
df_extracted_ra=extractMCbatch(df_list_ra,patient_ids_ra)
df_extracted_gm=extractMCbatch(df_list_gm,patient_ids_gm)
df_extracted_ga=extractMCbatch(df_list_ga,patient_ids_ga)
df_extracted_nu=extractMCbatch(df_list_nu,patient_ids_nu)

## plot the results
topN=40 ## the number for top conditions
threshold_score=0.8 ##the threshold of confidence score
df_cp_plot=mc_barplot(df_extracted_cp, threshold_score,topN)

---
## Combine the dataset

There are 6 datasets, one per spefiality. These 6 datasets need to be consolidated for the model training.

### Gaps:
The dataset is in long format, meaning that each row represents a single medical condition for one patients. If a patient *John* has 10 medical conditions, there will be 10 rows. Thus, there are varied number of rows of each patient. 
### Solutions:
To make the dataset easier for ML algorithm to handle, it need to be converted into wide format, one row for one patient. Instead of keeping all the existing medical conditions I have selected top 40 medical conditions from each category as input features. Note that `40` here is an arbitrary number chosen after few tests. With 20, model accuracy was close to 0. 

In the following cell, function *`retrieve_mcList(df, nFeature=40,threshold=0.8)`* helps to retrieve the features from each subset with `nFeature`(default=40)  as specified number of features and `threshold`(default=0.8) as the confidence threshold. Outputs from *`retrieve_mcList()`*:

+ top medical conditions list,
+ cleaned dataframe through converting to lower case, merg *etc*.

`Target column`: as it is a classification problem, a new column called `Label`is created with the number defined for the speciality. 

In [None]:
# Extract relavent records
mcListcp, df_grpcp=retrieve_mcList(df_extracted_cp, 40)
mcListor, df_grpor=retrieve_mcList(df_extracted_or, 40)
mcListra, df_grpra=retrieve_mcList(df_extracted_ra, 40)
mcListgm, df_grpgm=retrieve_mcList(df_extracted_gm, 40)
mcListga, df_grpga=retrieve_mcList(df_extracted_ga, 40)
mcListnu, df_grpnu=retrieve_mcList(df_extracted_nu, 40)

In [None]:
df_grpcp['Label']=1 # 'Cardiovascular / Pulmonary'
df_grpor['Label']=2 # 'Orthopedic'
df_grpra['Label']=3 # 'Radiology'
df_grpgm['Label']=4 # 'General Medicine'
df_grpga['Label']=5 # 'Gastroenterology'
df_grpnu['Label']=6 # 'Neurology'

df_fulllist = df_grpcp.append([df_grpor, df_grpra, df_grpgm, df_grpga, df_grpnu])
fullmcList=list(set(mcListcp+mcListor+mcListra+mcListgm+mcListga+mcListnu))
df_combined_full=df_mc_generator(df_fulllist, fullmcList ,colname_other=['ID',"Label"] )

---
## Save the processed file

In [None]:
df_combined_full.to_csv("./data/processed_combined_extract.csv",index=False)

# Upload to s3 for future use - customize the bucket name given here(hhtestdata) if you try out the code
fileUploadPath = os.path.join("./data", "processed_combined_extract.csv")
boto3.Session().resource("s3").Bucket('hhtestdata').Object("processed_combined_extract.csv").upload_file(
    fileUploadPath
)