# Hugging Face on Amazon SageMaker

Hugging Face Deep DLCs( Deep Learning Containers ) make it easier than ever to train Transformer models in SageMaker. Here is why you should consider using Hugging Face DLCs to train and deploy your next machine learning models:
* One command is all you need
* Accelerate machine learning from science to production
* Built-in performance

# Problem description

**Diagnostics prediction** aims to automatically predict diagnostics needed for a patient with certain anamnesis.
The anamnesis is represented by a raw text file with the doctor's notes about the patient, including his/her age, complaints described on a freeway, the patient's history and so on. It is unstructured - different sections of one patient's anamnesis may be absent in another's.

The target labels are represented by the name of the needed diagnostics procedure.

The value of the solution might be found in helping a doctor to find the optimal solution for diagnostics order. The patient can save time and money, and the doctor can serve a patient more efficiently by sparing time for unnecessary diagnostics. Moreover, in difficult cases, the algorithm may help a doctor to find a diagnosis faster, which in some cases may be extremely valuable, up to saving lives.

Theoretically, some regularities found by the algorithm may help medical researchers to find the idea of treating some diseases, based on their unobvious interconnections with some symptoms.

# Installation

In [1]:
# make sure the Amazon SageMaker SDK is updated
!pip install "sagemaker" --upgrade
!pip install transformers
!pip install datasets[s3]

Collecting sagemaker
  Downloading sagemaker-2.91.1.tar.gz (534 kB)
     |████████████████████████████████| 534 kB 23.1 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.91.1-py2.py3-none-any.whl size=742450 sha256=ce79b96c744859cc6dd861dba821b99dfc1f62f96a2634ee4da973ee6da7b56f
  Stored in directory: /home/ec2-user/.cache/pip/wheels/49/c1/2e/5d7bcd98cc65e1db77f617e66ec2577082381ba5570282474f
Successfully built sagemaker
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.86.2
    Uninstalling sagemaker-2.86.2:
      Successfully uninstalled sagemaker-2.86.2
Successfully installed sagemaker-2.91.1
Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
     |████████████████████████████████| 4.0 MB 35.9 MB/s  

In [2]:
# import a few libraries that will be needed

from sklearn.model_selection import train_test_split
import torch

import sagemaker
from sagemaker.huggingface import HuggingFace
from datasets import load_dataset
import boto3

import pandas as pd
import os, time, tarfile
import io
import re
import json

import warnings
warnings.filterwarnings("ignore")

# Permissions

You need access to an IAM Role with the required permissions for Sagemaker.

In [3]:
# gets role for executing training job and set a few variables
sagemaker_session = sagemaker.Session()
bucket = 'medical-transcription-repo'
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")

sagemaker bucket: medical-transcription-repo
sagemaker session region: us-east-1


# Dataset
This dataset contains sample medical transcriptions for various medical specialties. This dataset offers a solution by providing medical transcription samples.

In [4]:
bucket = 'medical-transcription-repo'
key = 'dataset/mtsamples.csv'
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [5]:
df.isna().sum()

description             0
medical_specialty       0
sample_name             0
transcription          33
keywords             1068
dtype: int64

# Preprocessing

After preprocessing, the dataset will be uploaded to our s3_bucket to be used within our training job.

In [6]:
df['transcription'].fillna(df['description'], inplace=True)

In [7]:
df['medical_specialty'] = df['medical_specialty'].str.replace(r'\A ', '')

In [8]:
feature = 'medical_specialty'
df[feature] = df[feature].apply(lambda x:str.strip(x))
# renaming specialties under Anesthesiology and Internal Medicine
new_feature = 'medical_specialty_supergroup'
new_class = 'Internal Medicine'
df[new_feature] = df[feature].copy()
# Grouping all anesthesiology specialties
df[new_feature].mask(df[new_feature] == 'Hospice - Palliative Care', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Pain Management', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Sleep Medicine', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Endocrinology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Gastroenterology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Hematology - Oncology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Nephrology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Rheumatology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Cardiovascular / Pulmonary', new_class, inplace=True)
# General medicine is also known as Internal Medicine
df[new_feature].mask(df[new_feature] == 'General Medicine', new_class, inplace=True)

new_class = 'Surgery'
# Grouping all surgery specialties
df[new_feature].mask(df[new_feature] == 'Surgery', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Cosmetic / Plastic Surgery', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Neurosurgery', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'ENT - Otolaryngology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Obstetrics / Gynecology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Urology', new_class, inplace=True)

new_class = 'Medical Records'
# Grouping all documents
df[new_feature].mask(df[new_feature] == 'Consult - History and Phy.', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Discharge Summary', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Emergency Room Reports', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'IME-QME-Work Comp etc.', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Letters', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Office Notes', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'SOAP / Chart / Progress Notes', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Radiology', new_class, inplace=True)

new_class = 'Other' 
# Grouping less popular specialties and specialties with the least data points
df[new_feature].mask(df[new_feature] == 'Diets and Nutritions', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Bariatrics', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Dentistry', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Ophthalmology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Pediatrics - Neonatal', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Dermatology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Allergy / Immunology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Speech - Language', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Psychiatry / Psychology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Autopsy', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Lab Medicine - Pathology', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Physical Medicine - Rehab', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Orthopedic', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Chiropractic', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Podiatry', new_class, inplace=True)
df[new_feature].mask(df[new_feature] == 'Neurology', new_class, inplace=True)

In [9]:
df.head()

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords,medical_specialty_supergroup
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller...",Other
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh...",Other
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart...",Other
3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple...",Internal Medicine
4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo...",Internal Medicine


In [10]:
df['target'] = df['medical_specialty'].copy()
data_categories  = df.groupby(df['target'])
i = 1
print('===========Original Categories =======================')
for catName,dataCategory in data_categories:
    print('Cat:'+str(i)+' '+catName + ' : '+ str(len(dataCategory)) )
    i = i+1
print('==================================')

Cat:1 Allergy / Immunology : 7
Cat:2 Autopsy : 8
Cat:3 Bariatrics : 18
Cat:4 Cardiovascular / Pulmonary : 372
Cat:5 Chiropractic : 14
Cat:6 Consult - History and Phy. : 516
Cat:7 Cosmetic / Plastic Surgery : 27
Cat:8 Dentistry : 27
Cat:9 Dermatology : 29
Cat:10 Diets and Nutritions : 10
Cat:11 Discharge Summary : 108
Cat:12 ENT - Otolaryngology : 98
Cat:13 Emergency Room Reports : 75
Cat:14 Endocrinology : 19
Cat:15 Gastroenterology : 230
Cat:16 General Medicine : 259
Cat:17 Hematology - Oncology : 90
Cat:18 Hospice - Palliative Care : 6
Cat:19 IME-QME-Work Comp etc. : 16
Cat:20 Lab Medicine - Pathology : 8
Cat:21 Letters : 23
Cat:22 Nephrology : 81
Cat:23 Neurology : 223
Cat:24 Neurosurgery : 94
Cat:25 Obstetrics / Gynecology : 160
Cat:26 Office Notes : 51
Cat:27 Ophthalmology : 83
Cat:28 Orthopedic : 355
Cat:29 Pain Management : 62
Cat:30 Pediatrics - Neonatal : 70
Cat:31 Physical Medicine - Rehab : 21
Cat:32 Podiatry : 47
Cat:33 Psychiatry / Psychology : 53
Cat:34 Radiology : 273
Ca

In [11]:
counts = df['target'].value_counts()
others = [k for k,v in counts.items() if v<50]
for each_spec in others:
    df.loc[df['target']==each_spec,'target']=' others' 

In [12]:
final_data_categories = df.groupby(df['target'])
i=1
print('============Reduced Categories ======================')
for catName,dataCategory in final_data_categories:
    print('Cat:'+str(i)+' '+catName + ' : '+ str(len(dataCategory)) )
    i = i+1

print('============ Reduced Categories ======================')

Cat:1  others : 319
Cat:2 Cardiovascular / Pulmonary : 372
Cat:3 Consult - History and Phy. : 516
Cat:4 Discharge Summary : 108
Cat:5 ENT - Otolaryngology : 98
Cat:6 Emergency Room Reports : 75
Cat:7 Gastroenterology : 230
Cat:8 General Medicine : 259
Cat:9 Hematology - Oncology : 90
Cat:10 Nephrology : 81
Cat:11 Neurology : 223
Cat:12 Neurosurgery : 94
Cat:13 Obstetrics / Gynecology : 160
Cat:14 Office Notes : 51
Cat:15 Ophthalmology : 83
Cat:16 Orthopedic : 355
Cat:17 Pain Management : 62
Cat:18 Pediatrics - Neonatal : 70
Cat:19 Psychiatry / Psychology : 53
Cat:20 Radiology : 273
Cat:21 SOAP / Chart / Progress Notes : 166
Cat:22 Surgery : 1103
Cat:23 Urology : 158


In [13]:
df.to_csv('pre-process.csv', index=False)

In [14]:
s3 = boto3.resource('s3')
s3.meta.client.upload_file('pre-process.csv', bucket, 'preprocess/pre-processed/pre-process.csv')

In [15]:
feat = ['target', 'transcription']
df = df[feat]
df.columns = ['target', 'text']
df.head()

Unnamed: 0,target,text
0,others,"SUBJECTIVE:, This 23-year-old white female pr..."
1,others,"PAST MEDICAL HISTORY:, He has difficulty climb..."
2,others,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ..."
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...


In [16]:
def pre_process(df):
    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    df['text'] = df['text'].apply(lambda x: x.lower())
    df['text'] = df['text'].apply(lambda x: REPLACE_BY_SPACE_RE.sub('',  x)) # replace REPLACE_BY_SPACE_RE symbols by space in text
    df['text'] = df['text'].apply(lambda x: BAD_SYMBOLS_RE.sub('',  x)) # delete symbols which are in BAD_SYMBOLS_RE from text
    df['text'] = df['text'].apply(lambda x:str.strip(x))
    
    return df

df = pre_process(df)

In [17]:
df['target'] = df['target'].astype('category').cat.codes

## Estimating class weights
For class imbalance, one aspect to consider is that each batch has enough signal to provide some coverage of all the classes, even the unbalanced ones. Otherwise, it may degenerate during training.

We use class weights to handle an imbalanced dataset, in this case.

In [18]:
from sklearn.utils import class_weight
import numpy as np
class_weights = dict(enumerate(class_weight.compute_class_weight('balanced',
                                                         classes=np.unique(df['target']),
                                                         y=df['target'])))
class_weights

{0: 0.6813411476080141,
 1: 0.5842683496961197,
 2: 0.421216717222784,
 3: 2.012479871175523,
 4: 2.217834960070985,
 5: 2.8979710144927537,
 6: 0.9449905482041588,
 7: 0.8391807957025348,
 8: 2.414975845410628,
 9: 2.6833064949006977,
 10: 0.9746539286410606,
 11: 2.3122109158186865,
 12: 1.3584239130434783,
 13: 4.261722080136402,
 14: 2.6186485070717653,
 15: 0.6122473974280466,
 16: 3.505610098176718,
 17: 3.104968944099379,
 18: 4.10090237899918,
 19: 0.7961458831024049,
 20: 1.3093242535358827,
 21: 0.19705151957113012,
 22: 1.375619152449092}

In [19]:
train, test = train_test_split(df, 
                               test_size=0.2,
                               #stratify=df['target'],
                               random_state=42)

train.reset_index(inplace = True, drop = True)
test.reset_index(inplace = True, drop = True)

train.to_csv('med-train.csv', index=False)
test.to_csv('med-test.csv', index=False)

In [20]:
#train_dataset = load_dataset('csv', data_files='med-train.csv', delimiter=",", split="train")
#test_dataset = load_dataset('csv', data_files='med-test.csv',  delimiter=",", split="test")
dataset = load_dataset('csv', data_files={
    "train": 'med-train.csv',
    "test": 'med-test.csv'
})

Using custom data configuration default-3807c6c770e47b96


Downloading and preparing dataset csv/default to /home/ec2-user/.cache/huggingface/datasets/csv/default-3807c6c770e47b96/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/csv/default-3807c6c770e47b96/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

## Tokenization

In [21]:
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = 'distilbert-base-uncased'
# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'],padding=True, truncation=True)

# tokenize dataset
train_dataset = dataset['train'].map(tokenize, batched=True)
test_dataset = dataset['test'].map(tokenize, batched=True)

# set format for pytorch
train_dataset =  train_dataset.rename_column("target", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("target", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Uploading data to s3_bucket

In [22]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()  
prefix = 'preprocess'

# save train_dataset to s3
training_input_path = f's3://{bucket}/{prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{bucket}/{prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

# Fine-tuning & starting Sagemaker Training Job

In [23]:
!pygmentize ./scripts/train.py

[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmetrics[39;49;00m [34mimport[39;49;00m accuracy_score, precision_recall_fscore_support
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m

[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"[39;49;00m:

    parser = argparse.ArgumentParser()

    [37m# hyperparameters sent by the client are passed as command-line arguments to the script.[39;49;00m
    parser.add_argument(

## Creating an Estimator and starting a training job

This estimator runs a Hugging Face training script in a SageMaker training environment.

The estimator initiates the SageMaker-managed Hugging Face environment by using the pre-built Hugging Face Docker container and runs the Hugging Face training script that user provides through the entry_point argument.

In [24]:
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig
# initialize the Amazon Training Compiler
compiler_config=TrainingCompilerConfig()

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 10,
                 'train_batch_size': 64,
                 'eval_batch_size': 32,
                 'learning_rate': 3e-5, 
                 'model_name':'distilbert-base-uncased'
                 }

In [25]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.11.0',
                            pytorch_version='1.9.0',
                            py_version='py38',
                            output_path='s3://{}/models'.format(bucket),
                            hyperparameters = hyperparameters,
                            compiler_config = compiler_config
                                   )

In [26]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

2022-05-23 13:59:34 Starting - Starting the training job...
2022-05-23 14:00:00 Starting - Preparing the instances for trainingProfilerReport-1653314373: InProgress
.........
2022-05-23 14:01:32 Downloading - Downloading input data
2022-05-23 14:01:32 Training - Downloading the training image.......................................
2022-05-23 14:08:00 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-05-23 14:08:03,329 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-05-23 14:08:03,358 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-05-23 14:08:03,366 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-05-23 14:08:03,978 sagemaker-training-toolkit INFO     Invoking user script[0m
[3

# Deploying the endpoint

To deploy our endpoint, we call deploy() on our HuggingFace estimator object, passing in our desired number of instances and instance type.

For inference, you can use your trained Hugging Face model or one of the pretrained Hugging Face models to deploy an inference job with SageMaker. With this collaboration, you only need one line of code to deploy both your trained models and pre-trained models with SageMaker.

In [27]:
predictor = huggingface_estimator.deploy(1,"ml.t2.medium")

--------------!

In [28]:
sentiment_input= {"inputs":'HISTORY OF PRESENT ILLNESS:,  The patient is well known to me for a history of iron-deficiency anemia due to chronic blood loss from colitis.  We corrected her hematocrit last year with intravenous (IV) iron.  Ultimately, she had a total proctocolectomy done on 03/14/2007 to treat her colitis.  Her course has been very complicated since then with needing multiple surgeries for removal of hematoma.  This is partly because she was on anticoagulation for a right arm deep venous thrombosis (DVT) she had early this year, complicated by septic phlebitis.,Chart was reviewed, and I will not reiterate her complex history.,I am asked to see the patient again because of concerns for coagulopathy.,She had surgery again last month to evacuate a pelvic hematoma, and was found to have vancomycin resistant enterococcus, for which she is on multiple antibiotics and followed by infectious disease now.,She is on total parenteral nutrition (TPN) as well.,LABORATORY DATA:,  Labs today showed a white blood count of 7.9, hemoglobin 11.0, hematocrit 32.8, and platelets 1,121,000.  MCV is 89.  Her platelets have been elevated for at least the past week, with counts initially at the 600,000 to 700,000 range and in the last couple of day rising above 1,000,000.  Her hematocrit has been essentially stable for the past month or so.  White blood count has improved.,PT has been markedly elevated and today is 44.9 with an INR of 5.0.  This is despite stopping Coumadin on 05/31/2007, and with administration of vitamin K via the TPN, as well as additional doses IV.'}
predictor.predict(sentiment_input)

[{'label': 'LABEL_8', 'score': 0.5507590770721436}]

# Evaluation

Model prediction over evaluation set to be uploaded to s3 bucket for Quicksight Dashboard.

In [29]:
eval_df = pd.read_csv('med-test.csv')
#eval_df = pre_process(eval_df)
eval_df['pred'] = 0
eval_df['score'] = 0.0
max_length = 512
for i in range(len(eval_df)):
    sentiment_input= {"inputs": eval_df['text'].iloc[i][:max_length]}
    eval_df['pred'][i] = int(predictor.predict(sentiment_input)[0]['label'].split('_')[1])
    eval_df['score'][i] = float(predictor.predict(sentiment_input)[0]['score'])

In [30]:
eval_df.head()

Unnamed: 0,target,text,pred,score
0,22,operative note the patient was taken to the o...,22,0.42861
1,13,vital signs reveal a blood pressure of tempe...,13,0.724876
2,10,history neurologic consultation was requested...,10,0.30007
3,0,anatomical summary1 sharp force wound of neck ...,0,0.629299
4,1,indications for procedure the patient has pres...,1,0.542188


In [31]:
eval_df.to_csv("model-performance.csv", index=False)
s3 = boto3.resource('s3')
s3.meta.client.upload_file('model-performance.csv', bucket, 'preprocess/pre-processed/eval/model-performance.csv')