## Error Analysis for Hyperlipidemia Tag Predictions (BERT Augmented)

In [1]:
import os
import string
import random
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt

In [2]:
# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score


In [3]:
import numpy as np

### Test LABELS for TOKENS in TEST Dataset against BERT Outputs

BERT Classifier has returned results for the tokens passed in 'test.tsv' file.  The returned values are probabilities, that need to be converted into equivalent class labels based on majority class.  Then, the class label should be compared against the actual label from the code above to extract the IO-Coding from the xml files.  This is a brute-force approach or a manual way of verifying the validity of the predictions


Read in results from BERT Predicitons to the above dataset
The above dataset is derived from IO-Coding applied as done on the training set. This is what should be based on the annotation process. Nowe, we have to read in the predictions from bert, which is a set of class probabilities across all 3 classes and we have to merget that with the above dataset for comparison and error analysis.

### Data File Names

* Test files with Labels and Filenames : /data_for_bert_sent/test_files_with_labels/*_testfile.csv
* Bert label mapping /data_for_bert_sent/test_files_with_labels/*_labelmapping.csv
* BERT evaluation /data_for_bert_sent/BERT_run_results/*_eval_results.txt


In [4]:
print(os.path.dirname(os.path.abspath('__file__')))

C:\Users\Kalyan\Documents\Anu\W266 - NLP\Final Project\lheart-disease-risk-prediction\Code


### Hyperlipidemia Indicator

In [5]:
# read in the test files with labels

HI_test = pd.read_csv("data_for_bert_augmented/test_files_with_labels/hyperlipidemia_ind_testfile.csv")

In [6]:
HI_test.rename( columns={'Unnamed: 0' :'sentenceId'}, inplace=True )

In [7]:
HI_test.head(10)

Unnamed: 0,sentenceId,sentence,label,file
0,0,Record date: 2080-02-18,Other,110-03.xml
1,1,SDU JAR Admission Note,Other,110-03.xml
2,2,Name: \t Yosef Villegas,Other,110-03.xml
3,3,MR:\t8249813,Other,110-03.xml
4,4,DOA: \t2/17/80,Other,110-03.xml
5,5,PCP: Gilbert Perez,Other,110-03.xml
6,6,Attending: YBARRA,Other,110-03.xml
7,7,CODE: FULL,Other,110-03.xml
8,8,HPI: 70 yo M with NIDDM admitted for cath aft...,Other,110-03.xml
9,9,Pt has had increasing CP and SOB on exertion f...,Other,110-03.xml


In [8]:
# read in the test results captured for BERT Augmented Hyperlipidemia model and specify columns as the actual file has no header
bert_HI_results = pd.read_csv("data_for_bert_augmented/bert_augmented_run_results/bert_aug_data_output_data_hyperlipidemia_ind_output_results_test_results.tsv", sep='\t',header=None)
 
bert_HI_results.columns=["Class0", "Class1", "Class2", "Class3"]

In [9]:
bert_HI_results.head()

Unnamed: 0,Class0,Class1,Class2,Class3
0,0.999954,1.7e-05,1e-05,1.9e-05
1,0.999954,1.5e-05,1e-05,2.1e-05
2,0.99995,1.7e-05,8e-06,2.6e-05
3,0.999883,2.3e-05,1.1e-05,8.3e-05
4,0.994832,0.000295,0.000111,0.004762


In [10]:
bert_HI_results['predClass'] = bert_HI_results.idxmax(axis=1)

In [11]:
bert_HI_results.head()

Unnamed: 0,Class0,Class1,Class2,Class3,predClass
0,0.999954,1.7e-05,1e-05,1.9e-05,Class0
1,0.999954,1.5e-05,1e-05,2.1e-05,Class0
2,0.99995,1.7e-05,8e-06,2.6e-05,Class0
3,0.999883,2.3e-05,1.1e-05,8.3e-05,Class0
4,0.994832,0.000295,0.000111,0.004762,Class0


In [12]:
bert_HI_results['predClass'].value_counts()

Class0    24910
Class3      347
Class1       22
Class2        2
Name: predClass, dtype: int64

In [13]:
def HI_set_labels(classlabel):
    if (classlabel=='Class1'):
        return 'high LDL'
    elif (classlabel=='Class2'):
        return 'high chol.'
    elif (classlabel=='Class3'):
        return 'mention'
    else:
        return 'Other'

In [14]:
bert_HI_results['predLabel'] = bert_HI_results['predClass'].apply(HI_set_labels)

bert_HI_results.head(10)


Unnamed: 0,Class0,Class1,Class2,Class3,predClass,predLabel
0,0.999954,1.7e-05,1e-05,1.9e-05,Class0,Other
1,0.999954,1.5e-05,1e-05,2.1e-05,Class0,Other
2,0.99995,1.7e-05,8e-06,2.6e-05,Class0,Other
3,0.999883,2.3e-05,1.1e-05,8.3e-05,Class0,Other
4,0.994832,0.000295,0.000111,0.004762,Class0,Other
5,0.999956,1.5e-05,9e-06,2e-05,Class0,Other
6,0.999955,1.4e-05,9e-06,2.1e-05,Class0,Other
7,0.999955,1.7e-05,1e-05,1.9e-05,Class0,Other
8,0.999955,1.3e-05,9e-06,2.3e-05,Class0,Other
9,0.999954,1.5e-05,9e-06,2.2e-05,Class0,Other


In [15]:
# validating the counts by label
bert_HI_results['predLabel'].value_counts()

Other         24910
mention         347
high LDL         22
high chol.        2
Name: predLabel, dtype: int64

In [16]:
HI_combined = pd.concat([HI_test, bert_HI_results], axis=1)

In [17]:
HI_combined.head()

Unnamed: 0,sentenceId,sentence,label,file,Class0,Class1,Class2,Class3,predClass,predLabel
0,0,Record date: 2080-02-18,Other,110-03.xml,0.999954,1.7e-05,1e-05,1.9e-05,Class0,Other
1,1,SDU JAR Admission Note,Other,110-03.xml,0.999954,1.5e-05,1e-05,2.1e-05,Class0,Other
2,2,Name: \t Yosef Villegas,Other,110-03.xml,0.99995,1.7e-05,8e-06,2.6e-05,Class0,Other
3,3,MR:\t8249813,Other,110-03.xml,0.999883,2.3e-05,1.1e-05,8.3e-05,Class0,Other
4,4,DOA: \t2/17/80,Other,110-03.xml,0.994832,0.000295,0.000111,0.004762,Class0,Other


In [18]:
HI_combined[HI_combined['predLabel']!='Other']

Unnamed: 0,sentenceId,sentence,label,file,Class0,Class1,Class2,Class3,predClass,predLabel
18,18,Hyperlipidemia,mention,110-03.xml,0.002504,0.000209,0.000205,0.997083,Class3,mention
104,104,hyperlipidemia,mention,110-04.xml,0.002504,0.000209,0.000205,0.997083,Class3,mention
185,185,His past medical history is significant for hy...,mention,112-02.xml,0.002225,0.000202,0.000216,0.997357,Class3,mention
227,227,His past medical history is significant for hy...,mention,112-03.xml,0.002225,0.000202,0.000216,0.997357,Class3,mention
265,265,"He is a 54-year-old man with obesity, dyslipid...",mention,112-04.xml,0.002099,0.000194,0.000223,0.997484,Class3,mention
310,310,High cholesterol.,mention,114-03.xml,0.004025,0.000316,0.000198,0.995460,Class3,mention
357,357,Mr. Slater is an 83 yo w/ h/o bull...,mention,114-04.xml,0.002201,0.000201,0.000221,0.997377,Class3,mention
376,376,&#183; Hypercholesterolemia,mention,114-04.xml,0.002187,0.000202,0.000209,0.997402,Class3,mention
430,430,: Mr. Slater is an 83 yo w/ h/o bullous pemphi...,mention,114-04.xml,0.002185,0.000200,0.000221,0.997393,Class3,mention
497,497,Hyperlipidemia MAJOR,mention,115-04.xml,0.002561,0.000210,0.000203,0.997025,Class3,mention


In [19]:
HI_test_labels = HI_combined['label']
HI_pred_labels = HI_combined['predLabel']

#print(type(HI_test_labels))

In [20]:
accuracy_score(HI_test_labels, HI_pred_labels)

0.9962422372532732

In [21]:
print(classification_report(HI_pred_labels, HI_test_labels))

              precision    recall  f1-score   support

       Other       1.00      1.00      1.00     24910
    high LDL       0.48      0.73      0.58        22
  high chol.       0.14      0.50      0.22         2
     mention       0.91      0.90      0.90       347

   micro avg       1.00      1.00      1.00     25281
   macro avg       0.63      0.78      0.68     25281
weighted avg       1.00      1.00      1.00     25281



In [22]:
unique_label = np.unique(HI_test_labels)
print(pd.DataFrame(confusion_matrix(HI_test_labels, HI_pred_labels, labels=unique_label), 
                   index=['true:{:}'.format(x) for x in unique_label], 
                   columns=['pred:{:}'.format(x) for x in unique_label]))

                 pred:Other  pred:high LDL  pred:high chol.  pred:mention
true:Other            24858              6                1            34
true:high LDL            16             16                0             1
true:high chol.           5              0                1             1
true:mention             31              0                0           311


In [23]:
HI_combined[HI_combined['label'] =='high LDL']

Unnamed: 0,sentenceId,sentence,label,file,Class0,Class1,Class2,Class3,predClass,predLabel
1641,1641,181/39/112 WITH TG 149 11/85.,high LDL,131-01.xml,0.999955,1.5e-05,8e-06,2.2e-05,Class0,Other
1642,1642,194/42/123/4.6 WITH TG 147 7/86.,high LDL,131-01.xml,0.999955,1.4e-05,9e-06,2.2e-05,Class0,Other
1643,1643,12/88 188/42/118/4.5.,high LDL,131-01.xml,0.999956,1.5e-05,9e-06,2e-05,Class0,Other
2825,2825,Cholesterol-LDL 05/15/2090 165,high LDL,134-03.xml,0.000151,0.999499,0.000215,0.000135,Class1,high LDL
3759,3759,Please see prior notes for full lipid analysis...,high LDL,138-03.xml,0.002036,0.997454,0.000256,0.000255,Class1,high LDL
3891,3891,LDL 138,high LDL,139-02.xml,0.000178,0.999498,0.00018,0.000145,Class1,high LDL
4543,4543,"I restarted her on lipitor 20 mg po qd, given ...",high LDL,162-04.xml,0.999865,2.8e-05,1.7e-05,9e-05,Class0,Other
4753,4753,and LDL from 09/15/83 was 154 with a total cho...,high LDL,163-03.xml,0.000268,0.999434,0.000148,0.00015,Class1,high LDL
6186,6186,"However, cholesterol now of 186, HDL 46, LDL 105.",high LDL,169-01.xml,0.000139,0.999488,0.000249,0.000124,Class1,high LDL
6994,6994,"11/95 TC 199, HDL 42, LDL 122, TG 171, and sim...",high LDL,193-05.xml,0.999933,1.8e-05,1e-05,3.8e-05,Class0,Other


The entire class of 'high LDL' is getting predicted incorrectly as the model is not able to learn from the actual limits that determine high LDL levels.

### Hyperlipidemia Time

In [48]:
# read in the test files with labels

HT_test = pd.read_csv("data_for_bert_augmented/test_files_with_labels/hyperlipidemia_tim_testfile.csv")

In [49]:
HT_test.rename( columns={'Unnamed: 0' :'sentenceId'}, inplace=True )

In [50]:
HT_test.head(10)

Unnamed: 0,sentenceId,sentence,label,file
0,0,Record date: 2080-02-18,Other,110-03.xml
1,1,SDU JAR Admission Note,Other,110-03.xml
2,2,Name: \t Yosef Villegas,Other,110-03.xml
3,3,MR:\t8249813,Other,110-03.xml
4,4,DOA: \t2/17/80,Other,110-03.xml
5,5,PCP: Gilbert Perez,Other,110-03.xml
6,6,Attending: YBARRA,Other,110-03.xml
7,7,CODE: FULL,Other,110-03.xml
8,8,HPI: 70 yo M with NIDDM admitted for cath aft...,Other,110-03.xml
9,9,Pt has had increasing CP and SOB on exertion f...,Other,110-03.xml


In [51]:
# read in the test results captured for BERT Augmented Hyperlipidemia model and specify columns as the actual file has no header
bert_aug_HT_results = pd.read_csv("data_for_bert_augmented/bert_augmented_run_results/bert_aug_data_output_data_hyperlipidemia_time_output_results_test_results.tsv", sep='\t',header=None)
 
bert_aug_HT_results.columns=["Class0", "Class1", "Class2", "Class3"]

In [52]:
bert_aug_HT_results.head()

Unnamed: 0,Class0,Class1,Class2,Class3
0,0.999958,1.2e-05,1.3e-05,1.8e-05
1,0.999957,1.1e-05,1.3e-05,1.9e-05
2,0.999951,1e-05,1.5e-05,2.4e-05
3,0.999879,1.4e-05,3.9e-05,6.8e-05
4,0.999838,1.9e-05,5e-05,9.2e-05


In [53]:
bert_aug_HT_results['predClass'] = bert_aug_HT_results.idxmax(axis=1)

In [54]:
bert_aug_HT_results.head()

Unnamed: 0,Class0,Class1,Class2,Class3,predClass
0,0.999958,1.2e-05,1.3e-05,1.8e-05,Class0
1,0.999957,1.1e-05,1.3e-05,1.9e-05,Class0
2,0.999951,1e-05,1.5e-05,2.4e-05,Class0
3,0.999879,1.4e-05,3.9e-05,6.8e-05,Class0
4,0.999838,1.9e-05,5e-05,9.2e-05,Class0


In [55]:
bert_aug_HT_results['predClass'].value_counts()

Class0    24883
Class3      202
Class2      123
Class1       73
Name: predClass, dtype: int64

In [68]:
def HT_set_labels(classlabel):
    if (classlabel=='Class1'):
        return 'after DCT'
    elif (classlabel=='Class2'):
        return 'before DCT'
    elif (classlabel=='Class3'):
        return 'during DCT'
    else:
        return 'Other'

In [69]:
bert_aug_HT_results['predLabel'] = bert_aug_HT_results['predClass'].apply(HT_set_labels)

bert_aug_HT_results.head(10)


Unnamed: 0,Class0,Class1,Class2,Class3,predClass,predLabel
0,0.999958,1.2e-05,1.3e-05,1.8e-05,Class0,Other
1,0.999957,1.1e-05,1.3e-05,1.9e-05,Class0,Other
2,0.999951,1e-05,1.5e-05,2.4e-05,Class0,Other
3,0.999879,1.4e-05,3.9e-05,6.8e-05,Class0,Other
4,0.999838,1.9e-05,5e-05,9.2e-05,Class0,Other
5,0.999957,1.3e-05,1.3e-05,1.8e-05,Class0,Other
6,0.999955,1.1e-05,1.4e-05,2e-05,Class0,Other
7,0.999957,1.1e-05,1.3e-05,2e-05,Class0,Other
8,0.999958,1.1e-05,1.4e-05,1.6e-05,Class0,Other
9,0.999953,1.4e-05,1.6e-05,1.7e-05,Class0,Other


In [70]:
# validating the counts by label
bert_aug_HT_results['predLabel'].value_counts()

Other         24883
during DCT      202
before DCT      123
after DCT        73
Name: predLabel, dtype: int64

In [71]:
HT_combined = pd.concat([HT_test, bert_aug_HT_results], axis=1)

In [72]:
HT_combined.head()

Unnamed: 0,sentenceId,sentence,label,file,Class0,Class1,Class2,Class3,predClass,predLabel
0,0,Record date: 2080-02-18,Other,110-03.xml,0.999958,1.2e-05,1.3e-05,1.8e-05,Class0,Other
1,1,SDU JAR Admission Note,Other,110-03.xml,0.999957,1.1e-05,1.3e-05,1.9e-05,Class0,Other
2,2,Name: \t Yosef Villegas,Other,110-03.xml,0.999951,1e-05,1.5e-05,2.4e-05,Class0,Other
3,3,MR:\t8249813,Other,110-03.xml,0.999879,1.4e-05,3.9e-05,6.8e-05,Class0,Other
4,4,DOA: \t2/17/80,Other,110-03.xml,0.999838,1.9e-05,5e-05,9.2e-05,Class0,Other


In [73]:
HT_combined[HT_combined['predLabel']!='Other']

Unnamed: 0,sentenceId,sentence,label,file,Class0,Class1,Class2,Class3,predClass,predLabel
18,18,Hyperlipidemia,before DCT,110-03.xml,0.003151,0.017987,0.032193,0.946669,Class3,during DCT
104,104,hyperlipidemia,after DCT,110-04.xml,0.003151,0.017987,0.032193,0.946669,Class3,during DCT
185,185,His past medical history is significant for hy...,after DCT,112-02.xml,0.012428,0.324499,0.657111,0.005962,Class2,before DCT
227,227,His past medical history is significant for hy...,before DCT,112-03.xml,0.012428,0.324499,0.657111,0.005962,Class2,before DCT
265,265,"He is a 54-year-old man with obesity, dyslipid...",during DCT,112-04.xml,0.000782,0.001913,0.002474,0.994831,Class3,during DCT
310,310,High cholesterol.,after DCT,114-03.xml,0.000795,0.406887,0.589512,0.002807,Class2,before DCT
357,357,Mr. Slater is an 83 yo w/ h/o bull...,during DCT,114-04.xml,0.000779,0.001524,0.002448,0.995250,Class3,during DCT
376,376,&#183; Hypercholesterolemia,during DCT,114-04.xml,0.001389,0.000462,0.004218,0.993931,Class3,during DCT
430,430,: Mr. Slater is an 83 yo w/ h/o bullous pemphi...,during DCT,114-04.xml,0.000716,0.001242,0.002449,0.995593,Class3,during DCT
496,496,Hyperuricemia MAJOR,Other,115-04.xml,0.001924,0.000457,0.004667,0.992953,Class3,during DCT


In [74]:
HT_test_labels = HT_combined['label']
HT_pred_labels = HT_combined['predLabel']

#print(type(HT_test_labels))

In [75]:
accuracy_score(HT_test_labels, HT_pred_labels)

0.990664926229184

In [76]:
print(classification_report(HT_pred_labels, HT_test_labels))

              precision    recall  f1-score   support

       Other       1.00      1.00      1.00     24883
   after DCT       0.30      0.21      0.24        73
  before DCT       0.39      0.49      0.43       123
  during DCT       0.78      0.68      0.73       202

   micro avg       0.99      0.99      0.99     25281
   macro avg       0.62      0.59      0.60     25281
weighted avg       0.99      0.99      0.99     25281



In [77]:
unique_label = np.unique(HT_test_labels)
print(pd.DataFrame(confusion_matrix(HT_test_labels, HT_pred_labels, labels=unique_label), 
                   index=['true:{:}'.format(x) for x in unique_label], 
                   columns=['pred:{:}'.format(x) for x in unique_label]))

                 pred:Other  pred:after DCT  pred:before DCT  pred:during DCT
true:Other            24832              25               22               20
true:after DCT           13              15               15                7
true:before DCT          31              26               60               37
true:during DCT           7               7               26              138


In [78]:
HI_combined[HI_combined['label'] =='high LDL']

Unnamed: 0,sentenceId,sentence,label,file,Class0,Class1,Class2,Class3,predClass,predLabel
1641,1641,181/39/112 WITH TG 149 11/85.,high LDL,131-01.xml,0.999955,1.5e-05,8e-06,2.2e-05,Class0,Other
1642,1642,194/42/123/4.6 WITH TG 147 7/86.,high LDL,131-01.xml,0.999955,1.4e-05,9e-06,2.2e-05,Class0,Other
1643,1643,12/88 188/42/118/4.5.,high LDL,131-01.xml,0.999956,1.5e-05,9e-06,2e-05,Class0,Other
2825,2825,Cholesterol-LDL 05/15/2090 165,high LDL,134-03.xml,0.000151,0.999499,0.000215,0.000135,Class1,high LDL
3759,3759,Please see prior notes for full lipid analysis...,high LDL,138-03.xml,0.002036,0.997454,0.000256,0.000255,Class1,high LDL
3891,3891,LDL 138,high LDL,139-02.xml,0.000178,0.999498,0.00018,0.000145,Class1,high LDL
4543,4543,"I restarted her on lipitor 20 mg po qd, given ...",high LDL,162-04.xml,0.999865,2.8e-05,1.7e-05,9e-05,Class0,Other
4753,4753,and LDL from 09/15/83 was 154 with a total cho...,high LDL,163-03.xml,0.000268,0.999434,0.000148,0.00015,Class1,high LDL
6186,6186,"However, cholesterol now of 186, HDL 46, LDL 105.",high LDL,169-01.xml,0.000139,0.999488,0.000249,0.000124,Class1,high LDL
6994,6994,"11/95 TC 199, HDL 42, LDL 122, TG 171, and sim...",high LDL,193-05.xml,0.999933,1.8e-05,1e-05,3.8e-05,Class0,Other
