## i2b2 Dataset - Heart Disease Risk Prediction

This project uses i2b2 dataset to assess for heart disease risk factors using NLP techniques.  This serves to enhance on the 2014 competition that used NLP techniques on the i2b2 dataset to identify risk factors associated with heart disease.  Several models, including CNN, RNN, bi-directional LSTM have already been explored.  Our goal is to use BERT and other NLP techniques to undestand how well it performs on the i2b2 dataset and compare the results against previous studies.


In [2]:
import xml.etree.ElementTree as ET
from xml.dom.minidom import parse, Node

### Parse through XML

The data (clinical text) is contained in XML format.  This includes the actual clinical text as well as associated tags based on the annotations.

In [3]:
xmlTree = parse("training-PHI-Gold-Set1/220-01.xml")

In [4]:
for node1 in xmlTree.getElementsByTagName("TEXT"):
    print(" node value: ")
    print(node1.firstChild.nodeValue  )  

 node value: 



Record date: 2067-05-03

Narrative History

   55 yo woman who presents for f/u 

   

   Seen in Cardiac rehab locally last week and BP 170/80.  They called us and we increased her HCTZ to 25 mg from 12.5 mg.  States her BP's were fine there since - 130-140/70-80.

   

   

   Saw Dr Oakley 4/5/67 - she was happy with results of ETT at Clarkfield.  To f/u 7/67.  No CP's since last admit.

   

   Back to work and starting to walk.  No wt loss and discouraged by this, but just starting to exercise.

   

   No smoking for 3 months now!

   

   Still with hotflashes, wakes her up at night.

Problems

      FH breast cancer   37 yo s 



      FH myocardial infarction   mother died 66 yo 



      Hypertension



      Uterine fibroids   u/s 2062 



      Smoking



      hyperlipidemia   CRF mild chol, cigs, HTN, Fhx and known hx CAD in pt. 



      borderline diabetes mellitus   4/63 125 , follow hgbaic 



      VPB   2065 - ETT showed freq PVC 



      coronary 

In [5]:
xmlTree = ET.parse("training-PHI-Gold-Set1/220-01.xml")
elemList = []

for elem in xmlTree.iter():
    elemList.append(elem.tag) 

# now I remove duplicities - by convertion to set and back to list
elemList = list(set(elemList))

# Just printing out the result
print(elemList)

['AGE', 'LOCATION', 'TAGS', 'NAME', 'TEXT', 'deIdi2b2', 'DATE']


In [8]:
root = xmlTree.getroot()
root

<Element 'deIdi2b2' at 0x0000028127DE17C8>

In [9]:
root.tag

'deIdi2b2'

In [10]:
[(elem.tag, elem.attrib, elem.text) for elem in root.iter()]

[('deIdi2b2', {}, '\n'),
 ('TEXT',
  {},
  "\n\n\nRecord date: 2067-05-03\n\nNarrative History\n\n   55 yo woman who presents for f/u \n\n   \n\n   Seen in Cardiac rehab locally last week and BP 170/80.  They called us and we increased her HCTZ to 25 mg from 12.5 mg.  States her BP's were fine there since - 130-140/70-80.\n\n   \n\n   \n\n   Saw Dr Oakley 4/5/67 - she was happy with results of ETT at Clarkfield.  To f/u 7/67.  No CP's since last admit.\n\n   \n\n   Back to work and starting to walk.  No wt loss and discouraged by this, but just starting to exercise.\n\n   \n\n   No smoking for 3 months now!\n\n   \n\n   Still with hotflashes, wakes her up at night.\n\nProblems\n\n      FH breast cancer   37 yo s \n\n\n\n      FH myocardial infarction   mother died 66 yo \n\n\n\n      Hypertension\n\n\n\n      Uterine fibroids   u/s 2062 \n\n\n\n      Smoking\n\n\n\n      hyperlipidemia   CRF mild chol, cigs, HTN, Fhx and known hx CAD in pt. \n\n\n\n      borderline diabetes mellitus   

In [11]:
from xml.dom.minidom import parse, Node

xmlTree = parse("training-PHI-Gold-Set1/220-01.xml")
#get all departments
for node1 in xmlTree.getElementsByTagName("deIdi2b2") :
    for node2 in node1.childNodes:
        #print(node2)
        print(xmlTree.getElementsByTagName("TEXT"))
        if(node2.nodeType == Node.TEXT_NODE) :
            print(node2.data)


[<DOM Element: TEXT at 0x28127e05c28>]


[<DOM Element: TEXT at 0x28127e05c28>]
[<DOM Element: TEXT at 0x28127e05c28>]


[<DOM Element: TEXT at 0x28127e05c28>]
[<DOM Element: TEXT at 0x28127e05c28>]




In [12]:
xmlstr = ET.tostring(root, encoding='utf8', method='xml')

In [13]:
xmlstr

b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<deIdi2b2>\n<TEXT>\n\n\nRecord date: 2067-05-03\n\nNarrative History\n\n   55 yo woman who presents for f/u \n\n   \n\n   Seen in Cardiac rehab locally last week and BP 170/80.  They called us and we increased her HCTZ to 25 mg from 12.5 mg.  States her BP\'s were fine there since - 130-140/70-80.\n\n   \n\n   \n\n   Saw Dr Oakley 4/5/67 - she was happy with results of ETT at Clarkfield.  To f/u 7/67.  No CP\'s since last admit.\n\n   \n\n   Back to work and starting to walk.  No wt loss and discouraged by this, but just starting to exercise.\n\n   \n\n   No smoking for 3 months now!\n\n   \n\n   Still with hotflashes, wakes her up at night.\n\nProblems\n\n      FH breast cancer   37 yo s \n\n\n\n      FH myocardial infarction   mother died 66 yo \n\n\n\n      Hypertension\n\n\n\n      Uterine fibroids   u/s 2062 \n\n\n\n      Smoking\n\n\n\n      hyperlipidemia   CRF mild chol, cigs, HTN, Fhx and known hx CAD in pt. \n\n\n\n      borderline

## Read files in Directory

In [14]:
import os

path = 'C:/Users/sudha/Box/UCBerkeley-MIDS/Sprint-2019/W-266/W266-Final-Project-Related/i2b2-Data-Downloads/training-PHI-Gold-Set1'

folder = os.fsencode(path)
print (folder)

filenames = []
xml_contents = []

for file in os.listdir(folder):
    filename = os.fsdecode(file)
    if filename.endswith( ('.xml') ): # select xml files
        filenames.append(filename)
        xmlTree = ET.parse('training-PHI-Gold-Set1/'+filename)
        root = xmlTree.getroot()
        xml_elem = [(elem.tag, elem.attrib, elem.text) for elem in root.iter()]
        
        xml_contents.append(xml_elem)
        
#training-PHI-Gold-Set1/220-01.xml
filenames.sort() # now you have the filenames and can do something with them

b'C:/Users/sudha/Box/UCBerkeley-MIDS/Sprint-2019/W-266/W266-Final-Project-Related/i2b2-Data-Downloads/training-PHI-Gold-Set1'


In [15]:
filenames

['220-01.xml',
 '220-02.xml',
 '220-03.xml',
 '220-04.xml',
 '220-05.xml',
 '221-01.xml',
 '221-02.xml',
 '221-03.xml',
 '221-04.xml',
 '221-05.xml',
 '222-01.xml',
 '222-02.xml',
 '222-03.xml',
 '222-04.xml',
 '222-05.xml',
 '223-01.xml',
 '223-02.xml',
 '223-03.xml',
 '223-04.xml',
 '224-01.xml',
 '224-02.xml',
 '224-03.xml',
 '224-04.xml',
 '225-01.xml',
 '225-02.xml',
 '225-03.xml',
 '225-04.xml',
 '226-01.xml',
 '226-02.xml',
 '226-03.xml',
 '226-04.xml',
 '226-05.xml',
 '227-01.xml',
 '227-02.xml',
 '227-03.xml',
 '227-04.xml',
 '227-05.xml',
 '228-01.xml',
 '228-02.xml',
 '228-03.xml',
 '228-04.xml',
 '228-05.xml',
 '229-01.xml',
 '229-02.xml',
 '229-03.xml',
 '240-01.xml',
 '240-02.xml',
 '240-03.xml',
 '240-04.xml',
 '241-01.xml',
 '241-02.xml',
 '241-03.xml',
 '241-04.xml',
 '242-01.xml',
 '242-02.xml',
 '242-03.xml',
 '242-04.xml',
 '242-05.xml',
 '243-01.xml',
 '243-02.xml',
 '243-03.xml',
 '243-04.xml',
 '244-01.xml',
 '244-02.xml',
 '244-03.xml',
 '244-04.xml',
 '246-01.x

In [16]:
len(xml_contents)

521

In [17]:
xml_contents[0]

[('deIdi2b2', {}, '\n'),
 ('TEXT',
  {},
  "\n\n\nRecord date: 2067-05-03\n\nNarrative History\n\n   55 yo woman who presents for f/u \n\n   \n\n   Seen in Cardiac rehab locally last week and BP 170/80.  They called us and we increased her HCTZ to 25 mg from 12.5 mg.  States her BP's were fine there since - 130-140/70-80.\n\n   \n\n   \n\n   Saw Dr Oakley 4/5/67 - she was happy with results of ETT at Clarkfield.  To f/u 7/67.  No CP's since last admit.\n\n   \n\n   Back to work and starting to walk.  No wt loss and discouraged by this, but just starting to exercise.\n\n   \n\n   No smoking for 3 months now!\n\n   \n\n   Still with hotflashes, wakes her up at night.\n\nProblems\n\n      FH breast cancer   37 yo s \n\n\n\n      FH myocardial infarction   mother died 66 yo \n\n\n\n      Hypertension\n\n\n\n      Uterine fibroids   u/s 2062 \n\n\n\n      Smoking\n\n\n\n      hyperlipidemia   CRF mild chol, cigs, HTN, Fhx and known hx CAD in pt. \n\n\n\n      borderline diabetes mellitus   