## Exploring the XML files from the dataset

The XML files in the dataset are as follows, and each file needs to be individually parsed

* `deid_surrogate_test_all_groundtruth_version2.xml`:

    Annotations for each entity type for each Record ID
    
<br />
    
* `deid_surrogate_test_all_version2.xml`:

    Corresponding Text for each of the above records
    
<br />
    
* `deid_surrogate_train_all_version2.xml`:

    File error
    
<br />
    
* `smokers_surrogate_test_all_groundtruth_version2.xml`:

    ID, Smoking status and text from the case notes
    
<br />
    
*  `smokers_surrogate_test_all_version2.xml`:

    ID, case notes without a label of the smoking status
    
<br />
    
* `smokers_surrogate_train_all_version2.xml`:

    The file has ID, Smoking status and text from the case notes
    
<br />
    
* `unannotated_records_deid_smoking.xml`:

    File error
    
<br />
    
    
Using the functions defined below, we can parse the XML files and the information given in them, we will use these decitionaries to build NER models

In [1]:
import xml.etree.ElementTree as ET

def xml_parse_deid_surrogate_test_all_groundtruth_version2(file):
    tree = ET.parse(file)
    root = tree.getroot() 
    return {child.attrib['ID']: [(i.attrib['TYPE'], i.text) for i in child[0]] for child in root}

def xml_parse_deid_surrogate_test_all_version2(file):
    tree = ET.parse(file)
    root = tree.getroot()
    return {child.attrib['ID']: [(i.tag, i.text) for i in child] for child in root}

def xml_parse_smokers_surrogate_test_all_groundtruth_version2(file):
    tree = ET.parse(file)
    root = tree.getroot()
    return {child.attrib['ID']: (child[0].attrib, child[1].tag, child[1].text) for child in root}

def xml_parse_smokers_surrogate_test_all_version2(file):
    tree = ET.parse(file)
    root = tree.getroot()
    return {child.attrib['ID']: (child[0].tag, child[0].text) for child in root}

def xml_parse_smokers_surrogate_train_all_version2(file):
    tree = ET.parse(file)
    root = tree.getroot()
    return {child.attrib['ID']: (child[0].attrib, child[1].tag, child[1].text) for child in root}

In [37]:
xml_parse_deid_surrogate_test_all_groundtruth_version2("data/deid_surrogate_test_all_groundtruth_version2.xml")

{'111': [('ID', '081039790'),
  ('HOSPITAL', 'EH'),
  ('ID', '05861967'),
  ('ID', '2214142'),
  ('DATE', '7/23'),
  ('DATE', '07/23'),
  ('DATE', '08/05'),
  ('DATE', '08/05'),
  ('DOCTOR', 'FLOW VIZEUBELB'),
  ('DATE', '07/24'),
  ('DATE', '07/24'),
  ('DOCTOR', 'Flow Vizeubelb'),
  ('HOSPITAL', 'Totonleyash Clandsdallca Center'),
  ('DATE', '07/31'),
  ('DATE', '08/01'),
  ('DATE', '08/03'),
  ('ID', '9-4443854 AIPlkbn'),
  ('DOCTOR', 'KOTEJESC , NAE FARIETTEJARRED'),
  ('DOCTOR', 'VIZEUBELB , MANSHIRL'),
  ('ID', '2434925'),
  ('DATE', '08/05'),
  ('DATE', '08/05')],
 '135': [('ID', '194830718'),
  ('HOSPITAL', 'YC'),
  ('ID', '29157608'),
  ('ID', '520377'),
  ('DATE', '10/25'),
  ('DATE', '10/25'),
  ('DATE', '11/03'),
  ('DATE', '12'),
  ('DATE', '08'),
  ('DATE', '09'),
  ('DATE', '09'),
  ('DATE', '11'),
  ('DATE', '10/28'),
  ('DOCTOR', 'Jesc'),
  ('DOCTOR', 'BELL REXBEATHEFARST'),
  ('ID', 'NM82'),
  ('DOCTOR', 'LINEMASE D. JESC'),
  ('ID', 'ZP2 SF477/5317'),
  ('ID', '31967

In [38]:
xml_parse_deid_surrogate_test_all_version2('data/deid_surrogate_test_all_version2.xml')

{'111': [('TEXT',
   '\n081039790 EH\n05861967\n2214142\n7/23/2003 12:00:00 AM\nSUBARACHNOID HEMORRHAGE\nSigned\nDIS\nAdmission Date :\n07/23/2003\nReport Status :\nSigned\nDischarge Date :\n08/05/2003\nDate of Discharge :\n08/05/2003\nATTENDING :\nFLOW VIZEUBELB M.D.\nPRINCIPAL DIAGNOSIS :\nAnterior communicating artery aneurysm , subarachnoid hemorrhage .\nOTHER PROBLEMS :\nNone .\nHISTORY OF PRESENT ILLNESS :\nThis is a 52-year-old female who complains of 5 days of headache that has worsened over the past 2 days .\nShe has complained of nausea and vomiting since Wednesday , and currently , the headache is 9/10 with photophobia .\nShe called her primary care physician who originally thought that the nausea and vomiting were suggestive of a cold .\nPAST MEDICAL HISTORY :\nPolio as a child , history of alcohol abuse but quit in 1993 , and hypertension no longer on medications .\nSOCIAL HISTORY :\nThe patient denies ethanol since 1993 .\nShe is married .\nShe denies tobacco history and 

In [132]:
xml_parse_smokers_surrogate_test_all_groundtruth_version2('data/smokers_surrogate_test_all_groundtruth_version2.xml')

{'1': ({'STATUS': 'CURRENT SMOKER'},
  'TEXT',
  '\n726132880\nDH\n9749099\n947532\n7473533\n12/7/2006 12:00:00 AM\nDischarge Summary\nSigned\nDIS\nReport Status :\nSigned\nDISCHARGE SUMMARY\nNAME :\nSTERPMOONE , NY\nUNIT NUMBER :\n636-48-57\nADMISSION DATE :\n12/07/2006\nDISCHARGE DATE :\n12/09/2006\nPRINCIPAL DIAGNOSIS :\nHyperkalemia .\nASSOCIATED DIAGNOSES :\n1. Endstage renal disease .\n2. Thrombosed dialysis arteriovenous graft .\n3. Anemia .\nOPERATIONS AND PROCEDURES :\n1. Dialysis .\n2. AV graft thrombectomy .\n3. Tunneled hemodialysis catheter .\n4. Hemodialysis .\nDISCHARGE MEDICATIONS :\n1. Aspirin 81 mg every day .\n2. Amitriptyline 25 mg at bedtime .\n3. Atenolol 50 mg per day .\n4. Lipitor 10 mg per day .\n5. Calcium acetate three tablets three times a day with meals .\n6. Celexa 40 mg per day .\n7. Nexium 20 mg per day .\n8. Mirapex 0.5 mg pre-dialysis .\n9. Quinine 325 mg per day .\n10. Renagel 800 mg four times per day .\nBRIEF HISTORY :\nShe is a 57-year-old chronic 

In [133]:
xml_parse_smokers_surrogate_test_all_version2('data/smokers_surrogate_test_all_version2.xml')

{'1': ('TEXT',
  '\n726132880\nDH\n9749099\n947532\n7473533\n12/7/2006 12:00:00 AM\nDischarge Summary\nSigned\nDIS\nReport Status :\nSigned\nDISCHARGE SUMMARY\nNAME :\nSTERPMOONE , NY\nUNIT NUMBER :\n636-48-57\nADMISSION DATE :\n12/07/2006\nDISCHARGE DATE :\n12/09/2006\nPRINCIPAL DIAGNOSIS :\nHyperkalemia .\nASSOCIATED DIAGNOSES :\n1. Endstage renal disease .\n2. Thrombosed dialysis arteriovenous graft .\n3. Anemia .\nOPERATIONS AND PROCEDURES :\n1. Dialysis .\n2. AV graft thrombectomy .\n3. Tunneled hemodialysis catheter .\n4. Hemodialysis .\nDISCHARGE MEDICATIONS :\n1. Aspirin 81 mg every day .\n2. Amitriptyline 25 mg at bedtime .\n3. Atenolol 50 mg per day .\n4. Lipitor 10 mg per day .\n5. Calcium acetate three tablets three times a day with meals .\n6. Celexa 40 mg per day .\n7. Nexium 20 mg per day .\n8. Mirapex 0.5 mg pre-dialysis .\n9. Quinine 325 mg per day .\n10. Renagel 800 mg four times per day .\nBRIEF HISTORY :\nShe is a 57-year-old chronic dialysis patient who was admitte

In [134]:
xml_parse_smokers_surrogate_train_all_version2('data/smokers_surrogate_train_all_version2.xml')

{'10': ({'STATUS': 'UNKNOWN'},
  'TEXT',
  '\n688127038 EH\n47449520\n204512\n3/5/2002 12:00:00 AM\nTCA overdose\nDIS\nAdmission Date :\n03/05/2002\nReport Status :\nDischarge Date :\n03/07/2002\n****** DISCHARGE ORDERS ******\nHULLEKOTE , AU 329-72-53-3 Z23\nRoom :\n29Z-191\nService :\nMED\nDISCHARGE PATIENT ON :\n03/07/02 AT 11:00 AM\nCONTINGENT UPON\nNot Applicable\nWILL D / C ORDER BE USED AS THE D / C SUMMARY :\nYES\nAttending :\nUPHMADRE , CIOUSALE N. , M.D.\nCODE STATUS :\nFull code\nDISPOSITION :\nRehabilitation\nDISCHARGE MEDICATIONS :\nFOLATE ( FOLIC ACID ) 1 MG PO QD ZANTAC ( RANITIDINE HCL ) 150 MG PO BID MVI THERAPEUTIC W / MINERALS ( THERAP VITS / MINERALS ) 1 TAB PO QD THIAMINE HCL 100 MG PO QD\nDIET :\nNo Restrictions Activity - As tolerated\nFOLLOW UP APPOINTMENT ( S ) :\nDr. Ee , patient to arrange , No Known Allergies\nADMIT DIAGNOSIS :\nTCA overdose\nPRINCIPAL DISCHARGE DIAGNOSIS ; Responsible After Study for Causing Admission ) TCA overdose\nOTHER DIAGNOSIS ; Condi

In [5]:
for key in a:
    tags = a[key]

[('TEXT', '\n113416550\nPRGH\n13523357\n630190\n6/7/1999 12:00:00 AM\nDischarge Summary\nSigned\nDIS\nAdmission Date :\n06/07/1999\nReport Status :\nSigned\nDischarge Date :\n06/13/1999\nHISTORY OF PRESENT ILLNESS :\nEssentially , Mr. Cornea is a 60 year old male who noted the onset of dark urine during early January .\nHe underwent CT and ERCP at the Lisonatemi Faylandsburgnic, Community Hospital with a stent placement and resolution of jaundice .\nHe underwent an ECHO and endoscopy at Ingree and Ot of Weamanshy Medical Center on April 28 .\nHe was found to have a large , bulging , extrinsic mass in the lesser curvature of his stomach .\nFine needle aspiration showed atypical cells , positively reactive mesothelial cells .\nAbdominal CT on April 14 , showed a 12 x 8 x 8 cm mass in the region of the left liver , and appeared to be from the lesser curvature of the stomach or left liver .\nHe denied any nausea , vomiting , anorexia , or weight loss .\nHe states that his color in urine or