In [1]:
!pip3 install xmltodict

Collecting xmltodict
  Using cached https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0


# Week 6 Exercises

_McKinney 6.1_

There are multiple ways to solve the problems below.  You can use any one of several approaches.  For example, you can read CSV files using Pandas or the csv module.  Your score won't depend on which modules you choose to use unless explicitly noted below, but your programming style will still matter.

### 30.1 List of Allergies

In the /data directory on the Jupyter server, there is a file called `allergies.json` that contains a list of patient allergies.  It is taken from sample data provided by the EHR vendor, Epic, here: https://open.epic.com/Clinical/Allergy

Take some time to look at the structure of the file.  You can open it directly in Jupyter by clicking the _Home_ icon, then the _from_instructor_ folder, and then the _data_ folder.

Within the file, you'll see that it is a dictionary with many items in it.  One of those items is called `entry` and that item is a list of things.  You can tell that because the item name is immediately followed by an opening square bracket, signifying the start of a list.  It's line 11 of the file: `  "entry": [`

Write a function named `allergy_count(json_file)` that takes as one parameter the name of the JSON file and returns an integer number of entries in that file.  Your function should open the file, read the json into a Python object, and return how many items there are in the list of `entry`s.

In [2]:
import json
from pathlib import Path
HOME = str(Path.home())

ALLERGIES_FILE="/data/allergies.json"

In [3]:
### BEGIN SOLUTION
def allergy_count(file):
    """(file) -> filename
    This function opens a file as specified by filename and returns how many items exist in the 'entry'
    list
    """
    with open(file) as f:
        allergyinfo = json.load(f)
        allergylist = allergyinfo.get("entry")
     
    return len(allergylist)
### END SOLUTION

In [4]:
allergy_count(ALLERGIES_FILE)

4

In [5]:
assert type(allergy_count(ALLERGIES_FILE)) == int
assert allergy_count(ALLERGIES_FILE) == 4

In [6]:
#import doctest
#doctest.run_docstring_examples(allergy_count, globals(), verbose=True)

### 30.2 Number of Patients

If you dig a little bit deaper into this list of allergies, you'll see that each result has a patient associated with it.  Create a funcation called `patient_count(json_file)` that will count how many unique patients we have in this JSON structure.  

In [7]:
### BEGIN SOLUTION
def patient_count(file):
    """(file) -> filename
    This function opens a file as specified by filename and returns how many items exist in the 'entry'
    list
    """
    patients = []
    with open(file) as f:
        allergyinfo = json.load(f)
        entries = allergyinfo.get("entry")
        
        for entry in entries:
            pid = entry.get("resource").get("patient").get("reference")
            if pid not in patients:
                patients.append(pid)
     
    return len(patients)
### END SOLUTION

In [8]:
patient_count(ALLERGIES_FILE)

2

### 30.3 How Many Allergies per Patient

Although each entry is a separate allergy, several of them are for the same patient.  Write a function called `allergy_per_patient(json_file)` that counts up how many allergies each patient has.


In [9]:
### BEGIN SOLUTION
def allergy_per_patient(json_file):
    """(file) -> filename
    This function opens a file as specified by filename and returns how many items exist in the 'entry'
    list
    """
    patients = {}
    with open(json_file) as f:
        allergyinfo = json.load(f)
        entries = allergyinfo.get("entry")
        
        for entry in entries:
            pid = entry.get("resource").get("patient").get("reference")
            pname = entry.get("resource").get("patient").get("display")
            if pid not in patients.keys():
                patients[pid] = 1
            else:
                patients[pid] += 1
     
    return  patients
### END SOLUTION

In [10]:
allergy_per_patient(ALLERGIES_FILE)

{'https://open-ic.epic.com/FHIR/api/FHIR/DSTU2/Patient/Tbt3KuCY0B5PSrJvCu2j-PlK.aiHsu2xUjUM8bWpetXoB': 3,
 'https://open-ic.epic.com/FHIR/api/FHIR/DSTU2/Patient/Tbt3KuCY0B5PSrJvCu2j-PlK.aiHsu2xUjUM8bWpetXoZ': 1}

### 30.4 Patient Allergies and Reaction

You'll see in the file that each of the items in the `entry` list have several other attributes including a patient name, substance text representation, and a reaction manifestation.  Create a function named `allergy_list(json_file)` that will create an output list that has patient name, allergy, and reaction for each `entry`.  The actual result you should get will be:

```python
[['Jason Argonaut', 'PENICILLIN G', 'Hives'],
 ['Paul Boal', 'PENICILLIN G', 'Bruising'],
 ['Jason Argonaut', 'SHELLFISH-DERIVED PRODUCTS', 'Itching'],
 ['Jason Argonaut', 'STRAWBERRY', 'Anaphylaxis']]
```

You'll notice that the reaction and the manifestation of that action are lists.  You only need to capture the first reaction and the first manifestation of the action.  That is, if there is a list of things, just output the first one.

In [11]:
import json
### BEGIN SOLUTION
def allergy_list(json_file):
    """(file) -> filename
    This function opens a file as specified by filename and returns how many items exist in the 'entry'
    list
    """
    patients = []
    with open(json_file) as f:
        allergyinfo = json.load(f)
        entries = allergyinfo.get("entry")
        
        for entry in entries:
            pid = entry.get("resource").get("patient").get("reference")
            pname = entry.get("resource").get("patient").get("display")
            substance = entry.get("resource").get("substance").get("text")
            reaction = entry.get("resource").get("reaction")[0]['manifestation'][0]['text']
            patients.append([pname, substance, reaction])
     
    return  patients
### END SOLUTION

In [12]:
output=[['Jason Argonaut', 'PENICILLIN G', 'Hives'],
 ['Paul Boal', 'PENICILLIN G', 'Bruising'],
 ['Jason Argonaut', 'SHELLFISH-DERIVED PRODUCTS', 'Itching'],
 ['Jason Argonaut', 'STRAWBERRY', 'Anaphylaxis']]

assert allergy_list(ALLERGIES_FILE) == output


### 30.5 Allergy Reaction

Write a function called `allergy_reaction(json_file,patient,substance)` that takes three parameter and returns the reaction that will happen if the patient takes the specified substance.  Solve this, in part, by calling your `allergy_list` function inside your new `allergy_reaction` function.

If the substance is not found in the allergy list, the function should return None.

In [13]:
import json
### BEGIN SOLUTION
def allergy_reaction(json_file, patient, substance):
    """(file) -> filename
    This function opens a file as specified by filename and returns how many items exist in the 'entry'
    list
    """
    nidx = 0
    subidx = 1
    ridx = 2
    reaction = None
    patient_list = allergy_list(json_file)
    for pl in patient_list:
        if (pl[nidx] == patient) and (substance == pl[subidx]):
            reaction = pl[ridx]

    return reaction
### END SOLUTION

In [14]:
assert allergy_reaction(ALLERGIES_FILE, 'Jason Argonaut', 'PENICILLIN G') == 'Hives'
assert allergy_reaction(ALLERGIES_FILE, 'Jason Argonaut', 'SHELLFISH-DERIVED PRODUCTS') == 'Itching'
assert allergy_reaction(ALLERGIES_FILE, 'Jason Argonaut', 'STRAWBERRY') == 'Anaphylaxis'
assert allergy_reaction(ALLERGIES_FILE, 'Jason Argonaut', 'PENICILLIN') == None
assert allergy_reaction(ALLERGIES_FILE, 'Paul Boal', 'PENICILLIN G') == 'Bruising'

---
---

# Stretch (Extra) Problems

Work on either of the stretch problems below can earn you up to 25 free points toward the midterm assignment.  That is, if you complete one of these extra problems successfully, you can skip 1 of the problems that will appear on the midterm exam coming up next week.

The midterm will be distributed 3/4 and due 3/14.



---
---

### STRETCH for March 2021 - For those looking for an additional challenge

As I've mentioned in class, CMS is now enforcing a rule around price transparency.  Every facility that take Medicare payments is required to publish a "machine readable" file with it's pricing infomration for a number of common procedures across all of the payers they work with.  There are two examples of such files in the `/data/` directory: `whiteriver.json` and `saline.xml`.

If you want to compare contracted prices across these two hospitals, you'll need to read in the information from both of those files into some kind of data structure, then merge the data together from those two files.  See what you can do.

See if you can create an output file that has the following fields:
* HOSPITAL
* PROCEDURE_CODE
* PAYER
* AMOUNT

If you choose to work on this, you may get stuck at some point and you won't know if you're _doing it right_. Make some assumptions. Document your questions in this notebook.



```
Procedure Code |  Description  |  Gross Charges  |  Aetna  |  QualChoice
```

### Assumptions:
* Only charges > 0 for plans is picked up from the data
* DRG data is ignored
* The fields below are ingored sicne they appear they are aggregate calculations/statistics on the set of plans.
  Description, ProcedureCode, Modifier, RevenueCode, MSDRG, NDC, InpatientGrossCharge, OutpatientGrossCharge, EmergencyRoomGrossCharge, MSDRGAverageGrossCharge, DiscountedCashPrice, MinimumNegotiatedCharge, MaximumNegotiatedCharge
### What is NOT done:
* Standardization of names of health plans. There appears to be different terminology for the names of plans in the 2 data files. No attempt is made to 
  standardize it. 



In [15]:
def summary_price_json(file):
    """(file) -> filename
    This function opens a json file and returns a dictionary object with 1 key named "results" and the value 
    being an array of dictionaries

    """
    ignore_keys = ["Description", "ProcedureCode", "Modifier", "RevenueCode", "MSDRG", "NDC", "InpatientGrossCharge", "OutpatientGrossCharge",
                   "EmergencyRoomGrossCharge", "MSDRGAverageGrossCharge", "DiscountedCashPrice", "MinimumNegotiatedCharge", "MaximumNegotiatedCharge"]
    retlist = []
    #retlist = {}
    with open(file) as f:
        data = json.load(f)
        hospital = data.get("root").get("HospitalorFacilityName")
        chargelist = data.get("root").get("StandardCharges")
        
        for cl in chargelist:
            code = cl.get("ProcedureCode")
            descr = cl.get("Description")
            grosscharge = cl.get("OutpatientGrossCharge")
            row = {}
            row["hospital"] = hospital
            row["descr"] = descr
            row["code"] = code
            row["grosscharge"] = grosscharge
            for plan in cl.keys():
                #print("Plan is {}".format(plan))
                if plan in ignore_keys:
                    continue
                #print(cl.get(plan), float(cl.get(plan)))
                if float(cl.get(plan)) > 0:
                    row[plan] = float(cl.get(plan))
                    #print("Plan is {}, Price is {}".format(plan, cl.get(plan)))
            #print(row)        
            retlist.append(row)
            #retlist[row["code"]] = row

    return {"results": retlist}

#print(chargesummary1)
                    

### NOT USED 


In [16]:
from lxml import objectify
import string
def summary_price_xml(file):
    """(file) -> filename
    This function opens a xml file and returns a dictionary object with 1 key named "results" and the value 
    being an array of dictionaries.
    """
    retlist = []
    #retlist = {}
    with open(file) as f:
        xml = objectify.parse(f)
        root = xml.getroot()
        row = {}
        facroot = root.find("Facility")
        #print(facroot.attrib)
        #for child in root.getchildren():
        #  print(child.tag + " : " + str(child.attrib))
        #  if child.tag == "Facility":
        root = facroot
        hospital = root.attrib.get("Name")
        #print("Hospital is {}".format(hospital))
        # Patient Types        
        patienttypes = root.findall("Patient")
        for pt in patienttypes:
            #print(pt.attrib)
            patientservicetype = pt.attrib.get("Type")
            items = pt.find("Charge").find("Item")
            #print(items)
            #This is the core loop to get the charges for each item code etc
            for i in items:
                row = {}
                #print(i.attrib)
                row["hospital"] = hospital
                row["descr"] = i.find("Description")
                row["code"] = i.attrib.get("Code")
                row["grosscharge"] = i.find("GrossCharge")
                print(type(row["hospital"]))
                print(type(row["grosscharge"]))
                for c in i.find("Contracts").findall("Contract"):
                    #print("Contract is {}".format(c.attrib.get("Payer")))
                    row[c.attrib.get("Payer") + "_" + patientservicetype] = c.attrib.get("Charge") 
                #print(row)
                #print(json.dumps(row, indent=4))
                retlist.append(row)
                #retlist[row["code"]] = row
                break
    return {"results": retlist}
        
#print(chargesummary2)


### Assumptions:
* A string is constructued thus "healthplanname_patientservicetype" for an entry against the code
* Only Charge types of HCPPS are considered. DRG types are not considered.


In [17]:
import xmltodict
import json
def get_contracts(contract, patientservicetype):
    """(contract) -> dict json object which could be a dict or an array
    (patientservicetype) -> string Patient Service type ie. emergency, inpatient etc.
    Rturns a dictionary of K-V pairs with the payer as the key and the charge as the value.
    """
    #print("get contracts {}".format(contract))
    val = {}
    #print(type(contract))
    if isinstance(contract, dict):
        #print("There is only 1 contract")
        val[contract["@Payer"]+"_"+patientservicetype] = contract["@Charge"]
    elif type(contract) is list:
        #print("There is a list of contracts")
        for co in contract:
            val[co["@Payer"]+"_"+patientservicetype] = co["@Charge"]
    #print(val)
    return val

def summary_price_xmljson(file):
    """(file) -> filename
    This function opens a xml file and returns a dictionary object with 1 key named "results" and the value 
    being an array of dictionaries. The approach taken is to convert an xml object to json.
    """
    with open("/data/saline.xml") as fd:
        doc = xmltodict.parse(fd.read())
        #print(json.dumps(doc, indent = 4))
        retlist = []
        patientarray = doc.get("StandardCharges").get("Facility").get("Patient")
        hospital = doc.get("StandardCharges").get("Facility").get("@Name")
        for p in patientarray:
            patientservicetype = p.get("@Type")
            charges = p.get("Charge")
            #print("Type for charges is {} \n".format(type(charges)))
            if type(charges) == list:
                for c in charges:
                    print(c.get("@Type"))
                    #print("C is {}".format(c))
                    if c.get("@Type") == "DRG":
                        #print("ON TO THE NEXT BECAUSE THIS IS FOR DRG")
                        continue
                    #Only do HCPCS types
                    for i in c.get("Item"):
                        row = {}
                        row["hospital"] = hospital
                        row["descr"] = i.get("Description")
                        row["code"] = i.get("@Code")
                        row["grosscharge"] = i.get("GrossCharge")
                        
                        contract = i.get("Contracts").get("Contract")
                        #print("Contract 1 is {}".format(contract))
                        row2 = get_contracts(contract, patientservicetype)
                        #print("Row is {} and row2 is {}".format(row, row2))
                        row = {**row,**row2}
                        #print("Row is {}".format(row))
                        retlist.append(row)
            else:
                if c.get("@Type") == "DRG":
                    continue
                #Only do HCPCS types
                for i in c.get("Item"):
                    row = {}
                    row["hospital"] = hospital
                    row["descr"] = i.get("Description")
                    row["code"] = i.get("@Code")
                    row["grosscharge"] = i.get("GrossCharge")
                    contract = i.get("Contracts").get("Contract")
                    #print("Contract 2 is {}".format(contract))
                    row2 = get_contracts(contract, patientservicetype)

                    row = {**row,**row2}
                    #print("Row is {}".format(row))
                    retlist.append(row)
                        
        return {"results": retlist}
        

## Final Results
The output expected in this section are 1 csv file named "summary.csv"
This has a summary of all the charges
- No DRG Data
- No standardization of plan names
- No sorting/arrangement of data in the output.
It is expected that an Excel/google sheet user can use the appropriate filters to study the csv and compare plans

In [19]:
import pandas as pd

chargesummary1 = summary_price_json("/data/whiteriver.json") 
df = pd.json_normalize(chargesummary1['results'])
df.to_csv("samplecsv1.csv")

#print(json.dumps(chargesummary1, indent = 4))
chargesummary2 = summary_price_xmljson("/data/saline.xml") 
df = pd.json_normalize(chargesummary2['results'])
df.to_csv("samplecsv2.csv")
#print(chargesummary2)
#print(json.dumps(chargesummary2, indent = 4))

#Merge the 2 returned structures 
mergeddata = chargesummary1
mergeddata["results"] = mergeddata["results"] + chargesummary2["results"]
df = pd.json_normalize(mergeddata['results'])
df.to_csv("summary.csv")

#print(json.dumps(mergeddata, indent = 4))

#mergeddata["results"].append(chargesummary2["results"])
#print(json.dumps(mergeddata, indent = 4))

#import pandas as pd
#import json

#info = json.loads(chargesummary1)
#df = pd.json_normalize(mergeddata['results'])
#df.to_csv("samplecsv1.csv")


DRG
HCPCS


## Questions

* What is MSDRG and DRG in the 2 data files
* How to standardize plan names for ease of comparison ?



---
---

### STRETCH from March 2020 - For those looking for an additional challenge

The Coronavirus is creating quite the stir right now.  There are some sources suggesting that trends show it is going to be significantly more serious than SARS was back in the 2002 timeframe.  Here's one visualization trying to demonstrate that: https://www.reddit.com/r/China_Flu/comments/ev2b4v/i_updated_some_charts_comparing_this_outbreak/

Someone on Kaggle has generously already compiled a dataset based on information from Johns Hopkins about the Coronavirus outbreak.  https://www.kaggle.com/brendaso/2019-coronavirus-dataset-01212020-01262020  Create a Kaggle account, if you don't already have one.  Download this data set and then upload it to your Jupyter Home folder.  (The "up arrow" button is for uploading a file.)

Use Python's built-in `csv` module to read the data from this file and generate the following information: **what are the total confirmed cases in all of Mainland China as of the latest information in the data set?**  Some important things to note:
* Each entry for a given city has the **cumulative** number of cases.  So that column is not additive (it cannot be summed).  You'll have to find a way to filter your data for the last day for each city, then total those up.
* If you choose to parse the date column, you will want to lookup how to do that using Python's `datetime` module.  Especially the `strptime` function.  https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior  Hint: you can parse a date string in the format 2/17/2020 using the code below.  This link will tell you what things like `%m` and `%Y` mean.

```
from datetime import datetime
d = datetime.strptime('2/17/2020', '%m/%d/%Y')
```

If you want to take this another step, **create a list of tuples that contain (observate date, total confirmed) totalled over all locations represented in the data**

---

## Check your work above

If you didn't get them all correct, take a few minutes to think through those that aren't correct.


## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Follow the instruction on the prompt below to either ssave and submit your work, or continue working.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

---

In [None]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git add week06_assignment_2.ipynb
    !git commit -a -m "Submitting the week 6 programming exercises"
    !git push
else:
    print('''
    
OK. We can wait.
''')


---

If the message above says something like _Submitting the week 3 review exercises_ or _Everything is up to date_, then your work was submitted correctly.