# Analyzing Data Using the FHIR Bulk Data Export API

The Bulk Data API is still very early in it's development, which you can follow here: https://github.com/smart-on-fhir/bulk-data-server.git. 

In this notebook, we'll implement a very simple client that can access and download [FHIR bulk data](http://wiki.hl7.org/index.php?title=201801_Bulk_Data) from the [Demo SMART Bulk Data Server](https://bulk-data.smarthealthit.org). This notebook is based on a [FHIR Connectathon Project](https://github.com/plangthorne/python-fhir/blob/master/demo/BulkDataDemo.ipynb) that was also set up to implement simple authentication according to [SMART Authorization Guide protocol](http://docs.smarthealthit.org/authorization/backend-services/), but we're ignoring this aspect since it's a bit technical and out of scope for this class. 

## Initial Setup
1. Download and install Anaconda: https://www.anaconda.com/distribution/

2. Install the `requests` package: `pip install requests`

3. Clone project from github: `git clone https://github.com/uw-fhir/bulk-fhir-tutorial.git`

4. Open `bulk-fhir-tutorial.ipynb` in JupyterLab or Jupyter Notebook

## Server Configuration

We'll start by reading required config parameters, which define the FHIR server and other options. We'll be testing against the [Demo SMART Bulk Data Server](https://bulk-data.smarthealthit.org). 

In [60]:
import yaml

with open('config.yaml') as f:
    config = yaml.load(f)

# Generating a Data Analysis Pipeline

Creating maintainable and comprehensive pipelines for data analysis - where the process becomes as automated as possible - is incredibly useful for rapid iteration, reprucibility, updates, clarity, and documentation purposes. (talk about it more). 

To create this type of pipeline, we want to be able to query for required data at the source, automatically transform and load this data into a suitable format, and load it into the desired analysis software. 

In this app, we'll demo this approach the nascent Bulk FHIR API and Python to
1. Request at the data we want to analyze from the FHIR server
2. Access and transform the data for analysis
3. Analyze the data using R

## 1. Explore the Dataset

First, we're going to use the [SMART Patient Browser](https://patient-browser.smarthealthit.org/index.html?config=r3#/) tool to explore the patients and associated data. 

Click on the tool and play around with it a bit, clicking on the different listed [Patients](https://www.hl7.org/fhir/patient.html) and then exploring their associated FHIR Resources like [Immunizations](https://www.hl7.org/fhir/immunization.html) or [Encounters](https://www.hl7.org/fhir/encounter.html). 

## 2. Generate the Query

Now that we have a feel for the data, we need to decide what specific resources we're interested in and generate a query by using the proper [FHIR Bulk Data query parameters](https://github.com/smart-on-fhir/fhir-bulk-data-docs/blob/master/export.md#query-parameters). 

We can first try out our downloads by using a very simple tool made by the SMART folks - [The FHIR Bulk Downloader](https://bulk-data.smarthealthit.org/sample-app/index.html?server=https%3A%2F%2Fbulk-data.smarthealthit.org%2FeyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MX0%2Ffhir)

We'll focus on the Patient-level export since we're only interested in data that is in some way associated with a Patient (see https://github.com/smart-on-fhir/fhir-bulk-data-docs/blob/master/export.md#query-parameters for more information), and want to download resources that will be useful for analysis. 

In this example, I'll be looking at [Immunizations](https://www.hl7.org/fhir/immunization.html). 

1. Go to [The FHIR Bulk Downloader](https://bulk-data.smarthealthit.org/sample-app/index.html?server=https%3A%2F%2Fbulk-data.smarthealthit.org%2FeyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MX0%2Ffhir)

2. Select the desired resources, patient groups, and time frame. 

3. Notice how each selection modifies the download link url.

4. Try modifying the url yourself using the available query parameters, and see what happens after pressing `Download`. 
   
   For example, type this in: `https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MX0/fhir/Patient/$export?_type=Patient,Immunization&_typeFilter=Immunization%3Fvaccine-code%3D140,Patient%3Fgender=female`
   
5. Note this generated url string for use later in the tutorial to get the desired dataset. I will be using the query above. 

## 3. Send Bulk Data Request

Now that we know what query string we'll use to access our desired dataset, we can send a request to the FHIR endpoint which will tell the endpoint to start compiling the data we require. 

In [62]:
import requests
import urllib

fhir_endpoint = config["server"]
export_level = '/Patient/$export'
types = ["Patient", "Immunization"]
typeFilters = ["Patient?gender=female", "Immunization?vaccine-code=140"]

headers = {'Accept': 'application/fhir+json', "Prefer": 'respond-async'}

payload = {'_type': ",".join(types), '_typeFilter': ",".join(typeFilters)}
payload_str = "&".join("%s=%s" % (k,v) for k,v in payload.items())

request_url = fhir_endpoint + export_level

r = requests.get(request_url, headers=headers, params=payload_str)

display(r)
display(r.url)
display(r.json())

<Response [202]>

'https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MX0/fhir/Patient/$export?_type=Patient,Immunization&_typeFilter=Patient?gender=female,Immunization?vaccine-code=140'

{'resourceType': 'OperationOutcome',
 'text': {'status': 'generated',
  'div': '<div xmlns="http://www.w3.org/1999/xhtml"><h1>Operation Outcome</h1><table border="0"><tr><td style="font-weight:bold;">information</td><td>[]</td><td><pre>Your request have been accepted. You can check it\'s status at &quot;https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwidHlwZSI6IlBhdGllbnQsSW1tdW5pemF0aW9uIiwiaWQiOiI2NzcwMDFjYjY5YTRkOGNjMjE2ZGEyYjk0YjNkN2JiNDhlYTI4MmQ5NGFmYzlmOTllN2UxZDQzZTkyYTI3ZTAzIiwicmVxdWVzdFN0YXJ0IjoxNTQ5MzUwNDQyNjI5LCJzZWN1cmUiOmZhbHNlLCJvdXRwdXRGb3JtYXQiOiJuZGpzb24iLCJncm91cCI6bnVsbCwicmVxdWVzdCI6Imh0dHA6Ly9idWxrLWRhdGEuc21hcnRoZWFsdGhpdC5vcmcvZXlKbGNuSWlPaUlpTENKd1lXZGxJam94TURBd01Dd2laSFZ5SWpveE1Dd2lkR3gwSWpveE5Td2liU0k2TVgwL2ZoaXIvUGF0aWVudC8kZXhwb3J0P190eXBlPVBhdGllbnQsSW1tdW5pemF0aW9uJl90eXBlRmlsdGVyPVBhdGllbnQ_Z2VuZGVyPWZlbWFsZSxJbW11bml6YXRpb24_dmFjY2luZS1jb2RlPTE0MCJ9/fhir/bulkstatus&quot;</pre></td></tr></table></div>'},

## 4. Wait for Data

Now that the server is compiling the datasets, we need to wait until the endpoint is ready to send the data. We continuously query the provided location until the server is done.

In [64]:
from time import sleep

def parse_manifest(response):
    return [_.get('url') for _ in response.json()['output']]

location = r.headers['Content-Location']

while True:
    sleep(0.5)
    response = requests.get(location)
    display(response.status_code)
    if response.status_code == 200:
        manifest = parse_manifest(response)
        break
    
display(manifest)

200

['https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwidHlwZSI6IlBhdGllbnQsSW1tdW5pemF0aW9uIiwiaWQiOiI2NzcwMDFjYjY5YTRkOGNjMjE2ZGEyYjk0YjNkN2JiNDhlYTI4MmQ5NGFmYzlmOTllN2UxZDQzZTkyYTI3ZTAzIiwicmVxdWVzdFN0YXJ0IjoxNTQ5MzUwNDQyNjI5LCJzZWN1cmUiOmZhbHNlLCJvdXRwdXRGb3JtYXQiOiJuZGpzb24iLCJncm91cCI6bnVsbCwib2Zmc2V0IjowLCJsaW1pdCI6MTAwMDB9/fhir/bulkfiles/1.Immunization.ndjson',
 'https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwidHlwZSI6IlBhdGllbnQsSW1tdW5pemF0aW9uIiwiaWQiOiI2NzcwMDFjYjY5YTRkOGNjMjE2ZGEyYjk0YjNkN2JiNDhlYTI4MmQ5NGFmYzlmOTllN2UxZDQzZTkyYTI3ZTAzIiwicmVxdWVzdFN0YXJ0IjoxNTQ5MzUwNDQyNjI5LCJzZWN1cmUiOmZhbHNlLCJvdXRwdXRGb3JtYXQiOiJuZGpzb24iLCJncm91cCI6bnVsbCwib2Zmc2V0IjowLCJsaW1pdCI6MTAwMDB9/fhir/bulkfiles/1.Patient.ndjson']

## Download the Data

Now that we have a list of bulk data files, we 

We should close the connection in the underlying session prior to releasing the client. Alternatively the client also functions as a context manager for simplicity.

In [69]:
import json

def iterate_over_json(manifest):
    for url in manifest:
        data = requests.get(url)
        for item in data.iter_lines():
            for item in data.iter_lines():
                yield json.loads(item)
            
json_data = iterate_over_json(manifest);
next(json_data)


{'resourceType': 'Immunization',
 'id': 'bc109526-a37d-4e28-89af-d6794e1ea5f3',
 'status': 'completed',
 'notGiven': False,
 'vaccineCode': {'coding': [{'system': 'http://hl7.org/fhir/sid/cvx',
    'code': '140',
    'display': 'Influenza, seasonal, injectable, preservative free'}],
  'text': 'Influenza, seasonal, injectable, preservative free'},
 'patient': {'reference': 'Patient/ddf5ae5c-5646-4a76-9efd-f7e697f3b728'},
 'encounter': {'reference': 'Encounter/0780aaee-2233-4ee2-9033-7929ce78e5c0'},
 'date': '2005-04-23T16:33:20+00:00',
 'primarySource': True}

In [70]:
next(json_data)

{'resourceType': 'Immunization',
 'id': '2334bd5e-0af2-417c-bb35-e70646a1d7ac',
 'status': 'completed',
 'notGiven': False,
 'vaccineCode': {'coding': [{'system': 'http://hl7.org/fhir/sid/cvx',
    'code': '140',
    'display': 'Influenza, seasonal, injectable, preservative free'}],
  'text': 'Influenza, seasonal, injectable, preservative free'},
 'patient': {'reference': 'Patient/ddf5ae5c-5646-4a76-9efd-f7e697f3b728'},
 'encounter': {'reference': 'Encounter/1c093c58-6f5f-4da6-8d9b-8513b3accf46'},
 'date': '2006-04-29T16:33:20+00:00',
 'primarySource': True}