# 1. Project info

**Project title**: Identify patients with suspected infection using the electronic health record.
<!---Max 41 characters.-->

**Name:** JoAnn Alvarez

**E-mail:** joannmrudd@hotmail.com

**GitHub username**: ruddjm

**Link to prior writing**: https://github.com/ruddjm/data.table/blob/master/dataDotTable.ipynb ; Several posted here: http://biostat.mc.vanderbilt.edu/wiki/Main/JoAnnAlvarez
<!---Please supply a link/reference to some of your own already published educational writing. For example, in the form of a blog post, notebook, article, book or internal case study.-->

**Short description**: Flex your data.table muscle to classify patients with suspected infection using their electronic health record. <!--(111 chr. 110 limit.)-->

#### Long description ####

Sepsis is a deadly illness accounting for a large portion of in-hospital deaths. It occurs when a person's organs shut down in response to a severe infection. This public health problem is a major target for research, and hospital records represent a great opportunity for sepsis research. In this R project, you will identify hospital patients with severe infection using medical record data. This project assumes you know how to work with data frames using `data.table`.

<!---A longer description of the project, around four sentences in length. 
This will be read by the students on the DataCamp platform **before** deciding to start the project. It should mention some of the major prerequisites for completing the project (for example "familiarity with data frame" or "know how to use the `lm` function")-->

#### Datasets used ####

Two dataframes I simulated in R to mimic the format of tables from EHR databases. Medication administration records and blood culture records are in `antibioticDT` and `bloodcultureDT`. The code and R data are located at https://github.com/ruddjm/datacamp_projects_identify_infection_ehr. I envision these data frames already being loaded in the workspace.

<!---Short description (and ideally links) to the datasets used in the project. This will be read my me (Rasmus) only.-->

#### Assumed student background ####

   * Students should have some experience using R and have been introduced to `data.table`. 
      * They should understand how indexing works in R (`[]`). 
      * `merge` function in R. They need to understand the concept of merging, inner joins, left inner joins, and `by` variables (but not necessarily the `join` terminology). 
      * Assignment in `data.table` using ":="
      * Grouped aggregations in `data.table`
      * `shift` function in `data.table`
      
This should not require any knowledge of healthcare or medicine. 

<!---What background knowledge you assume the student doing this project will have. The more specific the better. This will be read my me (Rasmus) only. Please list things like modules, tools, functions, methods, and statistical concepts and jargon.

Not so useful: "The student has a basic familiarity with data frames."

More useful: "The student knows how to read in a csv file as a data frame and how to compute grouped summary statistics using `dplyr`."-->

# 2. Project narrative intro

## 1. This patient may have sepsis.

Sepsis is a deadly syndrome where a patient has a severe infection that causes organ failure. The medical community believes that the sooner septic patients are treated, the more likely they are to survive, but recognizing sepsis is difficult. Now that hospitals have a constant flow of data, it may be possible to use machine learning to automatically flag patients who are likely to be septic. Before any predictive algorithms can be developed, however, we need a reliable way to pick out the patients who had sepsis. 
 
In this project, we will find out which patients were suspected to have a severe infection using electronic health record (EHR) data. In other words, we will look in to the hospital's records to see what happened during a patient's hospital stay, and try to figure out whether he or she had a severe infection. We will check to see whether the doctor ordered a blood test to look for bacteria (a blood culture) and also gave the patient a series of antibotics. The basic idea is that a patient starts antibiotics within a couple of days of a blood culture, and is given antibiotics for at least 4 days.

### Criteria for Suspected Infection*
   * Patient receives antibiotics for a sequence of 4 days, with gaps of 1 day allowed.
   * The sequence must start with a “new antibiotic” (not given in the prior 2 days).
   * The sequence must start within +/-2 days of a blood culture.  
   * There must be at least one IV antibiotic within the +/-2 day window period. (An IV drug is one that is given intravenously, not by mouth.)

Let's take a look at the antibiotic data, which is in a csv file called 'antibioticDT.csv.' Let's import it using the `data.table` function for reading in data. Let's look at a few rows of data. Each row represents one time the patient was given a drug. The variables include the patient id, the day the drug was administered, and the type of drug are included. For example, patient "0010" recieved doxycycline by mouth on the first day of her stay.

In [6]:
library(data.table)
# load('ehr_dataframes.Rdata', verbose = TRUE)
antibioticDT = fread('antibioticDT.csv')
#blood_cultureDT = fread('blood_cultureDT.csv')
antibioticDT[1:30]

patient_id,day_given,antibiotic_type,route
1,4,ciprofloxacin,IV
1,4,penicillin,IV
1,5,penicillin,IV
1,8,penicillin,IV
1,9,doxycycline,IV
1,11,doxycycline,IV
10,1,doxycycline,PO
10,2,amoxicillin,IV
10,4,amoxicillin,IV
10,4,doxycycline,IV


# 2. Which antibiotics are "new?"

First let's find which rows represent 'new' antibiotics, by checking whether that particular antibiotic type was given in the past 2 days. We can use the `shift` function to look at data from other rows. Let's look at the data sorted by id, then antibiotic type, and finally day to visualize the task.

In [4]:
setorder(antibioticDT, patient_id, antibiotic_type, day_given)
antibioticDT[1:40]
antibioticDT[ , lastAdministrationDay := shift(day_given, 1), 
  by = .(patient_id, antibiotic_type)]

antibioticDT[ , daysSinceLastAdmin := day_given - lastAdministrationDay]
antibioticDT[ , antibiotic_new := 1]
antibioticDT[daysSinceLastAdmin <= 2, antibiotic_new := 0]

## 3. Looking at the blood culture data. 

Now let's look at the blood culture data. These data are in blood_cultureDT.csv. Let's start by reading it into the workspace and having a look at a few rows. Each row represnts one day that a patient had a blood culture. For example, patient "0006" had a culture on the first day of his hospitalization and again on the ninth. Notice that some patients from the antibiotic data are not in this data and vice versa. Some patients are in neither, because they received neither antibiotics nor a blood culture.

In [10]:
blood_cultureDT = fread('blood_cultureDT.csv')
blood_cultureDT[1:30]

patient_id,blood_culture_day
6,1
6,9
11,2
11,5
11,6
11,7
11,10
11,11
11,13
11,16


## 4. Combine the antibiotic data and the blood culture data

To find which antibiotics were given close to a blood culture, we'll need to combine the drug administration data with the blood culture data. Let's keep only patients that are still candidates for infection, so only those in both data sets.

A tricky part is that many patients will have had blood cultures on several different days. For each one of them, we are going to see if there's a sequence of antibiotic days close to it. For a given patient, we will match each blood culture to all of his antibiotics.

In [14]:
combinedDT = merge(
  blood_cultureDT,
  antibioticDT,
  all = FALSE,
  by = 'patient_id')

combinedDT

## 4. Determine whether each row is in-window.

Let's make a new variable indicating whether the given row is within two days of a blood culture. 

In [17]:
combinedDT[ , 
  drug_in_bcx_window := 
    day_given - blood_culture_day <= 2 
    & 
    day_given - blood_culture_day >= -2]

*These criteria are a very simplified version of the ones given in a 2017 JAMA article by Rhee and others.

Rhee C, Dantes R, Epstein L, Murphy DJ, Seymour CW, Iwashyna TJ, Kadri SS, Angus DC, Danner RL, Fiore AE, Jernigan JA, Martin GS, Septimus E, Warren DK, Karcz A, Chan C, Menchaca JT, Wang R, Gruber S, Klompas M. Incidence and Trends of Sepsis in US Hospitals Using Clinical vs Claims Data, 2009-2014. JAMA. 2017;318(13):1241-1249.