This repository contains all the code related to the 911 Nurse Triage Line randomized controlled trial. It will be updated as more analyses are done.
The code in this repository is written in Python and R. We used Python 3.8.6 and
R 4.1.2. We manage dependencies using poetry
and
renv
. Once Python and R are
installed you should be able to download and install all requirements with the
following commands on Mac/Linux:
Rscript -e 'install.packages("renv")
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python3 -
Or alternatively on Windows:
Rscript -e 'install.packages("renv")
(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python3 -
Once these dependency managers are installed, you can then run the following commands to install all dependencies:
poetry install
Rscript -e 'renv::restore()'
Once dependencies have been installed, then you can run the following command to perform all analyses that went into the ultimate paper:
poetry run ntl run-all -s 3 # NOTE: This will take a _long_ time to run
poetry run ntl run-all -s 4
There are several computations that are performed in this repository. Here we index them.
The notebook DroppedCalls.ipynb
corresponds to Appendix C of our Pre-analysis
Plan. It computes the proportion of calls which were assigned to the Nurse
Triage Line which will not be answered by the nurse due to all the nurses being
busy. This computation is based on historical data.
The notebook PowerCalculations.ipynb
computes the minimum detectable effect
based on a range of base rates and sample sizes which are plausible for our
study. It corresponds to Appendix G of our Pre-analysis Plan.
The data contains PII and we do not include these data in the repo. Instead, one can reproduce using the following directory structure for internal files.
data
private_data
: contains raw form of files that we do not post publicly in the repopublic_data
: contains some publicly-accessible files like mapping of ICD codes to likely emergent/non-emergent statusintermediate_objects
: these are derived data produced by earlier scripts and read in by later scripts
We also make available all calls for EMS service in the District during 2016. This csv
(data/2016_EMS_Events.csv
) has only two columns: the time the call was received and
the classification of the call.
-
000_constants.R: defines base directories and names of data directories/data. Function
get_mostrec
gets the most recently modified version of a file with a given prefix. -
001_viz_utils.R: data visualization utilities
-
Takes in:
- Credentials to access Common Events Database (CAD)
- .csv filed prepared by OCTO/OUC data scientist Nicole Donnelly (contains manually-fixed codes for what happens to calls or event codes)
-
What it does:
-
If an argument to pull from the raw database is set to
True
, reads in raw data directly from the database and the OUC-cleaned data -
Performs a left join that retains all rows from the raw data and adds the reconciled event codes. The id in the raw database is called num_1; in the OUC-cleaned data is called agency_event-- output is called ntl_summary_eval
-
Calculates descriptives statistics about the N of participants in each group over time
-
Prints ids of NTL participants in the format needed for FEMS SafetyPAD user interface (i.e., F1.... F1.... etc.). User needs to log into the SafetyPAD user interface at: dcfems.safetypad.com to then export the safetyPAD data related to those ids. Note, we do in batches of 1000 IDs, since the safetyPAD data is longform (each ID can have multiple rows corresponding to updates like ambulance dispatched, ambulance sent, etc.). The SafetyPAD UI has a export limit of 10,000 rows so the batches of 1000 help us conservatively stay under that limit.
-
After SafetyPAD data is exported and is saved using naming convention - `safetypad_idsearch_batch*.csv', reads in the safetyPAD files and rowbinds them
-
Performs a left join that retains all rows from the ntl_summary_eval and merges with SafetyPAD data. Id for left dataset is num_1. ID for right dataset is eResponse_03_Incident_Number
-
-
Outputs:
- `ntl_withsafetypad.[pkl|parquet]: result from left join in step 6; outputs in either pkl or parquet form depending on parameter
-
Takes in:
ntl_withsafetypad
created in previous script- AMR data (
amr_df.xlsx
; see email re data sources for token)
-
What it does:
-
Reads in each dataset
-
Creates flags for which ids are in which dataset (for AMR data, FEMSID is the identifier)
-
Renames AMR columns other than ID with amr prefix
-
Left joins the NTL base data (output from previous notebook) to AMR data
-
-
Outputs:
- ntl_withsafetypad_withamr.pkl: result from left join in step 4
-
Takes in:
ntl_withsafetypad_withamr.pkl
: output from previous notebookdem_fromsafetyPAD.csv
anddem_fromsafetyPAD_201911115.csv
: Pulls from SafetyPAD utilizing the scripts in Section 200. Note that these files contain an identical set of people. However, the latter contains more fields and was pulled at a later date. We utilize both because the former was used for matching and validation purposes and we wish to exactly preserve our process
-
What it does:
- Cleans names from SafetyPAD
- Cleans names from AMR
- Cleans addresses
- Cleans phone numbers
- Categorizes which people have different identifiers (names from either source; addresses; phone numbers)
- Creates a name string---FIRSTNAME_LASTNAME---and then does fuzzy matching to find high-probability matches of that string.
-
Output:
identifiers_fordhcr
anddf_fordhcr_DOBsadded
: this is data with cleaned identifiers for DC's DHCF to probabilistically match to Medicaid recordsdf_forrepeatcalls
: data used for repeat calls analysisdf_forambulanceuse
: this is data with cleaned identifiers to use for the ambulance use analysis in the script that followsdf_forfuzzy
: For posterity, the data frame that feeds into the fuzzy matching analysisdata_withmatches_amrupdates
: For posterity, the data frame that comes out of fuzzy matching analysis
-
Takes in:
df_forambulanceuse
: output from previous script; contains all callers regardless of Medicaid match status
-
What it does:
- Since each call has multiple timestamped
status events
tied to the call (so the 6,053 calls are tied to 9,599 events), aggregates to the call level two definitions of ambulance sent: (1) an ambulance is sent at any point among the timestamped events, (2) an ambulance is the last timestamped event - Plots descriptive rates between T and C of three categories: (1) an ambulance is sent/dispatched and it transports the caller, (2) an ambulance is sent/dispatched but it does not transport the caller, (3) an ambulance is neither sent/dispatched nor transports the caller. It also breaks these statuses down by the call response code (e.g., advanced life support (very few calls); basic life support; etc).
- Estimates regressions pooled across the entire period and separated by month of effect of randomization to treatment on: (1) whether ambulance is dispatched (significantly lower in treatment group) (2) whether ambulance transports an individual (significantly lower in treatment group)
- Since each call has multiple timestamped
-
Output:
callresponse_forposttx
: file used in later post-treatment bias diagnostic analyses that has detailed ambulance statuses that allow us to investigate extent to which failure to match to Medicaid records is differentially biased by treatment status
-
Takes in:
Member_Matches_wDHCF.xlsx
: Medicaid beneficiary file that contains all matchesMedicareEnrollmentForNTLMembersList.csv
: data on Medicare enrollmentdf_fordhcr_DOBsadded
: For reference, a file created in 020callresponse_forposttx.csv
: most recent ids and create_date variable (treatment statuses)df_forrepeatcalls.csv
: constructed IDs from fuzzy matching (treatment statuses)- Several hand reviewed matching files
-
What it does:
- Examines two types of matches: (1) cases where a single name_dob_id from the NTL call logs matches to multiple MedicaidSystemIDs from the beneficiary file, and (2) cases where a single MedicaidSystemID matches to multiple name_dob_ids. Indicators of matched first name, last name, and DOB are created to help adjudicate between cases that have multiple matches. The goal is deduplicate matches within the data.
- In the first case, in which one NTL ID matches with multiple Medicaid ID's, the goal is to deduplicate these matches since they are not due to repeated NTL calls. These matches are further examined and deduplicated depending on the nature of their match (case where there is no unique Medicaid ID for the top match value, cases where there is a unique Medicaid ID for the top match value and that Medicaid ID is the top match for only one NTL ID, and cases where there is a unique Medicaid ID but that unique Medicaid ID is the top match for multiple NTL IDs)
- In the second case, in which one Medicaid ID matches to multiple NTL ID's, there could be true matches due to repeat calls. Here, these matches are hand-coded and deduplicated accordingly.
- Deduplicated participants data from (2) and (3) are then merged into one file, which is then merged with treatment statuses received from FEMS.
-
Output:
ntl_withmedicaidIDS_{}.csv
: final deduplicated, decorated NTL participants data, which is used in later script to construct outcomes.
- 061_subset_medclaims_outcomeswindow.R (relies on a function in 060_clean_ntl_data.R)
-
Takes in:
claimsdata_2018031920190301.csv
: original claims dataClaimsDataWithAdditionalFields20170901_To_20190930.csv
: more fields for claims datantl_withmedicaidIDS_{}.csv
: NTL participants data from previous scriptMedicaidEnrollmentForNTLMembersList.xlsx
: Medicaid spells data
-
What it does:
- Since the claims data is large, this script subsets it to a manageable size by reducing the number of fields, which is then dealt with in a subsequent script.
- The previous merging script also used the Medicare file but did NOT merge the Medicaid enrollment spells due to more complicated data structure. This is dealt with here; the main feature of that data is trying to get a measure of a person's length of time in Medicaid to adjust expenditures by.
-
Output:
Medicaid_analytic_peoplewclaims_%s.csv
: the main outcomes data that includes beneficiaries with any claims within 6 months of call and assorted beneficiary informationMedicaid_analytic_precallclaims_%s.csv
: second dataset with claims in 6 months before call (for heterogeneous effects analysis)Medicaid_staticattributes_%s.csv
: third dataset with beneficiary demographic information (regardless of whether had claims 6 months before or after call) and spell information. Note that since static attributes were provided alongside claims data, they are only observed for beneficiaries with claimsall_analytic_firstcall_%s.csv
: dataset with participants' first call to use to filter
-
Takes in:
Medicaid_analytic_peoplewclaims_%s.csv
: main outcomes data created in previous scriptntl_withmedicaidIDS_{}.csv
: NTL participants data from 050 scriptall_analytic_firstcall_%s.csv
: dataset from previous script with participants' first call to use to filterMedicaid_analytic_precallclaims_%s.csv
: beneficiary data from previous script with claims in 6 months before callMedicaid_staticattributes_%s.csv
: dataset from previous script with beneficiary demographic information (regardless of whether had claims 6 months before or after call)nyu_ed.xlsx
: public ED visit codes from NYU
-
What it does:
- Three analytical datasets are created: (1) people who matched to the beneficiaries file with claims within a 24-hour or 6-month window of their call, (2) people who matched to the beneficiaries file with no claims in either window, and (3) people who did not match to the beneficiaries file.
- Next, claims associated with an ED visit and a general care, non-ED use measure are coded. Then appropriateness of visit is coded using NYU ED codes.
- Aggregating up from visit level to the patient level, binary outcomes are coded for all three groups mentioned in (1). The three groups are then rowbound, summarized, and beneficiaries are merged on MedicaidSystemID.
- Same process of (2) and (3) is used to code whether each line item is a PCP visit.
- Expenditures are aggregated by beneficiary, information is added for beneficiaries with no claims, and the outcomes are rowbound and summarized. Indicators are created for different quantiles of pre-call expenditures.
- Finally, all outcomes are merged into one dataset (beneficiares with imputed values for non-matches).
-
Output:
ptlevel_beneficonly.csv
: final outcome dataset (beneficiaries only)ptlevel_forrobust.csv
: final outcome dataset (beneficiaries with imputed values for non-matches)
-
Takes in:
ptlevel_beneficonly.csv
: final outcome dataset from previous script (beneficiaries only; main analytic data)ptlevel_forrobust.csv
: final outcome dataset from previous script (beneficiaries with imputed values for non-matches; bounding data)
-
What it does:
- Cleans main analytic data and bounding data for binary outcomes
- Creates a basic descriptive plot of the joined data.
- Regressions are run per our specification on the formula: outcome ~ treatment. Regression is run with all binary outcomes, and then with continuous outcomes.
- Binary outcomes are plotted.
This code was written by Chrysanthi Hatzimasoura (chrysanthi.hatzimasoura@dc.gov), Rebecca Johnson (rebecca.johnson@dc.gov), and Ryan T. Moore (@rtm-dc), Kevin H. Wilson.
See LICENSE.md.