# BME 590 Final Project 

For your final project, you will combine many of the tools and skills that you have learned so far in the class in order to perform an analysis on the MIMIC-III database. The MIMIC-III database has been used to research many applications of machine learning in the years since it has been released, and led to many research publications. Your task will be to draw upon this research and perform your own analysis, or to choose your own problem and solve it using methods you have learned in the class.

# Due Date:

All items noted in the deliverables section are due at 2:00PM EST on May 3rd, 2019. Instructions on submitting your work is also described in the deliverables section.

## Goals

The goals of the course, as laid out in the syllabus, are to be able to perform the following:

 * Ask appropriate questions/Generate hypotheses
 * Extract raw data
 * Perform cleaning, reshaping, and exploratory data analysis
 * Statistical and Machine Learning Modeling
 * Assess model results
 * Report on results in reproducible fashion
 * Present results verbally to an audience

The final project will ask you to apply most of these skills

## Project Teams

For the final project, you may work in teams or individually, although it is encouraged that you work in teams. Group sizes can range from 1 to 4 people. We understand that it is easier to accomplish more as a group than as an individual. This will be taken into account when grading the final project.

Please fill out your group members here as soon as possible:

https://docs.google.com/spreadsheets/d/1rXfac7lzC4cRJLDv8t9gCnGFXi3qEbzBBY-GsueNG3c/edit#gid=1353955844



# Project Tasks

As we have discussed in the course, there are many machine learning methods that can be used to improve hospital operations and to deliver better and more efficient care to patients. The MIMIC-III database has been provided to help answer some of these questions, and features data from intensive care unit hospital stays. 

Your task is to choose a specific problem of interest and to create a machine learning model that will predict an outcome that could improve a hospital. You will work with raw data from the MIMIC-III database and tune a model for a specific use case. Then, you will describe how this model may be used as a potential solution to the problem you are trying to solve. 

For all the tasks, there have been examples of groups that have published or otherwise released their results, which you are welcome to draw upon. More information on this is provided in the Resources section.

Below are three such tasks that may be of interest:

### Inpatient or 30-day Mortality - Short Term Mortality Prediction

Both inpatient (during a hospital admission) and 30-day mortality are important measures which hospitals are graded on across the country. The rate at which mortality in the inpatient setting or within 30 days of admission are some key factors which can influence how well a hospital is perceived. Therefore, many hospitals are working to try and predict which patients die within a patient encounter or within 30 days of admission. If a patients' risk could be made available to a physician at key points during the encounter, a hospital may make certain decisions to try and mitigate the risk of a patient being counted against them, such as consulting palliative care or other administrative measures.

#### Task:
Using the data from the MIMIC-III dataset, create a machine learning model that predicts whether a patient will die within a hospital stay or within 30 days of a patient admission. When doing this, keep in mind *when* the model will be used. For example, if all of the data from the encounter is used to predict if a patient dies at the end of the encounter, then using that model in production may be too late, as the patient will have already died.

### 1-year Mortality - Long Term Mortality Prediction

Oftentimes, health systems are interested in something known as population health, where the health outcomes of patients are tracked at the population level, and there is an emphasis on [continuity of care](https://www.aafp.org/about/policies/all/definition-care.html). In these settings, it is important to know what a patients' health will look like 6 months or even 1 year from the current time. 

In many settings, it can be easy to simply try and treat the condition that the patient presents with. However, if physicians could be alerted to the risk of the patient dying 1 year in the future, there may be a specialized course of action that could be recommended, such as home health, palliative care consults, or regular subspecialty visits to manage the patients' health outside of the hospital. 

#### Task:
Using the data from the MIMIC-III dataset, create a machine learning model that predicts whether a patient will die within a year of their admission.

### 30-Day Readmission

The Center for Medicare and Medicaid Services (CMS) has placed an emphasis on [reducing 30-day readmissions](https://www.cms.gov/medicare/quality-initiatives-patient-assessment-instruments/hospitalqualityinits/outcomemeasures.html). 30-day readmissions are particularly bad because if a patient is sick enough to come back in 30-days, it is likely that they were not stable enough to discharge in the first place. Therefore, there are large financial penalties for having too many 30-day readmissions. 

One potential solution that may help solve this problem is to provide physicians with a list of patients that are at high risk of readmission each day, so that they can take special care of those patients and be cautious about discharging them too early. In this way, the hospital can avoid having 30-day readmissions, which are often unnecessary and can waste space in the hospital.

#### Task:

Using the data from the MIMIC-III dataset, create a machine learning model that predicts whether a patient will readmit within 30-days. 


# Deliverables:

There are two main deliverables for this project: 
 * A Process Book in the form of a jupyter notebook (or several)
 * A verbal presentation of your work (described below)

## Process Book Instructions:

In the process book, please include the following information:

 * __Overview and Background:__ Describe the problem and what the motivation for solving the problem is. Imagine that your reader will be someone who is not familiar with the problem
 * __Data:__ How you extracted, cleaned, and transformed the data to answer the question
 * __Exploratory Data Analysis:__ Explain the different ways that you looked at the data to understand the data better. Provide visualizations that help to present the data in a way that is compelling
 * __Modeling:__ Explain how you set up the modeling problem, which models you tried, how you set up parameter tuning, and what your results/metrics were
 * __Suggestions:__ If the model performed well, how would you implement it to solve a problem? If it does not perform well enough, explain why you think that is. 

If you choose to host your process book on github, __PLEASE MAKE SURE YOU DO NOT COMMIT THE RAW DATA__. The MIMIC dataset requires special permission to use, and should not be made available to the general public, as this is a violation of the agreement that you signed at the beginning of the year. Use `.gitignore` files to make sure that you do not accidentally commit the data. If you do accidentally commit the data, please immediately delete your repository and create a new one.

## Verbal Presentation:

In addition to creating a process book, you will present your work using a powerpoint presentation to a panel audience who has experience working in healthcare. The presentation will be **10** minutes long, and will consist of the following:

 * ~7 minutes on the overall problem, and your solution, geared towards a _non-technical_ audience. 
 * Remaining ~3 minutes on technical details of model implementation, choice of model, hyperparameters, cross-validation, etc.
 
Information about how to schedule your session will be released as soon as groups are finalized.

## Submission:

There are two main components to the submission:
  * Process Book
  * Verbal Presentation

Please put all jupyter notebook files and any presentation materials into a folder and title it as follows:

__lastname-firstname-final.zip__

Each group will only need to submit 1 version of the final, so please designate one person to submit the assignment on Sakai. We will pull from the group IDs


# Grading

Your grade will be broken down into both the content in the process book as well as the verbal presentation. 

85% of the grade will come from the process book, where we will assess your understanding of the problem, your approach to solving it, clarity of code, and other factors.

The other 15% of the grade will come from the presentation, and will assess the clarity and effectiveness of your presentation along with the appropriateness of the information during the non-technical versus technical portions. 

If you are working in a group, each member must present roughly equally during the presentation. We will take into account familiarity with English when grading. 


# Data

The data is available through the Physionet website as `.csv` files at the following location:
 https://physionet.org/pnw/login
 
You will need to use the login that you received when you first registered for MIMIC to log in.

The database of interest here is the MIMIC-III Clinical Data Database.

Details about the tables and what data they contain can be found here:
https://mimic.physionet.org/mimictables/admissions/

If you wish to access the data through a Duke resource, please send a note to michael.gao@duke.edu and I will provide you with access to a Box Folder that contains the raw data. 

You may notice that the data is compressed into `.csv.gz` files. In order to unzip them, you will have to use a utility such as [7-zip](https://www.7-zip.org/) in Windows, or use the command line `gzip -d <name_of_file.csv.gz>` and replace the file you are trying to uncompress in between the brackets.

Please note that some of these files are *quite* large. The notable offender is the `chartevents.csv.gz` file. If you uncompress this file, it may take upwards of 33GB, so be careful. 

If you wish to use something from this file, you most likely will not be able to read the file into memory using `pd.read_csv()`. In this case, you may have to use some command line utilities to remove elements of interest. You may look into tools like `grep, awk, sed` in Linux/Mac, or `findstr` in Windows. If you run into issues, please post on piazza and we will assist in this process.



# Resources and Citing

As mentioned before, there has been a significant amount of work that has been done on the MIMIC dataset, and there are many useful repositories on Github available that will help in defining features and being a point of reference in case you are just getting started. Below are some publications and github repositories with some notes to help you get started.

If you use any resources from these repositories or papers, **make sure you cite them in your process book**. 


### Github Repos

https://github.com/YerevaNN/mimic3-benchmarks/tree/master/mimic3benchmark/resources

This github repository builds out many of the benchmarks for MIMIC-III that other researchers can use to test their models. However, the method that it uses is quite complicated. The part of this repository that extremely useful is the resources section, which contains a map from itemids to variables, which can be used to build features that otherwise would be difficult to build.

https://github.com/YaronBlinder/MIMIC-III_readmission

If you work on the readmission model, this is a good starting resource for building out the readmission work. However, much of the code in this repository is unfinished and can be improved upon. It also provides some AUC graphs for the gradient boosted tree model that was run, which can be a helpful reference for your own performance.

https://github.com/alistairewj/reproducibility-mimic/tree/master/queries/mp

Some of the sql queries in this repository may also help when thinking of how to build features from the data.

### Papers 

Real-time mortality prediction in the Intensive Care Unit:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977709/

Predicting Mortality in Diabetic ICU Patients Using Machine Learning and Severity Indices:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961793/

Mortality Prediction in the ICU Based on MIMIC-II Results from the Super ICU Learner Algorithm (SICULA) Project
https://link.springer.com/chapter/10.1007/978-3-319-43742-2_20

Early hospital mortality prediction using vital signals
https://www.sciencedirect.com/science/article/pii/S2352648318300357

Scalable and accurate deep learning with electronic health
records 
https://arxiv.org/pdf/1801.07860.pdf (Google paper -- not on MIMIC, but a good paper nonetheless)