Hospital Chargemaster Analysis
This is a small analysis using the hospital chargemaster Dinosaur Dataset. See the notebook for a quick example that shows there is interesting signal in the data for one hospital. We can build a linear model, perform feature selection with Lasso, and try to predict prices based on descriptor terms.
For this analysis, we want to try predicting price for a given item (possibly for a given hospital) based on the chargemaster data. For example, I would expect items with the terms "brain" or "heart" to be more expensive than general medications like Advil (ibuprofen).
The approach we will take is to try a simple linear regression. I don't want to do the ultimate analysis, but rather to show you that the data is interesting.
- We first start with data from one hospital. This is to keep the data frame size reasonable to share on GitHub, and also speedy to run on my tiny local machine.
- We will then do stop word removal and make all terms lowercase.
- Then we will create a sparse data frame of words (columns) by the unique identifiers (rows). We can use scikit-learn to create this data frame.
- The first model we will train is linear regression (possibly with lasso to get more zero entries).
Given over one hundred hospitals, there are definitely more interesting models to build and things to try! And you need validation. I leave this up to you, dear data scientist.
1. Data Preparation
The data required for the dummy demo is provided in the repository, and here is how I produced them:
git clone https://www.github.com/vsoch/hospital-chargemasters cd hospital-chargemasters
And use the script 1.prepare-data.py to read in the latest datasets