# Predicing Healthcare related infections
### Capstone 2

Data collection, wrangling, joining, initial checks


In [6]:
#Imports
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

#### Significance:

"At any given time, about 1 in 25 inpatients have an infection related to hospital care. These infections lead to tens of thousands of deaths and cost the U.S. health care system billions of dollars each year." - https://health.gov/our-work/health-care-quality/health-care-associated-infections

Additional resources
    - overview of HAI https://www.healthypeople.gov/2020/topics-objectives/topic/healthcare-associated-infections

#### Data summary
**Healthcare Associated Infections (HAIs)** (https://data.cms.gov/provider-data/dataset/77hc-ibv8)
<br> - How often patients get an infections while in the hopsital. This measure is categorized into several different types and means of infections (related to equipment, procedures, or location of infection).  It is also compared to a national benchmark for that type of hospital, and normalized based to some degree based on things like how many beds at the hospital, lab methods used, affilition with a medical school, patient age and some others. Top priority HAIs are central line-associated bloodstream infections (CLABSI) and methicillin-resistant Staphylococcus aureus (MRSA) infections.

**Patient survey (HCAHPS)** (https://data.cms.gov/provider-data/topics/hospitals/hcahps#hcahps-star-ratings) 
<br> - this survey is administered to patients at random (not just medicare patients).  This has 19 questions about the hospital + 10 other demographic and screening questions. (details on questions here: https://data.cms.gov/provider-data/topics/hospitals/hcahps#about-the-hcahps-survey) 

**Star rating (from HCAHPS survey results)** 
<br> - Star rating summarizes the patient survey responses by category, and is rolled into a single 'summary star' rating per facility. (details here: https://data.cms.gov/provider-data/topics/hospitals/hcahps#hcahps-star-ratings)

**Timely and Effective Care** (https://data.cms.gov/provider-data/dataset/yv7e-xc69)
<br> - Includes several measures about specific topics, each topic is given a rating based off what has been shown to be best practice or most important with that procedure.  Data are collected from records of medicare and non-medicare patients. Measures include:  cataract surgery outcome, colonoscopy follow-up, heart attack care, emergency department care, preventive care, pregnancy and delivery care, and cancer care.  Each category has different measures (percentage, number of minutes, etc...) **Most relevant measures here is sepsis - "percentage of patients with severe sepsis or septic shock for which a hospital provides appropriate care".**. (more details about the data: https://data.cms.gov/provider-data/topics/hospitals/timely-effective-care)

**Related data** (https://data.cms.gov/provider-data/dataset/yv7e-xc69)
<br> - In case that wasn't enough links, here's one more. Other datasets included in comparing hospitals that have not been used in this study. 

## Initial imports

In [13]:
#Initial load of files (stored locally, downloaded 2/8/2021)
HAI_raw = pd.read_csv('.\data\Healthcare_Associated_infections_-_Hospital.csv', na_values="Not Available")

display(HAI_raw.shape)
display(HAI_raw.sort_values(by="Facility ID").head(2))

In [24]:
survey_raw = pd.read_csv('.\data\HCAHPS-Hospital.csv', na_values="Not Available", dtype={12:object, 14:object, 17:object, 19:object},  parse_dates=True, infer_datetime_format=True)
display(survey_raw.shape)
display(survey_raw.sort_values(by="Facility ID").head(2))

(454026, 22)

Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,HCAHPS Measure ID,HCAHPS Question,...,Patient Survey Star Rating Footnote,HCAHPS Answer Percent,HCAHPS Answer Percent Footnote,HCAHPS Linear Mean Value,Number of Completed Surveys,Number of Completed Surveys Footnote,Survey Response Rate Percent,Survey Response Rate Percent Footnote,Start Date,End Date
0,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,H_COMP_1_A_P,"Patients who reported that their nurses ""Alway...",...,,77,,Not Applicable,507.0,,21.0,,01/01/2019,12/31/2019
67,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,H_CT_UNDER_D_SD,"Patients who ""Disagree"" or ""Strongly Disagree""...",...,,6,,Not Applicable,507.0,,21.0,,01/01/2019,12/31/2019


In [23]:
care_raw = pd.read_csv('.\data\Timely_and_Effective_Care-Hospital.csv', na_values="Not Available", parse_dates=True, infer_datetime_format=True)
display(care_raw.shape)
display(care_raw.sort_values(by="Facility ID").head(2))

(80665, 16)

Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Condition,Measure ID,Measure Name,Score,Sample,Footnote,Start Date,End Date
0,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,Emergency Department,EDV,Emergency department volume,high,,,01/01/2019,12/31/2019
16,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,Sepsis Care,SEV_SEP_6HR,Severe Sepsis 6-Hour Bundle,92,63.0,2.0,01/01/2019,12/31/2019


We now have the following imported as dataframes

 * HAI_raw
 * survey_raw
 * care_raw

## File QA

In [28]:
profile_HAI = ProfileReport(HAI_raw, title="Infections risk table Pandas Profiling Report")
profile_HAI.to_widgets()

Summarize dataset:   0%|          | 0/30 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [29]:
profile_survey = ProfileReport(survey_raw, title="Survey data Pandas Profiling Report")
profile_survey.to_widgets()

Summarize dataset:   0%|          | 0/36 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"cramers": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')
  (include the error message: '{error}')"""


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [30]:
profile_care = ProfileReport(care_raw, title="Timley and effective care table Pandas Profiling Report")
profile_care.to_widgets()

Summarize dataset:   0%|          | 0/30 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

### Summary from pandas-profiling
**HAI (Hospital associated infections)**
* Constant start and end date (1/1/2019 - 12/31/2019)
* Score and "compared to national" (key columns for prediction) are missing from >40% and 60% of the rows
* There are 6 types of infections (HAIs) each with 6 metrics, for a total of 6x6=36 unique measures that need moved to columns

**Survey**
* Constant start and end date (1/1/2019 - 12/31/2019)
* 93 distict questions

**Care**
* Start and end date are MOSTLY the same with a few outliers from other quarters.
* Score is missing in 53% of rows
* one of the conditions is "sepsis care", which likley has a high correlation with infections since sepsis is caused by infections.
    

## Reshape files 

Files need formatted as 1 row per facility.  Resulting files will be wider instead of longer to prep for joining by Facility ID

In [None]:
#First reshape the HAI table
#Take the 36 unique measures_IDs and convert to columns showing the score.  

#Add a singe columns per location (6)  for compared to national average

#How to handle footnotes??? Probably case-by-case

#drop measures name, as this can be looked up with the Measure_ID, which is more concise for column names.
#  Save a cross-reference table of Measure_ID and Measure Name for future readibilty and reference

#Final table should have 36+6 new columns, and only one row per facility.  

Check there are no duplicate Facility IDs in each file (for best join results)

## Join files on Facility ID