# CMS Hospital Rating

# Fun Facts:
## The probability of picking a 5-star rating hospital in the USA that is from Texas  is 0.35%.
## The probability of picking a 1-star rating hospital in the USA that is from New York  is 1.1%.

# Description:
### It is every hospital's dream to have a 5-star rating as it signifies the level of patient-care quality provided. The rating is one of many ways to showcase quality recognition and reputation. The dataset by Centers of Medicare and Medicaid, was downloaded on Kaggle dataset. This data can help us compare the quality  of care among hospitals in the USA. This is an interesting project as the data can be used for predictive modeling using Python. The result of the machine learning algorithms can have implications: 

### *Hospitals - better resource-allocation strategy
### *Insurance - improve overall policies/practices and network coverage strategy
### *Patients - better understanding of their healthcare facilities and makes informed decision

# Project Objective:
### The objective of this project is to predict the hospital's rating, thus, the target variable or the y-variable is "Hospital overall rating".

# Techniques:
### The project included comprehensive exploratory data analysis, data cleansing, and data visualization. Multiclass predictive models using machine learning from sci-kit learn libraries included K-nearest neighbor, Support Vector Machines, and Random Forest. The end result included the comparison of these models' performance. 

###  Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Opening and reading downloaded dataset

In [None]:
data = pd.read_csv("../input/cms-ratingcsv/Hospital General Information.csv")

In [None]:
data.head(3)

### Checking out the info on dataset to see what am I dealing with. Noted that features ending with *footnote have low non-null values. Almost all are objects except for provider ID, ZIP code, and phone numbers are numbers, which makes sense. The dataset is categorical-heavy. This means categorical graphs will be used frequently during the exploratory data analysis. 

In [None]:
data.info()

### Checking out the total null counts helps me decide if they should be dropped completely or replaced with relevant values. 

In [None]:
data.isnull().sum()

###  Diving in to see what is in *footnote - Mostly description, not very informational in our work here. 

In [None]:
data["Hospital overall rating footnote"].value_counts()

###  Those *footnote features will be droppedsince the information is not relevant to this project.

In [None]:
data = data.drop(["Hospital overall rating footnote", "Mortality national comparison footnote", 
          "Safety of care national comparison footnote", "Readmission national comparison footnote",
         "Readmission national comparison footnote", "Patient experience national comparison footnote",
         "Effectiveness of care national comparison footnote", "Timeliness of care national comparison footnote",
         "Efficient use of medical imaging national comparison footnote"], axis=1)

### Rechecking what is left - The length of the dataset is ~4800. I have decided to just drop the 15 null values on county name. Need to look into "Meets criteria for meaningful use of EHRs" and decide if dropping 143 null values will make sense. 

In [None]:
data.isnull().sum()

In [None]:
data["Meets criteria for meaningful use of EHRs"].value_counts()

### Almost 90% of the data for feature "Meets criteria for meaningful use of EHR" is Yes and 1 is Not Available. I will just drop this feature since the data is not too helpful. 

In [None]:
data = data.dropna()
data.isnull().sum()

# The exploratory data analysis begins here. The features listed in the top line will be used as my roadmap for exploration. First, I will do a basic descriptive analysis. 

In [None]:
data.describe()

### The descriptive analysis does not help much. The "State" column gives us the geographical location information instead of ZIP code. I suspect Provider ID is a unique feature (primary key). Phone numbers don't help much. Planned to drop Provider ID if this feature is used as unique identifier, which is not helpful in the project. Also will drop phone numbers and Zip code. 

In [None]:
data["Provider ID"].nunique()

### My dropping list now contains Provider ID, Phone Number, and Meets criteria for meaningful use of EHRs. Let's check county name if it is a useful feature.

In [None]:
data["County Name"].value_counts()

### The County Name feature has length of ~1500. It is not feasible to keep this feature in our machine learning process later. Planned to drop this feature. ZIP code counts below shows length of ~4000, which is also not feasible to keep. My dropping list now contains ZIP code, county name, provider ID, phone numbers, and EHRs.  

In [None]:
data["ZIP Code"].value_counts()

In [None]:
data = data.drop(["County Name", "ZIP Code", "Meets criteria for meaningful use of EHRs", 
                  "Provider ID", "Phone Number"], axis=1)

In [None]:
data.info()

### There are a few more features to consider dropping before moving to the core areas - Hospital Name and Address. I'm not too sure if these features are useful in this project. Noted that all the data now is of object type. 

In [None]:
data["Hospital Name"].value_counts()

In [None]:
data["Address"].value_counts()

In [None]:
data = data.drop(["Address", "Hospital Name"], axis=1)

## Exploratory Data Analysis

In [None]:
data["Hospital Type"].value_counts()

###### The majority of the hospitals in the USA are acute care hospitals, followed by critical access, and children. 

In [None]:
plt.figure(figsize=(12,5))
sns.set_context("paper", font_scale=1.5)
sns.countplot(x=data["Hospital Type"], data=data, color="red", alpha=0.4)

In [None]:
data["Emergency Services"].value_counts()

###### The majority of the acute care hospitals and critical access have some sort of Emergency Services, unlike children's hospital. This makes sense as children's hospital is a specialty, usually affiliated with large acute care hospital in health systems. 

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x="Hospital Type", data=data, hue="Emergency Services", palette="ocean_r")

In [None]:
plt.figure(figsize=(9,5))
sns.set_context("paper", font_scale=1.5)
sns.countplot(x="Emergency Services", data=data, color="blue", alpha=0.4)

In [None]:
data["Hospital Ownership"].value_counts()

### The majority of the hospitals in the USA is private non-profit, government and faith-based. This is followed by physician-owned hospitals and tribal-owned bottoms the list. It is probably due to the total population of the tribes is relatively low compared to the non-tribes. The supply and demand laws in healthcare applies here.  

In [None]:
plt.figure(figsize=(12, 7))
sns.set_context("paper", font_scale=1.5)
order = data["Hospital Ownership"].value_counts().sort_values(ascending=False).index
sns.countplot(y="Hospital Ownership", data=data, color="green", alpha=0.4, order=order)

In [None]:
data["Hospital Ownership"].value_counts(normalize=True)*100

In [None]:
data["Hospital overall rating"].value_counts()

### Planned to drop the Not Available even though its count is significant. Replacing it with the mean value does not provide us an accurate picture, which is representative of the hospital rating in this country. 

In [None]:
data = data.drop(data[data["Hospital overall rating"] == "Not Available"].index)

In [None]:
data["Hospital overall rating"].value_counts(normalize=True)*100

###### As expected, the majority of the hospitals in the USA falls under the average rating of 3 (49%). The extreme rating 1 and 5, with 5 being the best rating, have the lowest percentages. This is expected as it shows a very small number of hospitals  (3%) are actually under-performed and it is very difficult for hospitals to obtain 5-star rating (only 2%) as performance is based on CMS value-based programs and so forth. At this stage, it is hard to tell if hospitals have resources constraints as the barrier versus CMS requirements are too high. 

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("poster", font_scale=1)
sns.countplot(x=data["Hospital overall rating"], palette="plasma")

### Based on the graph above, it is safe to assume that the majority of the hospitals in the USA lean toward average to higher rating (average performers and over-achievers) with smaller percentage of them on the low-end side. In general, hospitals are performing better than worse based on CMS requirements with only abur less than a quarter ~21% are under=performing. This does not reflect the true picture of the current hospital performance; over-time tracking is required to see if hospitals are improving or declining. 

### Continuing analysis - will drop Not Available counts in all features. 

In [None]:
data["Mortality national comparison"].value_counts()

In [None]:
data = data.drop(data[data["Mortality national comparison"] == "Not Available"].index)

In [None]:
data["Mortality national comparison"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("paper", font_scale=1.5)
mortality_order = data["Mortality national comparison"].value_counts().sort_values(ascending=False).index
sns.countplot(x=data["Mortality national comparison"], color="pink", order=mortality_order)

In [None]:
data["Safety of care national comparison"].value_counts()

In [None]:
data = data.drop(data[data["Safety of care national comparison"] == "Not Available"].index)

In [None]:
data["Safety of care national comparison"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x=data["Safety of care national comparison"], color="orange")

In [None]:
data["Readmission national comparison"].value_counts()

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x=data["Readmission national comparison"], color="grey")

In [None]:
data["Patient experience national comparison"].value_counts()

In [None]:
data = data.drop(data[data["Patient experience national comparison"] == "Not Available"].index)

In [None]:
data["Patient experience national comparison"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x=data["Patient experience national comparison"], color="pink")

In [None]:
data["Effectiveness of care national comparison"].value_counts()

In [None]:
data = data.drop(data[data["Effectiveness of care national comparison"] == "Not Available"].index)

In [None]:
data["Effectiveness of care national comparison"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x=data["Effectiveness of care national comparison"], palette="viridis")

In [None]:
data["Timeliness of care national comparison"].value_counts()

In [None]:
data = data.drop(data[data["Timeliness of care national comparison"] == "Not Available"].index)

In [None]:
data["Timeliness of care national comparison"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x=data["Timeliness of care national comparison"], palette="plasma")

In [None]:
data["Efficient use of medical imaging national comparison"].value_counts()

In [None]:
data = data.drop(data[data["Efficient use of medical imaging national comparison"] == "Not Available"].index)

In [None]:
data["Efficient use of medical imaging national comparison"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x=data["Efficient use of medical imaging national comparison"], palette="cividis")

### Explorative data analysis on CMS measures are complete. Need to drilldown in the basics before moving on to machine learning. 

In [None]:
data["City"].value_counts()

### The City data length has ~1400 and it is not feasible to create dummies to keep this feature. Putting them into fewer categories is not necessary as we have State as the location.  

In [None]:
data = data.drop("City", axis=1)

In [None]:
data["State"].value_counts(normalize=True)*100

### Looks like the top 3 states that have the highest number of hospitals are CA, TX, and FL. Possibly due to the number of population, but I do not have the data to support this. It could be higher needs from sicker population in general, or just having excessive resources. These are just the preliminary impression. 

In [None]:
plt.figure(figsize=(15,30))
state_order = data["State"].value_counts().sort_values(ascending=False).index
sns.countplot(y=data["State"], palette="coolwarm", order=state_order)
plt.title("Number of Hospitals in Each State")
plt.xlabel("Number of Hospitals")

### Let's focus on the top 10 states that have the highest number of hospitals and their overall ratings. At a glance, TX has the highest number of hospitals with 5-star rating compared to the rest. In contrast, NY has the highest number of hospitals with 1-star rating compared to her counterparts. 

In [None]:
plt.figure(figsize=(12,18))
plt.title("Top 10 States")
state_order = data["State"].value_counts().sort_values(ascending=False)[:10].index
sns.countplot(y=data["State"], palette="coolwarm", order=state_order, data=data, hue="Hospital overall rating")
plt.legend(bbox_to_anchor=(1.2, 0.5), title="CMS Rating")

### Just out of curiousity, let's drilldown into the top 3 states.

In [None]:
len(data[data["State"] == "TX"])

In [None]:
data[data["State"] == "TX"]["Hospital overall rating"].value_counts()

In [None]:
len(data[data["State"] == "NY"])

In [None]:
data[data["State"] == "NY"]["Hospital overall rating"].value_counts()

In [None]:
len(data[data["State"] == "CA"])

In [None]:
data[data["State"] == "CA"]["Hospital overall rating"].value_counts()

In [None]:
len(data["Hospital overall rating"])

In [None]:
data["Hospital overall rating"].value_counts()

# There are a total of 59 hospitals with 5-star rating out of total 2297 hospitals in the USA (after data cleansing). 

#  Fun fact: Given the dataset, the probability of picking a 5-star rating hospital in the USA is 2.57%.

In [None]:
high_star_prob = 59/len(data["Hospital overall rating"])*100
high_star_prob

# #Fun fact: Given the dataset, the probability of picking a 1-star rating hospital in the USA is 4.1%.

In [None]:
low_star_prob = 94/len(data["Hospital overall rating"])*100
low_star_prob

# Fun fact: Given the dataset, the probability of picking a 5-star rating hospital in the USA that is from Texas  is 0.35%.

In [None]:
texas_high_prob = (8/164)
texas_usa_prob = (164/2297)
texas_high_usa_prob = (texas_high_prob*texas_usa_prob)*100
texas_high_usa_prob               

# Fun fact: Given the dataset, the probability of picking a 1-star rating hospital in the USA that is from New York  is 1.1%.

In [None]:
newyork_low_prob = 25/110 
newyork_usa_prob = 110/2297
newyork_low_usa_prob = (newyork_low_prob*newyork_usa_prob)*100
newyork_low_usa_prob

# It's time to look into the data type and prepare for machine learning. Noted that all are objects except for ZIP code. 

In [None]:
data.info()

# Getting dummy data on features

In [None]:
clean_state = pd.get_dummies(data["State"], prefix="State_", drop_first=True, dtype=int)
data = pd.concat([data.drop("State", axis=1), clean_state], axis=1)

In [None]:
data.head(3)

In [None]:
data.info()

In [None]:
dummy_column = data.iloc[:, 0:3]

In [None]:
clean_column = pd.get_dummies(dummy_column, drop_first=True, dtype=int)
data = pd.concat([data.drop(dummy_column, axis=1), clean_column], axis=1)

In [None]:
data.info()

In [None]:
dummy_columns = data.iloc[:, 1:8]

In [None]:
clean_columns = pd.get_dummies(dummy_columns, drop_first=True, dtype=int)
data = pd.concat([data.drop(dummy_columns, axis=1), clean_columns], axis=1)

In [None]:
data.info()

# Our y-variable("Hospital overall rating) will be converted from object to int. 

In [None]:
data["Hospital overall rating"] = data["Hospital overall rating"].astype(str).astype(int)

# Binary Classification on "Hospital overall rating"

# I received an error at the end of this project that sklearn.metrics does not support multiclass output (it works on my Jupyter notebook). So, the multiclass rating will be converted to binary classification: 0 = low to average; 1 = above average and high. The followings will be mapped: Rating 1,2, and 3 will be converted to 0. Rating 4 and 5 above

In [None]:
data["Hospital overall rating"] = data["Hospital overall rating"].map({1:0, 2:0, 3:0, 4:1, 5:1})

In [None]:
data["Hospital overall rating"].value_counts()

In [None]:
data.describe()

# Importing sci-kit learn libraries. Preprocessing scalar is not necessary as all in 0s and 1s. The data will be split into training set and test. 

In [None]:
from sklearn.model_selection import train_test_split

# Creating variables for X and y. Our target is to predict hospital rating, thus, our y variable is "Hospital overal rating". The rest will be our X variable.

In [None]:
X = data.drop("Hospital overall rating", axis=1)
y = data["Hospital overall rating"]

# Setting data for training at 80%, test data at 20%. Random state will be used so it will produce the same random sequence each time. An arbitrary number 42 will be used as the random state (I heard it is THE number for Life, universe, and everything :)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# K-nearest neighbor

## K-nearest neighbor - a random number for the K-value or the n_neighbors of 10 will be used.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
neighbor = KNeighborsClassifier(n_neighbors=10)

In [None]:
neighbor.fit(X_train, y_train)

In [None]:
neighbor_predict = neighbor.predict(X_test)

#  Training and prediction complete. Now let's evaluate

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(confusion_matrix(y_test, neighbor_predict))
print(classification_report(y_test, neighbor_predict))

## The accurate is 85% with n_neighbors set as 10, which was a random number. Noted that recall value on 1 is only 0.59 and the F1-score is only 68%. However, we have to consider the imbalanced data - the test size for 1 (the better rating group) is only about 30% of the dataset.  

###### Now I will try to optimize the model by finding the best K-value in range 1 to 40, which is also arbitrary, and see if we can improve the score.  

In [None]:
error_rate = []

for i in range(1,40):
    neighbor = KNeighborsClassifier(n_neighbors = i)
    neighbor.fit(X_train, y_train)
    knnpredict = neighbor.predict(X_test)
    error_rate.append(np.mean(knnpredict != y_test))

### The graph below shows the error rate trend in the elbow method. Noted that K-value 10 to 40 gives as a steady trend, hovering from 0.14 to 0.16 error rate. It is logical to pick the K-value that gives us the lowest error rate. In this case, I will pick 19 since it is the starting point of downtrend. N_neighbors will be set at 19 and retraint he model.

In [None]:
plt.figure(figsize=(15,7))
plt.plot(range(1,40), error_rate, color="blue", ls="dashed", marker="o", markerfacecolor="red", markersize=10)
plt.title("Error Rate")
plt.xlabel("K-value")
plt.ylabel("Error Rate")

In [None]:
neighbor = KNeighborsClassifier(n_neighbors=19)
neighbor.fit(X_train, y_train)
knnpredict = neighbor.predict(X_test)

In [None]:
print(classification_report(y_test, knnpredict))
print(confusion_matrix(y_test, knnpredict))

### The KNN model using elbow method resulted in a slight increase in accuracy, from 85% to 86%. The recall and f1-score on 1 has improved as well, which is much better than before IMO. Moving on to Support Vector Machines, which I think it is a close counterpart to KNN. 

## Support Vector Machines

In [None]:
from sklearn.svm import SVC

In [None]:
support = SVC(random_state=42)

In [None]:
support.fit(X_train, y_train)

In [None]:
support_predict = support.predict(X_test)

In [None]:
print(classification_report(y_test, support_predict))
print(confusion_matrix(y_test, support_predict))

### The SVM model yields 87% accuracy. Noted that it has 90% precision on rating 0 but only 79% on 1. The weighted average is 87% across the board. 

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
random_forest = RandomForestClassifier(n_estimators=300, bootstrap=True, random_state=42)
random_forest.fit(X_train, y_train)

In [None]:
random_forest_predict = random_forest.predict(X_test)

In [None]:
print(classification_report(y_test, random_forest_predict))
print(confusion_matrix(y_test, random_forest_predict))

### The Random Forest model yields a 86% accuracy with 79% precision and 67% recall on 1. 

In [None]:
report = [["Support Vector Machines", 0.87, 0.87, 0.87, 0.87], ["Random Forest", 0.86, 0.85, 0.86, 0.85], 
          ["K-nearest neighbor", 0.86, 0.85, 0.86, 0.85]]
overall_result = pd.DataFrame(report, columns=["Model", "Accuracy Score", "Precision", "Recall", "F1-score"])
overall_result.sort_values("F1-score", ascending=False)

# Overall, SVM yields the highest accuracy and F1-score, followed by Random Forest, and K-nearest neighbor. I believe the model can be better if dataset for 0s and 1s are more balanced (recalling the majority of rating of 3 was the majority and accounts for almost 50%). This is an interesting personal project. 