# Evaluating Hospital Effectiveness - Data Generation and Balancing


Thien Nguyen

## Abstract

There are ubiquitous problems within the US healthcare system, and one of the biggest problems revolves around costs. While every single hospital in the US has a chargemaster, a list of costs for all billable procedures, accessing this esoteric list as a consumer is nigh impossible, as [Vox](https://www.youtube.com/watch?v=Tct38KwROdw) has demonstrated.

As a result, market forces that are supposed to drive down prices via competition are ineffective due to the secrecy of these prices. This leads to wild discrepancies between institutions. Human psychology dictates that one's health is vital, which allow many healthcare services and products to have an inelastic demand. Furthermore, this also creates a misconceived notion around utility and price--Ie. consumers believe that paying higher prices for healthcare services actually results in better services.

This investigation seeks to evaluate whether this ideation is true or if it is simply a misconception. Since chargemasters are secretive, it would be greatly beneficial for the public if there are some other factors that can be used for cost-benefit analysis when it comes to healthcare. Shedding truth on this matter may provide critical information for the public that is necessary for driving down costs for an industry littered with problems.

If a highly accurate model can be established, then many consumers would indeed pay for such information. Of course, this does not need to be in the form of monetary payment. Offering the model as a free-service would lead to high traffic, which would in turn allure merchants to buy advertising pace. 

In [1]:
#Create a seed for reproducible results
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
np.random.seed(0)

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [29]:
df = pd.read_csv('null_df.csv')
print(df.shape)
df.head()

(4444, 137)


Unnamed: 0,Provider_ID,Rate of complications for hip/knee replacement patients_Denominator,Rate of complications for hip/knee replacement patients_Score,Rate of complications for hip/knee replacement patients_Lower Estimate,Rate of complications for hip/knee replacement patients_Higher Estimate,Death rate for heart attack patients_Denominator,Death rate for heart attack patients_Score,Death rate for heart attack patients_Lower Estimate,Death rate for heart attack patients_Higher Estimate,Death rate for CABG surgery patients_Denominator,...,Value of Care Heart Failure measur_Higher estimate,Value of Care Pneumonia measure_Denominator,Value of Care Pneumonia measure_Payment,Value of Care Pneumonia measure_Lower estimate,Value of Care Pneumonia measure_Higher estimate,Value of Care hip/knee replacement_Denominator,Value of Care hip/knee replacement_Payment,Value of Care hip/knee replacement_Lower estimate,Value of Care hip/knee replacement_Higher estimate,Spending_Score
0,10001,292.0,3.2,2.1,4.8,688.0,13.0,11.0,15.5,291.0,...,18523.0,531.0,19203.0,18191.0,20214.0,284.0,24984.0,23894.0,26172.0,0.99
1,10005,257.0,2.8,1.7,4.4,80.0,14.8,11.6,18.8,,...,18165.0,669.0,15973.0,15206.0,16718.0,253.0,22051.0,21041.0,23103.0,1.01
2,10006,389.0,2.6,1.7,4.0,441.0,15.4,12.8,18.3,145.0,...,18143.0,426.0,25008.0,23445.0,26652.0,642.0,16820.0,15992.0,17657.0,0.99
3,10007,31.0,2.8,1.6,4.8,,,,,,...,16788.0,209.0,16469.0,15038.0,17786.0,30.0,22066.0,19589.0,24807.0,1.08
4,10008,,,,,,,,,,...,17018.0,47.0,14702.0,12387.0,17048.0,,,,,1.06


In [30]:
no_nulls = [i for i  in df.index if all(pd.notnull(df.iloc[i]))]
df['null'] = 1

for i in no_nulls:
    df.iloc[i, df.columns.get_loc('null')] = 0

df.head()

Unnamed: 0,Provider_ID,Rate of complications for hip/knee replacement patients_Denominator,Rate of complications for hip/knee replacement patients_Score,Rate of complications for hip/knee replacement patients_Lower Estimate,Rate of complications for hip/knee replacement patients_Higher Estimate,Death rate for heart attack patients_Denominator,Death rate for heart attack patients_Score,Death rate for heart attack patients_Lower Estimate,Death rate for heart attack patients_Higher Estimate,Death rate for CABG surgery patients_Denominator,...,Value of Care Pneumonia measure_Denominator,Value of Care Pneumonia measure_Payment,Value of Care Pneumonia measure_Lower estimate,Value of Care Pneumonia measure_Higher estimate,Value of Care hip/knee replacement_Denominator,Value of Care hip/knee replacement_Payment,Value of Care hip/knee replacement_Lower estimate,Value of Care hip/knee replacement_Higher estimate,Spending_Score,null
0,10001,292.0,3.2,2.1,4.8,688.0,13.0,11.0,15.5,291.0,...,531.0,19203.0,18191.0,20214.0,284.0,24984.0,23894.0,26172.0,0.99,1
1,10005,257.0,2.8,1.7,4.4,80.0,14.8,11.6,18.8,,...,669.0,15973.0,15206.0,16718.0,253.0,22051.0,21041.0,23103.0,1.01,1
2,10006,389.0,2.6,1.7,4.0,441.0,15.4,12.8,18.3,145.0,...,426.0,25008.0,23445.0,26652.0,642.0,16820.0,15992.0,17657.0,0.99,1
3,10007,31.0,2.8,1.6,4.8,,,,,,...,209.0,16469.0,15038.0,17786.0,30.0,22066.0,19589.0,24807.0,1.08,1
4,10008,,,,,,,,,,...,47.0,14702.0,12387.0,17048.0,,,,,1.06,1


In [31]:
no_nulls_df = df.iloc[no_nulls]
test = no_nulls_df.sample(frac=.33, random_state = 0)

In [32]:
df.drop(test.index, axis = 0, inplace = True)
print(df.shape)

(4335, 138)


In [33]:
print(test.shape)

(109, 138)


In [34]:
print('total entries with no missing data: {}'.format(len(df[df.null == 0])))
print('total entries with missing data: {}'.format(len(df[df.null == 1])))


total entries with no missing data: 222
total entries with missing data: 4113


In [45]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X = df.loc[:, df.columns != 'null']
y = df.null

X_resampled, y_resampled = ros.fit_resample(X, y)
columns = X.columns

os_data_X = pd.DataFrame(data=X_resampled,columns=columns )
os_data_y= pd.DataFrame(y_resampled,columns=['null'])

Unnamed: 0,Provider_ID,Rate of complications for hip/knee replacement patients_Denominator,Rate of complications for hip/knee replacement patients_Score,Rate of complications for hip/knee replacement patients_Lower Estimate,Rate of complications for hip/knee replacement patients_Higher Estimate,Death rate for heart attack patients_Denominator,Death rate for heart attack patients_Score,Death rate for heart attack patients_Lower Estimate,Death rate for heart attack patients_Higher Estimate,Death rate for CABG surgery patients_Denominator,...,Value of Care Heart Failure measur_Higher estimate,Value of Care Pneumonia measure_Denominator,Value of Care Pneumonia measure_Payment,Value of Care Pneumonia measure_Lower estimate,Value of Care Pneumonia measure_Higher estimate,Value of Care hip/knee replacement_Denominator,Value of Care hip/knee replacement_Payment,Value of Care hip/knee replacement_Lower estimate,Value of Care hip/knee replacement_Higher estimate,Spending_Score
0,10001,292.0,3.2,2.1,4.8,688.0,13.0,11.0,15.5,291.0,...,18523.0,531.0,19203.0,18191.0,20214.0,284.0,24984.0,23894.0,26172.0,0.99
1,10005,257.0,2.8,1.7,4.4,80.0,14.8,11.6,18.8,,...,18165.0,669.0,15973.0,15206.0,16718.0,253.0,22051.0,21041.0,23103.0,1.01
2,10006,389.0,2.6,1.7,4.0,441.0,15.4,12.8,18.3,145.0,...,18143.0,426.0,25008.0,23445.0,26652.0,642.0,16820.0,15992.0,17657.0,0.99
3,10007,31.0,2.8,1.6,4.8,,,,,,...,16788.0,209.0,16469.0,15038.0,17786.0,30.0,22066.0,19589.0,24807.0,1.08
4,10008,,,,,,,,,,...,17018.0,47.0,14702.0,12387.0,17048.0,,,,,1.06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8221,450056,856.0,1.8,1.2,2.8,139.0,14.0,10.9,17.6,86.0,...,21889.0,491.0,19275.0,18241.0,20337.0,852.0,20711.0,20196.0,21243.0,1.06
8222,100087,2626.0,1.8,1.4,2.3,799.0,11.4,9.6,13.3,407.0,...,21086.0,1575.0,18213.0,17636.0,18754.0,2607.0,21645.0,21307.0,21982.0,1.05
8223,360180,650.0,2.8,2.0,3.9,494.0,9.5,7.7,11.6,377.0,...,17102.0,335.0,16921.0,15908.0,17969.0,636.0,21776.0,21111.0,22448.0,0.97
8224,500014,564.0,2.3,1.5,3.4,402.0,12.8,10.6,15.5,154.0,...,16300.0,659.0,16936.0,16124.0,17727.0,557.0,18641.0,18108.0,19184.0,0.93


In [46]:
# Check the numbers
print("length of oversampled data is ",len(os_data_X))
print("Number of nulls in oversampled data",len(os_data_y[os_data_y['null']==0]))
print("Number of pure",len(os_data_y[os_data_y['null']==1]))
print("Proportion of nulls in oversampled data is ",len(os_data_y[os_data_y['null']==0])/len(os_data_X))
print("Proportion of pures data in oversampled data is ",len(os_data_y[os_data_y['null']==1])/len(os_data_X))

length of oversampled data is  8226
Number of nulls in oversampled data 4113
Number of pure 4113
Proportion of nulls in oversampled data is  0.5
Proportion of pures data in oversampled data is  0.5


### Analysis
We've successfully oversampled the "minority" class, which are the entries with entirely no missing data. However, this sampling method is naive.

In [50]:
final = os_data_X

In [51]:
#detect number of nulls at a certain threshold
def detect_row_nulls(df, threshold):
    total_feat = df.shape[1]
    above_threshold = []
    for i in df.index:
        nulls = final.iloc[i].isna().sum()
        if nulls/total_feat >= threshold:
            above_threshold.append(i)
    print('Number of rows above threshold: {} \n percentage of total df with {}% missing:{}'.format(len(above_threshold), round(threshold*100,2), round(len(above_threshold)/(df.shape[0]),2)*100))
    return above_threshold

In [52]:
eighty_nulls = detect_row_nulls(final, .8)
fifty_nulls = detect_row_nulls(final, .5)
thirty_nulls = detect_row_nulls(final, .3)

Number of rows above threshold: 784 
 percentage of total df with 80.0% missing:10.0
Number of rows above threshold: 1586 
 percentage of total df with 50.0% missing:19.0
Number of rows above threshold: 1935 
 percentage of total df with 30.0% missing:24.0


In [112]:
from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.model_selection import cross_val_score

#create pseudolabels and train a model with them
class PseudoLabeler(BaseEstimator, RegressorMixin):
    
    def __init__(self, model, test, features, target, sample_rate=0.2, seed=42):
        self.sample_rate = sample_rate
        self.seed = seed
        self.model = model
        self.model.seed = seed
        
        self.test = test
        self.features = features
        self.target = target
        
    def get_params(self, deep=True):
        return {
            "sample_rate": self.sample_rate,
            "seed": self.seed,
            "model": self.model,
            "test": self.test,
            "features": self.features,
            "target": self.target
        }
    
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self
        
    def fit(self, X, y):
        if self.sample_rate > 0.0:
            augemented_train = self.__create_augmented_train(X, y)
            self.model.fit(
                augemented_train[self.features],
                augemented_train[self.target]
            )
        else:
            self.model.fit(X, y)
        
        return self
    
    def __create_augmented_train(self, X, y):
        num_of_samples = int(len(test) * self.sample_rate)
        random_state = 0
        
        # Train the model and creat the pseudo-labels
        self.model.fit(X, y)
        pseudo_labels = self.model.predict(self.test[self.features])
        
        # Add the pseudo-labels to the test set
        augmented_test = test.copy(deep=True)
        augmented_test[self.target] = pseudo_labels
        
        # Take a subset of the test set with pseudo-labels and append in onto
        # the training set
        sampled_test = augmented_test.sample(n=num_of_samples)
        temp_train = pd.concat([X, y], axis=1)
        augemented_train = pd.concat([sampled_test, temp_train])
        return shuffle(augemented_train)
        
    def predict(self, X):
        return self.model.predict(X)
    
    def get_model_name(self):
        return self.model.__class__.__name__

ModuleNotFoundError: No module named 'xgboost'

In [None]:
target = 'Spending_Score'
features = final.columns[1:-1]

train = final[pd.isnull(final[target]) == False]
test = final[pd.isnull(final[target])]

# Preprocess the data
X_train = train[features]
y_train = train[target]

X_test = test[features]

# Create the PseudoLabeler with XGBRegressor as the base regressor
model = PseudoLabeler(
    xgb.XGBRegressor(nthread=1),
    test,
    features,
    target
)


In [None]:
from imblearn.over_sampling import SMOTE

X = df.loc[:, df.columns != 'uninsured']
y = df.uninsured
#y = df.loc[:, df.columns == 'uninsured']

os = SMOTE(random_state=69)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)

columns = X_train.columns

os_data_X, os_data_y=os.fit_sample(X_train, y_train)

os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['uninsured'])

# Check the numbers
print("length of oversampled data is ",len(os_data_X))
print("Number of insured in oversampled data",len(os_data_y[os_data_y['uninsured']==0]))
print("Number of uninsured",len(os_data_y[os_data_y['uninsured']==1]))
print("Proportion of insured in oversampled data is ",len(os_data_y[os_data_y['uninsured']==0])/len(os_data_X))
print("Proportion of uninsured data in oversampled data is ",len(os_data_y[os_data_y['uninsured']==1])/len(os_data_X))

In [102]:
ar1 = [100,100, 50, 40, 40, 20, 10]
ar2 = [5, 25, 50,125]

In [98]:
sorted(ar1,reverse=True)

[100, 100, 50, 40, 40, 20, 10]

In [99]:
min(ar1)

10

In [108]:
def climbingLeaderboard(scores, alice):
    board = []
    current_index = len(scores)-1
    for play in alice:
        if play> max(scores):
            board.append(1)
            break
        while play > scores[current_index]:
            current_index -=1
        else:    
            copy= scores.copy()
            copy.insert(current_index, play)
            settled = sorted(list(set(copy)),reverse=True)
            board.append(settled.index(play)+1)

    return board

climbingLeaderboard(ar1, ar2)

[6, 4, 2, 1]