# Title: Identifying Most Impactful Metrics in Quality-Health Plan-Eligible Uninsured Adults
## Author: Sooho Myoung
## Date: September 26th - October 2nd 2021

### Data Selection
* I chose a dataset containing estimates and key demographic features of the uninsured population of the United States at the state, county, and local (PUMA) level. The dataset was supplied by Elliot Inman from SAS Institute using the census Bureau's 2019 American Community Survey.
    * From the dataset, I specifically selected the data from Subsidized QHP-Eligible Adults by Number.
* My main objective was to create a model to accurately identify the most impactful demographic features of quality health plan (QHP)-eligible uninsured adults.

### Data Pre-Processing & Explorative Analysis

In [1]:
# imports
import numpy as np # linear algebra
import pandas as pd # data processing

import os # retrieve filename
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:
df = pd.read_csv("/kaggle/input/subsidized-qhp-eligible/subsidized-qhp-eligible.csv") # read CSV file into DataFrame
df.head() # return first 5 rows

In [3]:
df.info() # summary of DataFrame

* Data mostly consists of the 'object' datatype
* Need to convert to a numerical datatype

In [4]:
df = df.replace('**', np.nan) # replaces '**' with NaN in the DataFrame
df.head()

In [5]:
df.isnull().sum(axis = 0) # counts how many NaN values are in each column

* I removed the categorical columns.
* There are many NaN values present in several columns, so I removed them.
    * Low incidences can be explained intuitively (e.g., not many Russian Speaking households in the U.S.)
* To increase the size of the data, I removed all NaN values and replaced them with the median value for each column.

In [6]:
df.drop(df.columns[[0,1,2,5,6,10]], axis=1, inplace=True) # remove columns with more than 5% NaN values
df.head()

In [7]:
df = df.replace({',':''},regex=True)  # removes the comma(s) in between the numbers
df.head()

In [8]:
df = df.apply(pd.to_numeric, 1, errors='coerce') # change datatype to numeric
df = df.apply(lambda x: x.fillna(x.mean())) 
df.info()

* Confirmed that datatype is now numerical and NaN values removed

In [9]:
df.head(10)

* Data is now clean and ready for analysis

### Random Forest Regression
#### Model Creation


In [10]:
# imports
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

x = df.loc[ : , df.columns != 'Uninsured Population (Excluding Undocumented)'] # features
y = df['Uninsured Population (Excluding Undocumented)'] # labels

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3) # split dataset into training (80%) and test (20%)

In [11]:
clf = RandomForestRegressor(n_estimators=1000,verbose=3,n_jobs=-1, bootstrap=False, warm_start=True) # create a Gaussian Classifier; n_estimators = 100 for aesthetic purposes

clf.fit(x_train,y_train) # train the model using training sets

y_pred = clf.predict(x_test) # predict y using model with input of test data

In [12]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred)) 
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred)) 
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [13]:
col = list(df.columns.values) # create a new variable "col" for index
col.remove('Uninsured Population (Excluding Undocumented)')

feature_imp = pd.Series(clf.feature_importances_,index=col).sort_values(ascending=False)
feature_imp

* The independent variables such as
    * Full-time Worker in Family                               
    * English Spoken in HH                                                 
    * Female                                             
    * High School Diploma                                                
    * Male                                     
* are the five demographic features with the most importance on the dependent variable "Uninsured Population," which specifies subsidized QHP-eligible uninsured adults.

#### Data Visualization

In [14]:
# imports
import seaborn as sns # graphics
import matplotlib.pyplot as plt # graphics

%matplotlib inline

# creating a bar plot
sns.barplot(x=feature_imp[:10], y=feature_imp.index[:10])

# add labels to your graph
plt.xlabel('Feature Importance Score')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

### Light Gradient Boosting (LGB)
#### Model Creation

In [15]:
# imports
import lightgbm as lgb # gradient boosting
import scipy # model fitting
from sklearn.model_selection import train_test_split # train & test data for LGB

In [16]:
x2 = df.drop("Uninsured Population (Excluding Undocumented)" ,axis= 1) # create a copy of df without dependent

In [17]:
y2 = df[['Uninsured Population (Excluding Undocumented)']] # create a copy of dependent only

In [18]:
col2 = list(df.columns.values) # make a list of column names
col2.remove('Uninsured Population (Excluding Undocumented)') # remove dependent from list

In [19]:
x2, x2_test, y2, y2_test = train_test_split(x2, y2, test_size=0.3, random_state=42) # split train and test data

In [20]:
train_data = lgb.Dataset(x2, label=y2,feature_name=col2) # designate train data

In [21]:
test_data = lgb.Dataset(x2_test, label=y2_test,feature_name=col2) # designate test data

In [22]:
parameters = { # set parameters (i.e., base values ) for LGB
    'metric': 'auc',
    'is_unbalance': 'true',
    'boosting': 'gbdt',
    'num_leaves': 100,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.5,
    'bagging_freq': 20,
    'learning_rate': 0.05,
    'verbose': 0
}

In [23]:
model = lgb.train(parameters, # train model until the highest AUC score is reached
                       train_data,
                       valid_sets=test_data,
                       num_boost_round=5000,
                       early_stopping_rounds=100)

In [24]:
model.save_model('model2.txt', num_iteration=model.best_iteration) # save model as a TXT file

In [25]:
y2_pred = model.predict(x2_test) # predict y using model with input of test data

In [26]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y2_test, y2_pred)) 
print('Mean Squared Error:', metrics.mean_squared_error(y2_test, y2_pred)) 
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y2_test, y2_pred)))

#### Data Visualization

In [27]:
lgb.plot_importance(model, max_num_features=10) # plot the feature importance

In [28]:
lgb.create_tree_digraph(model, tree_index=0, show_info=None, precision=3, orientation='horizontal') # example tree digraph

In [29]:
# imports
from ggplot import * # for graphics

In [30]:
ggplot(df, aes(x='Full-time Worker in Family', y='English Spoken in HH', color='Uninsured Population (Excluding Undocumented)')) +\
    geom_point(alpha=.5) +\
    xlim(0,255500) +\
    ylim(0, 371500) +\
    theme_bw() +\
    ggtitle("QHP-Eligible Uninsured Population by English Spoken in HH and Full-time Worker in Family")

* This plot shows that a household with a married couple and a full-time worker is much more likely to be be QHP-eligible and uninsured.