# Executive Report - Kyle Lombardo

## Part I: Research Question

With the recent uptick in hospital readmissions due to the addition of COPD and Hip & Knee measurement and historical trends, there is a need to understand why there has been an historically high number of readmissions in the hospital system. This executive report builds off the previous reports and will attempt to answer the queston of whether there are certain demographics, locales, pre-existing conditions, or other variavles that will result in a higher likelihood of readmission.

In the Exploratory Data Analysis report showed that there was a statistical relationship between 'Population', 'Children', 'Initial_days', and 'TotalCharge' with 'ReAdmis' when compared using a t-test. These results were similar to the first report which found that there was a relationship between 'ReAdmis', 'TotalCharge', 'Initial_days', and 'VitD_levels' when analyzed using PCA.

In the previous report using logistic regression, there was a very strong logistic relationship between 'ReAdmis' with 'Item8' and 'Initial_days' with an accuracy of 98%, auc of roc of 0.998, and a confusion matrix with:

X | Predict ReAdmis | Predict Not
----- | ----- | ----- 
Actual ReAdmis | 1241 | 25
Actual Not | 25 | 709


This report will use K-Nearest Neighbor, a supervised machine learning algorithm, to see if it is a more accurate method than logistic regression in predicting whether a patient will be readdmitted into the hospital after being discharged. One of the big stipulations with KNN is that the features must be continuous for it to work. Therefore, the data set will need to be cleaned accordingly. This report will also try to optimize the KNN using hyperparameter tuning to provide the best accuracy possible. 

The data used for this report includes 10,000 patient record's responses to the following criteria from 'D207 D208 D209 Medical Data Consideration and Dictionary.pdf'. All strings that could not be re-expressed were removed as they would be impossible to use for KNN. Redundant data such as 'Lat' and 'Lng' compared to 'Zip' or meaningless data such as 'Customer_id' were removed. Lastly, because KNN is only compatible with continuous values, many features will need to be re-expressed as well.

(Criteria marked with a * were removed for this report)


Criteria | Data Type | Example | Description
----- | ----- | ----- | -----
*Case Order | Integer | 1034 | Index Values
*Customer_id	|String |‘D550524’|	Unique identifier for patient
*Interaction |	String|	‘8cd49b13-f45a-4b47-a2bd-173ffa932c2f’|	Unique Identifier for transaction, procedure, and admission
*UID	|String|	‘3a83ddb66e2ae73798bdf1d705dc0932’	|Unique Identifier for transaction, procedure, and admission
*City	|String	|‘Braggs’	|Patient city address from billing
*State	|String|	‘AL’|	Patient state address from billing
*County	|String|	‘Morgan’|	Patient county address from billing
*Zip	|Integer|	35621	|Patient zip address from billing
*Lat	|Float	|-86.5404	|Patient latitude address from billing
*Lng	|Float	|-81.1272	|Patient longitude address from billing
Population	|Integer|	281	|Patient city address population from billing
Area	|String	|‘Urban’|	Patient address from billing zoning type
*Timezone|	String|	‘America/New_York’|	Patient address from billing time zone locale
*Job	|String	|‘Actuary’	|Patient or primary insurance holder’s occupation 
Children|	Integer|	2|	Number of children in patient’s household
Age	|Integer|	53	|Patient’s age
Income|	Float|	88126.93|	Patient or primary insurance holder’s yearly income
Marital|	String|	‘Married’|	Patient’s marital status
Gender	|String|	‘Female’	|Patient’s self-identification of gender
ReAdmis	|String	|‘Yes’	|Whether patient has been readmitted within a month of last release
VitD_level|	Float|	47.81348|	Patient’s vitamin d level (ng/mL)
Doc_visits|	Integer|	5	|Number of visits by primary physician during initial hospitalization
Full_meals_eaten|	Integer	|1|	Number of full meals the patient ate while in hospital.
VitD_supp|	Integer|	2	|Number of times vitamin d supplement was ministered to patient
Soft_drink	|String|	‘Yes’|	Whether patient consumes >= 3 soft drinks in a day
Initial_admin	|String	|‘Observation Admission’	|Means by which patient was admitted to hospital initially
HighBlood	|String	|‘Yes’	|Whether patient has high blood pressure
Stroke	|String|	‘No’	|Whether patient has had a stroke
Complication_risk|	String	|‘High’	|Complication risk level as determined by primary physician
Overweight|	String	|‘Yes’	|Whether patient is considered obese considering his or her demographics
Arthritis|	String	|‘No’|	Whether patient has arthritis
Diabetes	|String	|‘Yes’|	Whether patient has diabetes
Hyperlipidemia	|String|	‘No’|	Whether patient has hyperlipidemia
BackPain	|String	|‘Yes’|	Whether patient has chronic back pain
Anxiety	|String	|‘Yes’	|Whether patient has anxiety
Allergic_rhinitis	|String	|‘No’|	Whether patient has allergic rhinitis
Reflux_esophagitis	|String	|‘Yes’|	Whether patient has reflux esophagitis
Asthma	|String|	‘No’|	Whether patient has asthma
Services	|String|	‘CT Scan’	|Primary service patient received from hospital
Initial_days|	Float	|7.302395	|Number of days patient stayed in hospital on initial visit
TotalCharge	|Float	|2631.702|	Average amount charged to patient daily 
Additional_charges|	Float|	14382.23|	Average amount for miscellaneous procedures, treatments, medicines, etc

Patient’s opinion survey on rate of importance (1 = most important, 8 = least important)

Criteria | Data Type | Example | Description
----- | ----- | ----- | -----
Item1 |Integer| 3| Timely admission
Item2	|Integer|2|	Timely treatment
Item3	|Integer|5|	Timely visits
Item4	|Integer|6|	Reliability
Item5	|Integer|1|	Options
Item6	|Integer|2|	Hours of treatment
Item7	|Integer|6|	Courteous staff
Item8	|Integer|4|	Evidence of active listening from doctor

## Part II: Method Justification

K-nearest neighbors is a useful machine learning tool used to predict a categorical target variable using one or multiple continuous variables in a data set. KNN first calculates the distance between the variable to predict and all the prelabeled data in the training set. It then decides what group the data point in question is in by observing the user defined K-number of samples around it by distance and sees what grouping those labeled data mostly fall into and outputs a prediction. In this particular case, it will predict those patients that are or are not readmitted..

Some assumptions for KNN to work properly is that all data fed into KNN as predictors must be cleaned with no missing data. All feature variables must be continuous data so no categorical data in the form of strings can be used in the data set. All categorical data will be converted using one-hot encoding through the python method `pd.get_dummies()`. This will result in a much higher number of features for the final data set as each categorical variable will be split into n number of features where n = number of unique categories - 1.<sup>1</sup>

In this report Python will be used. While R could be used, the python scikit learn library has many useful tools including hyperparameter tuning with gridsearch, pipelines, and scaling techniques that makes KNN easy and quick to use. 

As for libraries used in the report, the following will be utilized. Pandas and numpy are crucial tools for importing, re-expressing, and manipulating the data. Scikit Learn is used for multiple things in this report including:

<ol>
    <li><strong><em>Pipeline</em></strong>: a way to group many scikit learn methods together in a clean and quick fashion. </li>
    <li><strong><em>GridSearchCV</em></strong>: used for hyperparameter tuning. It will try a number of different combinations of hyperparameters in a given range and output the best one according to a score. This score will be set to accuracy for KNN.</li>
    <li><strong><em>train_test_split</em></strong>: used in order to split up the feature and target variables into separate training and testing set. This is best practice in order to test the model using outside information, or information not used in the training set.</li>
    <li><strong><em>StandardScaler</em></strong>: used to scale all the data into z-scores. KNN can be easily tricked into over emphasizing variables with larger scales. This puts all the features on an even scale.</li>
    <li><strong><em>KNeighborsClassifier</em></strong>: KNN model method already described above.</li>
    <li><strong><em>classification_report</em></strong>: gives a summary of accuracy, recall, and precision among other parameters.</li>
    <li><strong><em>confusion_matrix</em></strong>: a tabled statistical summary of true positives, true negatives, false positives, and false negatives of the prediction values vs the actual values</li>
    <li><strong><em>roc_auc_score</em></strong>: area under the curve of the graph plotting the true positive rate v the false positive rate also known as receiver operating curve. A perfect score includes only true positive values and gives an area of 1.0</li>
</ol>

The data set 'df' is imported here using pandas.  

In [23]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

raw_data = pd.read_csv('medical_clean.csv')

## Part III: Data Preparation

As stated above, the data must be in the correct format and data types in order to be run properly in KNN. The data set will need to undergo a data cleaning step as follows: 

First, all variables not being used will be removed from the data set. Only numeric variables or variables that can be re-expressed as numeric will remain. This is the only data type KNN is able to use. Next, re-expressions will be performed using one-hot encoding using the `get_dummies()`, a pandas method, on all categorical variables including the target variable. This will break up all categorical variables into separate binary variables. The first binary variable is dropped from each previous variable due to being redundant. 

Second, any missing values must be imputed using either an average or the most frequent value.

### Removing unnecessary and redundant data

In [24]:
df = raw_data.drop(['CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State', \
                    'County', 'Lat', 'Lng', 'TimeZone','Job'], axis=1)
df = pd.get_dummies(df, drop_first=True)

### Re-expressing data to numeric data types and missing values

For a quick way to see how many missing values are in the data set `df.info()` can be used. It looks as if there is no missing data values. Imputation will not need to be performed on this data set. Also, all data types are either integers or floats and are ready to be run using KNN.

With the data set `df` cleaned and in correct type format, the mean and variance for each feature can be calculated using `df.describe()`.

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 48 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Zip                                  10000 non-null  int64  
 1   Population                           10000 non-null  int64  
 2   Children                             10000 non-null  int64  
 3   Age                                  10000 non-null  int64  
 4   Income                               10000 non-null  float64
 5   VitD_levels                          10000 non-null  float64
 6   Doc_visits                           10000 non-null  int64  
 7   Full_meals_eaten                     10000 non-null  int64  
 8   vitD_supp                            10000 non-null  int64  
 9   Initial_days                         10000 non-null  float64
 10  TotalCharge                          10000 non-null  float64
 11  Additional_charges           

In [26]:
df.describe(include='all')

Unnamed: 0,Zip,Population,Children,Age,Income,VitD_levels,Doc_visits,Full_meals_eaten,vitD_supp,Initial_days,...,Diabetes_Yes,Hyperlipidemia_Yes,BackPain_Yes,Anxiety_Yes,Allergic_rhinitis_Yes,Reflux_esophagitis_Yes,Asthma_Yes,Services_CT Scan,Services_Intravenous,Services_MRI
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,50159.3239,9965.2538,2.0972,53.5117,40490.49516,17.964262,5.0122,1.0014,0.3989,34.455299,...,0.2738,0.3372,0.4114,0.3215,0.3941,0.4135,0.2893,0.1225,0.313,0.038
std,27469.588208,14824.758614,2.163659,20.638538,28521.153293,2.017231,1.045734,1.008117,0.628505,26.309341,...,0.44593,0.472777,0.492112,0.467076,0.488681,0.492486,0.45346,0.327879,0.463738,0.191206
min,610.0,0.0,0.0,18.0,154.08,9.806483,1.0,0.0,0.0,1.001981,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27592.0,694.75,0.0,36.0,19598.775,16.626439,4.0,0.0,0.0,7.896215,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,50207.0,2769.0,1.0,53.0,33768.42,17.951122,5.0,1.0,0.0,35.836244,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,72411.75,13945.0,3.0,71.0,54296.4025,19.347963,6.0,2.0,1.0,61.16102,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0
max,99929.0,122814.0,10.0,89.0,207249.1,26.394449,9.0,7.0,5.0,71.98149,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Finally, for the last step in the data preparation stage, the cleaned and prepared dataset is exported for the final report.

In [27]:
df.to_csv(r'C:\Users\kylel\OneDrive\Desktop\Learning\School\D209 - Data Mining I\PA1\prepped_data.csv')

## Part IV: Analysis

With the data cleaned and prepared, it is time to run the KNN model. The data will need to be split into `X` for features and `y` as the target. 

First, a pipeline will be set up to make things a bit more organized. This pipeline will include the use of `StandardScaler()` and `KNeighborsClassifier()`. `StandardScaler()` is used to scale all of the X values as to avoid having small and large scaled values. Scaling will especially help those values that were re-expressed to 0s and 1s. 

Next, the data will be split into training and testing sets using `train_test_split()`. This is a crucial step in the modeling process. If all of `X` is used, there would be no way to test the model's accuracy. Splitting the data into training and testing sets allows for the ability to use unseen data (the test set) to be used to test for accuracy. 

After, a `GridSearchCV()` will be used to tune the `n_neighbors` hyperparameter. This will cycle through a range of K values for the `n_neighbors` value that gives the highest accuracy. This was already done a few times ahead of time to find the best range to iterate through. GridSearch typically has a default scoring of R<sup>2</sup>, but the best scoring parameter for a binary target would be accuracy.

Finally, the model will be fit and ready to analyze and be examined.<sup>1</sup>

In [28]:
X = df.drop(['ReAdmis_Yes'], axis=1)
y = df['ReAdmis_Yes']

In [29]:
steps = [('scaler', StandardScaler()), ('KNN', KNeighborsClassifier())]
pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=41)

parameters = {'KNN__n_neighbors': np.arange(235,245)}
cv = GridSearchCV(pipeline, parameters, cv=3, scoring='accuracy')

cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

In [33]:
print(classification_report(y_test, y_pred))
print('Tuned Model Parameters: {}'.format(cv.best_params_))

              precision    recall  f1-score   support

           0       0.97      0.96      0.96      1279
           1       0.94      0.94      0.94       721

    accuracy                           0.95      2000
   macro avg       0.95      0.95      0.95      2000
weighted avg       0.96      0.95      0.96      2000

Tuned Model Parameters: {'KNN__n_neighbors': 241}


In [34]:
confuse_matrix = confusion_matrix(y_test, y_pred)
confuse_matrix

array([[1232,   47],
       [  43,  678]], dtype=int64)

In [35]:
roc = roc_auc_score(y_test, y_pred)
print('Auc score: {:.3f}'.format(roc))

Auc score: 0.952


## Part V: Data Summary and Implications

With the higher than expected K-value for KNN, the range in the `parameters` variable has been altered a changed a number of times to find the value that works best for this model. A K-value of 241 is much higher than what is typically used, but it does provide the highest accuracy for the test set.

A classification_report was also used to see what sort of accuracy this particular model gives. 95% is lower than the logistic regression provided, but that was also tuned for a smaller number of features. The confusion matrix was also unsurprisingly lower giving more false positives and negatives than seen in the logistic regression.

The difference in confusion matrices can be best seen using the `roc_auc_score`. As stated above, the area under the curve of the receiver operating curve will give a good indication of how many more true positive rates there are compared to false positive rates. In this particular model the auc score was 0.952. Again, a good value, but not as good as the logistic model's value of 0.998.

However, with all that has been said, this was still a good, quick prediction of whether a patient may or may not be readmitted to the hospital. With a bit more fine tuning and possible reduction in number of features, this model could be improved upon. The fact that all variables were used for the KNN model is certainly the biggest limitation to this model. Another limitation is the data preparation that needs to happen before it can even be run. Scaling, no missing values, and outlier sensitivity can cause some issues.<sup>2</sup>

The recommended actions at this point are to either use the logitistic regression covered in the previous report or to fine tune the KNN model to have fewer, more important features in the `X` set. While the hyperparameters were sufficiently tuned for this particular model using GridSearch, the features were not touched at all. 

If it is necessary to use this particular model, the 95% accuracy indicates that it would serve well, but would have more errors in its ability to predict than the logistic model. 

## Resources

1. Hugo Bowne-Anderson, <em>Supervised Learning with scikit-learn</em>, DataCamp, accessed 12 October 2021, <<https://app.datacamp.com/learn/courses/supervised-learning-with-scikit-learn>>.
2. Genesis, <em>Pros and Cons of K-Nearest Neighbors</em>, Genesis, accessed 15, October 2021, <<https://www.fromthegenesis.com/pros-and-cons-of-k-nearest-neighbors/>>.