## *Will a patient be readmitted to a hospital within 30 days?*

![](https://dsuqs7rf8y6gn.cloudfront.net/wp-content/uploads/2019/03/patient-based.png)

# Patient Readmittance Analysis

## In this notebook, I have tried to analyze if a patient will be readmitted to a hostipal given certain medical and demographic data. To understand this data, certain degree of healthcare domain knowledge is needed, like ICD9 diagnosis codes. I have used pandas for EDA and data preprocessing and pyspark for modelling. Models used: Logistic Regression, Random Forest.I have also extracted the coefficients and important features from both models to find out which parameters have most weight on determining if a patient will be readmitted. 

### *If you have any questions or feedback, please comment! And if you find this notebook helpful, do leave an  UPVOTE*


### Importing required libraries

In [None]:
!pip install pyspark
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np
import pyspark
from pyspark.sql import SparkSession
import numpy as np

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext


from pyspark.ml import feature, regression,classification, Pipeline, evaluation 
from pyspark.sql import functions as fn, Row
from pyspark import sql

import matplotlib.pyplot as plt
import pandas as pd

In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

### Data Exploration and Data Cleaning

In [None]:
df = pd.read_csv('../input/diabetes/diabetic_data.csv')


In [None]:
df

In [None]:
import pandas_profiling

In [None]:
df.profile_report()

In [None]:
df.columns

In [None]:
df["readmitted"].value_counts()

Columns 24 to 47 are medications which are categories. Most of the columns are nominal and categorical. We will deal with them by creating dummies. Notice that we are dropping one dummy. So during our interpretation we will have to interpret coefficients of other dummies with reference to the dropped dummy!

In [None]:
df = pd.get_dummies(df,columns = [df.columns.values[i] for i in range(24,47) ], prefix=[df.columns.values[i] for i in range(24,47)], prefix_sep='_',drop_first=True) 
##Dummy reference Medication Down

In [None]:
df.shape

In [None]:
df.columns

This is our target variable. We convert it to 0 and 1 for Binary Classification.

In [None]:
df['readmitted'] = df['readmitted'].map({'NO': 0, '<30': 1, ">30":2})
df['readmittedbinary'] = df['readmitted'].map({0: 0, 1: 1, 2:1})

In [None]:
df = pd.get_dummies(df, columns=["change",'max_glu_serum','A1Cresult','diabetesMed'], prefix = ["change",'max_glu_serum','A1Cresult','diabetesMed'],prefix_sep='_',drop_first=True)
## Dummy Reference A1Cresult_>7,max_glu_serum>200,diabetesMed=No, change=ch

In [None]:
df['age'] = df['age'].map({'[0-10)':5,'[10-20)':15, '[20-30)':25,'[30-40)':35,'[40-50)':45,'[50-60)':55,'[60-70)':65,'[70-80)':75,'[80-90)':85,'[90-100)':95})

In [None]:
df.drop(['encounter_id','patient_nbr','weight','admission_type_id','discharge_disposition_id','admission_source_id','medical_specialty','payer_code'],axis=1,inplace=True)



In [None]:
df=df.loc[df['gender'].isin(['Male','Female'])]#df.loc[df['B'].isin(['one','three'])]

In [None]:
df.replace('?', np.nan, inplace = True)

In [None]:
df= df.dropna()##Clean pandas df without dummy variables 

In [None]:
df.columns

### Data Visualization

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
a = sns.countplot(x = df['age'], hue= df['readmitted'])

In [None]:
d = sns.countplot(x = df['readmitted'])

In [None]:
b = sns.countplot(x = df['gender'], hue= df['readmitted'])

In [None]:
c = sns.countplot(x = df['race'], hue= df['readmitted'])

In [None]:
count_of_y = df["age"].groupby(df["readmitted"]).value_counts().rename("counts").reset_index()
count_of_y
fig = sns.lineplot(x="age", y="counts", hue="readmitted", data=count_of_y)

In [None]:
sns.heatmap(df.corr())

In [None]:
plt.figure(figsize=(25, 8))
a = df.corr()
b = a['readmitted']
c= b.to_frame()
type(c)
c.sort_values(by = ['readmitted'], ascending = False , inplace = True)
pos = c.head(8)
c.sort_values(by = ['readmitted'], ascending = True , inplace = True)
neg = c.head(8)
neg


In [None]:
pos.index.name = 'feature'
pos.reset_index(inplace=True)
pos


In [None]:
neg.index.name = 'feature'
neg.reset_index(inplace=True)
neg

In [None]:
pos=pos.drop(pos.index[0:2])
pos

In [None]:
posplot = sns.barplot(x='feature', y="readmitted", data=pos)
posplot.set_xticklabels(posplot.get_xticklabels(),rotation=30)

In [None]:
negplot = sns.barplot(x='feature', y="readmitted", data=neg)
negplot
negplot.set_xticklabels(negplot.get_xticklabels(),rotation=40)

In [None]:
df.to_csv("clean.csv")

In [None]:
sorted(df.columns)

In [None]:
df

### Creating a Spark DataFrame

In [None]:
spark_df = spark.read.csv('clean.csv', header=True, inferSchema=True)

In [None]:
spark_df.show()

In [None]:
#spark_df=spark_df.withColumnRenamed("glimepiride-poglitazone","glimepiridepoglitazone").withColumnRenamed("glyburide-metformin","glyburidemetformin").withColumnRenamed("glipizide-metformin","glipizidemetformin").withColumnRenamed("glimepiride-pioglitazone","glimepiridepioglitazone").withColumnRenamed("metformin-rosiglitazone","metforminrosiglitazone").withColumnRenamed("metformin-pioglitazone","metforminpioglitazone")

### Feature Engineering

Here, I created a function to deal with ICD-9 codes. Please read the data description for more information about these codes.

In [None]:
def amit(row):    
    ma=0
    mb=0
    md=0
    me=0
    sa=0
    dr=dd=ddt=di=dm=dg=dn=dr2=dd2=ddt2=di2=dm2=dg2=dn2=dr3=dd3=ddt3=di3=dm3=dg3=dn3=0
    
    if "V" in row.diag_1 or "E" in row.diag_1:
        dr=dd=ddt=di=dm=dg=dn=0
    #elif 390 <= float(row.diag_1) <= 459 or float(row.diag_1) == 785: #DUMMY DROPPED REFERENCE
    #    dc =1 ##Circulatory
    elif 460 <= float(row.diag_1) <= 519 or float(row.diag_1) == 786:
        dr =1 #Respiratory
    elif 520 <= float(row.diag_1) <= 579 or float(row.diag_1) == 787:
        dd =1 #Digestive
    elif 250 <= float(row.diag_1) <= 250.999:
        ddt =1 #Diabetes
    elif 800 <= float(row.diag_1) <= 999:
        di =1 #Injury
    elif 710 <= float(row.diag_1) <= 739:
        dm =1 #musculoskeletal
    elif 580 <= float(row.diag_1) <= 629 or float(row.diag_1) == 788:
        dg =1 #Genitourinary
    elif 140 <= float(row.diag_1) <= 239:
        dn =1 #Neoplasms
    else:
        dr=dd=ddt=di=dm=dg=dn=0
        #do=1#others
        
    if "V" in row.diag_2 or "E" in row.diag_2:
        #do2=1
        dr2=dd2=ddt2=di2=dm2=dg2=dn2=0
    #elif 390 <= float(row.diag_2) <= 459 or float(row.diag_2) == 785: #DUMMY DROPPED REFERENCE
    #    dc2 =1 ##Circulatory
    elif 460 <= float(row.diag_2) <= 519 or float(row.diag_2) == 786:
        dr2 =1 #Respiratory
    elif 520 <= float(row.diag_2) <= 579 or float(row.diag_2) == 787:
        dd2 =1 #Digestive
    elif 250 <= float(row.diag_2) <= 250.999:
        ddt2 =1 #Diabetes
    elif 800 <= float(row.diag_2) <= 999:
        di2 =1 #Injury
    elif 710 <= float(row.diag_2) <= 739:
        dm2 =2 #musculoskeletal
    elif 580 <= float(row.diag_2) <= 629 or float(row.diag_2) == 788:
        dg2 =1 #Genitourinary
    #elif 140 <= float(row.diag_2) <= 239:
    #   dn2 =1 #Neoplasms
    else:
        #do2=1#others
        dr2=dd2=ddt2=di2=dm2=dg2=dn2=0
        
    if "V" in row.diag_3 or "E" in row.diag_3:
        #do3=1
        dr3=dd3=ddt3=di3=dm3=dg3=dn3=0
    #elif 390 <= float(row.diag_3) <= 459 or float(row.diag_3) == 785:#DUMMY DROPPED REFERENCE
    #    dc3 =1 ##Circulatory
    elif 460 <= float(row.diag_3) <= 519 or float(row.diag_3) == 786:
        dr3 =1 #Respiratory
    elif 520 <= float(row.diag_3) <= 579 or float(row.diag_3) == 787:
        dd3 =1 #Digestive
    elif 250 <= float(row.diag_3) <= 250.999:
        ddt3 =1 #Diabetes
    elif 800 <= float(row.diag_3) <= 999:
        di3 =1 #Injury
    elif 710 <= float(row.diag_3) <= 739:
        dm3 =1 #musculoskeletal
    elif 580 <= float(row.diag_3) <= 629 or float(row.diag_3) == 788:
        dg3 =1 #Genitourinary
    elif 140 <= float(row.diag_3) <= 239:
        dn3 =1 #Neoplasms
    else:
        dr3=dd3=ddt3=di3=dm3=dg3=dn3=0
        #do3=1#others  

    
    if row.race == "Caucasian":
        ma = 1
        #prfloat("A")
    elif row.race == "Asian":
        mb = 1
    #elif row.race == "AfricanAmerican": #DUMMY DROPPED REFERENCE
    #    mc = 1
    elif row.race =="Hispanic":
        me = 1
    else:# :
        ma=0
        mb=0
        me = 0


    if row.gender == "Male":
        sa = 1
    #elif row.gender == "Female": #DROPPED DUMMY REFERENCE
    #    sb = 1
    
    
    r = Row(Caucasian=int(ma) ,Asian=int(mb) ,Hispanic=int(me),male=float(sa),
            Respiratory=dr,
            Digestive= dd,
            Diabetes = ddt,
            Injury= di,
            Muscuskeletal= dm,
            Neoplasms=dn,
            Genitourinary = dg,
            
            Respiratory2=dr2,
            Digestive2= dd2,
            Diabetes2 = ddt2,
            Injury2= di2,
            Muscuskeletal2= dm2,
            Neoplasms2=dn2,
            Genitourinary2 = dg2,
            
            Respiratory3=dr3,
            Digestive3= dd3,
            Diabetes3 = ddt3,
            Injury3= di3,
            Muscuskeletal3= dm3,
            Neoplasms3=dn3,
            Genitourinary3 = dg3,
            
      )
    return(r)

In [None]:
dummy_df = spark.createDataFrame(spark_df.rdd.map(amit))

In [None]:
spark_df.show()

In [None]:
dummy_df.show()

I am joining the two spark dataframes here on row number. This will be our final data with all dummy variables.

In [None]:
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql.window import Window
# since there is no common column between these two dataframes add row_index so that it can be joined
spark_df=spark_df.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
dummy_df=dummy_df.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))

dummy_df = dummy_df.join(spark_df, on=["row_index"]).drop("row_index")
dummy_df.show()


Dropping these columns as we already have their dummies

In [None]:
dummy_df = dummy_df.drop("diag_1","diag_2","diag_3","gender","race")

In [None]:
dummy_df = dummy_df.drop("_c0")

In [None]:
#Checking dummies

In [None]:
dummy_df.select('male',"Caucasian","Hispanic","Asian").show()#,'Male','Female', 'Circulatory','Circulatory2','Circulatory3').show()

In [None]:
dummy_df.printSchema()

# Splitting into training, testing and validation

In [None]:
training_df, validation_df, testing_df = dummy_df.randomSplit([0.6, 0.3, 0.1], seed=0)

# Looking at our final columns

In [None]:
dummy_df.columns

In [None]:
featlist = ['Asian',
 'Caucasian',
 'Diabetes',
 'Diabetes2',
 'Diabetes3',
 'Digestive',
 'Digestive2',
 'Digestive3',
 'Genitourinary',
 'Genitourinary2',
 'Genitourinary3',
 'Hispanic',
 'Injury',
 'Injury2',
 'Injury3',
 'Muscuskeletal',
 'Muscuskeletal2',
 'Muscuskeletal3',
 'Neoplasms',
 'Neoplasms2',
 'Neoplasms3',
 'Respiratory',
 'Respiratory2',
 'Respiratory3',
 'male',
 'age',
 'time_in_hospital',
 'num_lab_procedures',
 'num_procedures',
 'num_medications',
 'number_outpatient',
 'number_emergency',
 'number_inpatient',
 'number_diagnoses',
 'metformin_No',
 'metformin_Steady',
 'metformin_Up',
 'repaglinide_No',
 'repaglinide_Steady',
 'repaglinide_Up',
 'nateglinide_No',
 'nateglinide_Steady',
 'nateglinide_Up',
 'chlorpropamide_No',
 'chlorpropamide_Steady',
 'chlorpropamide_Up',
 'glimepiride_No',
 'glimepiride_Steady',
 'glimepiride_Up',
 'acetohexamide_Steady',
 'glipizide_No',
 'glipizide_Steady',
 'glipizide_Up',
 'glyburide_No',
 'glyburide_Steady',
 'glyburide_Up',
 'tolbutamide_Steady',
 'pioglitazone_No',
 'pioglitazone_Steady',
 'pioglitazone_Up',
 'rosiglitazone_No',
 'rosiglitazone_Steady',
 'rosiglitazone_Up',
 'acarbose_No',
 'acarbose_Steady',
 'acarbose_Up',
 'miglitol_No',
 'miglitol_Steady',
 'miglitol_Up',
 'troglitazone_Steady',
 'tolazamide_Steady',
 'tolazamide_Up',
 'insulin_No',
 'insulin_Steady',
 'insulin_Up',
 'glyburide-metformin_No',
 'glyburide-metformin_Steady',
 'glyburide-metformin_Up',
 'glipizide-metformin_Steady',
 'glimepiride-pioglitazone_Steady',
 'metformin-rosiglitazone_Steady',
 'metformin-pioglitazone_Steady',
 'change_No',
 'max_glu_serum_>300',
 'max_glu_serum_None',
 'max_glu_serum_Norm',
 'A1Cresult_>8',
 'A1Cresult_None',
 'A1Cresult_Norm',
 'diabetesMed_Yes']

### Model Building and Evaluation

# Logistic model 1 with all features no parameter tuning

In [None]:
model1 = Pipeline(stages=[feature.VectorAssembler(inputCols=featlist,
                                        outputCol='features'),feature.StandardScaler(inputCol='features',outputCol = 'sdfeatures'),
                 classification.LogisticRegression(labelCol='readmittedbinary', featuresCol='sdfeatures')])


In [None]:
pipe_model = model1.fit(training_df)

In [None]:
pipe_modeldf = pipe_model.transform(validation_df).select("readmittedbinary","prediction")
pipe_modeldf.show()

In [None]:
tp = pipe_modeldf[(pipe_modeldf.readmittedbinary == 1) & (pipe_modeldf.prediction == 1)].count()
tn = pipe_modeldf[(pipe_modeldf.readmittedbinary == 0) & (pipe_modeldf.prediction == 0)].count()
fp = pipe_modeldf[(pipe_modeldf.readmittedbinary == 0) & (pipe_modeldf.prediction == 1)].count()
fn = pipe_modeldf[(pipe_modeldf.readmittedbinary == 1) & (pipe_modeldf.prediction == 0)].count()
print ("True Positives:", tp)
print ("True Negatives:", tn)
print ("False Positives:", fp)
print ("False Negatives:", fn)
print ("Total", dummy_df.count())

r = (tp)/(tp + fn)
print ("recall", r)

p = float(tp) / (tp + fp)
print ("precision", p)

Let us closely look at our results here. The confusion matrix here shows that there are 5742 true positives. This means that the model _correctly_ predicted 5742 patients who were readmitted to the hospital as readmitted. There are 12317 true negatives. This means that the model _correctly_ predicted 12317 patients who were NOT readmitted to hospital as NOT readmitted. Similarly, the model _incorrectly_ predicted 3351 people who were NOT readmitted to hospital as readmitted. Also, the model _incorrectly_ predicted that 7962 who were readmitted as NOT readmitted.
Looking at the recall value we can say that our our model correctly predicts 42% of readmitted cases (positives). Lets calculate specificity

In [None]:
specificity = tn/(tn+fp)
print("specificity",specificity)

Interesting! Our model correctly predicts 79% of people who will not require readmittance correctly.

In [None]:
evaluator = evaluation.BinaryClassificationEvaluator(labelCol='readmittedbinary')
AUC1 = evaluator.evaluate(pipe_model.transform(validation_df))

In [None]:
AUC1

In [None]:
pd.DataFrame(list(zip(featlist, pipe_model.stages[-1].coefficients.toArray())),
            columns = ['column', 'Coefficients']).sort_values('Coefficients',ascending = False).head(10)

In [None]:
print("Intercept: " + str(pipe_model.stages[-1].intercept))

# Model interpretation (Intercept and Coefficients)

Now that we have all the coefficients and intercept, lets try to interpret their meaning. Remember that in logistic regression we do not predict the actual value like linear regression. Instead, we try to predict the probabilty of getting 1 or 0; readmitted or not readmitted. To do this we fit the data to sigmoid function. To explain y as a linear combination of x variables, we take logit, which  is log(p/(1-p)). Hence, the coefficients in logistic regression are in terms of log odds. 
Let us look at the intercept first. The intercept has value of -0.1306. Remember we have chosen dummies as reference. So we interpret the intercept as : The log odds or logit of an African American female with diag1,diag2,diag3 as Circulatory and and all medications as Down, and the following characteristics: A1Cresult_>7,max_glu_serum>200,diabetesMed=No, change=ch, is -0.1306. We can find the probability by plugging the value in sigmoid function.

In [None]:
prob = 1/(1+np.exp(-pipe_model.stages[-1].intercept))
prob

We can say that with everything else being zero, an African American female with diag1,diag2,diag3 as Circulatory and and all medications as Down, and the following characteristics: A1Cresult_>7,max_glu_serum>200,diabetesMed=No, change=ch, has 46.73% probability of being readmitted to the hospital. 

Let us talk about coefficient of number_inpatient. It has a value of 0.4582. We interpret as: For one unit increase in number_inpatient, the logit of being readmitted increases by 0.4582, everything else being constant. We can compute odd ratio by taking the exponent of the coefficient.

In [None]:
np.exp(0.4582)

Odd ratio can be thought of as odds of being readmitted when number_inpatients is n+1 by odds of being readmitted when number_inpatients is n. Hence we can say that, holding all the other variables fixed, by increasing number_inpatient by one, we expect to see the odds of getting readmitted increase by about 58%.

In [None]:
pd.DataFrame(list(zip(featlist, pipe_model.stages[-1].coefficients.toArray())),
            columns = ['column', 'Coefficients']).sort_values('Coefficients').head(10)

Let us look at coefficient of num_procedures. 

In [None]:
np.exp(-0.068755)

Every one unit increase in num_procedures decreases the odds of being readmitted by about 7 percent.

In [None]:
beta = pipe_model.stages[-1].coefficients
plt.plot(beta)
plt.ylabel('Coefficients')
plt.show()

In [None]:
trainingSummary = pipe_model.stages[-1].summary
roc = trainingSummary.roc.toPandas()
plt.plot(roc['FPR'],roc['TPR'])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve')
plt.show()
print('Training set areaUnderROC: ' + str(trainingSummary.areaUnderROC))
roc

In [None]:
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
    .select('threshold').head()['threshold']
maxFMeasure

In [None]:
pr = trainingSummary.pr.toPandas()
plt.plot(pr['recall'],pr['precision'])
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.show()
#pr['recall']

In [None]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [None]:
va =feature.VectorAssembler(inputCols=featlist ,  outputCol='features')

In [None]:
sd = feature.StandardScaler(inputCol='features',outputCol = 'sdfeatures')

### Logistic model 2 with all features and reg param

In [None]:
lr = classification.LogisticRegression(labelCol='readmittedbinary', featuresCol='sdfeatures')

In [None]:
pipe_model2 = Pipeline(stages=[va,sd, lr])

In [None]:
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.1, 0.3, 0.5]).addGrid(lr.elasticNetParam, [0.2, 0.8, 0.5]).addGrid(lr.maxIter, [15, 30, 50]).build())

In [None]:
cv = CrossValidator(estimator=pipe_model2, \
                    estimatorParamMaps=paramGrid, \
                    evaluator=evaluator, \
                    numFolds=5)

In [None]:
cvModel = cv.fit(training_df)

In [None]:
AUC2 = evaluator.evaluate(cvModel.transform(validation_df))
AUC2

In [None]:
param_dict = cvModel.bestModel.stages[-1].extractParamMap()

sane_dict = {}
for k, v in param_dict.items():
    #print(k)
    sane_dict[k.name] = v

best_reg = sane_dict["regParam"]
best_elastic_net = sane_dict["elasticNetParam"]
best_max_iter = sane_dict["maxIter"]
print(best_reg)
print(best_elastic_net)
print(best_max_iter)

### Random forest model 1

In [None]:
modelrf = Pipeline(stages= [va, classification.RandomForestClassifier(labelCol='readmittedbinary', featuresCol="features")])

In [None]:
modelrffit= modelrf.fit(training_df)

In [None]:
modelrfdf = modelrffit.transform(validation_df)

In [None]:
modelrfdf

In [None]:
AUCrf = evaluator.evaluate(modelrffit.transform(validation_df))

In [None]:
AUCrf

In [None]:
pd.DataFrame(list(zip(featlist, modelrffit.stages[1].featureImportances.toArray())),
            columns = ['column', 'weight']).sort_values('weight', ascending = False).head(10)

In [None]:
tp = modelrfdf[(modelrfdf.readmittedbinary == 1) & (modelrfdf.prediction == 1)].count()
tn = modelrfdf[(modelrfdf.readmittedbinary == 1) & (modelrfdf.prediction == 0)].count()
fp = modelrfdf[(modelrfdf.readmittedbinary == 0) & (modelrfdf.prediction == 1)].count()
fn = modelrfdf[(modelrfdf.readmittedbinary == 0) & (modelrfdf.prediction == 0)].count()
print ("True Positives:", tp)
print ("True Negatives:", tn)
print ("False Positives:", fp)
print ("False Negatives:", fn)
#print ("Total", df.count())

r = (tp)/(tp + fn)
print ("recall", r)

p = float(tp) / (tp + fp)
print ("precision", p)

In [None]:
sensitivity = tn/(tn+fp)
print("Sensitivity",sensitivity)

In [None]:
beta = modelrffit.stages[-1].featureImportances
plt.plot(beta)
plt.ylabel('Importance')
plt.show()

### Cross-Validation Random Forest Model

In [None]:
rf=classification.RandomForestClassifier(labelCol='readmittedbinary', featuresCol="features")

In [None]:
mrf = Pipeline(stages=[va,rf])

In [None]:
paramGrid = (ParamGridBuilder()
             .addGrid(rf.numTrees, [40]).addGrid(rf.maxDepth,[5,10,15,30]).build())



In [None]:
cvrf = CrossValidator(estimator=mrf, \
                    estimatorParamMaps=paramGrid, \
                    evaluator=evaluator, \
                    numFolds=3)

In [None]:
cvrf1 = cvrf.fit(training_df)

In [None]:
AUCn = evaluator.evaluate(cvrf1.transform(validation_df))
AUCn

In [None]:
param_dict1 = cvrf1.bestModel.stages[-1].extractParamMap()

sane_dict1 = {}
for k, v in param_dict1.items():
    #print(k)
    sane_dict1[k.name] = v


best_max_depth = sane_dict1["maxDepth"]
print(best_max_depth)

### Multiclass classification for Readmitted

In [None]:
modelrfmc = Pipeline(stages= [va, classification.RandomForestClassifier(labelCol='readmitted', featuresCol="features",maxDepth=15, numTrees=30)])

In [None]:
modelrffitmc= modelrfmc.fit(training_df)

In [None]:
evaluatormc = evaluation.MulticlassClassificationEvaluator(labelCol='readmitted',predictionCol="prediction",metricName="accuracy")
AUCrfmc = evaluatormc.evaluate(modelrffitmc.transform(validation_df))

In [None]:
AUCrfmc

In [None]:
pd.DataFrame(list(zip(featlist, modelrffitmc.stages[1].featureImportances.toArray())),
            columns = ['column', 'weight']).sort_values('weight', ascending = False).head(10)

### Logistic Regression on top features of original Logistic Regression

In [None]:
model_new_lr2 = Pipeline(stages=[feature.VectorAssembler(inputCols=['number_inpatient', 'number_emergency', 'number_diagnoses', 'Diabetes', 'number_outpatient','time_in_hospital','diabetesMed_Yes','age','rosiglitazone_Steady','Caucasian'],
                                        outputCol='features'),sd,
                 classification.LogisticRegression(labelCol='readmittedbinary', featuresCol='sdfeatures')])


In [None]:
pipe_model3 = model_new_lr2.fit(training_df)
AUC5 = evaluator.evaluate(pipe_model3.transform(validation_df))
AUC5

### Random Forest on top features of original Logistic Regression

In [None]:
modelrfselected = Pipeline(stages= [feature.VectorAssembler(inputCols=['number_inpatient', 'number_emergency', 'number_diagnoses', 'Diabetes', 'number_outpatient','time_in_hospital','diabetesMed_Yes','age','rosiglitazone_Steady','Caucasian'],
                                        outputCol='features'), classification.RandomForestClassifier(labelCol='readmittedbinary', featuresCol="features",maxDepth=15, numTrees=30)])

In [None]:
modelrffitselected= modelrfselected.fit(training_df)

In [None]:
AUCrfselected = evaluator.evaluate(modelrffitselected.transform(validation_df))
AUCrfselected

### Logistic Regression on top of Random Forest features

In [None]:
model_new_lr3 = Pipeline(stages=[feature.VectorAssembler(inputCols=['number_inpatient','num_medications','num_lab_procedures','number_diagnoses','time_in_hospital','age','number_emergency', 'number_outpatient','num_procedures','male'],
                                        outputCol='features'),sd,
                 classification.LogisticRegression(labelCol='readmittedbinary', featuresCol='sdfeatures')])


In [None]:
pipe_model4 = model_new_lr3.fit(training_df)
AUC6 = evaluator.evaluate(pipe_model4.transform(validation_df))
AUC6

### Best model

In [None]:
best_model = Pipeline(stages= [va, classification.RandomForestClassifier(labelCol='readmittedbinary', featuresCol="features",maxDepth=10, numTrees=40)])

In [None]:
bestmodel_fit= best_model.fit(training_df)

Testing the best model so far on Testing data

In [None]:
AUCfinal = evaluator.evaluate(bestmodel_fit.transform(testing_df))

In [None]:
AUCfinal

In [None]:
pd.DataFrame(list(zip(featlist, bestmodel_fit.stages[1].featureImportances.toArray())),
            columns = ['column', 'weight']).sort_values('weight', ascending = False).head(10)

Fit the model to entire data

In [None]:
bestmodel_final_fit= best_model.fit(dummy_df)