## Problem Description

It is important to know if a patient will be readmitted in some hospital. The reason is that you can change the treatment, in order to avoid a readmission.

In this database, you have 3 different outputs:

- No readmission;
- A readmission in less than 30 days (this situation is not good, because maybe your treatment was not appropriate);
- A readmission in more than 30 days (this one is not so good as well the last one, however, the reason can be the state of the patient.

# Coding 

### 1. Import Libraries

In [1]:
# General Libraries
import pandas as pd
import numpy as np

# Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt


### 2. Read Data

In [2]:
!ls ''

In [3]:
df = pd.read_csv('/kaggle/input/diabetes/diabetic_data.csv')

In [4]:
df.head()

### 3. Data Analysis, Visualization and Cleaning

<b>Shape of the data ?</b>

In [5]:
print('The shape of the Dataset is :', df.shape, 'with', df.shape[0], 'records and', df.shape[1], 'columns')

<b>Check the columns of the dataset?</b>

In [6]:
df.columns

<b> Number of columns in the data?</b>

In [7]:
print('There are total', len(df.columns), 'columns in the dataset.')

From the 50 columns 49 columns such as encounter_id, patient_nbr etc are the independent variables and the column name <b>"readmitted"</b> is the dependent variable and the label of the data. 

<b>Statistics of the Data ?</b>

In [8]:
df.describe(include = 'all').T

<b> How many Null Values in Data? </b>

The data contains some null values, but null values are filled with "?". so we will look for '?' in each column for null values.

In [9]:
for i in df.columns:
    print(i, df[df[i] == '?'].shape[0])

We can see that there are many null values in the columns like "medical_specialty" , "race" and "payer_code". So we will have to fill these null values or drop the rows or columns with null values.  

We start analyzing columns sequentially and will drill down the data to look for insights. We will look for Number of Patients in the data. AS we know we can check from the <b>"patient_nbr"</b> column that how many unique patients in the data.

In [10]:
print('There are', len(df['patient_nbr'].unique()), 'unique patients in the data.')

In [11]:
print('There are', len(df['encounter_id'].unique()), 'unique encounters in the data.')

- Everytime the patient visits the hospital, it is called as <b>encounter</b>. 
- So we have multiple encounters per patient. 


So we will take the problem as simple classification problem and didnt deal it with like a <b>Time Series</b> problem as we dont have much encounters per patient in the data.

<b>Encounter per patient?</b>

In [12]:
# If we divide total patient with total encounter, we can get the average encounters per patient.
len(df['encounter_id'].unique())/len(df['patient_nbr'].unique())

- So we have <b>1.4 encounters </b> per patient and majority of the patients will have only 1 encounter in the data.
- <b>Lets check this with the statistics.</b>


In [13]:
df_encounters_check = df.groupby(['patient_nbr']).agg(encounters = ('encounter_id', 'count')).reset_index().sort_values(['encounters'], ascending = False)

In [14]:
df_encounters_check[df_encounters_check['encounters']==1]

In [None]:
# From the 71518 patients, 54745 patients have only 1 encounter in the data.
# Remaining patients have more than 1 encounter in the data. 
# So we concluded that we will only take data as simple data, not a Time Series Data. 

<b>Lets analyze the label column ?</b>

In [None]:
# First of All, Lets check the Distribution of Label column. 

In [15]:
df['readmitted'].value_counts()

In [16]:
ax = sns.barplot(x=df['readmitted'].value_counts().index,   y=df['readmitted'].value_counts())
plt.xlabel('labels', size = 12)
plt.ylabel('# of Readmitted', size = 12)
plt.title('Class Distribution \n', size = 12)
plt.show()

- As Approximately 50% of the data belongs to the  "NO" class, and other classes have less labels.
- It will create class imbalance problem. So we will take this problem as 2 class problem.
- We will only try to predict if the patinet will readmitted or Not, We will skip the part of less than 30 days or greater than 30 days.

<b> Create 2 Class Label : </b> Created Another label to map <30 and >30 to 1 class for better Analysis and Classification.

In [17]:
df['readmitted'].unique()

In [18]:
# Created another column and take it as 2 class problem, Label the <30 and >30 as YES and Other "N0" as No.

def check_label(text):
    if text == '>30' or text =='<30':
        return 'Yes'
    else:
        return 'No'
    
df['readmitted_2'] =df['readmitted'].apply(check_label) 

In [19]:
ax = sns.countplot(x='readmitted_2',   data= df)
plt.xlabel('Readmitted', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Distribution of Readmission Class  \n\n', size = 12)
plt.show()

<b> Race Column</b>

Race featrure defines the race of the patient.
According to Documentaiton the values for race can be: 

- Caucasian 
- Asian
- African American  
- Hispanic
- other

In [20]:
ax = sns.barplot(x=df['race'].value_counts().index,   y=df['race'].value_counts())
plt.xlabel('Race', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Distribution of Race of Patients \n', size = 12)
plt.show()

- The majority of the people are Caucasian, which are the people with european ancestry.

- There are "?" in the data which means the race contains the Null values.
- We will be needing to remove this from the data or we can also assign this with "Other" category.

In [21]:
df.loc[df['race'] == '?', 'race'] = 'Other'

In [22]:
ax = sns.barplot(x=df['race'].value_counts().index,   y=df['race'].value_counts())
plt.xlabel('Race', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Distribution of Race of Patients \n', size = 12)
plt.show()

We replaced the Race containing value '?' with Other!

<b> What is the Gender Distribution in Data?</b>

According to Documentation, The values can be,

- male 
- female  
- unknown/invalid

In [23]:
ax = sns.countplot(x='gender',   data= df)
plt.xlabel('Gender', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Gender Distribution \n', size = 12)
plt.show()

- We can see in the above figure that there are More than 50,000 Males in the data.
- Females are close to 48,000.
- There are some people whose gender is unknow, we can drop these rows as they are very few.

In [24]:
df['gender'].value_counts()

- There are only 3 Encounter for which we dont know the gender, It may create distribution error in the data. 
- So it is better to drop these rows from the data


In [25]:
df[df['gender']!='Unknown/Invalid']

In [26]:
# Drop the "Unknown/Invalid" gender of the data.
df.drop(df[df['gender'] == 'Unknown/Invalid'].index, inplace = True)

In [27]:
df.reset_index(inplace = True, drop = True)

In [28]:
df.head()

<b>Relationship of Gender and Readmitted Overall</b>

In [29]:
ax = sns.countplot(x="gender", hue="readmitted_2", data=df)
plt.xlabel('Gender', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Gender vs Readmitted \n', size = 12)
plt.show()

<b>What Age of People are there in data?</b>

In [30]:
ax = sns.countplot(x='age',   data= df)
plt.xlabel('Age', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Age Distribution \n', size = 12)
plt.show()

- As per the Literature, The problem of Readmission is common in Older People. 

<b>RelationShip Between and Age and Readmission ? </b>

In [31]:
ax = sns.countplot(x="age", hue="readmitted_2", data=df)
plt.xlabel('Age', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Age vs Readmitted \n', size = 12)
plt.show()

- As we mentioned above, The relationship of older Patients and Readmission is Strong as Mostly Older Patients are at high risk of Readmission.

- And you can also see from the data the Mostly Older Patient are Readmitted, and younger people not tend to readmit. 

<b> Lets Analyze Weight of the Patient ?</b>

In [32]:
df.shape

In [33]:
df['weight'].value_counts()

- From value Counts We can see that the from around 101000 records, 98569 records dont have Weight Value. 
- So, We will drop this column. 
- If we will try to fill this column it can disturb the distribution of the data.

In [34]:
# Lets drop this column. 
df.drop(columns = ['weight'], inplace = True)

<b>Understanding of admission_type_id column.</b>

As per the documentation, Integer identifier corresponding to 9 distinct values, for example:
- emergency
- urgent
- elective
- newborn
- not available

This represents the Type of Admission of the Patient, Which means in which department patient if admitted to at the time of encounter. 

As we dont have specific Id Defined even in the Documentation, we cannot map these value with Type for better undetstanding. 
We will only see if which ID have most Encounters.

In [35]:
ax = sns.countplot(x='admission_type_id',   data= df)
plt.xlabel('Admission Type ID', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Admission Type Id Distribution \n', size = 12)
plt.show()

We can see in the above graph, The Id 1 have most of the encounters. From the literature review i assumed that the value should mean as Inpatient Encounter. Because mostly the Patients Admitted to the Inpatiet Department Readmitted after some Procedure. 

<b>What is the Discharge Disposition ?</b>

AS per the Documentation, Integer identifier corresponding to 29 distinct values, for example:
- discharged to home
- expired
- not available 

As per Literature, The Discharge Disposition means the facility to which patinet is discharged to. Patient can discharge to Home Health, etc. 

In [36]:
len(df['discharge_disposition_id'].unique())

There are 26 Discharge Disposition's in the data and we also dont have mapping for it.

<b>What is Admission Source ID? </b>


As per the Literature, Integer identifier corresponding to 21 distinct values, for example:
- physician referral,
- emergency room,  
- transfer from a hospital

Admission Source means, from which source the patient came? The Patient can come from Physician Referral and other Sources.

In [37]:
df['admission_source_id'].unique()  

In [38]:
print('There are', len(df['admission_source_id'].unique()), 'unique Admission Sources from which patient can be admitted.')

<b>What is meaning of time_in_hospital? </b>


As per Literaure, it is Integer number of days between admission and discharge.

In [39]:
df['time_in_hospital'].unique()

In [40]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='time_in_hospital',   data= df)
plt.xlabel('Time In Hospital', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Time in Hospital Distribution \n', size = 12)
plt.show()

In [41]:
df['time_in_hospital'].mean()

From the Graph and Mean of the Time in Hospital, We found that the majority of the people stays in hospital 3-4 Days.

<b>What is the Relation of Stay in Hospital and Readmission? </b>

In [42]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='time_in_hospital',  hue= 'readmitted_2',  data= df)
plt.xlabel('Time In Hospital', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Readmitted Count', size = 12)
plt.title('Time in Hospital vs Readmission \n', size = 12)
plt.show()

In [43]:
sns.set(rc={'figure.figsize':(18, 8.2)})
sns.displot(df, x="time_in_hospital", hue = 'readmitted_2', kind="kde")
plt.title('Relationship between Time in Hospital and Readmission \n\n', size  = 14)
plt.show()

Normal Time in Hospital for Not Readmitted and Readmitted is the same. This means that this parameter will not add value In our model.

<b> What is payer Code ?</b>

From the Literature, Integer identifier corresponding to 23 distinct values, for example:
- Blue Cross\BlueShield, 
- Medicare,
- self-pay

This represent the payer of bill at the hospital. 

In [44]:
df['payer_code'].value_counts()


We can see that there are <b>40256 </b> Empty values here, We will remove this column from the data. 


In [45]:
df.drop(columns = ['payer_code'], inplace = True)

<b> What is medical Speciality?</b>

Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct
values, for example:

- cardiology
- internal medicine 
- family\general practice
- surgeon

In [46]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='medical_specialty',   data= df)
plt.xlabel('Medical Speciality', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 12)
plt.title('Medical Speciality Distribution \n', size = 12)
plt.show()

- By looking at the graph we can see that there are also many missing values in the data. We will remove this column.
- As managing this colum with so many missing values will be not easy. 

In [47]:
df.drop(columns =['medical_specialty'], inplace = True)

<b>What is num_lab_procedures ? </b>

Number of lab tests performed during the encounter


In [48]:
sns.displot(df, x="num_lab_procedures", kind="kde")
plt.title('Distribution of Lab Procedures \n\n', size = 13)
plt.show()

-  As we can see that from the distribution plot. That the majority of the Patients have around 30 to 50 Labs Procedures. Lets look at it with respect to class.   

<b>Trend of Lab Procedures with Readmission ?</b>

In [49]:
sns.displot(df, x="num_lab_procedures", hue= 'readmitted_2', kind="kde")
plt.title('Realtionship of Lab Procedures with Readmission \n\n', size = 13)
plt.show()

- The Distribution of Readmitted and Not Readmitted have the same trend.
- The number of labs procedures will not play a vital role in creating contrastive behaviour between Readmitted and Not Readmitted.

<b>What is the relation of Number of Procedures and Readmission? </b>

Number of procedures (other than lab tests) performed during the encounter

In [50]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='num_procedures',  hue= 'readmitted_2',  data= df)
plt.xlabel('Number of Procedures', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Readmitted Count', size = 12)
plt.title('Number of Procedures vs Readmission \n', size = 14)
plt.show()

- Number of Procedures is also not giving some vital signs of readmission with increase in procedure. 
- Majority of patients have 0 procedures which are can be Readmitted and not Readmitted

<b>What is the contribution of num_medications in the data? </b>

Number of distinct generic names administered during the encounter

In [51]:
sns.displot(df, x="num_medications", hue= 'readmitted_2', kind="kde")
plt.title('Number of Medications VS Readmission \n\n')
plt.show()

- Again, By looking at the Distribution of Number of Medications a patient take, the trend look same for both classes.
- It means this also will not add variance in data.

<b>What is the trend of Outpatient Visits W.R.T to Readmission? </b>

Number of outpatient visits of the patient in the year preceding the encounter

In [52]:
sns.displot(df, x="number_outpatient", hue= 'readmitted_2', kind ='kde')
plt.title('Relationship of Outpatient Visits with Readmission \n\n', size = 13)
plt.show()

- The Distribution of Number of Outpatient is very skewed. We cannot understand the trend from here.
- As the majority of the patinet have 0 inpatient visits. 
- We will look into deeply, We will drill down the data and look for data in detail.

<b> First of all look at Outpatient Visits below 5. </b> 

In [53]:
sns.displot(df.loc[df['number_outpatient']<5], x="number_outpatient", hue= 'readmitted_2', kind ='kde')
plt.title('Relation of Outpatient Visits less than 5 w.r.t Readmission \n\n', size = 13)
plt.show()

We Identfied that the Patients with Outpatient Visits at 0, Normally not Readmitted. 

<b> First of all look at Outpatient Visits above 5. </b> 

In [54]:
sns.displot(df.loc[df['number_outpatient']>=5], x="number_outpatient", hue= 'readmitted_2', kind ='kde')
plt.title('Relation of Outpatient Visits >= 5 w.r.t Readmission \n\n', size = 13)
plt.show()

- On the Data with Outpatient Visits greater than equal to 5, we most likely to see that more people readmitted then not readmitted.
- We can conclude that this feature on some range will give us more important rules.

<b> What is the trend of Number of Emergency Visits ?</b>

Number of emergency visits of the patient in the year preceding the encounter

In [55]:
sns.displot(df, x="number_emergency", hue= 'readmitted_2', kind='kde')
plt.title('Relation of Emergency Visits w.r.t Readmission \n\n', size = 13)
plt.show()

- We can see that the distribution of Emergnecy Visits very Skewed.
- Majority of the Patients have 0 Emergency `Visits. 
- We will slice the data and look for trend in detail.

<b> What is relation when we look at Emergency Visits less than 5 ? <b>

In [56]:
sns.displot(df.loc[df['number_emergency']<5], x="number_emergency", hue= 'readmitted_2', kind='kde')
plt.title('Relationship of Emergency Visits < 5 w.r.t Readmission \n\n', size = 13)
plt.show()

- When the value is at 0 the number of Not Readmitted are higher than the Readmitted Patients. 
- Now lets look at the patients with readmission greater than equal to 5.

<b> What is relation when we look at Emergency Visits greater than 5 ? <b>

In [57]:
sns.displot(df.loc[df['number_emergency']>=5], x="number_emergency", hue= 'readmitted_2', kind='kde')
plt.title('Relationship of Emergency Visits >= 5 w.r.t Readmission \n\n', size = 13)
plt.show()

- We can see that The majority of the Encouters have Number of Readmission Visits nea 10 and they are Readmitted to hospital.
- We can conclude that, if the Numer of emergency Visits Increased the Patient Most likely to readmit to the hospital.

<b>What is the pattern of Number of Inpatient Visits ? </b> 

In [58]:
sns.displot(df, x="number_inpatient", hue= 'readmitted_2', kind='kde')
plt.title('Realtionship of Inpatient Visits w.r.t Readmission \n\n')
plt.show()

- We can see from above graph, the Inpatinet Readmission also lies at 0 for majority of Patients. 
- Now we drill down Number of Inpatient for better undetstanding.

In [59]:
sns.displot(df.loc[df['number_inpatient']<5], x="number_inpatient", hue= 'readmitted_2', kind='kde')
plt.title('Relationship Inpatient Visits < 5 w.r.t Readmission \n\n', size = 13)
plt.show()

- From above graph, we see that if the patient comes in Inpatient Facility from 0-5 Times it will not readmitted.
- Also the majority of patients have 0 Inpatient Encounters.


In [60]:
sns.displot(df.loc[df['number_inpatient']>=5], x="number_inpatient", hue= 'readmitted_2', kind='kde')
plt.title(' Inpatient Visits >= 5 w.r.t Readmission \n\n')
plt.show()

Now, if we look at the data for Inpatinet Visits greater than equal to 5, the patients most likeyly to Readmit to the hospital and it will become the deciding criteria for the model.

<b>What are Diag_1, Diag_2 and Diag_3. ?</b>

- There are three column which contains the diagnosis code for the Encounter. 
- Each time the patient admits to the hospital, A diagnosis code is assign with it.
- Which means on which problem the patient comes to the hospital.

- <b>Diagnosis 1 </b> :Nominal The primary diagnosis (coded as first three digits of ICD9); 848 distinct values 0%
- <b>Diagnosis 2 </b> :Nominal Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values 0%
- <b>Diagnosis 3 </b> :Nominal Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct
values

As per the Documentation, they are the ICD - 9 Coded Diagnosis Codes, Each code represents some Disease.
- The diag_1 is the Primary Diagnosis of the Patient, which means the patient is admitted to the hospital on this diagnosis.
- The diag_2 is the Secondary Diagnosis, According to CMS Documentation  "Secondary diagnoses are “conditions that coexist at the time of admission, that develop subsequently, or that affect the treatment received and/or length of stay. These diagnoses are vital to documentation and have the potential to impact a patient's severity of illness and risk of mortality"
- The diag_2 is the Additional Secondary Diagnosis.

In [61]:
len(df['diag_1'].unique()), len(df['diag_2'].unique()), len(df['diag_3'].unique())

<b> As the unique values are too much. We can analyze the Top Diagnosis in Each Class!</b>

In [62]:
df[df['readmitted_2'] == 'Yes']['diag_1'].value_counts()

<b>Top 20 Diagnosis in the Readmitted = YES </b>

In [63]:
ax = sns.barplot(x=df[df['readmitted_2'] == 'Yes']['diag_1'].value_counts().index[:20],
                 y=df[df['readmitted_2'] == 'Yes']['diag_1'].value_counts()[:20])
plt.xlabel('Primary Diagnosis Codes', size = 12)
plt.ylabel('Count', size = 12)
plt.title('Top 20 Primary Diagnosis Codes in Readmission = YES \n', size = 12)
plt.show()

The Top Diagnosis Codes are 428, 414 and 786 in the Readmitted Patients.
If we look at the ICD-9 Dictionary we will know that,
- 428 = Congestive heart failure
- 414 = Ischemic heart disease
- 786 = Symptoms involving respiratory system and other chest symptoms
- 486 = Pneumonia, organism unspecified 

So Patients with Heart Disease and Chest Disease are more likely to readmit to the hospital.

In [64]:
ax = sns.barplot(x=df[df['readmitted_2'] == 'No']['diag_1'].value_counts().index[:20],
                 y=df[df['readmitted_2'] == 'No']['diag_1'].value_counts()[:20])
plt.xlabel('Primary Diagnosis Codes', size = 12)
plt.ylabel('Count', size = 12)
plt.title('Top 20 Primary Diagnosis Codes in Readmission = No \n', size = 12)
plt.show()

We can see from graph, Chest and Heart Diseases are also common in Patients who didnt Admitted.

<b>Lets Analyze Number of Diagnosis Column </b>

In [65]:
sns.displot(df, x="number_diagnoses", hue= 'readmitted_2', kind='kde')
plt.title('Number of Diagnosis vs Readmission \n\n')
plt.show()

- From Above plot we can see that, there is no clear difference in people Readmitted and Not Readmitted.
- There is some minor difference, which is at where Diagnosis between 8-10, The patient more likely to readmit. 
- As the pattern for more than 10 diagnosis is hidden we will look it in detail.


In [66]:
sns.displot(df[df['number_diagnoses']>10], x="number_diagnoses", hue= 'readmitted_2', kind='kde')
plt.title('Number of Diagnosis vs Readmission \n\n')
plt.show()

- The trend is same and have approximately same pattern. 

<b> What is the behaviour of max_glu_serum ?</b>

Indicates the range of the result or if the test was not taken.
Values:
- “>200,”
- “>300,”
- “normal,”
- “none” if not measured

In [67]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='max_glu_serum',   data= df)
plt.xlabel('Max Glu Serum', size = 14)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 14)
plt.title('Distribution of Max Glu Serum \n', size = 14)
plt.show()

- None means the max_glu_serum  test is not taken and almost 96,000 patients didnt took this test.
- Lets analyze the trend of this for other 3 values with respect to Readmission

In [68]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='max_glu_serum',  hue= 'readmitted_2', data= df[df['max_glu_serum']!='None'])
plt.xlabel('Max Glu Serum Value', size = 14)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 14)
plt.title('Max Glu Serum vs Readmission \n', size = 14)
plt.show()

- We can see that if the value of max_glu_serum greater than 300 there is high chance of Readmission as per the above graph.

<b> What is A1Cresult ? </b>

Indicates the range of the result or if the test was not taken. 
Values:

- “>8” if the result was greater than 8%,
- “>7” if the result was greater than 7% but less than 8%,
- “normal” if the result was less than 7%,
- and “none” if not measured

In [69]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='A1Cresult', data= df)
plt.xlabel('A1Cresult Values', size = 14)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 14)
plt.title('Distribution of A1Cresult \n', size = 14)
plt.show()

- Majority of the patients have None values, means this parameter if not measured. 
- Lets look at other 3 values and check the relationship

In [70]:
sns.set(rc={'figure.figsize':(18,8.2)})
ax = sns.countplot(x='A1Cresult', hue = 'readmitted_2', data=df[df['A1Cresult']!='None'])
plt.xlabel('A1Cresult Values', size = 14)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Count', size = 14)
plt.title('A1Cresult vs Readmission \n', size = 14)
plt.show()

There are more values values for Not Readmitted in each category. 

<b>What is change?</b>

Indicates if there was a change in diabetic medications (either dosage or generic
name). Values:

- “change”  
- “no change”

In [71]:
df['change'].value_counts()

In [72]:
ax = sns.countplot(x='change',  hue= 'readmitted_2',  data= df)
plt.xlabel('Diabetes Med', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Readmitted Count', size = 12)
plt.title('Diabetes Med vs Readmission \n', size = 12)
plt.show()

<b>What is Diabetes Med ? </b>

Indicates if there was any diabetic medication prescribed. Values:

- “yes”
- “no”

In [73]:
ax = sns.countplot(x='diabetesMed',  hue= 'readmitted_2',  data= df)
plt.xlabel('Diabetes Med', size = 12)
plt.xticks(rotation=90, size = 12)
plt.ylabel('Readmitted Count', size = 12)
plt.title('Diabetes Med vs Readmission \n', size = 12)
plt.show()

- From above figure we can see that, the Patient with Diabetes have the high amount of readmissions.

<b>24 Features for Medications?</b>

For the generic names: 
1. metformin 
2. repaglinide 
3. nateglinide
4. chlorpropamide
5. glimepiride
6. acetohexamide
7. glipizide
8. glyburide
9. tolbutamide 
10. pioglitazone
11. rosiglitazone
12. acarbose
13. miglitol
14. troglitazone
15. tolazamide
16. examide
17. sitagliptin
18. insulin
19. glyburide-metformin
20. glipizide-metformin
21. glimepiride-pioglitazone
22. metformin-rosiglitazone
23. metformin-pioglitazone

The feature indicates whether the drug was prescribed or there was a change in the dosage. Values:

- “up” if the dosage was increased during the encounter
- “down” if the dosage was decreased
- “steady” if the dosage did not change
- “no” if the drug was not prescribed

<b> Lets Analyze Distribution of each value in these columns! </b>

In [75]:
for i in df.iloc[:, 21:44].columns:

    ax = sns.countplot(x=i, data= df)
#     plt.xticks(rotation=90, size = 12)
    plt.ylabel('Count', size = 14)
    plt.show()

- From the above count plots, we can see that majority of the Medicines are not assigned to patients.
- If one is assigned then it is assigned to very few people.

<b> Analyze Medicines with Class Variable Readmission </b>

In [76]:
for columnName in df.iloc[:, 21:44].columns:
    g = sns.FacetGrid(df, col=columnName)
    g.map(sns.histplot, "readmitted_2")
#     plt.title(str(columnName) + 'vs Readmission', size = 13)
    plt.show()

In all of the features, Majority of the population is labeled as No. Means patients are not perscribed to take these medicines. 

- <b>Insulin </b> : In case of Insulin Half of the population is perscribed to it. 
- <b>metformin </b> : In case of Metformin, almost 18000 patients are prescribed.
- <b> examide & citoglipton </b> : No one is prescribed to examide, all the values are "No". So we will drop these columns
- <b> glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, metformin-pioglitazone</b> : These medicines are prescribed to very few people. 
- <b> glipizide-metformin,  glimepiride-pioglitazone, metformin-rosiglitazone, metformin-pioglitazone</b> : These are only assigned to 1 or 2 patients, which will not add any values in the data. So we will drop these columns

- <b>troglitazone & tolazamide</b>: These are only assigned to very few people and will not add value in the data. We will drop this also.
- <b>miglitol</b>: This is also prescribed to only 40 patients, but to which it is prescribed all are readmitted. It can add some value. We will keep this feature.
- <b> acarbose </b>: This assigned to 309 patients, and most of them are readmitted. We will take this feature.
- <b> pioglitazone	& rosiglitazone</b>: These are assigned to 6-7 Thousand Patients and most of them are readmitted. So we will take this feature. 
- <b>tolbutamide</b>: This medicine is only prescribed to 24 people. This will also not add any specific value.
- <b> glipizide	& glyburide</b>: These medicines are assigned to 10-12 thousand patients. This will add some variance.
- <b>acetohexamide</b>: This medicine is assigned to only 1 Enconter. We will drop this also.
- <b>glimepiride</b>: This medicine is prescribed to almost 5000 patients. So we will keep this feature.
- <b>chlorpropamide</b> This medicine is prescribed to only 87 patients. Will not add much value, but we will keep this as majority of people who take this are readmitted. 
- <b>nateglinide</b>: This medicine is prescribed to 704 patients. But majority of the patients are not readmitted. 

- <b>repaglinide</b>: This medicine is prescribed to almost 1500 patients, and 50 percent of them are readmitted and other are not readmitted. 
- <b>metformin</b>: This medicine seems to have important relationship and have prescribed to near 20000 patients. From them almost 9000 readmitted. 



<b> Dropping Columns with almost no Information</b>

In [77]:
df.drop(columns = ['acetohexamide', 'tolbutamide', 'troglitazone', 'tolazamide', 'examide', 'citoglipton',
                   'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone',
                   'metformin-pioglitazone'], inplace = True)

In [78]:
df.shape

<b> Drop Diagnosis Codes with empty values </b>

- As we have found Null values in the data, the diagnosis codes are not availale in the around 1500 rows.
- So we will drop these rows.

In [79]:
df = df[~((df['diag_1'] == "?") | (df['diag_2'] == "?") | (df['diag_3'] == "?"))]

In [80]:
df.shape

In [None]:
# df.to_csv('PreparedData.csv')

In the start of the Analysis, we had 50 columns and 12 of them are dropped from the dataset as they didnt provide any useful information.

In [81]:
# Make copy of data.
df_ = df.copy()

### 4. Transform the Categorical Features

In [82]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()

<b> Transform Categorical Features </b>

In [83]:
categorical_features =['race', 'gender', 'age',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses',
       'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide',
       'chlorpropamide', 'glimepiride', 'glipizide', 'glyburide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'insulin',
       'glyburide-metformin', 'change', 'diabetesMed'] 

for i in categorical_features:
    df_[i] = le.fit_transform(df_[i])

In [84]:
df_.head()

Now we can see that in dataframe that the categorical values are encoded.

<b> Transform Label Columns </b>

In [85]:
label = le.fit(df_['readmitted_2'])

In [86]:
df_['readmitted_2_encoded'] = label.transform(df_['readmitted_2'])

After Label Encoding the values assigned to class values are :

- 0 as No
- 1 as yes

### 5. Features Correaltion

In [87]:
df_ = df_.drop(columns= ['encounter_id', 'patient_nbr', 'readmitted','readmitted_2'])

In [88]:
df_

<b>Correlation between Numeical Features</b>

In [89]:
df_[['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 
   'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']].corr()

In [90]:
df_.columns

#### Split the Dependednt and Independent Variables

In [91]:
X = df_.drop(columns= ['readmitted_2_encoded'])
Y = df_['readmitted_2_encoded']

### 6. Feature Scaling

In [92]:
from sklearn import preprocessing
scaled_X = preprocessing.StandardScaler().fit_transform(X)

### 7. Train Test Split

In [93]:
from sklearn.model_selection import train_test_split

In [94]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, Y, test_size=0.25, random_state=42)

In [95]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### 8. Machine Learning Modeling

<b>Import Libraries for Evaluation of the Models</b>

In [96]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import plot_confusion_matrix

###### Logistic Regression

In [97]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
# Training
lr.fit(X_train, y_train)

# Prediction
lr_prediction = lr.predict(X_test)

In [98]:
print(classification_report(y_test, lr_prediction))

In [99]:
ax = sns.heatmap(confusion_matrix(y_test, lr_prediction), annot=True, fmt='', cmap='Blues')
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

ax.set_title('Confusion Matrix of Logistic Regression \n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ')

plt.show()

##### Random Forest Classifier

In [100]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 450, max_depth=9, random_state=43)
rf.fit(X_train, y_train)

In [102]:
rf_prediction =  rf.predict(X_test)

In [103]:
print(classification_report(y_test, rf_prediction))

In [104]:
ax = sns.heatmap(confusion_matrix(y_test, rf_prediction), annot=True, fmt='', cmap='Blues')
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

ax.set_title('Confusion Matrix of Random Forest \n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ')

plt.show()

###### Xgboost Classifier

In [105]:
import xgboost
xgb =  xgboost.XGBClassifier()
xgb.fit(X_train, y_train)

In [106]:
xgb_prediction = xgb.predict(X_test)

In [107]:
print(classification_report(y_test, xgb_prediction))

In [108]:
ax = sns.heatmap(confusion_matrix(y_test, xgb_prediction), annot=True, fmt='', cmap='Blues')
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

ax.set_title('Confusion Matrix of Xgboost Model \n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ')

plt.show()

###### Support Vector Machines

In [None]:
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train, y_train)

In [None]:
svm_prediction = clf.predict(X_test)

In [None]:
print(classification_report(y_test, svm_prediction))

In [None]:
ax = sns.heatmap(confusion_matrix(y_test, svm_prediction), annot=True, fmt='', cmap='Blues')
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

ax.set_title('Confusion Matrix of Support Vector Machines\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ')

plt.show()