Importing Libraries

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import io
import warnings
warnings.filterwarnings("ignore")

Importing the required Dataset

# Objective
The data was taken over a 2-month period in India with 25 features ( eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease. Use machine learning techniques to predict if a patient is suffering from a chronic kidney disease or not.

In [None]:
#Uploading dataset
df = pd.read_csv('/kaggle/input/chronic-kidney-disease/kidney_disease_train.csv')

In [None]:

#Getting a overview of DataFrame
df.head()

In [None]:
df.shape

 We can see that our dataset has 280 observations and 26 columns

In [None]:
df.columns

26 Columns

Attribute Information:

We use 24 + class = 25 ( 11 numeric ,14 nominal)

    Age(numerical) - age in years
    Blood Pressure(numerical) - bp in mm/Hg
    Specific Gravity(nominal) - sg - (1.005,1.010,1.015,1.020,1.025)
    Albumin(nominal) - al - (0,1,2,3,4,5)
    Sugar(nominal) - su - (0,1,2,3,4,5)
    Red Blood Cells(nominal) - rbc - (normal,abnormal)
    Pus Cell (nominal) - pc - (normal,abnormal)
    Pus Cell clumps(nominal) - pcc - (present,notpresent)
    Bacteria(nominal) - ba - (present,notpresent)
    Blood Glucose Random(numerical) - bgr in mgs/dl
    Blood Urea(numerical) -bu in mgs/dl
    Serum Creatinine(numerical) - sc in mgs/dl
    Sodium(numerical) - sod in mEq/L
    Potassium(numerical) - pot in mEq/L
    Hemoglobin(numerical) - hemo in gms
    Packed Cell Volume(numerical)
    White Blood Cell Count(numerical) - wc in cells/cumm
    Red Blood Cell Count(numerical) - rc in millions/cmm
    Hypertension(nominal) - htn - (yes,no)
    Diabetes Mellitus(nominal) - dm - (yes,no)
    Coronary Artery Disease(nominal) - cad - (yes,no)
    Appetite(nominal) - appet - (good,poor)
    Pedal Edema(nominal) - pe - (yes,no)
    Anemia(nominal) - ane - (yes,no)
    Class (nominal)- class - (ckd,notckd)



In [None]:
df.info()

We can see that our columns have **missing values**. 
**Classification** is our target variable.
We have **integer, float** as our numeric columns - They can store **continuous numeric, discrete numeric** and also **categorical variables**.
We also have **object columns** that store string values. They are majorly used to store categorical variables or character values.

We have a column **id with no missing values**. It can be a auto increment or unique identifier column. We will confirm this based on your future findings.

Two columns - **rc and wc** in cloumn list were numeric, but above they are taking object type. We need to check more into it.

In [None]:
df.shape

In [None]:
# Percentage of missing values
(df.isnull().sum()/df.shape[0])*100

Observations:
 1. Some columns have **no missing values**
 2. Columns l**ike - age, pcc, ba, bu, sc, htn, dm, cad, bp. They have less than 5% data as missing**. So we can subsitute them with **mean/median/mode** or it would be better if we find some **systematic mechaism** to relate to these missing values.
 3. Other variables have **high missing values**. So here we need to be **sure why they are missing and what is best way to handle them**. Otherwise it can create biasness in our data if we miss the underlying logic of missing values.
 4. Missing values are as high as - 38% and as low as 0.3%. Varing range of missing values. It needs to be paid more attaention.
 5. We should be also **confirm the column significance** to understand its importance - to decide to drop or how to impute its missing values and Outliers.
 6. **Distribution of the Feature**s will play a important factor in the way we will impute it.

In [None]:
# checking if any row has all missing values
df[df.isnull().all(1)]

We can also see no Row has all values as missing , so we cannot drop any row.

In [None]:
# Checking count of row with missing values for columns
df.isnull().sum(1).value_counts()

- **No row with all missing values**.
- We have **3 rows that have 10 variables missing. And 5 rows with 11 variables missing.**
- **6 rows with 8 variables missing**. We need to decide how to handle them, understand if they can provide any important information of variable. 
- We can **drop them** if they do not in any way provide useful information and **do not improve computational efficiency and predictive power** of model.




In [None]:
# Statistical Parameters for Columnn
df.describe()

By seeing the above table, we can conclude:
 1. We have **13 numeric **variables
 2. Same conclusion as above - all columns except id have missing values
 3. Id column - We assumed it to be a unique identifier, but the number should be lying between 0 to 280 which is not the case as it max value is 399. So we need to go more deeper into it to understand what it is actually.
 4. Age column - **Min age is 2 and max is 90. Min age of 2 shows a interesting case of chronic kidney disease** which usually should not have been a case.
 5. bp - ranges from 50 to 180. So we can conclude that there are **cases of high and low blood pressure**
 6. sg - It is a discrete numerical variable
 7. al - It can take values from 0 to 5. However we see most of the people **75% have al value of 0,1 or 2, which is interesting**. It will be interesting to see how the low or high values of al relate to Kidney disease
 8. su - Again, **75% people have sugar level of 0 which indicated that most of the people have low sugar** or it can be a data capturing error
 9. bgr - has a high std dev of 70
 10. bu - With a max value of 390, it will be having outiers
 11. sc - **most of the values are lowas 3, but the max value of 76 needs to be investigated**
 12. sod - seems to be **normally distributed**
 13. pot - most of the values are low, but the max value of 47 needs to be investigated
 14. hemo and pcv - looks rather symmetric
 
 We can coclude that features - **bgr, su, bp, bu, sc, sod, al have an asymmetric dstribution and are skewed. Their mean and median values do not overlap and we should be aware about existence of outliers and central tendency of such variables.**

In [None]:
# Object type describe
df.describe(include = 'O')

Observations: **To check the unique and how many categories**
 1. We have **13 Categorical variables**
 2. rbc - many missing values. 2 unique values
 3. pc - 2 unique values, with normal dominating
 4. **pcc, ba - 2 unique values, with notpresent occuring 90% of times**. These two can be related also
 5. **wc, rc - It is a discrete numerical variable.** It indicates count of white and red blood cell. It is wrongly characterised as object rather it should be a numeric (int) variables. It could also point to some unusal character occurence in the variable as they are read as object.
 6. htn - 2 unique vales
 7. dm, cad - **Have 4 and 3 unique values, rather as per dictionary it should be haveing only yes or no.**
 8. appet, pe, ane - nothing unusual
 9. classification - Target variable. It seems rather **balanced** with 2 unique values.


Having gained knowledge of the variables. We should try to gain more insight of the variables to **reach some conclusion and remove some biases**. Therfore, the next logical step is **Exploratory Data Anlysis.**

We will start with Univariate Analysis.
TO make the process more organised, I will be dividing the data set into numeric and categorical variables.

In [None]:
# Dividing into numeric and categorical variables
df_cont = df.select_dtypes(exclude = 'object')
df_cat = df.select_dtypes(include = 'object')

In [None]:
# EDA Numeric Variable
for col in df_cont.columns:
  fig, ax = plt.subplots(1,3, figsize = (13,5))
  df[col].plot(kind = 'kde', ax = ax[0])
  ax[0].set_ylim(bottom = 0)
  sns.boxplot(col, data = df_cont, ax = ax[1], orient = 'v')
  sns.swarmplot(col, data = df_cont, ax = ax[2], orient = 'v')

Approach : We try to study the distribution of our variables using the kde plot, box plot and Swarmplot.
**Kde plot** - can provide idea of **distribution** of variable. Concentration of data. Shape of distribution
**Boxplot** - It helps to understand the **median, the Quantiles and Outliers**
**SwarmPlot** - Gives an **actual idea of values taken by the variable. It help us to quantify distribution observations.**

Observations:
 1. Id -  Symmetric Distribution, No outliers and takes value across all of its range.
 2. Age - It is slightly **left skewed with high concentration of data between 40-60**. There are few outliers on the lower spectrum of age
 3. Bp - It has a long right tail that indicates presence of outliers as confirmed by Boxplot. **One interesting observation is that bp is taking discrete values only.**
 4. sg, al - as observed earlier also they are discrete variables. Tend to concentrate values on the lower end.
 5. **su - most of the values are 0**. It has outliers but we should not see them as outliers as they are actually true value. Due to a skewed distriibution, it is showing presence of outliers.
 6. bgr - Right skewed, Outliers. Concentrated between 0-150.
 7. bu - Right skewed, Outliers. Concentrated between 0-50.
 8. **sc - Concentrated between 0-8. We need to see why some values are so high for this variables.**
 9. sod - A left tail, Interesting. Has Outliers with very low value near 0 - which can be a error or a severe deficinecy of sodium.
 10. Pot - Few outliers
 11. Hemo, pcv - Both are little left skewed with a distribution close to normal.

**Mostly all the variables have outliers ranging from few to many.**
 

In [None]:
# Seeing values of id column
df_cont['id'].sort_values(ascending = True)

Id column is some form of **numeric identifier**. It **does not provide any qualitative information**. So I have decided to drop it and no further analysis will be done related to this variable.

In [None]:
#Unique values of age
np.sort(df_cont['age'].unique())

In [None]:
# Dataset with age less than 11
df[df['age'] < 11]

We can see that people with age **less than 11 are also suffering from chronic Disease.** 
Also Age does not have any value which can be an error other than Nan. So we **should not treat the Outliers for this variable **and fill the missing values with Median or other imputation technique, if we are able to relate the column with some other variable (which is not the case till now)

Checking Unique value counts of various discrete Variables to understand their distribution

In [None]:
# Unique values of bp
df['bp'].value_counts()

In [None]:
# Unique values of sg
df['sg'].value_counts(dropna = False)

In [None]:
# Unique value of al - Aluminium
df['al'].value_counts(dropna = False)

In [None]:
# Checking rows that have both al and sg missing
df[(df['al'].isnull()) & (df['sg'].isnull())].count()

We can see that su, al, sg have around 35 missing values and there are **33 indexes where all three have missing values.** This is interesting. **Also Rbc and pc are also missing in such indexes. **

In [None]:
# Overview of dataset for missing value of sg and al 
df_sg = df[(df['al'].isnull()) & (df['sg'].isnull())]
df_sg

By seeing the dataset, **we could not confirm** if the missing value belong to particular age group, showing some particular characterstics in term of other variables. 

One interesting observation is that for the missing value of the 5 columns , **pcc and ba take only non present values**. But the **non present value dominate for these variables as can be confirmed from above describe statistics**.

In [None]:
# Trying to see the values of other variables for the missing values of column sg and al,
# to understand if we can relate the variables somehow and this can help in missing value 
# imputation
for col in df_sg.columns:
  if df_sg[col].dtypes != 'object':
    fig, ax = plt.subplots(1,1, figsize =(6,3))
    sns.swarmplot(col, data = df_sg)

We are **not able to find relation** with any other columns of the missing values of the 5 columns that are considered above - sg, al, rbc, su, pc. We will try to impute them suitably. But it would be better to also make **a column to indicate the missing values of these 5 columns** - as they overlap and such a **column can capture the missingness of these 5 variables** in one go.

In [None]:
# Checking the value count for Sugar(su)
df.su.value_counts(dropna = False)

**0** is the most common occurence. SO the **imputation** can be done using it.

In [None]:

sns.countplot(x = 'su', hue = 'dm', data = df)
g = sns.FacetGrid(data = df, hue = 'su', aspect = 2)
g.map(sns.kdeplot, 'bgr')
plt.legend()

In [None]:
# Checking the values of the columns Blood Glocuse random(bgr) and Blood Urea(bu)
fig = plt.figure(figsize = (10,4))
plt.subplot(1,2,1)
df.bgr.hist()
plt.title('bgr')
plt.subplot(1,2,2)
df.bu.hist()
plt.title('bu')

The two have kind of **same distribution**. This can **point towards correlation**, which we will check for later using correlation matrix.

Both the variable also are **right skewed**. They also have long tails which point to **high Kurtosis** - pointing to existence of Outliers. 

In [None]:
#Checking for high values of bgr (abnormality)
df[df['bgr']> 400]

We can see that some values have very high values of Blood glucose range which is abnormal. As **bgr > 200 tends to indicate diabetis** and we can also see all these cases are of chronic kidney disease.
So it would not be advisable to consider them as errorneous values and drop them. **They are outliers but do provide valuable information**

In [None]:
# Checking for abnormal high values of bu
df[df.bu > 200]

Same as above the **high values indicate ckd** presence. We would need to **check** if this indicates a relation between bu and Chronic kidney Disease or just some one off observations.

In [None]:
# Generating insights using Serum Creatinine(sc)
df.sc.hist()

The normal range for creatinine in the blood may be 0.84 to 1.21 milligrams per deciliter(Mayo Clinic). 
**Higher than that it indicates kidney malfunction**. So it can be a big indicator for classification problem. We could confirm it in a better way by visualizing the relation between two latter.

Also the very far off value for SC certainly point to very **abnormal Outliers** which needs domain knowledge to be dealt properly.

In [None]:
# Understanding Potassimum(pot), Sodium(sod), Hemoglobin(hemo) data structure more closely
fig, ax = plt.subplots(1,3, figsize = (10,4))
df.sod.hist(ax = ax[0])
ax[0].set_title('Sodium')
df.pot.hist(ax = ax[1])
ax[1].set_title('Potassium Distribution')
df.hemo.hist(ax = ax[2])
ax[2].set_title('Hemoglobin')

- We can see **abnormal values of 0** in sodium values which should not be a case. 
- We can also see some **abnormal values of Potassium around 40.**
- Hemoglobin has a slight left skewness but overall we can see a **normal distribution**.
- Hemoglobin has normal values genrally in range of 11 to 17. So we can also **need to see of the lower values of hemoglobin point to something**.

In [None]:
#Checking abnormality of sodium and instances of ckd
ckd = list(df.classification.unique())
plt.figure(figsize = (10,6))
for c in ckd:
  sns.distplot(df['sod'][df['classification'] == c], label = c)
  plt.legend()

Both lower than 130 and higher than 160 values lead to ckd

In [None]:
# Abnormality of Potassium and ckd
plt.figure(figsize = (6,4))
for c in ckd:
  sns.distplot(df['pot'][df['classification'] == c], label = c)
  plt.legend()

**Abnormality in pot points to ckd.**

In [None]:
# Checking the dataset for abnormal values of Sodium and Potassium
df[(df.sod < 50) | (df.pot > 10)]

All these value are very abnormal and show us presence of ckd. We can treat these outliers to more probable values or we can drop the rows. 
We also need to confirm if this point to a pattern or just a random chance.

In [None]:
# Distribution of Pcked Cell Volume(Pcv)
df_cont.pcv.hist()
plt.title('pcv')

 ## CATEGORICAL VARIABLES

In [None]:
df_cat.columns

Pasting Column Meaning again just for Quick References

Age(numerical) - age in years
Blood Pressure(numerical) - bp in mm/Hg
Specific Gravity(nominal) - sg - (1.005,1.010,1.015,1.020,1.025)
Albumin(nominal) - al - (0,1,2,3,4,5)
Sugar(nominal) - su - (0,1,2,3,4,5)
Red Blood Cells(nominal) - rbc - (normal,abnormal)
Pus Cell (nominal) - pc - (normal,abnormal)
Pus Cell clumps(nominal) - pcc - (present,notpresent)
Bacteria(nominal) - ba - (present,notpresent)
Blood Glucose Random(numerical) - bgr in mgs/dl
Blood Urea(numerical) -bu in mgs/dl
Serum Creatinine(numerical) - sc in mgs/dl
Sodium(numerical) - sod in mEq/L
Potassium(numerical) - pot in mEq/L
Hemoglobin(numerical) - hemo in gms
Packed Cell Volume(numerical)
White Blood Cell Count(numerical) - wc in cells/cumm
Red Blood Cell Count(numerical) - rc in millions/cmm
Hypertension(nominal) - htn - (yes,no)
Diabetes Mellitus(nominal) - dm - (yes,no)
Coronary Artery Disease(nominal) - cad - (yes,no)
Appetite(nominal) - appet - (good,poor)
Pedal Edema(nominal) - pe - (yes,no)
Anemia(nominal) - ane - (yes,no)
Class (nominal)- class - (ckd,notckd)
  
  



In [None]:
# Count plot of Categorical Varables
sns.catplot(x = 'rbc', estimator = None, data = df_cat, kind = 'count')

Conclusion: 
- Lots of missing value in the variable.
- **Imbalanced** towards normal
- If we **impute this variable with mode then it can cause loss of predictive power of the feature, as it will become highly imbalanced**.
- It would be better if we can derive missing values accurately in some way or even think about creating a **new category - 'Other'** to understand the missing values importance.
- **Red Blood count can have a relationship with hemoglobin** as Hemoglobin is carried by Red blood cells. So we can visual this relation to undertand if our data has such relation and this will also help us in Feature Engineering and Imputations.

In [None]:
# creating list of categories of rbc
label = list(x for x in df['rbc'].unique())
label.remove(np.nan)
label

In [None]:
# Plotting different categories of rbc with hemoglobin
for z in label:
  subset = df['hemo'][df['rbc'] == z]
  sns.distplot(a = subset, label = label, rug = True)
  plt.legend(['normal','abnormal'])

The distribution do over lap in the middle but have **quite good separation. This can help us in imputation.** We can consider 12.5 as the point of separation for the two rbc categories.

(Source: Mayo Clinic)
The normal range for hemoglobin is:

    For men, 13.5 to 17.5 grams per deciliter
    For women, 12.0 to 15.5 grams per deciliter


In [None]:
# Just the above same plot in a different manner to remember. Please ignore
grid = sns.FacetGrid(df, hue="rbc", aspect = 2)
grid.map(sns.kdeplot, 'hemo')
grid.add_legend()

In [None]:
#Distirbution of Pus cell and Pus cells clump
plt.figure(figsize = (10,5))
plt.subplot(1,2,1)
sns.countplot(x = 'pc', data = df_cat)
plt.title('Pus Cell')
plt.subplot(1,2,2)
sns.countplot(x = 'pcc', data = df_cat)
plt.title('Pus Cells Clump')

In [None]:
#Check overlap
sns.countplot(x = 'pcc', hue = 'pc', data = df_cat, saturation = 1)

We can see that **Pcc not present has high overlap with Normal PC** (which we assumed), and PCC present has high overlap with abnormal pcc.

In [None]:
# Proportion of Bacteria present or not
df_cat.ba.value_counts(normalize = True).plot(kind = 'bar', colormap = 'cool')
plt.axhline(y = 0.93, color = 'r', label = '0.93')

About **93 percent value**s belong to only one variable present. This indicates a very imbalanced sample which is not very strong in terms of predictive power.

**rc and wc are numerical** variable with String values in it. So we should **visualise it as a continuous variable.**

In [None]:
# Provide us the non integer values in the numeric column
s = df.rc.apply(lambda x : str(x).replace('.','').isdigit())
t = list(s[s == 0].index.values)
df.iloc[t,:].rc.unique()

In [None]:
#Missng counts
df.rc.isnull().sum()

In [None]:
# Dropping the Nan and other non numeric values and seeing distribution
plt.figure(figsize = (15,5))
sns.distplot(df['rc'][(df['rc'] != '\t?') & (~df['rc'].isnull())])
plt.title('RC Distribution')

In [None]:
df['rc'] = df['rc'].replace({'\t?' : np.nan})
df['wc'] = df['wc'].replace({'\t?' : np.nan , '\t8400' : 8400})

In [None]:
df.rc.isnull().sum()
df.wc.isnull().sum()

A neat **Gaussian Distribution**. We need to **take special care in the imputation of this variable** - as it has a large number of missing values, so simple imputation can really bring biasness in its distribution

We can also see the distribution of the variable with rbc - normal or abnormal. This can help us in analysing this feature more closely.

In [None]:
label = list(x for x in df['rbc'].unique())
label.remove(np.nan)
label

In [None]:
# Plotting different categories of rbc with rc - Count of Red blood cells
df_rc = df[(df['rc'] != '\t?') & (~df['rc'].isnull())]
for z in label:
  subset = df_rc['rc'][df['rbc'] == z]
  sns.distplot(a = subset, label = label, rug = True)
  plt.legend(('normal','abnormal'))

We can see the separation **between the two classes of rbc on the rc variables.**

In [None]:
# WC Analysis
# Provide us the non integer values in the numeric column
s = df.wc.apply(lambda x : str(x).isdigit())
t = list(s[s == 0].index.values)
df.iloc[t,:].wc.unique()

These values are not normal in the wc column. nan and \t we have to deal. \t8400 value could be 8400 simply as it falls in the scale of wc count.

In [None]:
sns.distplot(df['wc'][(df['wc'] != '\t?') & (~df['wc'].isnull()) & (df['wc'] != '\t8400')], 
             color = 'Orange')
plt.title('WC Distribution')

It shows slight right skewed and also precense of Outliers. Though shape is near to Gaussian Curve

In [None]:
# Distribution of HyperTension and DM and CAD
plt.figure(figsize = (15,4))
plt.subplot(1,3,1)
sns.countplot(x = 'htn', data = df_cat)
plt.title('HyperTension')
plt.subplot(1,3,2)
sns.countplot(x = 'dm', data = df_cat)
plt.title('Diabetes Mellitus')
plt.subplot(1,3,3)
sns.countplot(x = 'cad', data = df_cat)
plt.title('Coronary Artery Disease')



We have **4 column in Dm and 3 in Cad** , which should not be the case. But we can say that the error only are cases of yes and no as can be deduced from the suffixes of these categories.

The values are very less and will not much affect the distribution of the Variables.

There are more cases of 'no' Hypertension. We can see this distribution across age and some other factor to understand HyperTension cases.
DM also has more cases of no. And same goes for cad - very few people do actually have a coronary disease.

Correcting dm and cad values - as we are more or less sure on how to impute them

In [None]:
df.dm.value_counts()

In [None]:
# Correcting the values of the variables dm and cad
df['dm'] = np.where(df.dm == '\tno', 'no', df['dm'])
df['dm'] = np.where(df.dm == '\tyes', 'yes', df['dm'])
df.dm.value_counts()

In [None]:
df.cad.value_counts()

In [None]:
# Correcting the values of the variables dm and cad
df['cad'] = np.where(df.cad == '\tno', 'no', df['cad'])
df.cad.value_counts()

In [None]:
# See relation of HyperTension with age
sns.swarmplot(x = 'htn', y = 'age', data = df)

As thought most of the cases of Hypertension are in the age bracket above age > 40.

In [None]:
# Relation of dm with blood glucose variable can show some trend
g = sns.FacetGrid(df, hue = 'dm', aspect = 2)
g.map(sns.kdeplot, 'bgr')
g.add_legend()

We can see a **separation above 170 mark**. Most of the cases with dm as yes belong above 170. The cases of no dm are concentrated between **50 -170** , which can be taken as **normal bgr range**.

In [None]:
# Distribution of dm with su(sugar)

sns.countplot(x = 'su', hue = 'dm', data = df)

We can see that most of the no status of diabetes concur with 0 as sugar level. So we can conclude that above 0 , there are more chances of dm.

In [None]:
# Distribution of Appetite, Pedal Edema, Anemia

plt.figure(figsize = (15,4))
plt.subplot(1,3,1)
sns.countplot(x = 'appet', data = df_cat)
plt.title('Appetite')
plt.subplot(1,3,2)
sns.countplot(x = 'pe', data = df_cat)
plt.title('Pedal Edema')
plt.subplot(1,3,3)
sns.countplot(x = 'ane', data = df_cat)
plt.title('Anemia')

- Most of the people have a good appetite.
- Pedel Edema - indicates swolleness of our feet. This can be a direct indicator of kidney issues. 'No' as being the major status here also.
- Anemia - Indicates presence or absence of red blood cells/hemoglobin. Most of the people have 'no' anemia condition.

In [None]:
#Anemia relation with red blood cell counts and hemoglobin level
plt.figure(figsize = (10,4))
plt.subplot(1,2,1)
sns.countplot(x = 'ane', hue = 'rbc', data = df)
plt.title('Anemia vs Red BC Counts')
plt.subplot(1,2,2)
sns.swarmplot(x = 'ane', y = 'hemo', data = df)
plt.title('Anemia vs Hemo')

 - From Amenia vs RBC - we can conclude that most of cases of **no anemia lies with normal rbc count** (as should be the case)
 - Anemia vs Hemo - **Below 10 -11 hemo level we have higher chances of being anemic**.(which should be the case)

In [None]:
#Target variable distribution
df.classification.value_counts().plot(kind = 'bar')
plt.title('Classification')

More cases of chronic kidney diseases which is very **favourable** as it will help us in predicting ckd category more accurately.

# Bivariate Analysis

In [None]:
##Rectifying some variables to make Bivariate analysis better
df['rc'] = df['rc'].replace({'\t?' : np.nan})
df['wc'] = df['wc'].replace({'\t?' : np.nan , '\t8400' : 8400})

df['dm'] = np.where(df.dm == '\tno', 'no', df['dm'])
df['dm'] = np.where(df.dm == '\tyes', 'yes', df['dm'])
df.dm.value_counts()

# Correcting the values of the variables dm and cad
df['cad'] = np.where(df.cad == '\tno', 'no', df['cad'])
df.cad.value_counts()

df.rc = df.rc.astype(float)
df.wc = df.wc.astype(float)

In [None]:
plt.figure(figsize = (15,15))
for col in df_cat.drop(['rc','wc','classification'], axis = 1).columns:
  plt.subplot(4,3,df_cat.columns.get_loc(col)+1)
  sns.countplot(col, hue = 'classification', data = df)


Observations:

- If rbc counts is normal then there are less chances of being ckd. But **rbc count abnormal can certainly lead to ckd**
- Same is the case of **Pus cell and Pus cell clump and Bacteria** - If there is normal puss cell, it has almost equal chance of being ckd or non ckd. But **presence of pc, ba, pcc do certainly point of occurence of ckd**. We can assume that pc, ba, pcc presence will be good predictor fot presence of ckd.
- **Presence of Hypertension(htn), diabetes(dm),  coronary artery disease, swollen feets(pe), amemia, poor appetitie leads to presence of ckd**. But absence of hypertension also can lead to ckd.
- From above analysis, we can see **that presence of even one** - abnormal red cell count, bacteria, hypertension, pus call, diabetes, coronary disease, lack of appetite, amenic, swllen body parts - **increases chances of occurence of ckd to substantial level**.

In [None]:
# Dropping discrete numeric variables and Dependent variable from Pairplot 
sns.pairplot(data = df.drop(['id','su','al','sg'], axis = 1), 
             hue = 'classification', 
             corner = True, height = 4)

Some observations -
- Linear relation between hemo and pcv. Lower PCV values -10-40 tend to lead to ckd.
- For **hemo also the lower value below 10 leads to ckd. (assumed by us earlier, normal range hemo - above 10)**
- For pot and sod there is not clear separation of ckd.
-  Not clear about sc, need more details on it
- For blood urea the values above 50 indicate presence of ckd
- bgr above 150 indicates ckd. 
- Both very **low and high bp can be causes of ckd**. However high bp is a more prevalent cause
- **Age older people are more susceptible to ckd**, but there are high overlapping also across all age groups.

In [None]:
# Relation of sc with Ckd can show some trend
g = sns.FacetGrid(df, hue = 'classification', aspect = 2)
g.map(sns.kdeplot, 'sc')
g.add_legend()

For sc , there were **many unusal values** as also noted earler. So we checked it separately.
All values above 4 tend to show ckd. So **outlier treatment** can be easily done for this variable by **capping **the values.


In [None]:
# Plotting discrete variable with classification
var = ['su','al','sg']
for var in df[var].columns:
  plt.subplots(1,1)
  sns.countplot(var, hue = 'classification', data = df)

- Level above 0 of sugar (su) is a symptom of ckd.
- Level above 0 of AL indicate ckd
- 1.005, 1.01, 1.015 indicate presence of ckd. With values of 1.02 and 1.025, less chances of belonging to ckd category.

In [None]:
# Correlation Matrix
plt.figure(figsize = (10,10))
sns.heatmap(df.corr(), annot = True)

Observations - 
- High correlation between **Pcv and hemo(**as seen by their linear relation)
- Not that high correlation among other variables.
- **bgr and su** have a high corr of 0.65 as also assumed by us earlier
- sodium and serum C has a negative correlation of 0.71
- We made some earlier assumption about rc, rbc , hemo being correlated. Since corr() only takes
numeric variables **bold text** - so we cannot confirm those relations from here. 
- Also relation between bgr and dm could not be confirmed.


Calculating Spearman correlation also, as we have some ordinal variables. So to see their relation in terms of spearman would make more sense.

In [None]:
# Spearman Correlation 
plt.figure(figsize = (10,10))
sns.heatmap(df.select_dtypes(exclude = 'object').corr(method = 'spearman'), annot = True)

### Data PreProcessing

Missing Values Treatment

In [None]:
# Count of Missing Values
df.isnull().sum()

In [None]:
# Scatter of age with bp
sns.scatterplot(x = 'age', y = 'bp', data = df)

In [None]:
df_m = df.copy()
df_m = df_m.drop(['id'], axis = 1)

In [None]:
for col in df_m.columns:
  if df_m[col].dtypes == 'object':
    df_m[col] = df_m[col].fillna(2)


for col in df_m.columns:
  if df_m[col].dtypes != 'object':
    df_m[col] = df_m[col].fillna(999)

In [None]:
df_m.rbc = df_m.rbc.replace({'normal':0, 'abnormal':1})
df_m.pc = df_m.pc.replace({'normal':0, 'abnormal' : 1})
df_m.pcc = df_m.pcc.replace({'notpresent':0, 'present' : 1})
df_m.ba = df_m.ba.replace({'notpresent':0, 'present' : 1})
df_m.htn = df_m.htn.replace({'no':0,'yes' : 1})
df_m.dm = df_m.dm.replace({'no': 0,'yes' : 1})
df_m.cad = df_m.cad.replace({'no':0,'yes' : 1})
df_m.appet = df_m.appet.replace({'good':0,'poor' : 1})
df_m.pe = df_m.pe.replace({'no':0,'yes' : 1})
df_m.ane = df_m.ane.replace({'no':0,'yes' : 1})
df_m.classification = df_m.classification.replace({'notckd':0,'ckd' : 1})

In [None]:
#Imptration for age
x_train = df_m[df_m.age != 999].drop(['age'], axis = 1)
y_train = df_m['age'][df_m.age != 999]
x_test = df_m[df_m.age == 999].drop(['age'], axis = 1)
y_test = df_m['age'][df_m.age == 999]

In [None]:
x_test

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 100, max_features = 10, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))


In [None]:
# imp = pd.Series(model.feature_importances_)
# imp.index = x_train.columns
# imp.sort_values(ascending = False)

In [None]:
model.predict(x_test)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
score = []
for k in range(1,50,1):
  rf = KNeighborsRegressor(n_neighbors= k)
  model2 = rf.fit(x_train, y_train)
  np.sqrt(mean_squared_error(y_train, model2.predict(x_train)))
  score.append(r2_score(y_train, model2.predict(x_train)))


plt.plot(range(1,50,1), score)

In [None]:
# Filling age value from Random Forest Regressor model values
df.age.fillna({81 : 57, 91: 36, 95: 53, 247: 57, 257: 56}, 
              axis = 0, inplace = True)

In [None]:
# Filling missing value of bp. We saw no particular variable had good correlation with bp.
# Highest was pcv with a -0.32 value. So it would be better if we use Regressor as we did for age

x_train = df_m[df_m.bp != 999].drop(['bp'], axis = 1)
y_train = df_m['bp'][df_m.bp != 999]
x_test = df_m[df_m.bp == 999].drop(['bp'], axis = 1)
y_test = df_m['bp'][df_m.bp == 999]

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 50, max_features = 10, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
model.predict(x_test)

In [None]:
df['bp'] = df['bp'].fillna({38 : 71, 89: 71, 101: 73, 169: 67, 183 : 74, 209: 81, 
                            246: 74, 258: 76, 274: 66}
                , axis = 0)


Filling htn, dm , cad missing values. They are categorical and have only one missing value. So we can use any simplified approach to fill them. Most probably fill with mode.

Seeing the row with missing value of dm

In [None]:
df[df.dm.isnull()]

- Seeing the data, we can see that this is the row with htn, dm, cad missing value.

- We can see that this person is not ckd , so there are more chances of him being not having hypertension, coronary disease, Diabetes (As we see in categorical bar plots from above)

- Also the blood sugar random (bgr) = 70, which is a normal value. So dm should take 'no' as a value.

Replacing with 'no' (which is the mode) for these three variables

In [None]:
df['htn'].fillna(df.htn.mode()[0], inplace = True)
df['cad'].fillna(df.cad.mode()[0], inplace = True)
df['dm'].fillna(df.dm.mode()[0], inplace = True)

In [None]:
df[df.dm.isnull()]

We had earlier seen that the five columns (sg, al, su, rbc, pc) have missing values for the same observations(through rbc and pc have have missing values in larger number of observations). 

We can also create a new column - with binary values 1 to mark these observations to create missing value significance and impute then with simple methods. However if we are able to impute with reasonable certainty then we need not create such a column

In [None]:
# Concurrence of missing values
df[df.su.isnull()]

Imputation of su
As we saw above by eda, that su have a relation with dm (diabetes) and bgr(blood glucose random).
Also su and bgr have a correlation of 0.65 which points towards a positive relation. So we woul be using these two variables for imputation.



In [None]:
x_train = df_m[['bgr','dm']][df_m.su != 999]
y_train = df_m['su'][df_m.su != 999]
x_test = df_m[['bgr','dm']][df_m.su == 999]
y_test = df_m['su'][df_m.su == 999]

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 50, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
model.predict(x_test)

In [None]:
import pandas as pd
series = pd.Series(model.predict(x_test))
series.index = x_test.index
series = np.round(series)
df2 = series
df_joint = pd.concat([x_test, df2], axis = 1)
df_joint = df_joint.rename(columns = {df_joint.columns[2] : 'su'})

In [None]:
df_joint.su.value_counts()

In [None]:
df.loc[df.su.isnull(), 'su'] = df_joint.loc[:,'su']

In [None]:
# Values of Sugar Column after Imputation
df.su.value_counts()

In [None]:
# We can see that for values greater than 140 , the bgr is taking 1 , 2, 3 as status. So

plt.subplots(1,1)
# sns.distplot(df['bgr'][df.su == 0], kde = False, label = 0, color = 'Blue')
sns.distplot(df['bgr'][df.su == 1], kde = False, label = 1, color = 'Red')
sns.distplot(df['bgr'][df.su == 2], kde = False, label = 2, color = 'Orange')
sns.distplot(df['bgr'][df.su == 3], kde = False, label = 3, color = 'Yellow')
# sns.distplot(df['bgr'][df.su == 4], kde = False)
plt.legend()

Imputing Value of Al

In [None]:
df.al.isnull().sum()

sc, hemo, pcv, rc have strong correlation with al variables as we checked from above correlation table. So we can use this relation to impute the values.

In [None]:
# Creating train and test split for al variable
x_train = df_m[['rc','pcv','hemo','sc']][df_m.al != 999]
y_train = df_m['al'][df_m.al != 999]
x_test = df_m[['rc','pcv','hemo','sc']][df_m.al == 999]
y_test = df_m['al'][df_m.al == 999]

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 51, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
series = pd.Series(model.predict(x_test))
series.index = x_test.index
series = np.round(series)
df2 = series
df_joint = pd.concat([x_test, df2], axis = 1)
df_joint = df_joint.rename(columns = {df_joint.columns[4] : 'al'})
df.loc[df.al.isnull(), 'al'] = df_joint.loc[:,'al']

In [None]:
# Imputation tried using KNN - FAILED
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
x_train_scaled = StandardScaler().fit_transform(x_train)
score = []
for k in range(1,10,1):
  rf = KNeighborsClassifier(n_neighbors= k)
  model2 = rf.fit(x_train_scaled, y_train)
  score.append(accuracy_score(y_train, model2.predict(x_train_scaled)))


plt.plot(range(1,10,1), score)

In [None]:
# Imputation using KNN IMPUTER - Failed
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
x_train_scaled = StandardScaler().fit_transform(x_train)
score = []
for k in range(1,10,1):
  model2 = KNNImputer(n_neighbors = k, weights = 'distance')
  model2 = rf.fit(x_train_scaled, y_train)
  score.append(accuracy_score(y_train, model2.predict(x_train_scaled)))


plt.plot(range(1,10,1), score)

Imputation for SG

In [None]:
df.sg.isnull().sum()

In [None]:
df.sg.value_counts()

In [None]:
# PCV and Hemo, rc and sod have high corelation with this variable


# Creating train and test split for al variable
x_train = df_m[['rc','pcv','hemo','sc']][df_m.sg != 999]
y_train = df_m['sg'][df_m.sg != 999]
x_test = df_m[['rc','pcv','hemo','sc']][df_m.sg == 999]
y_test = df_m['sg'][df_m.sg == 999]

Both sg and al have high correlation with same variable, 
they their own correlation is not that high

In [None]:
# Running Random Forest Regressor Model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 51, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
a = list(np.round(model.predict(x_test),3))
a = [round(round(b/0.005)*0.005,3) for b in a]

In [None]:
# Imputing the values
a = pd.Series(a)
a.index = x_test.index
df_joint = pd.concat([x_test, a], axis = 1)
df_joint = df_joint.rename(columns = {df_joint.columns[4] : 'sg'})
df.loc[df.sg.isnull(), 'sg'] = df_joint.loc[:,'sg']

Imputing values of sc

In [None]:
df.sc.isnull().sum()

In [None]:
# Replacing with mode
df['sc'].fillna(df['sc'].mode()[0], inplace = True)

Filling Values of pc and pcc. These two are related, so using their values to impute

In [None]:
df.pc.isnull().sum()

In [None]:
observed = pd.crosstab(df['pcc'], df['pc'])

In [None]:
# Calculating CramerV
from scipy.stats import chi2_contingency
chi_stats = chi2_contingency(observed)[0]
n = np.sum(observed).sum()
dof = np.min([observed.shape[0],observed.shape[1]]) - 1 
cramerv = np.sqrt(chi_stats/(n * dof))
cramerv

In [None]:
# Removing the over approximation in CramerV Statistics
def cramers_v(x, y):
    import scipy.stats as ss
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
  

cramers_v(df['pcc'], df['pc'])

Interesting to see that CramerV is showing only moderate association even though bar plots showed strong relation between two, also can be understood by definition.

PCC value is available for all nan value of PC, except 1. 

In [None]:
# Imputing the missing values of pc as per pcc
df['pc'][((df.pcc == 'notpresent') & (df.pc.isnull()))] = 'normal'
df['pc'][((df.pcc == 'present') & (df.pc.isnull()))] = 'abnormal'

In [None]:
# One imputation of pc is left, as pcc also has a null value there.
df[df.pc.isnull()]

Seeing that pc(pus cell) presenece is an indication of chronic kidney disease. So in above observation, status is notckd, so it should be normal pc and no present pcc.

In [None]:
df['pc'][df.pc.isnull()] = 'normal'

Imputing Value of PCC - Using values of PC

In [None]:
df[df.pcc.isnull()]

In [None]:
# Imputing with not present
df.loc[df.pcc.isnull(),'pcc'] = 'notpresent'

Imputing PCV 

All three rc, pcv, hemo are highly correlated. So we need to impute one using some other variables and then using this variable, impute the other two

We should first impute the variable that has highest correlation with other variables.

Impute pcv using - sc , al, sg.

In [None]:
# Creating train and test split for al variable
x_train = df[['sc','al','sg']][~df.pcv.isnull()]
y_train = df['pcv'][~df.pcv.isnull()]
x_test = df[['sc','al','sg']][df.pcv.isnull()]
y_test = df['pcv'][df.pcv.isnull()]

In [None]:
# Running Random Forest Regressor Model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 51, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
model.predict(x_test)

In [None]:
# Imputing the values
a = pd.Series(model.predict(x_test))
a.index = x_test.index
df_joint = pd.concat([x_test, a], axis = 1)
df_joint = df_joint.rename(columns = {df_joint.columns[3] : 'pcv'})
df.loc[df.pcv.isnull(), 'pcv'] = df_joint.loc[:,'pcv']

In [None]:
df.groupby(['sc','sg','al'])['pcv'].agg(['mean','median'])

Imputation of Hemo - Highlu Correlated with PCV.

In [None]:
# Creating train and test split for Hemo variable
x_train = df[['pcv']][~df.hemo.isnull()]
y_train = df['hemo'][~df.hemo.isnull()]
x_test = df[['pcv']][df.hemo.isnull()]
y_test = df['hemo'][df.hemo.isnull()]

In [None]:
# Running Random Forest Regressor Model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = LinearRegression()
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
model.coef_, model.intercept_

In [None]:
# Imputing the values
a = pd.Series(model.predict(x_test))
a.index = df[df.hemo.isnull()].index
df_joint = pd.DataFrame(a, columns= ['hemo'])
df.loc[df.hemo.isnull(), 'hemo'] = df_joint.loc[:,'hemo']

Imputation of RC

In [None]:
# Creating train and test split for Hemo variable
x_train = df[['pcv','hemo']][~df.rc.isnull()]
y_train = df['rc'][~df.rc.isnull()]
x_test = df[['pcv','hemo']][df.rc.isnull()]
y_test = df['rc'][df.rc.isnull()]

In [None]:
# Running Linear Regressor Model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = LinearRegression()
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
# Running Random Forest Regressor Model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 50, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
plt.subplots(1,1)
sns.distplot(pd.Series(model.predict(x_test)), color = 'Yellow')
sns.distplot(df.rc, color = 'Orange')

In [None]:
# Imputing the values
a = pd.Series(model.predict(x_test))
a.index = df[df.rc.isnull()].index
df_joint = pd.DataFrame(a, columns= ['rc'])
df.loc[df.rc.isnull(), 'rc'] = df_joint.loc[:,'rc']

Imputation of rbc

In [None]:
df.rbc.value_counts(dropna = False)

RBC has a direct relation with rc - that shows numerical value of red blood cell counts. So if the range of rc is in normal range, rbc will take normal as status or otherwise. This can be confirmed by boxplot.

In [None]:
sns.boxplot('rc', 'rbc', data = df)

We can see a separation in the concentrated part for the two categories of rbc.

However there is a issue, that rc itself has a lot of null values. So we have to think of some other way.

In [None]:
# Calculating PointBiserial Corelation bwteen rbc anc rc

from scipy.stats import pointbiserialr
df_pbr = df[~((df.rc.isnull()) | (df.rbc.isnull()))]
df_pbr.rbc = df_pbr.rbc.replace({'normal' : 0, 'abnormal' : 1})
stats = pointbiserialr(df_pbr['rbc'], df_pbr['rc'])
stats

Point Biserial is showing weak Correlation between the dichatamous and Continuous variable

In [None]:
fig, ax = plt.subplots(1,2, figsize = (10,3))
sns.boxplot(df['hemo'], df['rbc'], ax = ax[0])
sns.boxplot(df['pcv'], df['rbc'], ax = ax[1])

There is good separation also the two category of rbc for hemo and pcv.

In [None]:
from scipy.stats import pointbiserialr
df_pbr = df[~((df.hemo.isnull()) | (df.rbc.isnull()))]
df_pbr.rbc = df_pbr.rbc.replace({'normal' : 0, 'abnormal' : 1})
stats = pointbiserialr(df_pbr['hemo'], df_pbr['rbc'])
stats

In [None]:
from scipy.stats import pointbiserialr
df_pbr = df[~((df.pcv.isnull()) | (df.rbc.isnull()))]
df_pbr.rbc = df_pbr.rbc.replace({'normal' : 0, 'abnormal' : 1})
stats = pointbiserialr(df_pbr['rbc'], df_pbr['pcv'])
stats

Seeing from above graphs and Correlation, it would be better if we use pcv and hemo values for filling of rbc values

Conditions , hemo > 12 and pcv > 38, rbc takes normal value. And for hemo < 12 and pcv < 38, it takes abnormal value.

In [None]:
df[['rbc','hemo','pcv']][df.rbc.isnull()].isnull().all(axis = 1).sum()

In [None]:
df_new = df[['rbc','hemo','pcv']]
df_new.pcv_1 = np.where(df.pcv > 38, 0, 1)
df_new.hemo_1 = np.where(df.hemo > 12, 0, 1)
pd.crosstab(df_new.pcv_1, df_new.hemo_1, values = df_new.rbc, aggfunc = 'count')

In [None]:
df['rbc'][df.rbc.isnull()] = np.where(df.hemo < 12 , 'abnormal', 'normal')

In [None]:
# Spearman Correlation 
plt.figure(figsize = (10,10))
sns.heatmap(df.select_dtypes(exclude = 'object').corr(method = 'spearman'), annot = True)

Imputing Bgr

In [None]:
sns.boxplot('su', 'bgr', data = df)

Using su(sugar) for imputing bgr value

In [None]:
# Creating train and test split for Bgr variable
x_train = df[['su','al','sc']][~df.bgr.isnull()]
y_train = df['bgr'][~df.bgr.isnull()]
x_test = df[['su','al','sc']][df.bgr.isnull()]
y_test = df['bgr'][df.bgr.isnull()]

In [None]:
# Running Random Forest Regressor Model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 50, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
plt.subplots(1,1)
sns.distplot(pd.Series(model.predict(x_test)), color = 'Yellow')
sns.distplot(df.bgr, color = 'Orange')

In [None]:
# Imputing the values
a = pd.Series(model.predict(x_test))
a.index = df[df.bgr.isnull()].index
df_joint = pd.DataFrame(a, columns= ['bgr'])
df.loc[df.bgr.isnull(), 'bgr'] = df_joint.loc[:,'bgr']

Impuying Values of ba

In [None]:
df[df.ba.isnull()]

All four cases are with good health stats like good appetite, normal potassium and sodium, no heart disease. So it is better to fill bacteria as not present as presence of bacteria is a symptoym of chronic kidney disease.

In [None]:
df['classification'][df.ba == 'present'].value_counts()

In [None]:
df.loc[df.ba.isnull(),'ba'] = 'notpresent'

Imputing bu (blood urea) values

bu has high correlation with sc

In [None]:
# Creating train and test split for Hemo variable
x_train = df[['pcv','al','sc']][~df.bu.isnull()]
y_train = df['bu'][~df.bu.isnull()]
x_test = df[['pcv','al','sc']][df.bu.isnull()]
y_test = df['bu'][df.bu.isnull()]

In [None]:
# Running Random Forest Regressor Model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 50, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
# Imputing the values
a = pd.Series(model.predict(x_test))
a.index = df[df.bu.isnull()].index
df_joint = pd.DataFrame(a, columns= ['bu'])
df.loc[df.bu.isnull(), 'bu'] = df_joint.loc[:,'bu']

Imputing Potassium

This variables do not have high correlation with any other variable. So imputing them with KNN Regressor

Encoding df Categorical Variable

In [None]:
df.rbc = df.rbc.replace({'normal':0, 'abnormal':1})
df.pc = df.pc.replace({'normal':0, 'abnormal' : 1})
df.pcc = df.pcc.replace({'notpresent':0, 'present' : 1})
df.ba = df.ba.replace({'notpresent':0, 'present' : 1})
df.htn = df.htn.replace({'no':0,'yes' : 1})
df.dm = df.dm.replace({'no': 0,'yes' : 1})
df.cad = df.cad.replace({'no':0,'yes' : 1})
df.appet = df.appet.replace({'good':0,'poor' : 1})
df.pe = df.pe.replace({'no':0,'yes' : 1})
df.ane = df.ane.replace({'no':0,'yes' : 1})
df.classification = df.classification.replace({'notckd':0,'ckd' : 1})

In [None]:
# Creating train and test split for  variable
x_train = df.drop(['sod','wc','pot'], axis = 1)[~df.pot.isnull()]
y_train = df['pot'][~df.pot.isnull()]
x_test = df.drop(['sod','wc','pot'], axis = 1)[df.pot.isnull()]
y_test = df['pot'][df.pot.isnull()]

In [None]:
y_train.shape

In [None]:
# Imputation using KNN Regressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
x_train_scaled = StandardScaler().fit_transform(x_train)
x_test_scaled = StandardScaler().fit_transform(x_test)
score = []
for k in range(1,5,1):
  model2 = KNeighborsRegressor(n_neighbors = k)
  model2.fit(x_train_scaled, y_train)
  np.sqrt(mean_squared_error(y_train, model2.predict(x_train_scaled)))
  score.append(r2_score(y_train, model2.predict(x_train_scaled)))


plt.plot(range(1,5,1), score)
  

In [None]:
model2 = KNeighborsRegressor(n_neighbors = 2)
x_train_scaled = StandardScaler().fit_transform(x_train)
x_test_scaled = StandardScaler().fit_transform(x_test)
model2.fit(x_train_scaled, y_train)
a = pd.Series(model2.predict(x_test_scaled))
a.index = df[df.pot.isnull()].index
df_joint = pd.DataFrame(a, columns= ['pot'])
df.loc[df.pot.isnull(), 'pot'] = df_joint.loc[:,'pot']

Imputing WC (White Blood Cells) - Again no Significant Correlation with other variables

In [None]:
# Creating train and test split for Bgr variable
x_train = df.drop(['sod','wc'], axis = 1)[~df.wc.isnull()]
y_train = df['wc'][~df.wc.isnull()]
x_test = df.drop(['sod','wc'], axis = 1)[df.wc.isnull()]
y_test = df['wc'][df.wc.isnull()]

In [None]:
# Imputation using KNN Regressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
x_train_scaled = StandardScaler().fit_transform(x_train)
x_test_scaled = StandardScaler().fit_transform(x_test)
score = []
for k in range(1,5,1):
  model2 = KNeighborsRegressor(n_neighbors = k)
  model2.fit(x_train_scaled, y_train)
  np.sqrt(mean_squared_error(y_train, model2.predict(x_train_scaled)))
  score.append(r2_score(y_train, model2.predict(x_train_scaled)))


plt.plot(range(1,5,1), score)
  

In [None]:
# Running Random Forest Regressor Model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 50, random_state = 1)
model = rf.fit(x_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
r2_score(y_train, model.predict(x_train))

In [None]:
a = pd.Series(model.predict(x_test))
a.index = df[df.wc.isnull()].index
df_joint = pd.DataFrame(a, columns= ['wc'])
df.loc[df.wc.isnull(), 'wc'] = df_joint.loc[:,'wc']

Imputing sod value using KNN Imputer

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 3)
df_filled = imputer.fit_transform(df)
df_filled = pd.DataFrame(df_filled, columns = df.columns)
df = df_filled.copy()

In [None]:
df.isnull().sum()

### Outliers Treatment

id columns is irrelevant.

Age has outliers but it has no unnatural value. So we will not be treating its Outliers. However we can bin it, so that age Outliers effect are reduced. But still for time being we will leave it as it is, due to presence of no large Outliers.

Sg, Al, Su have discrete Vaues - so  they do not have any Ouliers. They have only count/frequency which does not signify any number range.



In [None]:
# EDA Numeric Variable

for col in df.columns:
  fig, ax = plt.subplots(1,1, figsize = (5,3))
  sns.boxplot(col, data = df, orient = 'v')

Treating bp variable Outliers. They are actually true values, so it is better to bin/discreticize bp to show different levels.

Using DecisionTreeClassifier as Discritiser as it benefits - 
- Creating monotonic relation with target variable
- Decreases Entropy within groups
- Treats Outliers

In [None]:
# from feature_engine.discretisers import DecisionTreeDiscretiser
# from sklearn.model_selection import train_test_split
# disc = DecisionTreeDiscretiser(cv = 5, scoring = 'accuracy',
#                                       variables = ['bp'], param_grid = {'max_depth' : [1,2,3]}
#                                       , regression = False, random_state = 1, )

# X_train, X_test, y_train, y_test =  train_test_split(
#             df.drop(['id','classification'], axis = 1),
#             df['classification'], test_size=0.3, random_state=0)

# disc.fit(X_train, y_train)
# df.disc_bp = disc.transform(df['bp'])

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =  train_test_split(
            df[['bp']], df['classification'], test_size=0.3, random_state=0)

tree_model = DecisionTreeClassifier(max_depth=2)
tree_model.fit(X_train, y_train)
X_train['bp_tree']=tree_model.predict_proba(X_train)[:,1] 
X_train.head(10)

In [None]:
df['bp_o'] = tree_model.predict_proba(df[['bp']])[:,1]

Treating Outliers of bgr and bu

We will use Transformation here - as these have a continuous stream of Outliers. So capping should not be opt as this will restrict the natural value of the variables.

In [None]:
df['bgr_o'] = np.log(df['bgr']) 
sns.boxplot(np.log(df['bgr']))

In [None]:
df['bu_o'] = np.log(df['bu'])
sns.boxplot(np.log(df['bu']))

In [None]:
# df['sc'].idxmax()
# df_sc = df['sc'].copy()
# df_sc[df_sc.index == 253]
# iqr = (1.5*(df_sc.quantile(0.75) - df_sc.quantile(0.25)) + df_sc.quantile(0.75))

Using KMeans Discretizer for sc - as this way most of Outliers can be grouped. Most abnormal will be grouped in one, the Ouliers in between in other group.

In [None]:
sns.boxplot(df['sc'])

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')
discretizer.fit(df[['sc']])
discretizer.transform(df[['sc']])
pd.concat((pd.DataFrame(discretizer.transform(df[['sc']]), index = df.index, columns = ['sc_disc']), df['sc']), 
          axis = 1).groupby(['sc_disc'])['sc'].agg({'max','min'})
# pd.concat(pd.Series(discretizer.transform(df[['sc']])), df['sc'])

We can see how well has our Discretizer performed.

In [None]:
df['sc_o'] = discretizer.transform(df[['sc']])

Same we will perform KMeansDiscretization with sodium as it as same distribution like sc

In [None]:
disodretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')
disodretizer.fit(df[['sod']])
disodretizer.transform(df[['sod']])
pd.concat((pd.DataFrame(disodretizer.transform(df[['sod']]), index = df.index, columns = ['sod_disod']), df['sod']), 
          axis = 1).groupby(['sod_disod'])['sod'].agg({'max','min'})
df['sod_o'] = disodretizer.transform(df[['sod']])

Hemo and PCV Has single Outlier value and those also not that abnormal or deviant. So we will not be treating these variables - as the value is just above upper bound of Outliers.

Potassium has 3 Outliers - we will be capping these values

In [None]:
ub_pot = (1.5*(df['pot'].quantile(0.75) - df['pot'].quantile(0.25)) + df['pot'].quantile(0.75))
sns.boxplot(np.where(df['pot'] > ub_pot, ub_pot, df['pot']))
df['pot_o'] =  np.where(df['pot'] > ub_pot, ub_pot, df['pot'])

Treating the Outliers of WC - Through Transformation.
Box - Cox Transformation will help in making the distribution more normal and also decreasing the effects of Outliers.

In [None]:
from scipy import stats
sns.boxplot(stats.boxcox(df['wc'])[0])
df['wc_o'] = stats.boxcox(df['wc'])[0]

### Feature Selection and Engineering

We Know that some varibles like - Hemo, PCV, rc have high correlation. So we can 
try to perform PCA on these. These three also are continuous also. 
We also have treated their Outliers. So we should try what the results and then see should 
we proceed with PCA results or not.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df_pca = df[['rc','hemo','pcv']]
scale = StandardScaler()
df_pca_scaled = scale.fit_transform(df_pca)
pca = PCA(n_components = 3)
array_pca = pca.fit_transform(df_pca_scaled)
df_pca_done = pd.DataFrame(data = array_pca, columns = ['pc1','pc2','pc3'])
pca.explained_variance_ratio_

In [None]:
plt.plot(range(3),pca.explained_variance_ratio_)

We can see that we can bring 90% of variance in one Principal Component. However, we also need to see how significant these variables are. This will help us to decide whether we want to keep original variables or the reduced Principal Component

In [None]:
df.columns

Above we see that we have all the variables, with Outlier Treatment and also Without outlier Treatment. This will help us to see whether our treatment helped to increase importance of that variable in prediction.

  Feature Importance as per Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 50, random_state = 1)
rf.fit(df.drop(['id','classification'], axis = 1), df.classification)
imp = pd.Series(rf.feature_importances_)
imp.index = df.drop(['id','classification'], axis = 1).columns
imp.sort_values(ascending = False)

From above we can see that the importance of most of the variables after transformation(var_o) has increased, except for sc . We need to work on this as we do not want to reduce importance of any variable.

For correlation variables, we can see that hemo is the most important, followed by pcv and rc. So we can drop pcv and rc to reduce collinerity.

Checking feature importance by Lasso Regression

In [None]:
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['classification', 'id'], axis=1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

In [None]:
sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1', solver = 'liblinear'))
sel_.fit(X_train, y_train)

In [None]:
selected_feat = df.drop(['id','classification'], axis = 1).columns[(sel_.get_support())]
selected_feat

In [None]:
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
      np.sum(sel_.estimator_.coef_ == 0)))

The Lasso Regression shrinked the coefficients of 19 features to zero. That is a significant reduction. But we need to see how well these features perform in term of Prediction results also.

Feature Selection by Shuffling method

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['classification', 'id'], axis=1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

In [None]:
# Building a Basic Random Forest Model to record base ROC-AUC score
from sklearn.metrics import roc_auc_score
rf = RandomForestClassifier(
    n_estimators=100, max_depth = 2, random_state=1)
 
rf.fit(X_train, y_train)
 
# print roc-auc in train and testing sets
print('train auc score: ',
      roc_auc_score(y_train, (rf.predict_proba(X_train)[:, 1])))
print('test auc score: ',
      roc_auc_score(y_test, (rf.predict_proba(X_test)[:, 1])))

In [None]:
from sklearn.metrics import confusion_matrix
sns.heatmap(confusion_matrix(y_test, rf.predict(X_test)), annot = True)

In [None]:
# overall train roc-auc: using all the features
train_auc = roc_auc_score(y_train, (rf.predict_proba(X_train)[:, 1]))
 
# dictionary to capture the features and the drop in auc that they
# cause when shuffled
feature_dict = {}
 
# selection  logic
for feature in X_train.columns:
    X_train_c = X_train.copy()
    
    # shuffle individual feature
    X_train_c[feature] = X_train_c[feature].sample(frac=1, random_state = 1).reset_index(
        drop=True)
    
    # make prediction with shuffled feature and calculate roc-auc
    shuff_auc = roc_auc_score(y_train,
                              (rf.predict_proba(X_train_c.fillna(0)))[:, 1])
    
    # save the drop in roc-auc
    feature_dict[feature] = (train_auc - shuff_auc)

In [None]:
feature_importance = pd.Series(feature_dict).reset_index()
feature_importance.columns = ['feature', 'auc_drop']
feature_importance.auc_drop = np.round(feature_importance.auc_drop, 5)
feature_importance

Our shuffling method could not prove anything conclusive.

### Logistics Regression Model

Building model using the Variables we got after treating Missing Values and Outliers

In [None]:
# Dividing into Training and Test set 
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['wc','bgr','bu','sod','sc','pot','bp','classification'], axis = 1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

We will not be standardizing here, as we wont be able to make sense of the coefficients, due to presence of Multicollinearity.

In [None]:
from sklearn.metrics import classification_report
LogR = LogisticRegression(random_state = 1, 
                            verbose = 1)
LogR.fit(X_train, y_train)
y_train_pred = LogR.predict(X_train)
y_test_pred = LogR.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

In [None]:
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots(1,2, figsize = (10,4))
sns.heatmap(confusion_matrix(y_train, y_train_pred), annot = True, fmt = 'd' ,ax = ax[0])
ax[0].set_title('Train Data')
sns.heatmap(confusion_matrix(y_test, y_test_pred), annot = True, fmt = 'd', ax = ax[1])
ax[1].set_title('Test Data')

Now we will Logistic Regression to understand the coefficient Significance. We will also need to do some check before that.

Before building the model, we need to complete some required checks - especially of Multicollinearity.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
X = df.drop(['classification','sc','bp','bgr','bu','pot','sod','wc','id'], axis = 1)
calc_vif(X)

Though many variables are showing high collinearity. But 5 variables - sg, bgr_o, wc_o, hemo, pcv. These have too high values of VIF

Dropping sg variable - as it is showing highest value. And running VIF test again.

In [None]:
X = df.drop(['classification','sc','bp','bgr','bu','pot','sod','wc','id','sg'], axis = 1)
calc_vif(X)

Still many variables have too high values of VIF. Dropping wc_o(as it has highest value for VIF), and also pcv and rc - we already seen they have high correlation with hemo.


In [None]:
X = df.drop(['classification','sc','bp','bgr','bu','pot','sod',
             'wc','id','rc','sg','pcv','wc_o', 'bgr_o', 'pot_o','sod_o','bu_o','bp_o'], axis = 1)
calc_vif(X)

I tried dropping variables iteratively and then see how much the VIF value decreased. By going in a iterative and dropping variable with highest VIF - Many variables have to be dropped.

Variables dropped - 'rc','sg','pcv','wc_o', 'bgr_o', 'pot_o','sod_o','bu_o','bp_o'

Building Logistics Regression Model - with scaled variables to understand significance of variables


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['classification','sc','bp','bgr','bu','pot','sod',
             'wc','id','rc','sg','pcv','wc_o', 'bgr_o', 'pot_o','sod_o','bu_o','bp_o'], axis = 1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape


In [None]:
scale = StandardScaler()
X_train_scaled = scale.fit_transform(X_train)
X_test_scaled = scale.fit_transform(X_test)

In [None]:
from sklearn.metrics import classification_report
LogR = LogisticRegression(random_state = 1, 
                            verbose = 1)
LogR.fit(X_train_scaled, y_train)
y_train_pred = LogR.predict(X_train_scaled)
y_test_pred = LogR.predict(X_test_scaled)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

In [None]:
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots(1,2, figsize = (10,4))
sns.heatmap(confusion_matrix(y_train, y_train_pred), annot = True, fmt = 'd' ,ax = ax[0])
ax[0].set_title('Train Data')
sns.heatmap(confusion_matrix(y_test, y_test_pred), annot = True, fmt = 'd', ax = ax[1])
ax[1].set_title('Test Data')

The Original Logistics Regression gave us a accuracy of 1 and even after decreasing our multicollinarity , the accuracy has not been substantially hit. 
We still are getting a accuracy of 0.95 with test set

In [None]:
feature_coef = pd.Series(np.exp(LogR.coef_[0]))
feature_coef.index = X_train.columns
feature_coef.sort_values(ascending = False)

The data gives us a indication of predictive power of the coefficients.
Al, dm, htn are the most important predictors for class = 1 (Chronic kidney Disease). They incline the odds in favour of Class 1.
Increase in Hemo changes the prediction more in favour of class = 0 (No Chronic Disease)

In [None]:
# # Dividing into Training and Test set 
# X_train, X_test, y_train, y_test = train_test_split(df.drop(['id','wc','bgr','bu','sod','sc','pot','bp','classification'], axis = 1),
#     df['classification'],
#     test_size=0.3,
#     random_state=0)
# X_train.shape, X_test.shape
# # Scaling before Ridge Regression
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.fit_transform(X_test)

# from sklearn.metrics import classification_report
# modelRR = LogisticRegression(penalty = 'l2', C = 1, random_state = 1, 
#                              solver = 'liblinear', verbose = 1)
# modelRR.fit(X_train_scaled, y_train)
# y_train_pred = modelRR.predict(X_train_scaled)
# y_test_pred = modelRR.predict(X_test_scaled)
# print('\n')
# print(classification_report(y_train, y_train_pred))
# print(classification_report(y_test, y_test_pred))

LASSO REGRESSION MODEL

Building Model using Lasso Regression, as it will help to understand Coefficient Significance and also help in dropping less significant variables

In [None]:
# Dividing into Training and Test set 
X_train, X_test, y_train, y_test = train_test_split(df.drop(['classification'], axis = 1),
    df['classification'],
    test_size=0.3,
    random_state=0)

# Scaling before Ridge Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
from sklearn.metrics import classification_report
modelLR = LogisticRegression(penalty = 'l1', C = 1, random_state = 1, 
                             solver = 'liblinear', verbose = 1)
modelLR.fit(X_train_scaled, y_train)
y_train_pred = modelLR.predict(X_train_scaled)
y_test_pred = modelLR.predict(X_test_scaled)
print('\n')
print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

In [None]:
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots(1,2, figsize = (10,4))
sns.heatmap(confusion_matrix(y_train, y_train_pred), annot = True, fmt = 'd' ,ax = ax[0])
ax[0].set_title('Train Data')
sns.heatmap(confusion_matrix(y_test, y_test_pred), annot = True, fmt = 'd', ax = ax[1])
ax[1].set_title('Test Data')

In [None]:
feature_coef = pd.Series(modelLR.coef_[0])
feature_coef.index = X_train.columns
feature_coef.sort_values(ascending = False)

We see that coefficients for a lot of variables changes to 0 and thus we can remove these variables if we want to decrease the complexity of the model, but it will increase the bias error.

In [None]:
feature_coef = pd.Series(modelLR.coef_[0])
feature_coef.index = X_train.columns
np.exp(feature_coef[feature_coef != 0]).sort_values(ascending = False)

Most Important Feature :
- Class 1 Prediction - al, dm, bgr_o
- Class 0 Prediction - sg, hemo

Before building Decision Tree model we will dropped the variables that we had transformed and made more robust. 
For sc we see in terms of Feature selection that sc was more important than sc_o , so we will keep the original variable here.

In [None]:
df = df.drop(['wc','bgr','bu','sod','sc_o','pot','bp','id'], axis = 1)

# Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(df.drop(['classification'], axis = 1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape


In [None]:
tree = DecisionTreeClassifier(random_state = 1)
tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

Feature Importance

In [None]:
imp = pd.Series(data = tree.feature_importances_, index = X_train.columns)
imp.sort_values( ascending = False)

Quite amazing. Only 3 variables are contributiing to division and thus we can understand why we are getting the perfect accuracy for train and test data.

Building Decision Tree Model keeping only the three variables - hemo, sg, su

In [None]:
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(df.loc[:,['sg','su','hemo']],
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

In [None]:
tree = DecisionTreeClassifier(random_state = 1)
tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

A sassumed - Perfect score

# Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(df.drop(['classification'], axis = 1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

In [None]:
rf = RandomForestClassifier(n_estimators = 100, random_state = 1)
rf.fit(X_train, y_train)
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

# Naive Bayes 

In [None]:
from sklearn.naive_bayes import GaussianNB
X_train, X_test, y_train, y_test = train_test_split(df.drop(['classification'], axis = 1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

In [None]:
NB = GaussianNB()
NB.fit(X_train, y_train)
y_train_pred = NB.predict(X_train)
y_test_pred = NB.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

# Support Vector Machine

In [None]:
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(df.drop(['classification'], axis = 1),
    df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape


In [None]:
classifier = SVC(kernel = 'linear', random_state = 1)
classifier.fit(X_train, y_train)
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

Linear SVM can very well separate the data.

In [None]:
classifier = SVC(kernel = 'poly',degree = 2, random_state = 1)
classifier.fit(X_train, y_train)
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

In [None]:
classifier = SVC(kernel = 'rbf', random_state = 1)
classifier.fit(X_train, y_train)
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))

We can see that rbf and polynimal SVM perform poor than Linear SVM.