### Chronic Kidney Disease

Abstract: This dataset can be used to predict chronic kidney disease and it has been collected at a hospital for a period of nearly 2 months.


In [1]:
import pandas as pd
import numpy as np
KD = pd.read_csv('kidneyChronic.csv', na_values=['?', '\t?'])
numeric_datatype = ['age', 'bp', 'bgr', 'bu', 'sc',
                    'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc']
nominal_datatype = ['sg', 'al', 'su', 'rbc', 'pc', 'pcc',
                    'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']

In [2]:
KD.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


### Type of the each columns of the dataset
Below Display the detail of the data set and their type. Inorder to see the missing values in the dataset we shall collate the order 

In [3]:
KD.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
age      391 non-null float64
bp       388 non-null float64
sg       353 non-null float64
al       354 non-null float64
su       351 non-null float64
rbc      248 non-null object
pc       335 non-null object
pcc      396 non-null object
ba       396 non-null object
bgr      356 non-null float64
bu       381 non-null float64
sc       383 non-null float64
sod      313 non-null float64
pot      312 non-null float64
hemo     348 non-null float64
pcv      329 non-null float64
wbcc     294 non-null float64
rbcc     269 non-null float64
htn      398 non-null object
dm       398 non-null object
cad      398 non-null object
appet    399 non-null object
pe       399 non-null object
ane      399 non-null object
class    400 non-null object
dtypes: float64(14), object(11)
memory usage: 78.2+ KB


### Missing values
We can see that almost all the columns except the class column contains missing values. This needs to be rectified, we will first try to see the dispersion of data using boxplot so that we can better understand the variation in the data and try to fill in the missing values

In [4]:
KD.isnull().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
from ipywidgets import interact

### Boxplot

From the box plot we can see that most of the data's mean and median are almost similar except for sc which contains lot of outliners. But by going through some research https://www.researchgate.net/publication/310952883_Missing_Data_Analysis_Using_Multiple_Imputation_in_Relation_to_Parkinson's_Disease, https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779 . We can see that for the given dataset filling the missing value with the help of mean gives fairly accurate result and it is most time efficient and does fairly decent job for small dataset

In [6]:
def select_type(data):
    KD[[data]].boxplot(showmeans=True)
    plt.show()
    print(KD[data].describe())
interact(select_type, data=numeric_datatype)

interactive(children=(Dropdown(description='data', options=('age', 'bp', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hem…

<function __main__.select_type(data)>

In [7]:
KD[numeric_datatype].describe()

Unnamed: 0,age,bp,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc
count,391.0,388.0,356.0,381.0,383.0,313.0,312.0,348.0,329.0,294.0,269.0
mean,51.483376,76.469072,148.036517,57.425722,3.072454,137.528754,4.627244,12.526437,38.884498,8406.122449,4.707435
std,17.169714,13.683637,79.281714,50.503006,5.741126,10.408752,3.193904,2.912587,8.990105,2944.47419,1.025323
min,2.0,50.0,22.0,1.5,0.4,4.5,2.5,3.1,9.0,2200.0,2.1
25%,42.0,70.0,99.0,27.0,0.9,135.0,3.8,10.3,32.0,6500.0,3.9
50%,55.0,80.0,121.0,42.0,1.3,138.0,4.4,12.65,40.0,8000.0,4.8
75%,64.5,80.0,163.0,66.0,2.8,142.0,4.9,15.0,45.0,9800.0,5.4
max,90.0,180.0,490.0,391.0,76.0,163.0,47.0,17.8,54.0,26400.0,8.0


### Filling missing values for numeric datatype
As discussed above we are going to fill the missing values for numeric datatype with mean value

In [8]:
for numerical in numeric_datatype:
    print("Filling na with mean value for %s numerical data" % numerical)
    KD[numerical].fillna(np.mean(KD[numerical]), inplace=True)

Filling na with mean value for age numerical data
Filling na with mean value for bp numerical data
Filling na with mean value for bgr numerical data
Filling na with mean value for bu numerical data
Filling na with mean value for sc numerical data
Filling na with mean value for sod numerical data
Filling na with mean value for pot numerical data
Filling na with mean value for hemo numerical data
Filling na with mean value for pcv numerical data
Filling na with mean value for wbcc numerical data
Filling na with mean value for rbcc numerical data


### Analyzing Nominal datatype
 Once we have sorted out numerical data we need to fill missing entry in nominal data type with most occurring value.  Before filling the missing value check the values

In [9]:
for nominal in nominal_datatype:
    print(KD[nominal].value_counts(dropna=False))

1.020    106
1.010     84
1.025     81
1.015     75
NaN       47
1.005      7
Name: sg, dtype: int64
0.0    199
NaN     46
1.0     44
3.0     43
2.0     43
4.0     24
5.0      1
Name: al, dtype: int64
0.0    290
NaN     49
2.0     18
3.0     14
1.0     13
4.0     13
5.0      3
Name: su, dtype: int64
normal      201
NaN         152
abnormal     47
Name: rbc, dtype: int64
normal      259
abnormal     76
NaN          65
Name: pc, dtype: int64
notpresent    354
present        42
NaN             4
Name: pcc, dtype: int64
notpresent    374
present        22
NaN             4
Name: ba, dtype: int64
no     251
yes    147
NaN      2
Name: htn, dtype: int64
no       258
yes      134
\tno       3
\tyes      2
NaN        2
 yes       1
Name: dm, dtype: int64
no      362
yes      34
\tno      2
NaN       2
Name: cad, dtype: int64
good    317
poor     82
NaN       1
Name: appet, dtype: int64
no     323
yes     76
NaN      1
Name: pe, dtype: int64
no     339
yes     60
NaN      1
Name: ane, dtype: in

### Rectify Nominal Data type 
We see that htn and dm has bad values, we will rectify it manually

In [10]:
KD['htn'] = KD['htn'].replace('\tno', 'no')
KD['htn'] = KD['htn'].replace('\tyes', 'yes')
KD['dm'] = KD['dm'].replace('\tno', 'no')
KD['dm'] = KD['dm'].replace(' yes', 'yes')

### Filling missing value of nominal data type
Inorder to fill missing value for nominla data type we use mode of column. We choose this technic because of the simplicity and gives a good result 

In [11]:
for nominal in nominal_datatype:
    mode = KD[nominal].value_counts().index[0]
    print("Filling na with mode value %s for %s nominal data" % (mode, nominal))
    KD[nominal].fillna(mode, inplace=True)

Filling na with mode value 1.02 for sg nominal data
Filling na with mode value 0.0 for al nominal data
Filling na with mode value 0.0 for su nominal data
Filling na with mode value normal for rbc nominal data
Filling na with mode value normal for pc nominal data
Filling na with mode value notpresent for pcc nominal data
Filling na with mode value notpresent for ba nominal data
Filling na with mode value no for htn nominal data
Filling na with mode value no for dm nominal data
Filling na with mode value no for cad nominal data
Filling na with mode value good for appet nominal data
Filling na with mode value no for pe nominal data
Filling na with mode value no for ane nominal data


### FIlled the missing values
Now we can see that we have filled the missing values

In [12]:
KD.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
age      400 non-null float64
bp       400 non-null float64
sg       400 non-null float64
al       400 non-null float64
su       400 non-null float64
rbc      400 non-null object
pc       400 non-null object
pcc      400 non-null object
ba       400 non-null object
bgr      400 non-null float64
bu       400 non-null float64
sc       400 non-null float64
sod      400 non-null float64
pot      400 non-null float64
hemo     400 non-null float64
pcv      400 non-null float64
wbcc     400 non-null float64
rbcc     400 non-null float64
htn      400 non-null object
dm       400 non-null object
cad      400 non-null object
appet    400 non-null object
pe       400 non-null object
ane      400 non-null object
class    400 non-null object
dtypes: float64(14), object(11)
memory usage: 78.2+ KB


In [13]:
KD.isnull().sum()

age      0
bp       0
sg       0
al       0
su       0
rbc      0
pc       0
pcc      0
ba       0
bgr      0
bu       0
sc       0
sod      0
pot      0
hemo     0
pcv      0
wbcc     0
rbcc     0
htn      0
dm       0
cad      0
appet    0
pe       0
ane      0
class    0
dtype: int64

# Separating feature variable and target variable

In [14]:
X = KD.drop('class', axis='columns')
Y = KD['class']

# Handling categorical columns in feature variable
Here we convert the nominal datatype into categorical columsn so that it will be easier to make classifiers. So cad is converted into cad_no cad_yes with their respective values

In [15]:
X = pd.get_dummies(X, drop_first=True)

In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
age            400 non-null float64
bp             400 non-null float64
sg             400 non-null float64
al             400 non-null float64
su             400 non-null float64
bgr            400 non-null float64
bu             400 non-null float64
sc             400 non-null float64
sod            400 non-null float64
pot            400 non-null float64
hemo           400 non-null float64
pcv            400 non-null float64
wbcc           400 non-null float64
rbcc           400 non-null float64
rbc_normal     400 non-null uint8
pc_normal      400 non-null uint8
pcc_present    400 non-null uint8
ba_present     400 non-null uint8
htn_yes        400 non-null uint8
dm_no          400 non-null uint8
dm_yes         400 non-null uint8
cad_no         400 non-null uint8
cad_yes        400 non-null uint8
appet_poor     400 non-null uint8
pe_yes         400 non-null uint8
ane_yes        40

# Handling missing value in target variable
we see that taget class has no missing values

In [17]:
Y.value_counts(dropna=False)

ckd       250
notckd    150
Name: class, dtype: int64

# Training and predicting 
We first use random forest algorithm and then proceed with Decision Tree Classifiert to classify and predict our data set
### Random Forest Classifier

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.8, random_state=42)
rf = RandomForestClassifier(
    n_estimators=50, min_samples_leaf=0.2, random_state=42)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)

In [19]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

         ckd       0.99      0.95      0.97       205
      notckd       0.92      0.98      0.95       115

   micro avg       0.96      0.96      0.96       320
   macro avg       0.95      0.97      0.96       320
weighted avg       0.96      0.96      0.96       320



In [20]:
print("Accuracy of the RandomForestClassifier model is : {}".format(accuracy_score(y_test, pred)))

Accuracy of the RandomForestClassifier model is : 0.9625


### Decision Tree Classifier

In [21]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)

In [22]:
clf.fit(X=X_train, y=y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

### Feature importance
Values are according to the displayed in the above info list

In [23]:
clf.feature_importances_

array([0.        , 0.        , 0.0963777 , 0.        , 0.        ,
       0.23127924, 0.        , 0.57656075, 0.        , 0.        ,
       0.        , 0.        , 0.09578231, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])

In [24]:
print("Accuracy of the DecisionTreeClassifier model is : {}".format(clf.score(X=X_test, y=y_test)))

Accuracy of the DecisionTreeClassifier model is : 0.896875


# Conclusion 

In this chroninc kidney classification problem we see that, there are many problem in real world data set. Some of the problem  can be missing values, garbled charecter and in order to fill missing values we have used the mean and mode technic. Once our data set is clean and all the missing values is filled we used two classification technic to compare its accuracy, we used Random Forest Classifier and Decision Tree Classifier having accuracy of 0.9625 and 0.896875 and we have choosen Random Forest Classifier for our classification technique.