#Chronic Kidney Disease

# Objective: To implement various classification models and ensemble them to classify Chronic Kidney Disease as infected or not

## Dataset
The data was taken over a 2-month period in India with 25 features ( eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease.

## Attributes
1.	Age(numerical)
age in years
2.	Blood Pressure(numerical)
bp in mm/Hg
3.	Specific Gravity(nominal)
sg - (1.005,1.010,1.015,1.020,1.025)
4.	Albumin(nominal)
al - (0,1,2,3,4,5)
5.	Sugar(nominal)
su - (0,1,2,3,4,5)
6.	Red Blood Cells(nominal)
rbc - (normal,abnormal)
7.	Pus Cell (nominal)
pc - (normal,abnormal)
8.	Pus Cell clumps(nominal)
pcc - (present,notpresent)
9.	Bacteria(nominal)
ba - (present,notpresent)
10.	Blood Glucose Random(numerical)
bgr in mgs/dl
11.	Blood Urea(numerical)
bu in mgs/dl
12.	Serum Creatinine(numerical)
sc in mgs/dl
13.	Sodium(numerical)
sod in mEq/L
14.	Potassium(numerical)
pot in mEq/L
15.	Hemoglobin(numerical)
hemo in gms
16.	Packed Cell Volume(numerical)
17.	White Blood Cell Count(numerical)
wc in cells/cumm
18.	Red Blood Cell Count(numerical)
rc in millions/cmm
19.	Hypertension(nominal)
htn - (yes,no)
20.	Diabetes Mellitus(nominal)
dm - (yes,no)
21.	Coronary Artery Disease(nominal)
cad - (yes,no)
22.	Appetite(nominal)
appet - (good,poor)
23.	Pedal Edema(nominal)
pe - (yes,no)
24.	Anemia(nominal)
ane - (yes,no)

## Target Class
Class (nominal)
class - (ckd,notckd)

## Source: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease

Tasks:
1.	Obtain the dataset
2.	Apply pre-processing techniques (if any)
3.	Divide dataset into training and testing set, respectively.
4.	Implement SVM, Logistic regression, Decision tree and KNN models.
5.	Ensemble SVM, Logistic regression, Decision tree and KNN models. 
6.	Evaluate accuracy, precision, recall and f-measure for all models.
7.	Plot the results by playing with the hyper-parameters of the afore-mentioned models.
8.	Conclude the results


Helpful links: 

https://scikit-learn.org/stable/modules/ensemble.html

https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

https://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/

https://scikit-learn.org/stable/modules/ensemble.html

https://www.datacamp.com/community/tutorials/ensemble-learning-python







## Task 1: Implement classification models on Chronic Kidney Disease dataset 

In [1]:
# Load the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler,LabelEncoder,OrdinalEncoder
from scipy.io import arff
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score,recall_score,mean_squared_error,precision_score,f1_score
from sklearn.feature_selection import SelectKBest,f_classif

In [2]:
# Load the dataset 

data = pd.read_csv('kidney_disease.csv')
data.head()

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd


In [3]:
data.shape

(400, 26)

In [4]:
# lets drop the id column as it presents no information
data.drop('id', axis = 1, inplace = True)

In [5]:
# sanity check
data.shape

(400, 25)

In [6]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)

# we can observe that there are lots of missing values, we cant simply drop them
data.isna().sum()

age                 9
bp                 12
sg                 47
al                 46
su                 49
rbc               152
pc                 65
pcc                 4
ba                  4
bgr                44
bu                 19
sc                 17
sod                87
pot                88
hemo               52
pcv                70
wc                105
rc                130
htn                 2
dm                  2
cad                 2
appet               1
pe                  1
ane                 1
classification      0
dtype: int64

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             391 non-null    float64
 1   bp              388 non-null    float64
 2   sg              353 non-null    float64
 3   al              354 non-null    float64
 4   su              351 non-null    float64
 5   rbc             248 non-null    object 
 6   pc              335 non-null    object 
 7   pcc             396 non-null    object 
 8   ba              396 non-null    object 
 9   bgr             356 non-null    float64
 10  bu              381 non-null    float64
 11  sc              383 non-null    float64
 12  sod             313 non-null    float64
 13  pot             312 non-null    float64
 14  hemo            348 non-null    float64
 15  pcv             330 non-null    object 
 16  wc              295 non-null    object 
 17  rc              270 non-null    obj

In [8]:
data.describe()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,hemo
count,391.0,388.0,353.0,354.0,351.0,356.0,381.0,383.0,313.0,312.0,348.0
mean,51.483376,76.469072,1.017408,1.016949,0.450142,148.036517,57.425722,3.072454,137.528754,4.627244,12.526437
std,17.169714,13.683637,0.005717,1.352679,1.099191,79.281714,50.503006,5.741126,10.408752,3.193904,2.912587
min,2.0,50.0,1.005,0.0,0.0,22.0,1.5,0.4,4.5,2.5,3.1
25%,42.0,70.0,1.01,0.0,0.0,99.0,27.0,0.9,135.0,3.8,10.3
50%,55.0,80.0,1.02,0.0,0.0,121.0,42.0,1.3,138.0,4.4,12.65
75%,64.5,80.0,1.02,2.0,0.0,163.0,66.0,2.8,142.0,4.9,15.0
max,90.0,180.0,1.025,5.0,5.0,490.0,391.0,76.0,163.0,47.0,17.8


In [9]:
for col in data.columns:
    print(col, data[col].unique())

age [48.  7. 62. 51. 60. 68. 24. 52. 53. 50. 63. 40. 47. 61. 21. 42. 75. 69.
 nan 73. 70. 65. 76. 72. 82. 46. 45. 35. 54. 11. 59. 67. 15. 55. 44. 26.
 64. 56.  5. 74. 38. 58. 71. 34. 17. 12. 43. 41. 57.  8. 39. 66. 81. 14.
 27. 83. 30.  4.  3.  6. 32. 80. 49. 90. 78. 19.  2. 33. 36. 37. 23. 25.
 20. 29. 28. 22. 79.]
bp [ 80.  50.  70.  90.  nan 100.  60. 110. 140. 180. 120.]
sg [1.02  1.01  1.005 1.015   nan 1.025]
al [ 1.  4.  2.  3.  0. nan  5.]
su [ 0.  3.  4.  1. nan  2.  5.]
rbc [nan 'normal' 'abnormal']
pc ['normal' 'abnormal' nan]
pcc ['notpresent' 'present' nan]
ba ['notpresent' 'present' nan]
bgr [121.  nan 423. 117. 106.  74. 100. 410. 138.  70. 490. 380. 208.  98.
 157.  76.  99. 114. 263. 173.  95. 108. 156. 264. 123.  93. 107. 159.
 140. 171. 270.  92. 137. 204.  79. 207. 124. 144.  91. 162. 246. 253.
 141. 182.  86. 150. 146. 425. 112. 250. 360. 163. 129. 133. 102. 158.
 165. 132. 104. 127. 415. 169. 251. 109. 280. 210. 219. 295.  94. 172.
 101. 298. 153.  88. 226. 143. 1

In [10]:
# sample check
data['wc'].shape[0]

400

In [11]:
data.loc[0, 'wc']

'7800'

In [12]:
data['wc'][0]

'7800'

In [13]:
for col in ['wc', 'rc', 'pcv']:
    for i in range(data[col].shape[0]):
        if data.loc[i, col] == '\t?':
            data.loc[i, col] = np.float64(0)

In [14]:
cat_cols, num_cols = [], []
for col in data.columns:
    if(data[col].dtype == np.float64):
        num_cols.append(col)
    else:
        cat_cols.append(col)

In [15]:
cat_cols

['rbc',
 'pc',
 'pcc',
 'ba',
 'pcv',
 'wc',
 'rc',
 'htn',
 'dm',
 'cad',
 'appet',
 'pe',
 'ane',
 'classification']

In [16]:
for i in ['wc', 'rc', 'pcv']:
    cat_cols.remove(i)
print(len(cat_cols))

11


In [17]:
num_cols.extend(['wc', 'rc', 'pcv'])

In [18]:
num_cols

['age',
 'bp',
 'sg',
 'al',
 'su',
 'bgr',
 'bu',
 'sc',
 'sod',
 'pot',
 'hemo',
 'wc',
 'rc',
 'pcv']

In [19]:
print(len(cat_cols))
print(len(num_cols))

11
14


In [20]:
# lets fill the missing values
data[num_cols] = data[num_cols].fillna(0.0)

In [21]:
# Normal/Abnormal
data[['rbc','pc']] = data[['rbc','pc']].fillna('abnormal')

# Present/Notpresent
data[['pcc','ba']] = data[['pcc','ba']].fillna('notpresent')

# Yes/No
data[['htn','dm','cad', 'pe','ane']] = data[['htn','dm','cad', 'pe','ane']].fillna('no')

# Good/Poor
data[['appet']] = data[['appet']].fillna('poor')

In [22]:
# sanity check
data.isna().sum()

age               0
bp                0
sg                0
al                0
su                0
rbc               0
pc                0
pcc               0
ba                0
bgr               0
bu                0
sc                0
sod               0
pot               0
hemo              0
pcv               0
wc                0
rc                0
htn               0
dm                0
cad               0
appet             0
pe                0
ane               0
classification    0
dtype: int64

In [32]:
c = 'classification'
for i in range(400):
    if data[c][i] == 'ckd\t':
        data[i] = 'ckd'

In [34]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data[cat_cols] = data[cat_cols].apply(le.fit_transform)

data.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,rc,htn,dm,cad,appet,pe,ane,classification,37,230
0,48.0,80.0,1.02,1.0,0.0,0,1,0,0,121.0,...,5.2,1,4,1,0,0,0,0,ckd,ckd
1,7.0,50.0,1.02,4.0,0.0,0,1,0,0,0.0,...,0.0,0,3,1,0,0,0,0,ckd,ckd
2,62.0,80.0,1.01,2.0,3.0,1,1,0,0,423.0,...,0.0,0,4,1,1,0,1,0,ckd,ckd
3,48.0,70.0,1.005,4.0,0.0,1,0,1,0,117.0,...,3.9,1,3,1,1,1,1,0,ckd,ckd
4,51.0,80.0,1.01,2.0,0.0,1,1,0,0,106.0,...,4.6,0,3,1,0,0,0,0,ckd,ckd


In [43]:
data[['classification']] = data[['classification']].apply(le.fit_transform)

In [58]:
# Divide the dataset to training and testing set
X = data.drop('classification', axis = 1)
y = data['classification']

In [59]:
print("X shape : ", X.shape)
print("y shape : ", y.shape)

X shape :  (400, 26)
y shape :  (400,)


In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, shuffle = True) 

In [61]:
print("X train shape : ", X_train.shape)
print("y train shape : ", y_train.shape)
print("X test shape : ", X_test.shape)
print("y test shape : ", y_test.shape)

X train shape :  (320, 26)
y train shape :  (320,)
X test shape :  (80, 26)
y test shape :  (80,)


In [62]:
# Implement various classification models such as SVM, Logistic regression, Decision tree and KNN, respectively.
# Note all models can be obtained from sklearn 

lr = LogisticRegression()
dt = DecisionTreeClassifier()
svc = SVC()

In [63]:
lr.fit(X_train,y_train)
dt.fit(X_train,y_train)
svc.fit(X_train, y_train)

ValueError: could not convert string to float: 'ckd'

In [64]:
# Train and test the models



## Task 2: Ensembling of classification models

In [28]:
# Obtain the results of various models from Task 1



In [29]:
# Apply ensembling of various classification models such as SVM, Logistic regression, Decision tree and KNN

In [30]:
# Build training analysis graphs for various parameters i.e., accuracy, precision, recall and f-measure for all models.  

In [31]:
# Build testing analysis graphs for various parameters i.e., accuracy, precision, recall and f-measure for all models.  

# Conclude the results