In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree


In [2]:
df = pd.read_csv("kidney_disease.csv")
df.head(10)

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,...,44,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,...,38,6000.0,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500.0,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300.0,4.6,no,no,no,good,no,no,ckd
5,5,60.0,90.0,1.015,3.0,0.0,,,notpresent,notpresent,...,39,7800.0,4.4,yes,yes,no,good,yes,no,ckd
6,6,68.0,70.0,1.01,0.0,0.0,,normal,notpresent,notpresent,...,36,,,no,no,no,good,no,no,ckd
7,7,24.0,,1.015,2.0,4.0,normal,abnormal,notpresent,notpresent,...,44,6900.0,5.0,no,yes,no,good,yes,no,ckd
8,8,52.0,100.0,1.015,3.0,0.0,normal,abnormal,present,notpresent,...,33,9600.0,4.0,yes,yes,no,good,no,yes,ckd
9,9,53.0,90.0,1.02,2.0,0.0,abnormal,abnormal,present,notpresent,...,29,12100.0,3.7,yes,yes,no,poor,no,yes,ckd


1. Classification Problem

We would like to use supervised machine learning to predict chronic kidney disease, as well as understand which variables help with the diagnosis of it.

2. Variable Transformation

In [3]:
df.dtypes

id                  int64
age               float64
bp                float64
sg                float64
al                float64
su                float64
rbc                object
pc                 object
pcc                object
ba                 object
bgr               float64
bu                float64
sc                float64
sod               float64
pot               float64
hemo              float64
pcv                object
wc                 object
rc                 object
htn                object
dm                 object
cad                object
appet              object
pe                 object
ane                object
classification     object
dtype: object

In [4]:
df['sg'] = pd.Categorical(df['sg'])
df['al'] = pd.Categorical(df['al'])
df['su'] = pd.Categorical(df['su'])


pcv, wc and rc should actually be labeled as float64 and int64 variables. We will not transform them just yet as they only appear as objects because of N/A values. Later, in this analysis, we will drop the N/A values. Some of the variables that are labeled as float64 (continuous) should actually be labeled as int64 (integers), however, this will not make a difference in our calculations (for example, 4 equivalent to 4.0).

3. Dataset Overview

In [6]:
df.describe()

Unnamed: 0,id,age,bp,bgr,bu,sc,sod,pot,hemo
count,400.0,391.0,388.0,356.0,381.0,383.0,313.0,312.0,348.0
mean,199.5,51.483376,76.469072,148.036517,57.425722,3.072454,137.528754,4.627244,12.526437
std,115.614301,17.169714,13.683637,79.281714,50.503006,5.741126,10.408752,3.193904,2.912587
min,0.0,2.0,50.0,22.0,1.5,0.4,4.5,2.5,3.1
25%,99.75,42.0,70.0,99.0,27.0,0.9,135.0,3.8,10.3
50%,199.5,55.0,80.0,121.0,42.0,1.3,138.0,4.4,12.65
75%,299.25,64.5,80.0,163.0,66.0,2.8,142.0,4.9,15.0
max,399.0,90.0,180.0,490.0,391.0,76.0,163.0,47.0,17.8


This dataset consists of 400 observations (we may define this as 400 individuals). The youngest individual in this is 2 years old, and the oldest is 90 years old. The average age of the individuals we are looking at is 51 years old. As seen in part 2, there is a fairly even balance between binary (object) and numerical (integer and float variables) variables in this dataset. The average sodium of all individuals has been found as an exact value of 137.528753 milliequivalents per litre, with a standard deviation of 10.408752.

4. Association Between Variables

In [28]:
df_numeric = df.select_dtypes(include=['float64'])
df_category = df.select_dtypes(include=['category'])
df_numcat = pd.concat([df_numeric, df_category], axis=1).reindex(df_numeric.index)
df_numcat.corr()


Unnamed: 0,age,bp,bgr,bu,sc,sod,pot,hemo,sg,al,su
age,1.0,0.15948,0.244992,0.196985,0.132531,-0.100046,0.058377,-0.192928,-0.191096,0.122091,0.220866
bp,0.15948,1.0,0.160193,0.188517,0.146222,-0.116422,0.075151,-0.30654,-0.218836,0.160689,0.222576
bgr,0.244992,0.160193,1.0,0.143322,0.114875,-0.267848,0.066966,-0.306189,-0.37471,0.379464,0.717827
bu,0.196985,0.188517,0.143322,1.0,0.586368,-0.323054,0.357049,-0.61036,-0.314295,0.453528,0.168583
sc,0.132531,0.146222,0.114875,0.586368,1.0,-0.690158,0.326107,-0.40167,-0.361473,0.399198,0.223244
sod,-0.100046,-0.116422,-0.267848,-0.323054,-0.690158,1.0,0.097887,0.365183,0.41219,-0.459896,-0.131776
pot,0.058377,0.075151,0.066966,0.357049,0.326107,0.097887,1.0,-0.133746,-0.072787,0.129038,0.21945
hemo,-0.192928,-0.30654,-0.306189,-0.61036,-0.40167,0.365183,-0.133746,1.0,0.602582,-0.634632,-0.224775
sg,-0.191096,-0.218836,-0.37471,-0.314295,-0.361473,0.41219,-0.072787,0.602582,1.0,-0.46976,-0.296234
al,0.122091,0.160689,0.379464,0.453528,0.399198,-0.459896,0.129038,-0.634632,-0.46976,1.0,0.269305


The strong associations exist between serum creatinene and sodium, as well as between hemoglobin and blood urea. We see an even stronger association between sugar and blood glucose random. Hemoglibin also has fairly strong associations with albumin and specific gravity.

Because of the strong associations, our feature selection and extraction must incorporate hemoglobin because of it's relationship with multiple variables. Sodium, serum creatinene, albumin, specific gravity and blood urea are other variables we can use (we mentioned that sugar and blood glucose random have a strong association, however, their association with other variables is weak.). These variables will help us predict whether or not one has chronic kidney disease.