
<div><h1><b><center>Exploratory Data Analysis of Chronic Kidney Disease</b></h1></div>

 ![](https://static.healthcare.siemens.com/siemens_hwem-hwem_ssxa_websites-context-root/wcm/idc/groups/public/@global/documents/image/mda3/nzux/~edisp/ckd_2-04895028/~renditions/ckd_2-04895028~8.jpg)

![Co-learning Lounge](https://s3.ap-south-1.amazonaws.com/townscript-production/images/2545d2c7-a6e8-486e-97e6-737c42cef670.jpg)
<div><h4><b><center>Thanks to the Co-learning Lounge for pushing to create learning content on Kaggle problem.</b></h4></div>

<div><h4><b><center>You can find most updated and comprehensive learning material in their community.<br>
for more information Join and follow the <a href="https://linktr.ee/colearninglounge">Co-learning Lounge</a></b></h4></div>

* Chronic kidney disease (CKD), also known as chronic renal disease. Chronic kidney disease involves 
conditions that damage your kidneys and decrease their ability to keep you healthy. 
* You may develop complications like high blood pressure, anemia (low blood count), weak bones, poor nutritional health and nerve damage.
* Early detection and treatment can often keep chronic kidney disease from getting
worse.

<div><h2><b><center>Data Dictionary</b></h2></div>

1. **age**		-	age
1. **bp**		-	blood pressure
1. **sg**		-	specific gravity
1. **al**		-   albumin
1. **su**		-	sugar
1. **rbc**		-	red blood cells
1. **pc**		-	pus cell
1. **pcc**		-	pus cell clumps
1. **ba**		-	bacteria
1. **bgr**		-	blood glucose random
1. **bu**		-	blood urea
1. **sc**		-	serum creatinine
1. **sod**		-	sodium
1. **pot**		-	potassium
1. **hemo**		-	hemoglobin
1. **pcv**		-	packed cell volume
1. **wc**		-	white blood cell count
1. **rc**		-	red blood cell count
1. **htn**		-	hypertension
1. **dm**		-	diabetes mellitus
1. **cad**		-	coronary artery disease
1. **appet**		-	appetite
1. **pe**		-	pedal edema
1. **ane**		-	anemia
1. **classification**		-	class

<div><h2><b><center>Attribute Information</b></h2></div>

1. Age(numerical) age in years
1. Blood Pressure(numerical) bp in mm/Hg
1. Specific Gravity(nominal) sg - (1.005,1.010,1.015,1.020,1.025)
1. Albumin(nominal)al - (0,1,2,3,4,5)
1. Sugar(nominal) su - (0,1,2,3,4,5)
1. Red Blood Cells(nominal) rbc - (normal,abnormal)
1. Pus Cell (nominal)pc - (normal,abnormal)
1. Pus Cell clumps(nominal)pcc - (present,notpresent)
1. Bacteria(nominal) ba  - (present,notpresent)
1. Blood Glucose Random(numerical) bgr in mgs/dl
1. Blood Urea(numerical) bu in mgs/dl
1. Serum Creatinine(numerical) sc in mgs/dl
1. Sodium(numerical) sod in mEq/L
1. Potassium(numerical)	pot in mEq/L
1. Hemoglobin(numerical) hemo in gms
1. Packed  Cell Volume(numerical)
1. White Blood Cell Count(numerical) wc in cells/cumm
1. Red Blood Cell Count(numerical) rc in millions/cmm
1. Hypertension(nominal) htn - (yes,no)
1. Diabetes Mellitus(nominal) dm - (yes,no)
1. Coronary Artery Disease(nominal)	cad - (yes,no)
1. Appetite(nominal) ppet	 - (good,poor)
1. Pedal Edema(nominal)	pe - (yes,no)	
1. Anemia(nominal)ane	- (yes,no)
1. Class (nominal) class	 - (ckd,notckd)

In [None]:
!pip install plotly

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

<div><h2><b><center>Loading train and test Dataset</b></h2></div>

In [None]:
train=pd.read_csv('../input/chronic-kidney-disease/kidney_disease_train.csv')
test=pd.read_csv('../input/chronic-kidney-disease/kidney_disease_test.csv')

<div><h2><b><center>Renaming the columns to have meaningful names</b></h2></div>

In [None]:
col={"bp":"blood_pressure",
          "sg":"specific_gravity",
          "al":"albumin",
          "su":"sugar",
          "rbc":"red_blood_cells",
          "pc":"pus_cell",
          "pcc":"pus_cell_clumps",
          "ba":"bacteria",
          "bgr":"blood_glucose_random",
          "bu":"blood_urea",
          "sc":"serum_creatinine",
          "sod":"sodium",
          "pot":"potassium",
          "hemo":"hemoglobin",
          "pcv":"packed_cell_volume",
          "wc":"white_blood_cell_count",
          "rc":"red_blood_cell_count",
          "htn":"hypertension",
          "dm":"diabetes_mellitus",
          "cad":"coronary_artery_disease",
          "appet":"appetite",
          "pe":"pedal_edema",
          "ane":"anemia"}

train.rename(columns=col, inplace=True)
test.rename(columns=col, inplace=True)

More info about [data click here](https://github.com/rylativity/CKD_model/blob/master/chronic_kidney_disease.pdf) 

In [None]:
print('We have total {} train sample and {} test sample'.format(train.shape[0],test.shape[0]))

In [None]:
train.info()

In [None]:
train.isnull().sum()

From the above we can see that our columns have missing values, Lets check missing values percentage

In [None]:
# Percentage of missing values
((train.isnull().sum()/train.shape[0])*100).sort_values(ascending=False)

In [None]:
#drop id column
train.drop(["id"],axis=1,inplace=True) 

* **id** column seems to be a unique identifier for each row, so we are dropping that it won't help us to find any insights from the data

In [None]:
train['red_blood_cell_count'] = pd.to_numeric(train['red_blood_cell_count'], errors='coerce')
train['white_blood_cell_count'] = pd.to_numeric(train['white_blood_cell_count'], errors='coerce')

Now we have converted columns , **'red_blood_cell_count'** and **'white_blood_cell_count'** as a float

In [None]:
train.describe(include='all').T

In [None]:
for i in train.columns:
    print('{} has unique values {}'.format(i,train[i].unique()),'\n')

Observation:

* There are multiple incorrect values present in the columns **diabetes_mellitus** and **coronary_artery_disease** like \tyes and \tno.
 
 
**Let’s replace those values with correct values**

In [None]:
#Replace incorrect values
train['diabetes_mellitus'] =train['diabetes_mellitus'].replace(to_replace={'\tno':'no','\tyes':'yes',' yes':'yes'})
train['coronary_artery_disease'] = train['coronary_artery_disease'].replace(to_replace='\tno',value='no')

Now let's check the distribution of target variable first

In [None]:
sns.countplot(x='classification',data=train)
plt.xlabel("classification")
plt.ylabel("Count")
plt.title("target Class")
plt.show()
print('Percent of chronic kidney disease sample: ',round(len(train[train['classification']=='ckd'])/len(train['classification'])*100,2),"%")
print('Percent of not a chronic kidney disease sample: ',round(len(train[train['classification']=='notckd'])/len(train['classification'])*100,2),"%")

<div><h2><b><center>Correlation</b></h2></div>

In [None]:
corr_df = train.corr()
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(corr_df,annot=True,fmt=".2f",ax=ax,linewidths=0.5,linecolor="orange")
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.title('Correlations between different predictors')
plt.show()

## Positive Correlation

* **hemoglobin** -> red_blood_cell_count,packed_cell_volume, specific_gravity
* **red_blood_cell_count** -> packed_cell_volume,specific_gravity
* **specific_gravity** -> packed_cell_volume
* **blood_glucose_random** -> sugar
* **serum_creatinine** -> blood_urea

## Negative correlation

* **Albumin** -> hemoglobin, packed_cell_volume,specific_gravity,red_blood_cell_count
* **serum_creatinine** -> sodium
* **blood_urea** -> hemoglobin, packed_cell_volume,red_blood_cell_count

In [None]:
numerical_features=[feature for feature in train.columns if train[feature].dtypes=='float64']
print('total numerical column :',len(numerical_features))
print(numerical_features)

In [None]:
categorical_features=[feature for feature in train.columns if train[feature].dtypes=='O']
print('total categorical column :',len(categorical_features))
print(categorical_features)

In [None]:
train[categorical_features].describe(include='all').T

### Observation
* **red_blood_cells** are having highest missing values i.e 107 values are missing
* **pus_cell** having 50 values missing 
* **pus_cell_clumps** and **bacteria** having 4 values missing
* **hypertension, diabetes_mellitus** and **coronary_artery_disease** having only 1 value missing 

In [None]:
train[numerical_features].describe(include='all').T

<div><h2><b><center>Percentage of missing values</b></h2></div>

In [None]:
((train[numerical_features].isnull().sum()/train.shape[0])*100).sort_values(ascending=False)

In [None]:
def violin(col): 
    fig = px.violin(train, y=col, x="classification", color="classification", box=True, points="all", hover_data=train.columns)
    return fig.show()
def kde_plot(feature):
    grid = sns.FacetGrid(train, hue="classification", aspect = 2)
    grid.map(sns.kdeplot, feature)
    grid.add_legend()

## RBC, PCV, Hemoglobin

* Red_blood_cell counts have the highest missing value i.e 34% values are missing.
* Packed_cell_volume has 18% and hemoglobin has 14% values missing.
* From the correlation graph, we have seen that red_blood_cell count has a high positive correlation with hemoglobin and packed_cell_volume.

In [None]:
kde_plot('red_blood_cell_count')

In [None]:
train.groupby(['classification'])['red_blood_cell_count'].agg(['mean','median'])

* Both distributions are very different, CKD distribution has a long tail which shows it has a portion of the distribution having many occurrences far from the "head" or central part of the distribution.
* CKD and not-CKD follow a normal distribution as the mean and median are almost equal.

In [None]:
kde_plot('packed_cell_volume')

* Both distributions are quite different, distribution CKD is quite normal and evenly distributed but not CKD distribution is a little bit left-skewed but quite close to a normal distribution

In [None]:
kde_plot('hemoglobin')

In [None]:
train.groupby(['classification'])['hemoglobin'].agg(['mean','median'])

In [None]:
fig = px.scatter(train, x="red_blood_cell_count", y="hemoglobin", color="classification")
fig.show() 

* As red_blood_cell_count increases hemoglobin(high positive correlation(0.79))
* person having RBC count range ~2 to <4.5 and Hemoglobin between 3 to <13 are mostly classified as positive for chronic kidney disease(i.e ckd).
* person having RBC count range >4.5 to ~6.1 and Hemoglobin between >13 to 17.8 are classified as negative for chronic kidney disease(i.e not-CKD).
* there are few cases where even a person having a normal range of RBC count and hemoglobin still he or she has chronic kidney disease because of other factors will check this later

In [None]:
fig = px.scatter(train, x="red_blood_cell_count", y="packed_cell_volume", color="classification")
fig.show()

In [None]:
fig = px.scatter(train, x="packed_cell_volume", y="red_blood_cell_count", color="classification")
fig.show()

In [None]:
fig = px.bar(train, x="red_blood_cells", y="red_blood_cell_count",color='classification', barmode='group',height=400)
fig.show()

In [None]:
train.groupby(['red_blood_cells','classification'])['red_blood_cell_count'].agg(['count','mean','median','min','max'])

Observation
* As red_blood_cell_count increases packed_cell_volume also increases 
* person having RBC count range between 2 to <4.5 and packed_cell_volume 9 to <40 are **mostly** classified as chronic kidney disease patient.
* person having RBC count range between >4.5 to 6.2 and packed_cell volume range >40 to 54 are classified as normal i.e not chronic kidney disease
* there are few cases where even a person having a normal range of RBC count and packed_cell_volume still he or she has chronic kidney disease because of other factors, will check this later
* Also we have seen through plots that those having a normal range of hemoglobin i.e 13 to 17.8 are mostly non-ckd.
* All whos falling in abnormal red_blood_cells level are suffering from chronic kidney disease. for those who’s red_blood_cells are normal but having high and low red_blood_cell_counts are prone to have CKD

In [None]:
violin('red_blood_cell_count')

In [None]:
violin('packed_cell_volume')

In [None]:
violin('hemoglobin')

Observation:
* red_blood_cell_count, packed_cell_volume, and hemoglobin has few outlier and inlier.

In [None]:
def missing_value(feature): 
    a = train[(train[feature].isnull())]
    return a.groupby(['classification'])['classification'].agg(['count'])
    

In [None]:
print('missing values in RBC column:\n\n',missing_value('red_blood_cell_count'),'\n')
print('missing values in Packed cell volume column:\n\n',missing_value('packed_cell_volume'),'\n')
print('missing values in Hemoglobin column:\n\n',missing_value('hemoglobin'),'\n')

## Albumin

Let's check albumin is having a negative correlation with hemoglobin(-0.64) and packed_cell_volume(-0.62)

In [None]:
fig = px.bar(train, x="albumin", y="packed_cell_volume",color='classification', barmode='group',height=400)
fig.show()

In [None]:
train.groupby(['albumin','classification'])['albumin'].count()

* Level 0 above for albumin is symptoms of CKD. 
* packed_cell_volume is in pretty much in a normal range both the cases CKD and in non-CKD.

In [None]:
fig = px.bar(train, x="albumin", y="hemoglobin",color='classification', barmode='group',height=400)
fig.show()

* The majority, those people are suffering from chronic kidney disease who’s having less(<13) hemoglobin and >0 levels of albumin.

In [None]:
violin('albumin')

## Specific gravity

In [None]:
kde_plot('specific_gravity')

In [None]:
fig = px.bar(train, x="specific_gravity", y="packed_cell_volume",
             color='classification', barmode='group',
             height=400)
fig.show()

In [None]:
print("number of patient who's having packed cell volume<40 and specific gravity <1.02:\n\n",train[(train['packed_cell_volume']<40)&(train['specific_gravity']<1.02)].groupby(['classification'])['classification'].agg(['count']))
print("packed cell volume >=40 and specific gravity >=1.02:\n\n",train[(train['packed_cell_volume']>=40)&(train['specific_gravity']>=1.02)].groupby(['classification'])['classification'].agg(['count']))

* higher the specific_gravity lesser the chances of having CKD
* from the above stats we clearly say that person having packed_cell_volume <40 and specific gravity <1.02 are all CKD patients. 

In [None]:
fig = px.bar(train, x="specific_gravity", y="hemoglobin",
             color='classification', barmode='group',
             height=400)
fig.show()

In [None]:
print("number of patient who's having hemoglobin <12 and specific gravity <1.02:\n\n",train[(train['hemoglobin']<12)&(train['specific_gravity']<1.02)].groupby(['classification'])['classification'].agg(['count']))
print("hemoglobin >=12 and specific gravity >=1.02:\n\n",train[(train['packed_cell_volume']>=12)&(train['specific_gravity']>=1.02)].groupby(['classification'])['classification'].agg(['count']))

* higher the specific_gravity lesser the chances of having CKD
* chances of having CKD is high if a person having a specific gravity level 1.005,1.01,1.015 and hemoglobin below normal range i.e <13.
* there are few patients(CKD) who having a normal range of hemoglobin but having less specific_gravity.

In [None]:
fig = px.bar(train, x="specific_gravity", y="red_blood_cell_count",
             color='classification', barmode='group',
             height=400)
fig.show()

In [None]:
print("number of patient who's having RBC <3.9 and specific gravity <1.02:\n\n",train[(train['red_blood_cell_count']<3.9)&(train['specific_gravity']<1.02)].groupby(['classification'])['classification'].agg(['count']))
print("RBC >=3.9 and specific gravity >=1.02:\n\n",train[(train['red_blood_cell_count']>=3.9)&(train['specific_gravity']>=1.02)].groupby(['classification'])['classification'].agg(['count']))

In [None]:
train[(train['packed_cell_volume']<40)&(train['specific_gravity']<1.02)&(train['hemoglobin']<12)&(train['red_blood_cell_count']<3.9)].groupby(['classification'])['classification'].agg(['count'])

In [None]:
violin('specific_gravity')

## White blood cell count

White blood cells are vital components of the blood. Their role is to fight infection, and they are essential for health and well-being. The normal number of WBCs in the blood is 4,500 to 11,000 WBCs per microliter (4.5 to 11.0 × 109/L).

In [None]:
kde_plot('white_blood_cell_count')

* Pattern of CKD and not-CKD is different
* Distribution of chronic kidney disease is little skewed towards the left while the distribution of non-CKD is pretty normal.

In [None]:
train[(train['white_blood_cell_count']>=4300) &(train['white_blood_cell_count']<=11000)&(train['classification']=='ckd')]

In [None]:
violin('white_blood_cell_count')

* From the above data, we can say that those having a normal range of white_blood_cell_count they are also having chronic diseases
* White_blood_cell does not have any correlation with any column

## potassium, blood urea, pus cell, pus cell clumps

In [None]:
kde_plot('potassium')

In [None]:
kde_plot('blood_urea')

* Above plots clearly describes the similar pattern of chronic kidney disease and non-chronic disease of potassium.
* Distribution is very different in the case of blood urea
* From the correlation plot, we can see that potassium having positive relation with  blood_urea(0.4) and serum_creatinine(0.36). Let's check if it helps to find any pattern or certain behavior

In [None]:
fig = px.scatter(train, x="potassium", y="blood_urea", color="classification")
fig.show()

* here we have potassium in range 2.7 to 6.5 and blood urea in range 10 to 391 including both CKD and, not-CKD.
* We have 2 extreme cases of CKD, they have a high range of potassium and blood urea
* people have blood urea in the range 10 to 50 are majorly classified as a not-CKD, blood urea more than 50 are classified as CKD patient 
* there are few cases where a person having potassium and blood urea within range still they are suffering from CKD.

In [None]:
fig = px.scatter(train, x="potassium", y="serum_creatinine", color="classification")
fig.show()

In [None]:
violin('potassium')

* Your body uses the potassium it needs. The extra potassium that your body does not need is removed from your blood by your kidneys. When you have kidney disease, your kidneys cannot remove extra potassium in the right way, and too much potassium can stay in your blood.
*  From the above two scatter plots we can see that those people who are suffering from chronic kidney disease contain higher potassium since their kidney is unable to remove extra potassium, but we have also observed that people are still suffering through chronic disease even if they have normal range of potassium. 
* Even though serum_creatinine and blood_urea having slight positive relation not giving many insights about potassium.
* We have 2 abnormal values as an outlier(39, 47)

In [None]:
violin('blood_urea')

In [None]:
fig = px.bar(train, x="pus_cell", y="blood_urea",color='classification', barmode='group',height=400)
fig.show()

In [None]:
fig = px.bar(train, x="pus_cell_clumps", y="blood_urea",color='classification', barmode='group',height=400)
fig.show()

* From the above 2 plot we can say that blood urea and pus cell has no correlation, distribution of blood urea is pretty even in normal and abnormal case.
* 90% of the time person is suffering from CKD irrespective of blood_urea range

## Sodium, serum creatinine, Blood pressure, hypertension

In [None]:
kde_plot('sodium')

* A normal blood sodium level is between 135 and 145 milliequivalents per liter (mEq/L)
* It helps maintain normal blood pressure, supports the work of your nerves and muscles, and regulates your body's fluid balance, lets check the correlation between blood pressure and sodium.
* sodium has a negative correlation with serum creatinine. 

In [None]:
kde_plot('serum_creatinine')

Creatinine is a waste product that comes from normal wear and tears on the muscles of the body. Creatinine levels in the blood can vary depending on age, race, and body size. A creatinine level of greater than 1.2 for women and greater than 1.4 for men may be an early sign that the kidneys are not working properly. As kidney disease progresses, the level of creatinine in the blood rises.

In [None]:
fig = px.scatter(train, x="sodium", y="blood_pressure", color="classification")
fig.show()

* People who had blood pressure <60 to  >80 are prone to have a chronic disease.

In [None]:
fig = px.scatter(train, x="sodium", y="serum_creatinine", color="classification")
fig.show()

* people having serum creatinine <0.5 to >1.2 are prone to have a chronic disease.
* Serum creatinine has a direct impact on chronic disease

In [None]:
train[train['classification']=='notckd']['serum_creatinine'].agg(['min','max'])

* The normal range for creatinine in the blood maybe 0.84 to 1.21 milligrams per deciliter, 
* From the above plot we can say that all, a not-CKD person has a normal range of serum creatinine
* Even though a person has a normal range of sodium but s/he is suffering from CKD

In [None]:
kde_plot('blood_pressure')

In [None]:
train.groupby(['classification','hypertension'])['blood_pressure'].agg(['min','max','count','mean','median'])

In [None]:
fig = px.bar(train, x="hypertension", y="blood_pressure",color='classification', barmode='group',height=400)
fig.show()

* most of the people suffering from hypertension having chronic kidney disease. There is no correlation found with blood pressure

In [None]:
violin('blood_pressure')

## blood glucose random, sugar, diabetes_mellitus

In [None]:
kde_plot('blood_glucose_random')

In [None]:
fig = px.bar(train, x="sugar", y="blood_glucose_random",
             color='classification', barmode='group',
             height=400)
fig.show()

* Higher the glucose and sugar level (above 1) more chances of having chronic kidney disease.

In [None]:
fig = px.bar(train, x="diabetes_mellitus", y="blood_glucose_random",
             color='classification', barmode='group',
             height=400)
fig.show()

In [None]:
train.groupby(['classification'])['blood_glucose_random'].agg(['min','max','median','mean'])

* CKD patients have a high range of glucose than non-CKD
* From the KDE plot, we can clearly see a different distribution of CKD and non-ckd. non-ckd following Leptokurtic whereas CKD following Platykurtic distribution.
* CKD distribution is right-skewed.
* mostly high blood sugar level causes diabetes mellitus 
* people not suffering from diabetes mellitus are having blood glucose range between 70 to 140 
* from the above bar plot we can say that person suffering from diabetes mellitus also suffering from chronic kidney disease.

In [None]:
violin('blood_glucose_random')

## Age

In [None]:
kde_plot('age')

In [None]:
train.groupby(['classification'])['age'].agg(['min','max','count','mean','median'])

In [None]:
violin('age')

* Age is between 2 to 90, that is even 2 years kid and 90 year person suffering from chronic kidney diseases, so throght visualisation we can say that there is no such specific age gruop is suffering from ckd.
* CKD distribution is slightly left skewed with high concentration of data between 40-76. There are few outliers on the lower spectrum of age
* there is no correlation between age and other variables

In [None]:
col = ['bacteria','coronary_artery_disease', 'appetite', 'pedal_edema','anemia']
for i in col:
    sns.countplot(i, hue = 'classification', data = train)
    plt.show()


### conclusion

* RBC count is highly correlated with hemoglobin and packed_volume_area
* red blood cell level is abnormal and RBC count high or low than a normal range than person likely to have CKD
* white_blood_cell, sodium, and potassium, blood_press no such relation with other variables
* if a person is having less specific gravity(levels 1.005,1.01,1.015) with lesser hemoglobin(<13) and less packed cell volume(<40)  higher chances of having CKD
* Albumin and Hemoglobin have a negative correlation(correlation matric) mostly albumin level above 0 and hemoglobin higher than normal range is an indication CKD 
* If blood urea level is higher than 150 than there are higher chances of having a chronic kidney disease
* Higher the serum creatinine level i.e >1.2, people likely to have chronic kidney diseases. 
* People who had blood pressure <60 to  >80 are prone to have a chronic disease.
* high range of blood glucose random, sugar level above 1and also suffering from diabetes_mellitus are majorly classified as chronic kidney disease.
* Age has no such correlation with other variables
* red_blood_cell - red_blood_cell_count
* From the above analysis, we can see that presence of even one - abnormal red cell count, bacteria, hypertension, pus cells, diabetes, coronary disease, lack of appetite, amenic increases chances of occurence of ckd to substantial level.
