### Feature Description:
* id -id
* age -age
* bp -blood pressure
* sg -specific gravity
* al -albumin
* su -sugar
* rbc -red blood cells
* pc - pus cell
* pcc -pus cell clumps
* ba -bacteria
* bgr -blood glucose random
* bu -blood urea
* sc -serum creatinine
* sod -sodium
* pot -potassium
* hemo -haemoglobin
* pcv -packed cell volume
* wc -white blood cell count
* rc -red blood cell count
* htn -ypertension
* dm -diabetes mellitus
* cad -coronary artery disease
* appet -appetite
* pe -pedal edema
* ane -anemia
* classification -class

## Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
kidney=pd.read_csv('../input/ckdisease/kidney_disease.csv')

In [None]:
kidney.head()

In [None]:
kidney.info()

In [None]:
columns=pd.read_csv("../input/kidney-kronicle/data_description.txt",sep='-')
columns=columns.reset_index()

In [None]:
columns.columns=['cols','abb_col_names']

In [None]:
columns

In [None]:
kidney.head()

In [None]:
kidney.columns=columns['abb_col_names'].values

In [None]:
kidney.head()

In [None]:
kidney.describe()

In [None]:
def convert_dtype(kidney,feature):
    kidney[feature]=pd.to_numeric(kidney[feature],errors='coerce')    #whereever we have Nan values , this errors parameter will hanfle that 

In [None]:
features=['packed cell volume','white blood cell count','red blood cell count']
for i in features:
    convert_dtype(kidney,i)

In [None]:
kidney.dtypes

In [None]:
kidney.drop('id',inplace=True,axis=1)

## Data cleaning

In [None]:
def extract_cat_num(kidney):
    cat_col=[col for col in kidney.columns if kidney[col].dtype=='O']
    num_col=[col for col in kidney.columns if kidney[col].dtype!='O']
    return cat_col,num_col

In [None]:
cat_col,num_col=extract_cat_num(kidney)

In [None]:
cat_col

In [None]:
num_col

In [None]:
# dirtiness in categorical data
for col in cat_col:
    print('{} has {} values'.format(col,kidney[col].unique()))
    print("\n")

In [None]:
kidney['diabetes mellitus'].replace(to_replace={'\tno':'no','\tyes':'yes'},inplace=True)
kidney['coronary artery disease'].replace(to_replace={'\tno':'no'},inplace=True)
kidney['class'].replace(to_replace={'ckd\t':'ckd'},inplace=True)

In [None]:
# no dirtiness
for col in cat_col:
    print('{} has {} values'.format(col,kidney[col].unique()))
    print("\n")

## Exploratory Data Analysis

Analysing distribution of each and every column

In [None]:
len(num_col)

In [None]:
plt.figure(figsize=(30,30))
for i,feature in enumerate(num_col):
    plt.subplot(5,3,i+1)            
    kidney[feature].hist()
    plt.title(feature)

##### Check Label distribution of categorical Data

In [None]:
len(cat_col)

In [None]:
plt.figure(figsize=(20,20))

for i,feature in enumerate(cat_col):
    plt.subplot(4,3,i+1)
    sns.countplot(kidney[feature])

In [None]:
#There are so many warnings so we want to ignore them for more covinience
import warnings
from warnings import filterwarnings
filterwarnings("ignore")

In [None]:
plt.figure(figsize=(20,20))

for i,feature in enumerate(cat_col):
    plt.subplot(4,3,i+1)
    sns.countplot(kidney[feature],hue=kidney['class'])

In [None]:
sns.countplot(kidney['class'])

## Correlation between features

In [None]:
kidney.corr()

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(kidney.corr(method='pearson'),cbar=True,cmap='BuPu',annot=True)

* Rbc count is positively correlated with specific gravity,haemoglobin,packed cell volume
* Rbc count is negatively correlated with albumin, blood urea
* Packed cell volume and haemoglobin are highly positive correlated
* Packed cell volume is negatively correlated with albumin and blood urea
* haemoglobin and albumin are negatively correlated

In [None]:
kidney.groupby(['red blood cells','class'])['red blood cell count'].agg(['count','mean','median','min','max'])

We can observe that when a person is not diseased its rbc count is 134, mean is also high whereas when he is diseased count drop downs to 25-40 and mean is low.

#### Relationship between haemoglobin and packed cell volume

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(x=kidney.haemoglobin,y=kidney['packed cell volume'])
plt.xlabel('Haemoglobin')
plt.ylabel('packed cell volume')
plt.title('Relationship between haemoglobin and packed cell volume')


We can see that there is a linear relationship between haemoglobin and pacled cell volume

### Analyse distribution of red blood cell count chronic as well as non chronic

In [None]:
grid=sns.FacetGrid(kidney,hue='class',aspect=2)
grid.map(sns.kdeplot,'red blood cell count')
grid.add_legend()

from above visuals we can say that person with lower rbc count have high chances of having chronic disease

In [None]:
grid=sns.FacetGrid(kidney,hue='class',aspect=2)
grid.map(sns.kdeplot,'haemoglobin')
grid.add_legend()

In [None]:
plt.figure(figsize=(12,10))
sns.scatterplot(x=kidney['red blood cell count'],y=kidney['packed cell volume'],hue=kidney['class'])
plt.xlabel('red blood cell count')
plt.ylabel('packed cell volume')
plt.title('Relationship between red blood cell count and packed cell volume')


In [None]:
plt.figure(figsize=(12,10))
sns.scatterplot(x=kidney['red blood cell count'],y=kidney['haemoglobin'],hue=kidney['class'])
plt.xlabel('red blood cell count')
plt.ylabel('haemoglobin')
plt.title('Relationship between haemoglobin and red blood cell count')

* We can see that there is some kind of linearity in all the relationships
* Whenever haemoglobin is below 13-14 he is positive for chronic disease , Whenever haemoglobin is near 18 he is negative

## Handling Missing Values

In [None]:
kidney.isnull().sum()

In [None]:
kidney.isnull().sum().sort_values(ascending=False)

We can fill this missing values with mean,median or std deviat

In [None]:
plt.subplot(1,2,1)
sns.boxplot(x=kidney['class'],y=kidney['age'])

In [None]:
list(enumerate(cat_col))

In [None]:
plt.figure(figsize=(15,15))
for i in enumerate(num_col):
    plt.subplot(4,4,i[0]+1)
    sns.boxplot(x=kidney['class'],y=i[1],data=kidney.reset_index())

there are outliers in dataset so filling missing values with mean is not feasible , i will use median to fill missing values

In [None]:
np.mean(kidney)

In [None]:
kidney.isnull().sum()

In [None]:
for i in num_col:
    kidney[i].fillna(kidney[i].median(),inplace=True)

In [None]:
kidney.isnull().sum()

In [None]:
kidney.describe()

#### Filling missing values in categorical columns using random values

In [None]:
kidney['red blood cells'].isnull().sum()

In [None]:
random_sample=kidney['red blood cells'].dropna().sample(152)

In [None]:
random_sample

In [None]:
kidney[kidney['red blood cells'].isnull()].index

In [None]:
random_sample.index

We can see that indexes are different , while putting random values indexes must be equal

In [None]:
random_sample.index=kidney[kidney['red blood cells'].isnull()].index    #in this way index will be equal

In [None]:
random_sample.index

In [None]:
kidney.loc[kidney['red blood cells'].isnull(),'red blood cells']=random_sample

In [None]:
kidney.head()

In [None]:
kidney['red blood cells'].isnull().sum()

In [None]:
sns.countplot(kidney['red blood cells'])       # checking that ratio didnt change after filling missing values

ratio didnt changed

In [None]:
#filling random values in all categorical columns
def Random_value_Imputation(feature):
    random_sample=kidney[feature].dropna().sample(kidney[feature].isnull().sum())
    random_sample.index=kidney[kidney[feature].isnull()].index
    kidney.loc[kidney[feature].isnull(),feature]=random_sample

In [None]:
Random_value_Imputation(' pus cell')     #only this column because it has higher no. of missing value

In [None]:
kidney.isnull().sum()

Those categorical variables who have less no. of missing values then we can replace it with mode

In [None]:
def impute_mode(feature):
    mode=kidney[feature].mode()[0]
    kidney[feature]=kidney[feature].fillna(mode)

In [None]:
for col in cat_col:
    impute_mode(col)

In [None]:
kidney[cat_col].isnull().sum()

In [None]:
kidney.isnull().sum()

We can see that there is no missing value now

## Feature Encoding

In [None]:
for col in cat_col:
    print('{} has {} categories'.format(col,kidney[col].nunique()))

In [None]:
## Label Encoding  ---> Because there are less no. of categories in each column

## normal -- 0
## abnormal --1

In [None]:
 from sklearn.preprocessing import LabelEncoder

In [None]:
le=LabelEncoder()

In [None]:
for col in cat_col:
    kidney[col]=le.fit_transform(kidney[col])

In [None]:
kidney.head()

## Selecting important features

In [None]:
from sklearn.feature_selection import SelectKBest

In [None]:
from sklearn.feature_selection import chi2

In [None]:
ind_col=[col for col in kidney.columns if col!='class']
dep_col='class'

In [None]:
X=kidney[ind_col]
y=kidney[dep_col]

In [None]:
X.head()

In [None]:
imp_features=SelectKBest(score_func=chi2,k=20)

In [None]:
imp_features=imp_features.fit(X,y)

In [None]:
imp_features

In [None]:
imp_features.scores_

In [None]:
datascore=pd.DataFrame(imp_features.scores_,columns=['Score'])

In [None]:
datascore

In [None]:
X.columns

In [None]:
dfcols=pd.DataFrame(X.columns)

In [None]:
dfcols

In [None]:
features_rank=pd.concat([dfcols,datascore],axis=1)
features_rank

In [None]:
features_rank.columns=['features','score']

In [None]:
features_rank

In [None]:
features_rank.nlargest(10,'score')

In [None]:
selected=features_rank.nlargest(10,'score')['features'].values

In [None]:
selected

In [None]:
X_new=kidney[selected]

In [None]:
X_new.head()

In [None]:
len(X_new)

In [None]:
X_new.shape

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X_new,y,random_state=0,test_size=0.3)

In [None]:
X_train.shape

In [None]:
y_train.value_counts()    #Checking for imbalancing

## XGBoost Classifier

Since we are using XGBoost , feature scaling is not required

In [None]:
from xgboost import XGBClassifier

In [None]:
params={'learning-rate':[0,0.5,0.20,0.25],
        'max_depth':[5,8,10],
       'min_child_weight':[1,3,5,7],
       'gamma':[0.0,0.1,0.2,0.4],
       'colsample_bytree':[0.3,0.4,0.7]}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
classifier=XGBClassifier()

In [None]:
random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
random_search.fit(X_train,y_train)

In [None]:
random_search.best_estimator_    #Checking for best model

In [None]:
random_search.best_params_

In [None]:
classifier=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, gamma=0.2, gpu_id=-1,
              importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0,
              max_depth=5, min_child_weight=1,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [None]:
classifier.fit(X_train,y_train)

## Prediction

In [None]:
y_pred=classifier.predict(X_test)

In [None]:
y_pred

## Evaluation of the model

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
accuracy_score(y_test,y_pred)

We got very good accuracy using XGBoost