#          Pre Processing of Chronic Kidney Disease

Chronic kidney disease (CKD), also known as chronic renal disease. Chronic kidney disease involves conditions that damage your kidneys and decrease their ability to keep you healthy.

## * This is my 2nd notebook in extension to the EDA one.
## * The first notebook can be found here for your reference: https://www.kaggle.com/chayan8/chronic-kidney-disease-explored

## Here I will be using MICE (Multi-Variate Imputation by Chained Equations)

**Impyute** and **FancyImpute** are libraries specially designed for smart imputations. Two of their best techniques are: KNN and MICE.

KNN (K-Nearest Neighbors) finds the similar values of the nearest neighbors and imputes its average.

MICE (Multivariate Imputation by Chained Equations): What a heavy name!

Simply put, MICE considers the feature with missing values as a dependent variable, and the remaining features as the predictors.

From these multiple fitted models, MICE picks up the best ones and imputes using them.

I will be using MICE for numerical data and KNN for categorical data

Results? Way better than mean imputations.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install impyute

In [None]:
from impyute.imputation.cs import mice
from sklearn.preprocessing import OrdinalEncoder

In [None]:
df = pd.read_csv('../input/ckdisease/kidney_disease.csv')

In [None]:
cols_names={"bp":"blood_pressure",
          "sg":"specific_gravity",
          "al":"albumin",
          "su":"sugar",
          "rbc":"red_blood_cells",
          "pc":"pus_cell",
          "pcc":"pus_cell_clumps",
          "ba":"bacteria",
          "bgr":"blood_glucose_random",
          "bu":"blood_urea",
          "sc":"serum_creatinine",
          "sod":"sodium",
          "pot":"potassium",
          "hemo":"haemoglobin",
          "pcv":"packed_cell_volume",
          "wc":"white_blood_cell_count",
          "rc":"red_blood_cell_count",
          "htn":"hypertension",
          "dm":"diabetes_mellitus",
          "cad":"coronary_artery_disease",
          "appet":"appetite",
          "pe":"pedal_edema",
          "ane":"anemia"}

df.rename(columns=cols_names, inplace=True)

In [None]:
df['red_blood_cell_count'] = pd.to_numeric(df['red_blood_cell_count'], errors='coerce')
df['packed_cell_volume'] = pd.to_numeric(df['packed_cell_volume'], errors='coerce')
df['white_blood_cell_count'] = pd.to_numeric(df['white_blood_cell_count'], errors='coerce')

In [None]:
df.drop(["id"],axis=1,inplace=True)

In [None]:
numerical_features = []
categorical_features = []

for i in df.drop('classification', axis=1).columns:
    if df[i].nunique()>7:
        numerical_features.append(i)
    else:
        categorical_features.append(i)

In [None]:
#Replace incorrect values
df['diabetes_mellitus'] = df['diabetes_mellitus'].replace(to_replace = {'\tno':'no','\tyes':'yes',' yes':'yes'})
df['coronary_artery_disease'] = df['coronary_artery_disease'].replace(to_replace = '\tno', value='no')
df['classification'] = df['classification'].replace(to_replace = 'ckd\t', value = 'ckd')

In [None]:
df.loc[:,categorical_features].isnull().sum().sort_values(ascending=False)

In [None]:
df.loc[:,numerical_features].isnull().sum().sort_values(ascending=False)

## Encoding categorical features with Object type

In [None]:
to_encode = [feat for feat in categorical_features if df[feat].dtype=='object']

In [None]:
to_encode

In [None]:
ode = OrdinalEncoder(dtype = int)

In [None]:
def encode(data):
    '''function to encode non-nan data and replace it in the original data'''
    #retains only non-null values
    nonulls = np.array(data.dropna())
    #reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    #encode date
    impute_ordinal = ode.fit_transform(impute_reshape)
    #Assign back encoded values to non-null values
    data.loc[data.notnull()] = np.squeeze(impute_ordinal)
    return data

#create a for loop to iterate through each column in the data
for columns in to_encode:
    encode(df[columns])

In [None]:
df.loc[:, categorical_features].head(10)

So, they're Label encoded now.

In [None]:
X = df.drop('classification', axis=1)

In [None]:
X_train = X.loc[:300,]
X_test = X.loc[300:,]

## Imputing numerical features using MICE

In [None]:
# MICE requires float values
X_train_numerical = X_train.loc[:,numerical_features].astype('float64')

In [None]:
# Passing the numpy arrays to mice
X_train_numerical_imputed = mice(X_train_numerical.values)

In [None]:
X_train.loc[:,numerical_features].isna().sum().sort_values(ascending=False)

In [None]:
X_train.loc[:,numerical_features] = X_train_numerical_imputed

In [None]:
X_train.loc[:,numerical_features].isna().sum().sort_values(ascending=False)

Now, all the numerical features for training data are imputed. Let's take a look at the categorical features now.

## Imputing Categorical features

Here I'll be using the KNN function from FancyImpute for the task. Note that KNN outputs float values, so I'll round them to intergers to preserve categorical nature

In [None]:
from fancyimpute import KNN

In [None]:
imputer = KNN()

With the tensorflow backend, the process is quick and results will be printed as it iterates through every 100 rows. We need to round the values because KNN will produce floats. This means that our categorical columns will be rounded as well, so be sure to leave any features you do not want rounded left out of the data.

In [None]:
X_train_imputed = pd.DataFrame(np.round(imputer.fit_transform(X_train)),columns = X_train.columns)

In [None]:
X_train_imputed.isnull().sum()

Now, the data is imputed

## Scaling Data

In [None]:
X_train_imputed.describe().T

Let's scale the data now, as the distributions are highly varying for a few features. Here I'll use MinMaxScaler as I don't want to change the under lying distribution and the outliers.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
scaler.fit(X_train_imputed)
X_train_scaled = scaler.transform(X_train_imputed)

In [None]:
X_train_scaled = pd.DataFrame(data=X_train_scaled, columns = X_train.columns)

In [None]:
X_train_scaled.describe()

Now, the data is on similar scales, and good enough to be modeled. The same steps shall also be applied on the test set.

## Test Data

In [None]:
# MICE requires float values
X_test_numerical = X_test.loc[:,numerical_features].astype('float64')

In [None]:
X_test_numerical_imputed = mice(X_test_numerical.values)
X_test.loc[:,numerical_features] = X_test_numerical_imputed

In [None]:
X_test_imputed = pd.DataFrame(np.round(imputer.fit_transform(X_test)),columns = X_test.columns)

In [None]:
scaler.fit(X_test_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

In [None]:
X_test_scaled = pd.DataFrame(data=X_test_scaled, columns = X_test.columns)

Now, train and test data both are ready. In my next notebook, I will trying different models and also doing hyperparameter tuning using Hyperopt. Thank you!

# Please upvote if you liked my work :-)