Description
The data was taken over a 2-month period in India with 25 features ( eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease. Use machine learning techniques to predict if a patient is suffering from a chronic kidney disease or not.

Credit goes to Mansoor Iqbal (https://www.kaggle.com/mansoordaku) from where the dataset has been collected. For the purpose of creating a challenge, certain modifications have been done to the dataset.

Original dataset can be acquired from the link Chronic KIdney Disease (https://www.kaggle.com/mansoordaku/ckdisease)

## ** Feature Details**
Attribute Information:

We use 25 + class = 26 ( 12 numeric ,14 nominal)

Id(numerical) - Patient Id
Age(numerical) - age in years

Blood Pressure(numerical) - bp in mm/Hg

Specific Gravity(nominal) - sg - (1.005,1.010,1.015,1.020,1.025)

Albumin(nominal) - al - (0,1,2,3,4,5)

Sugar(nominal) - su - (0,1,2,3,4,5)

Red Blood Cells(nominal) - rbc - (normal,abnormal)

Pus Cell (nominal) - pc - (normal,abnormal)

Pus Cell clumps(nominal) - pcc - (present,notpresent)

Bacteria(nominal) - ba - (present,notpresent)

Blood Glucose Random(numerical) - bgr in mgs/dl

Blood Urea(numerical) -bu in mgs/dl

Serum Creatinine(numerical) - sc in mgs/dl

Sodium(numerical) - sod in mEq/L

Potassium(numerical) - pot in mEq/L

Hemoglobin(numerical) - hemo in gms

Packed Cell Volume(numerical)

White Blood Cell Count(numerical) - wc in cells/cumm

Red Blood Cell Count(numerical) - rc in millions/cmm

Hypertension(nominal) - htn - (yes,no)

Diabetes Mellitus(nominal) - dm - (yes,no)

Coronary Artery Disease(nominal) - cad - (yes,no)

Appetite(nominal) - appet - (good,poor)

Pedal Edema(nominal) - pe - (yes,no)

Anemia(nominal) - ane - (yes,no)

Class (nominal)- class - (ckd,notckd)

Acknowledgements
https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease






In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
Raw_train_df=pd.read_csv("/kaggle/input/chronic-kidney-disease/kidney_disease_train.csv")
Raw_test_df=pd.read_csv("/kaggle/input/chronic-kidney-disease/kidney_disease_test.csv")

In [None]:
#Raw_train_df=pd.read_csv("kidney_disease_train.csv")
#Raw_test_df=pd.read_csv("kidney_disease_test.csv")

In [None]:
Raw_test_df.head()

In [None]:
Raw_train_df.head()

In [None]:
print(f'Shape of Train Data {Raw_train_df.shape}')
print(f'\nShape of Test Data {Raw_test_df.shape}')

Train Data has 280 Rows and 26 Columns ( "classification" Column  is the label)

Test Data has 120 Rows and 25 Columns ( without label Column, which needs to be predicted)

In [None]:
#one Extra Field is because of Train Flag,which separates the Train and Test Data
Raw_train_df.shape


In [None]:
Raw_train_df.describe().T

In [None]:
Raw_train_df.skew()

From The above Describe Function(for Numerical Variables as detected by pandas. there might be some other numerical variables which are detected as Objects due to some wrong entries in data.), we are able to find below observations.

1) Ideally there should be total 280 count for each Column, From "Count" field, we could see the missing values.

2) Id, seems to be a just a numerical identifier for each patient, which is starting from 0 and ending with 399 . So this can be removed, as it doesnt show any significance.

3) Age Columns Varies from 2 to 90 years. Also, the Mean is Slightly less than the median.The skewness also points out the same, that its slightly skewed towards left.

4) BP Column Varies from 50 to 180. Also, the 2nd Quartile and 3rd Quartile are same. This means,that there are many entries with bp value as 70 ( at least 25% of data). The mean is more than Median,which points out 
that the distribution is right skewed with skewness > 2.

5)sg,al,su are Categorical Variables(Nominal, as they dont have any order importance)

6) bgr ranges from 22 to 490 and its right skewed with skewness 1.96

7)bu ranges from 10 to 391 and the mean is greater than median , so its right skewed with skewness 2.95.

8)sc ranges from 0.4 to 76 and heavily right skewed with skewness of 8.28.

9)sod seems to be left skewed with skewness of -7.1, but there are many number of missing values.

10) pot is  right skewed (skewness 9.86) and there are many missing values.

11) hemo is slightly left skewed.

12) pcv is slightly left skewed.

In [None]:
# From the Description of Data given in the Problem statement. Below are Numerical and Categorocal(nominal) columns/variables
# Excluding the Target variable "Classification"
cat_var=['sg','al','su','rbc','pc','pcc','ba','htn','dm','cad','appet','pe','ane']
num_var=['id','age','bp','bgr','bu','sc','sod','pot','hemo','pcv','wc','rc']
print(f'Number of Categirical Variables including Label is {len(cat_var)}')
print(f'Number of Numerical Variables including Label is {len(num_var)}')



In [None]:
Raw_train_df.dtypes

If we observe from the type of the for Numerical variables, ('wc','rc') are identified as objects by pandas instead of intezers/floats. lets explore the reason for this.

In [None]:
Raw_train_df[Raw_train_df['wc'].map(lambda x:type(x)==str)].wc.value_counts()


In [None]:
Raw_train_df[Raw_train_df['rc'].map(lambda x:type(x)==str)].rc.value_counts()

However for column 'rc'/'wc;, there is garbage characters ("\t"  and " ? ") character in it. So we need to replace it with Nan before type casting them to numeric

In [None]:
train_df=Raw_train_df.copy()
train_df['wc']=pd.to_numeric(train_df['wc'],errors='coerce')
train_df['rc']=pd.to_numeric(train_df['rc'], errors='coerce')


In [None]:
#Now All Numeric Columns/variables are of type Intezers/floats.
train_df.dtypes

In [None]:
#Missing Values For Each Column (Count in True Row). 
train_df.apply(lambda x: x.isna().value_counts()). T

There are Many columns with missing values, with 'rbc' with maximum of 107. we need to do proper imputation during pre processing as we cannot ignore the columns with more number of missing numbers.

#### Analysing the Distribution of Categorical variables with respect to Traget variable.

In [None]:
for feature in cat_var:
  plt.figure(figsize = (5,5))
  sns.countplot(x = feature, hue = 'classification', data = train_df, order = train_df[feature].value_counts().index)
  plt.title(feature)


1) For "sg" Values less than or equal to 1.015, there are only ckd cases ( that means,  lower the sg values, more chances of Chronic kidney disease)

2) For "ai/"su" Values greater than or equal to 1, there are only ckd cases ( that means,  higher the ai/su values, more chances of Chronic kidney disease)

3) for "rbc/pc", if they are abnormal, than its CKD. 

4) for "pcc/ba", if they are present, than its CKD. 

5) There seems to be extra charcters ( spaces) for columns "dm/cad". We need to trim the values in these columns and correct them.

6)  for "htn/dm/cad/pe/ane", if they are yes, than its CKD. 

7) for "appt", if its good, than its CKD. 

###  Missing Values Imputation

In Earlier section, we have removed the garbage data in case of numerical columns, where we had "?", "\t". For these we have replaced with Nan. 

So We need to treat the data in case of Categorical variables. In the EDA, we have observed that "dm/cad" . columns has extra spaces. lets fix them

In [None]:
# Checking for garbage or wrong values in Categorical variables.
for col in cat_var:
    print(f"Values counts for {col} are \n {train_df[col].value_counts()}")

In [None]:
#Removing the Extra tab character
train_df['dm']=train_df.dm.replace("\tno","no")
train_df['dm']=train_df.dm.replace("\tyes","yes")

train_df['cad']=train_df.cad.replace("\tno","no")

In [None]:
#Cross Checking the Replacement
print(train_df.dm.value_counts())
print(train_df.cad.value_counts())

In [None]:
#Missing Values For Each Column (Count in True Column). 

train_df.apply(lambda x: x.isna().value_counts()). T

Since the size of the train data is small, with just 280 rows and we have many such columns with missing values. We cannot go with Deleting the missing rows. 
Try to impute the missing values with median for continous variables and Mode for Categorical variables.

In [None]:
for col in num_var:
    print(f'Imputing for {col} with {train_df[col].median()}')
    train_df[col]=train_df[col].fillna(train_df[col].median())


In [None]:
for col in cat_var:
    print(f'Imputing for {col} with {train_df[col].mode()[0]}')
    train_df[col]=train_df[col].fillna(train_df[col].mode()[0])

In [None]:
#Cross checking for any more Missing Values For Each Column (Count in True Column). 

train_df.apply(lambda x: x.isna().value_counts()). T

### Outlier Treatment

In [None]:
train_num_df=train_df[num_var].copy()

In [None]:
train_num_df.head()

In [None]:
train_num_df.shape

In [None]:
from scipy.stats import zscore
train_num_zscore=train_num_df.apply(zscore)

In [None]:
train_num_zscore[~(np.abs(train_num_zscore) < 3).all(axis=1)].shape

In [None]:

train_num_zscore[~(np.abs(train_num_zscore) < 3).all(axis=1)]

In [None]:
(~(np.abs(train_num_zscore) < 3)).sum(axis=0)

Based on Zscore Analysis, there are around 25 Columns with outliers. 

Also Count of number of outliers for each Column are mentioned above.

bgr/bu columns has higest number of outliers(7)

In [None]:
from scipy import stats
for col in num_var:
    print(f'Imputing for {col} with {train_df[col].median()}')
    train_df.loc[(np.abs(stats.zscore(train_num_df[col])) >= 3), col] = train_df[col].median()


## Dimensionality Reduction

In [None]:
train_df.head()

In [None]:
train_df.dtypes

#### Converting Non Numeric Categorical Variables into Numeric.

In [None]:
cat_var

Out of the 14 Categorical Variables, 3 of them ( sg,al,su) are ordinal categorical Variables. These are already in Numeric Form. 

Remaining 11 Categorical Variables (rbc,pc,pcc,ba,htn,dm,cad,appet,pe,ane,classification) are nominal. They are only has 2 values each ( i.e normal/abnormal, yes/no etc..). So We can convert them to 

In [None]:
# Creating a Dictionary to Replace the Non Numerical values with Numeric for the Categorical variables identified above.
cat_nom_dict = {"rbc":     {"normal": 1, "abnormal": 0},
                "pc":     {"normal": 1, "abnormal": 0},
                "pcc":     {"present": 1, "notpresent": 0},
                "ba":     {"present": 1, "notpresent": 0},
                "htn":     {"yes": 1, "no": 0},
                "dm":     {"yes": 1, "no": 0},
                "cad":     {"yes": 1, "no": 0},
                "pe":     {"yes": 1, "no": 0},
                "ane":     {"yes": 1, "no": 0},
                "appet":     {"good": 1, "poor": 0},
                "classification":     {"ckd": 1, "notckd": 0} 
               }

In [None]:
cat_nom_dict

In [None]:
train_df.replace(cat_nom_dict, inplace=True)

In [None]:
train_df.head()

In [None]:
train_df.dtypes

The Id Column is just an identifier . So We can drop it.

In [None]:
train_df.drop('id',axis=1,inplace=True)

In [None]:
train_df.head()

In [None]:
# Plotting Heat map with Co Relation numbers between each Features. Degault will be Pearson Co relation co efficient
plt.figure(figsize=(25, 25))
Train_df_corr = train_df.corr()
sns.heatmap(Train_df_corr, 
            xticklabels = Train_df_corr.columns.values,
            yticklabels = Train_df_corr.columns.values,
            annot = True);

Considering absolute Co efficients greater than 0.6

1) pc,hemo are highly co related with 0.82 

2) rc,hemo are highly co related with 0.67

3) rc,pcv are highly co related with 0.72

4) dm,htn are highly co related with 0.64

5) hemo,htn are highly co related with 0.61

We can drop hemo,pcv,htm Features.


In [None]:
drop_feat=['hemo','pcv','htn']
train_df.drop(drop_feat,axis=1,inplace=True)

In [None]:
train_df.head()

## Model Building

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
y = train_df['classification']
X = train_df.drop(['classification'], axis = 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
print('X train shape: ', X_train.shape)
print('X test shape: ', X_test.shape)
print('y train shape: ', y_train.shape)
print('y test shape: ', y_test.shape)

In [None]:
norm = MinMaxScaler().fit(X_train)

In [None]:
X_train_norm = norm.transform(X_train)

In [None]:
type(X_train_norm)

In [None]:
X_train_norm

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf = LogisticRegression(random_state=0,max_iter=500).fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(confusion_matrix(y_test, y_pred))

from this output we can infer that the number of false negatives are more which we should try to reduce.

In [None]:
print(classification_report( y_test, y_pred))

this model gives us considerably good accuracy and the value of recall for 0 is 97 whereas for 1 we have 90 we will try to improve this . 

In [None]:
param_grid = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]

using these hyperparameters we try to improve our model.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
clf_random = RandomizedSearchCV(clf, param_distributions = param_grid, cv = 3, verbose=True, n_jobs=-1)

In [None]:
best_clf_random = clf_random.fit(X_train,y_train)

In [None]:
best_clf_random.best_estimator_

In [None]:
clf = LogisticRegression(C=4.281332398719396, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False).fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

our model has improved little from 5 false postives to 4 

In [None]:
print(classification_report( y_test, y_pred))

here we can see that our recall has improved from 90 to 92 which is considerably good 

In [None]:
param_grid_cv = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : np.logspace(3, 4, 20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [80, 100, 120, 150]
    }
]

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
clf_grid = GridSearchCV(clf, param_grid= param_grid_cv, cv = 3, verbose=True, n_jobs=-1)
best_clf_grid = clf_grid.fit(X_train, y_train)

In [None]:
best_clf_grid.best_estimator_

In [None]:
y_pred = best_clf_grid.best_estimator_.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

using grid search we have improved our model 

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(random_state=23)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred_dt))

In [None]:
print(classification_report(y_test, y_pred_dt))

In [None]:
param_grid_random = [    
    {'splitter' : ['best', 'random'],
     'max_depth' : np.linspace(1, 32, 32, endpoint=True),
     'min_samples_split' : np.linspace(1, 10, 10, endpoint=True),
     'min_samples_leaf' : np.linspace(0.1, 0.5, 10, endpoint=True),
     'max_features' : list(range(1,X_train.shape[1])),
    }
]

In [None]:
dt_model_random = RandomizedSearchCV(dt_model, param_distributions = param_grid_random, cv = 3, verbose=True, n_jobs=-1)
dt_model_random.fit(X_train, y_train)

In [None]:
y_pred = dt_model_random.best_estimator_.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Running Random Forest Regressor Model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
rf = RandomForestRegressor(n_estimators = 51, random_state = 1)
model = rf.fit(X_train, y_train)
np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
r2_score(y_train, model.predict(X_train))

In [None]:
rf = RandomForestClassifier(n_estimators = 100, random_state = 1)
rf.fit(X_train, y_train)
y_train = rf.predict(X_train)
y_pred = rf.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(Raw_train_df.drop(['classification'], axis = 1),
    Raw_train_df['classification'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

In [None]:
classifier = SVC(kernel = 'linear', random_state = 1)
classifier.fit(X_train, y_train)
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)
print('\n')
print('Classification Report for Train Set\n')
print(classification_report(y_train, y_train_pred))
print('Classification Report for Test Set\n')
print(classification_report(y_test, y_test_pred))


In [None]:
submission_df = pd.DataFrame({'PatientId' :['id'], 'class': y_pred.tolist()})

In [None]:
submission_df.to_csv("submission.csv", header = True, index= False)