#### ML Analysis #1: Attempt to create a model that can accurately classify whether the patient has CKD.

Yash Dhore

In [15]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report)

In [16]:
# Load the (preprocessed) data
df = pd.read_csv("ckd_preprocessed.csv")

Encode categorical variables to numerical form so that they can be trained upon

In [None]:
label_encoder = LabelEncoder()

object_columns_list = df.select_dtypes(include=['object']).columns.tolist()

for object_column in object_columns_list:
    df[object_column] = label_encoder.fit_transform(df[object_column].apply(lambda s: s.strip()))

Object columns: []
      age    bp     sg   al   su  rbc  pc  pcc  ba         bgr  ...   pcv  \
0    48.0  80.0  1.020  1.0  0.0    1   1    0   0  121.000000  ...  44.0   
1     7.0  50.0  1.020  4.0  0.0    1   1    0   0  148.036517  ...  38.0   
2    62.0  80.0  1.010  2.0  3.0    1   1    0   0  423.000000  ...  31.0   
3    48.0  70.0  1.005  4.0  0.0    1   0    1   0  117.000000  ...  32.0   
4    51.0  80.0  1.010  2.0  0.0    1   1    0   0  106.000000  ...  35.0   
..    ...   ...    ...  ...  ...  ...  ..  ...  ..         ...  ...   ...   
395  55.0  80.0  1.020  0.0  0.0    1   1    0   0  140.000000  ...  47.0   
396  42.0  70.0  1.025  0.0  0.0    1   1    0   0   75.000000  ...  54.0   
397  12.0  80.0  1.020  0.0  0.0    1   1    0   0  100.000000  ...  49.0   
398  17.0  60.0  1.025  0.0  0.0    1   1    0   0  114.000000  ...  51.0   
399  58.0  80.0  1.025  0.0  0.0    1   1    0   0  131.000000  ...  53.0   

         wc        rc  htn  dm  cad  appet  pe  ane  cla

Prepare the data by spliting into x and y, then into train/val/test sets

In [35]:
train_split = 0.75
val_split = 0.15
test_split = 0.10

x = df.drop('classification', axis=1)
y = df['classification']

x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=1 - train_split) # split into train and temp
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=test_split / (test_split + val_split)) # split temp into val and test

Baseline model that predicts based on the most frequent value

In [36]:
baseline_model = DummyClassifier(strategy='most_frequent')
baseline_model.fit(x_train, y_train)

y_baseline_pred = baseline_model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_baseline_pred))
print(classification_report(y_test, y_baseline_pred, zero_division=1))
cm = confusion_matrix(y_test, y_baseline_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 0.65
              precision    recall  f1-score   support

           0       0.65      1.00      0.79        26
           1       1.00      0.00      0.00        14

    accuracy                           0.65        40
   macro avg       0.82      0.50      0.39        40
weighted avg       0.77      0.65      0.51        40

Confusion Matrix:
 [[26  0]
 [14  0]]


Not a very good model, of course.

Let's try using a logistic regression model.

In [37]:
model = LogisticRegression(max_iter=9999) # increase limit on the number of iterations
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, zero_division=1))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        26
           1       1.00      1.00      1.00        14

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40

Confusion Matrix:
 [[26  0]
 [ 0 14]]


Using a logistic regression model ended up achieving perfect accuracy for our test set (sometimes 0.975). Definitely better than the baseline model.

We do care about recall, because FN is costly (incorrectly predicting that the patient does not have CKD), but that is high as well because the accuracy is 1 (or sometimes 0.975).

In [39]:
coefficients = model.coef_[0]
feature_importance = pd.DataFrame({'Feature': x.columns, 'Coefficient': coefficients})

feature_importance['Absolute Coefficient'] = feature_importance['Coefficient'].abs()
feature_importance = feature_importance.sort_values(by='Absolute Coefficient', ascending=False)

print(feature_importance)

   Feature  Coefficient  Absolute Coefficient
14    hemo     1.238881              1.238881
3       al    -1.176276              1.176276
19      dm    -1.039291              1.039291
18     htn    -1.005726              1.005726
11      sc    -0.923304              0.923304
22      pe    -0.857655              0.857655
21   appet    -0.812574              0.812574
17      rc     0.784960              0.784960
5      rbc     0.630143              0.630143
6       pc     0.623922              0.623922
4       su    -0.469025              0.469025
23     ane    -0.228043              0.228043
15     pcv     0.176473              0.176473
13     pot     0.142994              0.142994
7      pcc    -0.121348              0.121348
8       ba    -0.084545              0.084545
12     sod     0.068796              0.068796
1       bp    -0.054672              0.054672
20     cad    -0.038520              0.038520
2       sg     0.038275              0.038275
9      bgr    -0.018043           

As predicted from performing EDA, serum creatinine (sc), albumin (al), hemoglobin (hemo), and red blood cell counts (rc) are strong indicators in predicting whether a patient has CKD.

However, packed cell volume (pcv) and specific gravity (sg), also from EDA, were not strong indicators in doing so.

Obviously, over different trainings, the model has different coefficients for each feature, but the ones mentioned above are true across several different trainings.