#### ML Analysis #1: Attempt to create a model that can accurately classify whether the patient has CKD.

Yash Dhore

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report)

In [2]:
# Load the (preprocessed) data
df = pd.read_csv("ckd_preprocessed.csv")

Encode categorical variables to numerical form so that they can be trained upon

In [3]:
label_encoder = LabelEncoder()

object_columns_list = df.select_dtypes(include=['object']).columns.tolist()

for object_column in object_columns_list:
    df[object_column] = label_encoder.fit_transform(df[object_column])

Prepare the data by spliting into x and y, then into train/val/test sets

In [4]:
train_split = 0.75
val_split = 0.15
test_split = 0.10

x = df.drop('classification', axis=1)
y = df['classification']

x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=1 - train_split) # split into train and temp
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=test_split / (test_split + val_split)) # split temp into val and test

Baseline model that predicts based on the most frequent value

In [5]:
baseline_model = DummyClassifier(strategy='most_frequent')
baseline_model.fit(x_train, y_train)

y_baseline_pred = baseline_model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_baseline_pred))
print(classification_report(y_test, y_baseline_pred, zero_division=1))
cm = confusion_matrix(y_test, y_baseline_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 0.7
              precision    recall  f1-score   support

           0       0.70      1.00      0.82        28
           1       1.00      0.00      0.00        12

    accuracy                           0.70        40
   macro avg       0.85      0.50      0.41        40
weighted avg       0.79      0.70      0.58        40

Confusion Matrix:
 [[28  0]
 [12  0]]


Not a very good model, of course.

Let's try using a logistic regression model.

In [6]:
model = LogisticRegression(max_iter=9999) # increase limit on the number of iterations
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, zero_division=1))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        28
           1       1.00      1.00      1.00        12

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40

Confusion Matrix:
 [[28  0]
 [ 0 12]]


Using a logistic regression model ended up achieving perfect accuracy for our test set (sometimes 0.975). Definitely better than the baseline model.

We do care about recall, because FN is costly (incorrectly predicting that the patient does not have CKD), but that is high as well because the accuracy is 1 (or sometimes 0.975).

In [7]:
coefficients = model.coef_[0]
feature_importance = pd.DataFrame({'Feature': x.columns, 'Coefficient': coefficients})

feature_importance['Absolute Coefficient'] = feature_importance['Coefficient'].abs()
feature_importance = feature_importance.sort_values(by='Absolute Coefficient', ascending=False)

print(feature_importance)

   Feature  Coefficient  Absolute Coefficient
3       al    -1.628010              1.628010
14    hemo     1.171623              1.171623
18     htn    -1.040737              1.040737
11      sc    -1.031119              1.031119
19      dm    -1.030906              1.030906
4       su    -0.782523              0.782523
22      pe    -0.737404              0.737404
17      rc     0.719852              0.719852
21   appet    -0.633714              0.633714
5      rbc     0.302533              0.302533
13     pot    -0.248275              0.248275
23     ane    -0.225777              0.225777
6       pc     0.197410              0.197410
15     pcv     0.119346              0.119346
1       bp    -0.074432              0.074432
12     sod     0.071938              0.071938
7      pcc    -0.056572              0.056572
2       sg     0.044477              0.044477
20     cad    -0.042363              0.042363
10      bu     0.026738              0.026738
9      bgr    -0.022563           

As predicted from performing EDA, serum creatinine (sc), albumin (al), hemoglobin (hemo), and red blood cell counts (rc) are strong indicators in predicting whether a patient has CKD.

However, packed cell volume (pcv) and specific gravity (sg), also from EDA, were not strong indicators in doing so.

Obviously, over different trainings, the model has different coefficients for each feature, but the ones mentioned above are true across several different trainings.