#### ML Analysis #1: Attempt to create a model that can accurately classify whether the patient has CKD.

Yash Dhore

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report)

In [2]:
# Load the (preprocessed) data
df = pd.read_csv("ckd_preprocessed.csv")

Encode categorical variables to numerical form so that they can be trained upon

In [3]:
label_encoder = LabelEncoder()

object_columns_list = df.select_dtypes(include=['object']).columns.tolist()

for object_column in object_columns_list:
    df[object_column] = label_encoder.fit_transform(df[object_column])

Prepare the data by spliting into x and y, then into train/val/test sets

In [None]:
# split data in 80:20 (temp, test)
# then split temp into 80:20 (train, val)

x = df.drop('classification', axis=1)
y = df['classification']

x_temp, x_test, y_temp, y_test = train_test_split(x, y, test_size=0.2) # split into train and temp
x_train, x_val, y_train, y_val = train_test_split(x_temp, y_temp, test_size=0.2) # split temp into val and test

Baseline model that predicts based on the most frequent value

In [27]:
baseline_model = DummyClassifier(strategy='most_frequent')
baseline_model.fit(x_train, y_train)

y_baseline_pred = baseline_model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_baseline_pred))
print(classification_report(y_test, y_baseline_pred, zero_division=1))
cm = confusion_matrix(y_test, y_baseline_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 0.6625
              precision    recall  f1-score   support

           0       0.66      1.00      0.80        53
           1       1.00      0.00      0.00        27

    accuracy                           0.66        80
   macro avg       0.83      0.50      0.40        80
weighted avg       0.78      0.66      0.53        80

Confusion Matrix:
 [[53  0]
 [27  0]]


Not a very good model, of course.

Let's try using a logistic regression model.

In [28]:
model = LogisticRegression(max_iter=99999) # increase limit on the number of iterations
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, zero_division=1))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        53
           1       1.00      1.00      1.00        27

    accuracy                           1.00        80
   macro avg       1.00      1.00      1.00        80
weighted avg       1.00      1.00      1.00        80

Confusion Matrix:
 [[53  0]
 [ 0 27]]


Using a logistic regression model ended up achieving perfect accuracy for our test set (sometimes 0.975). Definitely better than the baseline model.

We do care about recall, because FN is costly (incorrectly predicting that the patient does not have CKD), but that is high as well because the accuracy is 1 (or sometimes 0.975).

In [31]:
coefficients = model.coef_[0]
feature_importance = pd.DataFrame({'Feature': x.columns, 'Coefficient': coefficients})

feature_importance['Absolute Coefficient'] = feature_importance['Coefficient'].abs()
feature_importance = feature_importance.sort_values(by='Absolute Coefficient', ascending=False)

display(feature_importance)

Unnamed: 0,Feature,Coefficient,Absolute Coefficient
18,htn,-1.349632,1.349632
19,dm,-1.320199,1.320199
3,al,-1.228003,1.228003
14,hemo,1.050057,1.050057
11,sc,-1.046498,1.046498
22,pe,-0.739518,0.739518
4,su,-0.723406,0.723406
17,rc,0.677558,0.677558
5,rbc,0.481642,0.481642
6,pc,0.4666,0.4666


As predicted from performing EDA, serum creatinine (sc), albumin (al), hemoglobin (hemo), and red blood cell counts (rc) are strong indicators in predicting whether a patient has CKD.

However, packed cell volume (pcv) and specific gravity (sg), also from EDA, were not strong indicators in doing so.

Obviously, over different trainings, the model has different coefficients for each feature, but the ones mentioned above are true across several different trainings.