#### ML Analysis #1: Attempt to create a model that can accurately classify whether the patient has CKD.

Yash Dhore

Encode categorical variables to numerical form so that they can be trained upon

In [None]:
label_encoder = LabelEncoder()

object_columns_list = df.select_dtypes(include=['object']).columns.tolist()

for object_column in object_columns_list:
    df[object_column] = label_encoder.fit_transform(df[object_column])

Prepare the data by spliting into x and y, then into train/val/test sets

In [None]:
train_split = 0.75
val_split = 0.15
test_split = 0.10

x = df.drop('classification', axis=1)
y = df['classification']

x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=1 - train_split) # split into train and temp
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=test_split / (test_split + val_split)) # split temp into val and test

Baseline model that predicts based on the most frequent value

In [None]:
baseline_model = DummyClassifier(strategy='most_frequent')
baseline_model.fit(x_train, y_train)

y_baseline_pred = baseline_model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_baseline_pred))
print(classification_report(y_test, y_baseline_pred, zero_division=1))
cm = confusion_matrix(y_test, y_baseline_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 0.55
              precision    recall  f1-score   support

           0       0.55      1.00      0.71        22
           1       1.00      0.00      0.00        18

    accuracy                           0.55        40
   macro avg       0.78      0.50      0.35        40
weighted avg       0.75      0.55      0.39        40

Confusion Matrix:
 [[22  0]
 [18  0]]


Not a very good model, of course.

Let's try using a logistic regression model.

In [None]:
model = LogisticRegression(max_iter=9999) # increase limit on the number of iterations
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, zero_division=1))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        22
           1       1.00      1.00      1.00        18

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40

Confusion Matrix:
 [[22  0]
 [ 0 18]]


Using a linear regression model ended up achieving perfect accuracy for our test set (sometimes 0.975). Definitely better than the baseline model.

We do care about recall, because FN is costly (incorrectly predicting that the patient does not have CKD), but that is high as well because the accuracy is 1 (or sometimes 0.975).

In [None]:
coefficients = model.coef_[0]
feature_importance = pd.DataFrame({'Feature': x.columns, 'Coefficient': coefficients})

feature_importance['Absolute Coefficient'] = feature_importance['Coefficient'].abs()
feature_importance = feature_importance.sort_values(by='Absolute Coefficient', ascending=False)

print(feature_importance)

   Feature  Coefficient  Absolute Coefficient
3       al    -1.597840              1.597840
19      dm    -1.349247              1.349247
11      sc    -1.121542              1.121542
14    hemo     1.107820              1.107820
18     htn    -1.034853              1.034853
4       su    -0.772791              0.772791
22      pe    -0.732845              0.732845
21   appet    -0.618613              0.618613
17      rc     0.595850              0.595850
6       pc     0.552777              0.552777
5      rbc     0.353435              0.353435
23     ane    -0.209415              0.209415
15     pcv     0.125468              0.125468
13     pot    -0.081918              0.081918
1       bp    -0.068919              0.068919
12     sod     0.060713              0.060713
7      pcc    -0.056624              0.056624
20     cad    -0.048529              0.048529
2       sg     0.047061              0.047061
9      bgr    -0.019876              0.019876
0      age     0.010391           

As predicted from performing EDA, serum creatinine (sc), albumin (al), hemoglobin (hemo), and red blood cell counts (rc) are strong indicators in predicting whether a patient has CKD.

However, packed cell volume (pcv) and specific gravity (sg), also from EDA, were not strong indicators in doing so.

Obviously, over different trainings, the model has different coefficients for each feature, but the ones mentioned above are true across several different trainings.