# Introduction

Since I've already done the data exploration, cleaning and visualization in a previous notebook, www.kaggle.com/rogges23/ap-liver-data-analysis

I'll move straight into using my cleaned dataset and evaluating models.

The general plan is to:
1. Load up the data, get rid of any artifacts, and a stratified train-test-split.

2. Scaling numerical data; the protein/enzyme data especially are extremely variable and could probably benefit from standardization.

3. Using 10 fold cross validation to compare a LinearSVC, SVC, Logistic Regression, K Neighbors Classifier, and Random Forest Classifier.

4. After comparing mean accuracy and the standard deviation of accuracy of the models on 10 fold CV, I'll pick out the best one for hyperparameter tuning and final training. Ideally, I'd compare the models with a classification report (precision/recall/F1 score) but I'll save that for the final model.

5. Final model tuning, training, testing.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split, RandomizedSearchCV, GridSearchCV, KFold, cross_val_score
)
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv("/kaggle/input/liverpatientclean/liver-patient-clean.csv")

After I cleaned my data and saved it as a csv, Unnamed: 0 appeared as a column for whatever reason. Both the Liver Patient and Gender categorical variables are severely imbalanced (more males and those with liver disease) so it's important to stratify the train and test sets.

In [None]:
data.drop(["Unnamed: 0"], axis = 1, inplace = True)

In [None]:
to_strat = data[["Liver Patient", "Gender_Female"]]

x_train, x_test, y_train, y_test = train_test_split(data.drop(["Liver Patient"], axis = 1),
                                                    data["Liver Patient"],
                                                    test_size = 0.2,
                                                    stratify = to_strat,
                                                    random_state = 1)

Using the Standard Scaler to separately scale the training data and then using that to transform the test data.

In [None]:
to_scale = ["Age", "Total_Bilirubin", "Direct_Bilirubin", "ALP", "ALT", "AST", "Protein", "Albumin", "AGR"]

scaler = StandardScaler()
scaler.fit(x_train[to_scale])
x_train[to_scale] = scaler.transform(x_train[to_scale])
x_test[to_scale] = scaler.transform(x_test[to_scale])

I'm evaluating the models by compiling a dictionary of instances first. From there, I just made a general function where I can pass in the data and dictionary of models with the desired metric and just compare them all with print out statements.

In [None]:
models = {"LSVC": LinearSVC(), "SVC": SVC(), "K Neighbors": KNeighborsClassifier(),
          "Random Forest": RandomForestClassifier(),
          "Logistic Regression": LogisticRegression()}

In [None]:
def compare_models(models, X_data, y_data, n_fold, score):
    for name, model in models.items():
        kfold = KFold(n_splits = n_fold)
        cv_results = cross_val_score(model, X_data, y_data, cv = kfold, scoring = score)
        eval_msg = f"{name}: {cv_results.mean()} ({cv_results.std()})"
        print(eval_msg)

In [None]:
compare_models(models, x_train, y_train, 10, 'accuracy')

# Final Model - Hyperparameter Tuning and Training

The Random Forest Classifier ended up consistently better on my system on multiple runs since I ran through this process multiple times when attempting to come up with an acceptable workflow, so I'm choosing RFC. Based on the multiple runs, or even just this iteration, it shouldn't matter much as long as you don't choose K Neighbors. Choosing RFC also makes it that the standardized scaling was ultimately unnecessary but it definitely made the comparison phase more competitive.

The following function is just an automated way of finding a good set of hyperparameters for RFC with RandomizedSearchCV. It'll be easy to copy and paste into future projects. 

In [None]:
def random_hyper_RFC(X_Train, Y_Train, cross_val = 5):
    hyperparameters = {'n_estimators': [100, 300, 500, 800, 1200], 
                       'max_depth': [5, 10, 15, 25, 30],
                       'min_samples_split': [2, 5, 10, 25, 100],
                       'min_samples_leaf': [1, 2, 5, 10]}
    random_search = RandomizedSearchCV(RandomForestClassifier(), hyperparameters, cv = cross_val)
    best_fit = random_search.fit(X_Train, Y_Train)
    
    return best_fit.best_params_

In [None]:
new_RFC_params = random_hyper_RFC(x_train, y_train)
tuned_RFC = RandomForestClassifier(**new_RFC_params)
tuned_RFC.fit(x_train, y_train)

In [None]:
report = classification_report(y_test, tuned_RFC.predict(x_test), output_dict = True)
report_simple = {"No Liver Disease": report["0"], "Liver Disease": report["1"]}
pd.DataFrame(report_simple)

In [None]:
sns.heatmap(pd.DataFrame(report).iloc[:-1, :].T, annot=True, cmap = 'Blues')

# Removing Some Redundant Features

I'm going to remove some features based on clinical knowledge and then repeat the general flow for the RFC.

Total_Bilirubin and Direct_Bilirubin are heavily correlated as in they're a part of a set of 3 values (Indirect Bilirubin being the third), where if you have any 2 values, you can calculate the third. Total Bilirubin is sum of Direct Bilirubin and Indirect Bilirubin. Of the three, I'd have Direct Bilirubin since that's the amount of Bilirubin that's been conjugated by the liver. 

I also don't need Protein or AGR because Albumin is likely the biggest contributor to both which is something I have as well. Albumin is the most common plasma protein so if you have decreased plasma protein, you most likely have decreased albumin as well and vice versa. AGR is the Albumin to Globulin ratio and is most often decreased clinically speaking. Decreased AGR can occur due to a decrease in Albumin (as in liver disease) or an increase in Globulins (e.g. infection/neoplasia). In this setting, AGR is probably even more tied to Albumin than normal. 

I could have done this with either a feature selection method or a correlation matrix but I thought it might be interesting to just try using clinical knowledge to make the decision. 

In [None]:
x_train2 = x_train.drop(["Total_Bilirubin", "Protein", "AGR"], axis = 1).copy()
x_test2 = x_test.drop(["Total_Bilirubin", "Protein", "AGR"], axis = 1).copy()

new_RFC_params2 = random_hyper_RFC(x_train2, y_train)
tuned_RFC2 = RandomForestClassifier(**new_RFC_params2)
tuned_RFC2.fit(x_train2, y_train)

In [None]:
report2 = classification_report(y_test, tuned_RFC2.predict(x_test2), output_dict = True)
report_simple2 = {"No Liver Disease": report2["0"], "Liver Disease": report2["1"]}
pd.DataFrame(report_simple2)

In [None]:
sns.heatmap(pd.DataFrame(report2).iloc[:-1, :].T, annot=True, cmap = "Blues")

# Impressions

Despite the poorer scores on predicting people without liver disease, I'm actually more impressed by that. The class imbalance in this dataset is considerable (approx. 400 liver patients, and 150 non-liver patients). I could've just made something that predicts everyone has liver disease, that would've given me a liver patient recall of 1, and precision of approximately 0.72. The non-liver patient recall and precision scores would have been 0 however. 

Both models with the scaled features performed better on precision and recall for non-liver patients than when I was just messing around without processing the data (0.1-0.3 precision and 0.01-0.15 recall). The better performance of the data without the redundant features removed could be because my clinical knowledge helped improve the model or it could just be a fluke based on the random_state in the train_test_split. I'd like to believe it was the clinical knowledge though.

Going forward, I'd probably look into undersampling or oversampling to help address the class imbalance and use hypothesis tests to determine if removing the features I thought extraneous actually were hindering the model. There may also be some value in engineering a new feature (ratio of Direct Bilirubin to Indirect Bilirubin). 