# Practical Delivery 2: Learning Decision Models
#### *by Sindre Øyen*

---

---

# 1 Introduction

    This paper will present a comprehensive analysis of the Hepatitis C Virus (HCV) dataset from the University of California, Irvine (UCI) Machine Learning Repository [1]. The focus of the paper will be to apply and evaluate various decision modeling techniques - encompassing preprocessing, model construction, and model evaluation. The paper will explore multiple preprocessing strategies, including handling missing values and feature modification, to prepare the dataset for the use in different machine learning models. These models will range from instance-based learning and decision trees to ensemble learning with trees and neural networks. The evaluation of the models will be based on a balanced construction, performance metrics, and yield curves, offering insights into their applicability in healthcare data analysis.

*To initialize this study, the dataset itself can be loaded from the ICU database as such:*

In [467]:
from ucimlrepo import fetch_ucirepo 
  
# Loading in the dataset
hcv_data = fetch_ucirepo(id=571) 
  
# Separating the features and target
X = hcv_data.data.features 
y = hcv_data.data.targets 
  
def printInfo():
    print(hcv_data.metadata)  
    print(hcv_data.variables)

---

# 2 Preprocessing of the Dataset

In this section, I will perform a reasoned construction of various versions of the dataset. These will be possible to differentiate in the later performed work with the data to evaluate performance on different versions.

In [468]:
X

Unnamed: 0,Age,Sex,ALB,ALP,AST,BIL,CHE,CHOL,CREA,CGT,PROT,ALT
0,32,m,38.5,52.5,22.1,7.5,6.93,3.23,106.0,12.1,69.0,7.7
1,32,m,38.5,70.3,24.7,3.9,11.17,4.80,74.0,15.6,76.5,18.0
2,32,m,46.9,74.7,52.6,6.1,8.84,5.20,86.0,33.2,79.3,36.2
3,32,m,43.2,52.0,22.6,18.9,7.33,4.74,80.0,33.8,75.7,30.6
4,32,m,39.2,74.1,24.8,9.6,9.15,4.32,76.0,29.9,68.7,32.6
...,...,...,...,...,...,...,...,...,...,...,...,...
610,62,f,32.0,416.6,110.3,50.0,5.57,6.30,55.7,650.9,68.5,5.9
611,64,f,24.0,102.8,44.4,20.0,1.54,3.02,63.0,35.9,71.3,2.9
612,64,f,29.0,87.3,99.0,48.0,1.66,3.63,66.7,64.2,82.0,3.5
613,46,f,33.0,,62.0,20.0,3.56,4.20,52.0,50.0,71.0,39.0


In [469]:
X.describe()

Unnamed: 0,Age,ALB,ALP,AST,BIL,CHE,CHOL,CREA,CGT,PROT,ALT
count,615.0,614.0,597.0,615.0,615.0,615.0,605.0,615.0,615.0,614.0,614.0
mean,47.40813,41.620195,68.28392,34.786341,11.396748,8.196634,5.368099,81.287805,39.533171,72.044137,28.450814
std,10.055105,5.780629,26.028315,33.09069,19.67315,2.205657,1.132728,49.756166,54.661071,5.402636,25.469689
min,19.0,14.9,11.3,10.6,0.8,1.42,1.43,8.0,4.5,44.8,0.9
25%,39.0,38.8,52.5,21.6,5.3,6.935,4.61,67.0,15.7,69.3,16.4
50%,47.0,41.95,66.2,25.9,7.3,8.26,5.3,77.0,23.3,72.2,23.0
75%,54.0,45.2,80.1,32.9,11.2,9.59,6.06,88.0,40.2,75.4,33.075
max,77.0,82.2,416.6,324.0,254.0,16.41,9.67,1079.1,650.9,90.0,325.3


### 2.1 Analysis of the Dataset

In this section I will seek to understand and elaborate further on the HCV dataset and the data that is in it. By understanding missing values, statistical parameters, the types of characteristics, as well as the classification values, the aim is to better plan how to work efficiently with the dataset. As illustrated at it's web page at the UC Irvine's Machine Learning Repository, the data in the dataset is in the following format [1]:

| Variable Name | Role     | Type       | Demographic | Description | Units | Missing Values |
|---------------|----------|------------|-------------|-------------|-------|----------------|
| ID            | ID       | Integer    |             |  Patient ID |       | no             |
| Age           | Feature  | Integer    | Age         |             | years | no             |
| Sex           | Feature  | Binary     | Sex         |             |       | no             |
| ALB           | Feature  | Continuous |             |             |       | yes            |
| ALP           | Feature  | Continuous |             |             |       | yes            |
| AST           | Feature  | Continuous |             |             |       | yes            |
| BIL           | Feature  | Continuous |             |             |       | no             |
| CHE           | Feature  | Continuous |             |             |       | no             |
| CHOL          | Feature  | Continuous |             |             |       | yes            |
| CREA          | Feature  | Continuous |             |             |       | no             |
| CGT           | Feature  | Continuous |             |             |       | no             |
| PROT          | Feature  | Continuous |             |             |       | yes            |
| Category      | Target   | Categorical|             | values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis' |       | no             |
| ALT           | Feature  | Continuous |             |             |       | no             |


A quick note to be taken: For the features here the values are mostly continuous, which is somewhat expected as these are measurements gathered from the patients. The ID is a unique integer value, whilst the age is also of course a numerical-discrete value in integer form. The sex of the patients is a binary value of either 'm' or 'f'.
This will all be further discussed later in section 2.

#### 2.1.1 Missing Values Study
As can be read in the table, the HCV dataset from UC Irvine's Machine Learning Repository contains several variables, some of which have missing values. In this section, I will focus on analyzing these missing values to understand their impact on the dataset and how they should be handled for effective machine learning model development. The variables with missing values are: ALB, ALP, AST, CHOL, and PROT. These are all continuous features, indicating that they are likely to represent some quantitative measurements. 

Understanding the extent of which values are missing within certain variables is crucial. It is important to calculate the proportion of missing values for each variable. If a significant proportion of data is missing in a particular variable, it might impact the reliability of any analysis involving that variable. So, let's dive deeper into this:

In [470]:
import numpy as np
import pandas as pd

In [471]:
# ALB, ALP, AST, CHOL, and PROT has missing values
# Lets find the percentage of missing values for each of these variables
def missing_percentage(df):
    '''
    This function takes a DataFrame(df) as input and returns two columns, total missing values and total missing values percentage.

    Parameters
    ----------
    df : DataFrame
        The pandas object holding the data.

    Returns
    -------
    missing_values : Series
        Total missing values of each feature.
    '''
    # Get the count of non null values of each feature
    total = df.notnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    total_null = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = np.round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2)[np.round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2) != 0]
    return pd.concat([total, total_null, percent], axis = 1, keys = ['No. values', 'No. null', '%'])

missing_percentage(X)

Unnamed: 0,No. values,No. null,%
ALB,614,1,0.16
PROT,614,1,0.16
ALT,614,1,0.16
CHOL,605,10,1.63
ALP,597,18,2.93


In the result table above, it is apparent that the degree of which the values are missing is varying. I would devise my strategy based on this information. I also note that in the source it appears that ALT and AST have been swapped, and that it is in fact ALT that has a missing value.

Firstly, since in the cases of the ALB, PROT, and ALT variables there are only one missing value for each, we could either perform a simple imputation strategy or just delete the missing instances. Whether deletion or imputation is the preferred choice, should depend on how skewed the dataset is. Let's understand this further:

In [472]:
# Find the range of each variable that has missing values
def describe_missing_vals(vals = ['ALB', 'PROT', 'ALT', 'CHOL', 'ALP'], data = X):
    print(data[vals].describe())

describe_missing_vals()

              ALB        PROT         ALT        CHOL         ALP
count  614.000000  614.000000  614.000000  605.000000  597.000000
mean    41.620195   72.044137   28.450814    5.368099   68.283920
std      5.780629    5.402636   25.469689    1.132728   26.028315
min     14.900000   44.800000    0.900000    1.430000   11.300000
25%     38.800000   69.300000   16.400000    4.610000   52.500000
50%     41.950000   72.200000   23.000000    5.300000   66.200000
75%     45.200000   75.400000   33.075000    6.060000   80.100000
max     82.200000   90.000000  325.300000    9.670000  416.600000


Let's break down these numbers!

Albumin (ALB):
- 1 missing value.
- The mean and median are close, indicating a relatively symmetrical distribution.
- Imputation is likely a good strategy here. Given the low percentage of missing data, mean or median imputation could work without significantly affecting the distribution.


Protein (PROT):
- 1 missing value.
- Similar to ALB, the distribution seems symmetrical.
- Again, mean or median imputation would be suitable due to the low percentage of missing data.


Alanine Aminotransferase (ALT):
- 1 missing value.
- The standard deviation is quite large relative to the mean, indicating variability.
- Given the data variability and that we have only a single missing entry, it might be a good solution to eliminate the entry with the missing data.


Cholesterol (CHOL):
- 10 missing values.
- The mean and median are fairly close, but the standard deviation is relatively high - relative to the mean value. 
- Given the slightly higher percentage of missing data (1.63%), a simple imputation might still be reasonable. However, I will attempt to use a more sofisticated imputation for this value. 


Alkaline Phosphatase (ALP):
- 18 missing values.
- There's a significant difference between the mean and the median, indicating a skewed distribution.
- The percentage of missing data is higher (2.93%). For skewed distributions, median imputation, or a more complex imputation method like regression imputation, could be considered to avoid introducing bias.

In [473]:
# ALB and PROT
# Imputing the missing value for ALB and PROT
X['ALB'].fillna(X['ALB'].mean(), inplace = True)
X['PROT'].fillna(X['PROT'].mean(), inplace = True)

# ALT
# Eliminating the entry with missing value for ALT
X.dropna(subset = ['ALT'], inplace = True)

# Create different versions of the dataset with different imputers
from sklearn.impute import KNNImputer, SimpleImputer

# Imputing the missing values for CHOL and ALP using different imputers to create different versions of the dataset
imputer_knn = KNNImputer(n_neighbors = 8)
imputer_regression = SimpleImputer(strategy = 'mean')

def get_imputed_data(imputer, df):
    '''
    This function takes an imputer object and a DataFrame as input and returns a DataFrame with the missing values imputed.

    Parameters
    ----------
    imputer : Imputer
        An imputer object.
    df : DataFrame
        The pandas object holding the data.

    Returns
    -------
    df : DataFrame
        The pandas object holding the data with the missing values imputed.
    '''
    # Imputing CHOL and ALP
    df_copy = df.copy()
    df_copy[['CHOL', 'ALP']] = imputer.fit_transform(df_copy[['CHOL', 'ALP']])
    return df_copy

print("Original:")
describe_missing_vals(['CHOL', 'ALP'])

print("\n Imputed with Regression:")
describe_missing_vals(['CHOL', 'ALP'], get_imputed_data(imputer_regression, X))

print("\nImputed with KNN:")
describe_missing_vals(['CHOL', 'ALP'], get_imputed_data(imputer_knn, X))


Original:
             CHOL         ALP
count  604.000000  596.000000
mean     5.367053   68.304027
std      1.133375   26.045538
min      1.430000   11.300000
25%      4.607500   52.500000
50%      5.300000   66.250000
75%      6.065000   80.125000
max      9.670000  416.600000

 Imputed with Regression:
             CHOL         ALP
count  614.000000  614.000000
mean     5.367053   68.304027
std      1.124092   25.660291
min      1.430000   11.300000
25%      4.620000   52.925000
50%      5.305000   66.700000
75%      6.057500   79.300000
max      9.670000  416.600000

Imputed with KNN:
             CHOL         ALP
count  614.000000  614.000000
mean     5.366708   68.242507
std      1.124769   25.682554
min      1.430000   11.300000
25%      4.620000   52.900000
50%      5.300000   66.300000
75%      6.060000   79.300000
max      9.670000  416.600000


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['ALB'].fillna(X['ALB'].mean(), inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['PROT'].fillna(X['PROT'].mean(), inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.dropna(subset = ['ALT'], inplace = True)


With the data from after imputing, the data seem to have kept a distribution that is close to the original. I will keep these versions and utilize them for other purposes later on.

### 2.2 Feature Modification of the Dataset Versions

##### 2.2.1 Encoding

As most of the values are continuous and not binary or numerical, there are not a lot of encoding needed for this dataset. However, the Sex feature can be label-encoded or one-hot encoded instead of being categorical with strings as identifiers.

Now, there are two main concerns to consider: 
- label-encoding implies an ordinal relationship between the two values. However, even though the different sexes can each be assigned either a 0 or a 1 as a classification - there is no real mathematical relation or distance between the two. Thus, label-encoding provides a more simple transformation than that of a one-hot encoder, but may create biased results because of the assumption of a mathematical relation, even though 0='male' !< 1='female' and vice versa. 
- One-hot encoding introduces more dimensionality into the dataset, which might slow down some models. However, this should not be too much of a concern given that this dataset only introduces two classifications for sex. 

Given these reflections, since one-hot encoding does not introduce too great of a dimensionality increase, I will stick with one-hot encoding to avoid artificial ordinality and biased results based on a non-existent mathematical relation between the two sexes. 

*Below I am encoding the data using the ColumnTransformer in the sklearn library as suggested by Sunny Srinidhi [2].*

In [474]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Assuming your dataframe is named 'df'

# Create the ColumnTransformer
transformer = ColumnTransformer(transformers=[('column_transformer', OneHotEncoder(), ['Sex'])], remainder='passthrough')

# Fit and transform the data
encoded_data = transformer.fit_transform(X)

# Get the column names for the one-hot encoded columns
encoded_feature_names = transformer.named_transformers_['column_transformer'].get_feature_names_out()

# Combine the new and original column names, excluding 'Sex' as it is now one-hot encoded
final_columns = list(encoded_feature_names) + list(X.drop('Sex', axis=1).columns)

# Convert the numpy array back to a DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=final_columns)

# Convert the one-hot encoded columns to integers and the Age column to int
encoded_df[encoded_feature_names] = encoded_df[encoded_feature_names].astype(int)
encoded_df['Age'] = encoded_df['Age'].astype(int)

# Replace the original features with the encoded features
X = encoded_df

X 

Unnamed: 0,Sex_f,Sex_m,Age,ALB,ALP,AST,BIL,CHE,CHOL,CREA,CGT,PROT,ALT
0,0,1,32,38.5,52.5,22.1,7.5,6.93,3.23,106.0,12.1,69.0,7.7
1,0,1,32,38.5,70.3,24.7,3.9,11.17,4.80,74.0,15.6,76.5,18.0
2,0,1,32,46.9,74.7,52.6,6.1,8.84,5.20,86.0,33.2,79.3,36.2
3,0,1,32,43.2,52.0,22.6,18.9,7.33,4.74,80.0,33.8,75.7,30.6
4,0,1,32,39.2,74.1,24.8,9.6,9.15,4.32,76.0,29.9,68.7,32.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,1,0,62,32.0,416.6,110.3,50.0,5.57,6.30,55.7,650.9,68.5,5.9
610,1,0,64,24.0,102.8,44.4,20.0,1.54,3.02,63.0,35.9,71.3,2.9
611,1,0,64,29.0,87.3,99.0,48.0,1.66,3.63,66.7,64.2,82.0,3.5
612,1,0,46,33.0,,62.0,20.0,3.56,4.20,52.0,50.0,71.0,39.0


In [475]:
# We now have two versions of the dataset
# 1. The dataset with missing values imputed using KNN
# 2. The dataset with missing values imputed using regression
dataset_versions = [get_imputed_data(imputer_knn, X), get_imputed_data(imputer_regression, X)]

##### 2.2.2 Outlier Detection

Another important element to consider in using datasets for machine learning models, especially within the medical scene - is outliers in the dataset. Given that this dataset is based on results to identify HCV or more severe liver disease, concluding on whether outliers are disruptive or indicative is a little difficult.

In this section I seek to gain an overview of the outlier scenery, without making any changes just yet. However, this knowledge can be important to understand undefined behaviour, false-postives, and false-negatives later on.

In [497]:
from sklearn.ensemble import IsolationForest

# Create the isolation forest model
outlier_detector = IsolationForest(contamination=0.1, random_state=86)
outlier_detector.fit(dataset_versions[0])

# Predict the outliers
outlier_predictions = outlier_detector.predict(dataset_versions[0])

print("Number of entries within each target value:")
print(y.value_counts())
print("\n")
print("Number of outliers within each target value:")
print(y.iloc[np.where(outlier_predictions == -1)[0]].value_counts())


Number of entries within each target value:
Category              
0=Blood Donor             533
3=Cirrhosis                30
1=Hepatitis                24
2=Fibrosis                 21
0s=suspect Blood Donor      7
Name: count, dtype: int64


Number of outliers within each target value:
Category              
3=Cirrhosis               26
0=Blood Donor             13
1=Hepatitis               10
0s=suspect Blood Donor     7
2=Fibrosis                 6
Name: count, dtype: int64


##### 2.2.3 Discretization

To further build on the analysis of this dataset and which preprocessing that may be plausible, I want to further analyse the variables in the dataset. These can, at this step, be described as such:

In [None]:
dataset_versions[0].describe() # Only displaying the first version of the dataset for brevity

Unnamed: 0,Sex_f,Sex_m,Age,ALB,ALP,AST,BIL,CHE,CHOL,CREA,CGT,PROT,ALT
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,0.387622,0.612378,47.423453,41.614691,68.242507,34.789088,11.403909,8.194381,5.366708,81.293322,39.566775,72.058867,28.450814
std,0.487605,0.487605,10.056115,5.779015,25.682554,33.1176,19.688388,2.206747,1.124769,49.796545,54.69928,5.390252,25.469689
min,0.0,0.0,19.0,14.9,11.3,10.6,0.8,1.42,1.43,8.0,4.5,44.8,0.9
25%,0.0,0.0,39.0,38.8,52.9,21.6,5.3,6.9325,4.62,67.0,15.7,69.3,16.4
50%,0.0,1.0,47.0,41.9,66.3,25.9,7.3,8.26,5.3,76.85,23.3,72.2,23.0
75%,1.0,1.0,54.0,45.2,79.3,32.9,11.2,9.5925,6.06,88.0,40.2,75.4,33.075
max,1.0,1.0,77.0,82.2,416.6,324.0,254.0,16.41,9.67,1079.1,650.9,90.0,325.3


Basing purely on the description of the data - ALP, AST, BIL, CHE, CREA, CGT, and ALT are potential candidates for discretization due to their skewed distributions and the presence of outliers. Discretization of these variables might enhance the model's ability to identify patterns and robustness. Thus, I want to create two new versions for each of the previous versions -> discretized and non-discretized.

In [None]:
# Discretizing ALP, AST, BIL, CHE, CREA, CGT, and ALT

from sklearn.preprocessing import KBinsDiscretizer


##### 2.2.4 Standardization

We can also create more versions of the dataset with standardization:

In [None]:
# We will work further on these to create different versions of the dataset
from sklearn.preprocessing import StandardScaler, RobustScaler


# Create different versions of the dataset with different scalers
# Here I am using StandardScaler and RobustScaler because RobustScaler is more robust to outliers
# and StandardScaler is more sensitive to outliers. This will help us study the effect of outliers on the dataset.
scalers = [StandardScaler(), RobustScaler()]

def get_scaled_data(scaler, df):
    '''
    This function takes a scaler object and a DataFrame as input and returns a DataFrame with the features scaled.

    Parameters
    ----------
    scaler : Scaler
        A scaler object.
    df : DataFrame
        The pandas object holding the data.

    Returns
    -------
    df : DataFrame
        The pandas object holding the data with the features scaled.
    '''
    # Scaling the features except the one-hot encoded features
    df_copy = df.copy()
    df_copy[df.drop(encoded_feature_names, axis=1).columns] = scaler.fit_transform(df.drop(encoded_feature_names, axis=1))
    return df_copy

# Create different versions of the dataset with different scalers
dataset_versions_scaled = [get_scaled_data(scaler, df) for df in dataset_versions for scaler in scalers]

# Print the differences between the original dataset and the different versions of the dataset
def __print_scaled_versions():
    for i in range(0, len(dataset_versions_scaled)):
        print("#"*20, "\nDataset version", i+1, ":\n")
        print(dataset_versions_scaled[i].describe(), "\n")

# Uncomment if you want to study the differences between the datasets
#__print_scaled_versions() # Commented out for brevity

With our now *4 versions* of the dataset, we can move on to work further with the different datasets.

These are the different datasets:
1. The dataset with missing values imputed using KNN and scaled using StandardScaler
4. The dataset with missing values imputed using KNN and scaled using RobustScaler
5. The dataset with missing values imputed using regression and scaled using StandardScaler
8. The dataset with missing values imputed using regression and scaled using RobustScaler

These can now be accessed using `dataset_versions_scaled[i-1]`

### 2.3 Feature Count Reduction of the Dataset Versions 

In [None]:
# Let's perform a principal component analysis on the dataset
from sklearn.decomposition import PCA

# Create a PCA object
pca = PCA(n_components = 0.95)

# Fit and transform the dataset
pca_data = pca.fit_transform(dataset_versions_scaled[0])

# Print the number of components
print("Number of components:", pca.n_components_)
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Explained variance ratio sum:", sum(pca.explained_variance_ratio_))

Number of components: 11
Explained variance ratio: [0.21460663 0.16079638 0.12031029 0.09615263 0.0840006  0.06764563
 0.06268    0.05108518 0.04729501 0.03891188 0.02907983]
Explained variance ratio sum: 0.9725640698098349


### 2.4 Example Set Modification of the Dataset Versions

In [None]:
# Splitting the dataset into train and test sets
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify=y)

---

# 3 Construction of Decision Models

### 3.1 Instance Based Learning

### 3.2 Decision Trees

### 3.3 Ensemble Learning with Trees

### 3.4 Linear Models

### 3.5 Neural Networks

---

# 4 Model Evaluation

### 4.1 Balanced Model Construction

### 4.2 Performance Evaluation

### 4.3 Yield Curves

---

# 5 Presentation and Defense

---

# 6 Bibliography

[1] Lichtinghagen,Ralf, Klawonn,Frank, and Hoffmann,Georg. (2020). *HCV data*. UCI Machine Learning Repository. Accessed 13.01.24. https://doi.org/10.24432/C5D612.

[2] Srinidhi, Sunny. (2019). *Use ColumnTransformer in SciKit instead of LabelEncoding and OneHotEncoding for data preprocessing in Machine Learning*. Towards Data Science. Accessed 31.01.24. https://towardsdatascience.com/columntransformer-in-scikit-for-labelencoding-and-onehotencoding-in-machine-learning-c6255952731b