# Practical Delivery 2: Learning Decision Models
#### *by Sindre Øyen*

---

---

# 1 Introduction

    This paper will present a comprehensive analysis of the Hepatitis C Virus (HCV) dataset from the University of California, Irvine (UCI) Machine Learning Repository [1]. The focus of the paper will be to apply and evaluate various decision modeling techniques - encompassing preprocessing, model construction, and model evaluation. The paper will explore multiple preprocessing strategies, including handling missing values and feature modification, to prepare the dataset for the use in different machine learning models. These models will range from instance-based learning and decision trees to ensemble learning with trees and neural networks. The evaluation of the models will be based on a balanced construction, performance metrics, and yield curves, offering insights into their applicability in healthcare data analysis.

*To initialize this study, the dataset itself can be loaded from the ICU database as such:*

In [37]:
from ucimlrepo import fetch_ucirepo 
  
# Loading in the dataset
hcv_data = fetch_ucirepo(id=571) 
  
# Separating the features and target
X = hcv_data.data.features 
y = hcv_data.data.targets 
  
def printInfo():
    print(hcv_data.metadata)  
    print(hcv_data.variables)

In [38]:
# Splitting the dataset into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify=y)

---

# 2 Preprocessing of the Dataset

In this section, I will perform a reasoned construction of various versions of the dataset. These will be possible to differentiate in the later performed work with the data to evaluate performance on different versions.

In [39]:
hcv_data.data.features

Unnamed: 0,Age,Sex,ALB,ALP,AST,BIL,CHE,CHOL,CREA,CGT,PROT,ALT
0,32,m,38.5,52.5,22.1,7.5,6.93,3.23,106.0,12.1,69.0,7.7
1,32,m,38.5,70.3,24.7,3.9,11.17,4.80,74.0,15.6,76.5,18.0
2,32,m,46.9,74.7,52.6,6.1,8.84,5.20,86.0,33.2,79.3,36.2
3,32,m,43.2,52.0,22.6,18.9,7.33,4.74,80.0,33.8,75.7,30.6
4,32,m,39.2,74.1,24.8,9.6,9.15,4.32,76.0,29.9,68.7,32.6
...,...,...,...,...,...,...,...,...,...,...,...,...
610,62,f,32.0,416.6,110.3,50.0,5.57,6.30,55.7,650.9,68.5,5.9
611,64,f,24.0,102.8,44.4,20.0,1.54,3.02,63.0,35.9,71.3,2.9
612,64,f,29.0,87.3,99.0,48.0,1.66,3.63,66.7,64.2,82.0,3.5
613,46,f,33.0,,62.0,20.0,3.56,4.20,52.0,50.0,71.0,39.0


### 2.1 Analysis of the Dataset

In this section I will seek to understand and elaborate further on the HCV dataset and the data that is in it. By understanding missing values, statistical parameters, the types of characteristics, as well as the classification values, the aim is to better plan how to work efficiently with the dataset. As illustrated at it's web page at the UC Irvine's Machine Learning Repository, the data in the dataset is in the following format [1]:

| Variable Name | Role     | Type       | Demographic | Description | Units | Missing Values |
|---------------|----------|------------|-------------|-------------|-------|----------------|
| ID            | ID       | Integer    |             |  Patient ID |       | no             |
| Age           | Feature  | Integer    | Age         |             | years | no             |
| Sex           | Feature  | Binary     | Sex         |             |       | no             |
| ALB           | Feature  | Continuous |             |             |       | yes            |
| ALP           | Feature  | Continuous |             |             |       | yes            |
| AST           | Feature  | Continuous |             |             |       | yes            |
| BIL           | Feature  | Continuous |             |             |       | no             |
| CHE           | Feature  | Continuous |             |             |       | no             |
| CHOL          | Feature  | Continuous |             |             |       | yes            |
| CREA          | Feature  | Continuous |             |             |       | no             |
| CGT           | Feature  | Continuous |             |             |       | no             |
| PROT          | Feature  | Continuous |             |             |       | yes            |
| Category      | Target   | Categorical|             | values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis' |       | no             |
| ALT           | Feature  | Continuous |             |             |       | no             |


#### 2.1.1 Missing Values Study
As can be read in the table, the HCV dataset from UC Irvine's Machine Learning Repository contains several variables, some of which have missing values. In this section, I will focus on analyzing these missing values to understand their impact on the dataset and how they should be handled for effective machine learning model development. The variables with missing values are: ALB, ALP, AST, CHOL, and PROT. These are all continuous features, indicating that they are likely to represent some quantitative measurements. 

Understanding the extent of which values are missing within certain variables is crucial. It is important to calculate the proportion of missing values for each variable. If a significant proportion of data is missing in a particular variable, it might impact the reliability of any analysis involving that variable. So, let's dive deeper into this:

In [40]:
import numpy as np
import pandas as pd

In [41]:
# ALB, ALP, AST, CHOL, and PROT has missing values
# Lets find the percentage of missing values for each of these variables
def missing_percentage(df):
    '''
    This function takes a DataFrame(df) as input and returns two columns, total missing values and total missing values percentage.

    Parameters
    ----------
    df : DataFrame
        The pandas object holding the data.

    Returns
    -------
    missing_values : Series
        Total missing values of each feature.
    '''
    # Get the count of non null values of each feature
    total = df.notnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    total_null = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = np.round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2)[np.round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2) != 0]
    return pd.concat([total, total_null, percent], axis = 1, keys = ['No. values', 'No. null', '%'])

missing_percentage(hcv_data.data.features)

Unnamed: 0,No. values,No. null,%
ALB,614,1,0.16
PROT,614,1,0.16
ALT,614,1,0.16
CHOL,605,10,1.63
ALP,597,18,2.93


In the result table above, it is apparent that the degree of which the values are missing is varying. I would devise my strategy based on this information. I also note that in the source it appears that ALT and AST have been swapped, and that it is in fact ALT that has a missing value.

Firstly, since in the cases of the ALB, PROT, and ALT variables there are only one missing value for each, we could either perform a simple imputation strategy or just delete the missing instances. Whether deletion or imputation is the preferred choice, should depend on how skewed the dataset is. Let's understand this further:

In [42]:
# Find the range of each variable that has missing values
for i in ['ALB', 'PROT', 'ALT', 'CHOL', 'ALP']:
    print(i, ':', 'min =', hcv_data.data.features[i].min(), ', max =', hcv_data.data.features[i].max())
    print("Mode value for", i, "is", hcv_data.data.features[i].mode()[0])
    print("Median value for", i, "is", hcv_data.data.features[i].median())
    print("Mean value for", i, "is", hcv_data.data.features[i].mean())
    print("The mean value is this percentage less than the max value:", (hcv_data.data.features[i].max() - hcv_data.data.features[i].mean())/hcv_data.data.features[i].max()*100)
    print("The mean value is this percentage more than the min value:", (hcv_data.data.features[i].mean() - hcv_data.data.features[i].min())/hcv_data.data.features[i].min()*100, "\n")

ALB : min = 14.9 , max = 82.2
Mode value for ALB is 39.0
Median value for ALB is 41.95
Mean value for ALB is 41.62019543973941
The mean value is this percentage less than the max value: 49.3671588324338
The mean value is this percentage more than the min value: 179.33017073650612 

PROT : min = 44.8 , max = 90.0
Mode value for PROT is 71.9
Median value for PROT is 72.2
Mean value for PROT is 72.0441368078176
The mean value is this percentage less than the max value: 19.950959102424896
The mean value is this percentage more than the min value: 60.81280537459285 

ALT : min = 0.9 , max = 325.3
Mode value for ALT is 16.6
Median value for ALT is 23.0
Mean value for ALT is 28.450814332247557
The mean value is this percentage less than the max value: 91.25397653481477
The mean value is this percentage more than the min value: 3061.201592471951 

CHOL : min = 1.43 , max = 9.67
Mode value for CHOL is 5.07
Median value for CHOL is 5.3
Mean value for CHOL is 5.368099173553719
The mean value is t

Let's break down these numbers!

*One missing value*
- One data entry has a missing ALB value. The mode, median, and mean values are quite close percentage-wise. However, the skewedness is quite prominent with the largest value being 180% larger than the mean. It is possible that replacing the value with the mean will cause incorrect treatment of the data, thus it can be better to eliminate the entry with the missing data.
- With regard to PROT, the mode, median, and the mean value are quite close and the skewedness is not too large, here it can be okay to perform imputation with the mean value.
- With regard to ALT, the mode, median, and mean value are quite skewed, and the general skewedness is extreme - making it better to just eliminate this entry.

*Several missing values*

With regard to CHOL and ALP, which both have a greater amount of missing values - it can be important to try to understand why. I will attempt to use a more sofisticated imputation approach for these values, such as e.g., regression imputation to understand whether the missingness is correlated to the other variables in the data.

In [43]:
# Eliminating the entries with missing values for ALB and ALT
hcv_data.data.features.dropna(subset = ['ALB', 'ALT'], inplace = True)
# Imputing the missing value for PROT
hcv_data.data.features['PROT'].fillna(hcv_data.data.features['PROT'].mean(), inplace = True)

# Create different versions of the dataset with different imputers
from sklearn.impute import KNNImputer, SimpleImputer

# Imputing the missing values for CHOL and ALP using different imputers to create different versions of the dataset
imputer_knn = KNNImputer(n_neighbors = 5)
imputer_regression = SimpleImputer(strategy = 'mean')

def get_imputed_data(imputer, df):
    '''
    This function takes an imputer object and a DataFrame as input and returns a DataFrame with the missing values imputed.

    Parameters
    ----------
    imputer : Imputer
        An imputer object.
    df : DataFrame
        The pandas object holding the data.

    Returns
    -------
    df : DataFrame
        The pandas object holding the data with the missing values imputed.
    '''
    # Imputing CHOL and ALP
    df['CHOL'] = imputer.fit_transform(df[['CHOL']])
    df['ALP'] = imputer.fit_transform(df[['ALP']])
    return df

missing_percentage(get_imputed_data(imputer_knn, hcv_data.data.features))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hcv_data.data.features.dropna(subset = ['ALB', 'ALT'], inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hcv_data.data.features['PROT'].fillna(hcv_data.data.features['PROT'].mean(), inplace = True)



#### 2.1.3 Type of Characteristics in the Dataset

#### 2.1.4 Study of the Classification Values

### 2.3 Missing Value Processing of the Dataset Versions

### 2.4 Feature Modification of the Dataset Versions

### 2.5 Feature Count Reduction of the Dataset Versions 

### 2.6 Example Set Modification of the Dataset Versions

---

# 3 Construction of Decision Models

### 3.1 Instance Based Learning

### 3.2 Decision Trees

### 3.3 Ensemble Learning with Trees

### 3.4 Linear Models

### 3.5 Neural Networks

---

# 4 Model Evaluation

### 4.1 Balanced Model Construction

### 4.2 Performance Evaluation

### 4.3 Yield Curves

---

# 5 Presentation and Defense

---

# 6 Bibliography

[1] Lichtinghagen,Ralf, Klawonn,Frank, and Hoffmann,Georg. (2020). HCV data. UCI Machine Learning Repository. https://doi.org/10.24432/C5D612.