# Practical Delivery 2: Learning Decision Models
#### *by Sindre Øyen*

---

---

# 1 Introduction

    This paper will present a comprehensive analysis of the Hepatitis C Virus (HCV) dataset from the University of California, Irvine (UCI) Machine Learning Repository [1]. The focus of the paper will be to apply and evaluate various decision modeling techniques - encompassing preprocessing, model construction, and model evaluation. The paper will explore multiple preprocessing strategies, including handling missing values and feature modification, to prepare the dataset for the use in different machine learning models. These models will range from instance-based learning and decision trees to ensemble learning with trees and neural networks. The evaluation of the models will be based on a balanced construction, performance metrics, and yield curves, offering insights into their applicability in healthcare data analysis.

The data in the dataset is in the following format [1]:

| Variable Name | Role     | Type       | Demographic | Description | Units | Missing Values |
|---------------|----------|------------|-------------|-------------|-------|----------------|
| ID            | ID       | Integer    |             |  Patient ID |       | no             |
| Age           | Feature  | Integer    | Age         |             | years | no             |
| Sex           | Feature  | Binary     | Sex         |             |       | no             |
| ALB           | Feature  | Continuous |             |             |       | yes            |
| ALP           | Feature  | Continuous |             |             |       | yes            |
| AST           | Feature  | Continuous |             |             |       | yes            |
| BIL           | Feature  | Continuous |             |             |       | no             |
| CHE           | Feature  | Continuous |             |             |       | no             |
| CHOL          | Feature  | Continuous |             |             |       | yes            |
| CREA          | Feature  | Continuous |             |             |       | no             |
| CGT           | Feature  | Continuous |             |             |       | no             |
| PROT          | Feature  | Continuous |             |             |       | yes            |
| Category      | Target   | Categorical|             | values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis' |       | no             |
| ALT           | Feature  | Continuous |             |             |       | no             |


[1] Lichtinghagen,Ralf, Klawonn,Frank, and Hoffmann,Georg. (2020). HCV data. UCI Machine Learning Repository. https://doi.org/10.24432/C5D612.

*To initialize this study, the dataset itself can be loaded from the ICU database as such:*

In [1]:
from ucimlrepo import fetch_ucirepo 
  
# Loading in the dataset
hcv_data = fetch_ucirepo(id=571) 
  
# Separating the features and target
X = hcv_data.data.features 
y = hcv_data.data.targets 
  
def printInfo():
    print(hcv_data.metadata)  
    print(hcv_data.variables)

In [2]:
# Splitting the dataset into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify=y)

---

# 2 Preprocessing of the Dataset

### 2.1 Work with Several Versions of the Dataset

### 2.2 Analysis of the Dataset

### 2.3 Missing Value Processing of the Dataset Versions

### 2.4 Feature Modification of the Dataset Versions

### 2.5 Feature Count Reduction of the Dataset Versions 

### 2.6 Example Set Modification of the Dataset Versions

---

# 3 Construction of Decision Models

### 3.1 Instance Based Learning

### 3.2 Decision Trees

### 3.3 Ensemble Learning with Trees

### 3.4 Linear Models

### 3.5 Neural Networks

---

# 4 Model Evaluation

### 4.1 Balanced Model Construction

### 4.2 Performance Evaluation

### 4.3 Yield Curves

---

# 5 Presentation and Defense