# Project 1: Interpretable and Explainable Classification for Medical Data

Project 1 consists of three parts. The first part uses tabular data, while the second uses imaging data. In the third part, you will recap your findings and answer some general questions about the methods you have seen. You will explore techniques that enable interpretable and explainable classification using shallow and deep machine-learning methods. Before starting the project, be aware of the following points:

- The report has a word limit of 5000 words excluding references. There is no restriction on the number of plots.
- The report must be handed in as a PDF.
- The report has to be self-contained, i.e., no references to code.
- Underlined sections within questions specify how many points can be achieved by solving that specific subquestion.
- You will also need to hand in your code. Please include a requirements.txt or similar for your Python environment and a README.md explaining how to run your code.
- Use train/validation splits for training and tuning only. Report results on the test set. Note that the performance of the different methods can vary a lot.
- Using publicly available code is okay, but properly reference repositories when you use them. Of course, you are not allowed to use the code of other teams from the course.


# Part 1: Heart Disease Prediction Dataset (20 Pts)

For Part 1, we will provide you with train and test splits from the Kaggle Heart Failure Prediction Dataset aggregated from UCI Machine Learning Repository over Moodle.

### Q1: Exploratory Data Analysis (3 Pts)
Get familiar with the dataset by exploring the different features, their distribution, and the labels (1 Pt). Check for common pitfalls like missing or nonsensical data, unusual feature distribution, outliers, or class imbalance, and describe how to handle them (1 Pt). After having familiarized yourself with the data, explain how you preprocess the dataset for the remaining tasks of part 1 (1 Pt). Interpretability and explainability aim at gaining more insights about the data than just optimizing predictive performance.

Task 1 : EDA

In [1]:
# Load and explore the heart disease prediction dataset
# Assuming the dataset is loaded into a pandas DataFrame called 'heart_data'

import pandas as pd 
import numpy as np
heart_data = pd.read_csv("data/heart_failure/train_val_split.csv")
display(heart_data.head())

# Q1: Exploratory Data Analysis
# Check for missing values
missing_values = heart_data.isnull().sum()
print("missing_values:" , missing_values)

# Check for class imbalance
class_distribution = heart_data['HeartDisease'].value_counts()

# Visualize features distribution
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.histplot(data=heart_data, x='Age', hue='HeartDisease', bins=20, kde=True)
plt.title('Distribution of Age by Target')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

# Q2: Handle missing or nonsensical data
# Drop rows with missing values
heart_data.dropna(inplace=True)

# Handle outliers
# Assuming 'age' and 'chol' are features with outliers
heart_data = heart_data[(heart_data['Age'] < 100) & (heart_data['Cholestrol'] < 400)]

# Handle class imbalance if present
# Oversample or undersample the minority class

# Q3: Preprocess the dataset for the remaining tasks
# Encode categorical variables if any
heart_data = pd.get_dummies(heart_data, columns=['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])

# Split the data into features and target variable
X = heart_data.drop(columns=['target'])
y = heart_data['target']

FileNotFoundError: [Errno 2] No such file or directory: 'data/heart_failure/train_val_split.csv'