# Data Mining CMP-7023B
## Lab 6: Supervised Learning - Classification Part 2 - more advanced practice sample solutions

## Heart Disease UCI
In this practice sheet we are using Heart Disease dataset from UCI (Machine Learning Repository)

### Content

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them https://archive.ics.uci.edu/ml/datasets/heart+disease

Attribute Information:
- age : age in years
- sex: sex (1 = male; 0 = female)
- cp : chest pain type (4 values)
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- trestbps: resting blood pressure
- chol: serum cholestoral in mg/dl 
- fbs: fasting blood sugar > 120 mg/dl  (1 = true; 0 = false)
- restecg: resting electrocardiographic results (values 0,1,2)
    - Value 0: normal
    - Value 1: having ST-T wave abnormality
    - Value 2: showing left ventricular hypertrophy
- thalach: maximum heart rate achieved
- exang: exercise induced angina (1 = yes; 0 = no)
- oldpeak = ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
- ca: number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
- target (num): diagnosis of heart disease (angiographic disease status) (<b> the predicted attribute </b>)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing

### Objective:
The objective of this lab exercise is to familiarize students with classification tasks, parameter tuning, and model evaluation using the Heart Disease UCI dataset.

## Tasks:
### Task 1: Data Exploration:
- Load the Heart Disease UCI dataset.
- Perform initial Exploratory Data Analysis (EDA).
- Identify features and the target variable.
- Check for missing values, outliers, and perform data cleanng.
ng.

#### Starting out: loading data and libraries
We begin by loading the necessary libraries for the work we are going to do in this lab.

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Normalizer, PowerTransformer, QuantileTransformer, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

import warnings

warnings.filterwarnings('ignore') #ignore warnings

- Download the dataset from blackboard and read it

In [None]:
#designate the path where you saved your OEC data
heart_data_path = "C:\DM-DATA\heart.csv"

#Load the data using pandas read_csv function. 
orig_data = pd.read_csv(heart_data_path)

#get the data out, leaving behind the target column (the last feature).
X = orig_data.iloc[:, :-1]
#extract the target column.
y = orig_data["target"] 

print(X)
print(y)

#### Check data types and missing values

In [None]:
# Check data types and missing values


### Explore the Data

#### Visualize the distribution of the target variable

In [None]:
# Visualize the distribution of the target variable


#### Visualize the distribution of numerical features

In [None]:
# Visualize the distribution of numerical features


#### Check for missing values

In [None]:
# Check for missing values



#### Look at the distributions or histograms of individual attributes.
Data describtions:

#### Examine the mean and standard deviation for each attribute.

#### Check for outliers using box plots

#### Perform data cleaning (handle missing values, outliers, etc.)
For simplicity, you can fill missing values with the mean for numerical features

#### Verify that missing values are filled

### Task 2: Classification Task: 
** Note: Assume preprocessing has been done before this task. 
- Split the dataset into training and testing sets.
  * random_state=41: setting a random_state ensures reproducibility. If you use the same random seed, you will get the same results each time you run the code.
  * stratify=y: ensures that the proportion of each class in y is maintained in both the training and testing sets. This is particularly important when dealing with imbalanced datasets or when the distribution of classes in the target variable is essential for model training and evaluation.
- Choose a classification algorithm (or a bunch) (e.g. Decision Trees, SVM, Nueral Network, Randome forest).
- Train the model using the training set. - Make predictions on thetesting set.
- Evaluate the model's performance using appropriate metrics (accuracy, precision, recall, F1-score).
- Generate the classification report and confusion matrix.

In [None]:
# Split the dataset into training, testing, and validation sets using stratified sampling
                                    

In [None]:
# Choose a classification algorithm (Decision Tree Classifier as an example)


In [None]:
# Train the model using the training set


In [None]:
# Make predictions on the testing set


In [None]:
# Evaluate the model's performance using appropriate metrics


In [None]:
# Generate the classification report


### Task 3: Parameter Tuning:
- Investigate the hyperparameters of the chosen algorithm.
- Create a pipeline with scaling/transformation and classification stages.
- Use grid search or random search to find the optimal hyperparameters. 
- Re-train the model with the tuned hyperparameters. 
- Evaluate the tuned model on the testing set.
- Assess overfitting or underfitting using the accuracy metric (the tuned model).
- Plot the precision and recall for the built pipeline.

In [None]:
# Choose classification algorithms (e.g Decision Tree)


# Define hyperparameter grids for grid search


# Create a pipeline with scaling/transformation and classification stages


# Perform Grid Search to find the optimal hyperparameters 


# Get the best hyperparameters


# Retrain the model with the tuned hyperparameters


# Evaluate the tuned model on the testing set


# Evaluate the model's performance 



In [None]:
# Generate the classification report


In [None]:
# Assess overfitting or underfitting using the accuracy metric (the tuned model).

# Make predictions on both training and testing sets

# Evaluate the model's performance on both sets

# Additional metrics

# Display the accuracies and additional metrics for both sets

# Assess overfitting or underfitting


#### Plot the precision and recall for the pipeline you have built

In [None]:
# Predict probabilities for each class for Decision Tree


# Predict probabilities for each class for SVC


# Plot Precision-Recall curves
plt.figure(figsize=(10, 6))



# Plot settings
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()

### you can try a nother classification algorithms (e.g Random Forest Classifier if already done SVC)

### Task 4: Model Comparison 
- Choose another classification algorithm (e.g. SVC, Decision Tree, Random Forest).
- Train the second model using the same training set.
- Compare the performance of the two models using appropriate metrics:
    - Accuracy
    - Precision
    - Recall
    - F1-Score
- Select the best-performing model.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


# Choose another classification algorithm (Random Forest as an example)
