### Libraries

In [None]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

### Loading the dataset file

In [None]:
df = pd.read_csv(r'./dataset.csv')

### Data Summary and Info  
- **`df.describe()`**: Provides statistical insights like mean, min, max, and percentiles for numerical columns.  
- **`df.info()`**: Displays the DataFrame's structure, data types, and non-null value counts.


In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.head()

#### Since Sex attribute is a categorical attribute, we will convert it to a numerical attribute. 

#### Encoding male as '1' and Female as'0'

In [None]:
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'Male' else 0)

In [None]:
df.head()

### Class Distribution

- **`df['Status'].value_counts()`**: Displays the number of rows (frequency) for each unique class in the `Status` column.  
  This is useful for understanding the distribution of data across different categories and checking for class imbalance in the dataset.


In [None]:
# number of rows of each class
df['Status'].value_counts()

### EDA: Understanding the Data Through Plots

Here’s what we did to explore the data and understand how different features relate to the `Status` class:  

1. **Age Distribution**  
   - We plotted the `Age` values for each class in the `Status` column.  
   - This helps us see if certain age groups are more common in each class.  

2. **BMI Distribution**  
   - This shows how `BMI` (Body Mass Index) is spread out for each class.  
   - It helps us check if BMI has any role in differentiating between the classes.  

3. **Temperature Distribution**  
   - We looked at how `Temperature` changes for each class.  
   - This can tell us if people in different classes have noticeable differences in body temperature.  

4. **Heart Rate Distribution**  
   - This plot shows how `Heart_rate` values are distributed for each class.  
   - It helps us understand if heart rate is an important factor.  

5. **SPO2 Distribution**  
   - We explored the `SPO2` (oxygen levels) to see if it varies across classes.  
   - It’s useful for checking if oxygen levels are related to the classes.  

6. **ECG Distribution**  
   - Finally, we looked at the `Ecg` (electrocardiogram) values for each class.  
   - This can help us see if ECG readings differ between the groups.  

By plotting these features, we get a clearer idea of how the data behaves and what might be important for analysis!


In [None]:
# eda

# Plot the distribution of 'Age' for each class
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Age', hue='Status', kde=True, bins=30)
plt.title("Distribution of Age for each CAD Status")
plt.show()

In [None]:
# Plot the distribution of 'BMI' for each class
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='BMI', hue='Status', kde=True, bins=30)
plt.title("Distribution of BMI for each CAD Status")
plt.show()

In [None]:
# Plot the distribution of 'Temperature' for each class
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Temperature', hue='Status', kde=True, bins=30)
plt.title("Distribution of Temperature for each CAD Status")
plt.show()

In [None]:
# Plot the distribution of 'Heart_rate' for each class
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Heart_rate', hue='Status', kde=True, bins=30)
plt.title("Distribution of Heart Rate for each CAD Status")
plt.show()

In [None]:
# Plot the distribution of 'SPO2' for each class
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='SPO2', hue='Status', kde=True, bins=30)
plt.title("Distribution of SPO2 for each CAD Status")
plt.show()

In [None]:
# Plot the distribution of 'Ecg' for each class
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Ecg', hue='Status', kde=True, bins=30)
plt.title("Distribution of ECG for each CAD Status")
plt.show()

### Correlation Matrix  

- A **correlation matrix** shows how strongly different features in the dataset are related to each other.  
- We used a heatmap to visualize the correlations, where:  
  - Values range from **-1 to 1**:
    - **1** means perfect positive correlation (features increase together).  
    - **-1** means perfect negative correlation (one feature decreases as the other increases).  
    - **0** means no correlation.  
  - The colors help make it easy to spot strong positive (red) or negative (blue) relationships.  

- This plot is useful to identify which features are strongly linked to each other or the target variable (`Status`), guiding us in feature selection.


In [None]:
# correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix")
plt.show()

### Dropping Low-Correlation Features  

- We removed the following columns: **`Age`**, **`Weight`**, **`Height`**, **`Sex`**, and **`BMI`**.  
- **Reason**:  
  - These features showed **low correlation** with the target variable (`Status`) in the correlation matrix.  
  - Features with low or no correlation often do not contribute significantly to the model's performance.  
  - Keeping them can add unnecessary noise and increase computation time.  

- By dropping these features, we focus only on the most relevant data for better analysis and model accuracy.


In [None]:
# drop low correlation features
df = df.drop(columns=['Age', 'Weight', 'Height', 'Sex', 'BMI'])

### Updated Correlation Matrix  

- After dropping low-correlation features, we created a new **correlation matrix** to reassess the relationships between the remaining features.  
- **Why check again?**  
  - To ensure that the dropped features did not affect the strength of the remaining correlations.  
  - This helps confirm that the dataset is now more focused on relevant features that might impact the target variable (`Status`).  

- The updated heatmap highlights stronger and cleaner relationships, making it easier to identify the key features for further analysis or modeling.


In [None]:
# correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix")
plt.show()

### Splitting the Data into Training and Testing Sets  

- **`X`**: This is the feature set, where we removed the target variable (`Status`) because it is what we want to predict.  
- **`y`**: This is the target variable (`Status`), which we want to predict using the features.  

- We used the **`train_test_split()`** function to split the data:  
  - **Training Set**: 80% of the data used to train the model.  
  - **Testing Set**: 20% of the data set aside to evaluate the model's performance.  
  - **`random_state=42`**: Ensures that the split is reproducible, so you get the same training and test sets each time.  

This split is crucial for training and validating the model, helping us assess how well it will perform on unseen data.


In [None]:
# test train split
X = df.drop(columns=['Status'])
y = df['Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Dataset is tested on different models and Accuracy is calculated 

### Logistic Regression  

1. **What is Logistic Regression?**  
   Logistic Regression is a classification model that predicts the probability of a binary outcome (e.g., presence or absence of CAD). It uses the logistic (sigmoid) function to map predictions to probabilities between 0 and 1.  

2. **Steps in the Code**:  
   - **Model Initialization**: The Logistic Regression model is created with `random_state=42` for consistent results and `max_iter=100` to control convergence.  
   - **Training**: The model is trained on the `X_train` (features) and `y_train` (labels).  
   - **Accuracy Calculation**: The model's performance is evaluated using `score()` on the test dataset, showing the percentage of correct predictions.  

3. **Output**:  
   The model's accuracy is printed, representing how well it predicts CAD (Coronary Artery Disease) based on the input features.  

**Logistic Regression** is a straightforward and reliable algorithm, especially for linearly separable datasets.


In [None]:
# logistic regression
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=42, max_iter=100)
log_reg.fit(X_train, y_train)
log_reg_score = log_reg.score(X_test, y_test)
print(f"Logistic Regression Model Accuracy: {log_reg_score * 100:.2f}%")

### Support Vector Machine (SVM)  

1. **What is SVM?**  
   SVM is a supervised learning algorithm used for classification tasks. It works by finding the best hyperplane that separates data points of different classes.  
   - The **RBF Kernel** (Radial Basis Function) allows SVM to handle non-linear relationships by transforming data into higher dimensions.  

2. **Steps in the Code**:  
   - **Model Initialization**: The SVM model is set up with an **RBF kernel**, `random_state=42` for reproducibility, and `probability=True` to enable prediction probabilities (useful for ensemble methods).  
   - **Training**: The model is trained on `X_train` (features) and `y_train` (labels).  
   - **Accuracy Calculation**: The model is tested on `X_test` and `y_test` to calculate its accuracy (percentage of correct predictions).  

3. **Output**:  
   The accuracy of the SVM model is printed, indicating its performance in predicting CAD (Coronary Artery Disease).  

**SVM with an RBF kernel** is powerful for complex datasets with non-linear decision boundaries.


In [None]:
# svm
from sklearn import svm
svm_model = svm.SVC(kernel='rbf', random_state=42, probability=True)
svm_model.fit(X_train, y_train)
svm_score = svm_model.score(X_test, y_test)
print(f"SVM Model Accuracy: {svm_score * 100:.2f}%")

### Artificial Neural Network (ANN)  

1. **What is ANN?**  
   - An Artificial Neural Network is a machine learning model inspired by the structure of the human brain.  
   - It consists of layers of interconnected nodes (neurons) that process data and extract patterns, making it highly suitable for complex and non-linear relationships in data.

2. **Steps in the Code**:  
   - **Model Definition**:  
     - `Sequential`: The ANN is defined as a stack of layers.  
     - **Layers**:  
       - `Dense(256, activation='relu', input_shape=(4,))`: First layer with 256 neurons, using ReLU activation, and expecting 4 input features.  
       - Additional hidden layers with 128 and 32 neurons, also using ReLU activation.  
       - `Dense(1, activation='sigmoid')`: Output layer with a sigmoid activation for binary classification.  
   - **Compilation**:  
     - **Optimizer**: `adam`, a popular optimizer that adjusts weights efficiently.  
     - **Loss Function**: `binary_crossentropy`, appropriate for binary classification tasks.  
     - **Metrics**: Tracks accuracy during training.  
   - **Training**:  
     - `fit()`: The model is trained for 50 epochs with a batch size of 32 and validation using the test set.  
   - **Evaluation**:  
     - The model is evaluated on the test set, providing the accuracy score.  

3. **Output**:  
   - The ANN accuracy score indicates how well the model predicts CAD (Coronary Artery Disease) based on the features.

**ANNs** are particularly effective for datasets with intricate patterns and relationships that simpler models may not capture.


In [None]:
# ann
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(256, activation='relu', input_shape=(4,)),
    Dense(128, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
_, ann_score = model.evaluate(X_test, y_test)
print(f"ANN Model Accuracy: {ann_score * 100:.2f}%")

### Stacked Ensemble Model  

1. **Combining Predictions**:  
   - Predictions from Logistic Regression, SVM, and ANN are treated as features for the ensemble.  
   - `np.column_stack()` combines these predictions into a single feature matrix for the meta-model.  

2. **Testing the Ensemble**:  
   - The ensemble's accuracy is calculated using the `accuracy_score` function, based on the final predictions from the meta-model.  

This approach leverages the strengths of individual models to improve overall prediction accuracy.


In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate predictions for the validation dataset from each model
pred_log = log_reg.predict_proba(X_test)[:, 1]  # Logistic Regression probabilities
pred_svm = svm_model.predict_proba(X_test)[:, 1]  # SVM probabilities
pred_ann = model.predict(X_test).flatten()  # ANN probabilities (ensure 1D array)

# Combine predictions as features for the meta-model
stacked_features = np.column_stack((pred_log, pred_svm, pred_ann))

# Train the meta-model (logistic regression in this case)
meta_model = LogisticRegression(random_state=42)
meta_model.fit(stacked_features, y_test)  # Use test labels to train the meta-model

# Test the ensemble on the same test set
test_log = log_reg.predict_proba(X_test)[:, 1]
test_svm = svm_model.predict_proba(X_test)[:, 1]
test_ann = model.predict(X_test).flatten()
test_features = np.column_stack((test_log, test_svm, test_ann))

# Final predictions from the meta-model
final_preds = meta_model.predict(test_features)

# Evaluate the ensemble
ensemble_accuracy = accuracy_score(y_test, final_preds)
print(f"Stacked Ensemble Accuracy: {ensemble_accuracy * 100:.2f}%")


### Predicting New User Inputs  

1. **Process**:  
   - Each user input is passed through base models (Logistic Regression, SVM, ANN) to get probabilities.  
   - These probabilities are combined into a single feature vector.  
   - The final model predicts the class and probability based on the combined features.  

2. **Class Labels**:  
   - **Class 0**: Refers to **No CAD (Coronary Artery Disease)**.  
   - **Class 1**: Refers to **CAD (Coronary Artery Disease)**.  

3. **Output**:  
   - For each input, the final **class label** (0 or 1) and **probability** of having CAD are displayed.  

This method combines predictions from multiple models to improve accuracy.


In [None]:
# Example user input
user_input = np.array([[28,60,95,0]])  # Replace with actual user input class 1
user_input1 = np.array([[32,71,100,0]])  # Replace with actual user input class 0
# Preprocess user input (if required, e.g., scaling, normalization, etc.)

# Step 1: Get predictions from the base models
log_reg_pred = log_reg.predict_proba(user_input)[:, 1]  # Logistic Regression probability
svm_pred = svm_model.predict_proba(user_input)[:, 1]  # SVM probability
ann_pred = model.predict(user_input).flatten()  # ANN probability

# Step 2: Combine the predictions into a single feature vector
stacked_input = np.column_stack((log_reg_pred, svm_pred, ann_pred))

# Step 3: Pass the feature vector to the meta-model for the final prediction
final_prediction = meta_model.predict(stacked_input)  # Class label (e.g., 0 or 1)
final_probability = meta_model.predict_proba(stacked_input)[:, 1]  # Probability of class 1

# Output the result
print("First User Input (Class 1):",user_input)
print(f"Final Prediction (Class): {final_prediction[0]}")
print(f"Final Prediction (Probability of Class 1): {final_probability[0]:.2f}")

# Step 1: Get predictions from the base models
log_reg_pred = log_reg.predict_proba(user_input1)[:, 1]  # Logistic Regression probability
svm_pred = svm_model.predict_proba(user_input1)[:, 1]  # SVM probability
ann_pred = model.predict(user_input1).flatten()  # ANN probability

# Step 2: Combine the predictions into a single feature vector
stacked_input = np.column_stack((log_reg_pred, svm_pred, ann_pred))

# Step 3: Pass the feature vector to the meta-model for the final prediction
final_prediction = meta_model.predict(stacked_input)  # Class label (e.g., 0 or 1)
final_probability = meta_model.predict_proba(stacked_input)[:, 1]  # Probability of class 1

# Output the result
print("Second User Input (Class 0):",user_input1)
print(f"Final Prediction (Class): {final_prediction[0]}")
print(f"Final Prediction (Probability of Class 1): {final_probability[0]:.2f}")
