# Model Building and Evaluation Report

## 1. Objective
The objective of this task is to build a machine learning model to predict disease outcome using a clinical dataset.  
The cleaned and merged dataset prepared in Task 1 is used to train and evaluate a **Logistic Regression** classification model.

---

## 2. Dataset Overview
- **Total Records (Patients):** 180  
- **Total Features:** 17  
- **Target Variable:** Disease Outcome (Binary Classification)

The dataset was created by merging two clinical data files. Data cleaning, preprocessing, and feature engineering were performed before model training.

---

## 3. Data Preprocessing
The following preprocessing steps were applied:

- Handling missing values  
- Encoding categorical variables  
- Feature scaling using **StandardScaler**  
- Splitting the dataset into:
  - **Training Set:** 80%= 144 samples
  - **Testing Set:** 20%= 36samples

---

## 4. Model Implemented
### Logistic Regression
Logistic Regression is a supervised learning algorithm widely used for binary classification problems.  
It predicts the probability of an outcome using a logistic (sigmoid) function.

This model was chosen because:
- It works well with small to medium-sized datasets  
- It is easy to interpret  
- It is suitable for binary classification tasks  

---

## 5. Evaluation Metrics
The performance of the model was evaluated using the following metrics:

- Accuracy  
- Precision  
- Recall  
- F1-Score  
- Confusion Matrix  

These metrics help assess both overall performance and class-wise prediction quality.

---

## 6. Model Performance

The Logistic Regression model achieved the following results on the test dataset:

| Metric        | Value |
|---------------|-------|
| Accuracy      | 44.44% |
| Precision     | 0.44  |
| Recall        | 1.00  |
| F1-Score      | 0.62  |

The classification report shows that the model predicts the positive class well but fails to correctly classify the negative class, indicating class imbalance and model bias.

---
## 7. ROC Curve and AUC Score
The Receiver Operating Characteristic (ROC) curve was used to evaluate the model's ability to distinguish between the two classes.

- **ROC–AUC Score:** 0.50

An AUC value close to 0.5 indicates that the model has very limited discriminative power and performs similar to random guessing. This result is consistent with the observed class imbalance and biased predictions toward the positive class.

**The ROC–AUC score is low because the model predicts almost all samples as positive, which reduces its ability to distinguish between classes.**

---
## 8. Final Prediction

Once trained, the best model (Random Forest) predicts:

- **1 → Heart Disease Present**  
- **0 → No Heart Disease**  

This helps doctors in early diagnosis and reduces risk of severe complications.

---

## 9. Observations
- Feature scaling improved model performance  
- Logistic Regression provided stable and interpretable results  
- The model handled the dataset effectively despite limited sample size  

---

## 10. Conclusion
In this project, a **Logistic Regression** model was successfully built and evaluated using a clinical dataset with **17 features**.  
The model achieved satisfactory accuracy and balanced precision and recall, making it suitable for disease outcome prediction.

Future improvements may include:
- Hyperparameter tuning  
- Trying additional classification models  
- Applying cross-validation techniques  

---
## Final Summary

- **Model: Logistic Regression**
- **Features: 17**
- **Train/Test: 144 / 36**
- **Accuracy: 44.44%**
- **Precision: 0.44**
- **Recall: 1.00**
- **F1-score: 0.62**
- **ROC–AUC: 0.50**