### **Logistic Regression: Predicting Heart Disease**

#### Objective:
Your task is to build a **Logistic Regression** model to predict whether a person has heart disease based on their medical and demographic information. This homework will help you understand how Logistic Regression is used for binary classification tasks.

---

#### Dataset:
Use the **Heart Disease dataset**, which contains information about patients' health parameters and whether they have heart disease.  

---

#### Dataset Overview:
The dataset contains the following key features:
- **Age**: Age of the person (in years).
- **Sex**: Gender (1 = male, 0 = female).
- **CP**: Chest pain type (categorical: 0, 1, 2, 3).
- **Trestbps**: Resting blood pressure (in mm Hg).
- **Chol**: Serum cholesterol (in mg/dL).
- **FBS**: Fasting blood sugar > 120 mg/dL (1 = true, 0 = false).
- **Restecg**: Resting electrocardiographic results (0, 1, 2).
- **Thalach**: Maximum heart rate achieved.
- **Exang**: Exercise-induced angina (1 = yes, 0 = no).
- **Oldpeak**: ST depression induced by exercise relative to rest.
- **Slope**: Slope of the peak exercise ST segment (categorical: 0, 1, 2).
- **CA**: Number of major vessels colored by fluoroscopy (0–3).
- **Thal**: Thalassemia (categorical: 0, 1, 2, 3).
- **Target**: **1** if the person has heart disease, **0** otherwise (binary target variable).

---

#### Steps to Complete:

1. **Data Loading and Exploration**
   - Load the dataset using `pandas`.
   - Explore the data:
     - Display the first few rows and column information.
     - Check for missing values and handle them appropriately.
     - Understand the distribution of each feature using histograms and box plots.
     - Visualize the correlation between features and the target variable.

2. **Data Preprocessing**
   - Convert categorical variables into numerical variables using one-hot encoding (e.g., `CP`, `Slope`, `Thal`).
   - Scale the features using StandardScaler or MinMaxScaler to improve the model's performance.
   - Split the dataset into training and test sets (e.g., 80% training, 20% testing).

3. **Model Building**
   - Train a Logistic Regression model using `LogisticRegression` from `sklearn.linear_model`.
   - Use different solvers (e.g., `liblinear`, `lbfgs`) and experiment with regularization techniques (`l1`, `l2` penalties).

4. **Model Evaluation**
   - Evaluate the model's performance on the test set using:
     - **Accuracy Score**
     - **Precision, Recall, and F1-Score**
     - **Confusion Matrix**
     - **ROC Curve and AUC Score**
   - Interpret the coefficients of the logistic regression model to understand the influence of each feature.

5. **Feature Importance**
   - Perform feature selection by analyzing p-values or using techniques like Recursive Feature Elimination (RFE).
   - Retrain the model with the most important features and compare performance.

6. **Prediction**
   - Predict whether a hypothetical patient with specific characteristics (provided in the problem) has heart disease.
   - Justify the prediction based on the model's coefficients.

---

#### Bonus Challenges (Optional):

1. **Hyperparameter Tuning**
   - Use `GridSearchCV` or `RandomizedSearchCV` to find the best combination of hyperparameters for the Logistic Regression model.

2. **Multiclass Classification**
   - Extend the task by using a dataset where the target variable has more than two classes (e.g., predicting chest pain types instead of heart disease).

3. **Comparison with Other Models**
   - Compare Logistic Regression's performance with other classification algorithms like:
     - Support Vector Machines (SVM)
     - Random Forest
     - k-Nearest Neighbors (kNN)

4. **Imbalanced Dataset Handling**
   - Introduce imbalance in the dataset (e.g., reduce the number of samples for the target = 1 class) and apply techniques like:
     - Oversampling using SMOTE.
     - Class weighting in the Logistic Regression model.

---

#### Deliverables:
- A Python script or Jupyter Notebook containing:
  - Code for loading, preprocessing, and visualizing the dataset.
  - Implementation of Logistic Regression with evaluation results.
  - Plots (e.g., ROC curve, confusion matrix) and performance metrics.
- A brief report addressing:
  - What factors were most predictive of heart disease?
  - How did hyperparameter tuning and feature selection impact the model?
  - Insights into the model's strengths and limitations.

---

#### Useful Hints:
- Use `seaborn` and `matplotlib` for data visualization.
- Use `sklearn.metrics` for model evaluation metrics.
- Start with a simple logistic regression model and progressively enhance it.