## Libraries Used and Their Detailed Explanation

### 1. Data Cleaning
- **pandas**  
  - Purpose: To load datasets, merge multiple CSV files, handle missing values, remove duplicates, rename columns, and perform overall data manipulation.  
  - Why: Pandas provides powerful dataframes which make it easy to explore and clean structured tabular data.  
  - Example Use: `pd.read_csv()`, `df.dropna()`, `df.duplicated()`.

- **numpy**  
  - Purpose: For numerical operations and handling arrays efficiently.  
  - Why: Many cleaning operations (like filling missing values or numeric calculations) require fast array operations, which numpy provides.  
  - Example Use: `np.mean()`, `np.where()`.

---
### 2. Data Preprocessing
- **Encoding Categorical Features:**  
  - **Label Encoding** or **One-Hot Encoding** applied to categorical variables to convert them into numeric form suitable for machine learning.  
  - Examples: Gender (`Male`/`Female`) → 1/0, Chest Pain type → multiple binary columns (one-hot).  
  - Why: Machine learning models cannot handle string labels directly; encoding ensures proper numeric representation.
- **Feature Scaling:**  
  - **StandardScaler** from `scikit-learn` applied to numerical features to normalize feature magnitudes.  
  - Why: Scaling improves convergence for models like Logistic Regression and avoids bias toward features with larger ranges.
---


### 3. Exploratory Data Analysis (EDA)
- **pandas**  
  - Purpose: To generate descriptive statistics, summarize data, check datatypes, and analyze distributions.  
  - Why: Helps understand dataset structure, identify skewed features, and detect anomalies before modeling.  
  - Example Use: `df.describe()`, `df.info()`.

- **numpy**  
  - Purpose: Used for statistical calculations during EDA.  
  - Why: Helps compute aggregates like mean, median, standard deviation quickly for numerical features.  
  - Example Use: `np.std()`, `np.percentile()`.

---

### 4. Data Visualization 
- **matplotlib**  
  - Purpose: To create basic plots such as histograms, bar charts, scatter plots, and line plots.  
  - Why: Helps visually explore distributions, trends, and relationships between features.  
  - Example Use: `plt.hist()`, `plt.scatter()`.

- **seaborn**  
  - Purpose: To create advanced and aesthetically pleasing visualizations like count plots, box plots, and heatmaps.  
  - Why: Makes it easier to analyze class imbalance, feature relationships, and correlations visually.  
  - Example Use: `sns.countplot()`, `sns.heatmap()`.

---

### 5. Model Building
- **scikit-learn (sklearn)**  
  - Purpose: To build the machine learning model (Logistic Regression in this project).  
  - Why: Provides a simple, consistent interface to implement classification algorithms and preprocessing tools.  
  - Example Use: `LogisticRegression()`, `StandardScaler()`.

---

### 6. Model Evaluation
- **scikit-learn (sklearn)**  
  - Purpose: To evaluate model performance using metrics like accuracy, precision, recall, F1-score, confusion matrix, and ROC–AUC.  
  - Why: Provides reliable and widely accepted evaluation metrics for classification tasks.  
  - Example Use: `classification_report()`, `confusion_matrix()`, `roc_auc_score()`, `roc_curve()`.

---

### 7. scikit-learn Modules Used
- **train_test_split** – To divide the dataset into training and testing sets (80:20).  
- **StandardScaler** – To scale numerical features for models that are sensitive to feature magnitude.  
- **LogisticRegression** – To train the classification model predicting heart disease.  
- **classification_report** – To calculate precision, recall, F1-score, and support.  
- **confusion_matrix** – To evaluate correct and incorrect predictions.  
- **roc_auc_score** – To calculate the Area Under the ROC Curve (AUC).  
- **roc_curve** – To generate ROC curve for visual evaluation of model performance.

---

### 8. Overall Summary
These libraries together form a complete Python data science toolkit:  
- **pandas & numpy** → Data handling, cleaning, analysis  
- **matplotlib & seaborn** → Visualization and insights  
- **scikit-learn** → Model building and evaluation  

They were selected because they are **industry-standard, well-documented, and efficient**, making the entire workflow from raw data to model evaluation smooth and reproducible.
