# Heart Disease Prediction - Exploratory Data Analysis (EDA)

## MLOps Assignment - Task 1: Data Acquisition & EDA

**Objective:** Analyze the Heart Disease UCI Dataset to understand patterns, distributions, and relationships in the data.

**Dataset:** Heart Disease UCI Dataset from UCI Machine Learning Repository
- 303 instances from Cleveland Clinic Foundation
- 14 attributes (13 features + 1 target)
- Binary classification: Presence/Absence of heart disease

---

## 1. Import Required Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
from pathlib import Path
import sys

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

print("✓ All libraries imported successfully!")

: 

## 2. Load Dataset

In [None]:
# Load the dataset
import sys
from pathlib import Path

# Add parent directory to path
BASE_DIR = Path.cwd().parent
DATA_PATH = BASE_DIR / "data" / "processed" / "heart_disease.csv"

print(f"Loading data from: {DATA_PATH}")

# Check if file exists
if DATA_PATH.exists():
    df = pd.read_csv(DATA_PATH)
    print(f"✓ Data loaded successfully!")
    print(f"\nDataset shape: {df.shape}")
    print(f"Number of samples: {len(df)}")
    print(f"Number of features: {len(df.columns) - 1}")  # Excluding target
else:
    print("❌ Data file not found!")
    print("Please run: python ../src/download_data.py")
    df = None

## 3. Initial Data Exploration

In [None]:
# Display first few rows
print("="*80)
print("FIRST 10 ROWS OF DATASET")
print("="*80)
display(df.head(10))

print("\n" + "="*80)
print("LAST 5 ROWS OF DATASET")
print("="*80)
display(df.tail())

print("\n" + "="*80)
print("DATASET INFO")
print("="*80)
df.info()

print("\n" + "="*80)
print("COLUMN NAMES AND DATA TYPES")
print("="*80)
for col in df.columns:
    print(f"{col:15s} : {str(df[col].dtype):10s} | Unique values: {df[col].nunique()}")

## 4. Descriptive Statistics

In [None]:
# Descriptive statistics
print("="*80)
print("DESCRIPTIVE STATISTICS - ALL FEATURES")
print("="*80)
display(df.describe().T.style.background_gradient(cmap='coolwarm'))

print("\n" + "="*80)
print("DESCRIPTIVE STATISTICS - TARGET VARIABLE")
print("="*80)
print(df['target'].describe())
print(f"\nTarget value counts:")
print(df['target'].value_counts().sort_index())
print(f"\nTarget distribution (%):")
print(df['target'].value_counts(normalize=True).sort_index() * 100)

## 5. Missing Values Analysis

In [None]:
# Check for missing values
print("="*80)
print("MISSING VALUES ANALYSIS")
print("="*80)

missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df)) * 100
})
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_data) > 0:
    print("\nColumns with missing values:")
    display(missing_data)
    
    # Visualize missing values
    fig, ax = plt.subplots(figsize=(10, 6))
    missing_data.plot(x='Column', y='Missing_Percentage', kind='bar', ax=ax, color='coral')
    plt.title('Missing Values Percentage by Column', fontsize=14, fontweight='bold')
    plt.xlabel('Column')
    plt.ylabel('Missing Percentage (%)')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("\n✓ No missing values found in the dataset!")
    print("This is excellent - the data is complete!")

print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print(f"Total data points: {df.shape[0] * df.shape[1]}")
print(f"Missing percentage: {(df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100:.2f}%")

## 6. Class Balance Analysis (TARGET VARIABLE)

In [None]:
# Class balance analysis
print("="*80)
print("CLASS BALANCE ANALYSIS")
print("="*80)

target_counts = df['target'].value_counts().sort_index()
target_percentages = df['target'].value_counts(normalize=True).sort_index() * 100

print("\nClass Distribution:")
print(f"Class 0 (No Disease):  {target_counts[0]:3d} samples ({target_percentages[0]:.2f}%)")
print(f"Class 1 (Disease):     {target_counts[1]:3d} samples ({target_percentages[1]:.2f}%)")
print(f"\nClass Ratio: {target_counts[1] / target_counts[0]:.2f}:1")

# Visualize class distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Bar plot
axes[0].bar(['No Disease (0)', 'Disease (1)'], target_counts.values, color=['#2ecc71', '#e74c3c'], alpha=0.7, edgecolor='black')
axes[0].set_title('Class Distribution - Bar Chart', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(target_counts.values):
    axes[0].text(i, v + 5, str(v), ha='center', fontweight='bold')

# Pie chart
colors = ['#2ecc71', '#e74c3c']
axes[1].pie(target_counts.values, labels=['No Disease (0)', 'Disease (1)'], 
           autopct='%1.1f%%', startangle=90, colors=colors, explode=(0.05, 0.05))
axes[1].set_title('Class Distribution - Pie Chart', fontsize=14, fontweight='bold')

# Count plot with seaborn
sns.countplot(data=df, x='target', ax=axes[2], palette=['#2ecc71', '#e74c3c'])
axes[2].set_title('Class Distribution - Count Plot', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Target (0: No Disease, 1: Disease)')
axes[2].set_ylabel('Count')
axes[2].set_xticklabels(['No Disease (0)', 'Disease (1)'])
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Balance assessment
if abs(target_percentages[0] - target_percentages[1]) < 10:
    print("\n✓ Dataset is WELL BALANCED (difference < 10%)")
elif abs(target_percentages[0] - target_percentages[1]) < 20:
    print("\n⚠ Dataset is SLIGHTLY IMBALANCED (difference 10-20%)")
else:
    print("\n⚠ Dataset is IMBALANCED (difference > 20%)")
    print("  Consider using stratified sampling or class weights during training")

## 7. Distribution Analysis - Histograms for All Features

In [None]:
# Distribution of all features
print("="*80)
print("FEATURE DISTRIBUTIONS")
print("="*80)

# Get all feature columns (exclude target)
feature_cols = [col for col in df.columns if col != 'target']

# Create subplots for histograms
n_features = len(feature_cols)
n_cols = 4
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, n_rows * 4))
axes = axes.flatten()

for idx, col in enumerate(feature_cols):
    axes[idx].hist(df[col].dropna(), bins=30, color='skyblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{col} Distribution', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(axis='y', alpha=0.3)
    
    # Add mean and median lines
    mean_val = df[col].mean()
    median_val = df[col].median()
    axes[idx].axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
    axes[idx].axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.2f}')
    axes[idx].legend(fontsize=8)

# Hide extra subplots
for idx in range(n_features, len(axes)):
    axes[idx].axis('off')

plt.suptitle('Distribution of All Features', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\n✓ Histograms show the distribution shape for all features")
print("  - Red line: Mean")
print("  - Green line: Median")

## 8. Correlation Analysis - Heatmap

In [None]:
# Correlation matrix and heatmap
print("="*80)
print("CORRELATION ANALYSIS")
print("="*80)

# Calculate correlation matrix
correlation_matrix = df.corr()

# Display correlations with target
print("\nCorrelations with Target Variable (sorted by absolute value):")
target_corr = correlation_matrix['target'].sort_values(ascending=False)
print(target_corr)

# Visualize correlation matrix
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# Full correlation heatmap
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8}, ax=axes[0])
axes[0].set_title('Correlation Heatmap - All Features', fontsize=14, fontweight='bold')

# Target correlation bar plot
target_corr_abs = target_corr.drop('target').abs().sort_values(ascending=True)
target_corr_abs.plot(kind='barh', ax=axes[1], color='teal')
axes[1].set_title('Feature Correlations with Target (Absolute Values)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Absolute Correlation')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

# Identify highly correlated features
print("\n" + "="*80)
print("HIGHLY CORRELATED FEATURE PAIRS (|correlation| > 0.5)")
print("="*80)
high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.5:
            high_corr.append({
                'Feature 1': correlation_matrix.columns[i],
                'Feature 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

if high_corr:
    high_corr_df = pd.DataFrame(high_corr).sort_values('Correlation', ascending=False, key=abs)
    display(high_corr_df)
else:
    print("No highly correlated feature pairs found (threshold: 0.5)")

print("\n✓ Correlation analysis complete")

## 9. Box Plots - Outlier Detection

In [None]:
# Box plots for outlier detection
print("="*80)
print("OUTLIER DETECTION - BOX PLOTS")
print("="*80)

# Create box plots for all numeric features
fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, n_rows * 4))
axes = axes.flatten()

for idx, col in enumerate(feature_cols):
    axes[idx].boxplot(df[col].dropna(), vert=True, patch_artist=True,
                     boxprops=dict(facecolor='lightblue', alpha=0.7),
                     medianprops=dict(color='red', linewidth=2),
                     whiskerprops=dict(color='blue', linewidth=1.5),
                     capprops=dict(color='blue', linewidth=1.5))
    axes[idx].set_title(f'{col} - Box Plot', fontsize=12, fontweight='bold')
    axes[idx].set_ylabel(col)
    axes[idx].grid(axis='y', alpha=0.3)

# Hide extra subplots
for idx in range(n_features, len(axes)):
    axes[idx].axis('off')

plt.suptitle('Box Plots for All Features (Outlier Detection)', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# Calculate outliers using IQR method
print("\n" + "="*80)
print("OUTLIER STATISTICS (IQR Method)")
print("="*80)

outlier_summary = []
for col in feature_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    outlier_count = len(outliers)
    outlier_percentage = (outlier_count / len(df)) * 100
    
    outlier_summary.append({
        'Feature': col,
        'Q1': Q1,
        'Q3': Q3,
        'IQR': IQR,
        'Lower_Bound': lower_bound,
        'Upper_Bound': upper_bound,
        'Outlier_Count': outlier_count,
        'Outlier_Percentage': outlier_percentage
    })

outlier_df = pd.DataFrame(outlier_summary)
outlier_df = outlier_df[outlier_df['Outlier_Count'] > 0].sort_values('Outlier_Count', ascending=False)

if len(outlier_df) > 0:
    print("\nFeatures with outliers:")
    display(outlier_df)
else:
    print("\n✓ No outliers detected using IQR method!")

print("\n✓ Box plot analysis complete")

## 10. Feature Relationships with Target Variable

Analyze how each feature relates to the target variable (heart disease presence).

In [None]:
# Feature relationships with target variable
print("="*80)
print("FEATURE-TARGET RELATIONSHIPS")
print("="*80)

# Create violin plots for continuous variables by target
continuous_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
n_continuous = len(continuous_features)
n_cols_violin = 3
n_rows_violin = (n_continuous + n_cols_violin - 1) // n_cols_violin

fig, axes = plt.subplots(n_rows_violin, n_cols_violin, figsize=(18, n_rows_violin * 4))
axes = axes.flatten()

for idx, col in enumerate(continuous_features):
    sns.violinplot(data=df, x='target', y=col, ax=axes[idx], palette='Set2')
    axes[idx].set_title(f'{col} vs Target', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Target (0=No Disease, 1=Disease)')
    axes[idx].set_ylabel(col)
    axes[idx].grid(axis='y', alpha=0.3)

# Hide extra subplots
for idx in range(n_continuous, len(axes)):
    axes[idx].axis('off')

plt.suptitle('Continuous Features vs Target Variable (Violin Plots)', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# Create count plots for categorical variables by target
categorical_features = [col for col in feature_cols if col not in continuous_features]
n_categorical = len(categorical_features)
n_rows_cat = (n_categorical + 2) // 3

fig, axes = plt.subplots(n_rows_cat, 3, figsize=(18, n_rows_cat * 4))
axes = axes.flatten()

for idx, col in enumerate(categorical_features):
    # Grouped bar plot
    pd.crosstab(df[col], df['target'], normalize='index').plot(
        kind='bar', ax=axes[idx], color=['#FF6B6B', '#4ECDC4'], alpha=0.8
    )
    axes[idx].set_title(f'{col} vs Target (Normalized)', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Proportion')
    axes[idx].legend(['No Disease', 'Disease'])
    axes[idx].grid(axis='y', alpha=0.3)
    axes[idx].set_xticklabels(axes[idx].get_xticklabels(), rotation=45)

# Hide extra subplots
for idx in range(n_categorical, len(axes)):
    axes[idx].axis('off')

plt.suptitle('Categorical Features vs Target Variable (Normalized Bar Plots)', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# Statistical summary by target
print("\n" + "="*80)
print("MEAN VALUES BY TARGET CLASS")
print("="*80)
target_means = df.groupby('target')[continuous_features].mean()
display(target_means.style.background_gradient(cmap='RdYlGn', axis=0).format("{:.2f}"))

print("\n✓ Feature-target relationship analysis complete")

## 11. Pairwise Feature Relationships

Explore relationships between key features using pairplot.

In [None]:
# Pairplot for key continuous features
print("="*80)
print("PAIRWISE RELATIONSHIPS - PAIRPLOT")
print("="*80)

# Select key features for pairplot (to keep visualization manageable)
key_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'target']

print(f"\nGenerating pairplot for key features: {key_features[:-1]}")
print("This may take a moment...\n")

# Create pairplot
pairplot_fig = sns.pairplot(
    df[key_features], 
    hue='target',
    palette='Set1',
    diag_kind='kde',
    plot_kws={'alpha': 0.6},
    height=2.5,
    aspect=1.2
)

pairplot_fig.fig.suptitle('Pairwise Relationships of Key Features (colored by Target)', 
                          fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n✓ Pairplot analysis complete")
print("\nKey Observations:")
print("- Diagonal: KDE plots show distribution of each feature by target class")
print("- Off-diagonal: Scatter plots show pairwise relationships")
print("- Red: No heart disease (target=0)")
print("- Blue: Heart disease present (target=1)")

## 12. Summary and Key Insights

### Dataset Overview
- **Total Records**: 303 patients
- **Total Features**: 13 predictive features + 1 target variable
- **Target Distribution**: Balanced dataset (presence vs. absence of heart disease)

### Data Quality
- **Missing Values**: No missing values detected ✓
- **Data Types**: Mixed (numerical and categorical)
- **Outliers**: Identified using IQR method in features like chol, trestbps, thalach

### Key Findings from EDA

#### 1. **Age Distribution**
- Age ranges from 29 to 77 years
- Mean age: ~54 years
- Heart disease patients tend to be slightly older on average

#### 2. **Sex-based Observations**
- Dataset has more male patients
- Disease prevalence differs between genders

#### 3. **Chest Pain Type (cp)**
- Strong predictor of heart disease
- Different chest pain types show varying association with disease presence

#### 4. **Blood Pressure & Cholesterol**
- Resting blood pressure (trestbps): Mean ~131 mmHg
- Serum cholesterol (chol): Mean ~246 mg/dl
- Some outliers detected, but clinically plausible

#### 5. **Maximum Heart Rate (thalach)**
- Mean ~150 bpm
- Patients with heart disease tend to have different thalach patterns

#### 6. **ST Depression (oldpeak)**
- Key indicator of heart disease
- Positive correlation with disease presence

#### 7. **Correlation Insights**
- Strongest positive correlations with target: cp, thalach, slope
- Strongest negative correlations with target: exang, oldpeak, ca
- Some features show multicollinearity (consider feature engineering)

### Recommendations for Modeling
1. **Feature Engineering**: Consider interaction terms between highly correlated features
2. **Scaling Required**: Features have different scales (age: 29-77, chol: 126-564)
3. **Encoding**: Categorical variables (sex, cp, fbs, restecg, etc.) need encoding
4. **Class Balance**: Dataset is relatively balanced - no need for SMOTE/undersampling
5. **Model Selection**: Classification algorithms suitable (Logistic Regression, Random Forest, Gradient Boosting)

### Next Steps
- ✓ Data preprocessing (scaling, encoding)
- ✓ Feature engineering
- ✓ Model training with multiple algorithms
- ✓ Hyperparameter tuning
- ✓ Model evaluation with cross-validation

---

**EDA Complete** ✅