
### Heart Disease Prediction


## 1. Introduction
This project aims to predict whether a patient is likely to develop heart disease in the next 10 years based on health-related attributes. The motivation stems from the increasing prevalence of cardiovascular diseases and the need for early prediction systems.

## 2. Dataset Description

In [None]:

import pandas as pd
df = pd.read_csv("Heart Disease.csv")
df.info()
df.head()


In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

df.describe(include='all')

# Correlation heatmap
correlation_matrix = df.corr(numeric_only=True)
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Features')
plt.show()


## 3. Exploratory Data Analysis

In [None]:

output_col = 'Heart Disease (in next 10 years)'
df[output_col].value_counts().plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Class Distribution in Output Feature')
plt.xlabel('Heart Disease in 10 Years')
plt.ylabel('Count')
plt.show()


## 4. Dataset Pre-processing

In [None]:

# Handle missing values
df['education'].fillna(df['education'].mode()[0], inplace=True)
df['cigsPerDay'].fillna(df['cigsPerDay'].median(), inplace=True)
df['BPMeds'].fillna(0, inplace=True)
df['totChol'].fillna(df['totChol'].mean(), inplace=True)
df['BMI'].fillna(df['BMI'].mean(), inplace=True)
df['glucose'].fillna(df['glucose'].mean(), inplace=True)

# Encoding categorical
df['gender'] = df['gender'].map({'Male': 1, 'Female': 0})

# Normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = df.drop(columns=['Heart Disease (in next 10 years)'])
y = df['Heart Disease (in next 10 years)']
X_scaled = scaler.fit_transform(X)


## 5. Dataset Splitting

In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, stratify=y, random_state=42)


## 6. Model Training & Testing

In [None]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

models = {
    'KNN': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Naive Bayes': GaussianNB()
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    results[name] = {
        'accuracy': accuracy_score(y_test, preds),
        'precision': precision_score(y_test, preds),
        'recall': recall_score(y_test, preds),
        'confusion_matrix': confusion_matrix(y_test, preds),
        'roc_auc': roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    }

# Neural Network
nn = Sequential()
nn.add(Dense(16, input_dim=X_train.shape[1], activation='relu'))
nn.add(Dense(8, activation='relu'))
nn.add(Dense(1, activation='sigmoid'))

nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
nn.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0)
nn_preds = (nn.predict(X_test) > 0.5).astype("int32")

results['Neural Network'] = {
    'accuracy': accuracy_score(y_test, nn_preds),
    'precision': precision_score(y_test, nn_preds),
    'recall': recall_score(y_test, nn_preds),
    'confusion_matrix': confusion_matrix(y_test, nn_preds),
    'roc_auc': roc_auc_score(y_test, nn_preds)
}


## 7. Model Selection/Comparison

In [None]:

# Plotting accuracy
import matplotlib.pyplot as plt
names = list(results.keys())
accuracies = [results[name]['accuracy'] for name in names]

plt.figure(figsize=(8,6))
plt.bar(names, accuracies, color='teal')
plt.ylabel('Accuracy')
plt.title('Model Comparison - Accuracy')
plt.show()

# Optional: ROC curves


## 8. Conclusion
From the model evaluation metrics, we observe that the Neural Network performs competitively with traditional models like Logistic Regression and KNN. Challenges included handling missing data and imbalanced class distributions. Further improvements can be made with more data or feature engineering.