### Predicting Heart Disease
In this example, we will build and evaluate machine learning models to predict heart disease based on diagnostic measurements. We will use the Heart Disease UCI dataset from the UCI Machine Learning Repository.


##### Objective:
Predict whether a patient has heart disease based on diagnostic measurements.
##### Dataset:
Heart Disease UCI dataset from the UCI Machine Learning Repository.

**Data Description**
The dataset contains various diagnostic measurements for patients, as well as a target variable indicating the presence or absence of heart disease.

##### Features: #####

    1. age: age in years
    2. sex: Sex (1 = male; 0 = female).
    3. cp: Chest pain type (0-3).
    4. trestbps: Resting blood pressure (in mm Hg on admission to the hospital).
    5. chol: Serum cholesterol in mg/dl.
    6. fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false).
    7. restecg: Resting electrocardiographic results (0-2).
    8. thalach: Maximum heart rate achieved.
    9. exang: Exercise-induced angina (1 = yes; 0 = no).
    10. oldpeak: ST depression induced by exercise relative to rest.
    11. slope: The slope of the peak exercise ST segment (0-2).
    12. ca: Number of major vessels (0-3) colored by fluoroscopy.
    13. thal: Thalassemia (1 = normal; 2 = fixed defect; 3 = reversible defect).
    14. target: Diagnosis of heart disease (1 = presence; 0 = absence).

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Load the dataset
data = pd.read_csv('processed.cleveland.data', header=None)

# Display the first few rows of the dataset
print(data.head())

# Assuming the location column is the last column (adjust the index if necessary)
# Filter the dataset to include only Cleveland data
# Note: If the dataset does not have a location column, this step can be skipped
# cleveland_data = data[data.iloc[:, -1] == 'cleveland']

# For this example, we assume the dataset is already filtered to include only Cleveland data
cleveland_data = data

# Display the first few rows of the filtered dataset
print(cleveland_data.head())

# Define column names for the dataset
column_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']
cleveland_data.columns = column_names

In [None]:
# import matplotlib.pyplot as plt
# import seaborn as sns

# Check for missing values
print(cleveland_data.isnull().sum())

# Summary statistics
print(cleveland_data.describe())

# Correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(cleveland_data.corr(), annot=True, cmap='coolwarm')
plt.show()

# Pairplot
sns.pairplot(cleveland_data)
plt.show()

Model Building and Training

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Split the data into training and testing sets
X = cleveland_data.drop('target', axis=1)
y = cleveland_data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Train a Decision Tree model with GridSearchCV
param_grid_dt = {'max_depth': [3, 5, 7, 10], 'min_samples_split': [2, 5, 10]}
grid_dt = GridSearchCV(DecisionTreeClassifier(), param_grid_dt, cv=3)
grid_dt.fit(X_train, y_train)
dt = grid_dt.best_estimator_

# Train a Random Forest model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Train a Gradient Boosting model
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

# Train a Bagging Classifier
bagging = BaggingClassifier()
bagging.fit(X_train, y_train)

# Train a Support Vector Classifier with GridSearchCV
param_grid_svc = {'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01]}
grid_svc = GridSearchCV(SVC(), param_grid_svc, cv=3)
grid_svc.fit(X_train, y_train)
svc = grid_svc.best_estimator_

# Train a K-Nearest Neighbors model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

Results and Evaulations

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Evaluate the models
models = {
    'Logistic Regression': lr,
    'Decision Tree': dt,
    'Random Forest': rf,
    'Gradient Boosting': gb,
    'Bagging': bagging,
    'Support Vector Classifier': svc,
    'K-Nearest Neighbors': knn
}

for name, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f'{name} - Accuracy: {accuracy:.2f}, Precision: {precision:.2f}, Recall: {recall:.2f}, F1-Score: {f1:.2f}')

In [None]:
# Display feature importances for the Random Forest model
feature_importances = pd.DataFrame({'feature': X.columns, 'importance': rf.feature_importances_})
feature_importances = feature_importances.sort_values('importance', ascending=False)

# Heat map for Decision Tree
results_dt = pd.DataFrame(grid_dt.cv_results_)
scores_dt = results_dt.pivot("param_max_depth", "param_min_samples_split", "mean_test_score")
plt.figure(figsize=(8, 6))
sns.heatmap(scores_dt, annot=True, cmap='viridis')
plt.title('Decision Tree Grid Search Scores')
plt.show()

# Heat map for Support Vector Classifier
results_svc = pd.DataFrame(grid_svc.cv_results_)
scores_svc = results_svc.pivot("param_C", "param_gamma", "mean_test_score")
plt.figure(figsize=(8, 6))
sns.heatmap(scores_svc, annot=True, cmap='viridis')
plt.title('Support Vector Classifier Grid Search Scores')
plt.show()