### Overview
Cardiovascular diseases (CVDs), including heart disease, are the leading cause of death worldwide. Early detection of heart disease is critical for preventing serious health outcomes and improving the quality of life for patients. With the increasing availability of medical data, machine learning models can be used to predict whether a patient is likely to develop heart disease based on certain health indicators.

In this project, you will build a classification model to predict whether an individual is likely to have heart disease or not. The dataset provided includes various health and demographic factors such as age, blood pressure, cholesterol levels, and lifestyle habits (e.g., smoking and alcohol consumption). The goal is to train a model to identify which individuals have heart disease based on these features.


### Problem Statement
You are provided with a dataset that contains health-related information about individuals. Your task is to develop a machine learning model that can predict the presence of heart disease based on the provided features. The target variable in the dataset is "disease," which indicates whether a person has heart disease (1) or not (0).

Refer to the **Data information.pdf** for more details on variables before importing the data to the notebook.

In [None]:
# Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

import warnings

warnings.filterwarnings("ignore")

In [None]:
# Load the dataset
df = pd.read_csv('data_file.csv')

In [None]:
# Display the first few rows of the dataset
df.head()

### Data Preprocessing

In [None]:
# Drop any unnecessary columns (for example, 'id', 'date')
df.drop(['id', 'date'], axis=1, inplace=True)

In [None]:
# Checking for null values
df.isnull().sum()

In [None]:
# Convert categorical variables to numerical ones
label_encoder = LabelEncoder()
df['country'] = label_encoder.fit_transform(df['country'])
df['occupation'] = label_encoder.fit_transform(df['occupation'])

In [None]:
# Define features (X) and target variable (y)
X = df.drop('disease', axis=1)
y = df['disease']

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Normalize numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Exploratory Data Analysis (EDA)

In [None]:
# Visualize the distribution of key features
plt.figure(figsize=(12,6))
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


### Model Building

#### 1. Logistic Regression

In [None]:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred_log = log_reg.predict(X_test)

In [None]:
# Accuracy score for Logistic Regression
accuracy_log = accuracy_score(y_test, y_pred_log)
print(f"Logistic Regression Accuracy: {accuracy_log * 100:.2f}%")

#### 2. Decision Tree

In [None]:
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred_tree = tree.predict(X_test)

In [None]:
# Accuracy score for Decision Tree
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print(f"Decision Tree Accuracy: {accuracy_tree * 100:.2f}%")


#### 3. Support Vector Machine (SVM)

In [None]:
svm = SVC()
svm.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred_svm = svm.predict(X_test)

# Accuracy score for SVM
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"SVM Accuracy: {accuracy_svm * 100:.2f}%")


### Model Evaluation

In [None]:
# Defining a custom function to call for the evaluation

def evaluate_model(y_test, y_pred, model_name):
    # Confusion Matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6,4))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
    plt.title(f'{model_name} - Confusion Matrix')
    plt.show()

    # Classification Report
    print(f"{model_name} - Classification Report:")
    print(classification_report(y_test, y_pred))

In [None]:
# Evaluate each model

evaluate_model(y_test, y_pred_log, "Logistic Regression")
evaluate_model(y_test, y_pred_tree, "Decision Tree")
evaluate_model(y_test, y_pred_svm, "SVM")

### Hyperparameter Tuning with GridSearchCV

#### 1. Logistic Regression

In [None]:
param_grid_log = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'saga']}
grid_log_reg = GridSearchCV(LogisticRegression(random_state=42), param_grid_log, cv=5, scoring='accuracy')
grid_log_reg.fit(X_train, y_train)

print(f"Best Logistic Regression Parameters: {grid_log_reg.best_params_}")
print(f"Best Logistic Regression Accuracy: {grid_log_reg.best_score_ * 100:.2f}%")


#### 2. Decision Tree

In [None]:
param_grid_tree = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_tree = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_tree, cv=5, scoring='accuracy')
grid_tree.fit(X_train, y_train)

print(f"Best Decision Tree Parameters: {grid_tree.best_params_}")
print(f"Best Decision Tree Accuracy: {grid_tree.best_score_ * 100:.2f}%")


#### 3. SVM

In [None]:
# SVM Hyperparameter Tuning
param_grid_svm = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_svm = GridSearchCV(SVC(random_state=42), param_grid_svm, cv=5, scoring='accuracy')
grid_svm.fit(X_train, y_train)

print(f"Best SVM Parameters: {grid_svm.best_params_}")
print(f"Best SVM Accuracy: {grid_svm.best_score_ * 100:.2f}%")

# This code can take long to execute, if it takes too much time then interrupt the kernel and re execute.