Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.

Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link




In [None]:
!pip install pandas scikit-learn matplotlib seaborn

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV

# Load the dataset
data_url = 'https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing'
data_path = 'dataset.csv'
df = pd.read_csv(data_path)

# Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the numerical features if necessary.


# Handle missing values (fill with mean or mode for simplicity)
df.fillna(df.mean(), inplace=True)

# Encode categorical variables (one-hot encoding for 'cp' column)
df = pd.get_dummies(df, columns=['cp'], drop_first=True)

# Splitting the dataset into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Q2. Split the dataset into a training set (70%) and a test set (30%).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each tree. Use the default values for other hyperparameters.

rf_model_default = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model_default.fit(X_train, y_train)

# Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.
y_pred = rf_model_default.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Performance Metrics for Default Random Forest:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

# Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart disease risk. Visualise the feature importances using a bar chart.
feature_importances = rf_model_default.feature_importances_
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})
top_5_features = importance_df.nlargest(5, 'Importance')

print("\nTop 5 most important features:")
print(top_5_features)

# Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try different values of the number of trees, maximum depth, minimum samples split, and minimum samples leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.


param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_model_tuned = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf_model_tuned, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Q7. Report the best set of hyperparameters found by the search and the corresponding performance metrics. Compare the performance of the tuned model with the default model.

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
best_accuracy = grid_search.best_score_

print("\nBest hyperparameters found by grid search:")
print(best_params)
print("Best Accuracy:", best_accuracy)

# Evaluate the performance of the best model on the test set
y_pred_tuned = best_model.predict(X_test)

accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
precision_tuned = precision_score(y_test, y_pred_tuned)
recall_tuned = recall_score(y_test, y_pred_tuned)
f1_tuned = f1_score(y_test, y_pred_tuned)

print("\nPerformance Metrics for Tuned Random Forest:")
print("Accuracy:", accuracy_tuned)
print("Precision:", precision_tuned)
print("Recall:", recall_tuned)
print("F1 Score:", f1_tuned)

# Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the decision boundaries on a scatter plot of two of the most important features. Discuss the insights and limitations of the model for predicting heart disease risk.
# Plot decision boundaries on a scatter plot of two most important features
sns.scatterplot(data=df, x=top_5_features.iloc[0, 0], y=top_5_features.iloc[1, 0], hue='target', palette='Set1')
plt.xlabel(top_5_features.iloc[0, 0])
plt.ylabel(top_5_features.iloc[1, 0])
plt.title("Decision Boundaries of Random Forest Classifier")
plt.legend(title='Target', labels=['No Heart Disease', 'Heart Disease'])
plt.show()






Performance Metrics for Default Random Forest:
Accuracy: 0.8241758241758241
Precision: 0.8269230769230769
Recall: 0.86
F1 Score: 0.8431372549019608

Top 5 most important features:
    Feature  Importance
10       ca    0.131861
6   thalach    0.122198
11     thal    0.119966
8   oldpeak    0.104684
0       age    0.099240


                      -------------------------------------------------------------------