In [None]:
Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.

Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link

In [None]:
Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

To build a random forest classifier to predict the risk of heart disease based on the provided dataset, we need to preprocess the dataset first. This preprocessing involves handling missing values, encoding categorical variables, and scaling numerical features if necessary. Let's go through each step:

1. **Load the Dataset:**
   - Download the dataset from the provided link.
   - Read the dataset into a pandas DataFrame.

2. **Handle Missing Values:**
   - Check for any missing values in the dataset.
   - Handle missing values by either imputing them with a suitable value (e.g., mean, median) or removing rows or columns containing missing values.

3. **Encode Categorical Variables:**
   - Identify categorical variables in the dataset (if any).
   - Encode categorical variables using one-hot encoding or label encoding, depending on the nature of the categorical variables.

4. **Scale Numerical Features:**
   - If numerical features are on different scales, consider scaling them to ensure that they have similar ranges.
   - Common scaling techniques include Min-Max scaling or Standardization.

Let's implement these preprocessing steps in Python:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
url = "https://drive.google.com/uc?id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ"
heart_disease_data = pd.read_csv(url)

# Separate features (X) and target variable (y)
X = heart_disease_data.drop(columns=['target'])
y = heart_disease_data['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the preprocessing pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

# Fit the model
clf.fit(X_train, y_train)

# Evaluate the model
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

This code performs the following steps:
- Loads the dataset.
- Splits the dataset into features (X) and target variable (y).
- Splits the data into training and testing sets.
- Defines preprocessing steps using pipelines and transformers.
- Creates a preprocessing pipeline that handles missing values, encodes categorical variables, and scales numerical features.
- Builds a random forest classifier pipeline.
- Fits the model on the training data.
- Evaluates the model's accuracy on the testing data.

This code provides a basic example of preprocessing and building a random forest classifier for the heart disease prediction task. You may need to adjust the preprocessing steps or model hyperparameters based on the specific characteristics of the dataset and the requirements of the task.

In [None]:
Q2. Split the dataset into a training set (70%) and a test set (30%).

To split the dataset into a training set and a test set using Python, we can use the `train_test_split` function from the `sklearn.model_selection` module. Here's how you can do it:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
url = "https://drive.google.com/uc?id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ"
heart_disease_data = pd.read_csv(url)

# Separate features (X) and target variable (y)
X = heart_disease_data.drop(columns=['target'])
y = heart_disease_data['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

This code performs the following steps:
1. Loads the dataset from the provided URL.
2. Separates the features (X) and the target variable (y).
3. Uses the `train_test_split` function to split the data into training and testing sets, with a test size of 30% and a random state of 42.
4. Prints the shapes of the training and testing sets to verify the split.

Adjust the `test_size` parameter in the `train_test_split` function call if you want to change the proportion of the test set. In this example, 70% of the data is used for training, and 30% is used for testing.

In [None]:
Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

To train a Random Forest Classifier on the training set using 100 trees and a maximum depth of 10 for each tree, you can use the `RandomForestClassifier` class from the `sklearn.ensemble` module. Here's how you can do it:

from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Train the classifier on the training set
rf_classifier.fit(X_train, y_train)

This code creates a Random Forest Classifier with 100 trees (`n_estimators=100`) and a maximum depth of 10 for each tree (`max_depth=10`). The `random_state` parameter is set to 42 for reproducibility.

Then, the classifier is trained on the training set using the `fit` method, with the features (`X_train`) and the target variable (`y_train`).

The default values for other hyperparameters will be used, such as the criterion for splitting (`gini`), the minimum samples required to split an internal node (`min_samples_split=2`), the minimum samples required to be at a leaf node (`min_samples_leaf=1`), etc.

After training the classifier, you can use it to make predictions on new data or evaluate its performance on the test set.

In [None]:
Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

To evaluate the performance of the trained Random Forest Classifier on the test set, we can calculate various performance metrics such as accuracy, precision, recall, and F1 score. We can use the `accuracy_score`, `precision_score`, `recall_score`, and `f1_score` functions from the `sklearn.metrics` module to compute these metrics. Here's how you can do it:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate precision
precision = precision_score(y_test, y_pred)

# Calculate recall
recall = recall_score(y_test, y_pred)

# Calculate F1 score
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

This code:
- Makes predictions on the test set using the trained Random Forest Classifier (`rf_classifier.predict(X_test)`).
- Calculates accuracy, precision, recall, and F1 score using the ground truth labels (`y_test`) and the predicted labels (`y_pred`).
- Prints out the computed performance metrics.

These performance metrics provide insights into how well the model is performing on the test set. Accuracy measures the overall correctness of the predictions, precision measures the proportion of true positive predictions among all positive predictions, recall measures the proportion of true positive predictions among all actual positive instances, and F1 score is the harmonic mean of precision and recall, providing a balanced measure of model performance.

In [None]:
Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

To identify the top 5 most important features in predicting heart disease risk using the feature importance scores from the trained Random Forest Classifier, we can access the `feature_importances_` attribute of the classifier. Then, we can visualize the feature importances using a bar chart. Here's how you can do it:

import matplotlib.pyplot as plt

# Get feature importances
importances = rf_classifier.feature_importances_

# Get the indices of the top 5 most important features
indices = (-importances).argsort()[:5]

# Get the names of the top 5 most important features
top_features = X.columns[indices]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(indices)), importances[indices], align='center')
plt.xticks(range(len(indices)), top_features, rotation=45)
plt.xlabel('Feature')
plt.ylabel('Importance Score')
plt.title('Top 5 Most Important Features in Predicting Heart Disease Risk')
plt.show()

This code:
- Retrieves the feature importances from the trained Random Forest Classifier using the `feature_importances_` attribute.
- Identifies the indices of the top 5 most important features by sorting the feature importances in descending order.
- Gets the names of the top 5 most important features.
- Plots the feature importances using a bar chart, with the names of the top features on the x-axis and their importance scores on the y-axis.

This bar chart visualization helps identify which features are most influential in predicting heart disease risk according to the trained Random Forest Classifier. Features with higher importance scores contribute more to the model's predictions.

In [None]:
Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

To tune the hyperparameters of the Random Forest Classifier using either grid search or random search with cross-validation, we can utilize the `GridSearchCV` or `RandomizedSearchCV` classes from the `sklearn.model_selection` module. Here's how you can perform hyperparameter tuning using grid search with 5-fold cross-validation:

from sklearn.model_selection import GridSearchCV

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 150],  # Number of trees in the forest
    'max_depth': [5, 10, 15],         # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]     # Minimum number of samples required to be at a leaf node
}

# Create the grid search object
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Get the best model
best_rf_classifier = grid_search.best_estimator_

# Evaluate the best model on the test set
accuracy = best_rf_classifier.score(X_test, y_test)
print("Accuracy:", accuracy)

This code:
- Defines a grid of hyperparameters to tune, including `n_estimators`, `max_depth`, `min_samples_split`, and `min_samples_leaf`.
- Creates a `GridSearchCV` object with the Random Forest Classifier, the parameter grid, 5-fold cross-validation (`cv=5`), and accuracy as the scoring metric (`scoring='accuracy'`).
- Performs grid search using the training data.
- Retrieves the best hyperparameters and the best model.
- Evaluates the best model's accuracy on the test set.

You can adjust the range of values for each hyperparameter in the `param_grid` dictionary based on your requirements. Similarly, you can use `RandomizedSearchCV` instead of `GridSearchCV` for randomized search. The `n_jobs` parameter specifies the number of CPU cores to use for parallelization (-1 for all available cores).

In [None]:
Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

To report the best set of hyperparameters found by the hyperparameter search and the corresponding performance metrics, we can print out the best parameters and evaluate the performance of the tuned model. Additionally, we can compare the performance of the tuned model with the default model. Here's how you can do it:

from sklearn.metrics import classification_report

# Report the best set of hyperparameters
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Evaluate the best model on the test set
y_pred_tuned = best_rf_classifier.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
precision_tuned = precision_score(y_test, y_pred_tuned)
recall_tuned = recall_score(y_test, y_pred_tuned)
f1_tuned = f1_score(y_test, y_pred_tuned)

# Print performance metrics for the tuned model
print("\nPerformance Metrics for Tuned Model:")
print("Accuracy:", accuracy_tuned)
print("Precision:", precision_tuned)
print("Recall:", recall_tuned)
print("F1 Score:", f1_tuned)

# Compare with the default model
print("\nPerformance Metrics for Default Model:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

This code:
- Prints out the best set of hyperparameters found by the grid search (`grid_search.best_params_`).
- Evaluates the performance of the best model on the test set and computes accuracy, precision, recall, and F1 score.
- Prints the performance metrics for the tuned model.
- Compares the performance metrics of the tuned model with those of the default model.

By comparing the performance metrics of the tuned model with those of the default model, you can assess whether the hyperparameter tuning process has improved the model's performance. If the performance metrics of the tuned model are better than those of the default model, it indicates that the hyperparameter search has successfully optimized the model's hyperparameters.

In [None]:
Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.

Interpreting the decision boundaries of a Random Forest Classifier can provide insights into how the model separates different classes in the feature space. However, due to the ensemble nature of Random Forests and the high dimensionality of the feature space, visualizing decision boundaries directly is not straightforward. One common approach is to visualize decision boundaries using a scatter plot of two of the most important features and then overlay the regions where the model predicts each class.

Here's how you can plot the decision boundaries on a scatter plot of two of the most important features:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Reduce dimensionality for visualization (if necessary)
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_train)

# Train a Random Forest Classifier with the best hyperparameters
rf_classifier_best = RandomForestClassifier(**grid_search.best_params_)
rf_classifier_best.fit(X_train, y_train)

# Plot decision boundaries
h = .02  # Step size in the mesh
x_min, x_max = X_reduced[:, 0].min() - 1, X_reduced[:, 0].max() + 1
y_min, y_max = X_reduced[:, 1].min() - 1, X_reduced[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = rf_classifier_best.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)

# Plot data points
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_train, cmap=plt.cm.coolwarm, edgecolors='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Decision Boundaries of Random Forest Classifier')
plt.colorbar(label='Class')
plt.show()

This code:
- Reduces the dimensionality of the feature space to 2 dimensions using Principal Component Analysis (PCA) for visualization purposes.
- Trains a Random Forest Classifier with the best hyperparameters found by the grid search.
- Plots the decision boundaries of the trained classifier on a scatter plot of the two most important features.
- Overlay data points on the plot, where each point represents an instance in the dataset, colored by its class label.

Interpreting the decision boundaries and insights of the model:
- Decision boundaries separate different classes in the feature space. Regions where the same class is predicted by the model are enclosed by decision boundaries.
- The scatter plot of the two most important features provides a visualization of how the model separates instances belonging to different classes.
- Insights from the decision boundaries can help understand how the model makes predictions and identify regions where the model might struggle to generalize.
- Limitations include the inability to visualize high-dimensional feature spaces directly and the potential for overfitting in regions of high data density.

Interpreting decision boundaries should be done cautiously, especially in high-dimensional spaces, as visualizations may not capture the full complexity of the model's behavior. Additionally, other evaluation metrics and techniques, such as feature importance analysis and model interpretation methods, can provide complementary insights into the model's performance and behavior.