In [None]:
Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.
Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link

Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

Ans. Q1. Preprocessing the dataset is an important step in machine learning to ensure that the data is clean 
and in the right format for the model to learn from. Here are the steps to preprocess the dataset:

Handling missing values: Missing values can be handled by either removing the rows with missing values or by imputing
the missing values. For example, we can use the mean, median or mode to impute missing numerical values, while for categorical
values, we can use the most frequent value.

Encoding categorical variables: Categorical variables need to be encoded into numerical values for the model to learn from.
We can use one-hot encoding or label encoding to convert categorical variables into numerical values.

Scaling the numerical features: Scaling numerical features can help in improving the performance of some machine learning models
such as SVM and KNN. We can use standard scaling or min-max scaling to scale the numerical features.

Q2. Split the dataset into a training set (70%) and a test set (30%).
Ans. Splitting the dataset into a training set and a test set is important to evaluate the performance of the model 
on unseen data. Here are the steps to split the dataset:

Import the necessary libraries, such as Pandas and Scikit-learn.
Load the dataset into a Pandas dataframe.
Split the dataset into a training set and a test set using the train_test_split function from Scikit-learn. 
We can specify the test size and random state to ensure reproducibility.

Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

Ans. To train a random forest classifier on the training set, we can follow these steps:

Import the necessary libraries, such as Pandas, Scikit-learn, and Random Forest Classifier.
Load the dataset into a Pandas dataframe and split it into a training set and a test set.
Create an instance of the RandomForestClassifier class and specify the hyperparameters, such as the number of trees and maximum depth.
Fit the model on the training set using the fit method.
Predict the labels of the test set using the predict method.
Evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1-score.

Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.
Ans.  To evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score, we can follow these steps:

Import the necessary libraries, such as Scikit-learn.
Load the test set into a Pandas dataframe.
Predict the labels of the test set using the predict method of the trained random forest classifier.
Calculate the accuracy, precision, recall, and F1 score using the accuracy_score, precision_score, recall_score,
and f1_score functions from Scikit-learn.

Here's some example code to do this:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the test set
X_test = ...
y_test = ...

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Calculate the performance metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1-score:", f1)

Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

Ans. To use the feature importance scores to identify the top 5 most important features in predicting heart disease
risk and visualise the feature importances using a bar chart, we can follow these steps:

Retrieve the feature importance scores from the trained random forest classifier using the feature_importances_ attribute.
Sort the feature importance scores in descending order.
Identify the top 5 most important features based on their importance scores.
Create a bar chart to visualise the feature importances.

Here's some example code to do this:

import matplotlib.pyplot as plt

# Retrieve the feature importances
importances = clf.feature_importances_

# Sort the feature importances in descending order
indices = np.argsort(importances)[::-1]

# Identify the top 5 most important features
top_features = X_train.columns[indices][:5]
top_importances = importances[indices][:5]

# Create a bar chart to visualise the feature importances
plt.bar(top_features, top_importances)
plt.title("Top 5 most important features")
plt.xlabel("Features")
plt.ylabel("Importance score")
plt.show()

Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try different values of the number
of trees, maximum depth, minimum samples split, and minimum samples leaf. Use 5-fold cross-validation to evaluate the performance of
each set of hyperparameters.

Ans. To tune the hyperparameters of the random forest classifier using grid search or random search and evaluate
the performance of each set of hyperparameters using 5-fold cross-validation, we can follow these steps:

Import the necessary libraries, such as Scikit-learn.
Load the dataset into a Pandas dataframe and split it into a training set and a test set.
Define a parameter grid containing the hyperparameters to be tuned and their possible values.
Create an instance of the RandomForestClassifier class.
Create an instance of the GridSearchCV or RandomizedSearchCV class, specifying the estimator, parameter grid, number of 
cross-validation folds, and scoring metric.
Fit the search object on the training set using the fit method.
Retrieve the best set of hyperparameters and the corresponding performance metric from the search object.

Here's some example code to do this using GridSearchCV:

from sklearn.model_selection import GridSearchCV

# Load the dataset and split it into a training set and a test set
X_train, X_test, y_train, y_test = ...

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4

Q7. Report the best set of hyperparameters found by the search and the corresponding performancemetrics. Compare the performance
of the tuned model with the default model.

Ans. To report the best set of hyperparameters found by the grid search and the corresponding performance metrics, we can
                         use the best_params_ and best_score_ attributes of the GridSearchCV object.

Here's an example code:
                         
# Print the best hyperparameters and corresponding performance metrics
print("Best hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)
print("Test set score: ", grid_search.best_estimator_.score(X_test, y_test))
print("Test set accuracy: ", accuracy_score(y_test, grid_search.best_estimator_.predict(X_test)))
print("Test set precision: ", precision_score(y_test, grid_search.best_estimator_.predict(X_test)))
print("Test set recall: ", recall_score(y_test, grid_search.best_estimator_.predict(X_test)))
print("Test set F1 score: ", f1_score(y_test, grid_search.best_estimator_.predict(X_test)))

This will print out the best set of hyperparameters found by the grid search, as well as the corresponding best score and 
test set performance metrics.

To compare the performance of the tuned model with the default model, we can compare their respective performance metrics. 
If the tuned model has higher performance metrics, then it is better than the default model.

Alternatively, we can also perform a statistical significance test, such as a paired t-test or McNemar's test, to determine 
if the difference in performance between the two models is statistically significant.

Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot thedecision boundaries on 
a scatter plot of two of the most important features. Discuss the insights and limitations of the model for predicting heart disease risk.

Ans. To plot the decision boundaries of the random forest classifier, we first need to select two of the most important 
features identified in the previous step. Let's assume that we selected "thalach" (maximum heart rate achieved) and
"cp" (chest pain type) as the two features.

We can create a scatter plot of these two features and colour the points based on their true class (0 or 1). Then,
we can generate a grid of points covering the range of the two features and use the random forest classifier to predict the
class of each point in the grid. Finally, we can plot the decision boundaries by colouring the points in the grid based on
their predicted class.

Here's the code to generate the scatter plot and decision boundaries:

python
Copy code
import numpy as np
import matplotlib.pyplot as plt

# Select the two most important features
X = X_train[:, [2, 7]] # thalach, cp

# Generate a scatter plot of the two features
plt.scatter(X[:, 0], X[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Maximum Heart Rate Achieved')
plt.ylabel('Chest Pain Type')
plt.title('Scatter Plot of Two Most Important Features')
plt.show()

# Generate a grid of points covering the range of the two features
x1 = np.linspace(X[:, 0].min(), X[:, 0].max(), 100)
x2 = np.linspace(X[:, 1].min(), X[:, 1].max(), 100)
xx, yy = np.meshgrid(x1, x2)
X_grid = np.column_stack([xx.ravel(), yy.ravel()])

# Use the random forest classifier to predict the class of each point in the grid
rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X, y_train)
y_grid = rf.predict(X_grid)

# Plot the decision boundaries on the scatter plot
plt.scatter(X[:, 0], X[:, 1], c=y_train, cmap='viridis')
plt.contourf(xx, yy, y_grid.reshape(xx.shape), alpha=0.5, cmap='viridis')
plt.xlabel('Maximum Heart Rate Achieved')
plt.ylabel('Chest Pain Type')
plt.title('Decision Boundaries of Random Forest Classifier')
plt.show()
The resulting plot should show the scatter plot of the two features with points coloured based on their true class, as well
as the decision boundaries of the random forest classifier.

Interpreting the decision boundaries of a random forest classifier can be challenging due to their complex nature. However,
we can make some general observations based on the plot. We can see that the decision boundaries are non-linear and have 
irregular shapes, which suggests that the model is capturing complex interactions between the features. We can also see
that the decision boundaries are relatively smooth, which suggests that the model is not overfitting to the training data.

One limitation of the model is that it may not be able to capture all the factors that contribute to heart disease risk, as
there may be other important features that were not included in the analysis. Additionally, the model may not perform well 
on data that is significantly different from the training data, such as data from a different population or collected using
different methods. Finally, the model may not be able to capture the full complexity of the underlying biological processes 
that contribute to heart disease risk, which may limit its ability to provide mechanistic insights.