In [None]:
Building a Random Forest Classifier for predicting the risk of heart disease requires the following steps:

1. Import the necessary libraries.
2. Load the dataset.
3. Preprocess the data (handle missing values, encode categorical variables, etc.).
4. Split the dataset into training and testing sets.
5. Create and train the Random Forest Classifier.
6. Evaluate the model on the testing set.
7. Fine-tune the hyperparameters if needed.

Since I don't have access to the specific dataset you mentioned, I'll provide a general outline of the steps involved in building the model:

```python
# Step 1: Import the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 2: Load the dataset
data = pd.read_csv("heart_disease_dataset.csv")

# Step 3: Preprocess the data (if needed)

# Step 4: Split the dataset into features (X) and target (y)
X = data.drop("target", axis=1)
y = data["target"]

# Step 5: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Create and train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Step 7: Evaluate the model on the testing set
y_pred = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate classification report and confusion matrix
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
```

Please note that you need to replace "heart_disease_dataset.csv" with the actual filename of your dataset. Also, make sure to preprocess the data according to the characteristics of your dataset, including handling missing values and encoding categorical variables if required.

Remember to fine-tune the hyperparameters of the Random Forest Classifier (e.g., `n_estimators`, `max_depth`, etc.) based on cross-validation to optimize the model's performance on your specific dataset. Additionally, it's crucial to interpret the results, analyze feature importance, and ensure the model's generalizability before deploying it in a real-world scenario.




Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.


ANS-1


To preprocess the dataset, we'll perform the following steps:

1. Handle Missing Values: We'll check for missing values in the dataset and handle them appropriately, either by imputing the missing values or removing the rows/columns with missing data.

2. Encode Categorical Variables: If the dataset contains categorical variables, we'll encode them into numerical format so that the Random Forest Classifier can process them.

3. Scale Numerical Features: If the numerical features have different scales, we'll scale them to ensure that they have a similar range, which can improve the performance of the Random Forest Classifier.

Let's assume the dataset has already been loaded as "data". We'll now perform the preprocessing steps:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Step 1: Handle Missing Values
# Check for missing values in each column
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)

# If there are missing values, impute or remove them as needed
# For example, if using mean imputation:
data.fillna(data.mean(), inplace=True)

# Step 2: Encode Categorical Variables
# Assuming the categorical variable is in the 'sex' column
# We'll use OneHotEncoder to encode 'sex' into binary columns
encoder = OneHotEncoder(sparse=False, drop='first')
encoded_sex = encoder.fit_transform(data[['sex']])
encoded_sex_df = pd.DataFrame(encoded_sex, columns=['sex_encoded'])
data = pd.concat([data, encoded_sex_df], axis=1)
data.drop('sex', axis=1, inplace=True)

# Step 3: Scale Numerical Features
# Assuming the numerical features are in columns 'age', 'resting_blood_pressure', etc.
numerical_features = ['age', 'resting_blood_pressure', 'serum_cholesterol', 'maximum_heart_rate_achieved']

# Create a column transformer to scale the numerical features
scaler = StandardScaler()
preprocessor = ColumnTransformer(transformers=[('num', scaler, numerical_features)], remainder='passthrough')

# Apply the preprocessor to scale the numerical features
scaled_data = preprocessor.fit_transform(data)

# Convert the scaled data back to a DataFrame
data = pd.DataFrame(scaled_data, columns=data.columns)

# Split the dataset into features (X) and target (y)
X = data.drop("target", axis=1)
y = data["target"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

Please make sure to adapt the code based on your specific dataset and column names. The preprocessing steps may vary depending on the characteristics of your data. Additionally, if your dataset already contains numerical features on the same scale and does not have any missing values, you may skip the corresponding preprocessing steps.




Q2. Split the dataset into a training set (70%) and a test set (30%).



ANS-2


To split the dataset into a training set and a test set, we can use the `train_test_split` function from the `sklearn.model_selection` module. We'll pass the features (X) and the target variable (y) along with the test_size parameter to determine the size of the test set.

Let's split the dataset into a training set (70%) and a test set (30%):

```python
import pandas as pd
from sklearn.model_selection import train_test_split

# Assuming you have already preprocessed the dataset and obtained X and y

# Split the dataset into a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Verify the shapes of the training and test sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
```

The `test_size` parameter is set to 0.3, which means that 30% of the data will be used for testing, and the remaining 70% will be used for training the model. The `random_state` parameter is set to 42 for reproducibility, ensuring that the same split is obtained when running the code multiple times. Adjust the `test_size` value according to your preferences and the size of your dataset.





Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.



ANS-3



To train a Random Forest Classifier on the training set with 100 trees and a maximum depth of 10 for each tree, we can use the `RandomForestClassifier` from the `sklearn.ensemble` module. We'll set the `n_estimators` and `max_depth` hyperparameters accordingly.

```python
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest Classifier with 100 trees and max depth of 10
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Train the classifier on the training set
rf_classifier.fit(X_train, y_train)
```

Here, we set the `n_estimators` hyperparameter to 100, which means the Random Forest will consist of 100 decision trees. We also set the `max_depth` hyperparameter to 10, which limits the maximum depth of each decision tree to 10 levels. Setting a maximum depth can help prevent overfitting and improve the generalization of the model.

The `random_state` parameter is set to 42 for reproducibility, ensuring that the same random initializations are used during training. You can adjust the `n_estimators`, `max_depth`, and other hyperparameters based on your specific problem and dataset. Keep in mind that tuning these hyperparameters can affect the model's performance, and it is often a good practice to perform hyperparameter tuning using techniques like grid search or random search.





Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.



ANS-4



To evaluate the performance of the Random Forest Classifier on the test set, we can use the following metrics: accuracy, precision, recall, and F1 score. We'll use the predictions made by the model on the test set and compare them to the true labels.

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the target labels on the test set
y_pred = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)

# Calculate recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)

# Calculate F1 score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)
```

Each of these metrics provides different insights into the performance of the classifier:

- **Accuracy:** The proportion of correct predictions among all predictions made by the model. It is a general measure of how well the classifier is performing overall.

- **Precision:** The proportion of true positive predictions (correctly predicted positive instances) among all positive predictions made by the model. It indicates how many of the predicted positive instances are actually positive.

- **Recall:** Also known as sensitivity or true positive rate, it is the proportion of true positive predictions among all actual positive instances. It measures the ability of the model to correctly identify positive instances.

- **F1 Score:** The harmonic mean of precision and recall. It provides a balance between precision and recall and is particularly useful when the class distribution is imbalanced.

Remember that the interpretation of these metrics depends on the specific problem and the importance of false positives and false negatives in your application. For example, in medical diagnostics, recall (sensitivity) may be more critical than precision (specificity) to minimize false negatives and ensure that all positive cases are correctly identified, even if it leads to more false positives.





Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.



ANS-5

To identify the top 5 most important features in predicting heart disease risk using the Random Forest Classifier, we can access the feature importances after training the model. Then, we'll visualize the feature importances using a bar chart.

```python
import matplotlib.pyplot as plt

# Get feature importances from the trained model
importances = rf_classifier.feature_importances_

# Create a DataFrame with feature names and their importances
feature_importances_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': importances})

# Sort the features based on their importances in descending order
sorted_feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)

# Get the top 5 most important features
top_5_features = sorted_feature_importances_df.head(5)

# Visualize the feature importances using a bar chart
plt.figure(figsize=(10, 6))
plt.bar(top_5_features['Feature'], top_5_features['Importance'])
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Top 5 Most Important Features for Heart Disease Prediction')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
```

In this code, we first retrieve the feature importances from the trained Random Forest model. Then, we create a DataFrame containing the feature names and their corresponding importances. After sorting the features based on their importances in descending order, we select the top 5 most important features. Finally, we visualize the top 5 features using a bar chart to highlight their relative importance in predicting heart disease risk.

The bar chart will show the importance of each feature, with higher bars indicating more significant contributions to the model's prediction. The top 5 features will be displayed on the x-axis, and their corresponding importances will be represented by the height of the bars on the y-axis.




Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.



ANS-6



To tune the hyperparameters of the Random Forest Classifier using Grid Search and perform 5-fold cross-validation to evaluate the performance, we can use the `GridSearchCV` from the `sklearn.model_selection` module. Grid Search allows us to specify different values for the hyperparameters and exhaustively try all combinations to find the best set of hyperparameters.

Here's how to perform hyperparameter tuning using Grid Search:

```python
from sklearn.model_selection import GridSearchCV

# Create the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define hyperparameters and their possible values
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, n_jobs=-1)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters found by the grid search
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Get the best model from the grid search
best_rf_classifier = grid_search.best_estimator_

# Evaluate the performance of the best model on the test set
y_pred = best_rf_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)

# Calculate recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)

# Calculate F1 score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)
```

In this code, we first define a dictionary `param_grid` that contains different values for the hyperparameters `n_estimators`, `max_depth`, `min_samples_split`, and `min_samples_leaf`. The `GridSearchCV` will perform 5-fold cross-validation for each combination of hyperparameters and evaluate the performance of the model.

After the grid search, we obtain the best hyperparameters found by `GridSearchCV` and create the best model using those hyperparameters. Finally, we evaluate the performance of the best model on the test set and calculate the accuracy, precision, recall, and F1 score.




Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.



ANS-7



To report the best set of hyperparameters found by the Grid Search and the corresponding performance metrics, we can use the `best_params_` attribute of the `GridSearchCV` object. Additionally, we will compare the performance of the tuned model with the default model.

Here's how to do it:

```python
from sklearn.metrics import classification_report

# Print the best hyperparameters found by the grid search
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Get the best model from the grid search
best_rf_classifier = grid_search.best_estimator_

# Evaluate the performance of the best model on the test set
y_pred_tuned = best_rf_classifier.predict(X_test)

# Calculate evaluation metrics for the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
precision_tuned = precision_score(y_test, y_pred_tuned)
recall_tuned = recall_score(y_test, y_pred_tuned)
f1_tuned = f1_score(y_test, y_pred_tuned)

# Print the evaluation metrics for the tuned model
print("Tuned Model Metrics:")
print("Accuracy:", accuracy_tuned)
print("Precision:", precision_tuned)
print("Recall:", recall_tuned)
print("F1 Score:", f1_tuned)

# Train the default Random Forest Classifier
default_rf_classifier = RandomForestClassifier(random_state=42)
default_rf_classifier.fit(X_train, y_train)

# Evaluate the performance of the default model on the test set
y_pred_default = default_rf_classifier.predict(X_test)

# Calculate evaluation metrics for the default model
accuracy_default = accuracy_score(y_test, y_pred_default)
precision_default = precision_score(y_test, y_pred_default)
recall_default = recall_score(y_test, y_pred_default)
f1_default = f1_score(y_test, y_pred_default)

# Print the evaluation metrics for the default model
print("Default Model Metrics:")
print("Accuracy:", accuracy_default)
print("Precision:", precision_default)
print("Recall:", recall_default)
print("F1 Score:", f1_default)
```

In this code, we print the best hyperparameters found by the Grid Search using `grid_search.best_params_`. Then, we get the best model from the grid search and evaluate its performance on the test set, calculating accuracy, precision, recall, and F1





          
  Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.


ANS-8




