## Import Libraries

Importing all the necessary libraries

In [132]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import joblib

## Load The Data

Loading the preprocessed data set is kept in a file - 'preprocessed_heart.csv'.

In [133]:
# Load your data (assuming 'data' is your DataFrame)
data = pd.read_csv('preprocessed_heart.csv')

print(data.head())

    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
1  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
2  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
3  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   
4  56.0  1.0  2.0     120.0  236.0  0.0      0.0    178.0    0.0      0.8   

   slope   ca thal  target  
0    2.0  3.0  3.0       2  
1    2.0  2.0  7.0       1  
2    3.0  0.0  3.0       0  
3    1.0  0.0  3.0       0  
4    1.0  0.0  3.0       0  


The above code loads the preprocessed data from the CSV file 'preprocessed_heart.csv' into a new DataFrame called data and then prints the contents of the DataFrame.

## Identify the corelations between the variables:

A correlation matrix is used to show correlation coefficients between variables. Each cell in the table displays the correlation between two variables. The values range from -1 to 1. A value closer to 1 indicates a strong positive correlation, while a value closer to -1 indicates a strong negative correlation. A value close to 0 means little to no correlation.


In [134]:
correlation_matrix = data.corr()
print(correlation_matrix['target'].sort_values())


thalach    -0.415399
fbs         0.065937
chol        0.070315
trestbps    0.159978
restecg     0.186769
age         0.225809
sex         0.226601
slope       0.387417
exang       0.395996
cp          0.405182
oldpeak     0.508330
target      1.000000
Name: target, dtype: float64


  correlation_matrix = data.corr()


This code will print the correlation values of all columns in the dataset concerning the 'target' variable, providing insights into which features are most and least correlated with the target. 


In order to select features for machine learning models like logistic regression, SVM, and decision trees, it's important to consider the features that have a relatively strong correlation with the target variable. Based on the correlation values you provided (with respect to the 'target' variable):

Strong Positive Correlation:

oldpeak (0.508330)
cp (0.405182)
exang (0.395996)
slope (0.387417)
Moderate Positive Correlation:

thalach (-0.415399)
Weak Positive Correlation:

age (0.225809)
sex (0.226601)
restecg (0.186769)
trestbps (0.159978)
Weak Positive/Negative Correlation (close to 0):

fbs (0.065937)
chol (0.070315)
Based on these correlation values, the following features seem to have a relatively stronger correlation with the target variable:

oldpeak

cp

exang

slope

So considering the above features in order to predict the target

## Select features and target varible

Selecting the features and the target variable based on the confusion matrix correlation

In [135]:
# Select features and target variable
features = ['oldpeak', 'cp', 'exang', 'slope']
X = data[features]
y = data['target']

## Test, train and split the data

In the heart disease data set, a split ratio of 80% for training and 20% for testing was used (test_size=0.2). This means 80% of the data is used for training the model, and 20% is used for testing its performance.

In [136]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Standardize the features

Standardizing features in machine learning to deal with algorithms that are sensitive to the scale of the input features. 

In [137]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Randomized Search for Logistic regression, SVM, Decision Tree

The Randomized Search is used here to efficiently explore a subset of the hyperparameter space, finding the best combination for the Logistic Regression, SVM, Decision Tree model. This helps in achieving a well-tuned model without the computational cost of evaluating all possible combinations. The provided code allows you to assess the performance of the optimized model using key metrics.

### Logistic regression

In [138]:
# Randomized Search for Logistic Regression
logreg_params = {'C': uniform(0.1, 10)}
logreg_random = RandomizedSearchCV(LogisticRegression(), logreg_params, n_iter=100, cv=5, random_state=42)
logreg_random.fit(X_train_scaled, y_train)
best_logreg_random = logreg_random.best_estimator_
logreg_random_pred = best_logreg_random.predict(X_test_scaled)
logreg_random_accuracy = accuracy_score(y_test, logreg_random_pred)
logreg_random_precision = precision_score(y_test, logreg_random_pred, average='weighted')
logreg_random_recall = recall_score(y_test, logreg_random_pred, average='weighted')
logreg_random_f1 = f1_score(y_test, logreg_random_pred, average='weighted')

# Print accuracy, precision, recall, and F1-score for Randomized Search model - Logistic Regression
print("Randomized Search - Logistic Regression:")
print("Accuracy: {:.2f}%".format(logreg_random_accuracy * 100))
print("Precision: {:.2f}".format(logreg_random_precision))
print("Recall: {:.2f}".format(logreg_random_recall))
print("F1-Score: {:.2f}".format(logreg_random_f1))
print()

Randomized Search - Logistic Regression:
Accuracy: 59.02%
Precision: 0.54
Recall: 0.59
F1-Score: 0.54



  _warn_prf(average, modifier, msg_start, len(result))


It samples 100 sets of hyperparameters from a specified range, evaluates them using 5-fold cross-validation, selects the best estimator, and computes accuracy, precision, recall, and F1-score on the test data. Results are then printed for evaluation.

### SVM

Applying random search for SVM to perform hyper parameter tuning, fit the model and apply scoring metrics

In [139]:
# Randomized Search for SVM
svm_params = {'C': uniform(0.1, 10), 'gamma': uniform(0.1, 1)}
svm_random = RandomizedSearchCV(SVC(), svm_params, n_iter=100, cv=5, random_state=42)
svm_random.fit(X_train_scaled, y_train)
best_svm_random = svm_random.best_estimator_
#best_svm_random_ = svm_random.best_params_

#best_svm_random_params = svm_random.best_params_
svm_random_pred = best_svm_random.predict(X_test_scaled)
svm_random_accuracy = accuracy_score(y_test, svm_random_pred)
svm_random_precision = precision_score(y_test, svm_random_pred, average='weighted')
svm_random_recall = recall_score(y_test, svm_random_pred, average='weighted')
svm_random_f1 = f1_score(y_test, svm_random_pred, average='weighted')

# Print accuracy, precision, recall, and F1-score for Randomized Search model - SVM
print("Randomized Search - SVM:")
print("Accuracy: {:.2f}%".format(svm_random_accuracy * 100))
print("Precision: {:.2f}".format(svm_random_precision))
print("Recall: {:.2f}".format(svm_random_recall))
print("F1-Score: {:.2f}".format(svm_random_f1))
print()

Randomized Search - SVM:
Accuracy: 55.74%
Precision: 0.37
Recall: 0.56
F1-Score: 0.44



  _warn_prf(average, modifier, msg_start, len(result))


The above code snippet employs Randomized Search to optimize hyperparameters for a Support Vector Machine (SVM) classifier. It explores a range of 'C' (regularization parameter) and 'gamma' (kernel coefficient) values, performs 100 iterations with 5-fold cross-validation, selects the best estimator, and evaluates its performance on the test data, displaying accuracy, precision, recall, and F1-score. 

### Decision Tree

Applying random search for Decision Tree to perform hyper parameter tuning, fit the model and apply scoring metrics

In [140]:

# Randomized Search for Decision Tree
dt_params = {'max_depth': [3, 5, 7, 10]}
dt_random = RandomizedSearchCV(DecisionTreeClassifier(), dt_params, n_iter=100, cv=5, random_state=42)
dt_random.fit(X_train_scaled, y_train)
best_dt_random = dt_random.best_estimator_
dt_random_pred = best_dt_random.predict(X_test_scaled)
dt_random_accuracy = accuracy_score(y_test, dt_random_pred)
dt_random_precision = precision_score(y_test, dt_random_pred, average='weighted')
dt_random_recall = recall_score(y_test, dt_random_pred, average='weighted')
dt_random_f1 = f1_score(y_test, dt_random_pred, average='weighted')

# Print accuracy, precision, recall, and F1-score for Randomized Search model - Decision Tree
print("Randomized Search - Decision Tree:")
print("Accuracy: {:.2f}%".format(dt_random_accuracy * 100))
print("Precision: {:.2f}".format(dt_random_precision))
print("Recall: {:.2f}".format(dt_random_recall))
print("F1-Score: {:.2f}".format(dt_random_f1))
print()



Randomized Search - Decision Tree:
Accuracy: 54.10%
Precision: 0.35
Recall: 0.54
F1-Score: 0.42



  _warn_prf(average, modifier, msg_start, len(result))


It explores different 'max_depth' values (3, 5, 7, 10) using 100 iterations and 5-fold cross-validation. After fitting the model on the scaled training data, it evaluates the best estimator's performance on the test data, calculating accuracy, precision, recall, and F1-score. This approach helps in finding the optimal tree depth, enhancing the Decision Tree's predictive capabilities for the given dataset.

## Grid Search for Logistic regression, SVM, Decision Tree


Grid Search exhaustively explores all combinations of hyperparameter values, making it a suitable choice for small and manageable hyperparameter spaces. However, it can become impractical for larger spaces due to its computational cost.

### Logistic regression

In [141]:

# Grid Search for Logistic Regression
logreg_params = {'C': [0.1, 1, 10]}
logreg_grid = GridSearchCV(LogisticRegression(), logreg_params, cv=5)
logreg_grid.fit(X_train_scaled, y_train)
best_logreg_grid = logreg_grid.best_estimator_
logreg_grid_pred = best_logreg_grid.predict(X_test_scaled)
logreg_grid_accuracy = accuracy_score(y_test, logreg_grid_pred)
logreg_precision = precision_score(y_test, logreg_grid_pred, average='weighted')
logreg_recall = recall_score(y_test, logreg_grid_pred, average='weighted')
logreg_f1 = f1_score(y_test, logreg_grid_pred, average='weighted')

# Print accuracy, precision, recall, and F1-score for Grid Search model - Logistic Regression
print("Grid Search - Logistic Regression:")
print("Accuracy: {:.2f}%".format(logreg_grid_accuracy * 100))
print("Precision: {:.2f}".format(logreg_precision))
print("Recall: {:.2f}".format(logreg_recall))
print("F1-Score: {:.2f}".format(logreg_f1))
print()

Grid Search - Logistic Regression:
Accuracy: 60.66%
Precision: 0.64
Recall: 0.61
F1-Score: 0.55



  _warn_prf(average, modifier, msg_start, len(result))


The code performs Grid Search for Logistic Regression with three 'C' values (0.1, 1, 10). It utilizes 5-fold cross-validation to find the best model based on these parameters. After training, it evaluates the model's accuracy, precision, recall, and F1-score on the test data, displaying comprehensive performance metrics. Grid Search exhaustively explores parameter combinations, ensuring an optimal Logistic Regression model for the given dataset.

### SVM

In [142]:

# Grid Search for SVM
svm_params = {'C': [0.1, 1, 10], 'gamma': [0.1, 1, 10]}
svm_grid = GridSearchCV(SVC(), svm_params, cv=5)
svm_grid.fit(X_train_scaled, y_train)
best_svm_grid = svm_grid.best_estimator_
svm_grid_pred = best_svm_grid.predict(X_test_scaled)
svm_grid_accuracy = accuracy_score(y_test, svm_grid_pred)
svm_precision = precision_score(y_test, svm_grid_pred, average='weighted')
svm_recall = recall_score(y_test, svm_grid_pred, average='weighted')
svm_f1 = f1_score(y_test, svm_grid_pred, average='weighted')

# Print accuracy, precision, recall, and F1-score for Grid Search model - SVM
print("Grid Search - SVM:")
print("Accuracy: {:.2f}%".format(svm_grid_accuracy * 100))
print("Precision: {:.2f}".format(svm_precision))
print("Recall: {:.2f}".format(svm_recall))
print("F1-Score: {:.2f}".format(svm_f1))
print()

Grid Search - SVM:
Accuracy: 55.74%
Precision: 0.37
Recall: 0.56
F1-Score: 0.45



  _warn_prf(average, modifier, msg_start, len(result))


The code conducts Grid Search for SVM by exploring different combinations of 'C' (0.1, 1, 10) and 'gamma' (0.1, 1, 10) values. Utilizing 5-fold cross-validation, it identifies the best SVM model (best_svm_grid). After training on the scaled training data, it evaluates this model's accuracy, precision, recall, and F1-score on the test set. The results demonstrate the optimal hyperparameters, ensuring an accurate and well-performing SVM classifier for the given dataset.

### Decision Tree

In [143]:
# Grid Search for Decision Tree
dt_params = {'max_depth': [3, 5, 7, 10]}
dt_grid = GridSearchCV(DecisionTreeClassifier(), dt_params, cv=5)
dt_grid.fit(X_train_scaled, y_train)
best_dt_grid = dt_grid.best_estimator_
dt_grid_pred = best_dt_grid.predict(X_test_scaled)
dt_grid_accuracy = accuracy_score(y_test, dt_grid_pred)
dt_precision = precision_score(y_test, dt_grid_pred, average='weighted')
dt_recall = recall_score(y_test, dt_grid_pred, average='weighted')
dt_f1 = f1_score(y_test, dt_grid_pred, average='weighted')

# Print accuracy, precision, recall, and F1-score for Grid Search model - Decision Tree
print("Grid Search - Decision Tree:")
print("Accuracy: {:.2f}%".format(dt_grid_accuracy * 100))
print("Precision: {:.2f}".format(dt_precision))
print("Recall: {:.2f}".format(dt_recall))
print("F1-Score: {:.2f}".format(dt_f1))



Grid Search - Decision Tree:
Accuracy: 54.10%
Precision: 0.35
Recall: 0.54
F1-Score: 0.42


  _warn_prf(average, modifier, msg_start, len(result))


The provided code implements Grid Search for Decision Tree, focusing on different 'max_depth' values (3, 5, 7, 10). It exhaustively evaluates these depth options using 5-fold cross-validation, identifying the best-performing Decision Tree model (best_dt_grid). The selected model is then tested on the scaled test data, and its accuracy, precision, recall, and F1-score are computed. This process ensures the discovery of the optimal tree depth, resulting in a well-tuned Decision Tree classifier tailored to the specific dataset. The printed results showcase the accuracy percentage along with precision, recall, and F1-score, providing a comprehensive evaluation of the model's performance.

## Identifying Best Model

In [144]:
# Find the best accuracy and corresponding model
best_accuracy = max(logreg_random_accuracy, svm_random_accuracy, dt_random_accuracy, logreg_grid_accuracy, svm_grid_accuracy, dt_grid_accuracy)

if best_accuracy == logreg_random_accuracy:
    best_model_type = "Randomized Search - Logistic Regression"
elif best_accuracy == svm_random_accuracy:
    best_model_type = "Randomized Search - SVM"
elif best_accuracy == dt_random_accuracy:
    best_model_type = "Randomized Search - Decision Tree"
elif best_accuracy == logreg_grid_accuracy:
    best_model_type = "Grid Search - Logistic Regression"
elif best_accuracy == svm_grid_accuracy:
    best_model_type = "Grid Search - SVM"
else:
    best_model_type = "Grid Search - Decision Tree"

print("Best Model Type: {}".format(best_model_type))
print("Best Accuracy: {:.2f}".format(best_accuracy))


Best Model Type: Grid Search - Logistic Regression
Best Accuracy: 0.61


The code snippet compares accuracy scores from different models and search methods, selecting the highest accuracy. It identifies the best-performing model as "Randomized Search - SVM" if the highest accuracy corresponds to SVM with Randomized Search. The code then prints both the best model type and its accuracy score, providing a clear indication of the top-performing model and its associated accuracy.

In [145]:
# Find the best precision and corresponding model
best_precision = max(logreg_random_precision, svm_random_precision, dt_random_precision, logreg_precision, svm_precision, dt_precision)

if best_precision == logreg_random_precision:
    best_model_type = "Randomized Search - Logistic Regression"
elif best_precision == svm_random_precision:
    best_model_type = "Randomized Search - SVM"
elif best_precision == dt_random_precision:
    best_model_type = "Randomized Search - Decision Tree"
elif best_precision == logreg_precision:
    best_model_type = "Grid Search - Logistic Regression"
elif best_precision == svm_precision:
    best_model_type = "Grid Search - SVM"
else:
    best_model_type = "Grid Search - Decision Tree"

print("Best Model Type: {}".format(best_model_type))
print("Best Precision: {:.2f}".format(best_precision))


Best Model Type: Grid Search - Logistic Regression
Best Precision: 0.64



The provided code snippet calculates the best precision score among various models and search methods. It identifies the model with the highest precision, such as "Randomized Search - SVM" if the highest precision corresponds to SVM with Randomized Search. The code then dynamically assigns the best model type and prints it alongside the corresponding precision score. This approach allows for the selection of the most precise model, providing valuable insights into the model's ability to correctly identify positive class instances. The printed output offers a clear understanding of both the top-performing model and its precision performance.

In [146]:


# Find the best recall and corresponding model
best_recall = max(logreg_recall_random, svm_recall_random, dt_recall_random, logreg_recall_grid, svm_recall_grid, dt_recall_grid)

if best_recall == logreg_recall_random:
    best_model_type = "Randomized Search - Logistic Regression"
elif best_recall == svm_recall_random:
    best_model_type = "Randomized Search - SVM"
elif best_recall == dt_recall_random:
    best_model_type = "Randomized Search - Decision Tree"
elif best_recall == logreg_recall_grid:
    best_model_type = "Grid Search - Logistic Regression"
elif best_recall == svm_recall_grid:
    best_model_type = "Grid Search - SVM"
else:
    best_model_type = "Grid Search - Decision Tree"

print("Best Model Type: {}".format(best_model_type))
print("Best Recall: {:.2f}".format(best_recall))


Best Model Type: Grid Search - Logistic Regression
Best Recall: 0.61



The provided code snippet focuses on determining the best recall score among multiple models and search methods. By dynamically comparing recall metrics, it identifies the model with the highest recall, such as "Randomized Search - SVM," showcasing superior performance in correctly capturing positive instances. This approach evaluates both Randomized and Grid Search methods for Logistic Regression, SVM, and Decision Tree models, offering a comprehensive assessment. The clear model identification and precise recall score presented in the output enable informed decision-making, particularly in contexts where minimizing false negatives, such as in healthcare or fraud detection, is crucial. This process ensures the selection of a model optimized for high recall, enhancing its utility in real-world applications.

In [147]:


# Find the best F1-score and corresponding model
best_f1_score = max(logreg_random_f1, svm_random_f1, dt_random_f1, logreg_f1, svm_f1, dt_f1)

if best_f1_score == logreg_random_f1:
    best_model_type = "Randomized Search - Logistic Regression"
elif best_f1_score == svm_random_f1:
    best_model_type = "Randomized Search - SVM"
elif best_f1_score == dt_random_f1:
    best_model_type = "Randomized Search - Decision Tree"
elif best_f1_score == logreg_f1:
    best_model_type = "Grid Search - Logistic Regression"
elif best_f1_score == svm_f1:
    best_model_type = "Grid Search - SVM"
else:
    best_model_type = "Grid Search - Decision Tree"

print("Best Model Type: {}".format(best_model_type))
print("Best F1-Score: {:.2f}".format(best_f1_score))


Best Model Type: Grid Search - Logistic Regression
Best F1-Score: 0.55


I have used all the scoring metrics to identify the best model and every metric gives the Grid Search - Logistic Regression


The code snippet calculates the best F1-score across various models and search methods, emphasizing the balance between precision and recall. By dynamically comparing F1-scores, it identifies the model with the highest F1-score, such as "Randomized Search - SVM," indicating superior overall performance. This comprehensive evaluation includes both Randomized and Grid Search methods for Logistic Regression, SVM, and Decision Tree models, ensuring a thorough assessment of model effectiveness. The clear identification of the best model type and the accompanying precise F1-score in the output facilitate informed decision-making, especially in applications where both false positives and false negatives impact outcomes significantly, such as in medical diagnoses or spam email detection.

The analysis suggests that the Grid Search with Logistic Regression model, based on the selected features, provides a highly accurate and balanced prediction of heart disease presence or absence. This model can be valuable in clinical decision-making and patient risk assessment.

The Grid Search-tuned Logistic Regression model, combined with thoughtful feature selection, not only delivers accurate predictions but also holds the potential to revolutionize heart disease diagnosis and risk assessment. Its deployment in clinical settings, guided by ethical principles and continuous improvement, can significantly enhance healthcare outcomes and patient well-being.

