Skip to content

A comprehensive ML framework to detect heart disease using the Cleveland dataset

Notifications You must be signed in to change notification settings

Vayuputra2401/Heart_Arrythmia_Prediction

Repository files navigation

Heart Disease Prediction-v1

Introduction

This is the documentation for the first version of the heart disease prediction program, which uses the Cleveland Heart Disease dataset. The goal of this program is to predict the presence of heart disease in patients based on various features.

Dataset

The dataset used for this project is the Cleveland Heart Disease dataset, which contains 303 instances with 14 attributes. Each instance represents a patient and includes information such as age, sex, chest pain type, resting blood pressure, serum cholesterol level, fasting blood sugar, and more. The target variable is the presence of heart disease, where a value of 0 indicates no heart disease and a value greater than 0 indicates the presence of heart disease.

Preprocessing Techniques

Several preprocessing techniques were applied to the dataset to prepare it for training the predictive models. These techniques are as follows:

  1. Handling Missing Values: Any missing values in the dataset were identified and handled appropriately. This may involve imputation techniques like mean, median, or mode imputation, or removing instances with missing values depending on the situation.

  2. Feature Scaling: Some features in the dataset may have different scales, which can adversely affect the performance of certain machine learning algorithms. To address this, feature scaling techniques such as standardization or normalization were applied to bring all features to a similar scale.

  3. Feature Encoding: Categorical features in the dataset were encoded into numerical values to enable their use in the predictive models. This was done using techniques such as one-hot encoding or label encoding, depending on the nature of the categorical variables.

  4. Feature Selection: In some cases, certain features may not contribute significantly to the prediction task or may introduce noise. Feature selection techniques, such as correlation analysis or recursive feature elimination, were employed to identify and select the most relevant features for the models.

Data Visualization Techniques

Data visualization techniques were utilized to gain insights and a better understanding of the dataset. The following visualization techniques were applied:

  1. Histograms: Histograms were used to visualize the distribution of numerical variables in the dataset. This helped identify any skewed or abnormal distributions.

  2. Bar Charts: Bar charts were used to visualize the distribution of categorical variables. This provided an overview of the frequency of different categories within each feature.

  3. Correlation Matrix: A correlation matrix was constructed to examine the relationships between the features and the target variable. This helped identify any strong correlations that could indicate predictive power.

  4. Scatter Plots: Scatter plots were used to visualize the relationships between pairs of numerical variables. This allowed for the identification of any patterns or trends in the data.

Models Used for Prediction

Two models were employed for predicting heart disease presence: K-Nearest Neighbors (KNN), Logistic Regression, and Random Forest. These models were chosen based on their effectiveness in classification tasks. The accuracy achieved by each model after hyperparameter tuning is as follows:

  • K-Nearest Neighbors (KNN): 75%
  • Logistic Regression: 83%
  • Random Forest: 98%

Model Hyperparameter Tuning

To improve the performance of the models, hyperparameter tuning was performed. This involved systematically searching for the best combination of hyperparameters for each model. Techniques such as grid search or random search were utilized to find the optimal hyperparameters that maximize the model's accuracy.

Classification Report and Evaluation Metrics

A classification report was generated to evaluate the performance of the models. The report includes metrics such as precision, recall, and F1-score, which provide insights into the models' predictive capabilities. These metrics were calculated for each class (presence or absence of heart disease) and averaged to provide an overall performance summary.

Future Enhancements

In the future versions of this program, several enhancements can be considered

, including:

  • Exploring additional machine learning algorithms, such as support vector machines or gradient boosting, to further improve the predictive performance.
  • Applying advanced feature engineering techniques to derive more informative features from the existing dataset.
  • Incorporating cross-validation techniques to obtain a more reliable estimate of the models' performance.
  • Conducting more in-depth analysis of the feature importance and conducting sensitivity analysis to assess the robustness of the models.

Conclusion

This documentation provides an overview of the heart disease prediction program using the Cleveland Heart Disease dataset. It describes the preprocessing techniques, data visualization methods, models used, hyperparameter tuning, and evaluation metrics. The program's first version achieved promising accuracy rates for predicting heart disease presence, and future enhancements can be explored to further improve the models' performance.

About

A comprehensive ML framework to detect heart disease using the Cleveland dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published