# Heart Disease Dataset

## Overview
This dataset contains information about patients and various attributes related to heart disease, collected from Cleveland Clinic and made available on Kaggle. It includes both qualitative and quantitative variables, which are ideal for performing analyses such as descriptive statistics, confidence intervals, hypothesis testing, correlation analysis, and multiple linear regression.

**Source**: [Kaggle - Heart Disease Data](https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/data)

---

## Variables

### Quantitative Variables
- **id**: Unique identifier for each patient
- **age**: Age of the patient in years
- **trestbps**: Resting blood pressure in mm Hg
- **chol**: Serum cholesterol level in mg/dl
- **thalch**: Maximum heart rate achieved
- **oldpeak**: ST depression induced by exercise relative to rest
- **ca**: Number of major vessels (0-3) colored by fluoroscopy
- **num**: Diagnosis of heart disease (angiographic disease status), where `0` indicates no disease and `1-4` indicates presence of disease

### Qualitative Variables
- **sex**: Sex of the patient, either `Male` or `Female`
- **dataset**: Source of the data, e.g., Cleveland
- **cp**: Chest pain type, with categories `typical angina`, `asymptomatic`, `non-anginal`, or `atypical angina`
- **fbs**: Fasting blood sugar > 120 mg/dl, represented as `TRUE` if true and `FALSE` otherwise
- **restecg**: Resting electrocardiographic results, either `normal` or `lv hypertrophy` (left ventricular hypertrophy)
- **exang**: Exercise-induced angina, with `TRUE` if present and `FALSE` otherwise
- **slope**: Slope of the peak exercise ST segment, categorized as `upsloping`, `flat`, or `downsloping`
- **thal**: Type of thalassemia, with values `normal`, `fixed defect`, or `reversable defect`

---

## Potential Analyses

This dataset provides opportunities to explore relationships and patterns in heart disease-related factors. Potential analyses include:

1. **Descriptive Statistics**: Calculate summary statistics for both quantitative and qualitative variables.
2. **Confidence Intervals**: Compute confidence intervals for variables like `trestbps`, `chol`, and `thalch`.
3. **T-test/ANOVA**: Conduct group comparisons (e.g., chest pain types or sex) to explore mean differences.
4. **Non-Parametric Test**: Analyze differences between groups using non-parametric alternatives to traditional tests.
5. **Correlation Analysis**: Examine the relationships between `chol`, `trestbps`, `thalch`, and other variables.
6. **Multiple Linear Regression**: Use factors like `age`, `chol`, and `trestbps` to predict the presence of heart disease.

---

This dataset provides a robust foundation for statistical analyses to study factors related to heart disease.


In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("redwankarimsony/heart-disease-data")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Wahaj\.cache\kagglehub\datasets\redwankarimsony\heart-disease-data\versions\6


In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
early_stage_diabetes_risk_prediction = fetch_ucirepo(id=529) 
  
# data (as pandas dataframes) 
X = early_stage_diabetes_risk_prediction.data.features 
y = early_stage_diabetes_risk_prediction.data.targets 
  
# metadata 
print(early_stage_diabetes_risk_prediction.metadata) 
  
# variable information 
print(early_stage_diabetes_risk_prediction.variables) 


ModuleNotFoundError: No module named 'ucimlrepo'