# **Heart Disease Prediction**

## Data Collection

This notebook focuses on downloading and collecting the dataset required for the project. The goal is to gather the necessary data, ensure its integrity, and prepare it for further processing and analysis.

`Simón Correa Marín`

### **1. Import Libraries**

In [1]:
# Base libraries for data science
from pathlib import Path
import pandas as pd


### **2. Collect Data**

- **What is the objective of the Heart Disease problem?**

    The main objective is to develop a machine learning model that can predict the presence of heart disease in patients based on clinical and demographic attributes. This model will help identify individuals at risk, enabling early medical interventions and potentially saving lives.

    >One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

- **How will the solution be used?**

    The solution will be implemented as a decision-support tool for healthcare professionals, allowing them to quickly assess a patient’s risk of heart disease. It could be integrated into healthcare systems to monitor populations and prioritize resources for those at higher risk.

- **What are the current solutions?**

    There are currently various predictive tools and models for heart disease, using machine learning algorithms and statistical analysis. Those solutions come from the Cleveland Dataset and are hosted on Kaggle.


In [2]:
url_data = "https://github.com/JoseRZapata/Data_analysis_notebooks/raw/refs/heads/main/data/datasets/corazon_data.csv"
heart_df = pd.read_csv(url_data, low_memory=False)  # no parsing of mixed types
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rest_ecg     6403 non-null   object 
 1   ca           6482 non-null   object 
 2   thal         6557 non-null   object 
 3   max_hr       6456 non-null   object 
 4   exang        6499 non-null   object 
 5   old_peak     6496 non-null   object 
 6   chol         6649 non-null   object 
 7   rest_bp      6661 non-null   object 
 8   data_source  6845 non-null   object 
 9   chest_pain   6654 non-null   object 
 10  disease      6602 non-null   object 
 11  value        6845 non-null   float64
 12  chest_pain   6654 non-null   object 
 13  csv          0 non-null      float64
 14  sex          6700 non-null   object 
 15  thal         6557 non-null   object 
 16  fbs          6624 non-null   float64
 17  slope        6495 non-null   object 
 18  rest_ecg     6403 non-null   object 
 19  diseas

**Problems and comments**

- 'rest_ecg' duplicated
- 'data source' has the same value the whole columnm "dataset" and nan
- 'disease' has only 2 valid values and is duplicated
- 'value' doesn't make sense (only has 1 value and Nan)
- 'chest pain' duplicated
- 'csv' is a column full of nan values
- 'thal' is duplicated
- 'max_hr' is the same as thalach
- 'rest_bp' is the same as trestbps

In [3]:
pd.set_option('display.max_columns', None)  # Show all columns
heart_df.sample(10)

Unnamed: 0,rest_ecg,ca,thal,max_hr,exang,old_peak,chol,rest_bp,data_source,chest_pain,disease,value,chest_pain.1,csv,sex,thal.1,fbs,slope,rest_ecg.1,disease.1,age
2235,left ventricular hypertrophy,1.0,reversable,165,0,2.5,230,112,dataset,nonanginal,1,105137.0,nonanginal,,Male,reversable,0.0,2,left ventricular hypertrophy,1,58
6823,normal,0.0,normal,162,0,1.1,244,120,dataset,nontypical,0,105137.0,nontypical,,Female,normal,0.0,1,normal,0,50
4616,normal,0.0,normal,162,0,0.0,271,134,dataset,nontypical,0,105137.0,nontypical,,Female,normal,0.0,2,normal,0,49
4698,normal,1.0,reversable,140,1,3.6,260,120,dataset,asymptomatic,1,105137.0,asymptomatic,,Male,reversable,0.0,2,normal,1,61
4400,left ventricular hypertrophy,0.0,normal,144,1,1.8,211,110,dataset,typical,0,105137.0,typical,,Male,normal,0.0,2,left ventricular hypertrophy,0,64
2214,left ventricular hypertrophy,1.0,reversable,147,0,1.4,254,130,dataset,asymptomatic,1,105137.0,asymptomatic,,Male,reversable,0.0,2,left ventricular hypertrophy,1,63
805,normal,0.0,normal,173,0,0.0,209,120,dataset,nonanginal,0,105137.0,nonanginal,,Female,normal,0.0,2,normal,0,42
632,normal,0.0,normal,187,0,3.5,250,130,dataset,nonanginal,0,105137.0,nonanginal,,Male,normal,0.0,3,normal,0,37
3953,normal,0.0,reversable,161,0,0.0,211,110,dataset,asymptomatic,0,105137.0,asymptomatic,,Male,reversable,0.0,1,normal,0,43
2120,left ventricular hypertrophy,0.0,normal,148,1,3.0,208,104,dataset,asymptomatic,0,105137.0,asymptomatic,,Male,normal,0.0,2,left ventricular hypertrophy,0,45


#### **Dataset Variables**

- **rest_ecg (ECG observation at resting condition):** 
Resting Electrocardiogram Results. Possible values:
    - Normal
    - ST-T abnormality (ST-T segment abnormalities, which may indicate heart problems).
    - Left ventricular hypertrophy (LV hypertrophy) (thickening of the heart’s left ventricle).

- **ca (number of major vessels colored by fluoroscopy):**
 Number of major blood vessels (0-3) visible using fluoroscopy.

- **thal (thalassemia test results):** Thalassemia test result, which assesses a blood disorder. Possible values:
    - Normal
    - Fixed defect (permanent blood flow issue).
    - Reversible defect (a defect that may improve with treatment).

- **max_hr (maximum heart rate achieved)**: Maximum heart rate achieved during a stress test.

- **exang (exercise-induced angina):**
 Indicates whether the patient experienced angina due to exercise (1 = yes, 0 = no).

- **old_peak (ST depression induced by exercise relative to rest):** ST-segment depression during exercise compared to rest (indicates ischemia).

- **chol (cholesterol measure):** Serum cholesterol level in mg/dl.

- **rest_bp (resting blood pressure):**  Blood pressure while at rest (measured in mm Hg at hospital admission).

- **data_source:** source of the data.

- **chest_pain (chest pain type):** Type of chest pain the patient experiences:
    - Typical angina: Chest pain typically caused by heart disease.
    - Atypical angina: Chest pain that does not follow the usual pattern.
    - Non-anginal: Chest pain not related to angina.
    - Asymptomatic: No chest pain symptoms.

- **value**

- **csv**

- **sex (gender):** The patient’s gender (Male/Female).

- **fbs (fasting blood sugar):** Blood sugar level when fasting (>120 mg/dl is considered high, represented as 1 = true, 0 = false).

- **slope (the slope of the peak exercise ST segment):** The slope of the ST segment during peak exercise. Possible values:

    - Upsloping (generally normal).
	- Flat (may indicate heart disease).
	- Downsloping (suggests ischemia).

- **Age:** The patient’s age in years.

- **Disease (target [0=no heart disease; 1,2,3,4 = stages of heart disease]):** The target variable indicating the presence and severity of heart disease:
    - 0 = No heart disease.
    - 1,2,3,4 = Different stages of heart disease.

### **3. Data Information**

- **How should this problem be framed?**

    This is a supervised learning problem, as we have a labeled dataset where the target variable indicates the presence or absence of heart disease. The approach will be offline, training the model with historical data before deploying it for real-time predictions.

- **How should the solution’s performance be measured, as an initial intuition?**

    The model’s performance will be evaluated using metrics such as accuracy, precision, recall (sensitivity), f1-score, and the area under the ROC curve (AUC-ROC). These metrics will provide a comprehensive view of the model’s ability to correctly identify both patients with and without heart disease. On of the metrics I've heard we used the most in health is **recall**

- **Is the performance metric aligned with the problem’s objective?**

    Yes, the selected metrics align with the goal of correctly identifying patients at risk of heart disease while minimizing both false positives and false negatives.

- **What is the minimum performance required to achieve the problem’s objective?**

    While 100% performance is ideal, in practice, a model with accuracy and recall above 85% could be considered acceptable for clinical applications, provided its results are validated in different patient cohorts.

- **What are similar problems? Can existing experiences or tools be reused?**

    Similar problems include predicting other cardiovascular diseases, diabetes, and chronic conditions. Models and approaches developed for these cases can be leveraged and adapted to the specific context of heart disease.

- **Is there existing experience with this problem?**

    Yes, there is extensive literature and previous studies on heart disease prediction using machine learning techniques. ML reasearches have used the Cleveland dataset to approach this problem before.

- **How can the problem be solved manually?**

    Traditionally, doctors assess the risk of heart disease by collecting medical history, conducting physical exams, and performing diagnostic tests such as electrocardiograms and lab analyses. Based on their experience and clinical guidelines, they determine the likelihood of disease presence.


In [4]:
#Which columns are going to be used in the project?
heart_df.columns

Index(['rest_ecg', 'ca', 'thal', 'max_hr', 'exang', 'old_peak', 'chol',
       'rest_bp', 'data_source', 'chest_pain', 'disease', 'value',
       'chest_pain ', 'csv', 'sex', 'thal ', 'fbs', 'slope', 'rest_ecg ',
       'disease ', 'age'],
      dtype='object')

The duplicated columns have a blank space at the end of their name.

In [5]:
columns_to_use = [
    'age',
    'sex',
    'chest_pain',
    'rest_bp',
    'chol',
    'fbs',
    'rest_ecg',
    'max_hr',
    'exang',
    'old_peak',
    'slope',
    'ca',
    'thal',
    'disease'
]

**Current Assumptions**
- The available data is representative of the target population.
- The selected variables have a significant correlation with the presence of heart disease.
- There are no significant biases in data collection.
- The model will be able to generalize well to unseen data.

### **4. Data Download**

In [6]:
heart_final_df = pd.read_csv(url_data, usecols=columns_to_use, low_memory=False)
heart_final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   rest_ecg    6403 non-null   object 
 1   ca          6482 non-null   object 
 2   thal        6557 non-null   object 
 3   max_hr      6456 non-null   object 
 4   exang       6499 non-null   object 
 5   old_peak    6496 non-null   object 
 6   chol        6649 non-null   object 
 7   rest_bp     6661 non-null   object 
 8   chest_pain  6654 non-null   object 
 9   disease     6602 non-null   object 
 10  sex         6700 non-null   object 
 11  fbs         6624 non-null   float64
 12  slope       6495 non-null   object 
 13  age         6769 non-null   object 
dtypes: float64(1), object(13)
memory usage: 749.1+ KB


In [7]:
#Locally
Path.cwd().resolve().parents[0]  # Define folder and file path
DATA_DIR = Path.cwd().resolve().parents[0] / "data/01_raw"

file_path = DATA_DIR / "heartdisease_raw.csv"

# Create the folder if it does not exist
DATA_DIR.mkdir(parents=True, exist_ok=True)
heart_final_df.to_csv(file_path, index=False)
