# **Heart Disease Prediction**

## Data Collection

This notebook focuses on downloading and collecting the dataset required for the project. The goal is to gather the necessary data, ensure its integrity, and prepare it for further processing and analysis.

`Simón Correa Marín`

### **1. Import Libraries**

In [1]:
# base libraries for data science
from pathlib import Path
import pandas as pd


### **2. Collect Data**

In [2]:
url_data = "https://github.com/JoseRZapata/Data_analysis_notebooks/raw/refs/heads/main/data/datasets/corazon_data.csv"
heart_df = pd.read_csv(url_data, low_memory=False)  # no parsing of mixed types
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rest_ecg     6403 non-null   object 
 1   ca           6482 non-null   object 
 2   thal         6557 non-null   object 
 3   max_hr       6456 non-null   object 
 4   exang        6499 non-null   object 
 5   old_peak     6496 non-null   object 
 6   chol         6649 non-null   object 
 7   rest_bp      6661 non-null   object 
 8   data_source  6845 non-null   object 
 9   chest_pain   6654 non-null   object 
 10  disease      6602 non-null   object 
 11  value        6845 non-null   float64
 12  chest_pain   6654 non-null   object 
 13  csv          0 non-null      float64
 14  sex          6700 non-null   object 
 15  thal         6557 non-null   object 
 16  fbs          6624 non-null   float64
 17  slope        6495 non-null   object 
 18  rest_ecg     6403 non-null   object 
 19  diseas

In [3]:
heart_df.sample(10)

Unnamed: 0,rest_ecg,ca,thal,max_hr,exang,old_peak,chol,rest_bp,data_source,chest_pain,...,value,chest_pain.1,csv,sex,thal.1,fbs,slope,rest_ecg.1,disease,age
5977,normal,0.0,normal,179,0,0.0,199,94,dataset,nonanginal,...,105137.0,nonanginal,,Female,normal,0.0,1,normal,0,39
4001,normal,0.0,normal,187,0,3.5,250,130,dataset,nonanginal,...,105137.0,nonanginal,,Male,normal,0.0,3,normal,0,37
6445,normal,0.0,reversable,142,1,1.2,305,130,dataset,asymptomatic,...,105137.0,asymptomatic,,Female,reversable,0.0,2,normal,1,51
4239,left ventricular hypertrophy,0.0,normal,186,0,0.0,222,122,dataset,asymptomatic,...,105137.0,asymptomatic,,Male,normal,0.0,1,left ventricular hypertrophy,0,48
5959,left ventricular hypertrophy,0.0,reversable,156,1,0.0,282,126,dataset,asymptomatic,...,105137.0,asymptomatic,,Male,reversable,0.0,1,left ventricular hypertrophy,1,35
4637,left ventricular hypertrophy,3.0,reversable,124,0,1.0,289,165,dataset,asymptomatic,...,105137.0,asymptomatic,,Male,reversable,1.0,2,left ventricular hypertrophy,1,57
3990,normal,1.0,reversable,156,0,0.1,234,100,dataset,asymptomatic,...,105137.0,asymptomatic,,Male,reversable,0.0,1,normal,1,58
6013,normal,1.0,reversable,97,0,1.2,263,130,dataset,nonanginal,...,105137.0,nonanginal,,Female,reversable,0.0,2,normal,1,62
3147,left ventricular hypertrophy,0.0,reversable,128,0,2.6,243,150,dataset,asymptomatic,...,105137.0,asymptomatic,,Male,reversable,0.0,2,left ventricular hypertrophy,1,50
4910,normal,0.0,reversable,132,0,1.2,264,110,dataset,typical,...,105137.0,typical,,Male,reversable,0.0,2,normal,1,45


### **3. Data Information**

In [5]:
#Which columns are going to be used in the project?
heart_df.columns

Index(['rest_ecg', 'ca', 'thal', 'max_hr', 'exang', 'old_peak', 'chol',
       'rest_bp', 'data_source', 'chest_pain', 'disease', 'value',
       'chest_pain ', 'csv', 'sex', 'thal ', 'fbs', 'slope', 'rest_ecg ',
       'disease ', 'age'],
      dtype='object')

In [None]:
columns_to_use =

### **4. Data Download**

In [None]:
heart_final_df = pd.read_csv(url_data, usecols=columns_to_use, low_memory=False)
heart_final_df.info()

In [None]:
#Locally
Path.cwd().resolve().parents[0]  # Define el directorio y el archivo
DATA_DIR = Path.cwd().resolve().parents[0] / "data/01_raw"

file_path = DATA_DIR / "titanic_raw.csv"

# Crea el directorio si no existe
DATA_DIR.mkdir(parents=True, exist_ok=True)
