### Conjunto de datos de predicción de riesgo de ataque cardíaco

El conjunto de datos de **clasificación de pacientes con riesgo de ataque cardíaco** contiene **8.763 registros** de pacientes de diferentes países con una columna respuesta que se centra en la presencia o ausencia de riesgo de ataque cardíaco. 
Incluye atributos como:

- **Patient ID** - Identificador único para cada paciente
- **Age** - Edad del paciente
- **Sex** - Género del paciente (Masculino/Femenino)
- **Cholesterol** - Niveles de colesterol del paciente
- **Blood Pressure** - Presión arterial del paciente (sistólica/diastólica)
- **Heart Rate** - Frecuencia cardíaca del paciente
- **Diabetes** - Si el paciente tiene diabetes (Sí/No)
- **Family History** - Historial familiar de problemas cardíacos (1: Sí, 0: No)
- **Smoking** - Estado de tabaquismo del paciente (1: Fumador, 0: No fumador)
- **Obesity** - Estado de obesidad del paciente (1: Obeso, 0: No obeso)
- **Alcohol Consumption** - Nivel de consumo de alcohol por el paciente
- **Exercise Hours Per Week** - Número de horas de ejercicio por semana
- **Diet** - Hábitos alimenticios del paciente (Saludable/Promedio/No saludable)
- **Previous Heart Problems** - Problemas cardíacos previos del paciente (1: Sí, 0: No)
- **Medication Use** - Uso de medicamentos por el paciente (1: Sí, 0: No)
- **Stress Level** - Nivel de estrés reportado por el paciente (1-10)
- **Sedentary Hours Per Day** - Horas de actividad sedentaria por día
- **Income** - Nivel de ingresos del paciente
- **BMI** - Índice de Masa Corporal (IMC) del paciente
- **Triglycerides** - Niveles de triglicéridos del paciente
- **Physical Activity Days Per Week** - Días de actividad física por semana
- **Sleep Hours Per Day** - Horas de sueño por día
- **Country** - País del paciente
- **Continent** - Continente donde reside el paciente
- **Hemisphere** - Hemisferio donde reside el paciente
- **Heart Attack Risk** - Presencia de riesgo de ataque cardíaco (1: Sí, 0: No)

In [40]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [41]:
raw_data = pd.read_csv("heart_attack_prediction_dataset.csv")
raw_data.head(5)

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,...,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk
0,BMW7812,67,Male,208,158/88,72,0,0,1,0,...,6.615001,261404,31.251233,286,0,6,Argentina,South America,Southern Hemisphere,0
1,CZE1114,21,Male,389,165/93,98,1,1,1,1,...,4.963459,285768,27.194973,235,1,7,Canada,North America,Northern Hemisphere,0
2,BNI9906,21,Female,324,174/99,72,1,0,0,0,...,9.463426,235282,28.176571,587,4,4,France,Europe,Northern Hemisphere,0
3,JLN3497,84,Male,383,163/100,73,1,1,1,0,...,7.648981,125640,36.464704,378,3,4,Canada,North America,Northern Hemisphere,0
4,GFO8847,66,Male,318,91/88,93,1,1,1,1,...,1.514821,160555,21.809144,231,1,5,Thailand,Asia,Northern Hemisphere,0


In [42]:
raw_data.shape

(8763, 26)

#### Verificar valores nulos

In [43]:
raw_data.dropna().shape 

(8763, 26)

#### Revisar filas duplicadas

In [44]:
raw_data.drop_duplicates().shape

(8763, 26)

#### Revisar columnas

In [45]:
raw_data.columns

Index(['Patient ID', 'Age', 'Sex', 'Cholesterol', 'Blood Pressure',
       'Heart Rate', 'Diabetes', 'Family History', 'Smoking', 'Obesity',
       'Alcohol Consumption', 'Exercise Hours Per Week', 'Diet',
       'Previous Heart Problems', 'Medication Use', 'Stress Level',
       'Sedentary Hours Per Day', 'Income', 'BMI', 'Triglycerides',
       'Physical Activity Days Per Week', 'Sleep Hours Per Day', 'Country',
       'Continent', 'Hemisphere', 'Heart Attack Risk'],
      dtype='object')

In [46]:
raw_data.dtypes

Patient ID                          object
Age                                  int64
Sex                                 object
Cholesterol                          int64
Blood Pressure                      object
Heart Rate                           int64
Diabetes                             int64
Family History                       int64
Smoking                              int64
Obesity                              int64
Alcohol Consumption                  int64
Exercise Hours Per Week            float64
Diet                                object
Previous Heart Problems              int64
Medication Use                       int64
Stress Level                         int64
Sedentary Hours Per Day            float64
Income                               int64
BMI                                float64
Triglycerides                        int64
Physical Activity Days Per Week      int64
Sleep Hours Per Day                  int64
Country                             object
Continent  

#### Separar columna de presión arterial

In [47]:
raw_data[['Systolic', 'Diastolic']] = raw_data['Blood Pressure'].str.split('/', expand=True)
raw_data.drop('Blood Pressure', axis=1, inplace=True)
raw_data['Systolic'] = pd.to_numeric(raw_data['Systolic'])
raw_data['Diastolic'] = pd.to_numeric(raw_data['Diastolic'])

##### Cambiar valores de columnas a valores numéricos

In [48]:
raw_data['Sex'] = raw_data['Sex'].replace({'Male': 1, 'Female': 0})

In [49]:
raw_data['Diet'] = raw_data['Diet'].replace({'Healthy': 2, 'Average': 1, 'Unhealthy': 0})

In [50]:
countrys = {
    'Argentina': 0, 'Australia': 1, 'Brazil': 2, 'Canada': 3, 'China': 4,
    'Colombia': 5, 'France': 6, 'Germany': 7, 'India': 8, 'Italy': 9,
    'Japan': 10, 'New Zealand': 11, 'Nigeria': 12, 'South Africa': 13,
    'South Korea': 14, 'Spain': 15, 'Thailand': 16, 'United Kingdom': 17,
    'United States': 18, 'Vietnam': 19
}

raw_data['Country'] = raw_data['Country'].replace(countrys)

In [51]:
continent = {
    'Africa': 0, 'Asia': 1, 'Australia': 2, 'Europe': 3,
    'North America': 4, 'South America': 5
}

raw_data['Continent'] = raw_data['Continent'].replace(continent)

In [52]:
raw_data['Hemisphere'] = raw_data['Hemisphere'].replace({'Northern Hemisphere': 0, 'Southern Hemisphere': 1})

In [53]:
raw_data

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,...,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk,Systolic,Diastolic
0,BMW7812,67,1,208,72,0,0,1,0,0,...,31.251233,286,0,6,0,5,1,0,158,88
1,CZE1114,21,1,389,98,1,1,1,1,1,...,27.194973,235,1,7,3,4,0,0,165,93
2,BNI9906,21,0,324,72,1,0,0,0,0,...,28.176571,587,4,4,6,3,0,0,174,99
3,JLN3497,84,1,383,73,1,1,1,0,1,...,36.464704,378,3,4,3,4,0,0,163,100
4,GFO8847,66,1,318,93,1,1,1,1,0,...,21.809144,231,1,5,16,1,0,0,91,88
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8758,MSV9918,60,1,121,61,1,1,1,0,1,...,19.655895,67,7,7,16,1,0,0,94,76
8759,QSV6764,28,0,120,73,1,0,0,1,0,...,23.993866,617,4,9,3,4,0,0,157,102
8760,XKA5925,47,1,250,105,0,1,1,1,1,...,35.406146,527,4,4,2,5,1,1,161,75
8761,EPE6801,36,1,178,60,1,0,1,0,0,...,27.294020,114,2,8,2,5,1,0,119,67


#### Eliminar columna del ID

In [54]:
raw_data.drop(columns=['Patient ID'], inplace=True)

#### Balance de la columna respuesta (Heart Attack Risk)

In [55]:
raw_data['Heart Attack Risk'].value_counts()

0    5624
1    3139
Name: Heart Attack Risk, dtype: int64

#### Separar columna respuesta de los datos

In [56]:
response = raw_data['Heart Attack Risk']
data = raw_data.drop(columns=['Heart Attack Risk'])

#### Separar dataset en entrenamiento, validación y testeo

In [57]:
X_train, X_test, y_train, y_test = train_test_split(data, response, test_size=0.1, random_state=42)

In [58]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

#### Guardar datos y columna respuesta

In [59]:
X_train.to_csv("process_dataset/train_data.csv", index=False)
X_val.to_csv("process_dataset/val_data.csv", index=False)
X_test.to_csv("process_dataset/test_data.csv", index=False)

In [60]:
with open("process_dataset/y_train.npy", 'wb') as doc_export:
    np.save(doc_export, y_train)

In [61]:
with open("process_dataset/y_val.npy", 'wb') as doc_export:
    np.save(doc_export, y_val)

In [62]:
with open("process_dataset/y_test.npy", 'wb') as doc_export:
    np.save(doc_export, y_test)