### Conjunto de datos de predicción de riesgo de ataque cardíaco

El conjunto de datos de **clasificación de pacientes con riesgo de ataque cardíaco** contiene **8.763 registros** de pacientes de diferentes países con una columna respuesta que se centra en la presencia o ausencia de riesgo de ataque cardíaco. 
Incluye atributos como:

- **Patient ID** - Identificador único para cada paciente
- **Age** - Edad del paciente
- **Sex** - Género del paciente (Masculino/Femenino)
- **Cholesterol** - Niveles de colesterol del paciente
- **Blood Pressure** - Presión arterial del paciente (sistólica/diastólica)
- **Heart Rate** - Frecuencia cardíaca del paciente
- **Diabetes** - Si el paciente tiene diabetes (Sí/No)
- **Family History** - Historial familiar de problemas cardíacos (1: Sí, 0: No)
- **Smoking** - Estado de tabaquismo del paciente (1: Fumador, 0: No fumador)
- **Obesity** - Estado de obesidad del paciente (1: Obeso, 0: No obeso)
- **Alcohol Consumption** - Nivel de consumo de alcohol por el paciente
- **Exercise Hours Per Week** - Número de horas de ejercicio por semana
- **Diet** - Hábitos alimenticios del paciente (Saludable/Promedio/No saludable)
- **Previous Heart Problems** - Problemas cardíacos previos del paciente (1: Sí, 0: No)
- **Medication Use** - Uso de medicamentos por el paciente (1: Sí, 0: No)
- **Stress Level** - Nivel de estrés reportado por el paciente (1-10)
- **Sedentary Hours Per Day** - Horas de actividad sedentaria por día
- **Income** - Nivel de ingresos del paciente
- **BMI** - Índice de Masa Corporal (IMC) del paciente
- **Triglycerides** - Niveles de triglicéridos del paciente
- **Physical Activity Days Per Week** - Días de actividad física por semana
- **Sleep Hours Per Day** - Horas de sueño por día
- **Country** - País del paciente
- **Continent** - Continente donde reside el paciente
- **Hemisphere** - Hemisferio donde reside el paciente
- **Heart Attack Risk** - Presencia de riesgo de ataque cardíaco (1: Sí, 0: No)

In [16]:
import numpy as np
import pandas as pd

In [17]:
raw_data = pd.read_csv("heart_attack_prediction_dataset.csv")
raw_data.head(5)

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,...,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk
0,BMW7812,67,Male,208,158/88,72,0,0,1,0,...,6.615001,261404,31.251233,286,0,6,Argentina,South America,Southern Hemisphere,0
1,CZE1114,21,Male,389,165/93,98,1,1,1,1,...,4.963459,285768,27.194973,235,1,7,Canada,North America,Northern Hemisphere,0
2,BNI9906,21,Female,324,174/99,72,1,0,0,0,...,9.463426,235282,28.176571,587,4,4,France,Europe,Northern Hemisphere,0
3,JLN3497,84,Male,383,163/100,73,1,1,1,0,...,7.648981,125640,36.464704,378,3,4,Canada,North America,Northern Hemisphere,0
4,GFO8847,66,Male,318,91/88,93,1,1,1,1,...,1.514821,160555,21.809144,231,1,5,Thailand,Asia,Northern Hemisphere,0


In [18]:
raw_data.shape

(8763, 26)

In [19]:
raw_data.describe()

Unnamed: 0,Age,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Medication Use,Stress Level,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Heart Attack Risk
count,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0,8763.0
mean,53.707977,259.877211,75.021682,0.652288,0.492982,0.896839,0.501426,0.598083,10.014284,0.495835,0.498345,5.469702,5.99369,158263.181901,28.891446,417.677051,3.489672,7.023508,0.358211
std,21.249509,80.863276,20.550948,0.476271,0.499979,0.304186,0.500026,0.490313,5.783745,0.500011,0.500026,2.859622,3.466359,80575.190806,6.319181,223.748137,2.282687,1.988473,0.479502
min,18.0,120.0,40.0,0.0,0.0,0.0,0.0,0.0,0.002442,0.0,0.0,1.0,0.001263,20062.0,18.002337,30.0,0.0,4.0,0.0
25%,35.0,192.0,57.0,0.0,0.0,1.0,0.0,0.0,4.981579,0.0,0.0,3.0,2.998794,88310.0,23.422985,225.5,2.0,5.0,0.0
50%,54.0,259.0,75.0,1.0,0.0,1.0,1.0,1.0,10.069559,0.0,0.0,5.0,5.933622,157866.0,28.768999,417.0,3.0,7.0,0.0
75%,72.0,330.0,93.0,1.0,1.0,1.0,1.0,1.0,15.050018,1.0,1.0,8.0,9.019124,227749.0,34.324594,612.0,5.0,9.0,1.0
max,90.0,400.0,110.0,1.0,1.0,1.0,1.0,1.0,19.998709,1.0,1.0,10.0,11.999313,299954.0,39.997211,800.0,7.0,10.0,1.0


#### Verificar valores nulos

In [20]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 26 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Patient ID                       8763 non-null   object 
 1   Age                              8763 non-null   int64  
 2   Sex                              8763 non-null   object 
 3   Cholesterol                      8763 non-null   int64  
 4   Blood Pressure                   8763 non-null   object 
 5   Heart Rate                       8763 non-null   int64  
 6   Diabetes                         8763 non-null   int64  
 7   Family History                   8763 non-null   int64  
 8   Smoking                          8763 non-null   int64  
 9   Obesity                          8763 non-null   int64  
 10  Alcohol Consumption              8763 non-null   int64  
 11  Exercise Hours Per Week          8763 non-null   float64
 12  Diet                

In [21]:
raw_data.isnull().sum()

Patient ID                         0
Age                                0
Sex                                0
Cholesterol                        0
Blood Pressure                     0
Heart Rate                         0
Diabetes                           0
Family History                     0
Smoking                            0
Obesity                            0
Alcohol Consumption                0
Exercise Hours Per Week            0
Diet                               0
Previous Heart Problems            0
Medication Use                     0
Stress Level                       0
Sedentary Hours Per Day            0
Income                             0
BMI                                0
Triglycerides                      0
Physical Activity Days Per Week    0
Sleep Hours Per Day                0
Country                            0
Continent                          0
Hemisphere                         0
Heart Attack Risk                  0
dtype: int64

#### Revisar filas duplicadas

In [22]:
raw_data.drop_duplicates(keep=False).shape

(8763, 26)

#### Separar columna de presión arterial

In [23]:
raw_data[['Systolic', 'Diastolic']] = raw_data['Blood Pressure'].str.split('/', expand=True)
raw_data[['Systolic', 'Diastolic']] = raw_data[['Systolic', 'Diastolic']].astype(int)
raw_data.drop('Blood Pressure', axis=1, inplace=True)

#### Eliminar la columna del ID

In [24]:
raw_data.drop('Patient ID', axis=1, inplace=True)

#### Balance de las clases de la respuesta

In [25]:
raw_data["Heart Attack Risk"].value_counts()

Heart Attack Risk
0    5624
1    3139
Name: count, dtype: int64

#### Undersampling para la clase mayoritaria y balancear el dataset

In [26]:
class_1 = raw_data[raw_data['Heart Attack Risk'] == 1]
class_0 = raw_data[raw_data['Heart Attack Risk'] == 0]

# Realizar undersampling en la clase mayoritaria (en este caso, la clase 0)
class_0_sampled = class_0.sample(n=len(class_1), random_state=42)

In [27]:
balanced_data = pd.concat([class_1, class_0_sampled])
balanced_data = balanced_data.sample(frac=1, random_state=42).reset_index(drop=True)

In [28]:
balanced_data

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,...,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk,Systolic,Diastolic
0,19,Female,143,51,1,0,0,0,0,16.978907,...,39.348247,208,7,8,Spain,Europe,Southern Hemisphere,0,138,85
1,33,Male,170,46,1,1,1,0,1,2.210484,...,33.982475,680,1,9,United Kingdom,Europe,Northern Hemisphere,0,166,101
2,85,Male,201,105,0,0,1,1,1,12.466069,...,37.491882,382,1,10,Argentina,South America,Southern Hemisphere,1,107,96
3,37,Female,150,87,0,0,0,0,1,7.893431,...,34.952897,359,0,4,China,Asia,Northern Hemisphere,0,111,67
4,18,Female,233,74,0,1,0,0,1,18.571328,...,39.428916,704,4,5,Thailand,Asia,Northern Hemisphere,0,136,93
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6273,80,Male,151,108,1,0,1,1,0,1.617427,...,36.497506,651,0,7,Colombia,South America,Northern Hemisphere,0,160,74
6274,27,Female,396,79,1,0,0,0,0,4.601646,...,21.534604,754,0,10,Brazil,South America,Southern Hemisphere,0,129,108
6275,44,Female,339,85,1,0,1,1,1,14.725646,...,21.402346,269,1,8,Italy,Europe,Southern Hemisphere,0,177,94
6276,70,Male,123,76,0,1,1,0,1,2.800186,...,22.434137,540,2,8,New Zealand,Australia,Southern Hemisphere,0,157,98


In [29]:
balanced_data.to_csv('process_dataset/01_first_result_balanced.csv', index=False)

In [30]:
raw_data.to_csv('process_dataset/01_first_result.csv', index=False)