# Laboratorio – DataFrames con Pandas (Healthcare Stroke Dataset)
En este laboratorio usarás el dataset **real** cargado por el docente para practicar:
- Selección de columnas
- Filtrado
- Limpieza
- Groupby
- Creación de columnas
- Aplicación de funciones

**Dataset:** `healthcare-dataset-stroke-data.csv`

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('./data/healthcare-dataset-stroke-data.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


## Ejercicio 1 – Primeros análisis con Pandas
1. Mostrar dimensiones del dataset.
2. Mostrar primeras 10 filas.
3. Revisar tipos de datos.
4. Contar valores nulos.

In [None]:
# Punto 1: Dimensiones
df.shape

(5110, 12)

In [5]:
# Punto 2: Primeras 10 filas
df.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1


In [7]:
# Punto 3: Tipos de datos
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [9]:
# Punto 4: Valores nulos por columnas
df.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

## Ejercicio 2 – Selección e indexación
1. Seleccionar columnas: `gender`, `age`, `hypertension`, `heart_disease`, `stroke`.
2. Seleccionar primeras 5 filas con `iloc`.
3. Obtener filas donde el paciente tiene hipertensión (`hypertension == 1`).

In [10]:
# Punto 1: Subconjunto con las columnas gender, age, hypertension, heart_disease, stroke
df_subconjunto = df[['gender', 'age', 'hypertension', 'heart_disease', 'stroke']]
df_subconjunto

Unnamed: 0,gender,age,hypertension,heart_disease,stroke
0,Male,67.0,0,1,1
1,Female,61.0,0,0,1
2,Male,80.0,0,1,1
3,Female,49.0,0,0,1
4,Female,79.0,1,0,1
...,...,...,...,...,...
5105,Female,80.0,1,0,0
5106,Female,81.0,0,0,0
5107,Female,35.0,0,0,0
5108,Male,51.0,0,0,0


In [12]:
# Punto 2: Primeras 5 filas
df_subconjunto.iloc[0:5]

Unnamed: 0,gender,age,hypertension,heart_disease,stroke
0,Male,67.0,0,1,1
1,Female,61.0,0,0,1
2,Male,80.0,0,1,1
3,Female,49.0,0,0,1
4,Female,79.0,1,0,1


In [14]:
# Punto 3: Pacientes con hypertension
df_pacientes_hypertension = df_subconjunto[(df_subconjunto['hypertension'] == 1)].reset_index()
df_pacientes_hypertension

Unnamed: 0,index,gender,age,hypertension,heart_disease,stroke
0,4,Female,79.0,1,0,1
1,6,Male,74.0,1,1,1
2,10,Female,81.0,1,0,1
3,15,Female,50.0,1,0,1
4,17,Male,75.0,1,0,1
...,...,...,...,...,...,...
493,5088,Female,64.0,1,0,0
494,5091,Male,59.0,1,0,0
495,5093,Female,45.0,1,0,0
496,5100,Male,82.0,1,0,0


## Ejercicio 3 – Manejo y procesamiento
1. Ordenar pacientes por edad (descendente).
2. Agrupar por género y obtener promedio de edad.
3. Contar cuántas personas han tenido un stroke.

In [17]:
# Punto 1: Pacientes ordenados por edad de forma descendente
df_pacientes_ordenados_edad_desc = df_subconjunto.sort_values(by='age', ascending=False).reset_index()
df_pacientes_ordenados_edad_desc

Unnamed: 0,index,gender,age,hypertension,heart_disease,stroke
0,3108,Male,82.00,0,0,0
1,188,Male,82.00,0,0,1
2,1515,Female,82.00,0,0,0
3,1412,Male,82.00,1,0,0
4,1951,Female,82.00,0,0,0
...,...,...,...,...,...,...
5105,3618,Male,0.16,0,0,0
5106,3968,Male,0.16,0,0,0
5107,4021,Male,0.16,0,0,0
5108,1614,Female,0.08,0,0,0


In [18]:
# Punto 2: Promedio de edad por género
promedio_edad_por_genero = df_subconjunto.groupby('gender')['age'].mean().reset_index()
promedio_edad_por_genero

Unnamed: 0,gender,age
0,Female,43.757395
1,Male,42.483385
2,Other,26.0


In [32]:
# Punto 3: Personas que han tenido un Stroke
personas_stroke = (df_subconjunto['stroke'] == 1).sum()
personas_stroke

np.int64(249)

## Ejercicio 4 – Creación de columnas y apply
1. Crear columna `age_range`: 'Joven' < 30, 'Adulto' 30–60, 'Mayor' > 60.
2. Crear columna `is_risk`: True si hipertensión o enfermedad cardiaca.
3. Crear columna `bmi_norm`: normalizar BMI.

In [33]:
## Copia del dataframe
df_modificado = df.copy()
df_modificado.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [34]:
# Punto 1: Nueva columna age_range
def clasificar_edad(valor):
    if pd.isna(valor):
        return 'Desconocido'
    elif valor < 30:
        return 'Joven'
    elif valor < 60:
        return 'Adulto'
    else:
        return 'Mayor'

df_modificado['age_range'] = df_modificado['age'].apply(clasificar_edad)
df_modificado

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,age_range
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,Mayor
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,Mayor
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,Mayor
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,Adulto
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,Mayor
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0,Mayor
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0,Mayor
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0,Adulto
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0,Adulto


In [43]:
# Punto 2: Nueva Columna is_risk
def clasificar_riesgo(valorHyp, valorHeartD):
    if (pd.isna(valorHyp) and pd.isna(valorHeartD)):
        return 'Desconocido'
    elif valorHyp == 1 or valorHeartD == 1:
        return True
    else:
        return False

df_modificado['is_risk'] = df_modificado.apply(lambda df_modificado: clasificar_riesgo(df_modificado['hypertension'], df_modificado['heart_disease']), axis=1)
df_modificado

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,age_range,is_risk
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,Mayor,True
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,Mayor,False
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,Mayor,True
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,Adulto,False
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,Mayor,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0,Mayor,True
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0,Mayor,False
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0,Adulto,False
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0,Adulto,False


In [48]:
# Punto 3: Nueva columnas bmi_norm

df_modificado['bmi_norm'] = (df['bmi'] - df['bmi'].min()) / (df['bmi'].max() - df['bmi'].min())
df_modificado

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,age_range,is_risk,bmi_norm
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,Mayor,True,0.301260
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,Mayor,False,
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,Mayor,True,0.254296
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,Adulto,False,0.276060
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,Mayor,True,0.156930
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0,Mayor,True,
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0,Mayor,False,0.340206
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0,Adulto,False,0.232532
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0,Adulto,False,0.175258
