# Preprocesamiento de datos

En este notebook se preparará el dataset **Adult Income** para entrenar modelos de clasificación de ingresos.  
Incluye manejo de valores nulos, transformación de variables categóricas y escalado de variables numéricas.


In [None]:
# --- Imports ---
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder


## 1. Cargar dataset

Cargamos el dataset original y verificamos su estructura inicial.


In [22]:
df = pd.read_csv('../data/raw/adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


## 2. Detección y eliminación de duplicados

Revisamos si hay filas duplicadas y las eliminamos para evitar sesgos en el análisis.


In [23]:
# Revisar duplicados
df.duplicated().sum()

np.int64(52)

In [24]:
# Eliminar duplicados
df.drop_duplicates(inplace=True)

# Verificar
df.duplicated().sum()

np.int64(0)

## 3. Reemplazo de valores faltantes representados con '?'

Algunas columnas usan '?' para representar valores desconocidos.  
Se reemplazan por `NaN` para poder tratarlos correctamente.


In [25]:
df.replace('?', np.nan, inplace=True)
df.isnull().sum()

age                   0
workclass          2795
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2805
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      856
income                0
dtype: int64

## 4. Manejo de valores nulos

Eliminamos las filas que contienen valores nulos.  
Esto simplifica el preprocesamiento, aunque se podrían imputar si se quisiera mantener más datos.


In [26]:
df.dropna(inplace=True,ignore_index=True)
df.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

## 5. Codificación de la variable objetivo `income`

Convertimos la variable objetivo a formato numérico:  
- `<=50K` → 0  
- `>50K` → 1


In [27]:
le = LabelEncoder()
df['income'] = le.fit_transform(df['income'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,0


## 6. Codificación de variables categóricas

Convertimos las variables categóricas a variables dummy (**one-hot encoding**) para que los modelos de ML puedan procesarlas.


In [28]:
# One-hot encoding para variables categóricas (sin education)
categorical_cols = ['workclass', 'marital-status', 'occupation',
                    'relationship', 'race', 'gender', 'native-country']

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Convertir columnas booleanas a 0/1
df = df.astype({col: int for col in df.columns if df[col].dtype == 'bool'})

df.head()


Unnamed: 0,age,fnlwgt,education,educational-num,capital-gain,capital-loss,hours-per-week,income,workclass_Local-gov,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25,226802,11th,7,0,0,40,0,0,1,...,0,0,0,0,0,0,0,1,0,0
1,38,89814,HS-grad,9,0,0,50,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,28,336951,Assoc-acdm,12,0,0,40,1,1,0,...,0,0,0,0,0,0,0,1,0,0
3,44,160323,Some-college,10,7688,0,40,1,0,1,...,0,0,0,0,0,0,0,1,0,0
4,34,198693,10th,6,0,0,30,0,0,1,...,0,0,0,0,0,0,0,1,0,0


## 7. Eliminación de columnas redundantes de educación

Para evitar redundancia y posibles problemas de multicolinealidad en los modelos de Machine Learning, realizamos los siguientes pasos:

1. **Eliminamos las columnas `education`.**  
2. **Renombramos `educational-num` a `education`** para mayor claridad en el dataset.

De esta forma, conservamos la información educativa de manera compacta y útil para el modelado.


In [29]:
df.drop(columns=['education'], inplace=True)

df.rename(columns={'educational-num': 'education'}, inplace=True)

df.head()


Unnamed: 0,age,fnlwgt,education,capital-gain,capital-loss,hours-per-week,income,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25,226802,7,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,38,89814,9,0,0,50,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,28,336951,12,0,0,40,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,44,160323,10,7688,0,40,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,34,198693,6,0,0,30,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


## 8. Guardar dataset limpio

Guardamos el dataset preprocesado en un CSV para su uso en modelado posterior.


In [30]:
df.to_csv("../data/processed/adult_clean.csv", index=False)