# Demo: Pipeline de Preprocesamiento

Este notebook demuestra el uso del pipeline de preprocesamiento creado para el proyecto.

In [1]:
import sys
sys.path.insert(0, '..')

from src.data_processing.pipeline import DataPipeline
import pandas as pd
import numpy as np

## 1. Cargar y Preparar Datos

In [2]:
pipeline = DataPipeline()
X, y, df = pipeline.load_and_prepare_data()

print(f"Shape de X: {X.shape}")
print(f"Shape de y: {y.shape}")
print(f"\nPrimeras filas de X:")
X.head()

Shape de X: (373, 6)
Shape de y: (373,)

Primeras filas de X:


Unnamed: 0,Age,Years of Experience,Gender,Education Level,Job Title,Description
0,32.0,5.0,Male,Bachelor's,Software Engineer,I am a 32-year-old male working as a Software ...
1,28.0,3.0,Female,Master's,Data Analyst,I am a 28-year-old data analyst with a Master'...
2,45.0,15.0,Male,PhD,Senior Manager,I am a 45-year-old Senior Manager with a PhD a...
3,36.0,7.0,Female,Bachelor's,Sales Associate,I am a 36-year-old female Sales Associate with...
4,52.0,20.0,Male,Master's,Director,I am a 52-year-old male with over two decades ...


## 2. Aplicar Transformaciones

In [4]:
X_transformed = pipeline.fit_transform(X)

print(f"Shape después de transformación: {X_transformed.shape}")
print(f"Tipo de datos: {type(X_transformed)}")

Shape después de transformación: (373, 380)
Tipo de datos: <class 'pandas.core.frame.DataFrame'>


## 3. Nombres de Features

In [5]:
feature_names = pipeline.get_feature_names()

print(f"Total de features: {len(feature_names)}")
print(f"\nPrimeras 20 features:")
for i, name in enumerate(feature_names[:20], 1):
    print(f"{i:2d}. {name}")

Total de features: 380

Primeras 20 features:
 1. Age
 2. Years of Experience
 3. Gender_Female
 4. Gender_Male
 5. Education Level_Bachelor's
 6. Education Level_Master's
 7. Education Level_PhD
 8. Job Title_Account Manager
 9. Job Title_Accountant
10. Job Title_Administrative Assistant
11. Job Title_Business Analyst
12. Job Title_Business Development Manager
13. Job Title_CEO
14. Job Title_Chief Data Officer
15. Job Title_Chief Technology Officer
16. Job Title_Content Marketing Manager
17. Job Title_Copywriter
18. Job Title_Creative Director
19. Job Title_Customer Service Manager
20. Job Title_Customer Service Rep


## 4. Distribución de Features por Tipo

In [6]:
numeric_count = len([f for f in feature_names if not f.startswith(('Gender', 'Education', 'Job', 'text'))])
categorical_count = len([f for f in feature_names if f.startswith(('Gender', 'Education', 'Job'))])
text_count = len([f for f in feature_names if f.startswith('text_')])

print("Distribución de Features:")
print(f"  Numéricas:    {numeric_count:3d}")
print(f"  Categóricas:  {categorical_count:3d}")
print(f"  Texto (TF-IDF): {text_count:3d}")
print(f"  {'='*30}")
print(f"  TOTAL:        {len(feature_names):3d}")

Distribución de Features:
  Numéricas:      2
  Categóricas:  178
  Texto (TF-IDF): 200
  TOTAL:        380


## 5. Verificación de Calidad

In [None]:
print("Verificaciones de Calidad:")
print(f"  ✓ Valores nulos: {np.isnan(X_transformed).sum()}")
print(f"  ✓ Valores infinitos: {np.isinf(X_transformed).sum()}")
print(f"  ✓ Shape consistente: {X_transformed.shape[0] == len(y)}")

Verificaciones de Calidad:
  ✓ Valores nulos: num__Age                           0
num__Years of Experience           0
cat__Gender_Female                 0
cat__Gender_Male                   0
cat__Education Level_Bachelor's    0
                                  ..
text__various                      0
text__work                         0
text__working                      0
text__year                         0
text__years                        0
Length: 380, dtype: int64
  ✓ Valores infinitos: num__Age                           0
num__Years of Experience           0
cat__Gender_Female                 0
cat__Gender_Male                   0
cat__Education Level_Bachelor's    0
                                  ..
text__various                      0
text__work                         0
text__working                      0
text__year                         0
text__years                        0
Length: 380, dtype: int64
  ✓ Shape consistente: True

Estadísticas básicas:
  Min: -2.0481