# Sesión 1 – Apartado 3: Pipelines en Scikit-learn

**Objetivos:**
- Comprender el uso de `Pipeline` y `ColumnTransformer`.
- Encadenar preprocesamiento y modelos.
- Facilitar reproducibilidad y despliegue.


## 1) Cargar el dataset

In [17]:
import pandas as pd
df = pd.read_csv('../data/raw/mini_titanic.csv')
df.head()

Unnamed: 0,sex,class,age,fare,survived
0,male,Third,34.352706,119.0,0
1,female,Second,50.654987,66.77,0
2,male,First,42.007235,21.87,0
3,male,Second,27.760861,,0
4,male,Third,29.733773,212.45,0


## 2) Definir variables numéricas y categóricas

In [18]:
num_cols = ['age', 'fare']
cat_cols = ['sex', 'class']
target = 'survived'

X = df[num_cols + cat_cols]
y = df[target]

## 3) ColumnTransformer

In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

# Pipeline numérico
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Pipeline categórico
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

preprocessor

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


## 4) Pipeline completo con modelo

In [20]:
clf = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(max_iter=500))
])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

clf.fit(X_train, y_train)
print('Accuracy en test:', clf.score(X_test, y_test))

Accuracy en test: 0.9333333333333333


## 5) Inspeccionar el pipeline

In [21]:
print(clf.named_steps)
print('\nTransformaciones aplicadas a un batch de datos:')
import numpy as np
Xt = clf.named_steps['preprocessor'].transform(X_train)
print('Shape transformado:', Xt.shape)


{'preprocessor': ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer', SimpleImputer()),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'fare']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['sex', 'class'])]), 'model': LogisticRegression(max_iter=500)}

Transformaciones aplicadas a un batch de datos:
Shape transformado: (240, 7)


## 6) Guardar y cargar un pipeline entrenado

In [22]:
import joblib

# Guardar
joblib.dump(clf, '../data/processed/titanic_pipeline.joblib')

# Cargar
clf_loaded = joblib.load('../data/processed/titanic_pipeline.joblib')
print('Accuracy cargado:', clf_loaded.score(X_test, y_test))

Accuracy cargado: 0.9333333333333333
