---
title: "Building a tabular pipeline"
format:
    revealjs:
        slide-number: true
        toc: true
        code-fold: false
        code-tools: true

---

1. pipeline with `TableVectorizer`
2. pipeline with `tabular_pipeline`

We can now put data cleaning and feature engineering together to build a full
machine learning pipeline. 

## Exercise: 

In [None]:
from skrub.datasets import fetch_employee_salaries
from sklearn.datasets import fetch_openml

adult = fetch_openml("adult", version=2)  
X = adult.data
y = adult.target

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include="category")(X)
numerical_columns = selector(dtype_include="number")(X)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model_base = make_pipeline(ct, SimpleImputer(), LogisticRegression())
model_base

In [None]:
from skrub import TableVectorizer

tv = TableVectorizer()

model_tv = make_pipeline(tv, SimpleImputer(), StandardScaler(), LogisticRegression())
model_tv

In [None]:
from skrub import tabular_pipeline

model_tp = tabular_pipeline(LogisticRegression())
model_tp

In [None]:
model_hgb = tabular_pipeline("classification")
model_hgb

In [None]:
from sklearn.model_selection import cross_val_score

results_base = cross_val_score(model_base, X, y)
print(f"Base model: {results_base.mean():.4f}")

results_tv = cross_val_score(model_tv, X, y)
print(f"TableVectorizer: {results_tv.mean():.4f}")

results_tp = cross_val_score(model_tp, X, y)
print(f"Tabular pipeline: {results_tp.mean():.4f}")

results_hgb = cross_val_score(model_hgb, X, y)
print(f"HGB model: {results_hgb.mean():.4f}")