---
title: "A world without skrub"
format:
    revealjs:
        slide-number: true
        toc: true
        code-fold: false
        code-tools: true

---

Let's begin the lesson by imagining a world without skrub, where we can use 
only Pandas and scikit-learn to clean data and prepare a machine learning model. 

In [1]:
import pandas as pd
import numpy as np
from skrub.datasets import fetch_employee_salaries

data = fetch_employee_salaries()
X, y = data.X, data.y

Let's take a look at the target::

In [2]:
y

0        69222.18
1        97392.47
2       104717.28
3        52734.57
4        93396.00
          ...    
9223     72094.53
9224    169543.85
9225    102736.52
9226    153747.50
9227     75484.08
Name: current_annual_salary, Length: 9228, dtype: float64

This is a numerical column, and our task is predicting the value of `current_annual_salary`.

## Strategizing
We can begin by exploring the dataframe with `.describe`, and then think of a 
plan for pre-processing our data. 

In [3]:
X.describe(include="all")

Unnamed: 0,gender,department,department_name,division,assignment_category,employee_position_title,date_first_hired,year_first_hired
count,9211,9228,9228,9228,9228,9228,9228,9228.0
unique,2,37,37,694,2,443,2264,
top,M,POL,Department of Police,School Health Services,Fulltime-Regular,Bus Operator,12/12/2016,
freq,5481,1844,1844,300,8394,638,87,
mean,,,,,,,,2003.597529
std,,,,,,,,9.327078
min,,,,,,,,1965.0
25%,,,,,,,,1998.0
50%,,,,,,,,2005.0
75%,,,,,,,,2012.0


We need to:

- Impute some missing values in the `gender` column.
- Encode convert categorical features into numerical features. 
- Convert the column `date_first_hired` into numerical features.

Once we have processed the data, we can train a machine learning model. For the sake
of the example, we will use a linear model (`Ridge`), which means that we need to scale numerical features, besides imputing missing values. 

Finally, we want to evaluate the performance of the method across multiple 
cross-validation splits.

## Building a traditional pipeline
Let's build a traditional predictive pipeline following the steps we just discussed. 

### Step 1: Convert date features to numerical

Extract numerical features from the `date_first_hired` column.

In [4]:
# Create a copy to work with
X_processed = X.copy()

# Parse the date column
X_processed['date_first_hired'] = pd.to_datetime(X_processed['date_first_hired'])

# Extract numerical features from date
X_processed['years_since_hired'] = (pd.Timestamp.now() - X_processed['date_first_hired']).dt.days / 365.25
X_processed['hired_month'] = X_processed['date_first_hired'].dt.month
X_processed['hired_year'] = X_processed['date_first_hired'].dt.year

# Drop original date column
X_processed = X_processed.drop('date_first_hired', axis=1)

print("Features after date transformation:")
print("\nShape:", X_processed.shape)

Features after date transformation:

Shape: (9228, 10)


### Step 2: Encode categorical features

Encode only the non-numerical categorical features using one-hot encoding.

In [5]:
# Identify only the non-numerical (truly categorical) columns
categorical_cols = X_processed.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns to encode:", categorical_cols)

# Apply one-hot encoding only to categorical columns
X_encoded = pd.get_dummies(X_processed, columns=categorical_cols)
print("\nShape after encoding:", X_encoded.shape)

Categorical columns to encode: ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title']

Shape after encoding: (9228, 1219)


### Step 3: Impute missing values

We'll impute missing values in the `gender` column using the most frequent strategy.

In [6]:
from sklearn.impute import SimpleImputer

# Impute missing values with most frequent value
imputer = SimpleImputer(strategy='most_frequent')
X_encoded_imputed = pd.DataFrame(
    imputer.fit_transform(X_encoded),
    columns=X_encoded.columns
)

### Step 4: Scale numerical features

Scale numerical features for the Ridge regression model.

In [7]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X_encoded_imputed)
X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)

### Step 5: Train Ridge model with cross-validation

Train a Ridge regression model and evaluate with cross-validation.

In [8]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, cross_validate
import numpy as np

# Initialize Ridge model
ridge = Ridge(alpha=1.0)

# Perform cross-validation (5-fold)
cv_results = cross_validate(ridge, X_scaled, y, cv=5, 
                            scoring=['r2', 'neg_mean_squared_error'],
                            return_train_score=True)

# Convert MSE to RMSE
train_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])
test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])

# Display results
print("Cross-Validation Results:")
print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")

Cross-Validation Results:
Mean test R²: 0.8722 (+/- 0.0274)
Mean test RMSE: 10366.9520 (+/- 1403.5225)


### "Just ask an agent to write the code"
It's what I did. Here are some of the issues I noticed: 

- Operations in the wrong order.
- Trying to impute categorical features without encoding them as numerical values.
- The datetime feature was encoded as a categorical (i.e, with dummmies).
- Too many print statements.
- Cells could not be executed in order without proper debugging and re-prompting.


## Waking up from a nightmare
Thankfully, we live in a world where we can `import skrub`. Let's see what we can
get if we use `skrub.tabular_pipeline`. 

In [9]:
from skrub import tabular_pipeline

# Perform cross-validation (5-fold)
cv_results = cross_validate(tabular_pipeline("regression"), X, y, cv=5, 
                            scoring=['r2', 'neg_mean_squared_error'],
                            return_train_score=True)

# Convert MSE to RMSE
train_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])
test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])

# Display results
print("Cross-Validation Results:")
print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")

Cross-Validation Results:
Mean test R²: 0.9083 (+/- 0.0166)
Mean test RMSE: 8802.3781 (+/- 1060.9218)


All the code from before, the tokens and the debugging are replaced by a single 
import that gives better results.

Throughout the tutorial, we will see how each step can be simplified, replaced, or
improved using skrub features.