# Tutorial 2: Data Preprocessing with Transform Classes

This tutorial demonstrates how to use TabCamel's comprehensive data preprocessing capabilities through various `Transform` classes. We'll cover:

1. **Data Imputation** - Handling missing values in both categorical and numerical features
2. **Numerical Transformations** - Standardization, min-max scaling, and quantile transformation
3. **Categorical Encoding** - One-hot encoding and ordinal encoding
4. **Target Transformation** - Label encoding for classification and standardization for regression
5. **Best Practices** - Proper fitting and transformation workflow

Each transform follows the scikit-learn pattern: `fit()` on training data, then `transform()` on both training and test data.


## Prepare environment

## Environment Setup

First, let's configure our Jupyter environment with autoreload to automatically reload modules when they change, and set up matplotlib for inline plotting.


In [1]:
# Load the autoreload extension - automatically reloads modules when they change
%load_ext autoreload
# Set autoreload to mode 2 - reload all modules (except those excluded by %aimport) every time before executing Python code
%autoreload 2
# Enable inline matplotlib plots
%matplotlib inline

In [2]:
# Import the main TabularDataset class for loading and managing tabular data
from tabcamel.data.dataset import TabularDataset

In [3]:
dataset_openml = TabularDataset(
    dataset_name="adult",  # Name of the dataset to load from OpenML
    task_type="classification",  # Specify the task type (e.g., classification, regression)
)

# Split the dataset into training and testing sets
# Using stratified split to maintain class distribution in both sets
split_dict = dataset_openml.split(
    split_mode="stratified",  # Ensures both sets have similar target class proportions
    train_size=0.8,  # 80% for training, 20% for testing
)

# Extract the training and testing sets from the split dictionary
train_set = split_dict["train_set"]
test_set = split_dict["test_set"]

# Display information about both sets
print("Training Set:")
print(train_set)
print("\nTesting Set:")
print(test_set)

Training Set:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 39073
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7607299157986334, '>50K': 0.23927008420136667}

Testing Set:
Dataset: adult
Task type: classification
Status (is_tensor): False
Number of samples: 9769
Number of features: 14 (Numerical: 2, Categorical: 12)
Number of classes: 2
Class distribution: {'<=50K': 0.7606715119254785, '>50K': 0.23932848807452145}


## Data Imputation

Data imputation is the process of replacing missing values with substituted values. TabCamel's `SimpleImputeTransform` provides flexible imputation strategies:

- **Categorical features**: `most_frequent` (mode), `constant` (fill with a specific value)
- **Numerical features**: `mean`, `median`, `most_frequent`, `constant`

This transform handles categorical and numerical features separately with different strategies, which is often more appropriate than applying the same strategy to all features.


In [4]:
# Import the SimpleImputeTransform class for handling missing values
from tabcamel.data.transform import SimpleImputeTransform

In [5]:
# Create a fresh dataset instance for demonstration
dataset_openml = TabularDataset(
    dataset_name="adult",
    task_type="classification",
)

# Create a small sample for demonstration purposes
# Using stratified sampling to maintain class distribution
temp_set = dataset_openml.sample(
    sample_mode="stratified",  # Preserve target class proportions
    sample_size=5,  # Small sample for easy visualization
)["dataset_sampled"]

# Display the sampled data to see its structure
temp_set.data_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,target
20745,0,Local-gov,159032,7th-8th,4,Never-married,Farming-fishing,Own-child,White,Male,0,0,2,United-States,<=50K
1127,4,Federal-gov,124244,HS-grad,9,Widowed,Handlers-cleaners,Not-in-family,Black,Male,0,0,2,United-States,<=50K
14826,4,Private,343849,Some-college,10,Married-civ-spouse,Transport-moving,Husband,Black,Male,0,0,2,United-States,<=50K
30318,2,Private,58343,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,2,United-States,>50K
3466,2,Private,96452,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,3,United-States,>50K


In [6]:
# Introduce missing values to demonstrate imputation
# Setting some values to None to simulate real-world missing data scenarios
temp_set.data_df.at[92, "workclass"] = None  # Missing categorical value
temp_set.data_df.at[92, "fnlwgt"] = None  # Missing numerical value

# Display the data with missing values
temp_set.data_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,target
20745,0.0,Local-gov,159032.0,7th-8th,4.0,Never-married,Farming-fishing,Own-child,White,Male,0.0,0.0,2.0,United-States,<=50K
1127,4.0,Federal-gov,124244.0,HS-grad,9.0,Widowed,Handlers-cleaners,Not-in-family,Black,Male,0.0,0.0,2.0,United-States,<=50K
14826,4.0,Private,343849.0,Some-college,10.0,Married-civ-spouse,Transport-moving,Husband,Black,Male,0.0,0.0,2.0,United-States,<=50K
30318,2.0,Private,58343.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,2.0,United-States,>50K
3466,2.0,Private,96452.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,3.0,United-States,>50K
92,,,,,,,,,,,,,,,


In [7]:
# Create and configure the imputation transform
imputer = SimpleImputeTransform(
    categorical_feature_list=temp_set.categorical_feature_list,  # List of categorical columns
    numerical_feature_list=temp_set.numerical_feature_list,  # List of numerical columns
    strategy_categorical="most_frequent",  # Use mode for categorical features
    strategy_numerical="mean",  # Use mean for numerical features
)

# Fit the imputer on the data to learn the imputation values
# This calculates the mode for categorical features and mean for numerical features
imputer.fit(temp_set.data_df)

# Transform the data to fill in missing values
# The missing 'workclass' will be filled with the most frequent category
# The missing 'fnlwgt' will be filled with the mean value
imputed_data = imputer.transform(temp_set.data_df)
imputed_data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,target
20745,0,Local-gov,159032.0,7th-8th,4.0,Never-married,Farming-fishing,Own-child,White,Male,0,0,2,United-States,<=50K
1127,4,Federal-gov,124244.0,HS-grad,9.0,Widowed,Handlers-cleaners,Not-in-family,Black,Male,0,0,2,United-States,<=50K
14826,4,Private,343849.0,Some-college,10.0,Married-civ-spouse,Transport-moving,Husband,Black,Male,0,0,2,United-States,<=50K
30318,2,Private,58343.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,2,United-States,>50K
3466,2,Private,96452.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,3,United-States,>50K
92,2,Private,156384.0,Some-college,9.4,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,2,United-States,


## Numerical Transformations

Numerical feature scaling is crucial for many machine learning algorithms. TabCamel provides several transformation options:

1. **NumericTransform** with different strategies:
   - `"standard"` - Z-score normalization (mean=0, std=1)
   - `"minmax"` - Min-max scaling to [0,1] range
   - `"quantile"` - Quantile transformation for non-linear scaling

Let's demonstrate each approach:


In [8]:
# Import numerical transformation classes
from tabcamel.data.transform import NumericTransform

In [9]:
# Min-Max Scaling: Transform features to a fixed range [0, 1]
# This is useful when you need bounded values or when features have different scales
minmax_scaler = NumericTransform(
    numerical_feature_list=train_set.numerical_feature_list,
    strategy="minmax",  # Scale to [0, 1] range
    include_categorical=False,  # Only transform numerical features
)

# Fit on training data and transform both sets
minmax_scaler.fit(train_set.X_df)
train_minmax = minmax_scaler.transform(train_set.X_df)
test_minmax = minmax_scaler.transform(test_set.X_df)

print("Min-Max scaled training data (numerical features only):")
print(train_minmax[train_set.numerical_feature_list].head())

Min-Max scaled training data (numerical features only):
         fnlwgt  education-num
34495  0.121450       0.800000
18591  0.137385       0.466667
12562  0.076666       0.533333
552    0.129737       0.400000
3479   0.062938       0.866667


In [10]:
# Quantile Transformation: Maps features to a uniform or normal distribution
# This is particularly useful for non-linear transformations and handling outliers
quantile_scaler = NumericTransform(
    numerical_feature_list=train_set.numerical_feature_list,
    strategy="quantile",  # Quantile transformation
    include_categorical=False,  # Only transform numerical features
    train_num_samples=len(train_set.X_df),  # Required for quantile strategy
)

# Fit on training data and transform both sets
quantile_scaler.fit(train_set.X_df)
train_quantile = quantile_scaler.transform(train_set.X_df)
test_quantile = quantile_scaler.transform(test_set.X_df)

print("Quantile transformed training data (numerical features only):")
print(train_quantile[train_set.numerical_feature_list].head())

Quantile transformed training data (numerical features only):
         fnlwgt  education-num
34495  0.579295       0.834835
18591  0.689399       0.124625
12562  0.282853       0.287788
552    0.641397       0.099099
3479   0.197188       0.943944


## Categorical Encoding

Most machine learning algorithms require numerical input, so categorical features need to be encoded. TabCamel's `CategoryTransform` provides two main encoding strategies:

1. **One-Hot Encoding** - Creates binary columns for each category (good for nominal data)
2. **Ordinal Encoding** - Maps categories to integers (good for ordinal data or when memory is a concern)

Let's demonstrate both approaches:


In [11]:
# Import the CategoryTransform class for encoding categorical features
from tabcamel.data.transform import CategoryTransform

In [12]:
# One-Hot Encoding: Create binary columns for each category
# This is ideal for nominal categorical data where categories have no inherent order
onehot_encoder = CategoryTransform(
    categorical_feature_list=train_set.categorical_feature_list,
    strategy="onehot",  # Create binary columns for each category
)

# Fit the encoder on training data to learn all possible categories
onehot_encoder.fit(train_set.X_df)

# Transform both training and test data
train_onehot = onehot_encoder.transform(train_set.X_df)
test_onehot = onehot_encoder.transform(test_set.X_df)

print(f"Original shape: {train_set.X_df.shape}")
print(f"After one-hot encoding: {train_onehot.shape}")
print(f"New columns created: {train_onehot.shape[1] - train_set.X_df.shape[1]}")
print("\nFirst few columns of encoded data:")
print(train_onehot.head())

Original shape: (39073, 14)
After one-hot encoding: (39073, 124)
New columns created: 110

First few columns of encoded data:
       fnlwgt  education-num  age_0  age_1  age_2  age_3  age_4  \
34495  193106             13    0.0    0.0    1.0    0.0    0.0   
18591  216636              8    0.0    0.0    0.0    0.0    1.0   
12562  126977              9    0.0    0.0    0.0    1.0    0.0   
552    205343              7    0.0    0.0    0.0    0.0    1.0   
3479   106705             14    0.0    0.0    0.0    1.0    0.0   

       workclass_Federal-gov  workclass_Local-gov  workclass_Never-worked  \
34495                    0.0                  0.0                     0.0   
18591                    0.0                  0.0                     0.0   
12562                    0.0                  0.0                     0.0   
552                      0.0                  0.0                     0.0   
3479                     0.0                  0.0                     0.0   

       .

In [13]:
# Ordinal Encoding: Map categories to integer values
# This is more memory-efficient and suitable when categories have natural ordering
# or when dealing with high-cardinality categorical features
ordinal_encoder = CategoryTransform(
    categorical_feature_list=train_set.categorical_feature_list,
    strategy="ordinal"  # Map categories to integers
)

# Fit the encoder on training data
ordinal_encoder.fit(train_set.X_df)

# Transform both training and test data
train_ordinal = ordinal_encoder.transform(train_set.X_df)
test_ordinal = ordinal_encoder.transform(test_set.X_df)

print(f"Shape remains the same: {train_ordinal.shape}")
print("Categorical features are now encoded as integers:")
print(train_ordinal[train_set.categorical_feature_list].head())

# Show the category mappings
print("\nCategory mappings:")
for i, feature in enumerate(train_set.categorical_feature_list):
    print(f"{feature}: {dict(enumerate(ordinal_encoder.categories_[i]))}")
    if i >= 2:  # Limit output for readability
        print("... (and more)")
        break

Shape remains the same: (39073, 14)
Categorical features are now encoded as integers:
       age  workclass  education  marital-status  occupation  relationship  \
34495  2.0        3.0        9.0             4.0        11.0           1.0   
18591  4.0        4.0        2.0             2.0         3.0           0.0   
12562  3.0        3.0       11.0             5.0         2.0           1.0   
552    4.0        3.0        1.0             6.0         0.0           4.0   
3479   3.0        6.0       12.0             4.0         3.0           1.0   

       race  sex  capitalgain  capitalloss  hoursperweek  native-country  
34495   4.0  0.0          0.0          0.0           1.0            38.0  
18591   4.0  1.0          0.0          2.0           2.0            38.0  
12562   4.0  1.0          0.0          0.0           1.0            38.0  
552     4.0  0.0          0.0          0.0           2.0            38.0  
3479    4.0  0.0          0.0          0.0           2.0            38

## Target Transformation

The target variable often needs preprocessing too. TabCamel's `TargetTransform` handles both classification and regression targets:

- **Classification**: Label encoding to convert class names to integers
- **Regression**: Standardization to normalize continuous target values

Let's demonstrate target transformation for our classification task:


In [14]:
# Import the TargetTransform class for preprocessing target variables
from tabcamel.data.transform import TargetTransform

In [15]:
# Target transformation for classification task
# This will encode class labels (e.g., '<=50K', '>50K') to integers (0, 1)
target_transformer = TargetTransform(
    task="classification",  # Task type determines the transformation method
    target_feature=train_set.target_col,  # Name of the target column
)

# Fit the transformer on training data to learn the class mappings
target_transformer.fit(train_set.data_df)

# Transform both training and test targets
train_targets_encoded = target_transformer.transform(train_set.data_df)
test_targets_encoded = target_transformer.transform(test_set.data_df)

print("Original target values:")
train_set.data_df[train_set.target_col].head()

Original target values:


34495    <=50K
18591    <=50K
12562    <=50K
552      <=50K
3479     <=50K
Name: target, dtype: object

In [16]:

print("\nEncoded target values:")
train_targets_encoded.head()


Encoded target values:


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,target
34495,2,Private,193106,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,0,0,1,United-States,0
18591,4,Self-emp-inc,216636,12th,8,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,2,2,United-States,0
12562,3,Private,126977,HS-grad,9,Separated,Craft-repair,Not-in-family,White,Male,0,0,1,United-States,0
552,4,Private,205343,11th,7,Widowed,Adm-clerical,Unmarried,White,Female,0,0,2,United-States,0
3479,3,State-gov,106705,Masters,14,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,2,United-States,0


In [17]:
# Show the class-to-integer mapping
print(f"\nClass mapping: {target_transformer.encoded2class}")

# Demonstrate inverse transformation
print("\nInverse transformation (encoded back to original):")
reconstructed = target_transformer.inverse_transform(train_targets_encoded.head())
reconstructed


Class mapping: {0: '<=50K', 1: '>50K'}

Inverse transformation (encoded back to original):


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,target
34495,2,Private,193106,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,0,0,1,United-States,<=50K
18591,4,Self-emp-inc,216636,12th,8,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,2,2,United-States,<=50K
12562,3,Private,126977,HS-grad,9,Separated,Craft-repair,Not-in-family,White,Male,0,0,1,United-States,<=50K
552,4,Private,205343,11th,7,Widowed,Adm-clerical,Unmarried,White,Female,0,0,2,United-States,<=50K
3479,3,State-gov,106705,Masters,14,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,2,United-States,<=50K


## Best Practices and Transform Pipeline

When applying multiple transformations, it's important to follow these best practices:

1. **Always fit on training data only** - Never fit transformers on test data
2. **Apply transformations in the right order** - Imputation → Scaling → Encoding
3. **Use the same fitted transformers** - Apply the same transformation parameters to test data
4. **Handle inverse transformations** - All TabCamel transforms support `inverse_transform()`

Let's demonstrate a complete preprocessing pipeline:


In [18]:
# Complete preprocessing pipeline example
# Let's create a fresh dataset and apply multiple transformations in sequence

# Step 1: Load fresh data and split
fresh_dataset = TabularDataset(dataset_name="adult", task_type="classification")
split_dict = fresh_dataset.split(split_mode="stratified", train_size=0.8)
train_data = split_dict["train_set"]
test_data = split_dict["test_set"]

print("Starting preprocessing pipeline...")
print(f"Training data shape: {train_data.X_df.shape}")
print(f"Test data shape: {test_data.X_df.shape}")

# Step 2: Imputation (handle missing values first)
pipeline_imputer = SimpleImputeTransform(
    categorical_feature_list=train_data.categorical_feature_list,
    numerical_feature_list=train_data.numerical_feature_list,
    strategy_categorical="most_frequent",
    strategy_numerical="mean",
)
pipeline_imputer.fit(train_data.X_df)
train_imputed = pipeline_imputer.transform(train_data.X_df)
test_imputed = pipeline_imputer.transform(test_data.X_df)
print("✓ Step 2: Imputation completed")

# Step 3: Numerical scaling
pipeline_scaler = NumericTransform(
    numerical_feature_list=train_data.numerical_feature_list,
    strategy="standard",
    include_categorical=False,
)
pipeline_scaler.fit(train_imputed)
train_scaled = train_imputed.copy()
test_scaled = test_imputed.copy()
train_scaled = pipeline_scaler.transform(train_imputed)
test_scaled = pipeline_scaler.transform(test_imputed)
print("✓ Step 3: Numerical scaling completed")

# Step 4: Categorical encoding
pipeline_encoder = CategoryTransform(categorical_feature_list=train_data.categorical_feature_list, strategy="onehot")
pipeline_encoder.fit(train_scaled)
train_final = pipeline_encoder.transform(train_scaled)
test_final = pipeline_encoder.transform(test_scaled)
print("✓ Step 4: Categorical encoding completed")

# Step 5: Target transformation
pipeline_target = TargetTransform(task="classification", target_feature=train_data.target_col)
pipeline_target.fit(train_data.data_df)
train_targets_final = pipeline_target.transform(train_data.data_df)
test_targets_final = pipeline_target.transform(test_data.data_df)
print("✓ Step 5: Target transformation completed")

print(f"\nFinal preprocessed data shape: {train_final.shape}")
print("✅ Complete preprocessing pipeline finished!")
print("\nThis preprocessed data is now ready for machine learning algorithms!")

Starting preprocessing pipeline...
Training data shape: (39073, 14)
Test data shape: (9769, 14)
✓ Step 2: Imputation completed
✓ Step 3: Numerical scaling completed
✓ Step 4: Categorical encoding completed
✓ Step 5: Target transformation completed

Final preprocessed data shape: (39073, 121)
✅ Complete preprocessing pipeline finished!

This preprocessed data is now ready for machine learning algorithms!


## Summary

In this tutorial, we've covered TabCamel's comprehensive data preprocessing capabilities:

### Transform Classes Covered:

1. **SimpleImputeTransform** - Handle missing values with different strategies for categorical and numerical features
2. **NumericTransform** - Multiple scaling strategies (standard, minmax, quantile)
3. **CategoryTransform** - Categorical encoding (one-hot, ordinal)
4. **TargetTransform** - Target variable preprocessing for classification and regression

### Key Takeaways:

- **Fit-Transform Pattern**: Always fit on training data, then transform both training and test data
- **Proper Order (Not universal!)**: Imputation → Numerical Scaling → Categorical Encoding → Target transformation
- **Inverse Transforms**: All transforms support `inverse_transform()` for reversibility
- **Flexibility**: Different strategies available for different types of data and use cases

### Next Steps:

- Explore Tutorial 3 for model training with preprocessed data
- Try different transformation strategies for your specific datasets
- Combine transforms in different orders to see their effects
- Use the inverse transforms to interpret model predictions in original scale
