# Advanced transforms

## How to Transform Numerical and Categorical Data

🔧 **Why Data Transformation Is Needed**
- Essential for preparing raw data before model training.
- Ensures machine learning algorithms can understand and learn from the data structure.
- Different types of data (numerical vs. categorical) require different preprocessing steps.

⚠️ **Challenges with Mixed Data Types**
- Mixed-type datasets need tailored preprocessing **per column type**.
- Manual separation and transformation of columns is inefficient and error-prone.

🧰 **Solution: ColumnTransformer (scikit-learn)**
- Enables selective transformation of specific columns.
- Automatically applies different transforms to numerical vs. categorical data.
- Avoids manual column management and recombination.

🔨 **How to Use ColumnTransformer**
- Define a list of transformers: **Each as a tuple: ("name", transform_object, columns)**
- **Example usage**:<br>
o	Apply **OneHotEncoder** to **categorical columns** (e.g., columns 0 & 1).<br>
o	Apply **SimpleImputer(median)** to **numerica**l columns.<br>
- **Remainder options**:<br>
o	**drop** (default): drops unspecified columns.<br>
o	**passthrough**: passes through untouched columns.

🔁 **Integrating ColumnTransformer in a Pipeline**
- Combine ColumnTransformer with model in a Pipeline.
- Ensures consistent preprocessing for:<br>
o	**Training**<br>
o	**Cross-validation**<br>
o	**Future predictions**<br>
- Enables automation and reproducibility.

🧠 **Key Takeaways for Practice**
- Mixed-type preprocessing is streamlined with **ColumnTransformer**.
- Always evaluate with cross-validation for reliable metrics.
- Modular and reusable pipelines increase efficiency and model accuracy.

#### Data Preparation for the Abalone Regression Dataset

The abalone dataset is a standard machine learning problem that involves predicting the age of an abalone given measurements of an abalone. The dataset has 4,177 examples, 8 input variables, and the target variable is an integer. A naive model can achieve a mean absolute error (MAE) of about 2.363 (std 0.092) by predicting the mean value, evaluated via 10-fold cross-validation.

We can see that the first column is categorical and the remainder of the columns are numerical. We may want to one hot encode the first column and normalize the remaining numerical columns, and this can be achieved using the ColumnTransformer. We can model this as a regression predictive modeling problem with a support vector machine model (SVR). First, we need to load the dataset. We can load the dataset directly from the file using the read csv() Pandas function, then split the data into two data frames: one for input and one for the output. The complete example of loading the dataset is listed below.

In [1]:
# load the dataset
from pandas import read_csv 

# load dataset
path_abalone_data = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv  "
dataframe = read_csv(path_abalone_data, header=None) 

# split into inputs and outputs
last_ix = len(dataframe.columns) - 1
X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] 

print(X.shape, y.shape)

(4177, 8) (4177,)


Next, we can use the **select dtypes()** function to select the column indexes that match different data types. <br>We are interested in a list of columns that are numerical columns marked as float64 or int64 in Pandas, and a list of categorical columns, marked as object or bool type in Pandas.

In [2]:
# example of using the ColumnTransformer for the Abalone dataset 
from numpy import mean
from numpy import std
from numpy import absolute 
from pandas import read_csv
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder 
from sklearn.preprocessing import MinMaxScaler 
from sklearn.svm import SVR

# load dataset
path_abalone_data = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv  "
dataframe = read_csv(path_abalone_data, header=None) 

# split into inputs and outputs
last_ix = len(dataframe.columns) - 1
X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] 
print(X.shape, y.shape)

# determine categorical and numerical features
numerical_ix = X.select_dtypes(include=['int64', 'float64']).columns 
categorical_ix = X.select_dtypes(include=['object', 'bool']).columns 

# define the data preparation for the columns
t = [('cat', OneHotEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)] 
col_transform = ColumnTransformer(transformers=t)

# define the model
model  =  SVR(kernel='rbf',gamma='scale',C=100)

# define the data preparation and modeling pipeline
pipeline = Pipeline(steps=[('prep',col_transform), ('m', model)]) 

# define the model cross-validation configuration
cv = KFold(n_splits=10, shuffle=True, random_state=1)

# evaluate the pipeline using cross validation and calculate MAE
scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# convert MAE scores to positive values 
scores = absolute(scores)

# summarize the model performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

(4177, 8) (4177,)
MAE: 1.465 (0.047)


You now have a template for using the ColumnTransformer on a dataset with mixed data types that you can use and adapt for your own projects in the future.

## How to Transform the Target in Regression

📌 **Why Target Transformation Matters**
-  Critical for **boosting model performance** in regression tasks.
- Transforms like **scaling** improve models sensitive to value range (e.g., linear models, SVMs).
- Often overlooked: **Target variable** (output) **needs scaling too**—not just input features.

📏 **Importance of Data Scaling**
- Helps algorithms that rely on:<br>
o	Weighted sums (Linear models, Neural Nets)<br>
o	Distance measures (SVM, KNN)<br>
- Common techniques:<br>
o	**MinMaxScaler** → Normalize values to range [0,1]<br>
o	**PowerTransformer** → Make data more Gaussian-like<br>
- Improves both input and target representations.

⚙️ **Two Ways to Scale Target Variables**<br><br>
1️⃣ Manual Scaling
- Requires multiple steps:
1.	Create and fit scaler (e.g., MinMaxScaler).
2.	Transform training and test target values.
3.	Invert transforms for predictions.
- ⚠️ **Downside**:<br>
o	Tedious<br>
o	Doesn’t integrate well with cross_val_score() or pipelines.<br><br>

2️⃣ Automatic Scaling with **TransformedTargetRegressor**
- Scikit-learn wrapper that:<br>
o	Automatically transforms and inverse-transforms target variable.<br>
o	Integrates with models and pipelines.<br>
- Defined by:
- **TransformedTargetRegressor(regressor=model, transformer=scaler)**
- ✅ Allows use of model evaluation functions like **cross_val_score()**.

🧪 **Real Example: Boston Housing Dataset**<br><br>
📊 **Dataset Overview**<br>
- Predict house prices based on 13 suburb features.
- Regression task with 506 samples.<br><br>
🛠️ **Approach**<br>
1.	Normalize inputs using a Pipeline.
2.	Wrap pipeline in a TransformedTargetRegressor to scale the target.
3.	Evaluate using repeated 10-fold cross-validation.<br><br>
📉 **Results**<br>
- Naive model MAE: ~6.6
- With scaling: ~3.2 MAE
- With PowerTransformer (on input & target): ~2.9 MAE → best performance

🔁 **PowerTransformer for Advanced Scaling**
- Applies **Yeo-Johnson transform** → Makes distribution more Gaussian-like.
- Often used with **MinMaxScaler** to:<br>
o	Ensure positive values<br>
o	Improve inverse transform stability<br>
- Ideal for improving **linear model performance**.

🧠 **Key Takeaways**
- Always **consider transforming target variables** in regression tasks.
- Use TransformedTargetRegressor for:<br>
o	**Cleaner code**<br>
o	**Integration with cross-validation**<br>
o	**Avoiding manual transform/inverse headaches**<br>
- Experiment with different transforms (MinMaxScaler, PowerTransformer) to boost accuracy.

#### Example of Using the TransformedTargetRegressor

In this section, we will demonstrate how to use the TransformedTargetRegressor on a real dataset. We will use the Boston housing regression problem that has 13 inputs and one numerical target and requires learning the relationship between suburb characteristics and house prices. 

In [3]:
# load and summarize the dataset 
from numpy import loadtxt

# load data
boston_housing = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv "
dataset = loadtxt(boston_housing, delimiter=",") 

# split into inputs and outputs
X, y = dataset[:, :-1], dataset[:, -1]

# summarize dataset
print(X.shape, y.shape)

(506, 13) (506,)


In [4]:
# example of normalizing input and output variables for regression. 
from numpy import mean
from numpy import absolute 
from numpy import loadtxt
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import RepeatedKFold 
from sklearn.pipeline import Pipeline
from sklearn.linear_model import HuberRegressor 
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import TransformedTargetRegressor 

# load data
boston_housing = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv "
dataset = loadtxt(boston_housing, delimiter=",") 

# split into inputs and outputs
X, y = dataset[:, :-1], dataset[:, -1] 

# prepare the model with input scaling
pipeline = Pipeline(steps=[('normalize', MinMaxScaler()), ('model', HuberRegressor())]) 

# prepare the model with target scaling
model  =  TransformedTargetRegressor(regressor=pipeline,  transformer=MinMaxScaler())

# evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) 

# convert scores to positive
scores = absolute(scores) 

# summarize the result 
s_mean = mean(scores)
print('Mean MAE: %.3f' % (s_mean))

Mean MAE: 3.203


Running the example evaluates the model with normalization of the input and output variables.

In this case, we achieve a MAE of about 3.2, much better than a naive model that achieved about 6.6.

We are not restricted to using scaling objects; for example, we can also explore using other data transforms on the target variable, such as the PowerTransformer, that can make each variable more-Gaussian-like (using the Yeo-Johnson transform) and improve the performance of linear models (introduced in Chapter 20). By default, the PowerTransformer also performs a standardization of each variable after performing the transform. It can also help to scale the values to the range 0-1 prior to applying a power transform, to avoid problems inverting the transform. We achieve this using the MinMaxScaler and defining a positive range (introduced in Chapter 17). The complete example of using a PowerTransformer on the input and target variables of the housing dataset is listed below.

In [5]:
#example of power transform input and output variables for regression. 
from numpy import mean
from numpy import absolute 
from numpy import loadtxt
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import RepeatedKFold 
from sklearn.pipeline import Pipeline
from sklearn.linear_model import HuberRegressor 
from sklearn.preprocessing import PowerTransformer 
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import TransformedTargetRegressor 

# load data
boston_housing = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv "
dataset = loadtxt(boston_housing, delimiter=",") 

# split into inputs and outputs
X, y = dataset[:, :-1], dataset[:, -1]

# prepare the model with input scaling and power transform 
steps = list()
steps.append(('scale', MinMaxScaler(feature_range=(1e-5,1)))) 
steps.append(('power', PowerTransformer())) 
steps.append(('model', HuberRegressor()))
pipeline = Pipeline(steps=steps)

# prepare the model with target scaling
model  =  TransformedTargetRegressor(regressor=pipeline,  transformer=PowerTransformer()) 

# evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) 

# convert scores to positive
scores = absolute(scores) 

# summarize the result 
s_mean = mean(scores)
print('Mean MAE: %.3f' % (s_mean))

Mean MAE: 2.972


Running the example evaluates the model with a power transform of the input and output variables.

In this case, we see further improvement to a MAE of about 2.9.

## How to Save and Load Data Transforms

🚧 **Why It Matters**
- ✅ **Consistency is key**: Any transform applied during training must also be applied to **test or future data**.
- ❌ Inconsistent data preparation → **invalid predictions or degraded model performance**.
- ✅ Essential for **model deployment** and **real-world inference**.

⚠️ **Challenge: Preparing New Data**
- Input variables can vary in scale or units (e.g., inches, days, miles).<br><br>
- **Models sensitive to input scale**:<br>
o	Logistic Regression<br>
o	Neural Networks<br>
o	KNN<br><br>
- **Scaling ensures**:<br>
o	Equal contribution of all features.<br>
o	Improved model stability and convergence.<br><br>
- 🎯 Scaling must be:<br>
o	Fit on **training data**<br>
o	Applied to **both train and test/new data**<br><br>
- ✅ Easy in-memory, ❌ Tricky when model is **saved for later use**

💡 **Solution: Save Data Preparation Objects**
- Save both: <br>
o	🧠 Trained model<br>
o	⚙️ Data preparation object (e.g., MinMaxScaler, StandardScaler)<br><br>
- 🔧 Use pickle (Python built-in) for serialization.
- Two options:
1.	Save **entire objects** (easiest, most common).
2.	Save **parameters only** (for more control, expert use).

🧪 **Worked Example: Step-by-Step**<br><br>
🧰 **1. Define and Split Dataset**
- Use make_blobs() for a simple binary classification example.
- Train/test split with fixed random state.
- Observed:<br>
o	Variables have **different scales** in train vs. test.<br>
    
📏 **2. Scale Dataset**
- Apply **MinMaxScaler**:<br>
o	Fit on training set<br>
o	Transform both train and test sets<br>
- Result: All features in [0,1] range for both sets.
                                      
💽 **3. Save Model & Scaler**
- Train **LogisticRegression** on scaled data.
- Save both:<br>
o	model.pkl (**trained model**)<br>
o	scaler.pkl (**fitted scaler**)<br>
                                      
📂 **4. Load and Use Later**
- Load model and scaler from disk.
- Apply saved scaler to **new/test data**.
- Use model to make predictions.
- ✅ Scaler ensures data is in the same format as during training.
- ✅ 100% accuracy achieved on this trivial test case.

🧠 **Key Takeaways**
- Always **save the exact transform** used during training.
- Use tools like **pickle** to:<br>
o	Serialize models and scalers.<br>
o	Load them later for consistent preprocessing and inference.<br>
- 🔁 Reuse this workflow in real-world machine learning projects to ensure **reliable and repeatable results**.

### Define a Dataset

First, we need a dataset. We will use a synthetic dataset, specifically a binary classification problem with two input variables created randomly via the make blobs() function. The example below creates a test dataset with 100 examples, two input features, and two class labels (0 and 1). The dataset is then split into training and test sets and the min and max values of each variable are then reported.
Importantly, the random state is set when creating the dataset and when splitting the data so that the same dataset is created and the same split of data is performed each time that the code is run.


In [6]:
# example of creating a test dataset and splitting it into train and test sets 
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split 

# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) 

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) 

# summarize the scale of each input variable
for i in range(X_test.shape[1]):
    print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' % (i, X_train[:, i].min(), X_train[:, i].max(),
    X_test[:, i].min(), X_test[:, i].max()))

>0, train: min=-11.856, max=0.526, test: min=-11.270, max=0.085
>1, train: min=-6.388, max=6.507, test: min=-5.581, max=5.926


Running the example reports the min and max values for each variable in both the train and test datasets. We can see that each variable has a different scale, and that the scales differ between the train and test datasets. This is a realistic scenario that we may encounter with a real dataset.

### Scale the Dataset

Next, we can scale the dataset. We will use the MinMaxScaler to scale each input variable to the range 0-1 (introduced in Chapter 17). The best practice use of this scaler is to fit it on the training dataset and then apply the transform to the training dataset, and other datasets: in this case, the test dataset. The complete example of scaling the data and summarizing the effects is listed below.

In [7]:
# example of scaling the dataset
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import MinMaxScaler

# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) 

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) 

# define scaler
scaler = MinMaxScaler()
# fit scaler on the training dataset 
scaler.fit(X_train)

# transform both datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# summarize the scale of each input variable 
for i in range(X_test.shape[1]):
    print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' % (i, X_train_scaled[:, i].min(), X_train_scaled[:, i].max(),
    X_test_scaled[:, i].min(), X_test_scaled[:, i].max()))

>0, train: min=0.000, max=1.000, test: min=0.047, max=0.964
>1, train: min=0.000, max=1.000, test: min=0.063, max=0.955


Running the example prints the effect of the scaled data showing the min and max values for each variable in the train and test datasets. We can see that all variables in both datasets now have values in the desired range of 0 to 1.

### Save Model and Data Scaler

Next, we can fit a model on the training dataset and save both the model and the scaler object to file. We will use a LogisticRegression model because the problem is a simple binary classification task. The training dataset is scaled as before, and in this case, we will assume the test dataset is currently not available. Once scaled, the dataset is used to fit a logistic regression model. We will use the pickle framework to save the LogisticRegression model to one file, and the MinMaxScaler to another file. The complete example is listed below.

In [8]:
# example of fitting a model on the scaled dataset 
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression 
from pickle import dump

# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) 

# split data into train and test sets
X_train, _, y_train, _ = train_test_split(X, y, test_size=0.33, random_state=1) 

# define scaler
scaler = MinMaxScaler()

# fit scaler on the training dataset 
scaler.fit(X_train)

# transform the training dataset 
X_train_scaled = scaler.transform(X_train) 

# define model
model = LogisticRegression(solver='lbfgs') 
model.fit(X_train_scaled, y_train)

# save the model
model_path = "../../data/models/model.pkl"
dump(model, open(model_path, 'wb')) 

# save the scaler
scaler_path = "../../data/scalers/scaler.pkl"
dump(scaler,  open(scaler_path,  'wb'))

Running the example scales the data, fits the model, and saves the model and scaler to files using pickle. <br>You should have two files in your current working directory:
- The model object: model.pkl
- The scaler object: scaler.pkl

### Load Model and Data Scaler

Finally, we can load the model and the scaler object and make use of them. In this case, we will assume that the training dataset is not available, and that only new data or the test dataset is available. We will load the model and the scaler, then use the scaler to prepare the new data and use the model to make predictions. Because it is a test dataset, we have the expected target values, so we will compare the predictions to the expected target values and calculate the accuracy of the model. The complete example is listed below.

In [9]:
# load model and scaler and make predictions on new data
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
from pickle import load 

# prepare dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) 

# split data into train and test sets
_, X_test, _, y_test = train_test_split(X, y, test_size=0.33, random_state=1) 

# load the model
model_path = "../../data/models/model.pkl"
model = load(open(model_path, 'rb'))

# load the scaler
scaler_path = "../../data/scalers/scaler.pkl"
scaler = load(open(scaler_path, 'rb'))

# check scale of the test set before scaling 
print('Raw test set range')
for i in range(X_test.shape[1]):
    print('>%d, min=%.3f, max=%.3f' % (i, X_test[:, i].min(), X_test[:, i].max())) 

# transform the test dataset
X_test_scaled = scaler.transform(X_test) 

print('\nScaled test set range')
for i in range(X_test_scaled.shape[1]):
    print('>%d, min=%.3f, max=%.3f' % (i, X_test_scaled[:, i].min(), X_test_scaled[:, i].max()))
    
# make predictions on the test set 
yhat = model.predict(X_test_scaled) 

# evaluate accuracy
acc = accuracy_score(y_test, yhat) 

print('\nTest Accuracy:', acc)

Raw test set range
>0, min=-11.270, max=0.085
>1, min=-5.581, max=5.926

Scaled test set range
>0, min=0.047, max=0.964
>1, min=0.063, max=0.955

Test Accuracy: 1.0


Running the example loads the model and scaler, then uses the scaler to prepare the test dataset correctly for the model, meeting the expectations of the model when it was trained. To confirm the scaler is having the desired effect, we report the min and max value for each input feature both before and after applying the scaling. The model then makes a prediction for the examples in the test set and the classification accuracy is calculated. In this case, as expected, the data set correctly normalized the model achieved 100 percent accuracy on the test set because the test problem is trivial.