# Custom Transformers

While scikit-learn offers many built-in transformers (StandardScaler, OneHotEncoder, etc.), you’ll often need to **create your own custom transformers** to:
- Engineer new features
- Applying logic spesific to your domain
- Integrate external processing (e.g., text, audio, or image operations)
- Post-process model predictions

To build a custom transformer, inherint from `sklearn.base.BaseEstimator` and `sklearn.base.TransformerMixin`. Your class must implement `fit` and `transform` methods:

In [64]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self # Always return self
    
    def transform(self, X):
        X_transformed = X.copy() # Do some transformation
        return X_transformed

You get `fit_transform()` automatically by mixing in `TransformerMixin`, which calls:
```python
def fit_transform(self, X, y=None):
    return self.fit(X, y).transform(X)
```

In [65]:
# example
from sklearn.base import BaseEstimator, TransformerMixin

def TextLengthExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X['text_length'] = X[self.column].apply(len)
        return X[['text_length']]

**Custom Text Transformer: Word Count**

In [72]:
class WordCountExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, text_column):
        self.text_column = text_column

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        return X[self.text_column].apply(lambda text: len(str(text).split())).values.reshape(-1, 1)

**Custom Image Transformer: Resize Images**

In [74]:
from PIL import Image
import numpy as np

class ImageResizer(BaseEstimator, TransformerMixin):
    def __init__(self, size=(64, 64)):
        self.size = size

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.array([
            np.array(Image.fromarray(img).resize(self.size))
            for img in X
        ])

**Custom Audio Transformer: Extract Duration**

In [75]:
import librosa

class AudioDurationExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        durations = [librosa.get_duration(y=signal, sr=sr) for signal, sr in X]
        return np.array(durations).reshape(-1, 1)

# Pipelines

A **Pipeline** in scikit-learn is a way to streamline machine learning workflows. It allows you to encapsulate a sequence of steps into a single object, simplifying code and preventing data leakage between training and testing sets. A pipeline can consist of multiple stages such as:

1. **Preprocessing** (e.g., scaling, imputation, encoding)
2. **Model training** (e.g., logistic regression, decision trees)
3. **Post-processing** (e.g., prediction transformation)

**Key Benefits of Pipelines:**
1. Modularity: You can split complex workflows into smaller, more manageable steps.
2. Avoid Data Leakage: Ensures that cross-validation splits or test sets are not involved in the preprocessing step.
3. Simplified Code: Helps in avoiding repetitive tasks such as data preprocessing for each model or validation step.
4. Reproducibility: All steps in the pipeline are stored together, making it easy to reproduce experiments.

Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final **predictor** for predictive modeling.

Intermediate steps of the pipeline must be transformers, that is, they must implement `fit` and `transform` methods. The final **estimator** only needs to implement `fit`.

**Regression**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.base import BaseEstimator, TransformerMixin

df = pd.read_csv('data/housing.csv')

In [22]:
df['income_cat'] = pd.qcut(df['median_income'], 4, labels=['low', 'medium', 'high', 'very_high'])

X = df.drop(columns=['median_house_value'])
y = df['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,income_cat
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,very_high
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,very_high
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,very_high
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,very_high
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,high


In [25]:
numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = ['income_cat']

In [26]:
class RoomPerHouseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X['rooms_per_household'] = X['total_rooms'] / X['households']
        return X
    
# Create Numerical Pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Create Categorical Pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine all preprocessing
preprocessor = ColumnTransformer([
    ('num', num_pipeline, numerical_features),
    ('cat', cat_pipeline, categorical_features)
])

# Full Pipeline
full_pipeline = Pipeline([
    ('feature_engineering', RoomPerHouseTransformer()), # Custom Transformer
    ('preprocessing', preprocessor),
    ('model', RandomForestRegressor(random_state=42))
])

In [28]:
full_pipeline.fit(X_train, y_train)

In [29]:
y_pred = full_pipeline.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")

RMSE: 50835.94


In [30]:
# Create a parameter grid for fine-tuning
param_grid = {
    'model__n_estimators': [50, 100],
    'model__max_depth': [10, 20, None]
}

# Wrap pipeline with GridSearch
grid_search = GridSearchCV(full_pipeline, param_grid, cv=3, scoring='neg_root_mean_squared_error')

In [31]:
grid_search.fit(X_train, y_train)

In [32]:
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"Best RMSE after tuning: {rmse:.2f}")
print("Best Parameters:", grid_search.best_params_)

Best RMSE after tuning: 50835.94
Best Parameters: {'model__max_depth': None, 'model__n_estimators': 100}


**Classification**

In [33]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin

iris = load_iris(as_frame=True)
df = iris.frame

In [34]:
class PetalRatioadder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X['PetalRatio'] = X['petal length (cm)'] / (X['petal width (cm)'] + 1e-5)
        return X

In [36]:
X = df.drop(columns=['target'])
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.2)
numerical_features = X.columns.tolist() + ['PetalRatio']

preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

full_pipeline = Pipeline([
    ('add_features', PetalRatioadder()),
    ('preprocessing', preprocessor),
    ('model', LogisticRegression(solver='lbfgs', max_iter=500))
])

full_pipeline.fit(X_train, y_train)

In [39]:
y_pred = full_pipeline.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}%")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy: 0.93%
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.90      0.90      0.90        10
   virginica       0.90      0.90      0.90        10

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30



In [61]:
param_grid = {
    'model__C': [0.1, 1, 10],
    'model__penalty': ['l2']
}

grid_search = GridSearchCV(full_pipeline, param_grid, cv=5, scoring='accuracy', verbose=1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


In [62]:
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

In [63]:
print("Best Accuracy:", accuracy_score(y_test, y_pred_best))
print("Best Params:", grid_search.best_params_)
print(classification_report(y_test, y_pred_best, target_names=iris.target_names))

Best Accuracy: 1.0
Best Params: {'model__C': 10, 'model__penalty': 'l2'}
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00        10
   virginica       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

