# sklearn preprocessing transformations

This notebook explores various preprocessing transformations available in scikit-learn. These transformations are essential for preparing data for machine learning models.

We will look at:
- `QuantileTransformer`: Non-linear transformation to the output distribution.
- `TargetEncoder`: Encode categorical features using target statistics.
- `KBinsDiscretizer`: Bin continuous data into discrete intervals.
- `OneHotEncoder`: Encode categorical features as a one-hot numeric array.



In [19]:
# Import required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import QuantileTransformer, KBinsDiscretizer, OneHotEncoder, TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

In [6]:
# Create a sample dataset
data = {
    'numerical_col': [1.0, 2.5, 3.1, 4.5, 5.9, 6.2, 7.8, 8.1, 9.5, 10.0,
                      11.3, 12.7, 13.0, 14.1, 15.5, 16.8, 17.0, 18.4, 19.9, 20.0],
    'categorical_col_low_cardinality': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C',
                                        'A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C'],
    'categorical_col_high_cardinality': ['X', 'Y', 'Z', 'X', 'W', 'Y', 'Z', 'X', 'W', 'Y',
                                         'Z', 'X', 'W', 'Y', 'Z', 'X', 'W', 'Y', 'Z', 'X'],
    'target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
               0, 1, 0, 1, 0, 1, 0, 1, 0, 1] # Binary target for TargetEncoder
}
df = pd.DataFrame(data)

print("Original DataFrame:")
df.head()

Original DataFrame:


Unnamed: 0,numerical_col,categorical_col_low_cardinality,categorical_col_high_cardinality,target
0,1.0,A,X,0
1,2.5,B,Y,1
2,3.1,A,Z,0
3,4.5,C,X,1
4,5.9,B,W,0


In [7]:
print("\nDataFrame Info:")
df.info()


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   numerical_col                     20 non-null     float64
 1   categorical_col_low_cardinality   20 non-null     object 
 2   categorical_col_high_cardinality  20 non-null     object 
 3   target                            20 non-null     int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 772.0+ bytes


In [15]:
df.describe()

Unnamed: 0,numerical_col,target,numerical_col_transformed_qt
count,20.0,20.0,20.0
mean,10.865,0.5,0.075113
std,5.933604,0.512989,1.935974
min,1.0,0.0,-5.199338
25%,6.125,0.0,-0.682248
50%,10.65,0.5,-0.023979
75%,15.825,1.0,0.769599
max,20.0,1.0,5.199338


In [16]:
X = df.drop('target', axis=1)
y = df['target']
# # Split data into training and testing sets (important for TargetEncoder)
#
#
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# print("\nTraining data shape:", X_train.shape)
# print("Testing data shape:", X_test.shape)


In [17]:
"""## QuantileTransformer

Transforms features using quantiles information. This transformation maps data to a uniform or normal distribution and is non-linear. It is less sensitive to outliers and can handle datasets with unusual distributions.
"""

# Initialize QuantileTransformer
# output_distribution='uniform' or 'normal'
quantile_transformer = QuantileTransformer(output_distribution='normal', n_quantiles=4)

# Apply the transformation (fit and transform on training, only transform on test)
X['numerical_col_transformed_qt'] = quantile_transformer.fit_transform(X[['numerical_col']])

print("\nDataFrame after QuantileTransformer (Training data head):")
X.head()



DataFrame after QuantileTransformer (Training data head):


Unnamed: 0,numerical_col,categorical_col_low_cardinality,categorical_col_high_cardinality,numerical_col_transformed_qt
0,1.0,A,X,-5.199338
1,2.5,B,Y,-1.457684
2,3.1,A,Z,-1.273337
3,4.5,C,X,-0.957799
4,5.9,B,W,-0.716909


In [26]:
"""## TargetEncoder

Encodes categorical features based on the mean of the target variable for each category. This is often used for high-cardinality categorical features. It's crucial to fit this transformer only on the training data to prevent data leakage.
"""

# Initialize TargetEncoder
# smoothing parameter helps to prevent overfitting, especially with small categories
target_encoder = TargetEncoder()

# Apply the transformation (fit on training data AND target, transform on test data)
X["target_encoded"] = target_encoder.fit_transform(X["categorical_col_high_cardinality"].values.reshape(-1,1), y)

print("\nDataFrame after TargetEncoder (Training data head):")
X.head()


DataFrame after TargetEncoder (Training data head):


Unnamed: 0,numerical_col,categorical_col_low_cardinality,categorical_col_high_cardinality,numerical_col_transformed_qt,target_encoded
0,1.0,A,X,-5.199338,1.0
1,2.5,B,Y,-1.457684,1.0
2,3.1,A,Z,-1.273337,0.0
3,4.5,C,X,-0.957799,1.0
4,5.9,B,W,-0.716909,0.0


In [None]:
"""## KBinsDiscretizer

Bins continuous data into discrete intervals. This can be useful for handling non-linear relationships or for models that work better with discrete features.

- `strategy`: Defines the binning strategy ('uniform', 'quantile', 'kmeans').
- `n_bins`: The number of bins.
"""

# Initialize KBinsDiscretizer
# strategy='uniform' (equal width bins), 'quantile' (equal frequency bins), 'kmeans' (bins based on kmeans clustering)
kbins_discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # encode='ordinal' or 'onehot'

# Apply the transformation (fit and transform on training, only transform on test)
X_train['numerical_col_binned'] = kbins_discretizer.fit_transform(X_train[['numerical_col']])
X_test['numerical_col_binned'] = kbins_discretizer.transform(X_test[['numerical_col']])

print("\nDataFrame after KBinsDiscretizer (Training data head):")
print(X_train[['numerical_col', 'numerical_col_binned']].head())



In [None]:
"""## OneHotEncoder

Encodes categorical features as a one-hot numeric array. Each category becomes a binary feature (0 or 1). This is suitable for categorical features with low cardinality.

- `handle_unknown`: How to handle unseen categories ('ignore', 'error').
- `sparse_output`: Whether to return a sparse matrix (memory efficiency).
"""

# Initialize OneHotEncoder
# handle_unknown='ignore' allows the transformer to handle categories not seen during fit
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Apply the transformation (fit and transform on training, only transform on test)
# The output is a numpy array, not a pandas DataFrame by default with sparse_output=False
ohe_train_array = one_hot_encoder.fit_transform(X_train[['categorical_col_low_cardinality']])
ohe_test_array = one_hot_encoder.transform(X_test[['categorical_col_low_cardinality']])

# To see the result clearly, let's convert the array back to a DataFrame
ohe_feature_names = one_hot_encoder.get_feature_names_out(['categorical_col_low_cardinality'])
ohe_train_df = pd.DataFrame(ohe_train_array, columns=ohe_feature_names, index=X_train.index)
ohe_test_df = pd.DataFrame(ohe_test_array, columns=ohe_feature_names, index=X_test.index)

print("\nDataFrame after OneHotEncoder (Training data head - new columns):")
print(ohe_train_df.head())

# To use this in a model, you would typically concatenate these new columns
# with the other features.



In [None]:
"""## Combining Transformations using ColumnTransformer and Pipeline

Often, you need to apply different transformations to different columns. `ColumnTransformer` is ideal for this. You can then combine `ColumnTransformer` with a model in a `Pipeline`.
"""

# Define which columns get which transformation
preprocessor = ColumnTransformer(
    transformers=[
        ('num_qt', QuantileTransformer(output_distribution='normal', n_quantiles=10), ['numerical_col']),
        ('cat_target', TargetEncoder(smoothing=1.0), ['categorical_col_high_cardinality']),
        ('cat_ohe', OneHotEncoder(handle_unknown='ignore'), ['categorical_col_low_cardinality'])
        # You could add KBinsDiscretizer here too if needed on another or the same numerical column
    ],
    remainder='passthrough' # Keep other columns that aren't explicitly transformed
)

# Create a simple pipeline (e.g., preprocessing + a simple model)
# We'll just show the preprocessing step here
from sklearn.linear_model import LogisticRegression

# Example pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', LogisticRegression())])

# Fit the pipeline on the training data
# This applies all transformations defined in the preprocessor to X_train
# and then fits the LogisticRegression model.
pipeline.fit(X_train, y_train)

print("\nPipeline fitted successfully.")

# Transform the test data using the fitted pipeline
# This applies the same transformations learned from the training data
X_test_transformed = pipeline.named_steps['preprocessor'].transform(X_test)

print("\nTransformed Test data shape:", X_test_transformed.shape)
# Note: The output format depends on the transformers used (sparse/dense matrix)

# You can now use the fitted pipeline to make predictions
# y_pred = pipeline.predict(X_test)
# print("\nSample predictions:", y_pred[:5])
