**Column Transformer**

ColumnTransformer is necessary when you want to apply different preprocessing steps to different columns in a dataset.

In a typical dataset, you might have

(i) Numerical columns (e.g., "age", "salary") that need scaling or imputation for missing values.

(ii) Categorical columns (e.g., "gender", "occupation") that require encoding like One Hot Encoding or Ordinal Encoding.


**With ColumnTransformer, you can apply transformations to different subsets of columns in one step rather than applying them separately for each column.**

In [9]:
import numpy as np
import pandas as pd

In [10]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer

In [11]:
df = pd.read_csv('adult.csv')

In [20]:
df.head()

Unnamed: 0,age,workclass,education,marital.status,occupation,sex,native.country,income
0,90,?,HS-grad,Widowed,?,Female,United-States,<=50K
1,82,Private,HS-grad,Widowed,Exec-managerial,Female,United-States,<=50K
2,66,?,Some-college,Widowed,?,Female,United-States,<=50K
3,54,Private,7th-8th,Divorced,Machine-op-inspct,Female,United-States,<=50K
4,41,Private,Some-college,Separated,Prof-specialty,Female,United-States,<=50K


In [17]:
print(df['age'].isnull().sum())

932


In [18]:
df['age'] = df['age'].fillna(df['age'].median())  # Fill missing with median
df['age'] = df['age'].astype(int)

In [19]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['income']),df['income'],
                                                test_size=0.2)

In [21]:
X_train

Unnamed: 0,age,workclass,education,marital.status,occupation,sex,native.country
7799,24,Private,9th,Divorced,Other-service,Female,United-States
28231,42,Private,HS-grad,Married-civ-spouse,Craft-repair,Male,United-States
10885,18,Private,11th,Never-married,Sales,Female,United-States
6455,47,Private,HS-grad,Married-civ-spouse,Transport-moving,Male,United-States
12529,39,Local-gov,Some-college,Divorced,Prof-specialty,Female,United-States
...,...,...,...,...,...,...,...
13488,30,Private,Assoc-voc,Married-civ-spouse,Machine-op-inspct,Male,United-States
27084,39,Self-emp-not-inc,Prof-school,Married-civ-spouse,Prof-specialty,Male,United-States
26159,29,Private,HS-grad,Never-married,Exec-managerial,Male,United-States
24490,19,Private,HS-grad,Never-married,Handlers-cleaners,Male,United-States


In [24]:
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['age']),
    ('tnf2',OrdinalEncoder(categories=[['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th',
                        '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-acdm',
                        'Assoc-voc', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']]),['education']),
    ('tnf3',OneHotEncoder(sparse_output=False,drop='first'),['workclass','occupation'])
],remainder='passthrough')

In [25]:
transformer.fit_transform(X_train).shape

(26048, 27)

In [26]:
transformer.transform(X_test).shape

(6513, 27)