# How to Use the ColumnTransformer for Data Preparation

Author: Jason Brownlee

Article from [machinelearningmastery](https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/).

> Note: In this notebook, I am studying the article mentioned above. Some changes may have been made to the code during its implementation.

# Library

In [30]:
import pandas as pd
from numpy import absolute
from numpy import mean
from numpy import std
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Data Preparation for the Abalone Regression Dataset

## Load the dataset

In [5]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv'
dataframe = pd.read_csv(url, header=None)
dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


## Split into inputs and outputs

In [8]:
last_ix = len(dataframe.columns) - 1
X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
print(X.shape, y.shape)

(4177, 8) (4177,)


## Determine categorical and numerical features

In [12]:
numerical_ix = X.select_dtypes(include=['int64', 'float64']).columns
categorical_ix = X.select_dtypes(include=['object', 'bool']).columns
print(numerical_ix, categorical_ix)

Int64Index([1, 2, 3, 4, 5, 6, 7], dtype='int64') Int64Index([0], dtype='int64')


## Define the data preparation for the columns

In [15]:
t = [('cat', OneHotEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)]
col_transform = ColumnTransformer(transformers=t)

## Define the model

In [17]:
model = SVR(kernel='rbf', gamma='scale', C=100)

## Define the data preparation and modeling pipeline

In [32]:
pipeline = Pipeline(steps=[('prep', col_transform), ('m', model)], verbose=True)

## Define the model cross-validation configuration

In [33]:
cv =  KFold(n_splits=10, shuffle=True, random_state=1)

## Evaluate the pipeline using cross validation and calculate MAE

In [35]:
scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, verbose=True)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    6.7s finished


## Convert MAE scores to positive values

In [36]:
scores = absolute(scores)

## Summarize the model performance

In [37]:
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

MAE: 1.465 (0.047)
