## Create ColumnTransformer & FeatureUnion in Pipelines with Code Example

- https://medium.com/mlearning-ai/create-columntransformer-featureunion-in-pipelines-with-code-example-c1270dc0d225

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023. 7.10</div>
<div style="text-align: right"> Last update: 2023. 7. 10</div>

In [1]:
import os
import sys
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats
import warnings; warnings.filterwarnings('ignore')
#plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid')
%matplotlib inline

ColumnTransformer와 FeatureUnion 모두 데이터프레임으로 학습하고 변환을 할 수 있는 estimator이다. 입력 데이터로부터 여러가지의 transformer를 구현하는데 도움이 된다.   
일반적으로 데이터 전처리 단계에 사용된다.

- ColumnTransformer : 입력의 서로 다른 컬럼 또는 컬럼의 하위 집합을 개별적으로 transform 할수 있다. 각 transform에서 생성된 피처는 다시 연결되어 단일 feature를 형성한다. 서로 다른 데이터타입(카테고리, 수치형) 데이터를 변환하여 결합하는데 유용하다.  
- FeatureUnion : 여러개의 transformer 개체를 연결한다. 단일 입력으로부터 병렬로 transform을 적용한다음 결과를 합친다.(concat) 여러가지의 feature extraction 메커니즘을 단일 transformer로 결합하는데 유용하다.

- ColumnTransformer : 서로 다른 컬럼에 다른 transform을 적용. 그리고 하나로 합침
- FeqtureUnion : 동일 컬럼에 다른 transform을 적용. 그리고 하나로 합침

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion

In [3]:
X_train = pd.DataFrame(data={'review': ['Good food good service.', 
                                        'Good food with friendly service.',
                                        'Average food and bad service'],
                             'star': [5,4,2],
                             'meal_time': ['dinner','lunch','breakfirst'],
                             'tip_%': [0.25, 0.18, np.nan]})
X_train.head()

Unnamed: 0,review,star,meal_time,tip_%
0,Good food good service.,5,dinner,0.25
1,Good food with friendly service.,4,lunch,0.18
2,Average food and bad service,2,breakfirst,


### 1. Construct a ColumnTransformer

- Step 1: For the categorical column meal_time, encode it using OneHotEncoder
- Step 2: For numeric columns star and tip_%, create a Pipeline to first impute NaN values using SimpleImputer and then scale its result using MinMaxScaler
- Step 3: Pass through the remaining columns (review)
- Step 4: Combine the two results together using ColumnTransformer

In [4]:
cat_col = ['meal_time']
num_col = ['star','tip_%']

num_pipe = Pipeline([
    ('computer', SimpleImputer(strategy='constant', fill_value=0)),
    ('scaler', MinMaxScaler())
])

ColumnTransformation = ColumnTransformer(
    transformers=[
        ('cat_ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'), cat_col),
        ('num_pipe', num_pipe, num_col)
    ]
    , remainder='passthrough'
)
ColumnTransformation

In [5]:
ColumnTransformation.fit_transform(X_train)

array([[0.0, 1.0, 0.0, 0.9999999999999999, 1.0,
        'Good food good service.'],
       [0.0, 0.0, 1.0, 0.6666666666666666, 0.72,
        'Good food with friendly service.'],
       [1.0, 0.0, 0.0, 0.0, 0.0, 'Average food and bad service']],
      dtype=object)

- 데이터 사이즈가 (3, 6)으로 확장되었다.

### 2. Construct a FeatureUnion

단일 컬럼 `review`에 대하여 다음 작업을 진행한다.   
- Step 1: Get the word count of each review by creating WordCounter class
- Step 2: Vectorize each review using TfidfVectorizer
- Step 3: Combine the two parallel results together using FeatureUnion

In [6]:
class WordCounter(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        num_word = X.apply(lambda x: len(x.split())).values.reshape(-1, 1)
        return num_word

In [7]:
vectorizer = TfidfVectorizer(stop_words="english")

word_cnt_pipe = Pipeline([
    ('word_counter', WordCounter()),
    ('scaler', MinMaxScaler())
])

# construct the FeatureUnion 
FeatureUnionTransformation = FeatureUnion([
    ('vectorizer', vectorizer),
    ('word_counter', word_cnt_pipe)
])

FeatureUnionTransformation

In [8]:
FeatureUnionTransformation.fit_transform(X_train['review'])

<3x7 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

### 3.Construct a Mix of ColumnTransformer & FeatureUnion & Pipeline

두부분을 완성하여 ColumnTransformer로 결합해본다.

In [9]:
preprocessor = ColumnTransformer([
    ('col_trans', ColumnTransformation, ['meal_time','star','tip_%']),
    ('feature_union', FeatureUnionTransformation, 'review')
])

preprocessor

In [10]:
preprocessor.fit_transform(X_train)

array([[0.        , 1.        , 0.        , 1.        , 1.        ,
        0.        , 0.        , 0.34035465, 0.        , 0.87653717,
        0.34035465, 0.        ],
       [0.        , 0.        , 1.        , 0.66666667, 0.72      ,
        0.        , 0.        , 0.39148397, 0.66283998, 0.50410689,
        0.39148397, 1.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ,
        0.6088451 , 0.6088451 , 0.35959372, 0.        , 0.        ,
        0.35959372, 1.        ]])

In [11]:
preprocessor.fit_transform(X_train).shape

(3, 12)