## Creating Custom Transformers for sklearn Pipelines
- https://towardsdatascience.com/creating-custom-transformers-for-sklearn-pipelines-d3d51852ecc1

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.8. 7</div>
<div style="text-align: right"> Last update: 2023. 8. 7</div>

In [1]:
import os
import sys
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats
import warnings; warnings.filterwarnings('ignore')
#plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid')
%matplotlib inline

### Defining our custom transformer

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

In [3]:
class ColumnsSelector(BaseEstimator, TransformerMixin):
    # initializer 
    def __init__(self, columns):
        # save the features list internally in the class
        self.columns = columns
        
    def fit(self, X, y = None):
        return self
    def transform(self, X, y = None):
        # return the dataframe with the specified features
        return X[self.columns]

- `__init__` : 인스턴스 생성시 입자로 받을 것 정의  
- `fit` : 학습할 것이 없으면, 필터링하면 된다. standard scaler 처럼 학습해야할 것이 있으면 fit에 지정해야한다.  
- `transform` : 변환이 되는 부분 정의  
- fit, transform 모두 X, y 두개 인자를 받는데 y를 안쓸 경우 y=None으로 설정한다.

### 사용해보기

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
df = pd.read_csv("../data/titanic/train.csv")
df = df[['Survived','Pclass','Sex','Age','Fare','Embarked']]
X = df.iloc[:,1:]
y = df.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                  test_size = 0.3, 
                                                  stratify = y, 
                                                  random_state = 0)
X_train


Unnamed: 0,Pclass,Sex,Age,Fare,Embarked
231,3,male,29.0,7.7750,S
836,3,male,21.0,8.6625,S
639,3,male,,16.1000,S
389,2,female,17.0,12.0000,C
597,3,male,49.0,0.0000,S
...,...,...,...,...,...
131,3,male,20.0,7.0500,S
490,3,male,,19.9667,S
838,3,male,32.0,56.4958,S
48,3,male,,21.6792,C


In [6]:
from sklearn.pipeline import Pipeline

In [7]:
numeric_transformer = Pipeline(steps=[
    ('columns selector', ColumnsSelector(columns=['Age', 'Fare']))
])

In [8]:
numeric_transformer.fit(X_train)

In [9]:
numeric_transformer.transform(X_train)

Unnamed: 0,Age,Fare
231,29.0,7.7750
836,21.0,8.6625
639,,16.1000
389,17.0,12.0000
597,49.0,0.0000
...,...,...
131,20.0,7.0500
490,,19.9667
838,32.0,56.4958
48,,21.6792


In [10]:
numeric_transformer.fit_transform(X_train, y_train)

Unnamed: 0,Age,Fare
231,29.0,7.7750
836,21.0,8.6625
639,,16.1000
389,17.0,12.0000
597,49.0,0.0000
...,...,...
131,20.0,7.0500
490,,19.9667
838,32.0,56.4958
48,,21.6792


fit_transform 함수는 fit, transform 함수를 순서대로 호출한다.   
여기서는 fit은 아무일도 하지 않는다.   

아래 경우에는 fit_transform을 사용하면 좋다.  
- 학습데이터에는 fit, transform을 적용하고, 테스트데이터에는 transform 만 적용해야하는 경우

아래 경우를 살펴보자.   
학습을 시키지 않고 transform을 호출하면 에러가 발생한다.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
numeric_transformer = Pipeline(steps=[
    ('columns selector', ColumnsSelector(['Age','Fare'])),
    ('imputer', SimpleImputer(strategy='median')),
])

In [12]:
# numeric_transformer.transform(X_train)

In [13]:
numeric_transformer.fit_transform(X_train)

array([[29.    ,  7.775 ],
       [21.    ,  8.6625],
       [28.75  , 16.1   ],
       ...,
       [32.    , 56.4958],
       [28.75  , 21.6792],
       [22.    ,  9.    ]])

파이프라인을 어어나가 보자

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
numeric_transformer = Pipeline(steps=[
    ('columns selector', ColumnsSelector(['Age','Fare'])),
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
numeric_transformer.fit_transform(X_train)

array([[-0.02863633, -0.47911875],
       [-0.65142052, -0.46270324],
       [-0.04809833, -0.32513665],
       ...,
       [ 0.20490775,  0.42203815],
       [-0.04809833, -0.22194182],
       [-0.5735725 , -0.45646073]])

### Creating Our Own Custom StandardScaler Transformer

In [15]:
class MyStandardScaler(BaseEstimator, TransformerMixin): 
    def __init__(self):
        return None
    
    def fit(self, X, y = None):
        print(type(X))
        # the type of X might be a DataFrame or a NumPy array
        # depending on the previous transformer object that 
        # you use in the pipeline
        self.means = np.mean(X, axis=0)    # calculate the mean
        self.stds = np.std(X, axis=0)      # calculate the 
                                           # standard deviation
        return self
    def transform(self, X, y = None):
        return (X - self.means) / self.stds

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [17]:
numeric_transformer = Pipeline(steps=[
    ('columns selector', ColumnsSelector(['Age','Fare'])),
    ('my scaler', MyStandardScaler())  
])
numeric_transformer.fit_transform(X_train)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Fare
231,-0.036551,-0.479119
836,-0.592408,-0.462703
639,,-0.325137
389,-0.870336,-0.400972
597,1.353091,-0.622928
...,...,...
131,-0.661890,-0.492529
490,,-0.253617
838,0.171895,0.422038
48,,-0.221942


In [18]:
numeric_transformer = Pipeline(steps=[
    ('columns selector', ColumnsSelector(['Age','Fare'])),
    ('my scaler', StandardScaler())  
])
numeric_transformer.fit_transform(X_train)

array([[-0.03655096, -0.47911875],
       [-0.59240794, -0.46270324],
       [        nan, -0.32513665],
       ...,
       [ 0.17189541,  0.42203815],
       [        nan, -0.22194182],
       [-0.52292581, -0.45646073]])

- 커스텀 standard scaler를 사용하면 데이터프레임을 리턴했다.

그 앞에 SimpleImputer를 추가하면 어떻게 될까.

In [19]:
numeric_transformer = Pipeline(steps=[
    ('columns selector', ColumnsSelector(['Age','Fare'])),
    ('imputer', SimpleImputer(strategy='median')),
    ('my scaler', MyStandardScaler())  
])
numeric_transformer.fit_transform(X_train)

<class 'numpy.ndarray'>


array([[-0.02863633, -0.47911875],
       [-0.65142052, -0.46270324],
       [-0.04809833, -0.32513665],
       ...,
       [ 0.20490775,  0.42203815],
       [-0.04809833, -0.22194182],
       [-0.5735725 , -0.45646073]])

### Ensuring that the Transformer Has Been Fitted

In [20]:
from sklearn.utils.validation import check_is_fitted

class MyStandardScaler(BaseEstimator, TransformerMixin): 
    def __init__(self):
        return None
    
    def fit(self, X, y = None):
        print(type(X))
        # the type of X might be a DataFrame or a NumPy array
        # it depends on the previous transformer object that 
        # you use in the pipeline
        self.means = np.mean(X, axis=0)
        self.stds = np.std(X, axis=0)
        return self
    def transform( self, X, y = None ):
        check_is_fitted(self, ['means','stds'])
        return (X - self.means) / self.stds

- check_is_fitted() 함수에서는 클래스에 있어야 하는 객체 속성을 지정하는 문자열(또는 문자열 목록)을 전달하기만 하면 됩니다(사용자가 fit() 함수를 건너뛰면 means 및 stds 속성이 생성되지 않으므로 이 함수는 NotFittedError 예외를 발생시킵니다).

In [21]:
numeric_transformer = Pipeline(steps=[
    ('columns selector', ColumnsSelector(['Age','Fare'])),
    ('imputer', SimpleImputer(strategy='median')),
    ('my scaler', MyStandardScaler())  
])
numeric_transformer.transform(X_train)

NotFittedError: This SimpleImputer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.