## Yes, You Can Build Your Own Custom Sklearn Transformers. Here Is How
- https://pub.towardsai.net/yes-you-can-build-your-own-custom-sklearn-transformers-here-is-how-2508b71cf107

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.8. 7</div>
<div style="text-align: right"> Last update: 2023. 8. 7</div>

In [1]:
import os
import sys
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats
import warnings; warnings.filterwarnings('ignore')
#plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid')
%matplotlib inline

### Integrating simple function with FunctionTransformer

In [2]:
df = pd.read_csv('../data/titanic/train.csv')
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [3]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [4]:
def num_missing_row(X: pd.DataFrame, y = None):
    num_missing = X.isnull().sum(axis=1)
    num_missing_std = X.isnull().std(axis=1)

    # Add the above series as a new feature to the df
    X["#missing"] = num_missing
    X["num_missing_std"] = num_missing_std

    return X

In [5]:
from sklearn.preprocessing import FunctionTransformer

num_missing_estimator = FunctionTransformer(num_missing_row)

In [6]:
# Check number of columns before
print(f"Number of features before preprocessing: {len(df.columns)}")

# Apply the custom estimator
tps_df = num_missing_estimator.transform(df)
print(f"Number of features after preprocessing: {len(df.columns)}")

Number of features before preprocessing: 12
Number of features after preprocessing: 14


따라서 function transformer를 사용하면 다음과 같이 정의하고 사용하면 된다.  


```python
# FunctionTransformer signature
def custom_function(X, y=None):
    ...

estimator = FunctionTransformer(custom_function)  # no errors

custom_pipeline = make_pipeline(StandardScaler(), estimator, xgb.XGBRegressor())
custom_pipeline.fit(X, y)
```

invers function이 있으면 다음과 같의 정의하면 된다.

```python
def custom_function(X, y=None):
    ...

def inverse_of_custom(X, y=None):
    ...

estimator = FunctionTransformer(func=custom_function, inverse_func=inverse_of_custom)
```

### Integrating more complex preprocessing steps with custom transformers

`PowerTransformer`로 custom transformer를 만들어보자.

In [7]:
from sklearn.preprocessing import PowerTransformer
from sklearn.base import BaseEstimator, TransformerMixin

class CustomLogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._estimator = PowerTransformer()
        
    def fit(self, X, y = None):
        X_copy = np.copy(X) + 1
        self._estimator.fit(X_copy)
        
        return self
    
    def transform(self, X):
        X_copy = np.copy(X) + 1
        
        return self._estimator.transform(X_copy)
    
    def inverse_transform(self, X):
        X_reversed = self._estimator.inverse_transform(np.copy(X))

        return X_reversed - 1

In [8]:
custom_log = CustomLogTransformer()

In [9]:
custom_log.fit(df.select_dtypes(int))

In [10]:
df_transformed = custom_log.transform(df.select_dtypes(int))
df_transformed

array([[-2.13529912, -0.78927234,  0.86703775,  1.3518121 , -0.56010901,
         0.05530165],
       [-2.12061766,  1.2669898 , -1.44707807,  1.3518121 , -0.56010901,
        -1.58746898],
       [-2.10696133,  1.2669898 ,  0.86703775, -0.67868344, -0.56010901,
         0.05530165],
       ...,
       [ 1.55262029, -0.78927234,  0.86703775,  1.3518121 ,  1.8638087 ,
         1.64614505],
       [ 1.55562662,  1.2669898 , -1.44707807, -0.67868344, -0.56010901,
        -1.58746898],
       [ 1.55863197, -0.78927234,  0.86703775, -0.67868344, -0.56010901,
         0.05530165]])

In [11]:
df_inversed = custom_log.inverse_transform(df_transformed)
df_inversed

array([[ 1.00000000e+00,  8.88178420e-16,  3.00000000e+00,
         1.00000000e+00,  2.66453526e-15,  1.00000000e+00],
       [ 2.00000000e+00,  1.00000000e+00,  1.00000000e+00,
         1.00000000e+00,  2.66453526e-15, -2.22044605e-16],
       [ 3.00000000e+00,  1.00000000e+00,  3.00000000e+00,
         4.44089210e-16,  2.66453526e-15,  1.00000000e+00],
       ...,
       [ 8.89000000e+02,  8.88178420e-16,  3.00000000e+00,
         1.00000000e+00,  2.00000000e+00,  2.00000000e+00],
       [ 8.90000000e+02,  1.00000000e+00,  1.00000000e+00,
         4.44089210e-16,  2.66453526e-15, -2.22044605e-16],
       [ 8.91000000e+02,  8.88178420e-16,  3.00000000e+00,
         4.44089210e-16,  2.66453526e-15,  1.00000000e+00]])

In [12]:

df.select_dtypes(int).values

array([[  1,   0,   3,   1,   0,   1],
       [  2,   1,   1,   1,   0,   0],
       [  3,   1,   3,   0,   0,   1],
       ...,
       [889,   0,   3,   1,   2,   2],
       [890,   1,   1,   0,   0,   0],
       [891,   0,   3,   0,   0,   1]])

파이프라인으로 학습시켜 보기

In [13]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
import xgboost as xgb
from sklearn.impute import SimpleImputer

xgb_pipe = make_pipeline(
    FunctionTransformer(num_missing_row),
    SimpleImputer(strategy="constant", fill_value=-99999),
    CustomLogTransformer(),
    xgb.XGBClassifier(
        n_estimators=1000, tree_method="gpu_hist", objective="binary:logistic"
    ),
)

In [14]:
xgb_pipe