<a href="https://colab.research.google.com/github/tommybebe/til/blob/master/ml/making_a_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Reference
- [Housing Prices: Using XGBoost and Pipeline](https://www.kaggle.com/kaggledroid/housing-prices-using-xgboost-and-pipeline)
- [XGBoost Pipeline](https://www.kaggle.com/byrolew/xgboost-pipeline)
- [House Prices: Advanced Regression Techniques](https://github.com/data-doctors/kaggle-house-prices-advanced-regression-techniques)
- [XGBoost with Scikit-Learn Pipeline & GridSearchCV](https://www.kaggle.com/carlosdg/xgboost-with-scikit-learn-pipeline-gridsearchcv)
- [SageMaker build-in LinearLearner](https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab/blob/master/notebooks/MLA-TAB-Lecture2-SageMaker.ipynb)

### Sample Data

In [167]:
import numpy as np
import pandas as pd
from scipy import sparse as sp
from pandas._testing import rands_array
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, make_union
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBRegressor

In [158]:
df = pd.DataFrame(np.random.normal(size=(400, 4)), columns=list('abcd'))
df['e'] = rands_array(1, df.shape[0])

numerical_cols = df.select_dtypes(exclude=['object']).columns
categorical_cols = df.select_dtypes(include=['object']).columns
target_col = 'y'

df[target_col] = df['a']*3 + df['b']*np.random.rand(df.shape[0]) + 3 + df['e'].map(lambda x: int(x) if x.isdigit() else 0 )

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

### Pipelines

#### Hello World Example

In [159]:
#Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')

#Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

#Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

#Defining model
model = XGBRegressor(objective='reg:squarederror', n_estimators=200, learning_rate=0.03)

#Bundle preprocessing and modeling code in a pipeline
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

In [160]:
pipe.fit(X_train, y_train)
preds = pipe.predict(X_valid)

print(mean_absolute_error(preds, y_valid))

0.8131575948461615


#### What happens
- preprocessor 가 transform한 데이터를 model이 fit 했다. 
- model fitting 결과가 model 객체에 남았다.

In [161]:
model.feature_importances_

array([0.35271233, 0.0230296 , 0.00160313, 0.00346912, 0.        ,
       0.        , 0.02174103, 0.04681395, 0.06139186, 0.067981  ,
       0.07269321, 0.10697489, 0.13098437, 0.08292631, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.01311071, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.00049586, 0.        , 0.        , 0.        , 0.        ,
       0.01407257, 0.        , 0.        , 0.        , 0.        ,
       0.        ], dtype=float32)

- 별도로 떼네어 실행해봐도 동일한 결과를 얻을 수 있다. 

In [162]:
prep_train = preprocessor.transform(X_train)
prep_valid = preprocessor.transform(X_valid)
print('type :\t\t\t', type(prep_train))
print('example of item :\t', prep_train.toarray()[0])

type :			 <class 'scipy.sparse.csr.csr_matrix'>
example of item :	 [ 1.02506078 -2.34833459  0.0219236  -0.98661037  0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          1.          0.          0.        ]


In [163]:
model = XGBRegressor(objective='reg:squarederror', n_estimators=200, learning_rate=0.03)
model.fit(prep_train, y_train)
preds = model.predict(prep_valid)
print(mean_absolute_error(preds, y_valid))

0.8131575948461615


- preprocessor만을 파이프라인에 넣어 실행시키는 경우와 동일하다. 

In [164]:
pipe = Pipeline(steps=[('prep', preprocessor)])

prep_train = pipe.transform(X_train)
print('type :\t\t\t', type(prep_train))
print('example of item :\t', prep_train.toarray()[0])

type :			 <class 'scipy.sparse.csr.csr_matrix'>
example of item :	 [ 1.02506078 -2.34833459  0.0219236  -0.98661037  0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          1.          0.          0.        ]


#### Make a new pipline component

##### 데이터프레임을 가공하는 transformer

In [188]:
class AddRand(BaseEstimator, TransformerMixin):
    def fit(self, df, *args):
        return self
    
    def transform(self, df):
        _df = df.copy()
        _df['f'] = np.random.normal(size=df.shape[0])
        return _df


pipe = Pipeline(steps=[
    ('add_rand', AddRand()),
])

prep_train = pipe.fit_transform(X_train)
print('type :\t\t\t', type(prep_train))
print('example of item :\t',)
display(prep_train.head())

type :			 <class 'pandas.core.frame.DataFrame'>
example of item :	


Unnamed: 0,a,b,c,d,e,f
336,1.025061,-2.348335,0.021924,-0.98661,x,-0.660819
64,0.222432,-1.278728,0.456675,0.930438,t,0.098729
55,0.00384,0.273327,-0.352015,-0.415909,5,0.235445
106,-1.066128,0.27998,0.611953,-1.313808,m,0.759435
300,0.866429,-0.647406,-1.16552,-0.74394,w,-0.609945


##### FeatureUnion 사용 방법
- [Custom transformer mixin with FeatureUnion in scikit-learn](https://stackoverflow.com/questions/52116786/custom-transformer-mixin-with-featureunion-in-scikit-learn)
- [Pipelines & Custom Transformers in scikit-learn: The step-by-step guide (with Python code)](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156)

In [247]:
class AddRand(BaseEstimator, TransformerMixin):
    def fit(self, q, *args):
        return self
    
    def transform(self, q):
        _q = q.copy()
        _q['f'] = np.random.normal(size=q.shape[0])
        f = _q.values
        return _q.values

class AddConstant(BaseEstimator, TransformerMixin):
    def fit(self, q, *args):
        return self

    def transform(self, X):
        f = np.random.normal(size=X.shape[0])
        return np.random.normal(size=X.shape[0]).reshape(X.shape[0], 1)


p1 = Pipeline([('add_rand', AddRand())])
p2 = Pipeline([('add_constant', AddConstant())])

pipe = FeatureUnion([
    ('p1', p1),
    ('p2', p2),
])

prep_train = pipe.transform(X_train)

print('type :\t\t\t', type(prep_train))
print('example of item :\t',)
display(prep_train[0])

type :			 <class 'numpy.ndarray'>
example of item :	


array([1.0250607820437239, -2.3483345949230676, 0.02192360106126882,
       -0.9866103659303194, 'x', 0.5389890191402996, 0.11735763205253363],
      dtype=object)

##### sklearn SimpleImputer 등을 지나면 CRS Matrix 형식이 되는 것 주의

In [134]:
class AddConstant(BaseEstimator, TransformerMixin):
    def fit(self, X, *args):
        return self
    
    def transform(self, X):
        return sp.hstack([X, sp.csr_matrix(np.ones((X.shape[0], 1)))])


pipe = Pipeline(steps=[
    ('prep', preprocessor),
    ('add_dummy', AddConstant())
])

prep_train = pipe.fit_transform(X_train)
print('type :\t\t\t', type(prep_train))
print('example of item :\t', prep_train.toarray()[0])

type :			 <class 'scipy.sparse.coo.coo_matrix'>
example of item :	 [ 1.59835191 -0.67823669 -0.35101521  0.25688665  0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          1.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  1.        ]


##### [make_pipeline vs Pipeline](https://stackoverflow.com/questions/40708077/what-is-the-difference-between-pipeline-and-make-pipeline-in-scikit)

    ```
    # Pipeline:
    names are explicit, you don't have to figure them out if you need them;
    name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.

    # make_pipeline:
    shorter and arguably more readable notation;
    names are auto-generated using a straightforward rule (lowercase name of an estimator).
    ```
- make_pipeline이 쓰기에 편해보이나, 이름을 가져와 사용할 일이 있을 때 더 번거로울 것.


In [136]:
pipe = make_pipeline(
    preprocessor,
    AddConstant()
)

prep_train = pipe.fit_transform(X_train)
print('type :\t\t\t', type(prep_train))
print('example of item :\t', prep_train.toarray()[0])

type :			 <class 'scipy.sparse.coo.coo_matrix'>
example of item :	 [ 1.59835191 -0.67823669 -0.35101521  0.25688665  0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          1.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  1.        ]


#### Grid search

In [None]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])
param_grid = {
    'model__max_depth': [2, 3, 5, 7, 10],
    'model__n_estimators': [10, 100, 500],
}

grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)



GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('num',
                                                                         SimpleImputer(add_indicator=False,
                                                                                       copy=True,
                                                                                       fill_value=None,
                                                                                       missing_values=nan,
                                                       

In [None]:
grid.best_estimator_

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='most_frequent',
                                                                verbose=0),
                                                  ['MSSubClass', 'LotFrontage',
                                                   'LotArea', 'OverallQual',
                                                   'Over