## Pipeline
- class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False

### 1.pipeline 사용법
1. 작업명,작업클래스 두 개로 이루어진 튜플을 리스트로 담아서 Pipeline에 담기
    - Pipeline([('작업1' , 작업 클래스) , ('작업2' , 작업 클래스)])
2. Pipelien 을 fit하기
3. Pipeline을 predcit

In [1]:
from sklearn.datasets import make_regression,make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

# 분류할 데이터 생성
X, y = make_classification(n_samples=100,n_features=10,n_informative=2 , random_state=42)
# 0.33 비율로 트레인 테스트 스플릿
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

# pipeline생성 (스케일링과 ,분석방법)
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', DecisionTreeClassifier(random_state=112))
	])

pipeline.fit(X_train,y_train)
y_preds=pipeline.predict(X_test)
accuracy_score(y_test,y_preds)

0.8787878787878788

In [2]:
from sklearn.linear_model import LogisticRegression

pipeline2 = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', LogisticRegression())
	])

pipeline2.fit(X_train,y_train)
y_preds2=pipeline2.predict(X_test)
accuracy_score(y_test,y_preds2)


0.9393939393939394

- use iris data 

In [3]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', DecisionTreeClassifier(random_state=112))
	])

pipeline.fit(X_train,y_train)
y_preds = pipeline.predict(X_test)
accuracy_score(y_test,y_preds)

0.98

In [4]:
pipeline2 = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', LogisticRegression())
	])

pipeline2.fit(X_train,y_train)
y_preds2 = pipeline2.predict(X_test)
accuracy_score(y_test,y_preds2)

0.98

### 2. pipeline과 교차검증(k-fold, stratified k-fold)
- use sklearn.model_selection.cross_val_score

![Alt text](image.png)

- 이렇게 하지 않으면, 교차 검증과정에서의 트레인 데이터와 검증 데이터가 모두 학습되고 검증 데이터를 이용해 성능 평가를 하게된다
- 그렇게 되면 데이터 누출(data- leakage)문제가 발생하게 된다
- 이를 pipeline을 통해 해결할 수 있다

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', DecisionTreeClassifier())
	])
scores = cross_val_score(pipeline,X , y ,scoring='accuracy' ,  cv=5 )
scores


array([0.96666667, 0.96666667, 0.9       , 1.        , 1.        ])

In [9]:
import numpy as np

np.mean(scores)

0.9666666666666668

### 3. pipeline과 교차검증과 gridsearchCV

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,KFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

rfc = RandomForestClassifier(random_state=112)
sts = StandardScaler()

pipeline=Pipeline([
    ('scaler',sts),
    ('clf',rfc)
])

kflod =KFold(n_splits=5, shuffle=True, random_state=0)
params = {
    "clf__max_depth" : [3,5,7,9],
    "clf__min_samples_split" : [4,6,8]
}

grid_model = GridSearchCV(estimator=pipeline,
                          param_grid=params,
                          cv=kflod,
                          scoring='accuracy',
                          refit=True)

grid_model.fit(X_train,y_train)

print('교차검증 점수 : ', grid_model.best_score_)
print('최적의 하이퍼 파라메터 조합 :', grid_model.best_params_)
print('학습 평가 : ', grid_model.score(X_train, y_train))
print('테스트 평가 : ', grid_model.score(X_test, y_test))

교차검증 점수 :  0.9400000000000001
최적의 하이퍼 파라메터 조합 : {'clf__max_depth': 3, 'clf__min_samples_split': 4}
학습 평가 :  0.96
테스트 평가 :  0.98
