## Boston 데이터셋 - 두번째 코딩
- sklearn의 pipeline 모듈 활용
- hyperopt 사용하기

### 0.불러오기 등 준비

In [37]:
## 준비
print('Ran on Jupyter Notebook')

import sys 
print(sys.version) # 파이썬 버전 
print(sys.executable) # 파이썬 위치
import os
print(os.getcwd()) # 폴더 위치

import inspect
import pixiedust
from datetime import datetime

from IPython.core.display import display, HTML # 셀이 화면 전체 채우도록 함
display(HTML("<style>.container { width:100% !important; }</style>"))

Ran on Jupyter Notebook
3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 14:00:49) [MSC v.1915 64 bit (AMD64)]
C:\ProgramData\Anaconda3\python.exe
C:\Users\Administrator\Desktop\D0123_make_lecture_for_tobigs\D0210_실제로_코드_짜기\notebooks


In [38]:
## 자주 쓰는 모듈들
import pandas as pd
import numpy as np

import sklearn
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

### 1. 핵심 모듈 및 데이터 불러오기 (Import Modules and Data)

In [49]:
from sklearn.datasets import load_boston
from sklearn.pipeline import Pipeline
from joblib import Memory
from shutil import rmtree

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVR

In [40]:
boston = load_boston()
df_feat = pd.DataFrame(boston.data)
df_feat.columns = boston.feature_names
df_feat['PRICE'] = boston.target
df = df_feat
del df_feat 

### 2. 데이터 탐색

In [41]:
display(df.head())

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


### 3. 데이터 전처리 + 모델 선정, sklearn.pipeline 활용

In [42]:
X = df.drop(['PRICE'], axis = 1)
y = df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4) #:D random_state seed 설정해서 다시 돌려도

#### 3.1. 일단, 하이퍼 파라미터 튜닝 없이  +  Pipeline 활용

In [43]:
svr_pipe = Pipeline(steps=[
    ('sc', StandardScaler()), #:D tuple 앞 string은 지정하는 이름
    ('svr',SVR()) 
])

In [44]:
## 일단, 하이퍼 파라미터 튜닝 없이
svr_pipe.fit(X_train, y_train) #:D 중간의 전처리 스텝들은 알아서 fit_transform이 이뤄짐
cross_val_score(svr_pipe, X_train, y_train, cv=5, scoring = 'r2').mean()

0.5823120249781474

In [45]:
## X_train이 원래 상태 유지함을 확인
pd.DataFrame(X_train).head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
84,0.05059,0.0,4.49,0.0,0.449,6.389,48.0,4.7794,3.0,247.0,18.5,396.9,9.62
354,0.04301,80.0,1.91,0.0,0.413,5.663,21.9,10.5857,4.0,334.0,22.0,382.8,8.05
221,0.40771,0.0,6.2,1.0,0.507,6.164,91.3,3.048,8.0,307.0,17.4,395.24,21.46


#### 3.2. PCA, GridSearchCV
- https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

##### 보충: 차원축소
- 중요성
- Scaling이 필요한 차워축소법

In [50]:
location = 'cachedir'
memory = Memory(location=location, verbose=10) #:D 캐쉬를 통해 파이프라인의 단계마다 transformer의 fit 결과를 저장하여 재사용
# https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html
# https://tensorflow.blog/2017/12/08/pipeline%EC%97%90%EC%84%9C-%EC%BA%90%EC%8B%B1%EC%9D%84-%EC%82%AC%EC%9A%A9%ED%95%98%EA%B8%B0/
# 전처리 단계가 복잡하고 많을 수록 캐싱의 시간 절감 효과는 큼

svr_pipe = Pipeline(steps=[
    ('sc', StandardScaler()),
    ('pca', PCA()),
    ('svr',SVR()) 
], memory = memory)

In [51]:
parameters = {
    'pca__n_components': [3, 5, 10], #:D 전처리 하이퍼파라미터도 
    
    'svr__gamma': [1e-7, 1e-4],
    'svr__epsilon':[0.1,0.2,0.5,0.3], 
    'svr__kernel':('rbf', 'poly', 'sigmoid'),
    'svr__C': [1, 5, 10]
}

randomizedsearch = RandomizedSearchCV(svr_pipe, param_distributions  = parameters, cv = 3, scoring = 'r2',  n_jobs=-1, verbose = 10)
randomizedsearch.fit(X_train, y_train)

memory.clear(warn=False)
rmtree(location) #  작업이 끝나고 난 뒤에는 임시 디렉토리를 지워줌

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done  19 out of  30 | elapsed:    1.7s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed:    1.7s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:    1.8s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    1.8s finished


________________________________________________________________________________
[Memory] Calling sklearn.pipeline._fit_transform_one...
_fit_transform_one(StandardScaler(copy=True, with_mean=True, with_std=True),          CRIM    ZN  INDUS  CHAS    NOX     RM   AGE      DIS   RAD    TAX  \
84    0.05059   0.0   4.49   0.0  0.449  6.389  48.0   4.7794   3.0  247.0   
354   0.04301  80.0   1.91   0.0  0.413  5.663  21.9  10.5857   4.0  334.0   
221   0.40771   0.0   6.20   1.0  0.507  6.164  91.3   3.0480   8.0  307.0   
34    1.61282   0.0   8.14   0.0  0.538  6.096  96.9   3.7598   4.0  307.0   
267   0.57834  20.0   3.97   0.0  0.575  8.297  67.0   2.4216   5.0  264.0   
..        ...   ...    ...   ...    ...    ...   ...      ...   ...    ...   
385  16.81180   0.0  18.10   0.0  0.700  5.277  98.1   1.4261  24.0  666.0   
197   0.04666  80.0   1.52   0.0  0.404  7.107  36.6   7.3090   2.0  329.0   
439   9.39063   0.0  18.10   0.0  0.740  5.627  93.9   1.8172  24.0  666.0   
174   

In [53]:
print('--- best_params_ ---')
print(randomizedsearch.best_params_)
print('--------------------')

print('--- best_score_ ----')
print(randomizedsearch.best_score_)
print('--------------------')

--- best_params_ ---
{'svr__kernel': 'sigmoid', 'svr__gamma': 0.0001, 'svr__epsilon': 0.3, 'svr__C': 10, 'pca__n_components': 10}
--------------------
--- best_score_ ----
0.09512827627588427
--------------------


##### 보충: ColumnTransformer 
- 다른 형태의 열들을 각기 처리할 수 있음
- FeatureUnion