##### 1. Wrapper Method
> Greedy Search Algorithms을 기반으로 특정 Machine Learning 알고리즘에 대해 최상의 결과를 얻을 수 있는 조합을 찾는 방법이다.
> 가능한 모든 방법을 테스트하기 때문에, 학습 데이터가 매우클 경우 Calculating Cost 가 매우 큰 단점이있다.

Wrapper Method Feature Selection은 크게 세가지 방법으로 구분하여 Step Forward, Step BackWard, Exhaustive로 구분할 수 있다.
데이터 사이언스를 위한 기능 모음 패키지인 mlxtend를 사용하여 Wrapper Method를 구현해보고자 한다.

Kaggle에서 제공하는 [BNP Paribas Cardif Claims Management](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data)데이터를 활용하여,
Wrapper Methods Feature selection 을 수행해보자.

###### *Description*
BNP Paribas Cardif 청구관리 데이터 셋으로 고객의 클레임에 신속한 대처를 할 수 있게 클레임 여부를 판단하는 모형을 만들어내는 것이 목표이다.

In [37]:
# Feature Selection에 들어가기 앞서 Data Preprocessing을 수행
# 0. Data Preprocessing

import pandas
import numpy
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold



paribas_data = pandas.read_csv("../../../data/3.FeatureSelection/01.Wrapper/train.csv", nrows=20000)
print("----Raw Data Shape----")
print(paribas_data.shape)
# Numeric Columns 추출
num_columns = ['int16','int32', 'int64', 'float16', 'float32','float64']
numerical_columns = list(paribas_data.select_dtypes(include=num_columns).columns)
# 114개 연속형 변수만 추출
paribas_data = paribas_data[numerical_columns]
print("----Numerical Data Shape----")
print(paribas_data.shape)

# Split Train, Test Data Set
train_x, test_x, train_y, test_y = train_test_split(paribas_data.drop(labels=["ID","target"], axis=1),
                                                    paribas_data["target"],
                                                    test_size=0.2,
                                                    random_state=42)

# Correlation > |0.8| 인 컬럼 제거(Filter Method 사용)
correlated_features = set()
correlation_matrix = paribas_data.corr()

for idx, colname in enumerate(correlation_matrix):
    #print("-------")
    #print("{0} : {1}".format(idx, colname))
    for j in range(idx):
        #print(j)
        if abs(correlation_matrix.iloc[idx, j]) > 0.8:
            correlated_features.add(colname)

print(correlated_features)
print(len(correlated_features))

train_x.drop(labels=correlated_features, axis=1, inplace=True)
test_x.drop(labels=correlated_features, axis=1, inplace=True)

print("Train Shape : {0}, Test Shape : {1}".format(train_x.shape, test_x.shape))



----Raw Data Shape----
(20000, 133)
----Numerical Data Shape----
(20000, 114)
{'v63', 'v83', 'v123', 'v64', 'v108', 'v84', 'v122', 'v118', 'v101', 'v130', 'v21', 'v54', 'v67', 'v106', 'v115', 'v44', 'v121', 'v49', 'v40', 'v95', 'v105', 'v46', 'v104', 'v116', 'v48', 'v41', 'v12', 'v86', 'v87', 'v100', 'v111', 'v73', 'v103', 'v65', 'v32', 'v53', 'v77', 'v109', 'v126', 'v68', 'v81', 'v124', 'v98', 'v60', 'v128', 'v25', 'v89', 'v114', 'v76', 'v37', 'v78', 'v43', 'v96', 'v93', 'v55'}
55
Train Shape : (16000, 57), Test Shape : (4000, 57)


In [41]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score


# mlxtend
# A Python Library with Intersting Tools for Data Science
from mlxtend.feature_selection import SequentialFeatureSelector
#1. Step Forward Feature Selection
# SequentialFeatureSelector를 사용한 Feature Selection
# 적합시킬 모델은 RandomForest Classifier에
# k_feature : 선택할 변수의 갯수(int)
# forward : step forward selection(True/false)
# verbose : logging level parameter(int)
# scoring : performance evaluation criteria and finally(string)
# cv : cross validation folds(int)
feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=1),
                                            k_features=5,
                                            forward=True,
                                            verbose=2,
                                            scoring='roc_auc',
                                            cv=4)

# Step Forward Feature Selection
features = feature_selector.fit(numpy.array(train_x.fillna(0)), train_y)

# Selected Feature
filtered_features = train_x.columns[list(features.k_feature_idx_)]
print(filtered_features)




[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  57 out of  57 | elapsed:  7.9min finished

[2021-07-05 16:37:57] Features: 1/5 -- score: 0.6135197528020672[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  56 out of  56 | elapsed:  5.5min finished

[2021-07-05 16:43:27] Features: 2/5 -- score: 0.6497732434609347[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  55 out of  55 | elapsed:  6.7min finished

[2021-07-05 16:50:09] Features: 3/5 -- score: 0.6757750744810179[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 

Index(['v10', 'v14', 'v23', 'v34', 'v50'], dtype='object')


[Parallel(n_jobs=1)]: Done  53 out of  53 | elapsed: 10.3min finished

[2021-07-05 17:10:19] Features: 5/5 -- score: 0.6775732152494603