- Feature Engineering	
        도메인 지식을 사용하여 데이터에서 피처를 변형/생성	
- Feature Extraction	
        차원축소 등 새로운 중요 피처를 추출	
- Feature Selection	
        기존 피처에서 원하는 피처만 (변경하지 않고) 선택하는 과정

#### **SIMPLE IMPORT**

In [None]:
!pip install --upgrade category_encoders

In [29]:
# ignore warnings

import warnings
warnings.filterwarnings('ignore')

In [30]:
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sklearn
import os
import io
import pandas as pd
import graphviz
from sklearn.tree import export_graphviz

from sklearn.model_selection import train_test_split
from category_encoders import OrdinalEncoder
from category_encoders import TargetEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer 
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestRegressor

In [31]:
from google.colab import drive 
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


### **Feature Selection**

- **Feature selection이란?**
    - Feature selection (variable selection)은 기계 학습 알고리즘을 구축하기 위해 dataset에 있는 전체 변수에서 관련 변수 또는  하위 집합을 선택하는 Process입니다.

- **기능 선택의 장점**
    - 기능 선택 프로세스에는 다양한 이점이 있습니다.

      - 향상된 정확도
      - 단순한 모델, 해석하기 쉬움
      - 훈련 시간 단축
      - 과적합을 줄여 일반화 향상
      - 소프트웨어 개발자가 구현하기 쉬움
      - 모델 사용에 따른 데이터 오류 위험 감소
      - 중복 제거 
      -  high dimensional spaces에서 Bad learning behaviour 방지

- Feature Selection – Techniques
   - Feature Selection 테크닉에는 3가지가 있습니다.


  1. Filter methods
          Filter Method : 관련성을 찾는 방법
    - Filter methods는 아래와 같은 방법으로 구성되어 있습니다.
      - Basic methods
      - Univariate methods
      - Information gain
      - Fischer score
      - Correlation Matrix with Heatmap
  
  2.Wrapper methods
          Wrapper Method : 유용성을 측정한 방법
    - Wrapper methods는 아래와 같은 방법으로 구성되어 있습니다.
      - Forward Selection
      - Backward Elimination
      - Exhaustive Feature Selection
      - Recursive Feature Elimination
      - Recursive Feature Elimination with Cross-Validation

  3.Embedded methods
          Embedded Method : 유용성을 측정하지만 내장 metric을 사용하는 방법
    - Embedded methods는 아래와 같은 방법으로 구성되어 있습니다.
      - LASSO : L1-norm을 통해 제약 주는 방법
      - RIDGE : L2-norm을 통해 제약을 주는 방법
      - Elastic Net : 위 둘을 선형결합한 방법
      - SelectFromModel
        - Tree Importance : decision tree 기반 알고리즘에서 피처를 뽑아오는 방법입니다.




#### **Filter methods**




---

- Filter methods는 일반적으로 전처리 단계로 사용됩니다. 

    - Feature Selection은 ML 알고리즘과 무관합니다. 
    - 대신, Feature는 outcome variable과의 상관 관계에 대한 다양한 통계 테스트의 점수를 기반으로 선택됩니다. 
      
- Filter methods 특징
  - 데이터의 특성(피처 특성)에 의존합니다.
  - 기계 학습 알고리즘을 사용하지 않습니다.
  - model 불가지론자입니다.
  - 계산 비용이 적게 드는 경향이 있습니다.
  - 일반적으로 Wrapper methods 보다 예측 성능이 낮습니다.
  - 그들은 a quick screen에 적합하고 관련 없는 변수 제거에 매우 적합합니다.

- Filter methods는 아래와 같은 다양한 테크닉으로 구성됩니다. 

1. Basic methods
2. Univariate feature selection
3. Information gain
4. Fischer score
5. ANOVA F-Value for Feature Selection
6. Correlation Matrix with Heatmap


### **Basic methods**

#### Remove Constant Features

1. Basic methods

  - 기본 방법에서는 반복되는 & 유사하게 거듭되는 features를 제거합니다.
  - constant features은 데이터 세트의 모든 관찰에 대해 동일한 값, 단 하나의 값을 표시하는 특성입니다. 이는 데이터 세트의 모든 행에 대해 동일한 값입니다. 이러한 기능은 기계 학습 모델이 대상을 식별하거나 예측할 수 있는 정보를 제공하지 않습니다.

  - constant features을 식별하기 위해 sklearn의 VarianceThreshold 함수를 사용할 수 있습니다.

In [32]:
df = pd.read_excel('/content/gdrive/MyDrive/data/sc231/1. MD_2016년 어르신 여가생활 설문조사_개별면접.xlsx')
df = df.replace({'DQ63': {1: 1, 2:1 , 3:2, 4:2, 5:2}})
target = 'DQ63'                                           
df
#타겟 변수를 바꾸어 주도록 하겠습니다.

Unnamed: 0,ID,SQ0,SQ1,SQ2,SQ3,Q01,Q021,Q022,Q023,Q024,Q03,Q041,Q04101,Q04102,Q04103,Q04104,Q04105,Q04106,Q042,Q04201,Q04202,Q04203,Q04204,Q04205,Q04206,Q04207,Q043,Q04301,Q04302,Q04303,Q04304,Q04305,Q04306,Q04307,Q04308,Q04309,Q04310,Q05,Q051,Q05101,...,Q3201,Q3202,Q3203,Q3204,Q3205,Q3206,Q33,Q34,Q3401,Q3402,Q3403,Q3404,Q3405,Q35,Q3501,Q36,Q37,Q381,Q382,Q383,Q391,Q392,Q393,Q394,Q395,Q40,Q401,Q411,Q412,Q42,DQ1,DQ2,DQ3,DQ4,DQ5,DQ61,DQ62,DQ63,DQ64,DQ65
0,1,1,1,2,1,2,2,12,2,16,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,2,3,4,5,#NULL!,#NULL!,#NULL!,5,7,3,5,2,2,2,1,1,2,2,2,1,3,1,0,15000,1,1949,1,4,3,3,2,1,2,1
1,2,1,1,2,1,2,2,13,2,16,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,3,5,#NULL!,#NULL!,#NULL!,#NULL!,6,7,3,4,2,2,2,1,1,2,2,2,1,13,0,15,10000,1,1948,2,1,2,3,3,1,2,3
2,3,1,1,2,1,2,1,11,2,15,1,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,2,8,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,1,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,3,2,3,5,#NULL!,#NULL!,#NULL!,6,7,3,3,2,2,2,2,1,2,2,2,1,3,1,10,25000,1,1950,2,1,2,3,2,2,2,3
3,4,1,1,2,1,2,2,12,2,16,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,3,3,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,5,7,3,4,2,2,2,1,1,2,2,2,1,13,0,10,10000,1,1949,2,1,2,3,3,1,2,3
4,5,1,1,2,1,2,2,15,2,17,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,2,3,5,#NULL!,#NULL!,#NULL!,#NULL!,5,7,3,4,2,2,2,1,1,2,2,2,1,5,0,20,15000,1,1948,1,2,3,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
518,620,3,1,2,1,2,1,10,2,16,1,3,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,4,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,6,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,1,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,3,3,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,6,3,3,2,2,2,2,1,2,2,1,1,16,1,0,12000,1,1948,2,1,2,2,2,2,3,3
519,621,5,1,2,1,3,2,12,2,17,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,1,2,#NULL!,#NULL!,#NULL!,#NULL!,3,4,2,2,1,1,1,2,1,2,2,2,1,3,1,0,35000,1,1946,2,1,4,2,4,2,3,3
520,622,5,1,2,1,2,2,13,2,18,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,...,4,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,1,1,2,#NULL!,#NULL!,#NULL!,#NULL!,3,4,2,2,2,2,1,2,1,1,2,2,1,21,1,10,20000,1,1944,2,1,4,3,4,2,3,2
521,623,5,1,2,1,3,2,12,2,17,2,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,3,1,2,#NULL!,#NULL!,#NULL!,#NULL!,6,7,2,3,2,2,2,2,1,2,2,2,1,1,1,0,10000,1,1940,1,2,3,2,3,2,2,3


In [33]:
encoder = OrdinalEncoder()
df = encoder.fit_transform(df)

In [34]:
df

Unnamed: 0,ID,SQ0,SQ1,SQ2,SQ3,Q01,Q021,Q022,Q023,Q024,Q03,Q041,Q04101,Q04102,Q04103,Q04104,Q04105,Q04106,Q042,Q04201,Q04202,Q04203,Q04204,Q04205,Q04206,Q04207,Q043,Q04301,Q04302,Q04303,Q04304,Q04305,Q04306,Q04307,Q04308,Q04309,Q04310,Q05,Q051,Q05101,...,Q3201,Q3202,Q3203,Q3204,Q3205,Q3206,Q33,Q34,Q3401,Q3402,Q3403,Q3404,Q3405,Q35,Q3501,Q36,Q37,Q381,Q382,Q383,Q391,Q392,Q393,Q394,Q395,Q40,Q401,Q411,Q412,Q42,DQ1,DQ2,DQ3,DQ4,DQ5,DQ61,DQ62,DQ63,DQ64,DQ65
0,1,1,1,2,1,2,2,12,2,16,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,2,3,1,1,1,1,1,5,1,3,5,2,2,2,1,1,2,2,2,1,1,1,0,15000,1,1949,1,4,3,3,2,1,2,1
1,2,1,1,2,1,2,2,13,2,16,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,3,2,2,1,1,1,6,1,3,4,2,2,2,1,1,2,2,2,1,2,0,15,10000,1,1948,2,1,2,3,3,1,2,3
2,3,1,1,2,1,2,1,11,2,15,1,2,1,1,1,1,1,1,2,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,2,2,1,...,1,1,1,1,1,1,3,2,3,1,1,1,1,6,1,3,3,2,2,2,2,1,2,2,2,1,1,1,10,25000,1,1950,2,1,2,3,2,2,2,3
3,4,1,1,2,1,2,2,12,2,16,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,3,3,4,2,1,1,1,5,1,3,4,2,2,2,1,1,2,2,2,1,2,0,10,10000,1,1949,2,1,2,3,3,1,2,3
4,5,1,1,2,1,2,2,15,2,17,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,2,3,2,2,1,1,1,5,1,3,4,2,2,2,1,1,2,2,2,1,3,0,20,15000,1,1948,1,2,3,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
518,620,3,1,2,1,2,1,10,2,16,1,4,1,1,1,1,1,1,5,1,1,1,1,1,1,1,6,1,1,1,1,1,1,1,1,1,1,2,2,1,...,1,1,1,1,1,1,3,3,4,2,1,1,1,1,2,3,3,2,2,2,2,1,2,2,1,1,26,1,0,12000,1,1948,2,1,2,2,2,2,3,3
519,621,5,1,2,1,3,2,12,2,17,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,5,2,1,1,1,3,3,2,2,1,1,1,2,1,2,2,2,1,1,1,0,35000,1,1946,2,1,4,2,4,2,3,3
520,622,5,1,2,1,2,2,13,2,18,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,6,1,1,1,1,1,1,1,5,2,1,1,1,3,3,2,2,2,2,1,2,1,1,2,2,1,4,1,10,20000,1,1944,2,1,4,3,4,2,3,2
521,623,5,1,2,1,3,2,12,2,17,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,3,1,5,2,1,1,1,6,1,2,3,2,2,2,2,1,2,2,2,1,16,1,0,10000,1,1940,1,2,3,2,3,2,2,3


In [35]:
#검증 트레이닝 데이터 나누기
target = 'DQ63'

train, test = train_test_split(df, train_size=0.80, test_size=0.20, 
                              stratify=df[target], random_state=2)
train.shape, test.shape

((418, 236), (105, 236))

In [36]:
features = train.drop(columns=[target]).columns

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]


X_train.shape, y_train.shape,  X_test.shape ,y_test.shape

((418, 235), (418,), (105, 235), (105,))

In [37]:
# using sklearn variancethreshold to find constant features

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0)
sel.fit(X_train)

VarianceThreshold(threshold=0)

In [38]:
# get_support는 constant 피쳐를 나타내는 boolean vector입니다.
# get_support를 하면 일정하지 않은 피쳐의 수를 얻습니다.
sum(sel.get_support())

177

In [39]:
# print the constant features
print(
    len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

58


['SQ1',
 'SQ2',
 'SQ3',
 'Q023',
 'Q04104',
 'Q04105',
 'Q04106',
 'Q04204',
 'Q04205',
 'Q04206',
 'Q04207',
 'Q04307',
 'Q04308',
 'Q04309',
 'Q04310',
 'Q05103',
 'Q05104',
 'Q05105',
 'Q07104',
 'Q07105',
 'Q07106',
 'Q07204',
 'Q07205',
 'Q07206',
 'Q07207',
 'Q07306',
 'Q07307',
 'Q07308',
 'Q07309',
 'Q07310',
 'Q09104',
 'Q09105',
 'Q09106',
 'Q09205',
 'Q09206',
 'Q09207',
 'Q09306',
 'Q09307',
 'Q09308',
 'Q09309',
 'Q09310',
 'Q1107',
 'Q1108',
 'Q1504',
 'Q1505',
 'Q1506',
 'Q1507',
 'Q1508',
 'Q2004',
 'Q2005',
 'Q2006',
 'Q2105',
 'Q2106',
 'Q3004',
 'Q3204',
 'Q3205',
 'Q3206',
 'Q3405']




*   58개의 열/변수가 상수임을 알 수 있습니다. 
*   이는 58개의 변수가 훈련 세트의 모든 관찰에 대해 동일한 값, 
- 단 하나의 값을 표시한다는 것을 의미합니다. 
- 그런 다음 변형 기능을 사용하여 훈련 및 테스트 세트를 줄입니다.

In [40]:
# we can then drop these columns from the train and test sets
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

In [41]:
# check the shape of training and test set

X_train.shape, X_test.shape

((418, 177), (105, 177))

#### Remove Quasi-Constant Features

- Quasi-Constant Features는 데이터 세트의 great majority of the observations에 대해 동일한 값을 나타내는 Features입니다
- 일반적으로 이러한 Features는 machine learning model이 target을 식별하거나 예측할 수 있도록 하는 정보를 거의 제공하지 않습니다. 
- 그러나 예외가 있을 수 있습니다. 따라서 이러한 유형의 Features를 제거할 때는 주의해야 합니다. 
- Quasi-Constant Features를 식별하고 제거하는 것은 feature selection 과 더 쉽게 해석 가능한 machine learning model을 만들기 위한 쉬운 첫 번째 단계중 하나 입니다.

- sklearn의 Variance threshold는  Feature Selection에 대한 간단한 기준 접근 방식입니다. 분산이 일부 threshold를 충족하지 않는 모든  Features를 제거합니다. 기본적으로 모든 zero-variance features, 즉 모든 샘플에서 동일한 값을 갖는 features를 제거합니다.

almost / quasi-constant features를 제거하기 위해 기본 임계 값을 변경합니다.

In [42]:
df = pd.read_excel('/content/gdrive/MyDrive/data/sc231/1. MD_2016년 어르신 여가생활 설문조사_개별면접.xlsx')
#타겟 변수를 바꾸어 주도록 하겠습니다.
df = df.replace({'DQ63': {1: 1, 2:1 , 3:2, 4:2, 5:2}})
target = 'DQ63'                                           

encoder = OrdinalEncoder()
df = encoder.fit_transform(df)

#검증 트레이닝 데이터 나누기
target = 'DQ63'

train, test = train_test_split(df, train_size=0.80, test_size=0.20, 
                              stratify=df[target], random_state=2)
print(train.shape, test.shape)

features = train.drop(columns=[target]).columns

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]


print(X_train.shape, y_train.shape,  X_test.shape ,y_test.shape)


# using sklearn variancethreshold to find constant features

from sklearn.feature_selection import VarianceThreshold

# 0.1는 99% of observations approximately 를 나타냅니다.
sel = VarianceThreshold(threshold=0.01)
sel.fit(X_train)

print(len(X_train.columns[sel.get_support()]))

# finally we can print the quasi-constant features
print(len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

print('------------------')
print(X_train.shape, X_test.shape)

(418, 236) (105, 236)
(418, 235) (418,) (105, 235) (105,)
161
74
------------------
(418, 161) (105, 161)


### **Univariate selection methods**

- 일변량(Univariate) feature selection methods는 ANOVA와 같은 일변량 통계 테스트를 기반으로 최상의 feature를 선택하여 작동합니다.
- 추정에 대한 전처리 단계로 볼 수 있습니다. 
- Scikit-learn은  feature selection routines을 변환 방법을 구현하는 objects로 나타내게 합니다.

- F-검정에 기반한 방법은 두 확률 변수 간의 선형 종속 정도를 추정합니다. 
- 그들은 기능과 대상 사이의 선형 관계를 가정합니다. 
- 또한 이러한 방법은 변수가 Gaussian distribution를 따른다고 가정합니다.

- 4가지 방법이 있습니다.

1. SelectKBest
2. SelectPercentile
3. SelectFpr, SelectFdr, or family wise error SelectFwe
4. Generic Univariate Selection

- SelectKBest 및 SelectPercentile로 한정해서 사용해보도록 하겠습니다.


#### **SelectKBest**

- 이 방법은 k개의 가장 높은 점수에 따라 기능을 선택합니다.
- 예를 들어, 샘플에 대해 카이제곱 테스트를 수행하여 다음과 같이 데이터 세트에서 두 가지 최상의 기능만 검색할 수 있습니다.



> 카이제곱 검정 https://github.com/syh0397/Statistics_python/blob/main/4_Statistic_python_Chi_square_Test.ipynb



In [43]:
X_train.shape, y_train.shape

((418, 161), (418,))

In [44]:
# select the two best features
from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(chi2, k=2).fit_transform(X_train, y_train)
X_new.shape

(418, 2)

#### **SelectPercentile**

- 가장 높은 socre의 백분위수에 따라 feature를 선택합니다.

In [45]:
# now select features based on top 10 percentile
from sklearn.feature_selection import SelectPercentile, chi2

X_new = SelectPercentile(chi2, percentile=10).fit_transform(X_train, y_train)
X_new.shape

(418, 16)

- 주의할점은 
    - 회귀 작업의 경우: f_regression,mutual_info_regression

    - 분류 작업의 경우: chi2, f_classif,mutual_info_classif

  와 같이 다르게 쓰이는데 잘 파악해서 쓸 것 
  

#### Information Gain

- Information gain or mutual_information
    - mutual_information =  X와 Y가 공유하는 정보를 측정합니다. 
    - 이 변수 ​​중 하나를 아는 것이 다른 변수에 대한 불확실성을 얼마나 줄이는지를 측정합니다. 
    - 예를 들어, X와 Y가 독립적인 경우 X를 아는 것은 Y에 대한 정보를 제공하지 않으며 그 반대의 경우도 마찬가지이므로 상호간의 정보는 0입니다. 
    - 다른 극단에서 X가 Y의 결정론적 함수이고 Y가 X의 결정론적 함수이면 X가 전달하는 모든 정보는 Y와 공유됩니다. 
    - X를 아는 것이 Y의 값을 결정하고 그 반대의 경우도 마찬가지입니다. 
    - 결과적으로 이 경우 상호 정보는 Y(또는 X)에만 포함된 불확실성, 즉 Y(또는 X)의 엔트로피와 동일합니다. 
    - 더욱이, 이 상호 정보는 X의 엔트로피와 Y의 엔트로피와 동일합니다. 
    - (이것의 매우 특별한 경우는 X와 Y가 동일한 확률 변수인 경우입니다.)

### **ANOVA F-value For Feature Selectio**

- 제공된 샘플에 대한 ANOVA F-값을 계산합니다.

- feature가 범주형이면 각 feature와 target vector 사이의 카이제곱 통계를 계산합니다. 
- 그러나 feature가 quantitative하면 각 feature와 target vector 사이의 ANOVA F-값을 계산합니다.

- F-값 점수는 target vector로 numerical feature들을 그룹화할 때 각 그룹의 평균이 유의하게 다른지 여부를 조사합니다.

In [47]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

fvalue_selector = SelectKBest(f_classif, k=2)

# Apply the SelectKBest object to the features and target
X_kbest = fvalue_selector.fit_transform(X_train, y_train)

print('Original number of features:', X_train.shape[1])
print('Reduced number of features:', X_kbest.shape[1])

Original number of features: 161
Reduced number of features: 2


### **Correlation-Matrix with Heatmap**

- 상관관계는 2개 이상의 변수의 선형 관계를 측정한 것입니다. 상관관계를 통해 우리는 하나의 변수를 다른 변수로부터 예측할 수 있습니다.
- 좋은 변수는 목표와 높은 상관관계가 있습니다.

- 상관관계가 있는 예측 변수는 중복 정보를 제공합니다.
- 변수들은 타겟과 상관 관계가 있어야 하지만 변수들 사이에는 상관 관계가 없어야 합니다.
- 이 섹션에서는 두 feature 간의 상관 관계를 기반으로 features를 선택하는 방법을 보여줍니다. 
- 서로 상관관계가 있는 feature들을 찾을 수 있습니다. 
- 이러한 features를 판단하여 유지하려는 feature와 제거할 feature를 결정할 수 있습니다.

- 피어슨 상관 관계 계수 값은 -1과 1 사이입니다.
- 두 기능 간의 상관 관계가 0이면 이 두 기능 중 하나를 변경해도 다른 기능에 영향을 미치지 않습니다.

In [53]:
# Convert feature matrix into DataFrame
df = pd.DataFrame(X_train)

# View the data frame
print(df)

# Create correlation matrix
corr_matrix = df.corr()
print(corr_matrix)

     0    1    2    3    4    5    6    ...  154  155  156  157  158  159  160
0    350    4    3    1   11   17    1  ...    1    2    2    3    1    2    1
1    352    4    3    1   11   16    1  ...    2    1    4    3    3    2    2
2    611    1    2    2   13   16    2  ...    3    2    3    2    2    3    2
3     14    1    2    1    9   16    1  ...    2    1    4    2    1    2    2
4    102    2    3    2   12   17    2  ...    1    2    3    3    3    2    2
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
413  378    5    1    2   12   17    2  ...    1    2    3    2    1    2    2
414  454    5    3    2   14   17    2  ...    2    1    3    3    2    3    3
415   13    1    3    1    9   14    1  ...    2    1    4    2    2    2    2
416   94    2    2    1   10   15    1  ...    1    2    2    1    1    1    1
417   17    1    1    1    9   15    1  ...    3    2    2    2    2    3    3

[418 rows x 161 columns]
          0         1     

In [54]:
 # Select upper triangle of correlation matrix
 
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper  

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160
0,,0.792391,0.093185,0.268724,0.363623,0.231987,0.265563,0.035766,-0.000242,0.029868,-0.103615,-0.076691,-0.063837,0.006135,-0.089354,-0.075394,0.037243,0.040102,0.052371,-0.220753,-0.120745,-0.052721,0.023889,0.524585,0.204865,0.062397,0.325966,0.186114,0.101189,0.103769,0.353169,0.243951,0.087117,0.105929,0.065831,0.025660,0.194588,0.436718,0.093658,0.054293,...,-0.138199,-0.004675,0.108921,0.041017,-0.001138,0.101330,0.041911,0.026122,-0.002431,0.005906,-0.087366,0.154710,-0.009519,0.097489,0.143461,-0.061007,-0.164176,-0.322779,-0.096155,-0.128314,-0.164692,0.151778,0.078604,-0.058898,-0.070072,-0.085265,-0.040571,0.177324,0.047923,0.021883,0.216327,0.113892,0.043923,0.002395,0.014420,0.063002,0.298625,0.218367,0.274405,0.178189
1,,,0.071204,0.300132,0.391799,0.292644,0.293858,0.063560,0.056175,0.069303,-0.103720,-0.086882,-0.034403,0.005751,-0.065392,-0.093294,0.093140,0.081115,0.066558,-0.233066,-0.124254,-0.077046,0.015525,0.608101,0.266779,0.122474,0.408489,0.230311,0.183557,0.138763,0.422588,0.286223,0.204530,0.192129,0.118218,0.055014,0.269527,0.509978,0.148315,0.086929,...,-0.081943,0.125066,0.017795,0.080040,0.036920,0.194392,-0.024123,0.042743,-0.008521,0.027545,-0.157548,0.227193,0.076506,0.161267,0.163567,-0.059327,-0.184021,-0.388650,-0.119508,-0.163977,-0.228115,0.175476,0.140443,-0.129672,-0.095747,-0.059989,0.010610,0.203888,0.057865,0.037584,0.196288,0.224005,-0.011881,0.053035,0.043855,0.122637,0.344990,0.252529,0.323600,0.226526
2,,,,0.151570,0.220506,-0.084731,0.158713,-0.021093,-0.018872,-0.042299,0.082816,-0.091598,-0.061187,-0.036274,0.084667,-0.157227,-0.076963,-0.092479,-0.023016,-0.103882,-0.259085,-0.162338,0.096452,-0.079997,-0.016417,-0.003055,0.112060,0.092280,0.039513,-0.019735,0.135195,0.211584,0.117174,0.013790,0.017353,-0.039975,-0.049254,-0.043517,-0.089086,-0.085517,...,0.052971,-0.097323,-0.054833,-0.079179,0.050279,-0.144033,0.116122,0.027846,0.034374,-0.071269,-0.170633,0.084160,0.235090,0.088880,0.014004,-0.023144,-0.071911,-0.121329,-0.118865,-0.137660,-0.092074,0.169823,0.187745,-0.150734,-0.100840,0.074675,-0.039821,0.081606,0.039170,-0.033089,0.160591,0.252688,0.172964,0.024262,-0.110098,0.264832,0.159862,0.167009,0.188461,0.098850
3,,,,,0.839518,0.382687,0.995226,-0.752197,-0.310421,-0.170526,-0.748186,-0.448251,-0.184069,-0.074210,-0.692392,-0.559543,-0.360967,-0.234761,-0.087542,-0.933080,-0.624776,-0.228715,-0.068509,0.110875,0.098524,0.098274,0.384458,0.121218,0.002044,0.079747,0.351652,0.176959,0.004491,-0.054039,0.019301,0.036822,0.251245,0.330275,0.087033,-0.033777,...,-0.006427,0.052891,0.097862,0.024930,0.046887,0.038735,-0.001248,-0.102616,-0.065137,0.108169,-0.124227,0.081058,-0.053252,0.028108,0.054049,0.008662,-0.054357,-0.098706,-0.073061,-0.055551,-0.109883,-0.082778,-0.021433,-0.063218,0.011016,-0.091817,0.026927,0.063210,0.038636,0.021108,0.060316,0.085380,0.031931,0.034972,0.106943,0.103048,0.129795,0.097058,0.157954,0.101236
4,,,,,,0.417028,0.838182,-0.591311,-0.290887,-0.210493,-0.598864,-0.384169,-0.143606,-0.071280,-0.545616,-0.456587,-0.277623,-0.237558,-0.105643,-0.780476,-0.524950,-0.235017,-0.021804,0.202659,0.093718,0.075444,0.375650,0.130511,0.025969,0.116865,0.370943,0.206158,0.010742,-0.053396,0.015366,0.070914,0.234441,0.400445,0.167685,-0.004668,...,0.031675,0.053910,0.061093,0.039505,0.099599,0.096945,-0.011030,-0.070471,-0.054198,0.061972,-0.137228,0.097878,-0.033010,0.069919,0.105772,-0.004763,-0.086250,-0.170528,-0.093616,-0.104057,-0.154104,-0.027572,0.031771,-0.093205,-0.064401,-0.102284,0.026179,0.091718,0.040950,0.044533,0.128219,0.125188,0.015356,0.049389,0.055561,0.144707,0.185187,0.150461,0.222406,0.124440
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.245477,0.352709,0.338924,0.262683
157,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.702264,0.694960,0.682390
158,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.636613,0.665968
159,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.755859


In [55]:
# Find index of feature columns with correlation greater than 0.7

to_drop = [column for column in upper.columns if any(upper[column] > 0.7)]
print(to_drop)

[1, 4, 6, 10, 14, 19, 40, 44, 48, 72, 94, 99, 105, 140, 158, 160]


In [56]:
df1 = df.drop(df.columns[to_drop], axis=1)
print(df1)

     0    2    3    5    7    8    9    ...  152   153  154  155  156  157  159
0    350    3    1   17    3    1    1  ...    1  1952    1    2    2    3    2
1    352    3    1   16    3    1    1  ...    1  1948    2    1    4    3    2
2    611    2    2   16    1    1    1  ...    1  1949    3    2    3    2    3
3     14    2    1   16    2    1    1  ...    1  1948    2    1    4    2    2
4    102    3    2   17    1    1    1  ...    1  1943    1    2    3    3    2
..   ...  ...  ...  ...  ...  ...  ...  ...  ...   ...  ...  ...  ...  ...  ...
413  378    1    2   17    1    1    1  ...    2  1947    1    2    3    2    2
414  454    3    2   17    1    1    1  ...    2  1942    2    1    3    3    3
415   13    3    1   14    2    1    1  ...    1  1949    2    1    4    2    2
416   94    2    1   15    2    2    1  ...    1  1950    1    2    2    1    1
417   17    1    1   15    2    2    1  ...    1  1936    3    2    2    2    3

[418 rows x 145 columns]
