<style>
    rd { color:red; }
    bl { color:blue; }
</style>

# GridSearchCV 로 RandomForest 조정시도
## 전처리
| 작업        | 대상                                                                                       |
|:------------|:-------------------------------------------------------------------------------------------|
| 컬럼 삭제   | "RowNumber", "CustomerId", "Surname"                                                       |
| 컬럼 인코딩 | "Geography", "Gender"                                                                      |
| 컬럼 라벨링 | "CreditScore", "Geography", "Age", "Tenure", "Balance", "NumOfProducts", "EstimatedSalary" |

### 스케일링 : StandardScaler

## 하이퍼파라미터
- RandomForest
    - param_grid = {
          'n_estimators'      : [100, 300, 500],
          'max_depth'         : [5, 8, 10, None],
          'min_samples_split' : [10, 30, 50],
          'max_features'      : ['sqrt', 'log2', None],
          'criterion'         : ["gini", "entropy", "log_loss"],
          'min_samples_leaf'  : [10, 30, 50, 80]
      }

## 결론 : 성능에 유의미한 변화 없음

In [2]:
import numpy             as np
import pandas            as pd
import matplotlib.pyplot as plt
import seaborn           as sns

import matplotlib
import matplotlib.font_manager as fm

import re

from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [3]:
def encoding(df:pd.DataFrame, columns:list[str]):
    """범주형 데이터를 인코딩"""

    encoder_list = {}
    result_df    = df.copy(deep=True)

    for col_nm in columns:
        encoder           = LabelEncoder()
        result_df[col_nm] = encoder.fit_transform(result_df[col_nm])

        encoder_list[col_nm] = encoder

    return result_df, encoder_list


def scaling(df:pd.DataFrame, columns:list[str]):
    """DataFrame 에서 컬럼들을 스케일링"""

    scaler    = StandardScaler()
    result_df = df.copy(deep=True)

    result_df[columns] = scaler.fit_transform(result_df[columns])

    return result_df

## 데이터 로드 및 전처리

In [4]:
######################################### 데이터 로드
df     = pd.read_csv("data/Churn_Modelling.csv")
inputs = df.drop(columns=["Exited"], axis=1)
labels = df["Exited"]


######################################### 데이터 전처리
_input = inputs.drop(columns=["RowNumber", "CustomerId", "Surname"], axis=1)     # 컬럼 삭제( Rownumber, CustomerId, Surname )
_input, encoders = encoding(_input, ["Geography", "Gender"])            # 범주형 문자열 데이터 인코딩
_input = scaling(_input, ["CreditScore", "Geography", "Age", "Tenure", "Balance", "NumOfProducts", "EstimatedSalary"])

In [5]:
print(_input.info(), "\n")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  float64
 1   Geography        10000 non-null  float64
 2   Gender           10000 non-null  int64  
 3   Age              10000 non-null  float64
 4   Tenure           10000 non-null  float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  float64
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
dtypes: float64(7), int64(3)
memory usage: 781.4 KB
None 



In [6]:
print(_input.value_counts(), "\n")

CreditScore  Geography  Gender  Age        Tenure     Balance    NumOfProducts  HasCrCard  IsActiveMember  EstimatedSalary
 2.063884     1.515067  1        3.058772   1.724464  -0.110230  -0.911583      1          0               -0.038201          1
-3.109504    -0.901886  0        0.102810  -1.733315   0.554746  -0.911583      1          1                1.256024          1
                                 2.009882  -0.695982  -1.225848  -0.911583      0          0                0.238332          1
                        1        1.151700   1.724464  -1.225848  -0.911583      1          1                0.447481          1
              0.306591  1        0.007457  -1.733315   0.532858   0.807737      0          0                0.408848          1
                                                                                                                             ..
-2.581819     0.306591  1        0.865639   1.032908   0.827869  -0.911583      1          0                1

## 데이터 분할

In [7]:
######################################### 데이터 분할. random_state 지정한 상태에서 성능 확인/개선해보고, state 풀었을 때도 보기.
train_x, test_x, train_y, test_y = train_test_split(_input, labels, stratify=labels)
print("학습 데이터 shape : ", train_x.shape, train_y.shape)
print("검증 데이터 shape : ",  test_x.shape,  test_y.shape, "\n")

학습 데이터 shape :  (7500, 10) (7500,)
검증 데이터 shape :  (2500, 10) (2500,) 



## 임의의 파라미터로 수행한 RandomForestClassifier
max_depth=10, max_features='sqrt', min_samples_split=2, n_estimators=300

In [9]:
######################################### 모델 학습
model = RandomForestClassifier(max_depth=10, max_features='sqrt', min_samples_split=2, n_estimators=300)
model.fit(train_x, train_y)


######################################### 모델 성능 평가
predicted = model.predict(test_x)
print(classification_report(test_y, predicted, target_names=["Stayed", "Exited"]))

              precision    recall  f1-score   support

      Stayed       0.87      0.98      0.92      1991
      Exited       0.82      0.42      0.56       509

    accuracy                           0.86      2500
   macro avg       0.84      0.70      0.74      2500
weighted avg       0.86      0.86      0.85      2500



## GridSearchCV 로 하이퍼파라미터 조정 시도

In [8]:
# GridSearchCV 로 RandomForestClassification 하이퍼파라미터 조정
param_grid = {
    'n_estimators'      : [100, 300, 500],
    'max_depth'         : [5, 8, 10, None],
    'min_samples_split' : [10, 30, 50],
    'max_features'      : ['sqrt', 'log2', None],
    'criterion'         : ["gini", "entropy", "log_loss"],
    'min_samples_leaf'  : [10, 30, 50, 80]
}
gridSCV = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, cv=3)
gridSCV.fit(train_x, train_y)

print(gridSCV.best_score_, gridSCV.best_params_)

0.8596 {'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 10, 'min_samples_split': 30, 'n_estimators': 300}
