<style>
    rd { color:red; }
    bl { color:blue; }
</style>

# 잔류데이터셋 크기를 줄여 균형데이터로 만들었을 때
## 전처리
| 작업        | 대상                                                                                       |
|:------------|:-------------------------------------------------------------------------------------------|
| 컬럼 삭제   | "RowNumber", "CustomerId", "Surname"                                                       |
| 컬럼 인코딩 | "Geography", "Gender"                                                                      |
| 컬럼 라벨링 | "CreditScore", "Geography", "Age", "Tenure", "Balance", "NumOfProducts", "EstimatedSalary" |

### 스케일링 : StandardScaler

## 하이퍼파라미터
- RandomForest
    - 순성

## 결론 : 정확도가 더 떨어짐

In [49]:
import numpy             as np
import pandas            as pd
import matplotlib.pyplot as plt
import seaborn           as sns

import matplotlib
import matplotlib.font_manager as fm

import re

from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [50]:
def encoding(df:pd.DataFrame, columns:list[str]):
    """범주형 데이터를 인코딩"""

    encoder_list = {}
    result_df    = df.copy(deep=True)

    for col_nm in columns:
        encoder           = LabelEncoder()
        result_df[col_nm] = encoder.fit_transform(result_df[col_nm])

        encoder_list[col_nm] = encoder

    return result_df, encoder_list


def scaling(df:pd.DataFrame, columns:list[str]):
    """DataFrame 에서 컬럼들을 스케일링"""

    scaler    = StandardScaler()
    result_df = df.copy(deep=True)

    result_df[columns] = scaler.fit_transform(result_df[columns])

    return result_df

## 데이터 로드 및 전처리

In [51]:
######################################### 데이터 로드
df     = pd.read_csv("../data/Churn_Modelling.csv")
# inputs = df.drop(columns=["Exited"], axis=1)
# labels = df["Exited"]

exited = df[df["Exited"] == 1]
stayed = df[df["Exited"] == 0]
stayed = stayed.sample(n=len(exited), replace=False)

inputs = pd.concat([stayed, exited]).drop(columns=["Exited"], axis=1)
labels = pd.concat([stayed, exited])["Exited"]

####################################### 데이터 전처리
_input = inputs.drop(columns=["RowNumber", "CustomerId", "Surname"], axis=1)     # 컬럼 삭제( Rownumber, CustomerId, Surname )
_input, encoders = encoding(_input, ["Geography", "Gender"])            # 범주형 문자열 데이터 인코딩
_input = scaling(_input, ["CreditScore", "Geography", "Age", "Tenure", "Balance", "NumOfProducts", "EstimatedSalary"])

In [52]:
print(_input.info(), "\n")

<class 'pandas.core.frame.DataFrame'>
Index: 4074 entries, 2067 to 9998
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      4074 non-null   float64
 1   Geography        4074 non-null   float64
 2   Gender           4074 non-null   int64  
 3   Age              4074 non-null   float64
 4   Tenure           4074 non-null   float64
 5   Balance          4074 non-null   float64
 6   NumOfProducts    4074 non-null   float64
 7   HasCrCard        4074 non-null   int64  
 8   IsActiveMember   4074 non-null   int64  
 9   EstimatedSalary  4074 non-null   float64
dtypes: float64(7), int64(3)
memory usage: 350.1 KB
None 



In [53]:
print(_input.value_counts(), "\n")

CreditScore  Geography  Gender  Age        Tenure     Balance    NumOfProducts  HasCrCard  IsActiveMember  EstimatedSalary
 2.067609     1.531443  1        2.823300   1.735964  -0.193963  -0.759500      1          0               -0.054465          1
-3.038817    -0.959215  0       -0.097288  -1.715631   0.478729  -0.759500      1          1                1.241588          1
                                 1.786962  -0.680152  -1.322525  -0.759500      0          0                0.222458          1
                        1        0.939049   1.735964  -1.322525  -0.759500      1          1                0.431902          1
              0.286114  1       -0.191501  -1.715631   0.456587   0.731682      0          0                0.393215          1
                                                                                                                             ..
-2.773283     0.286114  0       -1.133626  -0.334993   0.542736   3.714045      1          0                0

## 데이터 분할

In [42]:
######################################### 데이터 분할. random_state 지정한 상태에서 성능 확인/개선해보고, state 풀었을 때도 보기.
train_x, test_x, train_y, test_y = train_test_split(_input, labels, stratify=labels)
print("학습 데이터 shape : ", train_x.shape, train_y.shape)
print("검증 데이터 shape : ",  test_x.shape,  test_y.shape, "\n")

학습 데이터 shape :  (3055, 10) (3055,)
검증 데이터 shape :  (1019, 10) (1019,) 



## 모델 학습 및 평가 - max_features="log2", max_depth=5
순정에 비해 정밀도는 <rd>상승</rd>, 재현율은 <bl>하락</bl>, F1점수는 <bl>하락</bl>

In [54]:
######################################### 모델 학습
model = RandomForestClassifier(n_estimators=300, max_depth=10, min_samples_leaf=10)
model.fit(train_x, train_y)


######################################### 모델 성능 평가
predicted = model.predict(test_x)
print(classification_report(test_y, predicted, target_names=["Stayed", "Exited"]))

              precision    recall  f1-score   support

      Stayed       0.75      0.77      0.76       510
      Exited       0.76      0.74      0.75       509

    accuracy                           0.76      1019
   macro avg       0.76      0.76      0.76      1019
weighted avg       0.76      0.76      0.76      1019



## 모델 학습 및 평가 - max_features=None, max_depth=5
순정에 유의미한 변화 없음

In [55]:
from xgboost import XGBClassifier

######################################### 모델 학습
model = XGBClassifier(n_estimators=500, learning_rate=0.005, max_depth=5)             # 0.85
model.fit(train_x, train_y)


######################################### 모델 성능 평가
predicted = model.predict(test_x)
print(classification_report(test_y, predicted, target_names=["Stayed", "Exited"]))

              precision    recall  f1-score   support

      Stayed       0.74      0.76      0.75       510
      Exited       0.76      0.73      0.75       509

    accuracy                           0.75      1019
   macro avg       0.75      0.75      0.75      1019
weighted avg       0.75      0.75      0.75      1019

