# 2020707035 박시언 lab6

### introduction
1. 이전 과제들과 동일하게 데이터 전처리를 진행합니다.
2. LogisticRegression, SVC, RandomForestClassifier, GradientBoostingClassifier 하이퍼파라미터 튜닝을 통해서 최적의 파라미터를 찾습니다.
3. 최적의 파라미터를 착고 각각의 모델들로 성능을 평가합니다.(베이지안 최적화)
4. 모니터링은 roc-auc를 기반으로 합니다.

----------------------------------------
### conclusion
Model: LogisticRegression
Accuracy: 0.8827
F1-Score: 0.8205
ROC-AUC: 0.8975
----------------------------------------
Model: SVC
Accuracy: 0.8324
F1-Score: 0.7541
ROC-AUC: 0.8996
----------------------------------------
Model: RandomForestClassifier
Accuracy: 0.8547
F1-Score: 0.7719
ROC-AUC: 0.9032
----------------------------------------
Model: GradientBoostingClassifier
Accuracy: 0.8492
F1-Score: 0.7692
ROC-AUC: 0.9007
----------------------------------------

: 전체적인 성능을 보았을 때 LogisticRegression을 쓰는 것이 권장된다 생각한다.


In [1]:
'''

given

Logistic Regression, SVM, Random Forest, Gradient Boosting
'''
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
data['Age'].fillna(data['Age'].mean(),inplace=True)
data['Cabin'].fillna('N',inplace=True)
data['Embarked'].fillna('N',inplace=True)

data.loc[data["Sex"] == "male", "Sex_encode"] = 0
data.loc[data["Sex"] == "female", "Sex_encode"] = 1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Cabin'].fillna('N',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values al

In [3]:
from sklearn.preprocessing import LabelEncoder

# Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Cabin'].fillna('N', inplace=True)
    df['Embarked'].fillna('N', inplace=True)
    df['Fare'].fillna(0, inplace=True)
    return df

# 머신러닝 알고리즘에 불필요한 피처 제거
def drop_features(df):
    df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
    return df

# 레이블 인코딩 수행.
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin', 'Sex', 'Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

# 앞에서 설정한 데이터 전처리 함수 호출
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df


Y = data['Survived']
X = data.drop('Survived', axis=1)
X = transform_features(X)
X['Fare'] = np.log(X['Fare'] + 1)
X['Age'] = np.log(X['Age'] + 1)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Cabin'].fillna('N', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, Y, \
                                                  test_size=0.2, random_state=11)

from skopt import BayesSearchCV
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score
import warnings

warnings.filterwarnings("ignore")  # 경고 제거

# 각 모델과 검색 공간 정의
param_spaces = {
    'LogisticRegression': {
        'C': (1e-6, 1e+6, 'log-uniform'),  # 정규화 강도
        'penalty': ['l2'],  # L2 규제
        'solver': ['lbfgs'],  # solver 선택
    },
    'SVC': {
        'C': (1e-3, 1e+3, 'log-uniform'),  # 정규화 강도
        'kernel': ['linear', 'rbf'],  # 커널 선택
        'gamma': (1e-4, 1e+1, 'log-uniform'),  # 감마 파라미터
    },
    'RandomForestClassifier': {
        'n_estimators': (10, 500),  # 트리 개수
        'max_depth': (3, 20),  # 최대 깊이
        'min_samples_split': (2, 10),  # 최소 샘플 분리
    },
    'GradientBoostingClassifier': {
        'learning_rate': (0.01, 1.0, 'log-uniform'),  # 학습률
        'n_estimators': (10, 500),  # 트리 개수
        'max_depth': (3, 20),  # 최대 깊이
    }
}

models = {
    'LogisticRegression': LogisticRegression(),
    'SVC': SVC(probability=True),
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier()
}

# 베이지안 최적화를 실행하고 결과 저장
best_models = {}
results = {}

In [5]:
for model_name, model in models.items():
    print(f"Optimizing {model_name}...")
    search = BayesSearchCV(
        estimator=model,
        search_spaces=param_spaces[model_name],
        n_iter=30,  # 탐색 반복 횟수
        cv=3,  # 3-Fold 교차 검증
        scoring='roc_auc',  # ROC-AUC 점수 기준
        n_jobs=-1,  # 병렬 처리
        random_state=42
    )
    search.fit(X_train, y_train)
    best_models[model_name] = search.best_estimator_
    results[model_name] = {
        'Best Params': search.best_params_,
        'Best Score': search.best_score_
    }
    print(f"Best Params for {model_name}: {search.best_params_}")
    print(f"Best ROC-AUC Score for {model_name}: {search.best_score_}")

# 최적화 결과 출력
for model_name, result in results.items():
    print(f"\nModel: {model_name}")
    print(f"Best Params: {result['Best Params']}")
    print(f"Best ROC-AUC: {result['Best Score']:.4f}")

# 최적 모델 성능 평가
print("\nEvaluating best models on test set...\n")
for model_name, model in best_models.items():
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else "N/A"

    print(f"Model: {model_name}")
    print(f"Accuracy: {acc:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"ROC-AUC: {roc_auc:.4f}" if roc_auc != "N/A" else "ROC-AUC: Not available")
    print("-" * 40)


Optimizing LogisticRegression...
Best Params for LogisticRegression: OrderedDict({'C': 0.344330407655008, 'penalty': 'l2', 'solver': 'lbfgs'})
Best ROC-AUC Score for LogisticRegression: 0.8517023376788018
Optimizing SVC...
Best Params for SVC: OrderedDict({'C': 86.95535347355771, 'gamma': 0.0008240929829517187, 'kernel': 'rbf'})
Best ROC-AUC Score for SVC: 0.8458517726956014
Optimizing RandomForestClassifier...
Best Params for RandomForestClassifier: OrderedDict({'max_depth': 16, 'min_samples_split': 10, 'n_estimators': 500})
Best ROC-AUC Score for RandomForestClassifier: 0.8681626307348397
Optimizing GradientBoostingClassifier...
Best Params for GradientBoostingClassifier: OrderedDict({'learning_rate': 0.19860463029029454, 'max_depth': 11, 'n_estimators': 362})
Best ROC-AUC Score for GradientBoostingClassifier: 0.8433717385580518

Model: LogisticRegression
Best Params: OrderedDict({'C': 0.344330407655008, 'penalty': 'l2', 'solver': 'lbfgs'})
Best ROC-AUC: 0.8517

Model: SVC
Best Param