<a href="https://colab.research.google.com/github/seungmin-son/ML_Practice/blob/main/%EA%B8%B0%EA%B3%84%ED%95%99%EC%8A%B5%EB%A1%A0_8%EC%A3%BC%EC%B0%A8%EA%B3%BC%EC%A0%9C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 필요한 라이브러리, 데이터 셋 통합

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.plotting import scatterplotmatrix

In [28]:
red_wine = pd.read_csv('/content/winequality-red.csv', sep=';');
white_wine = pd.read_csv('/content/winequality-white.csv', sep=';');

In [29]:
red_wine['color'] =1
white_wine['color'] = 0

In [30]:
red_wine.shape, white_wine.shape

((1599, 13), (4898, 13))

In [31]:
wine = pd.concat([red_wine,white_wine])
wine.shape

(6497, 13)

In [32]:
wine.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378,0.246114
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255,0.430779
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0,0.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0,0.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0,0.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0,0.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0,1.0


## 데이터 전처리

In [33]:
y = wine['color']
X = wine.drop(['color'], axis= 1)
X.shape

(6497, 12)

# Scaler 적용

In [34]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

scaler = StandardScaler()
scaler.fit(X)

X_scaled = scaler.transform(X)

#데이터셋 분리

In [35]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size= 0.2,
                                                 random_state = 13)

# 모델 선택

In [36]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score


tree = DecisionTreeClassifier(random_state=13)
tree.fit(X_train, y_train)
y_pred_initial = tree.predict(X_test)


print("Initial Model Accuracy: ", accuracy_score(y_test, y_pred_initial))
print("Initial Model Recall: ", recall_score(y_test, y_pred_initial))
print("Initial Model Precision: ", precision_score(y_test, y_pred_initial))
print("Initial Model AUC Score: ", roc_auc_score(y_test, y_pred_initial))
print("Initial Model F1 Score: ", f1_score(y_test, y_pred_initial))

Initial Model Accuracy:  0.9853846153846154
Initial Model Recall:  0.9746031746031746
Initial Model Precision:  0.9654088050314465
Initial Model AUC Score:  0.9817178309564097
Initial Model F1 Score:  0.9699842022116903


# 하이퍼 파라미터 튜닝

In [37]:
param_grid = {
    'criterion': ["gini", "entropy", "log_loss"],
    'max_depth': list(range(1, 10)),
    'min_samples_split': list(range(2, 20))
}

In [38]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(tree, param_grid, cv=3)
grid_search.fit(X_train, y_train)

In [39]:
print("Best Parameters: ", grid_search.best_params_)
print("Best Accuracy: ", grid_search.best_score_)

Best Parameters:  {'criterion': 'entropy', 'max_depth': 8, 'min_samples_split': 5}
Best Accuracy:  0.9869161195060162


# 재학습

In [40]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score

best_tree = grid_search.best_estimator_
best_tree.fit(X_train, y_train)
y_pred = best_tree.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Recall: ", recall_score(y_test, y_pred))
print("Precision: ", precision_score(y_test, y_pred))
print("AUC Score: ", roc_auc_score(y_test, y_pred))
print("F1 Score: ", f1_score(y_test, y_pred))

Accuracy:  0.99
Recall:  0.9714285714285714
Precision:  0.9870967741935484
AUC Score:  0.983683828861494
F1 Score:  0.9792000000000001


하이퍼파라미터 적용시 모델의 성능이 전반적으로 상승

- Accuracy(정확도)는 약 0.985에서 0.99로 향상
-Precision(정밀도)는 약 0.965에서 0.987로 향상
-AUC Score는 0.982에서 0.984로 향상
-F1 Score(F1 점수)도 0.970에서 0.979로 향상

Recall(재현율)은 약 0.975에서 0.971로 약간 감소하였는데
이는 최적화된 하이퍼 파라미터가 max_depth': 8, 'min_samples_split': 5
로 복잡한 패턴을 학습하면서  일부 Positive 클래스를 잘못 분류했을 가능성이 높아보임
