## NLP06

### 노드19
- vocabulary size 변경
- 실험 조건은(TF-IDF행렬 기준으로, 노드18은 num_words=10,000)
  - 10,000 단어
  - 5,000 단어
  - 모든 단어 쓴다면?
- 참고사항
  - lightGBM은 이번 경우에는 성능 매우 낮음!

### Reflection
- 단어장 개수가 커질수록 성능 높아진 모델
  - LogisticRegression, SVM
- 단어장 개수가 커질수록 성능 낮아진 모델
  - RandomForest, KNN, NaiveBayes
- LogisticRegression이 성능이 좋지만, XGBoost도 성능이 비슷하고 무엇보다 단어장 크기에 큰 영향 안 받음
  - 딥러닝 모델과는 XGBoost를 비교대상으로 삼았음
- ML/DL 비교 관련
  - Dense net(MLP)는 성능이 xgboost 대비 떨어짐(batch 32->16으로 줄이고 epochs 10->20 했음에도)
  - 단 Dense net(MLP)은 문장별 평균 벡터로 입력층을 변화시키면 성능이 개선됨
  - RNN의 경우, MLP대비 성능이 우수했으나 XGBoost나 Logistic 대비 성능이 좋지 않았음(num_words=10000기준)

### 실험 결과 요약
1. 단어장 개수를 바꾸면서(10000, 5000, Nan) 8개의 ML모델을 사용해보고
accuacry와 f1-score를 비교(벡터화는 tf-idf로 통일)


#### 단어장 개수별 ML 모델 성능 비교 (Accuracy / F1-score)

    
| Vocabulary Size | Model             | Accuracy | F1-Score | 시간(초)     |
|------------------|------------------|----------|----------| -------------|
| 10000            | LogisticRegression |  0.8108  |   0.8057 |   747초    |
|                  | SVM                |  0.7850 |    0.7818 |   136초   |
|                  | RandomForest       |  0.6741  |   0.6429 |     2초    |
|                  | XGBoost            |  0.7907    | 0.7841 | 217초      |
|                  | NaiveBayes         |  0.6567   |  0.5764 | 1초 미만   |
|                  | KNN                |  0.7894 |    0.7891 | 1초 미만   |
|                  | LightGBM           |  0.0614   |  0.0462 | 55초       |
|                  | DecisionTree       | 0.6923    |  0.6895  |  8초       |
| 5000             | LogisticRegression | 0.8037    |  0.798   |  502초   |
|                  | SVM                | 0.7685   |   0.7647 |   149초    |
|                  | RandomForest       | 0.7012   |   0.6770 |     2초    |
|                  | XGBoost            | 0.7947   |   0.7847 |   206초    |
|                  | NaiveBayes         | 0.6732   |   0.6013 |   1초미만  |
|                  | KNN                | 0.7823   |   0.7709 |    1초미만  |
|                  | LightGBM           | 0.2364   |   0.2919 |    51초    |
|                  | DecisionTree       | 0.6981   |   0.6933 |     7초  |
| NaN (All words)  | LogisticRegression | 0.8166   |   0.8114 |   915초    |
|                  | SVM                | 0.7916   |   0.7873 |   126초    |
|                  | RandomForest       | 0.6545   |   0.6226 |     3초    |
|                  | XGBoost            | 0.7947   |   0.7883 |   233초    |
|                  | NaiveBayes         | 0.5997   |   0.5046 |  1초 미만  |
|                  | KNN                | 0.7720   |   0.7639 |  1초 미만  |
|                  | LightGBM           | 0.3353   |   0.3037 |    49초    |
|                  | DecisionTree       | 0.7057   |   0.7021 |     8초    |
    

2. 딥러닝과 머신런닝의 차이 비교하기
- 벡터화 방법을 바꿔보며(DTM, W2V) 머신러닝 모델1개(성능 높은 모델), 딥러닝 모델 2개(Dense, RNN)를 비교   
  평가지표 = accuacry, f1-score

#### 벡터화 방법별 ML/DL 모델 성능 비교 (Accuracy / F1-score)

| Vectorization | Model           | Accuracy | F1-Score  | 시간(초)|
|---------------|------------------|----------|----------|---------|
| Word2Vec      | XGBoost          |  0.7907  | 0.7841   |  217초  |
|               | Dense NN         |  0.6665  | 0.6363   |   50초  |
|               | Dense NN(평균벡터)| 0.7569  | 0.7396   |   37초  |
|               | RNN              |  0.7841  | 0.7751   |   68초  |
|               | RNN(변경)        |  0.7787  | 0.7622   |   68초  |

- RNN의 경우, BatchNormalization, ReduceLROnPlateau 적용했으나 별 차이 없었음

#### 19-01 머신러닝
- vector화를 위해 전처리된 데이터를 text형태로 원복(노드18과 동일)
- 10,000단어 기준

##### 데이터전처리 TF-IDF

In [1]:
from tensorflow.keras.datasets import reuters
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split = 0.2)

In [3]:
print('훈련 샘플의 수:{}'.format(len(x_train)))
print('테스트 샘플의 수:{}'.format(len(x_test)))

훈련 샘플의 수:8982
테스트 샘플의 수:2246


In [4]:
word_index = reuters.get_word_index(path="reuters_word_index.json")

In [5]:
index_to_word = { index+3 : word for word, index in word_index.items() }
for index, token in enumerate(("<pad>", "<sos>", "<unk>")):
  index_to_word[index]=token

In [6]:
decoded = []
for i in range(len(x_train)):
    t = ' '.join([index_to_word[index] for index in x_train[i]])
    decoded.append(t)

x_train = decoded
print(len(x_train))

8982


In [7]:
decoded_test = []
for i in range(len(x_test)):
    t = ' '.join([index_to_word[index] for index in x_test[i]])
    decoded_test.append(t)

x_test = decoded_test
print(len(x_test))

2246


In [8]:
# 벡터화 DTM, TF-idf 방법
dtmvector = CountVectorizer()

tfidf_transformer = TfidfTransformer()

x_train_dtm = dtmvector.fit_transform(x_train)
x_test_dtm= dtmvector.transform(x_test)

x_train_tfidf = tfidf_transformer.fit_transform(x_train_dtm)
x_test_tfidf = tfidf_transformer.transform(x_test_dtm)

##### 로지스틱
- 로지스틱 대신 SGDClassifier쓰면 훨씬 빠름
- sklearn.linear_model import SGDClassifier
- loss = 'log' 또는 'log_loss' (sklearn버전따라 다르니 주의!)

In [9]:
import time
from sklearn.linear_model import LogisticRegression

st = time.time()
lr = LogisticRegression(C=10000, penalty='l2', max_iter=3000)
lr.fit(x_train_tfidf, y_train)
ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')

소요시간:915.2초


In [10]:
predicted = lr.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

정확도: 0.8165627782724845


In [11]:
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

 Accuracy : 0.8166
 F1-score : 0.8114


###### 참고 SGD
- 살짝 성능은 떨어지지만 훨씬 결과 빠르게 나오는 것 확인!

In [12]:
import time
from sklearn.linear_model import SGDClassifier

st = time.time()
sgd_lr = SGDClassifier(loss='log', max_iter=3000, tol=1e-3)  # log_loss = 로지스틱 손실
sgd_lr.fit(x_train_tfidf, y_train)
ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')

소요시간:3.63초


In [13]:
predicted = sgd_lr.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

정확도: 0.7943009795191451


In [14]:
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

 Accuracy : 0.7943
 F1-score : 0.7707


##### SVM

In [15]:
import time
from sklearn.svm import LinearSVC

st = time.time()

lsvc = LinearSVC(C=1000, penalty='l1', max_iter=3000, dual=False)
lsvc.fit(x_train_tfidf, y_train)

ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')

소요시간:125.87초




In [16]:
predicted = lsvc.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

정확도: 0.7916295636687445


In [17]:
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

 Accuracy : 0.7916
 F1-score : 0.7873


##### RandomForest

In [18]:
from sklearn.ensemble import RandomForestClassifier
import time

st = time.time()

rf_clf = RandomForestClassifier(n_estimators=5, random_state=0)
rf_clf.fit(x_train_tfidf, y_train)

ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')

소요시간:2.46초


In [19]:
predicted = rf_clf.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

정확도: 0.6544968833481746


In [20]:
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

 Accuracy : 0.6545
 F1-score : 0.6226


##### XGBoost

In [21]:
from xgboost import XGBClassifier
import time

st = time.time()

xgb_clf = XGBClassifier(n_estimators=100, max_depth=5, eval_metric='mlogloss')
xgb_clf.fit(x_train_tfidf, y_train)

ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')



소요시간:232.37초


In [22]:
predicted = xgb_clf.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

정확도: 0.794746215494212


In [23]:
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

 Accuracy : 0.7947
 F1-score : 0.7883


##### NaiveBayes

In [24]:
from sklearn.naive_bayes import MultinomialNB
import time

st = time.time()

nb_clf = MultinomialNB()
nb_clf.fit(x_train_tfidf, y_train)

ed = time.time()
print(f'소요시간:{round(ed-st,2)}초')

소요시간:0.06초


In [25]:
predicted = nb_clf.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

정확도: 0.5997328584149599


In [26]:
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

 Accuracy : 0.5997
 F1-score : 0.5046


##### KNN
- n=5로 설정

In [27]:
from sklearn.neighbors import KNeighborsClassifier
import time

st = time.time()

knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(x_train_tfidf, y_train)

ed = time.time()
print(f'소요시간:{round(ed-st,2)}초')


소요시간:0.0초


In [28]:
predicted = knn_clf.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

정확도: 0.7720391807658059


In [29]:
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

 Accuracy : 0.7720
 F1-score : 0.7639


##### LightGBM

In [30]:
from lightgbm import LGBMClassifier
import time

st = time.time()

# lightgbm은 dense하게 변경 필요

x_train_dense = x_train_tfidf.toarray()
x_test_dense = x_test_tfidf.toarray()

lgbm_clf = LGBMClassifier(objective='multiclass', num_class=46, n_estimators=200, learning_rate=0.1, max_depth=6)
lgbm_clf.fit(x_train_dense, y_train)

ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')

소요시간:49.79초


In [31]:
predicted = lgbm_clf.predict(x_test_dense) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

정확도: 0.3352626892252894
 Accuracy : 0.3353
 F1-score : 0.3037


##### DecisionTree

In [32]:
from sklearn.tree import DecisionTreeClassifier
import time

st = time.time()

dt_clf = DecisionTreeClassifier()
dt_clf.fit(x_train_tfidf, y_train)

ed = time.time()
print(f'소요시간:{round(ed-st,2)}초')

소요시간:8.01초


In [33]:
predicted = dt_clf.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

정확도: 0.7056990204808549
 Accuracy : 0.7057
 F1-score : 0.7021


#### 19-02 ML/DL 비교

##### 데이터준비(TF-IDF)

In [35]:
(x_train, y_rain), (x_test, y_test) = reuters.load_data(num_words=10000, test_split=0.2)

In [36]:
print('훈련 샘플의 수:{}'.format(len(x_train)))
print('테스트 샘플의 수:{}'.format(len(x_test)))

훈련 샘플의 수:8982
테스트 샘플의 수:2246


In [37]:
word_index = reuters.get_word_index(path="reuters_word_index.json")

In [38]:
index_to_word = { index+3 : word for word, index in word_index.items() }
for index, token in enumerate(("<pad>", "<sos>", "<unk>")):
  index_to_word[index]=token

In [39]:
decoded = []
for i in range(len(x_train)):
    t = ' '.join([index_to_word[index] for index in x_train[i]])
    decoded.append(t)

x_train = decoded
print(len(x_train))

8982


In [40]:
decoded_test = []
for i in range(len(x_test)):
    t = ' '.join([index_to_word[index] for index in x_test[i]])
    decoded_test.append(t)

x_test = decoded_test
print(len(x_test))

2246


In [41]:
# 벡터화 DTM, TF-idf 방법
dtmvector = CountVectorizer()

tfidf_transformer = TfidfTransformer()

x_train_dtm = dtmvector.fit_transform(x_train)
x_test_dtm= dtmvector.transform(x_test)

x_train_tfidf = tfidf_transformer.fit_transform(x_train_dtm)
x_test_tfidf = tfidf_transformer.transform(x_test_dtm)

##### 데이터준비(word2vec)

In [46]:
# 벡터화 W2V방법
from gensim.models import Word2Vec

# 우선 문장을 토큰화 시킵시다 띄어쓰기 기반으로 해볼게요! -> # 위에서 DTM만들때는 왜 안해줬냐! -> CountVectorizer에서 띄어쓰기 기반 토큰화가 내장되있음
x_train_tokenized = [sentence.split() for sentence in x_train]
x_test_tokenized = [sentence.split() for sentence in x_test]

# vector사이즈를 늘리거나 줄여보세요 아마 512 가장많이쓰이는 방식
model = Word2Vec(sentences = x_train_tokenized, vector_size = 256, window = 5, min_count = 5, workers = 4, sg = 0)
print("모델 학습 완료!")

모델 학습 완료!


In [47]:
# 학습된 Word2Vec 모델
w2v_model = model

# 각 문장을 벡터화 시키는 코드
def vectorize_sentence(sentence, model, max_len):
    vecs = []
    for word in sentence:
        if word in model.wv:
            vecs.append(model.wv[word])
        else:
            vecs.append(np.zeros(model.vector_size))
    # Padding
    if len(vecs) < max_len:
        vecs += [np.zeros(model.vector_size)] * (max_len - len(vecs))
    else:
        vecs = vecs[:max_len]
    return np.array(vecs)


# 최대 문장길이를 잘 잡아주세요
x_train_w2v = np.array([vectorize_sentence(s, w2v_model, max_len=100) for s in x_train_tokenized])
x_test_w2v = np.array([vectorize_sentence(s, w2v_model, max_len=100) for s in x_test_tokenized])


##### XG-Boost

In [42]:
from xgboost import XGBClassifier
import time

st = time.time()

xgb_clf = XGBClassifier(n_estimators=100, max_depth=5, eval_metric='mlogloss')
xgb_clf.fit(x_train_tfidf, y_train)

ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')



소요시간:217.14초


In [43]:
predicted = xgb_clf.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교
# 평가 지표
# predicted(y_pred)
acc = accuracy_score(y_test, predicted)
f1 = f1_score(y_test, predicted, average='weighted')

print(f" Accuracy : {acc:.4f}")
print(f" F1-score : {f1:.4f}")

정확도: 0.7907390917186109
 Accuracy : 0.7907
 F1-score : 0.7841


##### MLP
- dense layer

In [51]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout, LSTM, Dense, Dropout


dense_model = Sequential([
    Flatten(input_shape=(100, 256)),  # (seq_len, embedding_dim)
    Dense(512, activation='relu'),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(46, activation='softmax')   # 클래스 수에 맞게 조정 46개로 맞춰주세요!
])

dense_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
dense_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 25600)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 512)               13107712  
_________________________________________________________________
dropout_4 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 128)               65664     
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 46)                5934      
Total params: 13,179,310
Trainable params: 13,179,310
Non-trainable params: 0
__________________________________________

In [52]:
# 시간이 좀 걸립니다! 한 20분정도.. (초기값 epochs =10 ,epochs = 20으로 늘리면 몇 분?)
import time
st = time.time()
dense_model.fit(x_train_w2v, y_train, epochs=20, batch_size=16, validation_split=0.2)
ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
소요시간:50.8초


In [54]:
y_pred_proba = dense_model.predict(x_test_w2v)
y_pred = np.argmax(y_pred_proba, axis=1)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f" Accuracy: {acc:.4f}")
print(f" F1-score: {f1:.4f}")

 Accuracy: 0.6665
 F1-score: 0.6363


##### RNN

In [56]:
# rnn 시계열 특징 데이터 특화 모델

rnn_model = Sequential([
    LSTM(128, input_shape=(100, 256)),  # (seq_len, embedding_dim)
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(46, activation='softmax')   # 클래스 수에 맞게 조정 46개로 맞춰주세요~
])

rnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
rnn_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               197120    
_________________________________________________________________
dropout_6 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_7 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 46)                2990      
Total params: 208,366
Trainable params: 208,366
Non-trainable params: 0
_________________________________________________________________


In [59]:
# 시간이 좀 걸립니다! 한 20분정도
import time
st = time.time()
rnn_model.fit(x_train_w2v, y_train, epochs=20, batch_size=16, validation_split=0.2)
ed = time.time()

print(f'소요시간:{round(ed-st,2)}초')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
소요시간:68.05초


In [62]:

y_pred_proba = rnn_model.predict(x_test_w2v)
y_pred = np.argmax(y_pred_proba, axis=1)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f" Accuracy: {acc:.4f}")
print(f" F1-score: {f1:.4f}")

 Accuracy: 0.7841
 F1-score: 0.7752


##### MLP 개선
- 입력 데이터 변환: 문장별 평균 벡터

In [65]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import time

# 1. 입력 데이터 변환: 문장별 평균 벡터
x_train_mean = x_train_w2v.mean(axis=1)  # shape: (n_samples, 256)
x_test_mean = x_test_w2v.mean(axis=1)    # shape: (n_samples, 256)

# 2. 모델 정의
dense_model = Sequential([
    Dense(512, activation='relu', input_shape=(256,)),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(46, activation='softmax')
])

# 3. 컴파일
dense_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
dense_model.summary()

# 4. 학습
st = time.time()
dense_model.fit(x_train_mean, y_train, epochs=30, batch_size=16, validation_split=0.2)
ed = time.time()

print(f'소요시간: {round(ed - st, 2)}초')

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_14 (Dense)             (None, 512)               131584    
_________________________________________________________________
dropout_10 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 128)               65664     
_________________________________________________________________
dropout_11 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_16 (Dense)             (None, 46)                5934      
Total params: 203,182
Trainable params: 203,182
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epo

In [66]:
y_pred_proba = dense_model.predict(x_test_mean)
y_pred = np.argmax(y_pred_proba, axis=1)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f" Accuracy: {acc:.4f}")
print(f" F1-score: {f1:.4f}")

 Accuracy: 0.7569
 F1-score: 0.7396


##### RNN 개선
- BatchNormalization : LSTM 후 hidden activation 안정화 → 더 빠른 수렴
- EarlyStopping : validation 성능이 plateau에 도달하면 자동 종료
- ReduceLROnPlateau : 정체 구간에서 learning rate를 줄여 local minima 탈출 유도

In [67]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import time

# 모델 정의
rnn_model = Sequential([
    LSTM(128, input_shape=(100, 256)),
    BatchNormalization(),
    Dropout(0.3),
    Dense(64, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(46, activation='softmax')  # 클래스 수 = 46
])

rnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
rnn_model.summary()

# 콜백 설정
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2)

# 학습
st = time.time()
rnn_model.fit(
    x_train_w2v, y_train,
    epochs=30,
    batch_size=16,
    validation_split=0.2,
    callbacks=[early_stop, lr_scheduler]
)
ed = time.time()

print(f'소요시간: {round(ed - st, 2)}초')

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               197120    
_________________________________________________________________
batch_normalization (BatchNo (None, 128)               512       
_________________________________________________________________
dropout_12 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_17 (Dense)             (None, 64)                8256      
_________________________________________________________________
batch_normalization_1 (Batch (None, 64)                256       
_________________________________________________________________
dropout_13 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_18 (Dense)             (None, 46)               

In [68]:

y_pred_proba = rnn_model.predict(x_test_w2v)
y_pred = np.argmax(y_pred_proba, axis=1)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f" Accuracy: {acc:.4f}")
print(f" F1-score: {f1:.4f}")

 Accuracy: 0.7787
 F1-score: 0.7622
