<a href="https://colab.research.google.com/github/wheemin-2/25-1-ESAA/blob/main/OB1_mini_project2_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
import nltk

# 데이터 로드
- `train.csv`, `test_x.csv` 불러오기

In [None]:
train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ESAA/25-1 OB/mini project 2/train.csv", encoding='utf-8')
test = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ESAA/25-1 OB/mini project 2/test_x.csv", encoding='utf-8')

In [None]:
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True



---



In [None]:
print(train.columns)

Index(['index', 'text', 'author'], dtype='object')


In [None]:
print(train['author'].unique())

[3 2 1 4 0]


## 피처 생성

1. **`num_words`**  
   - 텍스트를 공백(`" "`) 기준으로 분할한 뒤, 전체 단어의 개수를 계산  

2. **`num_unique_words`**  
   - 텍스트 내 중복을 제거한 뒤 고유 단어의 개수를 계산  

3. **`num_chars`**  
   - 텍스트 문자열 전체 길이(문자 수)를 계산

4. **`num_punctuations`**  
   - `string.punctuation`에 정의된 모든 구두점 문자를 찾아 개수를 계산

5. **`num_words_upper`**  
   - 텍스트를 단어별로 분할한 뒤, 모두 대문자인 단어의 개수를 계산

6. **`num_words_title`**  
   - 텍스트를 단어별로 분할한 뒤, 단어의 첫 글자만 대문자인 단어의 개수를 계산

7. **`mean_word_len`**  
   - 텍스트를 단어별로 분할한 뒤, 각 단어 길이의 평균을 계산

8. **`num_stopwords`**  
   - 텍스트 내 영어 불용어 개수를 계산

In [None]:
## Number of words in the text ##
train["num_words"] = train["text"].apply(lambda x: len(str(x).split()))
test["num_words"] = test["text"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train["num_unique_words"] = train["text"].apply(lambda x: len(set(str(x).split())))
test["num_unique_words"] = test["text"].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text ##
train["num_chars"] = train["text"].apply(lambda x: len(str(x)))
test["num_chars"] = test["text"].apply(lambda x: len(str(x)))

## Number of punctuations in the text ##
import string
train["num_punctuations"] =train['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test["num_punctuations"] =test['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Number of title case words in the text ##
train["num_words_upper"] = train["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test["num_words_upper"] = test["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

## Number of title case words in the text ##
train["num_words_title"] = train["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test["num_words_title"] = test["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

## Average length of the words in the text ##
train["mean_word_len"] = train["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test["mean_word_len"] = test["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

In [None]:
## Number of stopwords in the text ##
eng_stopwords = [
    "a", "about", "above", "across", "after", "afterwards", "again", "against",
    "all", "almost", "alone", "along", "already", "also", "although", "always",
    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    "around", "as", "at", "back", "be", "became", "because", "become",
    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    "below", "beside", "besides", "between", "beyond", "bill", "both",
    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill",
    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    "something", "sometime", "sometimes", "somewhere", "still", "such",
    "system", "take", "ten", "than", "that", "the", "their", "them",
    "themselves", "then", "thence", "there", "thereafter", "thereby",
    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    "third", "this", "those", "though", "three", "through", "throughout",
    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    "within", "without", "would", "yet", "you", "your", "yours", "yourself",
    "yourselves"]
train["num_stopwords"] = train["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
test["num_stopwords"] = test["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

In [None]:
# Clean text
from tqdm import tqdm
tqdm.pandas()
punctuation = ['.', '..', '...', ',', ':', ';', '-', '*', '"', '!', '?']
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def clean_text(x):
    x.lower()
    for p in punctuation:
        x.replace(p, '')
    return x

train['text_cleaned'] = train['text'].apply(lambda x: clean_text(x))
test['text_cleaned'] = test['text'].apply(lambda x: clean_text(x))

def extract_features(df):
    df['len'] = df['text'].apply(lambda x: len(x))
    df['n_words'] = df['text'].apply(lambda x: len(x.split(' ')))
    df['n_.'] = df['text'].str.count('\.')
    df['n_...'] = df['text'].str.count('\...')
    df['n_,'] = df['text'].str.count('\,')
    df['n_:'] = df['text'].str.count('\:')
    df['n_;'] = df['text'].str.count('\;')
    df['n_-'] = df['text'].str.count('\-')
    df['n_?'] = df['text'].str.count('\?')
    df['n_!'] = df['text'].str.count('\!')
    df['n_\''] = df['text'].str.count('\'')
    df['n_"'] = df['text'].str.count('\"')

    # First words in a sentence
    df['n_The '] = df['text'].str.count('The ')
    df['n_I '] = df['text'].str.count('I ')
    df['n_It '] = df['text'].str.count('It ')
    df['n_He '] = df['text'].str.count('He ')
    df['n_Me '] = df['text'].str.count('Me ')
    df['n_She '] = df['text'].str.count('She ')
    df['n_We '] = df['text'].str.count('We ')
    df['n_They '] = df['text'].str.count('They ')
    df['n_You '] = df['text'].str.count('You ')
    df['n_the'] = df['text_cleaned'].str.count('the ')
    df['n_ a '] = df['text_cleaned'].str.count(' a ')
    df['n_appear'] = df['text_cleaned'].str.count('appear')
    df['n_little'] = df['text_cleaned'].str.count('little')
    df['n_was '] = df['text_cleaned'].str.count('was ')
    df['n_one '] = df['text_cleaned'].str.count('one ')
    df['n_two '] = df['text_cleaned'].str.count('two ')
    df['n_three '] = df['text_cleaned'].str.count('three ')
    df['n_ten '] = df['text_cleaned'].str.count('ten ')
    df['n_is '] = df['text_cleaned'].str.count('is ')
    df['n_are '] = df['text_cleaned'].str.count('are ')
    df['n_ed'] = df['text_cleaned'].str.count('ed ')
    df['n_however'] = df['text_cleaned'].str.count('however')
    df['n_ to '] = df['text_cleaned'].str.count(' to ')
    df['n_into'] = df['text_cleaned'].str.count('into')
    df['n_about '] = df['text_cleaned'].str.count('about ')
    df['n_th'] = df['text_cleaned'].str.count('th')
    df['n_er'] = df['text_cleaned'].str.count('er')
    df['n_ex'] = df['text_cleaned'].str.count('ex')
    df['n_an '] = df['text_cleaned'].str.count('an ')
    df['n_ground'] = df['text_cleaned'].str.count('ground')
    df['n_any'] = df['text_cleaned'].str.count('any')
    df['n_silence'] = df['text_cleaned'].str.count('silence')
    df['n_wall'] = df['text_cleaned'].str.count('wall')

    df.drop(['text_cleaned'], axis=1, inplace=True)

print('Processing train...')
extract_features(train)
print('Processing test...')
extract_features(test)

Processing train...
Processing test...


고유명사 사용 패턴 출력 -> 너무 오래걸림

추후 시간 여유가 생기면 성능 개선 여부를 확인해보는 것이 좋을 듯 합니다

---
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')



---


import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')

def pos_tag_text(s):
    sents = nltk.sent_tokenize(s)
    res = []
    for sent in sents:
        words = nltk.word_tokenize(sent)
        tag_res = [a[1] for a in nltk.pos_tag(words)]
        res.append(' '.join(tag_res))
    return '. '.join(res)

def ne_text(s):
    sents = nltk.sent_tokenize(s)
    res = []
    for sent in sents:
        words = nltk.word_tokenize(sent)
        tag_res = nltk.pos_tag(words)
        ne_tree = nltk.ne_chunk(tag_res)
        list_res = nltk.tree2conlltags(ne_tree)
        ne_res = [a[2] for a in list_res]
        res.append(' '.join(ne_res))
    return '. '.join(res)

train['tag_txt'] = train["text"].apply(pos_tag_text)
train['ne_txt'] = train["text"].apply(ne_text)
test['tag_txt'] = test["text"].apply(pos_tag_text)
test['ne_txt'] = test["text"].apply(ne_text)

c_vec3 = CountVectorizer(lowercase=False)
c_vec3.fit(train['tag_txt'].values.tolist())
train_cvec3 = c_vec3.transform(train['tag_txt'].values.tolist()).toarray()
test_cvec3 = c_vec3.transform(test['tag_txt'].values.tolist()).toarray()
print(train_cvec3.shape,test_cvec3.shape)

c_vec4 = CountVectorizer(lowercase=False)
c_vec4.fit(train['ne_txt'].values.tolist())
train_cvec4 = c_vec4.transform(train['ne_txt'].values.tolist()).toarray()
test_cvec4 = c_vec4.transform(test['ne_txt'].values.tolist()).toarray()
print(train_cvec4.shape,test_cvec4.shape)

tf_vec5 = TfidfVectorizer(lowercase=False)
tf_vec5.fit(train['tag_txt'].values.tolist())
train_tf5 = tf_vec5.transform(train['tag_txt'].values.tolist()).toarray()
test_tf5 = tf_vec5.transform(test['tag_txt'].values.tolist()).toarray()
print(train_tf5.shape,test_tf5.shape)

tf_vec6 = TfidfVectorizer(lowercase=False)
tf_vec6.fit(train['ne_txt'].values.tolist())
train_tf6 = tf_vec6.transform(train['ne_txt'].values.tolist()).toarray()
test_tf6 = tf_vec6.transform(test['ne_txt'].values.tolist()).toarray()
print(train_tf6.shape,test_tf6.shape)



---



## TF-IDF & SVD 메타피처

- **`svd_word_0`~`svd_word_29`**  
  워드 n-그램(1–3) TF-IDF 벡터를 TruncatedSVD(30차원)로 축소한 주성분  

- **`svd_char_0`~`svd_char_29`**  
  문자 n-그램(3–7) TF-IDF 벡터를 TruncatedSVD(30차원)로 축소한 주성분

In [None]:
## 참고용) 약 10분 걸림

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition         import TruncatedSVD

# 1) WORD n-gram TF-IDF
tfidf_vec = TfidfVectorizer(ngram_range=(1,3), max_df=0.8, sublinear_tf=True)
tfidf_vec.fit(train['text'])
train_tfidf = tfidf_vec.transform(train['text'])
test_tfidf  = tfidf_vec.transform(test['text'])

# 2) SVD
svd = TruncatedSVD(n_components=30, random_state=42)
svd.fit(train_tfidf)
train_svd = pd.DataFrame(svd.transform(train_tfidf),
                         columns=[f'svd_word_{i}' for i in range(30)])
test_svd  = pd.DataFrame(svd.transform(test_tfidf),
                         columns=[f'svd_word_{i}' for i in range(30)])

# 3) CHAR n-gram TF-IDF
tfidf_char = TfidfVectorizer(analyzer='char', ngram_range=(3,7),
                             max_df=0.8, sublinear_tf=True)
tfidf_char.fit(train['text'])
train_tfidf_char = tfidf_char.transform(train['text'])
test_tfidf_char  = tfidf_char.transform(test['text'])

# 4) SVD (char)
svd2 = TruncatedSVD(n_components=30, random_state=42)
svd2.fit(train_tfidf_char)
train_svd2 = pd.DataFrame(svd2.transform(train_tfidf_char),
                          columns=[f'svd_char_{i}' for i in range(30)])
test_svd2  = pd.DataFrame(svd2.transform(test_tfidf_char),
                          columns=[f'svd_char_{i}' for i in range(30)])

## Naive Bayes OOF 메타피처

- **벡터화**: 워드 n-그램(1-3) TF-IDF(max_features=5000, sublinear_tf=True)  
- **CV 설정**: 5-Fold StratifiedKFold(shuffle=True, random_state=42)  
- **모델 학습**: 각 Fold에서 MultinomialNB(alpha=0.1) 학습  
- **메타피처**:  
  - `meta_train`에 각 validation 분할의 `predict_proba` 결과 삽입 (OOF)  
  - `meta_test`에 test 세트에 대한 `predict_proba` 평균값 저장  


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
import numpy as np

# 클래스 수 확인
n_classes = len(train['author'].unique())

# 1. TF-IDF (단어 기반)
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=5000, sublinear_tf=True)
X = tfidf.fit_transform(train['text'])
X_test = tfidf.transform(test['text'])
y = train['author'].values

# 2. Naive Bayes + Stratified KFold
meta_train = np.zeros((X.shape[0], n_classes))
meta_test = np.zeros((X_test.shape[0], n_classes))

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"Fold {fold+1}")

    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    model = MultinomialNB(alpha=0.1)
    model.fit(X_train, y_train)

    meta_train[val_idx] = model.predict_proba(X_val)
    meta_test += model.predict_proba(X_test) / skf.n_splits

Fold 1
Fold 2
Fold 3
Fold 4
Fold 5


In [None]:
import pandas as pd

meta_train_df = pd.DataFrame(meta_train, columns=[f'nb_class_{i}' for i in range(n_classes)])
meta_test_df = pd.DataFrame(meta_test, columns=[f'nb_class_{i}' for i in range(n_classes)])

train = pd.concat([train.reset_index(drop=True), meta_train_df], axis=1)
test = pd.concat([test.reset_index(drop=True), meta_test_df], axis=1)

# 중복된 nb_class_* 컬럼 제거
train = train.loc[:, ~train.columns.duplicated()]
test = test.loc[:, ~test.columns.duplicated()]

## 문자 유형 비율 메타피처

- 숫자, 영어, 한글 문자의 비율을 통해 저자별 문자 사용 습관을 포착
- 전체 길이 대비 각 문자 종류의 비중


In [None]:
import re

def char_type_features(df, text_col='text'):
    ct = pd.DataFrame(index=df.index)
    # 전체 문자 수
    ct['total_chars'] = df[text_col].str.len()
    # 숫자
    ct['digit_count'] = df[text_col].str.count(r'\d')
    ct['digit_ratio'] = ct['digit_count'] / (ct['total_chars'] + 1e-9)
    # 영어
    ct['eng_count'] = df[text_col].str.count(r'[A-Za-z]')
    ct['eng_ratio'] = ct['eng_count'] / (ct['total_chars'] + 1e-9)
    return ct.fillna(0)

# train/test 각각에 적용
train_ct = char_type_features(train, text_col='text')
test_ct  = char_type_features(test,  text_col='text')

train = pd.concat([train, train_ct], axis=1).fillna(0)
test  = pd.concat([test,  test_ct ], axis=1).fillna(0)

# 결과 확인
print("train.shape:", meta_train.shape)
print("test.shape: ", meta_test.shape)
train.head()

train.shape: (54879, 5)
test.shape:  (19617, 5)


Unnamed: 0,index,text,author,num_words,num_unique_words,num_chars,num_punctuations,num_words_upper,num_words_title,mean_word_len,...,nb_class_0,nb_class_1,nb_class_2,nb_class_3,nb_class_4,total_chars,digit_count,digit_ratio,eng_count,eng_ratio
0,0,"He was almost choking. There was so much, so m...",3,46,39,240,8,0,4,4.23913,...,0.068573,0.011556,0.052782,0.853063,0.014027,240,0,0.0,187,0.779167
1,1,"“Your sister asked for it, I suppose?”",2,7,7,38,2,1,2,4.571429,...,0.282354,0.386041,0.074021,0.237655,0.019929,38,0,0.0,28,0.736842
2,2,"She was engaged one day as she walked, in per...",1,57,50,320,9,0,4,4.614035,...,0.105101,0.865957,0.003361,0.021768,0.003813,320,0,0.0,253,0.790625
3,3,"The captain was in the porch, keeping himself ...",4,58,49,319,18,0,7,4.517241,...,0.013569,0.00381,0.276248,0.007511,0.698863,319,0,0.0,242,0.758621
4,4,"“Have mercy, gentlemen!” odin flung up his han...",3,39,36,228,13,0,4,4.871795,...,0.072644,0.002546,0.071292,0.777851,0.075667,228,0,0.0,171,0.75


In [None]:
print(train.columns)
print(train.shape[1])  # 전체 피처 수

Index(['index', 'text', 'author', 'num_words', 'num_unique_words', 'num_chars',
       'num_punctuations', 'num_words_upper', 'num_words_title',
       'mean_word_len', 'num_stopwords', 'len', 'n_words', 'n_.', 'n_...',
       'n_,', 'n_:', 'n_;', 'n_-', 'n_?', 'n_!', 'n_'', 'n_"', 'n_The ',
       'n_I ', 'n_It ', 'n_He ', 'n_Me ', 'n_She ', 'n_We ', 'n_They ',
       'n_You ', 'n_the', 'n_ a ', 'n_appear', 'n_little', 'n_was ', 'n_one ',
       'n_two ', 'n_three ', 'n_ten ', 'n_is ', 'n_are ', 'n_ed', 'n_however',
       'n_ to ', 'n_into', 'n_about ', 'n_th', 'n_er', 'n_ex', 'n_an ',
       'n_ground', 'n_any', 'n_silence', 'n_wall', 'nb_class_0', 'nb_class_1',
       'nb_class_2', 'nb_class_3', 'nb_class_4', 'total_chars', 'digit_count',
       'digit_ratio', 'eng_count', 'eng_ratio'],
      dtype='object')
66


# **모델을 이용하여 피처 생성**

## **LogisticRegression으로 피처 생성**

In [None]:
# 피처 벡터화
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

tfidf2 = TfidfVectorizer(tokenizer=word_tokenize, ngram_range=(1, 3), min_df=50)
X = tfidf2.fit_transform(train['text'])
X_test = tfidf2.transform(test['text'])



In [None]:
n_fold = 5
seed = 42
cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)

In [None]:
from sklearn.linear_model import LogisticRegression

y = train['author'].values
n_class = len(np.unique(y))

# 예측 결과 저장용
p = np.zeros((X.shape[0], n_class))
p_tst = np.zeros((X_test.shape[0], n_class))

# Stratified K-Fold 교차검증
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for i_cv, (i_trn, i_val) in enumerate(cv.split(X, y), 1):
    print(f"Fold {i_cv}")

    clf = LogisticRegression(max_iter=1000, C=1.0, solver='liblinear', multi_class='ovr')
    clf.fit(X[i_trn], y[i_trn])

    # 검증셋 예측 저장
    p[i_val, :] = clf.predict_proba(X[i_val])

    # 테스트셋 예측 누적 평균
    p_tst += clf.predict_proba(X_test) / cv.n_splits

Fold 1




Fold 2




Fold 3




Fold 4




Fold 5




In [None]:
# 로지스틱 예측 결과 확인
print(p)
print(p_tst)

[[0.06189655 0.03873073 0.02612818 0.85686559 0.01637895]
 [0.4160363  0.34474796 0.05481215 0.15570115 0.02870244]
 [0.26537439 0.67465412 0.00167294 0.05461705 0.00368151]
 ...
 [0.07944283 0.70219475 0.01780089 0.13582431 0.06473721]
 [0.07170358 0.02215117 0.27386091 0.60806033 0.02422401]
 [0.37257146 0.04392143 0.11611735 0.44741359 0.01997616]]
[[0.07791363 0.45494121 0.38854873 0.06687426 0.01172217]
 [0.12064811 0.70770804 0.01609679 0.06491018 0.09063687]
 [0.75956974 0.05041887 0.10398966 0.06875483 0.0172669 ]
 ...
 [0.23679087 0.67850783 0.00359074 0.07235674 0.00875382]
 [0.01828719 0.825399   0.08690073 0.03792794 0.03148514]
 [0.39121226 0.02044296 0.17828803 0.13872753 0.27132922]]


In [None]:
# csv 파일 추출
pd.DataFrame(p).to_csv('logistic_train.csv', index=False)
pd.DataFrame(p_tst).to_csv('logistic_test.csv', index=False)

## **NN으로 피처 생성**

Tokenizing > Embedding > Pooling > Hidden Layer(2) > Output

In [None]:
import tensorflow
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Dense,Input,Conv1D,Embedding, GlobalMaxPooling1D, GlobalAveragePooling1D, Dropout, Concatenate
from tensorflow.keras.models import Model,Sequential, load_model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
import gc

 **적절한 패딩 길이 찾기**

In [None]:
tokenizer = Tokenizer(num_words=30000)
tokenizer.fit_on_texts(train['text'])
train_x_tmp = tokenizer.texts_to_sequences(train['text'])

In [None]:
sequence_lengths = [len(seq) for seq in train_x_tmp]
print("Max length:", max(sequence_lengths))
print("95 percentile:", np.percentile(sequence_lengths, 95))
print("Median length:", np.median(sequence_lengths))

Max length: 473
95 percentile: 160.0
Median length: 23.0


(각 문장을 정수 인덱스 시퀀스로 변환한 결과) 전체 데이터의 50%는 23 토큰 이하로 매우 짧음

95%는 160 토큰 이하로, 이보다 긴 시퀀스는 거의 X

473까지 가는 긴 문장은 극소수이며, 이들을 모두 살리려면 오히려 모델 학습에 비효율이 발생!

✅ 최적의 MAX_LEN 추천
- 160: 95% 문장을 커버 → ✅ 가장 균형 잡힌 선택

- 200: 약간의 여유를 두고 자르기 → 가능

- 256 이상: 필요 없으며, 오히려 padding이 과도해질 가능성 있음

> 160보다는 100으로 잡았을 때 성능이 더 향상되었으므로 100으로 진행

In [None]:
# Average pooling 적용
# Hidden Layer 2개

import json


def get_nn_feats2(rnd=1):
    train_pred, test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    best_val_train_pred, best_val_test_pred = np.zeros((54879,5)),np.zeros((19617,5))

    best_fold_loss = float('inf')
    best_fold_idx = -1
    best_fold_metrics = {}

    # 파라미터 설정
    FEAT_CNT = 10
    NUM_WORDS = 30000
    N = 10
    MAX_LEN = 100
    NUM_CLASSES = 5
    MODEL_P = 'nn_model.h5'

    tmp_X = train['text']
    tmp_Y = train['author']
    tmp_X_test = test['text']
    # Tokenizing
    tokenizer = Tokenizer(num_words=NUM_WORDS)
    tokenizer.fit_on_texts(tmp_X)
    # text to sequence
    # padding
    ttrain_x = tokenizer.texts_to_sequences(tmp_X)
    ttrain_x = pad_sequences(ttrain_x, maxlen=MAX_LEN)

    ttest_x = tokenizer.texts_to_sequences(tmp_X_test)
    ttest_x = pad_sequences(ttest_x, maxlen=MAX_LEN)

    # label one-hot encoding
    lb = preprocessing.LabelBinarizer()
    lb.fit(tmp_Y)
    ttrain_y = lb.transform(tmp_Y)

    skf = StratifiedKFold(n_splits=FEAT_CNT, shuffle=True, random_state=233*rnd)
    for fold_idx, (train_index, test_index) in enumerate(skf.split(ttrain_x, tmp_Y)):
        # 입력층
        input_layer = Input(shape=(MAX_LEN,))

        # 임베딩층
        embedding = Embedding(input_dim=NUM_WORDS, output_dim=N, input_length=MAX_LEN)(input_layer)

        # 평균 풀링 + 최대 풀링
        avg_pool = GlobalAveragePooling1D()(embedding)
        #max_pool = GlobalMaxPooling1D()(embedding)

        # 두 풀링 결과를 합치기
        #concat = Concatenate()([avg_pool, max_pool])  # shape: (None, 2 * EMBEDDING_DIM)

        # 밀집층
        #x = Dense(64, activation='relu')(concat)
        x = Dense(64, activation='relu')(avg_pool)
        x = Dropout(0.1)(x)
        x = Dense(32, activation='relu')(x)
        x = Dropout(0.1)(x)
        output = Dense(NUM_CLASSES, activation='softmax')(x)

        # 모델 정의
        model = Model(inputs=input_layer, outputs=output)
        model.compile(optimizer='adam',
                      loss='categorical_crossentropy', metrics=['accuracy'])

        mc = ModelCheckpoint(filepath=MODEL_P, monitor='val_loss', save_best_only=True, verbose=1)
        es=EarlyStopping(monitor='val_loss', patience=2)

        np.random.seed(42)
        history = model.fit(ttrain_x[train_index], ttrain_y[train_index],
                  validation_split=0.3,
                  batch_size=64, epochs=20,
                  verbose=1,
                  callbacks=[mc,es],
                  shuffle=False
                 )

        # 현재 fold의 best val_loss와 그때의 val_accuracy
        current_best_val_loss = min(history.history['val_loss'])
        current_best_val_acc = max(history.history['val_accuracy'])

        if current_best_val_loss < best_fold_loss:
            best_fold_loss = current_best_val_loss
            best_fold_idx = fold_idx
            best_fold_metrics = {'fold': fold_idx, 'val_loss': round(current_best_val_loss, 4),
                                 'val_accuracy': round(current_best_val_acc, 4) }

        # 마지막에 JSON으로 저장
        with open('best_val_result.json', 'w') as f:
            json.dump(best_fold_metrics, f)

        print("✅ Best fold info saved to best_val_result.json")

        # feature 생성 1
        train_pred[test_index] = model.predict(ttrain_x[test_index])
        test_pred += model.predict(ttest_x)/FEAT_CNT

        # feature 생성 2
        model = load_model(MODEL_P)
        best_val_train_pred[test_index] = model.predict(ttrain_x[test_index])
        best_val_test_pred += model.predict(ttest_x)/FEAT_CNT

        del model
        gc.collect()
        print('------------------')

    return train_pred,test_pred,best_val_train_pred,best_val_test_pred

nn_train1,nn_test1,nn_train2,nn_test2 = get_nn_feats2(1)



Epoch 1/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.2839 - loss: 1.5491
Epoch 1: val_loss improved from inf to 1.16502, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.2847 - loss: 1.5482 - val_accuracy: 0.5347 - val_loss: 1.1650
Epoch 2/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5451 - loss: 1.1240
Epoch 2: val_loss improved from 1.16502 to 0.99614, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.5453 - loss: 1.1235 - val_accuracy: 0.6107 - val_loss: 0.9961
Epoch 3/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.6427 - loss: 0.9138
Epoch 3: val_loss improved from 0.99614 to 0.90425, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.6430 - loss: 0.9132 - val_accuracy: 0.6594 - val_loss: 0.9043
Epoch 4/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.7071 - loss: 0.7742
Epoch 4: val_loss improved from 0.90425 to 0.87923, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 9ms/step - accuracy: 0.7073 - loss: 0.7738 - val_accuracy: 0.6759 - val_loss: 0.8792
Epoch 5/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7530 - loss: 0.6730
Epoch 5: val_loss improved from 0.87923 to 0.83030, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.7530 - loss: 0.6730 - val_accuracy: 0.6993 - val_loss: 0.8303
Epoch 6/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7857 - loss: 0.5913
Epoch 6: val_loss improved from 0.83030 to 0.73532, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.7858 - loss: 0.5910 - val_accuracy: 0.7352 - val_loss: 0.7353
Epoch 7/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.8152 - loss: 0.5213
Epoch 7: val_loss did not improve from 0.73532
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.8152 - loss: 0.5212 - val_accuracy: 0.7272 - val_loss: 0.7686
Epoch 8/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8147 - loss: 0.5120
Epoch 8: val_loss improved from 0.73532 to 0.72399, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 6ms/step - accuracy: 0.8149 - loss: 0.5117 - val_accuracy: 0.7482 - val_loss: 0.7240
Epoch 9/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.8466 - loss: 0.4431
Epoch 9: val_loss did not improve from 0.72399
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.8466 - loss: 0.4432 - val_accuracy: 0.7205 - val_loss: 0.8006
Epoch 10/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8589 - loss: 0.4086
Epoch 10: val_loss improved from 0.72399 to 0.71059, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.8588 - loss: 0.4088 - val_accuracy: 0.7581 - val_loss: 0.7106
Epoch 11/20
[1m535/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8607 - loss: 0.4001
Epoch 11: val_loss did not improve from 0.71059
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.8607 - loss: 0.4000 - val_accuracy: 0.7596 - val_loss: 0.7294
Epoch 12/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.8724 - loss: 0.3741
Epoch 12: val_loss did not improve from 0.71059
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.8724 - loss: 0.3741 - val_accuracy: 0.7565 - val_loss: 0.7513
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.2985 - loss: 1.5460
Epoch 1: val_loss improved from inf to 1.13821, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.2989 - loss: 1.5455 - val_accuracy: 0.5694 - val_loss: 1.1382
Epoch 2/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.5635 - loss: 1.1053
Epoch 2: val_loss improved from 1.13821 to 0.94326, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.5636 - loss: 1.1052 - val_accuracy: 0.6399 - val_loss: 0.9433
Epoch 3/20
[1m534/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.6581 - loss: 0.8912
Epoch 3: val_loss improved from 0.94326 to 0.87526, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.6582 - loss: 0.8906 - val_accuracy: 0.6686 - val_loss: 0.8753
Epoch 4/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.7094 - loss: 0.7659
Epoch 4: val_loss improved from 0.87526 to 0.81655, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.7094 - loss: 0.7659 - val_accuracy: 0.6908 - val_loss: 0.8166
Epoch 5/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7484 - loss: 0.6805
Epoch 5: val_loss did not improve from 0.81655
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 6ms/step - accuracy: 0.7486 - loss: 0.6802 - val_accuracy: 0.6805 - val_loss: 0.8377
Epoch 6/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.7759 - loss: 0.6140
Epoch 6: val_loss improved from 0.81655 to 0.76231, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.7760 - loss: 0.6137 - val_accuracy: 0.7191 - val_loss: 0.7623
Epoch 7/20
[1m531/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7967 - loss: 0.5668
Epoch 7: val_loss did not improve from 0.76231
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.7969 - loss: 0.5664 - val_accuracy: 0.7197 - val_loss: 0.7711
Epoch 8/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8217 - loss: 0.5048
Epoch 8: val_loss improved from 0.76231 to 0.75482, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.8217 - loss: 0.5048 - val_accuracy: 0.7354 - val_loss: 0.7548
Epoch 9/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.8220 - loss: 0.4979
Epoch 9: val_loss did not improve from 0.75482
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.8220 - loss: 0.4979 - val_accuracy: 0.7294 - val_loss: 0.7921
Epoch 10/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8321 - loss: 0.4688
Epoch 10: val_loss did not improve from 0.75482
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.8321 - loss: 0.4687 - val_accuracy: 0.7301 - val_loss: 0.8260
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.2913 - loss: 1.5496
Epoch 1: val_loss improved from inf to 1.28054, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.2918 - loss: 1.5490 - val_accuracy: 0.5198 - val_loss: 1.2805
Epoch 2/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5015 - loss: 1.2125
Epoch 2: val_loss improved from 1.28054 to 1.05009, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 6ms/step - accuracy: 0.5022 - loss: 1.2110 - val_accuracy: 0.5898 - val_loss: 1.0501
Epoch 3/20
[1m534/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.6280 - loss: 0.9367
Epoch 3: val_loss improved from 1.05009 to 0.92131, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.6283 - loss: 0.9361 - val_accuracy: 0.6454 - val_loss: 0.9213
Epoch 4/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.6934 - loss: 0.7984
Epoch 4: val_loss improved from 0.92131 to 0.79609, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.6935 - loss: 0.7983 - val_accuracy: 0.7017 - val_loss: 0.7961
Epoch 5/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7302 - loss: 0.7115
Epoch 5: val_loss did not improve from 0.79609
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.7303 - loss: 0.7112 - val_accuracy: 0.6821 - val_loss: 0.8231
Epoch 6/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.7678 - loss: 0.6274
Epoch 6: val_loss improved from 0.79609 to 0.78864, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.7679 - loss: 0.6273 - val_accuracy: 0.7037 - val_loss: 0.7886
Epoch 7/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7934 - loss: 0.5689
Epoch 7: val_loss improved from 0.78864 to 0.74753, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.7935 - loss: 0.5688 - val_accuracy: 0.7285 - val_loss: 0.7475
Epoch 8/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8157 - loss: 0.5162
Epoch 8: val_loss did not improve from 0.74753
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.8157 - loss: 0.5162 - val_accuracy: 0.7040 - val_loss: 0.8067
Epoch 9/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.8267 - loss: 0.4878
Epoch 9: val_loss did not improve from 0.74753
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.8267 - loss: 0.4878 - val_accuracy: 0.6944 - val_loss: 0.8553
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.2894 - loss: 1.5497
Epoch 1: val_loss improved from inf to 1.23587, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.2899 - loss: 1.5492 - val_accuracy: 0.5007 - val_loss: 1.2359
Epoch 2/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5122 - loss: 1.1929
Epoch 2: val_loss improved from 1.23587 to 1.10041, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.5124 - loss: 1.1926 - val_accuracy: 0.5424 - val_loss: 1.1004
Epoch 3/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6220 - loss: 0.9625
Epoch 3: val_loss improved from 1.10041 to 0.98415, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.6221 - loss: 0.9624 - val_accuracy: 0.6022 - val_loss: 0.9841
Epoch 4/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.6818 - loss: 0.8240
Epoch 4: val_loss improved from 0.98415 to 0.87199, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.6819 - loss: 0.8237 - val_accuracy: 0.6618 - val_loss: 0.8720
Epoch 5/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7269 - loss: 0.7277
Epoch 5: val_loss improved from 0.87199 to 0.77261, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.7271 - loss: 0.7272 - val_accuracy: 0.7124 - val_loss: 0.7726
Epoch 6/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7630 - loss: 0.6428
Epoch 6: val_loss improved from 0.77261 to 0.73869, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.7631 - loss: 0.6426 - val_accuracy: 0.7317 - val_loss: 0.7387
Epoch 7/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7967 - loss: 0.5659
Epoch 7: val_loss did not improve from 0.73869
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.7967 - loss: 0.5659 - val_accuracy: 0.7158 - val_loss: 0.7750
Epoch 8/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8144 - loss: 0.5219
Epoch 8: val_loss did not improve from 0.73869
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.8144 - loss: 0.5217 - val_accuracy: 0.7201 - val_loss: 0.7871
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m 



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
------------------
Epoch 1/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.2763 - loss: 1.5534
Epoch 1: val_loss improved from inf to 1.37878, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.2769 - loss: 1.5527 - val_accuracy: 0.4007 - val_loss: 1.3788
Epoch 2/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.4750 - loss: 1.2725
Epoch 2: val_loss improved from 1.37878 to 0.96541, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.4753 - loss: 1.2719 - val_accuracy: 0.6253 - val_loss: 0.9654
Epoch 3/20
[1m534/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.6206 - loss: 0.9420
Epoch 3: val_loss improved from 0.96541 to 0.86837, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.6208 - loss: 0.9414 - val_accuracy: 0.6585 - val_loss: 0.8684
Epoch 4/20
[1m531/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.6761 - loss: 0.8050
Epoch 4: val_loss improved from 0.86837 to 0.83048, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.6763 - loss: 0.8047 - val_accuracy: 0.6793 - val_loss: 0.8305
Epoch 5/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.7233 - loss: 0.7185
Epoch 5: val_loss improved from 0.83048 to 0.81884, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.7233 - loss: 0.7183 - val_accuracy: 0.6850 - val_loss: 0.8188
Epoch 6/20
[1m535/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7562 - loss: 0.6455
Epoch 6: val_loss improved from 0.81884 to 0.76690, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.7563 - loss: 0.6452 - val_accuracy: 0.7170 - val_loss: 0.7669
Epoch 7/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7918 - loss: 0.5748
Epoch 7: val_loss did not improve from 0.76690
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.7918 - loss: 0.5748 - val_accuracy: 0.6691 - val_loss: 0.9400
Epoch 8/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.8133 - loss: 0.5228
Epoch 8: val_loss did not improve from 0.76690
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.8134 - loss: 0.5225 - val_accuracy: 0.7056 - val_loss: 0.8562
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m 



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.2932 - loss: 1.5471
Epoch 1: val_loss improved from inf to 1.26844, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.2937 - loss: 1.5465 - val_accuracy: 0.4439 - val_loss: 1.2684
Epoch 2/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.5281 - loss: 1.1584
Epoch 2: val_loss improved from 1.26844 to 0.94773, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.5288 - loss: 1.1569 - val_accuracy: 0.6424 - val_loss: 0.9477
Epoch 3/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.6476 - loss: 0.9091
Epoch 3: val_loss improved from 0.94773 to 0.86582, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.6478 - loss: 0.9087 - val_accuracy: 0.6726 - val_loss: 0.8658
Epoch 4/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7140 - loss: 0.7648
Epoch 4: val_loss improved from 0.86582 to 0.77171, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.7140 - loss: 0.7647 - val_accuracy: 0.7163 - val_loss: 0.7717
Epoch 5/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7576 - loss: 0.6577
Epoch 5: val_loss did not improve from 0.77171
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.7576 - loss: 0.6577 - val_accuracy: 0.6903 - val_loss: 0.8070
Epoch 6/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7927 - loss: 0.5746
Epoch 6: val_loss improved from 0.77171 to 0.74861, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.7928 - loss: 0.5745 - val_accuracy: 0.7241 - val_loss: 0.7486
Epoch 7/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8035 - loss: 0.5393
Epoch 7: val_loss improved from 0.74861 to 0.72378, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.8036 - loss: 0.5392 - val_accuracy: 0.7384 - val_loss: 0.7238
Epoch 8/20
[1m534/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.8259 - loss: 0.4891
Epoch 8: val_loss improved from 0.72378 to 0.69742, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.8260 - loss: 0.4889 - val_accuracy: 0.7531 - val_loss: 0.6974
Epoch 9/20
[1m531/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8376 - loss: 0.4596
Epoch 9: val_loss improved from 0.69742 to 0.68965, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 6ms/step - accuracy: 0.8377 - loss: 0.4592 - val_accuracy: 0.7620 - val_loss: 0.6896
Epoch 10/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.8439 - loss: 0.4356
Epoch 10: val_loss did not improve from 0.68965
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.8440 - loss: 0.4356 - val_accuracy: 0.7629 - val_loss: 0.7028
Epoch 11/20
[1m534/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8499 - loss: 0.4236
Epoch 11: val_loss did not improve from 0.68965
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.8500 - loss: 0.4232 - val_accuracy: 0.7571 - val_loss: 0.7283
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.2947 - loss: 1.5489
Epoch 1: val_loss improved from inf to 1.17466, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.2957 - loss: 1.5477 - val_accuracy: 0.5283 - val_loss: 1.1747
Epoch 2/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5495 - loss: 1.1212
Epoch 2: val_loss improved from 1.17466 to 0.92319, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.5497 - loss: 1.1206 - val_accuracy: 0.6357 - val_loss: 0.9232
Epoch 3/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.6513 - loss: 0.8836
Epoch 3: val_loss improved from 0.92319 to 0.86906, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.6514 - loss: 0.8834 - val_accuracy: 0.6612 - val_loss: 0.8691
Epoch 4/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.7117 - loss: 0.7623
Epoch 4: val_loss improved from 0.86906 to 0.79133, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.7117 - loss: 0.7621 - val_accuracy: 0.7021 - val_loss: 0.7913
Epoch 5/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7443 - loss: 0.6813
Epoch 5: val_loss improved from 0.79133 to 0.75394, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.7443 - loss: 0.6811 - val_accuracy: 0.7164 - val_loss: 0.7539
Epoch 6/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7731 - loss: 0.6060
Epoch 6: val_loss did not improve from 0.75394
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.7733 - loss: 0.6058 - val_accuracy: 0.7119 - val_loss: 0.7722
Epoch 7/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.7984 - loss: 0.5535
Epoch 7: val_loss did not improve from 0.75394
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.7985 - loss: 0.5535 - val_accuracy: 0.7247 - val_loss: 0.7606
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m 



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.2900 - loss: 1.5461
Epoch 1: val_loss improved from inf to 1.31969, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.2901 - loss: 1.5459 - val_accuracy: 0.4439 - val_loss: 1.3197
Epoch 2/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.4815 - loss: 1.2525
Epoch 2: val_loss improved from 1.31969 to 1.05239, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 11ms/step - accuracy: 0.4818 - loss: 1.2519 - val_accuracy: 0.5754 - val_loss: 1.0524
Epoch 3/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.6190 - loss: 0.9645
Epoch 3: val_loss improved from 1.05239 to 0.87553, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 7ms/step - accuracy: 0.6191 - loss: 0.9642 - val_accuracy: 0.6682 - val_loss: 0.8755
Epoch 4/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.6863 - loss: 0.8156
Epoch 4: val_loss improved from 0.87553 to 0.81562, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.6864 - loss: 0.8153 - val_accuracy: 0.6917 - val_loss: 0.8156
Epoch 5/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7335 - loss: 0.7095
Epoch 5: val_loss improved from 0.81562 to 0.77296, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.7336 - loss: 0.7095 - val_accuracy: 0.7090 - val_loss: 0.7730
Epoch 6/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7690 - loss: 0.6323
Epoch 6: val_loss improved from 0.77296 to 0.74646, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.7690 - loss: 0.6321 - val_accuracy: 0.7213 - val_loss: 0.7465
Epoch 7/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.7943 - loss: 0.5695
Epoch 7: val_loss did not improve from 0.74646
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.7943 - loss: 0.5694 - val_accuracy: 0.7136 - val_loss: 0.7847
Epoch 8/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8072 - loss: 0.5339
Epoch 8: val_loss did not improve from 0.74646
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.8073 - loss: 0.5338 - val_accuracy: 0.7131 - val_loss: 0.7947
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m 



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.2848 - loss: 1.5516
Epoch 1: val_loss improved from inf to 1.30534, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.2851 - loss: 1.5513 - val_accuracy: 0.4920 - val_loss: 1.3053
Epoch 2/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5492 - loss: 1.1598
Epoch 2: val_loss improved from 1.30534 to 0.92220, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.5496 - loss: 1.1587 - val_accuracy: 0.6396 - val_loss: 0.9222
Epoch 3/20
[1m534/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.6646 - loss: 0.8791
Epoch 3: val_loss did not improve from 0.92220
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.6648 - loss: 0.8786 - val_accuracy: 0.6498 - val_loss: 0.9544
Epoch 4/20
[1m534/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7221 - loss: 0.7556
Epoch 4: val_loss improved from 0.92220 to 0.91325, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.7223 - loss: 0.7552 - val_accuracy: 0.6709 - val_loss: 0.9133
Epoch 5/20
[1m540/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.7609 - loss: 0.6541
Epoch 5: val_loss improved from 0.91325 to 0.86183, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.7610 - loss: 0.6541 - val_accuracy: 0.6917 - val_loss: 0.8618
Epoch 6/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.7916 - loss: 0.5889
Epoch 6: val_loss improved from 0.86183 to 0.76575, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.7916 - loss: 0.5889 - val_accuracy: 0.7180 - val_loss: 0.7658
Epoch 7/20
[1m539/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8042 - loss: 0.5505
Epoch 7: val_loss improved from 0.76575 to 0.76349, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.8043 - loss: 0.5505 - val_accuracy: 0.7250 - val_loss: 0.7635
Epoch 8/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.8258 - loss: 0.5020
Epoch 8: val_loss did not improve from 0.76349
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.8258 - loss: 0.5019 - val_accuracy: 0.7265 - val_loss: 0.7761
Epoch 9/20
[1m532/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8402 - loss: 0.4659
Epoch 9: val_loss did not improve from 0.76349
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.8402 - loss: 0.4660 - val_accuracy: 0.7207 - val_loss: 0.8242
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m 



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
------------------
Epoch 1/20
[1m538/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.2968 - loss: 1.5519
Epoch 1: val_loss improved from inf to 1.18990, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.2973 - loss: 1.5513 - val_accuracy: 0.5515 - val_loss: 1.1899
Epoch 2/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5563 - loss: 1.1277
Epoch 2: val_loss improved from 1.18990 to 0.94320, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.5566 - loss: 1.1268 - val_accuracy: 0.6373 - val_loss: 0.9432
Epoch 3/20
[1m536/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.6540 - loss: 0.9015
Epoch 3: val_loss improved from 0.94320 to 0.88905, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.6541 - loss: 0.9012 - val_accuracy: 0.6573 - val_loss: 0.8891
Epoch 4/20
[1m531/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 10ms/step - accuracy: 0.7011 - loss: 0.7912
Epoch 4: val_loss improved from 0.88905 to 0.77622, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 11ms/step - accuracy: 0.7014 - loss: 0.7906 - val_accuracy: 0.7118 - val_loss: 0.7762
Epoch 5/20
[1m537/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.7425 - loss: 0.6886
Epoch 5: val_loss improved from 0.77622 to 0.73733, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.7426 - loss: 0.6883 - val_accuracy: 0.7286 - val_loss: 0.7373
Epoch 6/20
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7802 - loss: 0.5993
Epoch 6: val_loss improved from 0.73733 to 0.70763, saving model to nn_model.h5




[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.7803 - loss: 0.5992 - val_accuracy: 0.7407 - val_loss: 0.7076
Epoch 7/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 9ms/step - accuracy: 0.8066 - loss: 0.5465
Epoch 7: val_loss did not improve from 0.70763
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 10ms/step - accuracy: 0.8067 - loss: 0.5462 - val_accuracy: 0.7443 - val_loss: 0.7126
Epoch 8/20
[1m533/541[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.8068 - loss: 0.5377
Epoch 8: val_loss did not improve from 0.70763
[1m541/541[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 6ms/step - accuracy: 0.8070 - loss: 0.5371 - val_accuracy: 0.7191 - val_loss: 0.8054
✅ Best fold info saved to best_val_result.json
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m614/614[0m



[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
------------------


In [None]:
# 확인 (nn_feat2 만)
with open('best_val_result.json', 'r') as f:
    result = json.load(f)

print(f"Best fold: {result['fold']}, val_loss: {result['val_loss']}, val_accuracy: {result['val_accuracy']}")

Best fold: 6, val_loss: 0.8095, val_accuracy: 0.714


In [None]:
# nn으로 생성한 피처 csv 파일로 저장
pd.DataFrame(nn_train1).to_csv('nn_train1.csv', index=False)
pd.DataFrame(nn_test1).to_csv('nn_test1.csv', index=False)
pd.DataFrame(nn_train2).to_csv('nn_train2.csv', index=False)
pd.DataFrame(nn_test2).to_csv('nn_test2.csv', index=False)

In [None]:
# 잘 저장되었는지 확인
chk = pd.read_csv('/content/nn_test1.csv')
chk.head()

Unnamed: 0,0,1,2,3,4
0,0.000194,0.915968,0.067935,0.015842,6.1e-05
1,0.342317,0.425324,0.145968,0.034048,0.052343
2,0.941147,0.01492,0.00123,0.000218,0.042485
3,6e-06,3.9e-05,0.969542,1.3e-05,0.030399
4,0.992519,0.002278,3.7e-05,0.00437,0.000795


## **CNN으로 피처 생성**

In [None]:
def get_cnn_feats(rnd=1):
    train_pred, test_pred = np.zeros((54879, 5)), np.zeros((19617, 5))
    best_val_train_pred, best_val_test_pred = np.zeros((54879, 5)), np.zeros((19617, 5))

    FEAT_CNT = 5
    NUM_WORDS = 16000
    EMBED_DIM = 64
    MAX_LEN = 300
    NUM_CLASSES = 5
    MODEL_P = 'cnn_model.h5'

    tmp_X = train['text']
    tmp_Y = train['author']
    tmp_X_test = test['text']

    # Tokenizing
    tokenizer = Tokenizer(num_words=NUM_WORDS)
    tokenizer.fit_on_texts(tmp_X)
    ttrain_x = tokenizer.texts_to_sequences(tmp_X)
    ttrain_x = pad_sequences(ttrain_x, maxlen=MAX_LEN)
    ttest_x = tokenizer.texts_to_sequences(tmp_X_test)
    ttest_x = pad_sequences(ttest_x, maxlen=MAX_LEN)

    # Label one-hot encoding
    lb = preprocessing.LabelBinarizer()
    lb.fit(tmp_Y)
    ttrain_y = lb.transform(tmp_Y)

    skf = StratifiedKFold(n_splits=FEAT_CNT, shuffle=True, random_state=2333 * rnd)
    for fold, (train_index, val_index) in enumerate(skf.split(ttrain_x, tmp_Y)):
        print(f"🌊 Fold {fold+1}/{FEAT_CNT}")

        model = Sequential()
        model.add(Embedding(NUM_WORDS, EMBED_DIM, input_length=MAX_LEN))
        model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
        model.add(GlobalMaxPooling1D())
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.2))
        model.add(Dense(NUM_CLASSES, activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        mc = ModelCheckpoint(filepath=MODEL_P, monitor='val_loss', save_best_only=True, verbose=1)
        es = EarlyStopping(monitor='val_loss', patience=2)

        np.random.seed(42)
        model.fit(
            ttrain_x[train_index], ttrain_y[train_index],
            validation_split=0.1,
            batch_size=256, epochs=10,
            verbose=1,
            callbacks=[mc, es],
            shuffle=False
        )

        # feature 생성 1 (현재 모델)
        train_pred[val_index] = model.predict(ttrain_x[val_index])
        test_pred += model.predict(ttest_x) / FEAT_CNT

        # feature 생성 2 (best 모델)
        model = load_model(MODEL_P)
        best_val_train_pred[val_index] = model.predict(ttrain_x[val_index])
        best_val_test_pred += model.predict(ttest_x) / FEAT_CNT

        del model
        gc.collect()
        print('------------------')

    return train_pred, test_pred, best_val_train_pred, best_val_test_pred


In [None]:
cnn_train1, cnn_test1, cnn_train2, cnn_test2 = get_cnn_feats(1)

🌊 Fold 1/5




Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 582ms/step - accuracy: 0.3560 - loss: 1.4667
Epoch 1: val_loss improved from inf to 0.92947, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m104s[0m 598ms/step - accuracy: 0.3567 - loss: 1.4656 - val_accuracy: 0.6402 - val_loss: 0.9295
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 554ms/step - accuracy: 0.6896 - loss: 0.8258
Epoch 2: val_loss improved from 0.92947 to 0.68358, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 588ms/step - accuracy: 0.6899 - loss: 0.8251 - val_accuracy: 0.7511 - val_loss: 0.6836
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 545ms/step - accuracy: 0.8263 - loss: 0.5030
Epoch 3: val_loss improved from 0.68358 to 0.65290, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m138s[0m 562ms/step - accuracy: 0.8265 - loss: 0.5027 - val_accuracy: 0.7673 - val_loss: 0.6529
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 548ms/step - accuracy: 0.8926 - loss: 0.3258
Epoch 4: val_loss did not improve from 0.65290
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 562ms/step - accuracy: 0.8926 - loss: 0.3256 - val_accuracy: 0.7716 - val_loss: 0.6894
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 591ms/step - accuracy: 0.9278 - loss: 0.2213
Epoch 5: val_loss did not improve from 0.65290
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 625ms/step - accuracy: 0.9279 - loss: 0.2212 - val_accuracy: 0.7606 - val_loss: 0.7807
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 21ms/step




[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 24ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 21ms/step
------------------
🌊 Fold 2/5
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 557ms/step - accuracy: 0.3596 - loss: 1.4621
Epoch 1: val_loss improved from inf to 0.94023, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m95s[0m 593ms/step - accuracy: 0.3603 - loss: 1.4609 - val_accuracy: 0.6329 - val_loss: 0.9402
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 573ms/step - accuracy: 0.6911 - loss: 0.8220
Epoch 2: val_loss improved from 0.94023 to 0.68742, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 595ms/step - accuracy: 0.6914 - loss: 0.8214 - val_accuracy: 0.7424 - val_loss: 0.6874
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 556ms/step - accuracy: 0.8259 - loss: 0.4985
Epoch 3: val_loss improved from 0.68742 to 0.65476, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 590ms/step - accuracy: 0.8260 - loss: 0.4981 - val_accuracy: 0.7716 - val_loss: 0.6548
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 574ms/step - accuracy: 0.8913 - loss: 0.3253
Epoch 4: val_loss did not improve from 0.65476
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 591ms/step - accuracy: 0.8914 - loss: 0.3251 - val_accuracy: 0.7770 - val_loss: 0.6844
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 528ms/step - accuracy: 0.9294 - loss: 0.2181
Epoch 5: val_loss did not improve from 0.65476
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 545ms/step - accuracy: 0.9294 - loss: 0.2180 - val_accuracy: 0.7732 - val_loss: 0.7497
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 22ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 21ms/step




[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 20ms/step
------------------
🌊 Fold 3/5
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 568ms/step - accuracy: 0.3675 - loss: 1.4606
Epoch 1: val_loss improved from inf to 0.91199, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m93s[0m 587ms/step - accuracy: 0.3682 - loss: 1.4594 - val_accuracy: 0.6509 - val_loss: 0.9120
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 581ms/step - accuracy: 0.7058 - loss: 0.7969
Epoch 2: val_loss improved from 0.91199 to 0.67330, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 596ms/step - accuracy: 0.7060 - loss: 0.7963 - val_accuracy: 0.7518 - val_loss: 0.6733
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 563ms/step - accuracy: 0.8306 - loss: 0.4842
Epoch 3: val_loss improved from 0.67330 to 0.64525, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 580ms/step - accuracy: 0.8308 - loss: 0.4839 - val_accuracy: 0.7700 - val_loss: 0.6453
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 582ms/step - accuracy: 0.8979 - loss: 0.3122
Epoch 4: val_loss did not improve from 0.64525
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m144s[0m 597ms/step - accuracy: 0.8980 - loss: 0.3121 - val_accuracy: 0.7677 - val_loss: 0.7000
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 550ms/step - accuracy: 0.9338 - loss: 0.2103
Epoch 5: val_loss did not improve from 0.64525
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 566ms/step - accuracy: 0.9338 - loss: 0.2101 - val_accuracy: 0.7602 - val_loss: 0.7677
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 20ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 18ms/step




[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 23ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 21ms/step
------------------
🌊 Fold 4/5
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 571ms/step - accuracy: 0.3578 - loss: 1.4677
Epoch 1: val_loss improved from inf to 0.94892, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m93s[0m 590ms/step - accuracy: 0.3584 - loss: 1.4666 - val_accuracy: 0.6327 - val_loss: 0.9489
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 572ms/step - accuracy: 0.6903 - loss: 0.8244
Epoch 2: val_loss improved from 0.94892 to 0.68532, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 590ms/step - accuracy: 0.6905 - loss: 0.8237 - val_accuracy: 0.7474 - val_loss: 0.6853
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 604ms/step - accuracy: 0.8282 - loss: 0.4919
Epoch 3: val_loss improved from 0.68532 to 0.65303, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m147s[0m 620ms/step - accuracy: 0.8283 - loss: 0.4915 - val_accuracy: 0.7750 - val_loss: 0.6530
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 549ms/step - accuracy: 0.8969 - loss: 0.3134
Epoch 4: val_loss did not improve from 0.65303
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 570ms/step - accuracy: 0.8970 - loss: 0.3132 - val_accuracy: 0.7711 - val_loss: 0.7035
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 560ms/step - accuracy: 0.9330 - loss: 0.2104
Epoch 5: val_loss did not improve from 0.65303
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 575ms/step - accuracy: 0.9331 - loss: 0.2103 - val_accuracy: 0.7679 - val_loss: 0.7661
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 24ms/step




[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 20ms/step
------------------
🌊 Fold 5/5
Epoch 1/10




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 529ms/step - accuracy: 0.3523 - loss: 1.4744
Epoch 1: val_loss improved from inf to 0.95215, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 548ms/step - accuracy: 0.3530 - loss: 1.4733 - val_accuracy: 0.6263 - val_loss: 0.9522
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 558ms/step - accuracy: 0.6812 - loss: 0.8416
Epoch 2: val_loss improved from 0.95215 to 0.67154, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m145s[0m 573ms/step - accuracy: 0.6815 - loss: 0.8409 - val_accuracy: 0.7563 - val_loss: 0.6715
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 569ms/step - accuracy: 0.8231 - loss: 0.5109
Epoch 3: val_loss improved from 0.67154 to 0.64155, saving model to cnn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 587ms/step - accuracy: 0.8232 - loss: 0.5105 - val_accuracy: 0.7716 - val_loss: 0.6415
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 546ms/step - accuracy: 0.8886 - loss: 0.3366
Epoch 4: val_loss did not improve from 0.64155
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m138s[0m 563ms/step - accuracy: 0.8887 - loss: 0.3363 - val_accuracy: 0.7682 - val_loss: 0.6860
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 554ms/step - accuracy: 0.9268 - loss: 0.2281
Epoch 5: val_loss did not improve from 0.64155
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 570ms/step - accuracy: 0.9269 - loss: 0.2280 - val_accuracy: 0.7700 - val_loss: 0.7277
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 21ms/step




[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 19ms/step
------------------


In [None]:
def save_cnn_feats_to_csv(cnn_train1, cnn_test1, cnn_train2, cnn_test2, base_filename="cnn_features"):
    """
    get_cnn_feats 함수에서 얻은 NumPy 배열들을 각각 CSV 파일로 저장합니다.

    Args:
        cnn_train1, cnn_test1, cnn_train2, cnn_test2 (np.ndarray): 메타 피처들
        base_filename (str): 저장 파일 이름의 prefix (기본값: 'nn_features')
    """

# cnn으로 생성한 피처 csv 파일로 저장
pd.DataFrame(cnn_train1).to_csv('cnn_train1.csv', index=False)
pd.DataFrame(cnn_test1).to_csv('cnn_test1.csv', index=False)
pd.DataFrame(cnn_train2).to_csv('cnn_train2.csv', index=False)
pd.DataFrame(cnn_test2).to_csv('cnn_test2.csv', index=False)

## **GRU로 피처 생성**

In [None]:
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import GRU

In [None]:
def get_gru_feats(rnd=1):
    train_pred, test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    best_val_train_pred, best_val_test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    FEAT_CNT = 5
    NUM_WORDS = 16000
    N = 12
    MAX_LEN = 300
    NUM_CLASSES = 5
    MODEL_P = 'nn_model.h5'

    tmp_X = train['text']
    tmp_Y = train['author']
    tmp_X_test = test['text']

    tokenizer = Tokenizer(num_words=NUM_WORDS)
    tokenizer.fit_on_texts(tmp_X)

    ttrain_x = tokenizer.texts_to_sequences(tmp_X)
    ttrain_x = pad_sequences(ttrain_x, maxlen=MAX_LEN)

    ttest_x = tokenizer.texts_to_sequences(tmp_X_test)
    ttest_x = pad_sequences(ttest_x, maxlen=MAX_LEN)

    lb = preprocessing.LabelBinarizer()
    lb.fit(tmp_Y)

    ttrain_y = lb.transform(tmp_Y)
    skf = StratifiedKFold(n_splits=FEAT_CNT, shuffle=True, random_state=2333*rnd)
    for train_index, test_index in skf.split(ttrain_x,tmp_Y):
        model = Sequential()
        model.add(Embedding(NUM_WORDS, N, input_length=MAX_LEN))
        model.add(GRU(N, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
        model.add(Flatten())
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.2))
        model.add(Dense(NUM_CLASSES, activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        mc = ModelCheckpoint(filepath=MODEL_P, monitor='val_loss', save_best_only=True, verbose=1)
        es=EarlyStopping(monitor='val_loss', patience=2)

        np.random.seed(42)
        model.fit(ttrain_x[train_index], ttrain_y[train_index],
                  validation_split=0.1,
                  batch_size=256, epochs=10,
                  verbose=1,
                  callbacks=[mc,es],
                  shuffle=False
                 )

        # feature 생성 1
        train_pred[test_index] = model.predict(ttrain_x[test_index])
        test_pred += model.predict(ttest_x)/FEAT_CNT

        # feature 생성 2
        model = load_model(MODEL_P)
        best_val_train_pred[test_index] = model.predict(ttrain_x[test_index])
        best_val_test_pred += model.predict(ttest_x)/FEAT_CNT

        del model
        gc.collect()
        print('------------------')

    return train_pred,test_pred,best_val_train_pred,best_val_test_pred

In [None]:
gru_train1,gru_test1,gru_train2,gru_test2 = get_gru_feats(1)

Epoch 1/10




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 292ms/step - accuracy: 0.3427 - loss: 1.4745
Epoch 1: val_loss improved from inf to 1.02439, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 305ms/step - accuracy: 0.3433 - loss: 1.4734 - val_accuracy: 0.5862 - val_loss: 1.0244
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 304ms/step - accuracy: 0.6275 - loss: 0.9471
Epoch 2: val_loss improved from 1.02439 to 0.79951, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 311ms/step - accuracy: 0.6277 - loss: 0.9467 - val_accuracy: 0.6967 - val_loss: 0.7995
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 293ms/step - accuracy: 0.7338 - loss: 0.7096
Epoch 3: val_loss improved from 0.79951 to 0.74352, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 302ms/step - accuracy: 0.7339 - loss: 0.7094 - val_accuracy: 0.7185 - val_loss: 0.7435
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 294ms/step - accuracy: 0.7853 - loss: 0.5833
Epoch 4: val_loss improved from 0.74352 to 0.72682, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 303ms/step - accuracy: 0.7854 - loss: 0.5831 - val_accuracy: 0.7244 - val_loss: 0.7268
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 303ms/step - accuracy: 0.8228 - loss: 0.4917
Epoch 5: val_loss did not improve from 0.72682
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 320ms/step - accuracy: 0.8228 - loss: 0.4916 - val_accuracy: 0.7294 - val_loss: 0.7645
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 292ms/step - accuracy: 0.8429 - loss: 0.4378
Epoch 6: val_loss did not improve from 0.72682
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 301ms/step - accuracy: 0.8429 - loss: 0.4377 - val_accuracy: 0.7331 - val_loss: 0.7696
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 39ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 39ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 38ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 291ms/step - accuracy: 0.3424 - loss: 1.4905
Epoch 1: val_loss improved from inf to 1.04844, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 305ms/step - accuracy: 0.3430 - loss: 1.4896 - val_accuracy: 0.5798 - val_loss: 1.0484
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 309ms/step - accuracy: 0.6127 - loss: 0.9763
Epoch 2: val_loss improved from 1.04844 to 0.84767, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 326ms/step - accuracy: 0.6129 - loss: 0.9759 - val_accuracy: 0.6707 - val_loss: 0.8477
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 291ms/step - accuracy: 0.7194 - loss: 0.7392
Epoch 3: val_loss improved from 0.84767 to 0.76516, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 309ms/step - accuracy: 0.7196 - loss: 0.7389 - val_accuracy: 0.7083 - val_loss: 0.7652
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 313ms/step - accuracy: 0.7815 - loss: 0.5881
Epoch 4: val_loss improved from 0.76516 to 0.74391, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 322ms/step - accuracy: 0.7816 - loss: 0.5879 - val_accuracy: 0.7192 - val_loss: 0.7439
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 297ms/step - accuracy: 0.8214 - loss: 0.4908
Epoch 5: val_loss did not improve from 0.74391
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 314ms/step - accuracy: 0.8215 - loss: 0.4906 - val_accuracy: 0.7335 - val_loss: 0.7524
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 314ms/step - accuracy: 0.8509 - loss: 0.4211
Epoch 6: val_loss did not improve from 0.74391
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 320ms/step - accuracy: 0.8509 - loss: 0.4211 - val_accuracy: 0.7345 - val_loss: 0.7956
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 40ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 38ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 41ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 303ms/step - accuracy: 0.3480 - loss: 1.4869
Epoch 1: val_loss improved from inf to 1.02099, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 314ms/step - accuracy: 0.3486 - loss: 1.4859 - val_accuracy: 0.6001 - val_loss: 1.0210
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 305ms/step - accuracy: 0.6237 - loss: 0.9623
Epoch 2: val_loss improved from 1.02099 to 0.81779, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 312ms/step - accuracy: 0.6239 - loss: 0.9619 - val_accuracy: 0.6809 - val_loss: 0.8178
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 293ms/step - accuracy: 0.7197 - loss: 0.7364
Epoch 3: val_loss improved from 0.81779 to 0.73659, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 310ms/step - accuracy: 0.7198 - loss: 0.7361 - val_accuracy: 0.7140 - val_loss: 0.7366
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 294ms/step - accuracy: 0.7827 - loss: 0.5877
Epoch 4: val_loss improved from 0.73659 to 0.73509, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 303ms/step - accuracy: 0.7827 - loss: 0.5876 - val_accuracy: 0.7283 - val_loss: 0.7351
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 293ms/step - accuracy: 0.8198 - loss: 0.4951
Epoch 5: val_loss improved from 0.73509 to 0.70752, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 302ms/step - accuracy: 0.8198 - loss: 0.4950 - val_accuracy: 0.7429 - val_loss: 0.7075
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 292ms/step - accuracy: 0.8487 - loss: 0.4186
Epoch 6: val_loss did not improve from 0.70752
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 301ms/step - accuracy: 0.8487 - loss: 0.4185 - val_accuracy: 0.7502 - val_loss: 0.7184
Epoch 7/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 306ms/step - accuracy: 0.8672 - loss: 0.3760
Epoch 7: val_loss did not improve from 0.70752
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 313ms/step - accuracy: 0.8672 - loss: 0.3759 - val_accuracy: 0.7522 - val_loss: 0.7448
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 40ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 38ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 307ms/step - accuracy: 0.3433 - loss: 1.4881
Epoch 1: val_loss improved from inf to 1.03435, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 320ms/step - accuracy: 0.3438 - loss: 1.4872 - val_accuracy: 0.5958 - val_loss: 1.0343
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 293ms/step - accuracy: 0.6245 - loss: 0.9642
Epoch 2: val_loss improved from 1.03435 to 0.80153, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 301ms/step - accuracy: 0.6247 - loss: 0.9637 - val_accuracy: 0.6946 - val_loss: 0.8015
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 304ms/step - accuracy: 0.7334 - loss: 0.7102
Epoch 3: val_loss improved from 0.80153 to 0.73840, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 312ms/step - accuracy: 0.7335 - loss: 0.7099 - val_accuracy: 0.7156 - val_loss: 0.7384
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 305ms/step - accuracy: 0.7870 - loss: 0.5776
Epoch 4: val_loss improved from 0.73840 to 0.72576, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 314ms/step - accuracy: 0.7871 - loss: 0.5774 - val_accuracy: 0.7310 - val_loss: 0.7258
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 293ms/step - accuracy: 0.8187 - loss: 0.4968
Epoch 5: val_loss did not improve from 0.72576
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 300ms/step - accuracy: 0.8188 - loss: 0.4968 - val_accuracy: 0.7324 - val_loss: 0.7431
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 296ms/step - accuracy: 0.8434 - loss: 0.4403
Epoch 6: val_loss did not improve from 0.72576
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 309ms/step - accuracy: 0.8434 - loss: 0.4402 - val_accuracy: 0.7274 - val_loss: 0.7958
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 38ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 41ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 295ms/step - accuracy: 0.3437 - loss: 1.4855
Epoch 1: val_loss improved from inf to 1.05287, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 318ms/step - accuracy: 0.3444 - loss: 1.4845 - val_accuracy: 0.5762 - val_loss: 1.0529
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 294ms/step - accuracy: 0.6129 - loss: 0.9768
Epoch 2: val_loss improved from 1.05287 to 0.86542, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 305ms/step - accuracy: 0.6131 - loss: 0.9763 - val_accuracy: 0.6607 - val_loss: 0.8654
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 303ms/step - accuracy: 0.7183 - loss: 0.7471
Epoch 3: val_loss improved from 0.86542 to 0.74883, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 312ms/step - accuracy: 0.7185 - loss: 0.7468 - val_accuracy: 0.7153 - val_loss: 0.7488
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 309ms/step - accuracy: 0.7783 - loss: 0.5955
Epoch 4: val_loss improved from 0.74883 to 0.72383, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 318ms/step - accuracy: 0.7784 - loss: 0.5954 - val_accuracy: 0.7326 - val_loss: 0.7238
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 293ms/step - accuracy: 0.8135 - loss: 0.5088
Epoch 5: val_loss did not improve from 0.72383
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 302ms/step - accuracy: 0.8136 - loss: 0.5087 - val_accuracy: 0.7324 - val_loss: 0.7405
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 306ms/step - accuracy: 0.8405 - loss: 0.4451
Epoch 6: val_loss did not improve from 0.72383
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 318ms/step - accuracy: 0.8406 - loss: 0.4450 - val_accuracy: 0.7383 - val_loss: 0.7659
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 42ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 42ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 41ms/step
------------------


In [None]:
import pandas as pd
import numpy as np

def save_gru_feats_to_csv(gru_train1, gru_test1, gru_train2, gru_test2, base_filename="gru_features"):
    """
    get_gru_feats 함수에서 얻은 NumPy 배열들을 각각 CSV 파일로 저장합니다.

    Args:
        gru_train1 (numpy.ndarray): 첫 번째 GRU 학습 데이터 특징 배열.
        gru_test1 (numpy.ndarray): 첫 번째 GRU 테스트 데이터 특징 배열.
        gru_train2 (numpy.ndarray): 두 번째 GRU 학습 데이터 특징 배열.
        gru_test2 (numpy.ndarray): 두 번째 GRU 테스트 데이터 특징 배열.
        base_filename (str, optional): 저장될 파일명의 기본 이름. Defaults to "gru_features".
    """
    # gru_train1을 DataFrame으로 변환하고 CSV 파일로 저장
    df_train1 = pd.DataFrame(gru_train1)
    train1_filename = f"{base_filename}_train1.csv"
    df_train1.to_csv(train1_filename, index=False)
    print(f"'{train1_filename}' 파일로 저장 완료.")

    # gru_test1을 DataFrame으로 변환하고 CSV 파일로 저장
    df_test1 = pd.DataFrame(gru_test1)
    test1_filename = f"{base_filename}_test1.csv"
    df_test1.to_csv(test1_filename, index=False)
    print(f"'{test1_filename}' 파일로 저장 완료.")

    # gru_train2를 DataFrame으로 변환하고 CSV 파일로 저장
    df_train2 = pd.DataFrame(gru_train2)
    train2_filename = f"{base_filename}_train2.csv"
    df_train2.to_csv(train2_filename, index=False)
    print(f"'{train2_filename}' 파일로 저장 완료.")

    # gru_test2를 DataFrame으로 변환하고 CSV 파일로 저장
    df_test2 = pd.DataFrame(gru_test2)
    test2_filename = f"{base_filename}_test2.csv"
    df_test2.to_csv(test2_filename, index=False)
    print(f"'{test2_filename}' 파일로 저장 완료.")

# CSV 파일로 저장하는 함수를 호출합니다.
save_gru_feats_to_csv(gru_train1, gru_test1, gru_train2, gru_test2)

'gru_features_train1.csv' 파일로 저장 완료.
'gru_features_test1.csv' 파일로 저장 완료.
'gru_features_train2.csv' 파일로 저장 완료.
'gru_features_test2.csv' 파일로 저장 완료.


## **LSTM으로 피처 생성**

In [None]:
import numpy as np
import gc
from sklearn import preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import load_model, Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dropout, Dense
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.layers import Flatten

In [None]:
def get_lstm_feats(rnd=1):
    train_pred, test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    best_val_train_pred, best_val_test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    FEAT_CNT = 5
    NUM_WORDS = 16000
    N = 12
    MAX_LEN = 300
    NUM_CLASSES = 5
    MODEL_P = 'nn_model.h5'

    tmp_X = train['text']
    tmp_Y = train['author']
    tmp_X_test = test['text']

    tokenizer = Tokenizer(num_words=NUM_WORDS)
    tokenizer.fit_on_texts(tmp_X)

    ttrain_x = tokenizer.texts_to_sequences(tmp_X)
    ttrain_x = pad_sequences(ttrain_x, maxlen=MAX_LEN)

    ttest_x = tokenizer.texts_to_sequences(tmp_X_test)
    ttest_x = pad_sequences(ttest_x, maxlen=MAX_LEN)

    lb = preprocessing.LabelBinarizer()
    lb.fit(tmp_Y)

    ttrain_y = lb.transform(tmp_Y)
    skf = StratifiedKFold(n_splits=FEAT_CNT, shuffle=True, random_state=2333*rnd)
    for train_index, test_index in skf.split(ttrain_x,tmp_Y):
        model = Sequential()
        model.add(Embedding(NUM_WORDS, N, input_length=MAX_LEN))
        model.add(LSTM(N, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
        model.add(Flatten())
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.2))
        model.add(Dense(NUM_CLASSES, activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        mc = ModelCheckpoint(filepath=MODEL_P, monitor='val_loss', save_best_only=True, verbose=1)
        es=EarlyStopping(monitor='val_loss', patience=2)

        np.random.seed(42)
        model.fit(ttrain_x[train_index], ttrain_y[train_index],
                  validation_split=0.1,
                  batch_size=256, epochs=10,
                  verbose=1,
                  callbacks=[mc,es],
                  shuffle=False
                 )

        # feature 생성 1
        train_pred[test_index] = model.predict(ttrain_x[test_index])
        test_pred += model.predict(ttest_x)/FEAT_CNT

        # feature 생성 2
        model = load_model(MODEL_P)
        best_val_train_pred[test_index] = model.predict(ttrain_x[test_index])
        best_val_test_pred += model.predict(ttest_x)/FEAT_CNT

        del model
        gc.collect()
        print('------------------')

    return train_pred,test_pred,best_val_train_pred,best_val_test_pred

In [None]:
lstm_train1,lstm_test1,lstm_train2,lstm_test2 = get_lstm_feats(1)



Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 350ms/step - accuracy: 0.3217 - loss: 1.5062
Epoch 1: val_loss improved from inf to 1.05573, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 362ms/step - accuracy: 0.3223 - loss: 1.5054 - val_accuracy: 0.5958 - val_loss: 1.0557
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 349ms/step - accuracy: 0.5997 - loss: 1.0197
Epoch 2: val_loss improved from 1.05573 to 0.82943, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 357ms/step - accuracy: 0.5999 - loss: 1.0193 - val_accuracy: 0.6812 - val_loss: 0.8294
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 350ms/step - accuracy: 0.7161 - loss: 0.7515
Epoch 3: val_loss improved from 0.82943 to 0.71616, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 358ms/step - accuracy: 0.7162 - loss: 0.7512 - val_accuracy: 0.7304 - val_loss: 0.7162
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 361ms/step - accuracy: 0.7874 - loss: 0.5877
Epoch 4: val_loss did not improve from 0.71616
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 369ms/step - accuracy: 0.7875 - loss: 0.5876 - val_accuracy: 0.7238 - val_loss: 0.7283
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 352ms/step - accuracy: 0.8277 - loss: 0.4843
Epoch 5: val_loss improved from 0.71616 to 0.68854, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 361ms/step - accuracy: 0.8278 - loss: 0.4841 - val_accuracy: 0.7493 - val_loss: 0.6885
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 349ms/step - accuracy: 0.8526 - loss: 0.4143
Epoch 6: val_loss did not improve from 0.68854
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 366ms/step - accuracy: 0.8527 - loss: 0.4142 - val_accuracy: 0.7556 - val_loss: 0.7162
Epoch 7/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 362ms/step - accuracy: 0.8706 - loss: 0.3642
Epoch 7: val_loss did not improve from 0.68854
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 370ms/step - accuracy: 0.8706 - loss: 0.3641 - val_accuracy: 0.7449 - val_loss: 0.7951
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 42ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 342ms/step - accuracy: 0.3291 - loss: 1.4939
Epoch 1: val_loss improved from inf to 1.02663, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 355ms/step - accuracy: 0.3298 - loss: 1.4929 - val_accuracy: 0.5930 - val_loss: 1.0266
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 341ms/step - accuracy: 0.6208 - loss: 0.9611
Epoch 2: val_loss improved from 1.02663 to 0.84276, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 350ms/step - accuracy: 0.6210 - loss: 0.9606 - val_accuracy: 0.6645 - val_loss: 0.8428
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 338ms/step - accuracy: 0.7213 - loss: 0.7344
Epoch 3: val_loss improved from 0.84276 to 0.78109, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 346ms/step - accuracy: 0.7214 - loss: 0.7341 - val_accuracy: 0.7026 - val_loss: 0.7811
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 350ms/step - accuracy: 0.7734 - loss: 0.6051
Epoch 4: val_loss did not improve from 0.78109
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 363ms/step - accuracy: 0.7735 - loss: 0.6050 - val_accuracy: 0.7078 - val_loss: 0.7834
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 335ms/step - accuracy: 0.8079 - loss: 0.5239
Epoch 5: val_loss did not improve from 0.78109
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 347ms/step - accuracy: 0.8079 - loss: 0.5238 - val_accuracy: 0.7140 - val_loss: 0.8025
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 40ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 41ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 360ms/step - accuracy: 0.3334 - loss: 1.4997
Epoch 1: val_loss improved from inf to 1.11499, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 371ms/step - accuracy: 0.3339 - loss: 1.4988 - val_accuracy: 0.5299 - val_loss: 1.1150
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 349ms/step - accuracy: 0.5868 - loss: 1.0266
Epoch 2: val_loss improved from 1.11499 to 0.87212, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 357ms/step - accuracy: 0.5871 - loss: 1.0262 - val_accuracy: 0.6575 - val_loss: 0.8721
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 348ms/step - accuracy: 0.6991 - loss: 0.7851
Epoch 3: val_loss improved from 0.87212 to 0.76084, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 356ms/step - accuracy: 0.6992 - loss: 0.7849 - val_accuracy: 0.7121 - val_loss: 0.7608
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 361ms/step - accuracy: 0.7671 - loss: 0.6262
Epoch 4: val_loss improved from 0.76084 to 0.71178, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 369ms/step - accuracy: 0.7672 - loss: 0.6260 - val_accuracy: 0.7381 - val_loss: 0.7118
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 342ms/step - accuracy: 0.8103 - loss: 0.5275
Epoch 5: val_loss did not improve from 0.71178
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 355ms/step - accuracy: 0.8103 - loss: 0.5274 - val_accuracy: 0.7335 - val_loss: 0.7220
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 347ms/step - accuracy: 0.8361 - loss: 0.4600
Epoch 6: val_loss improved from 0.71178 to 0.70776, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 364ms/step - accuracy: 0.8361 - loss: 0.4599 - val_accuracy: 0.7454 - val_loss: 0.7078
Epoch 7/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 357ms/step - accuracy: 0.8545 - loss: 0.4095
Epoch 7: val_loss did not improve from 0.70776
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 374ms/step - accuracy: 0.8545 - loss: 0.4094 - val_accuracy: 0.7481 - val_loss: 0.7461
Epoch 8/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 336ms/step - accuracy: 0.8662 - loss: 0.3735
Epoch 8: val_loss did not improve from 0.70776
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 346ms/step - accuracy: 0.8662 - loss: 0.3734 - val_accuracy: 0.7524 - val_loss: 0.7482
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 40ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 40ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 357ms/step - accuracy: 0.3325 - loss: 1.4868
Epoch 1: val_loss improved from inf to 1.06543, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 380ms/step - accuracy: 0.3330 - loss: 1.4858 - val_accuracy: 0.5550 - val_loss: 1.0654
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 338ms/step - accuracy: 0.5892 - loss: 1.0092
Epoch 2: val_loss improved from 1.06543 to 0.88230, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 355ms/step - accuracy: 0.5894 - loss: 1.0088 - val_accuracy: 0.6488 - val_loss: 0.8823
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 343ms/step - accuracy: 0.6965 - loss: 0.7836
Epoch 3: val_loss improved from 0.88230 to 0.77179, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 357ms/step - accuracy: 0.6966 - loss: 0.7833 - val_accuracy: 0.7046 - val_loss: 0.7718
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 360ms/step - accuracy: 0.7631 - loss: 0.6345
Epoch 4: val_loss improved from 0.77179 to 0.74534, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 369ms/step - accuracy: 0.7631 - loss: 0.6344 - val_accuracy: 0.7235 - val_loss: 0.7453
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 345ms/step - accuracy: 0.7989 - loss: 0.5467
Epoch 5: val_loss did not improve from 0.74534
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 362ms/step - accuracy: 0.7989 - loss: 0.5466 - val_accuracy: 0.7304 - val_loss: 0.7536
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 342ms/step - accuracy: 0.8262 - loss: 0.4811
Epoch 6: val_loss did not improve from 0.74534
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 359ms/step - accuracy: 0.8262 - loss: 0.4810 - val_accuracy: 0.7256 - val_loss: 0.7891
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 41ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 42ms/step
------------------
Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 355ms/step - accuracy: 0.3307 - loss: 1.4933
Epoch 1: val_loss improved from inf to 1.04392, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 367ms/step - accuracy: 0.3313 - loss: 1.4923 - val_accuracy: 0.5723 - val_loss: 1.0439
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 353ms/step - accuracy: 0.6291 - loss: 0.9516
Epoch 2: val_loss improved from 1.04392 to 0.82492, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 371ms/step - accuracy: 0.6294 - loss: 0.9511 - val_accuracy: 0.6727 - val_loss: 0.8249
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 353ms/step - accuracy: 0.7273 - loss: 0.7266
Epoch 3: val_loss improved from 0.82492 to 0.71815, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 362ms/step - accuracy: 0.7275 - loss: 0.7263 - val_accuracy: 0.7267 - val_loss: 0.7182
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 352ms/step - accuracy: 0.7884 - loss: 0.5800
Epoch 4: val_loss improved from 0.71815 to 0.69820, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 361ms/step - accuracy: 0.7885 - loss: 0.5798 - val_accuracy: 0.7404 - val_loss: 0.6982
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 354ms/step - accuracy: 0.8267 - loss: 0.4860
Epoch 5: val_loss improved from 0.69820 to 0.69454, saving model to nn_model.h5




[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 361ms/step - accuracy: 0.8268 - loss: 0.4858 - val_accuracy: 0.7495 - val_loss: 0.6945
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 364ms/step - accuracy: 0.8515 - loss: 0.4143
Epoch 6: val_loss did not improve from 0.69454
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 372ms/step - accuracy: 0.8516 - loss: 0.4142 - val_accuracy: 0.7490 - val_loss: 0.7269
Epoch 7/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 355ms/step - accuracy: 0.8703 - loss: 0.3653
Epoch 7: val_loss did not improve from 0.69454
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 364ms/step - accuracy: 0.8703 - loss: 0.3652 - val_accuracy: 0.7513 - val_loss: 0.7851
[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 43ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



[1m343/343[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 43ms/step
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 39ms/step
------------------


In [None]:
import pandas as pd
import numpy as np

def save_lstm_feats_to_csv(lstm_train1, lstm_test1, lstm_train2, lstm_test2, base_filename="lstm_features"):

    df_train1 = pd.DataFrame(lstm_train1)
    train1_filename = f"{base_filename}_train1.csv"
    df_train1.to_csv(train1_filename, index=False)
    print(f"'{train1_filename}' 파일로 저장 완료.")

    df_test1 = pd.DataFrame(lstm_test1)
    test1_filename = f"{base_filename}_test1.csv"
    df_test1.to_csv(test1_filename, index=False)
    print(f"'{test1_filename}' 파일로 저장 완료.")

    df_train2 = pd.DataFrame(lstm_train2)
    train2_filename = f"{base_filename}_train2.csv"
    df_train2.to_csv(train2_filename, index=False)
    print(f"'{train2_filename}' 파일로 저장 완료.")

    df_test2 = pd.DataFrame(lstm_test2)
    test2_filename = f"{base_filename}_test2.csv"
    df_test2.to_csv(test2_filename, index=False)
    print(f"'{test2_filename}' 파일로 저장 완료.")

# CSV 파일로 저장하는 함수를 호출합니다.
save_lstm_feats_to_csv(lstm_train1, lstm_test1, lstm_train2, lstm_test2)

'lstm_features_train1.csv' 파일로 저장 완료.
'lstm_features_test1.csv' 파일로 저장 완료.
'lstm_features_train2.csv' 파일로 저장 완료.
'lstm_features_test2.csv' 파일로 저장 완료.


# **앙상블 학습**

In [None]:
# Logistic, CNN, GRU, LSTM 피처 불러오기
import os

# features 폴더 경로 지정
folder_path = '/content/drive/MyDrive/Colab Notebooks/ESAA/25-1 OB/mini project 2/features'

# 폴더 내 csv 파일 불러오기 (train, test 따로)
train_csv = [f for f in os.listdir(folder_path) if f.endswith('.csv') and 'train' in f.lower()]
test_csv = [f for f in os.listdir(folder_path) if f.endswith('.csv') and 'test' in f.lower()]

# 4. 각 파일을 DataFrame으로 읽어서 리스트에 저장
train_features = [pd.read_csv(os.path.join(folder_path, file)) for file in train_csv]
test_features = [pd.read_csv(os.path.join(folder_path, file)) for file in test_csv]

In [None]:
# 하나의 데이터프레임으로 합치기
train_features_df = pd.concat(train_features, axis=1, ignore_index=True)
test_features_df = pd.concat(test_features, axis=1, ignore_index=True)

In [None]:
# 잘 합쳐졌는지 확인 (column이 45개여야 함)
print(train_features_df.shape)
train_features_df

(54879, 45)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
0,0.000210,0.000020,2.864701e-04,9.994677e-01,1.615322e-05,0.000648,0.000148,1.133970e-03,9.979519e-01,1.175307e-04,...,8.535597e-07,2.546287e-07,4.260121e-06,9.999942e-01,4.076763e-07,3.674807e-05,0.000016,1.006387e-04,9.998255e-01,2.096128e-05
1,0.451201,0.091890,1.271218e-01,1.295207e-01,2.002665e-01,0.334110,0.226153,1.921324e-01,8.953078e-02,1.580736e-01,...,5.169988e-01,3.564796e-01,8.244631e-02,2.229021e-02,2.178517e-02,5.287315e-01,0.321753,7.781456e-02,3.155240e-02,4.014891e-02
2,0.000011,0.999989,6.469904e-08,8.570756e-08,1.150365e-10,0.000044,0.999955,8.213282e-07,8.945881e-07,1.068220e-09,...,9.531045e-08,9.999998e-01,8.174168e-11,1.341125e-10,3.036066e-12,7.465615e-07,0.999999,3.755135e-09,2.336929e-08,1.410955e-10
3,0.000214,0.000006,2.037654e-03,2.363783e-06,9.977409e-01,0.000489,0.000015,8.124831e-03,1.245251e-05,9.913588e-01,...,2.545457e-04,4.342698e-07,9.573391e-03,9.239227e-08,9.901716e-01,2.138570e-03,0.000059,9.552764e-02,1.255973e-05,9.022618e-01
4,0.000273,0.000104,8.549729e-04,9.987586e-01,9.355747e-06,0.000514,0.000191,1.234133e-03,9.980285e-01,3.225793e-05,...,1.215439e-06,5.598211e-06,4.018148e-04,9.995870e-01,4.393047e-06,1.438705e-05,0.000049,1.128671e-03,9.987735e-01,3.421135e-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54874,0.437367,0.402492,8.113391e-02,5.575850e-02,2.324907e-02,0.287389,0.346206,1.833528e-01,3.958485e-02,1.434680e-01,...,2.288683e-02,9.002097e-01,6.451757e-02,1.040818e-02,1.977726e-03,7.099365e-02,0.838530,6.947847e-02,1.397646e-02,7.021556e-03
54875,0.080707,0.197824,5.063964e-01,3.188758e-02,1.831856e-01,0.077907,0.233811,4.856631e-01,5.396711e-02,1.486522e-01,...,1.164779e-01,9.962648e-02,2.557128e-01,1.595149e-01,3.686680e-01,2.229027e-01,0.142703,2.398252e-01,1.589911e-01,2.355776e-01
54876,0.015265,0.796027,1.289320e-02,1.754256e-01,3.890761e-04,0.021388,0.799632,2.821165e-02,1.491129e-01,1.655510e-03,...,6.541275e-04,9.827801e-01,6.157595e-04,1.592025e-02,2.975902e-05,1.647313e-03,0.969716,1.402360e-03,2.703138e-02,2.027040e-04
54877,0.010402,0.005244,4.179246e-02,9.366803e-01,5.881325e-03,0.033132,0.014365,1.578333e-01,7.544321e-01,4.023805e-02,...,2.279359e-02,1.825370e-03,1.411782e-02,9.409635e-01,2.029973e-02,9.171255e-02,0.005213,3.389686e-02,7.882127e-01,8.096502e-02


In [None]:
all_nn_train = train_features_df.copy()
all_nn_test = test_features_df.copy()

# 최종 앙상블 데이터
cols_to_drop = ['index', 'text']
train_X = train.drop(cols_to_drop+['author'], axis=1).values
test_X = test.drop(cols_to_drop, axis=1).values
train_X = np.hstack([train_X,train_svd,train_svd2])
test_X = np.hstack([test_X,test_svd,test_svd2])

f_train_X = np.hstack([train_X, all_nn_train])
f_train_X = np.round(f_train_X,4)
f_test_X = np.hstack([test_X, all_nn_test])
f_test_X = np.round(f_test_X,4)
print(f_train_X.shape, f_test_X.shape)

(54879, 168) (19617, 168)


##**Logistic Regression을 최종 모델로 스태킹 학습 진행**

- StandardScaler로 정규화
- f_train_X, f_test_x를 Logistic Regression으로 학습


In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import log_loss, accuracy_score

In [None]:
# StandardScaler로 정규화
scaler = StandardScaler()
f_train_X_scaled = scaler.fit_transform(f_train_X)
f_test_X_scaled = scaler.transform(f_test_X)
train_Y = train['author']

In [None]:
# 기본 Logistic Regression 모델 학습 및 예측
clf = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    penalty='l2',
    C=1.0,
    max_iter=1000
)
clf.fit(f_train_X_scaled, train_Y)



In [None]:
def cv_logreg_submit(k_cnt=3, s_flag=False):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(f_train_X)
    X_test_scaled = scaler.transform(f_test_X)
    y = train['author'].values

    if s_flag:
        kf = StratifiedKFold(n_splits=k_cnt, shuffle=True, random_state=42)
    else:
        kf = KFold(n_splits=k_cnt, shuffle=True, random_state=42)

    preds_per_fold = []
    weights = []
    org_train_pred = np.zeros((X_scaled.shape[0], 5))

    for fold, (train_idx, val_idx) in enumerate(kf.split(X_scaled, y)):
        X_train, X_val = X_scaled[train_idx], X_scaled[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        model = LogisticRegression(
            multi_class='multinomial',
            solver='lbfgs',
            penalty='l2',
            C=1.0,
            max_iter=1000
        )
        model.fit(X_train, y_train)

        val_pred = model.predict_proba(X_val)
        full_train_pred = model.predict_proba(X_scaled)
        test_fold_pred = model.predict_proba(X_test_scaled)

        val_loss = log_loss(y_val, val_pred)
        print(f"[Fold {fold}] Log loss: {val_loss:.5f}")
        preds_per_fold.append((test_fold_pred, val_loss))
        weights.append(1.0 / val_loss)
        org_train_pred += full_train_pred

    org_train_pred /= k_cnt
    avg_k_score = np.mean([v for _, v in preds_per_fold])

    # 평균 앙상블
    test_pred_avg = np.mean([p for p, _ in preds_per_fold], axis=0)

    # 가중 앙상블 (loss의 역수 가중치)
    weight_sum = sum(weights)
    test_pred_weighted = sum(p * w for (p, _), w in zip(preds_per_fold, weights)) / weight_sum

    # 최고 fold 예측
    best_fold_pred = min(preds_per_fold, key=lambda x: x[1])[0]

    # 저장 함수
    def save_pred(pred, filename):
        submiss = pd.read_csv("/content/drive/MyDrive/ESAA_OB/Dataset/novel_sample_submission.csv")
        pred = np.round(pred, 4)
        for i in range(5):
            submiss[str(i)] = pred[:, i]
        submiss.to_csv(filename, index=False)

    # 파일 각각 저장
    save_pred(test_pred_avg, f"logreg_avg_{k_cnt}.csv")

    # 성능 출력
    print("✅ Local average valid log loss:", avg_k_score)
    print("✅ Full train OOF log loss:", log_loss(y, org_train_pred))

In [None]:
cv_logreg_submit(k_cnt=5, s_flag=True)



[Fold 0] Log loss: 0.48223




[Fold 1] Log loss: 0.47525




[Fold 2] Log loss: 0.47266




[Fold 3] Log loss: 0.47424




[Fold 4] Log loss: 0.46914
✅ Local average valid log loss: 0.4747055800557282
✅ Full train OOF log loss: 0.46091900674167563


In [None]:
# # 하이퍼 파라미터 튜닝
# param_grid = {
#     'C': [0.01, 0.1, 1, 10],
#     'solver': ['lbfgs', 'newton-cg'],
#     'max_iter': [500, 1000]
# }

# grid_clf = GridSearchCV(
#     LogisticRegression(multi_class='multinomial', penalty='l2'),
#     param_grid,
#     cv=5,
#     scoring='neg_log_loss',
#     verbose=1,
#     n_jobs=-1
# )

# grid_clf.fit(f_train_X_scaled, train_Y)

# print("Best parameters:", grid_clf.best_params_)
# print("Best log loss:", -grid_clf.best_score_)

# # 최적 모델로 예측
# best_clf = grid_clf.best_estimator_
# test_pred_best = best_clf.predict_proba(f_test_X_scaled)

In [None]:
from google.colab import files
k_cnt = 5
# 다운로드할 파일 리스트
files.download(f'logreg_avg_{k_cnt}.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **XGB를 최종 모델로 스태킹 학습 진행**

In [None]:
# 하나의 데이터프레임으로 합치기
train_features_df = pd.concat(train_features, axis=1, ignore_index=True)
test_features_df = pd.concat(test_features, axis=1, ignore_index=True)

In [None]:
# 잘 합쳐졌는지 확인 (column이 45개여야 함)
print(train_features_df.shape)
train_features_df

(54879, 45)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
0,0.000210,0.000020,2.864701e-04,9.994677e-01,1.615322e-05,0.000648,0.000148,1.133970e-03,9.979519e-01,1.175307e-04,...,8.535597e-07,2.546287e-07,4.260121e-06,9.999942e-01,4.076763e-07,3.674807e-05,0.000016,1.006387e-04,9.998255e-01,2.096128e-05
1,0.451201,0.091890,1.271218e-01,1.295207e-01,2.002665e-01,0.334110,0.226153,1.921324e-01,8.953078e-02,1.580736e-01,...,5.169988e-01,3.564796e-01,8.244631e-02,2.229021e-02,2.178517e-02,5.287315e-01,0.321753,7.781456e-02,3.155240e-02,4.014891e-02
2,0.000011,0.999989,6.469904e-08,8.570756e-08,1.150365e-10,0.000044,0.999955,8.213282e-07,8.945881e-07,1.068220e-09,...,9.531045e-08,9.999998e-01,8.174168e-11,1.341125e-10,3.036066e-12,7.465615e-07,0.999999,3.755135e-09,2.336929e-08,1.410955e-10
3,0.000214,0.000006,2.037654e-03,2.363783e-06,9.977409e-01,0.000489,0.000015,8.124831e-03,1.245251e-05,9.913588e-01,...,2.545457e-04,4.342698e-07,9.573391e-03,9.239227e-08,9.901716e-01,2.138570e-03,0.000059,9.552764e-02,1.255973e-05,9.022618e-01
4,0.000273,0.000104,8.549729e-04,9.987586e-01,9.355747e-06,0.000514,0.000191,1.234133e-03,9.980285e-01,3.225793e-05,...,1.215439e-06,5.598211e-06,4.018148e-04,9.995870e-01,4.393047e-06,1.438705e-05,0.000049,1.128671e-03,9.987735e-01,3.421135e-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54874,0.437367,0.402492,8.113391e-02,5.575850e-02,2.324907e-02,0.287389,0.346206,1.833528e-01,3.958485e-02,1.434680e-01,...,2.288683e-02,9.002097e-01,6.451757e-02,1.040818e-02,1.977726e-03,7.099365e-02,0.838530,6.947847e-02,1.397646e-02,7.021556e-03
54875,0.080707,0.197824,5.063964e-01,3.188758e-02,1.831856e-01,0.077907,0.233811,4.856631e-01,5.396711e-02,1.486522e-01,...,1.164779e-01,9.962648e-02,2.557128e-01,1.595149e-01,3.686680e-01,2.229027e-01,0.142703,2.398252e-01,1.589911e-01,2.355776e-01
54876,0.015265,0.796027,1.289320e-02,1.754256e-01,3.890761e-04,0.021388,0.799632,2.821165e-02,1.491129e-01,1.655510e-03,...,6.541275e-04,9.827801e-01,6.157595e-04,1.592025e-02,2.975902e-05,1.647313e-03,0.969716,1.402360e-03,2.703138e-02,2.027040e-04
54877,0.010402,0.005244,4.179246e-02,9.366803e-01,5.881325e-03,0.033132,0.014365,1.578333e-01,7.544321e-01,4.023805e-02,...,2.279359e-02,1.825370e-03,1.411782e-02,9.409635e-01,2.029973e-02,9.171255e-02,0.005213,3.389686e-02,7.882127e-01,8.096502e-02


In [None]:
all_nn_train = train_features_df.copy()
all_nn_test = test_features_df.copy()

# 최종 앙상블 데이터
cols_to_drop = ['index', 'text']
train_X = train.drop(cols_to_drop+['author'], axis=1).values
test_X = test.drop(cols_to_drop, axis=1).values
train_X = np.hstack([train_X,train_svd,train_svd2])
test_X = np.hstack([test_X,test_svd,test_svd2])

f_train_X = np.hstack([train_X, all_nn_train])
f_train_X = np.round(f_train_X,4)
f_test_X = np.hstack([test_X, all_nn_test])
f_test_X = np.round(f_test_X,4)
print(f_train_X.shape, f_test_X.shape)

(54879, 168) (19617, 168)


In [None]:
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold

train_Y = train['author']

# 최종 앙상블입니다.
def cv_test(k_cnt=3, s_flag = False):
    rnd = 42
    if s_flag:
        kf = StratifiedKFold(n_splits=k_cnt, shuffle=True, random_state=rnd)
    else:
        kf = KFold(n_splits=k_cnt, shuffle=True, random_state=rnd)
    test_pred = None
    weighted_test_pred = None
    org_train_pred = None
    avg_k_score = 0
    reverse_score = 0
    best_loss = 100
    best_single_pred = None
    for train_index, test_index in kf.split(f_train_X,train_Y):
        X_train, X_test = f_train_X[train_index], f_train_X[test_index]
        y_train, y_test = train_Y[train_index], train_Y[test_index]
        params = {
                'colsample_bytree': 0.7,
                'subsample': 0.8,
                'eta': 0.04,
                'max_depth': 3,
                'eval_metric':'mlogloss',
                'objective':'multi:softprob',
                'num_class':5,
                'tree_method':'hist'
        }

        d_train = xgb.DMatrix(X_train, y_train)
        d_valid = xgb.DMatrix(X_test, y_test)
        d_test = xgb.DMatrix(f_test_X)

        watchlist = [(d_train, 'train'), (d_valid, 'valid')]
        m = xgb.train(params, d_train, 2000, watchlist,
                        early_stopping_rounds=50,
                        verbose_eval=200)

        train_pred = m.predict(d_train)
        valid_pred = m.predict(d_valid)
        tmp_train_pred = m.predict(xgb.DMatrix(f_train_X))

        train_score = log_loss(y_train,train_pred)
        valid_score = log_loss(y_test,valid_pred)
        print('train log loss',train_score,'valid log loss',valid_score)
        avg_k_score += valid_score
        rev_valid_score = 1.0/valid_score
        reverse_score += rev_valid_score
        print('rev',rev_valid_score)

        if test_pred is None:
            test_pred = m.predict(d_test)
            weighted_test_pred = test_pred*rev_valid_score
            org_train_pred = tmp_train_pred
            best_loss = valid_score
            best_single_pred = test_pred
        else:
            curr_pred = m.predict(d_test)
            test_pred += curr_pred
            weighted_test_pred += curr_pred*rev_valid_score
            org_train_pred += tmp_train_pred

            if valid_score < best_loss:
                print('BETTER')
                best_loss = valid_score
                best_single_pred = curr_pred

    test_pred = test_pred / k_cnt
    test_pred = np.round(test_pred,4)
    org_train_pred = org_train_pred / k_cnt
    avg_k_score = avg_k_score/k_cnt

    submiss=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ESAA/25-1 OB/mini project 2/sample_submission.csv")
    submiss['0']=test_pred[:,0]
    submiss['1']=test_pred[:,1]
    submiss['2']=test_pred[:,2]
    submiss['3']=test_pred[:,3]
    submiss['4']=test_pred[:,4]
    submiss.to_csv("xgb_{}.csv".format(k_cnt),index=False)
    print(reverse_score)
    # weigthed
    submiss=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ESAA/25-1 OB/mini project 2/sample_submission.csv")
    weighted_test_pred = weighted_test_pred / reverse_score
    weighted_test_pred = np.round(weighted_test_pred,4)
    submiss['0']=weighted_test_pred[:,0]
    submiss['1']=weighted_test_pred[:,1]
    submiss['2']=weighted_test_pred[:,2]
    submiss['3']=weighted_test_pred[:,3]
    submiss['4']=weighted_test_pred[:,4]
    submiss.to_csv("weighted_{}.csv".format(k_cnt),index=False)
    # best single
    submiss=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ESAA/25-1 OB/mini project 2/sample_submission.csv")
    weighted_test_pred = np.round(best_single_pred,4)
    submiss['0']=weighted_test_pred[:,0]
    submiss['1']=weighted_test_pred[:,1]
    submiss['2']=weighted_test_pred[:,2]
    submiss['3']=weighted_test_pred[:,3]
    submiss['4']=weighted_test_pred[:,4]
    submiss.to_csv("single_{}.csv".format(k_cnt),index=False)

    # train log loss
    print('local average valid loss',avg_k_score)
    print('train log loss', log_loss(train_Y,org_train_pred))

In [None]:
cv_test(5, True)



[0]	train-mlogloss:1.54568	valid-mlogloss:1.54587
[200]	train-mlogloss:0.43438	valid-mlogloss:0.45828
[400]	train-mlogloss:0.39884	valid-mlogloss:0.44602
[600]	train-mlogloss:0.37243	valid-mlogloss:0.44139
[800]	train-mlogloss:0.35013	valid-mlogloss:0.43885
[1000]	train-mlogloss:0.32967	valid-mlogloss:0.43702
[1200]	train-mlogloss:0.31151	valid-mlogloss:0.43610
[1282]	train-mlogloss:0.30451	valid-mlogloss:0.43597
train log loss 0.3045114176673922 valid log loss 0.4359743203816037
rev 2.293713077239757




[0]	train-mlogloss:1.54562	valid-mlogloss:1.54595
[200]	train-mlogloss:0.43420	valid-mlogloss:0.45623
[400]	train-mlogloss:0.39901	valid-mlogloss:0.44523
[600]	train-mlogloss:0.37265	valid-mlogloss:0.44061
[800]	train-mlogloss:0.35005	valid-mlogloss:0.43839
[1000]	train-mlogloss:0.33003	valid-mlogloss:0.43722
[1130]	train-mlogloss:0.31789	valid-mlogloss:0.43692
train log loss 0.3178864752714442 valid log loss 0.43692326301304124
rev 2.2887314195723016




[0]	train-mlogloss:1.54563	valid-mlogloss:1.54569
[200]	train-mlogloss:0.43519	valid-mlogloss:0.45203
[400]	train-mlogloss:0.40010	valid-mlogloss:0.44000
[600]	train-mlogloss:0.37410	valid-mlogloss:0.43545
[800]	train-mlogloss:0.35174	valid-mlogloss:0.43282
[1000]	train-mlogloss:0.33161	valid-mlogloss:0.43090
[1124]	train-mlogloss:0.31999	valid-mlogloss:0.43066
train log loss 0.31989001674991835 valid log loss 0.4306436305102449
rev 2.3221056324812177
BETTER




[0]	train-mlogloss:1.54557	valid-mlogloss:1.54614
[200]	train-mlogloss:0.43481	valid-mlogloss:0.45534
[400]	train-mlogloss:0.39931	valid-mlogloss:0.44363
[600]	train-mlogloss:0.37314	valid-mlogloss:0.43960
[800]	train-mlogloss:0.35024	valid-mlogloss:0.43709
[1000]	train-mlogloss:0.32978	valid-mlogloss:0.43579
[1175]	train-mlogloss:0.31350	valid-mlogloss:0.43514
train log loss 0.313419920561722 valid log loss 0.43513855779399474
rev 2.298118569564742




[0]	train-mlogloss:1.54549	valid-mlogloss:1.54566
[200]	train-mlogloss:0.43494	valid-mlogloss:0.45615
[400]	train-mlogloss:0.39971	valid-mlogloss:0.44314
[600]	train-mlogloss:0.37310	valid-mlogloss:0.43774
[800]	train-mlogloss:0.35050	valid-mlogloss:0.43543
[1000]	train-mlogloss:0.33056	valid-mlogloss:0.43390
[1200]	train-mlogloss:0.31201	valid-mlogloss:0.43274
[1400]	train-mlogloss:0.29534	valid-mlogloss:0.43195
[1545]	train-mlogloss:0.28373	valid-mlogloss:0.43171
train log loss 0.28373377787923365 valid log loss 0.43170858996423583
rev 2.3163773509413916
11.51904604979941
local average valid loss 0.43407767233262406
train log loss 0.3237289203387155


## **LGBM을 최종 모델로 스태킹 학습 진행**

In [None]:
train_Y = train['author']
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import log_loss
import lightgbm as lgb
import numpy as np
import pandas as pd

# 최종 앙상블입니다.
def cv_test_lgbm(k_cnt=3, s_flag = False):
    rnd = 42
    if s_flag:
        kf = StratifiedKFold(n_splits=k_cnt, shuffle=True, random_state=rnd)
    else:
        kf = KFold(n_splits=k_cnt, shuffle=True, random_state=rnd)
    test_pred = None
    weighted_test_pred = None
    org_train_pred = None
    avg_k_score = 0
    reverse_score = 0
    best_loss = 100
    best_single_pred = None
    for train_index, test_index in kf.split(f_train_X,train_Y):
        X_train, X_test = f_train_X[train_index], f_train_X[test_index]
        y_train, y_test = train_Y[train_index], train_Y[test_index]
        params = {
            'objective': 'multiclass',
            'metric': 'multi_logloss',
            'num_class': 5,
            'boosting_type': 'gbdt',
            'n_estimators': 2000,
            'learning_rate': 0.04,
            'num_leaves': 31,
            'max_depth': 3,
            'subsample': 0.8,
            'colsample_bytree': 0.7,
            'random_state': rnd,
            'n_jobs': -1,
            'verbose': -1,
        }

        lgb_train = lgb.Dataset(X_train, y_train)
        lgb_valid = lgb.Dataset(X_test, y_test, reference=lgb_train)

        m = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_valid],
                        callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=200)])

        train_pred = m.predict(X_train)
        valid_pred = m.predict(X_test)
        tmp_train_pred = m.predict(f_train_X)
        test_pred_fold = m.predict(f_test_X)

        train_score = log_loss(y_train,train_pred)
        valid_score = log_loss(y_test,valid_pred)
        print('train log loss',train_score,'valid log loss',valid_score)
        avg_k_score += valid_score
        rev_valid_score = 1.0/valid_score
        reverse_score += rev_valid_score
        print('rev',rev_valid_score)

        if test_pred is None:
            test_pred = test_pred_fold
            weighted_test_pred = test_pred_fold * rev_valid_score
            org_train_pred = tmp_train_pred
            best_loss = valid_score
            best_single_pred = test_pred_fold
        else:
            test_pred += test_pred_fold
            weighted_test_pred += test_pred_fold * rev_valid_score
            org_train_pred += tmp_train_pred

            if valid_score < best_loss:
                print('BETTER')
                best_loss = valid_score
                best_single_pred = test_pred_fold

    test_pred = test_pred / k_cnt
    test_pred = np.round(test_pred,4)
    org_train_pred = org_train_pred / k_cnt
    avg_k_score = avg_k_score/k_cnt

    submiss=pd.read_csv("/content/drive/MyDrive/ESAA/OB/csv/sample_submission.csv")
    submiss['0']=test_pred[:,0]
    submiss['1']=test_pred[:,1]
    submiss['2']=test_pred[:,2]
    submiss['3']=test_pred[:,3]
    submiss['4']=test_pred[:,4]
    submiss.to_csv("lgbm_{}.csv".format(k_cnt),index=False)
    print(reverse_score)
    # weigthed
    submiss=pd.read_csv("/content/drive/MyDrive/ESAA/OB/csv/sample_submission.csv")
    weighted_test_pred = weighted_test_pred / reverse_score
    weighted_test_pred = np.round(weighted_test_pred,4)
    submiss['0']=weighted_test_pred[:,0]
    submiss['1']=weighted_test_pred[:,1]
    submiss['2']=weighted_test_pred[:,2]
    submiss['3']=weighted_test_pred[:,3]
    submiss['4']=weighted_test_pred[:,4]
    submiss.to_csv("weighted_lgbm_{}.csv".format(k_cnt),index=False)
    # best single
    submiss=pd.read_csv("/content/drive/MyDrive/ESAA/OB/csv/sample_submission.csv")
    best_single_pred = np.round(best_single_pred,4)
    submiss['0']=best_single_pred[:,0]
    submiss['1']=best_single_pred[:,1]
    submiss['2']=best_single_pred[:,2]
    submiss['3']=best_single_pred[:,3]
    submiss['4']=best_single_pred[:,4]
    submiss.to_csv("single_lgbm_{}.csv".format(k_cnt),index=False)

    # train log loss
    print('local average valid loss',avg_k_score)
    print('train log loss', log_loss(train_Y,org_train_pred))

In [None]:
cv_test_lgbm(5, True)



Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1680]	training's multi_logloss: 0.305251	valid_1's multi_logloss: 0.51683
train log loss 0.3052510867872835 valid log loss 0.5168300801714009
rev 1.9348719015510112




Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1419]	training's multi_logloss: 0.330204	valid_1's multi_logloss: 0.525018
train log loss 0.3302041404046566 valid log loss 0.5250182437943189
rev 1.9046957164249703




Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1287]	training's multi_logloss: 0.346853	valid_1's multi_logloss: 0.514922
train log loss 0.34685301360667514 valid log loss 0.5149224476552055
rev 1.9420400189459304
BETTER




Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1337]	training's multi_logloss: 0.339366	valid_1's multi_logloss: 0.521077
train log loss 0.3393655217235693 valid log loss 0.5210769205472182
rev 1.9191024598629933




Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1760]	training's multi_logloss: 0.299594	valid_1's multi_logloss: 0.506956
train log loss 0.2995939159354713 valid log loss 0.50695595136555
rev 1.9725579654531589
BETTER
9.673268062238064
local average valid loss 0.5169607287067387
train log loss 0.3457859420647463


# **리더보드 결과**

- 메타 모델 : XGB 사용
- final prediction value : weighted mean 결과가 가장 좋음
    - public : 0.2195617319 / private : 0.2226967441
