# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계3 : Text classification

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * Machine Learning
>> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
> * Deep Learning
>> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
>> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
>> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)
>> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

## 1. 개발 환경 설정

### 1-1. 라이브러리 설치

In [1]:
# 필요 라이브러리부터 설치할께요.
! pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
     --------------------------------------- 19.4/19.4 MB 10.7 MB/s eta 0:00:00
Collecting wordcloud
  Downloading wordcloud-1.8.2.2-cp39-cp39-win_amd64.whl (153 kB)
     -------------------------------------- 153.1/153.1 kB 8.9 MB/s eta 0:00:00
Collecting python-mecab-ko
  Downloading python_mecab_ko-1.3.3-cp39-cp39-win_amd64.whl (810 kB)
     ------------------------------------- 810.6/810.6 kB 10.3 MB/s eta 0:00:00
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting JPype1>=0.7.0
  Downloading JPype1-1.4.1-cp39-cp39-win_amd64.whl (345 kB)
     ------------------------------------- 345.2/345.2 kB 10.5 MB/s eta 0:00:00
Collecting python-mecab-ko-dic
  Downloading python_mecab_ko_dic-2.1.1.post2-py3-none-any.whl (34.5 MB)
     --------------------------------------- 34.5/34.5 MB 10.1 MB/s eta 0:00:00


### 1-2. 라이브러리 import

In [4]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import wget,os
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
#import tensorflow as tf
import nltk
import wget,os

### 1-3. 한글 글꼴 설정(Windows)

In [None]:
# if not os.path.exists("malgun.ttf"): 
#     wget.download("https://www.wfonts.com/download/data/2016/06/13/malgun-gothic/malgun.ttf")
# if 'malgun' not in fm.fontManager.findfont("Malgun Gothic"):
#     fm.fontManager.addfont("malgun.ttf")
# if plt.rcParams['font.family']!= ["Malgun Gothic"]:
#     plt.rcParams['font.family']= [font for font in fm.fontManager.ttflist if 'malgun.ttf' in font.fname][-1].name
# plt.rcParams['axes.unicode_minus'] = False #한글 폰트 사용시 마이너스 폰트 깨짐 해결
# assert plt.rcParams['font.family'] == ["Malgun Gothic"], "한글 폰트가 설정되지 않았습니다."
# FONT_PATH = "malgun.ttf"

In [None]:
# !sudo apt-get install -y fonts-nanum

### 1-4. 자바 경로 설정(Windows)

In [12]:
os.environ['JAVA_HOME'] = "C:\Program Files\jdk-20" #"C:\Program Files\Java\jdk-20"

### 1-3. 한글 글꼴 설정(Colab)

In [None]:
!sudo apt-get install -y fonts-nanum

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-525
Use 'sudo apt autoremove' to remove it.
The following NEW packages will be installed:
  fonts-nanum
0 upgraded, 1 newly installed, 0 to remove and 23 not upgraded.
Need to get 9,599 kB of archives.
After this operation, 29.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 fonts-nanum all 20180306-3 [9,599 kB]
Fetched 9,599 kB in 1s (8,268 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype

In [None]:
FONT_PATH = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'
font_name = fm.FontProperties(fname=FONT_PATH, size=10).get_name()
print(font_name)
plt.rcParams['font.family']=font_name
assert plt.rcParams['font.family'] == [font_name], "한글 폰트가 설정되지 않았습니다."

NanumGothic


### 1-4. 구글드라이브 연결(Colab)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. 전처리한 데이터 불러오기
* 1, 2일차에 전처리한 데이터를 불러옵니다.
* sparse data에 대해서는 scipy.sparse.load_npz 활용

In [4]:
pwd

'c:\\Users\\User\\Desktop\\에이블\\4월\\0407\\mp'

In [72]:
train = pd.read_csv('c:\\Users\\User\\Desktop\\에이블\\4월\\0407\\mp\\train.csv')
test = pd.read_csv('c:\\Users\\User\\Desktop\\에이블\\4월\\0407\\mp\\test.csv')
submission = pd.read_csv('c:\\Users\\User\\Desktop\\에이블\\4월\\0407\\mp\\random_submission.csv')

In [73]:
label_dict = {
    '코드1': 0,
    '코드2': 0,
    '웹': 1,
    '이론': 2,
    '시스템 운영': 3,
    '원격': 4
}

train = train.replace({'label' : label_dict}).copy()

In [214]:
# from sklearn.model_selection import train_test_split

# X_train, X_val, y_train, y_val = train_test_split(preprocessed_df['text'], preprocessed_df['label'], test_size=0.25, random_state=42)

In [7]:
import re 

def clean_text(texts): 
    corpus = [] 
    for i in range(0, len(texts)): 

        review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\>\<]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거 ]\[\은 예외처리
        review = re.sub(r'\d+','', review)#숫자 제거
        review = texts[i].lower() #소문자 변환
        review = re.sub(r'\s+', ' ', review) #extra space 제거
        review = re.sub(r'<[^>]+>','',review) #Html tags 제거
        review = re.sub(r'\s+', ' ', review) #spaces 제거
        review = re.sub(r"^\s+", '', review) #space from start 제거
        review = re.sub(r'\s+$', '', review) #space from the end 제거
        review = re.sub(r'_', ' ', review) #space from the end 제거
        corpus.append(review) 
        
    return corpus

In [8]:
train

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0
...,...,...
3701,"토큰화 이후 train val 를 분리하고 각 train set, val set에 ...",0
3702,올린 값들 중 최고점인 건가요? 아니면 최근에 올린 파일로 무조건 갱신인가요?\n최...,3
3703,수업에서 cacoo랑 packet tracer를 배우는 이유가\n\n1. IT 인프...,2
3704,inplace =True 해도 값이 변경이 안되고 none으로 뜹니다. 혹시 원격지...,4


In [9]:
train['text'] = clean_text(train['text'])
test['text'] = clean_text(test['text'])

In [10]:
train

Unnamed: 0,text,label
0,"self.convs1 = nn.modulelist([nn.conv2d(1, co, ...",0
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1
2,glob.glob(path) 를 사용할 때 질문입니다. path에 [ ] 가 포함되...,0
3,"tmpp = tmp.groupby(by = 'addr1', as index=fals...",0
4,filename = test image + str(round(frame sec)) ...,0
...,...,...
3701,"토큰화 이후 train val 를 분리하고 각 train set, val set에 ...",0
3702,올린 값들 중 최고점인 건가요? 아니면 최근에 올린 파일로 무조건 갱신인가요? 최고...,3
3703,수업에서 cacoo랑 packet tracer를 배우는 이유가 1. it 인프라 구...,2
3704,inplace =true 해도 값이 변경이 안되고 none으로 뜹니다. 혹시 원격지...,4


In [112]:
# X_train = clean_text(X_train.values)
# X_val = clean_text(X_val.values)


In [74]:
from konlpy.tag import Okt 

han_sentence = "오늘도 열심히 코딩을 해볼까요? 같이 힘내서 자연어 처리 고수가 됩시다! ㅎㅎ"
okt = Okt() # 인스턴스 할당
print("한국어 형태소 분석 결과(어간 추출X) ==>", okt.morphs(han_sentence, stem = False)) # 형태소 단위로 분리
print("한국어 형태소 분석 결과(어간 추출O) ==>", okt.morphs(han_sentence, stem = True)) # 형태소 단위로 분리 후 어간 추출

한국어 형태소 분석 결과(어간 추출X) ==> ['오늘', '도', '열심히', '코딩', '을', '해볼까', '요', '?', '같이', '힘내서', '자연어', '처리', '고수', '가', '됩시다', '!', 'ㅎㅎ']
한국어 형태소 분석 결과(어간 추출O) ==> ['오늘', '도', '열심히', '코딩', '을', '해보다', '요', '?', '같이', '힘내다', '자연어', '처리', '고수', '가', '되다', '!', 'ㅎㅎ']


In [75]:
tokenized = [] # 데이터프레임의 한 컬럼으로 추가할 리스트
for sentence in train['text']: # 전처리된 리뷰들을 하나씩 꺼내옵니다
    tokens = okt.morphs(sentence, stem = True) # 형태소 분석 (stem = True로 설정해 어간 추출을 해주었습니다)
    tokenize = " ".join(tokens) # tokens라는 리스트 안의 형태소들을 띄어쓰기로 분리된 하나의 문자열로 join시켜줍니다.
    tokenized.append(tokenize) # 형태소 단위로 띄어쓰기된 문자열을 최종 리스트에 추가해줍니다
X_train = pd.DataFrame(tokenized) # 리스트를 데이터프레임으로 변환해 tokenized_stem라는 컬럼명으로 추가해줍니다.
X_train = X_train[0]
#train.head() # 데이터 확인

In [76]:
tokenized = [] # 데이터프레임의 한 컬럼으로 추가할 리스트
for sentence in test['text']: # 전처리된 리뷰들을 하나씩 꺼내옵니다
    tokens = okt.morphs(sentence, stem = True) # 형태소 분석 (stem = True로 설정해 어간 추출을 해주었습니다)
    tokenize = " ".join(tokens) # tokens라는 리스트 안의 형태소들을 띄어쓰기로 분리된 하나의 문자열로 join시켜줍니다.
    tokenized.append(tokenize) # 형태소 단위로 띄어쓰기된 문자열을 최종 리스트에 추가해줍니다
X_test = pd.DataFrame(tokenized) # 리스트를 데이터프레임으로 변환해 tokenized_stem라는 컬럼명으로 추가해줍니다.
X_test = X_test[0]
#train.head() # 데이터 확인

In [77]:
# X = X_train.copy()
y_train = train['label']

## 3. Machine Learning(N-grams)
* N-gram으로 전처리한 데이터를 이용하여 3개 이상의 Machine Learning 모델 학습 및 성능 분석
> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization 적용하여 학습 데이터셋과 테스트 데이터 셋 변환. 
tfidf_vect = TfidfVectorizer(ngram_range=(1,4),  min_df = 2, max_df=500, analyzer = 'char', sublinear_tf = True)
tfidf_vect.fit(X_train)

X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test) # train셋으로 fit한 벡터라이저 이용해 transform
print('학습 & 테스트 데이터 Text의 TfidfVectorizer Shape:',X_train_tfidf_vect.shape, X_test_tfidf_vect.shape)

학습 & 테스트 데이터 Text의 TfidfVectorizer Shape: (3706, 81045) (929, 81045)


### 3-1. Model 1

In [118]:
X_train_tfidf_vect

<3706x8883 sparse matrix of type '<class 'numpy.float64'>'
	with 82541 stored elements in Compressed Sparse Row format>

In [79]:
from imblearn.over_sampling import SMOTE

#X_train_tfidf_vect = pd.DataFrame(X_train_tfidf_vect)
X_resample, y_resampled = SMOTE().fit_resample(X_train_tfidf_vect, y_train)

In [80]:
from sklearn.linear_model import LogisticRegression #모델 불러오기
from sklearn import metrics
from sklearn.model_selection import cross_val_score


# LogisticRegression을 이용하여 학습/예측/평가 수행. 
lr_clf = LogisticRegression(solver='liblinear', C = 12, penalty = 'l2', max_iter = 500) 
# 성능 지표는 정확도(accuracy) , 교차 검증 세트는 5개 
scores = cross_val_score(lr_clf , X_resample, y_resampled, scoring='accuracy',cv=5)

print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.round(np.mean(scores),4))

#print('TF-IDF Logistic Regression 의 예측 정확도는 {0:.3f}'.format(metrics.accuracy_score(y_val ,pred)))

교차 검증별 정확도: [0.9426 0.9438 0.9666 0.9722 0.976 ]
평균 검증 정확도: 0.9603


In [81]:
lr_clf.fit(X_resample , y_resampled)
pred = lr_clf.predict(X_test_tfidf_vect)

In [82]:
submission['label'] = pred

In [83]:
submission.head()

Unnamed: 0,id,label
0,0,3
1,1,3
2,2,0
3,3,0
4,4,0


In [84]:
submission.to_csv('world_submission_14.csv', index = False)

### 3-1. Model 1-2 (제출)

In [67]:
from sklearn.linear_model import LogisticRegression #모델 불러오기
from sklearn import metrics
from sklearn.model_selection import cross_val_score


# LogisticRegression을 이용하여 학습/예측/평가 수행. 
lr_clf = LogisticRegression(solver='liblinear', C = 12, penalty = 'l2', max_iter = 500) 
# 성능 지표는 정확도(accuracy) , 교차 검증 세트는 5개 
scores = cross_val_score(lr_clf , X_train_tfidf_vect, y_train, scoring='accuracy',cv=5)

print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.round(np.mean(scores),4))

#print('TF-IDF Logistic Regression 의 예측 정확도는 {0:.3f}'.format(metrics.accuracy_score(y_val ,pred)))

교차 검증별 정확도: [0.8666 0.861  0.8367 0.857  0.8516]
평균 검증 정확도: 0.8546


In [68]:
lr_clf.fit(X_train_tfidf_vect , y_train)
pred = lr_clf.predict(X_test_tfidf_vect)

In [69]:
submission['label'] = pred

In [70]:
submission

Unnamed: 0,id,label
0,0,3
1,1,3
2,2,0
3,3,0
4,4,0
...,...,...
924,924,3
925,925,0
926,926,3
927,927,1


In [71]:
submission.to_csv('world_submission_13.csv', index = False)

### Model 1-3

In [224]:
from sklearn.model_selection import GridSearchCV
## 하이퍼파라미터 튜닝용 함수
def logistic_tuning(train_sprs, y, params):
    model = LogisticRegression(random_state = 99) # 파라미터 튜닝(train data 전체를 넣어서 5-fold cv)
    grid = GridSearchCV(model, params, scoring ='roc_auc', cv = 5)
    grid.fit(train_sprs, y)

    print(grid.best_params_)
    print(grid.best_score_)

    return grid.best_estimator_


In [None]:
%%time
param1 = {'penalty':['l2', 'l1'], 'C':[0.01, 0.1, 1, 5, 10], 'max_iter': [100, 500]}
logistic_tuning(X_train_tfidf_vect, y_train,  params = param1)

### 3-2. Model 2

In [81]:
train['label'].value_counts()[0] / train['label'].value_counts()[1]

2.1653005464480874

In [None]:
# xgboost 학습 파라미터
scale_pos_weight = train_df['Class'].value_counts()[0] / train_df['Class'].value_counts()[1]
print(scale_pos_weight)

In [36]:
import xgboost as xgb 
from sklearn.model_selection import cross_val_score

xg = xgb.XGBClassifier()

# 성능 지표는 정확도(accuracy) , 교차 검증 세트는 5개 
scores = cross_val_score(xg , X_train_tfidf_vect, y_train, scoring='accuracy',cv=5)

print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.round(np.mean(scores),4))

교차 검증별 정확도: [0.7992 0.7881 0.7935 0.7935 0.7841]
평균 검증 정확도: 0.7917


### 3-3. Model 3

In [16]:
!pip install catboost



In [18]:
from sklearn.utils.class_weight import compute_class_weight

# 불균형한 클래스인 것으로 확인되어 class_weight = 'balanced'로 설정해준다. 
classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))

In [19]:
from sklearn.model_selection import cross_val_score
from catboost import CatBoostClassifier

cb = CatBoostClassifier(class_weights= class_weights, bootstrap_type= 'MVS', #['Bayesian', 'Bernoulli', 'MVS']
                     )
# cb = CatBoostClassifier(learning_rate= 0.03, max_depth= 10, n_estimators= 1000, class_weights= class_weights, bootstrap_type= 'MVS', #['Bayesian', 'Bernoulli', 'MVS']
#                      subsample = 0.8, colsample_bylevel=1.0, random_state=42, verbose =0)

# 성능 지표는 정확도(accuracy) , 교차 검증 세트는 5개 
scores = cross_val_score(cb , X_train_tfidf_vect, y_train, scoring='accuracy',cv=2)

print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.round(np.mean(scores),4))

Learning rate set to 0.081655
0:	learn: 1.5324484	total: 258ms	remaining: 4m 17s
1:	learn: 1.4824278	total: 451ms	remaining: 3m 45s
2:	learn: 1.4325725	total: 623ms	remaining: 3m 27s
3:	learn: 1.3851768	total: 793ms	remaining: 3m 17s
4:	learn: 1.3599460	total: 961ms	remaining: 3m 11s
5:	learn: 1.3262484	total: 1.28s	remaining: 3m 32s
6:	learn: 1.3043470	total: 1.47s	remaining: 3m 29s
7:	learn: 1.2916342	total: 1.67s	remaining: 3m 26s
8:	learn: 1.2712087	total: 1.98s	remaining: 3m 37s
9:	learn: 1.2497579	total: 2.2s	remaining: 3m 38s
10:	learn: 1.2322129	total: 2.42s	remaining: 3m 37s
11:	learn: 1.2192273	total: 2.62s	remaining: 3m 35s
12:	learn: 1.2007566	total: 2.83s	remaining: 3m 34s
13:	learn: 1.1892686	total: 2.98s	remaining: 3m 29s
14:	learn: 1.1742389	total: 3.13s	remaining: 3m 25s
15:	learn: 1.1629466	total: 3.29s	remaining: 3m 22s
16:	learn: 1.1499170	total: 3.44s	remaining: 3m 18s
17:	learn: 1.1352493	total: 3.64s	remaining: 3m 18s
18:	learn: 1.1237722	total: 3.83s	remaining: 

### Model4

In [125]:
from lightgbm import LGBMClassifier


lgbm = LGBMClassifier()


# 성능 지표는 정확도(accuracy) , 교차 검증 세트는 5개 
scores = cross_val_score(lgbm , X_train_tfidf_vect, y_train, scoring='accuracy',cv=5)

print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.round(np.mean(scores),4))


교차 검증별 정확도: [0.7884 0.7692 0.7395 0.7787 0.7638]
평균 검증 정확도: 0.7679


### Model5

In [29]:
from sklearn import svm

svc = svm.SVC(kernel = 'linear', C= 10) #{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}

# 성능 지표는 정확도(accuracy) , 교차 검증 세트는 5개 
scores = cross_val_score(svc , X_train_tfidf_vect, y_train, scoring='accuracy',cv=5)

print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.round(np.mean(scores),4))


교차 검증별 정확도: [0.8423 0.8205 0.8043 0.8165 0.8273]
평균 검증 정확도: 0.8222


### 3-4. Hyperparameter Tuning(Optional) 
* Manual Search, Grid search, Bayesian Optimization, TPE...
> * [grid search tutorial sklearn](https://scikit-learn.org/stable/modules/grid_search.html)
> * [optuna tutorial](https://optuna.org/#code_examples)
> * [ray-tune tutorial](https://docs.ray.io/en/latest/tune/examples/tune-sklearn.html)

## 4. Deep Learning(Sequence)
* Sequence로 전처리한 데이터를 이용하여 DNN, 1-D CNN, LSTM 등 3가지 이상의 deep learning 모델 학습 및 성능 분석
> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)

### 4-1. DNN

In [167]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

from tensorflow.keras.preprocessing.sequence import pad_sequences

In [168]:
str_len_max = np.max(X_train.str.len()) # 리뷰 길이의 최대값 계산
print('최대 길이 :',round(str_len_max))

최대 길이 : 6675


In [169]:
str_len_mean = np.mean(X_train.str.len()) # 리뷰 길이의 평균값 계산
print('평균 길이 :',round(str_len_mean))

평균 길이 : 218


In [170]:
max_words = 6675 #3987 ## 위에서 40,000으로 설정함
embedding_dim = 128 ## 단어 embedding 지원. 단어 하나당 몇 개의 특징값을 학습할 것인가
max_len = 80 ## 문장 최대 길이

In [171]:
### Tokenizer here
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=max_words, lower=False)  # lower (대문자->소문자) 옵션은 한국어를 할땐 끄자.

In [172]:
# Text --> Sequence
%%time
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

CPU times: user 367 ms, sys: 37 µs, total: 367 ms
Wall time: 366 ms


In [173]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
#### Pad Sequences here
X_train = pad_sequences(X_train, maxlen = max_len)
X_test = pad_sequences(X_test, maxlen = max_len)

In [174]:
X_train.shape

(3706, 80)

In [175]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.backend import clear_session
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Dense, Conv1D, Bidirectional, LSTM, GRU, RNN, MaxPool1D, Flatten

In [176]:
X_train.shape[1]

80

In [189]:
#####################
# 1. 세션 초기화
clear_session()

# 2. 모델 사슬처럼 엮기
il = Input(shape=(X_train.shape[1],))

# 1. 임베딩 레이어 : 임베딩차원은 128
el = Embedding(max_words,
               embedding_dim,
               input_length=max_len)(il)

# 2. Conv1D 블록 : 필터수 64개, 윈도우 사이즈 5
hl1 = Conv1D(filters=64,
             kernel_size=5,
             activation='swish')(el)
# 3. Bidirectional layer :
#     * 정방향 : LSTM, 히든스테이트 32 
#     * 역방향 : LSTM, 히든스테이트 32
lstm32 = LSTM(32, return_sequences=True)
hl2 = Bidirectional(lstm32)(hl1)
# 4. Bidirectional layer :
#     * 정방향 : GRU, 히든스테이트 32
#     * 역방향 : RNN, 히든스테이트 16
forward_gru32 = GRU(32, return_sequences=True)
backward_lstm16 = LSTM(16, return_sequences=True, go_backwards=True)
hl3 = Bidirectional(forward_gru32, backward_layer=backward_lstm16)(hl2)
# 5. Conv1D 블록 : 필터수 32개, 윈도우 사이즈 5
hl4 = Conv1D(filters=32,
             kernel_size=5,
             activation='swish')(hl3)
# 6. MaxPool1D 블록 : 필터사이즈2
hl5 = MaxPool1D(pool_size=2)(hl4)
# 7. 플래튼
hl6 = Flatten()(hl5)
# 8. FC Layer : 노드 1024개
hl7 = Dense(1024, activation='relu')(hl6)
# 9. 시그모이드 레이어
ol = Dense(5, activation='softmax')(hl7)

# 3. 모델 처음과 끝 지정
model = Model(il, ol)

# 4. 컴파일
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics='accuracy')

# 요약
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 80)]              0         
                                                                 
 embedding (Embedding)       (None, 80, 128)           854400    
                                                                 
 conv1d (Conv1D)             (None, 76, 64)            41024     
                                                                 
 bidirectional (Bidirectiona  (None, 76, 64)           24832     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 76, 48)           14592     
 nal)                                                            
                                                                 
 conv1d_1 (Conv1D)           (None, 72, 32)            7712  

* Using pre-trained word Embedding  
* 남이 사용한 임베딩을 가져와 임베딩하기
* GloVe, Word2Vec 등...

# EarlyStopping을 이용한 학습.

1. 20%는 벨리데이션 셋.
2. 4epochs전과 비교하여 early stopping할 것.

In [190]:
from tensorflow.keras.callbacks import EarlyStopping

In [191]:
#####################
es = EarlyStopping(monitor='val_accuracy',
                   min_delta=0,
                   patience=10,
                   verbose=1,
                   restore_best_weights=True)

In [192]:
from tensorflow.keras.utils import to_categorical

y_train_ctg = to_categorical(y_train)
#y_val_ctg = to_categorical(y_val)

In [193]:
y_train_ctg.shape

(3706, 5)

In [194]:
X_train.shape

(3706, 80)

In [195]:
y_train_ctg

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]], dtype=float32)

In [196]:
#####################
model.fit(X_train, y_train_ctg, epochs=30, verbose=1, callbacks=[es], validation_split=0.2)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 20: early stopping


<keras.callbacks.History at 0x7fbad1b91610>

In [197]:
y_pred = model.predict(X_test)



In [199]:
np.argmax(y_pred, axis=1)

array([3, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 2,
       0, 0, 3, 0, 0, 1, 2, 0, 1, 0, 2, 0, 1, 0, 0, 3, 0, 0, 1, 0, 0, 0,
       0, 3, 0, 2, 3, 0, 0, 2, 0, 0, 4, 3, 0, 2, 0, 0, 2, 0, 0, 0, 0, 3,
       3, 2, 0, 3, 0, 0, 2, 0, 2, 0, 2, 0, 2, 3, 0, 3, 3, 0, 0, 0, 2, 0,
       0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 1, 0, 0, 0, 3,
       3, 3, 0, 0, 0, 1, 1, 3, 0, 0, 3, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3,
       0, 1, 0, 2, 2, 2, 0, 0, 0, 2, 2, 0, 0, 1, 2, 2, 0, 0, 0, 0, 3, 0,
       2, 2, 2, 2, 0, 2, 2, 0, 0, 1, 2, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 3, 2, 2, 3, 0, 2, 0, 3, 3, 1, 2, 2, 0, 0, 3, 2, 3, 0, 3,
       2, 0, 2, 1, 2, 0, 2, 0, 0, 0, 4, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 0, 2, 3, 0,
       4, 1, 0, 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 4, 3, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 3, 3, 2, 0, 3, 0, 2, 3, 0, 3, 0, 0,

In [202]:
submission['label'] = np.argmax(y_pred, axis=1)

In [203]:
submission

Unnamed: 0,id,label
0,0,3
1,1,3
2,2,0
3,3,0
4,4,2
...,...,...
924,924,3
925,925,0
926,926,3
927,927,1


In [204]:
submission.to_csv('world_submission_3.csv', index = False)

In [None]:
from sklearn.metrics import classification_report


y = y_val
p = np.argmax(y_pred, axis=1)

target_names = ['코드1,2(0)', '웹(1)', '이론(2)', '시스템운영(3)', '원격(4)']

label_dict = {
    '코드1': 0,
    '코드2': 0,
    '웹': 1,
    '이론': 2,
    '시스템 운영': 3,
    '원격': 4
}

preprocessed_df = train.replace({'label' : label_dict}).copy()

print(classification_report(y, p, 
                            target_names=target_names))

              precision    recall  f1-score   support

    코드1,2(0)       0.78      0.85      0.81       462
        웹(1)       0.50      0.54      0.52       217
       이론(2)       0.52      0.59      0.55       239
    시스템운영(3)       0.78      0.43      0.55       162
       원격(4)       0.56      0.31      0.40        32

    accuracy                           0.66      1112
   macro avg       0.63      0.54      0.57      1112
weighted avg       0.66      0.66      0.65      1112



In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

## performance metrics
accuracy = accuracy_score(y, p)

precision, recall, fscore, support = \
    precision_recall_fscore_support(y, p)

print('Accuracy   : %.3f' %accuracy) # (102+164)/(102+16+3+164)
print('Precision  : %.3f' %precision[0]) # 102/(102+3)
print('Recall     : %.3f' %recall[0]) # 102/(102+16)
print('Specificyty: %.3f' %recall[1]) # 164/(3+164)
print('F1-Score   : %.3f' %fscore[0]) # 2/(1/precision + 1/recall) = 2/(1/0.971+1/0.864)


Accuracy   : 0.656
Precision  : 0.781
Recall     : 0.848
Specificyty: 0.544
F1-Score   : 0.813


### 4-2. 1-D CNN

### 4-3. LSTM

## 5. Using pre-trained model(Optional)
* 한국어 pre-trained model로 fine tuning 및 성능 분석
> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)
> * [HuggingFace-Korean](https://huggingface.co/models?language=korean)