Keras RNN으로 BBC 기사 분류하기

1. 패키지 수입 및 파라미터 설정

In [None]:
# 블럭 1
# 패키지 수입
import csv
import numpy as np

# natutal language toolkit 
import nltk 

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from time import time

# 데이터 4분할에 사용
from sklearn.model_selection import train_test_split

# 인공신경망 구현시 사용 패키지
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN
from keras.layers import LSTM, Dropout, Embedding
from keras.layers import Bidirectional

# 인공신경망 평가용
from sklearn.metrics import confusion_matrix, f1_score

In [None]:
# 2번 블럭
# 파라미터 설정
MY_VOCAB = 5000   # 내가 사용할 단어 수, 인기도 기준
MY_EMBED = 64     # 임베딩 차원
MY_HIDDEN = 100   # RNN 셀의 규모
MY_LEN = 200      # 기사의 길이

MY_SPLIT = 0.8    # 학습용 데이터의 비율
MY_SAMPLE = 123   # 샘플 기사
MY_EPOCH = 10    # 반복 학습 수


2. 데이터 처리

In [None]:
# 3번 블럭
# 제외어 (stopword) 설정
nltk.download('stopwords')

# 영어 제외어 사용
MY_STOP = set(nltk.corpus.stopwords.words('english'))

print('영어 제외어')
print(MY_STOP)
print(type(MY_STOP))
print('제외어 갯수:', len(MY_STOP))

print('the' in MY_STOP)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
영어 제외어
{'did', 'i', 'into', 'there', 'with', 'too', 'mightn', "hadn't", 'be', "she's", 'once', 'myself', 'further', 'should', 'he', 'am', 'was', 'few', 'don', 're', "doesn't", 'has', 'then', 'hasn', "mightn't", 'and', 'by', 'yours', 'here', 'not', 'won', 'are', 'it', 'down', 'a', 'under', 'just', 'doesn', "mustn't", 'me', 'in', 'whom', 'own', "aren't", 'above', 'below', "you'll", 'o', "won't", 'yourselves', 'why', 'some', "weren't", 'is', 'needn', 'haven', 'having', 'only', 'what', 'over', 'we', 'of', 'during', 'have', 'can', 'no', "wasn't", "it's", 'wasn', 'herself', 'doing', 'but', "isn't", 'how', 'as', 'her', 'theirs', 'ours', 'them', 'all', 'before', "shouldn't", 'that', 'being', 'him', 'if', 'couldn', 'than', 'on', 'y', 'm', 'been', "didn't", 'those', 'itself', 'when', 't', 'were', 'until', 'do', 'hadn', 'most', 'themselves', 'yourself', 'which', 'nor', 'our', 'an

In [None]:
# 4번 블럭
# 데이터 보관 창고
original = []
articles = []
labels = []

print(type(original))


<class 'list'>


In [None]:
# 5번 블럭
# BBC 파일 읽고 처리
with open('/content/drive/MyDrive/dataset/bbc-text.csv', 'r') as file:
    # 컬럼 이름 읽기
    reader = csv.reader(file)
    header = next(reader)
    print(header)

    # 기사 한줄씩 처리
    for row in reader:
        labels.append(row[0])
        original.append(row[1])

        # 제외어 삭제하기
        news = row[1]
        #print('전:', news)
        for word in MY_STOP:
            mask = ' ' + word + ' '
            news = news.replace(mask, ' ')
        #print('후:', news)

        articles.append(news)
print('처리한 기사 숫자:', len(articles))

['category', 'text']
처리한 기사 숫자: 2225


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# 6번 블럭
# 샘플 기사 원본 출력
print('샘플 기사 원본')
print(original[MY_SAMPLE])
print(labels[MY_SAMPLE])
print(type(original[MY_SAMPLE]))
print('총 단어 수:', len(original[MY_SAMPLE].split()))

샘플 기사 원본
screensaver tackles spam websites net users are getting the chance to fight back against spam websites  internet portal lycos has made a screensaver that endlessly requests data from sites that sell the goods and services mentioned in spam e-mail. lycos hopes it will make the monthly bandwidth bills of spammers soar by keeping their servers running flat out. the net firm estimates that if enough people sign up and download the tool  spammers could end up paying to send out terabytes of data.   we ve never really solved the big problem of spam which is that its so damn cheap and easy to do   said malte pollmann  spokesman for lycos europe.  in the past we have built up the spam filtering systems for our users   he said   but now we are going to go one step further.    we ve found a way to make it much higher cost for spammers by putting a load on their servers.  by getting thousands of people to download and use the screensaver  lycos hopes to get spamming websites constantly r

In [None]:
# 7번 블럭
# 제외어 삭제 기사 출력
print('제외어 삭제본')
print(articles[MY_SAMPLE])
print('총 단어 수:', len(articles[MY_SAMPLE].split()))
print(type(articles[MY_SAMPLE]))


제외어 삭제본
screensaver tackles spam websites net users getting chance fight back spam websites  internet portal lycos made screensaver endlessly requests data sites sell goods services mentioned spam e-mail. lycos hopes make monthly bandwidth bills spammers soar keeping servers running flat out. net firm estimates enough people sign download tool  spammers could end paying send terabytes data.   never really solved big problem spam damn cheap easy   said malte pollmann  spokesman lycos europe.  past built spam filtering systems users   said   going go one step further.    found way make much higher cost spammers putting load servers.  getting thousands people download use screensaver  lycos hopes get spamming websites constantly running almost full capacity. mr pollmann said intention stop spam websites working subjecting much data cope with. said screensaver carefully written ensure amount traffic generated user overload web.  every single user contribute three four megabytes per day   s

In [None]:
# # 7.5번 블럭
# for i in range(len(articles)):
#     if ' the ' in articles[i]:
#         print('원본: ', original[i])
#         print('처리: ', articles[i])


In [None]:
# 8번 블럭
# 기사 tokenization 처리
A_token = Tokenizer(num_words=MY_VOCAB,
                    oov_token='OOV')
print(A_token.word_index)
A_token.fit_on_texts(articles)
print(A_token.word_index)

# 숫자를 단어로 전환
# sequence: 숫자, text: 단어
# OOV: out of vocabulary 특수 문자
print(A_token.sequences_to_texts([[1]]))
print(A_token.sequences_to_texts([[1012]]))

# 단어를 숫자로
print(A_token.texts_to_sequences(['the']))

# 모든 제외어 누락된 기사를 tokenize 처리
A_tokenized = A_token.texts_to_sequences(articles)


{}
['OOV']
['the']
[[1012]]


In [None]:
# 9번 블럭
# token 처리 결과 출력
sample = A_tokenized[MY_SAMPLE]
print(sample)

# 기사 통계 내기
longest = max([len(x) for x in A_tokenized])
print('제일 긴 기사 단어 수:', longest)

shortest = min([len(x) for x in A_tokenized])
print('제일 짧은 기사 단어 수:', shortest)


[3170, 1, 816, 878, 115, 136, 382, 347, 716, 28, 816, 878, 228, 1, 3171, 27, 3170, 1, 4867, 203, 569, 733, 1771, 126, 4024, 816, 260, 395, 3171, 700, 21, 1650, 3629, 2848, 2607, 1, 2325, 2551, 453, 2918, 570, 115, 63, 2290, 381, 7, 1161, 780, 1860, 2607, 11, 92, 1572, 1052, 1, 203, 281, 154, 1, 138, 364, 816, 1, 2224, 847, 2, 1, 1, 178, 3171, 139, 255, 1110, 816, 1, 726, 136, 2, 52, 60, 10, 818, 3921, 195, 41, 21, 56, 495, 245, 2607, 1362, 1, 2551, 382, 1022, 7, 780, 70, 3170, 3171, 700, 23, 1, 878, 3992, 453, 343, 322, 1393, 3, 1, 2, 3428, 583, 816, 878, 297, 1, 56, 203, 2296, 2403, 2, 3170, 2708, 1070, 660, 812, 1287, 3883, 1540, 1, 466, 224, 504, 1540, 1, 31, 96, 1, 681, 111, 2, 10, 1899, 912, 2, 381, 7, 1161, 1, 878, 11, 722, 256, 1, 1287, 224, 504, 111, 3171, 79, 70, 260, 395, 716, 28, 2, 3, 1, 4, 1606, 10, 823, 455, 158, 823, 455, 2, 569, 2178, 4024, 816, 260, 395, 891, 733, 1771, 126, 220, 3677, 569, 316, 86, 1052, 816, 260, 395, 3677, 23, 1, 1453, 681, 111, 415, 569, 3170, 760,

In [None]:
# 10번 블럭
# 기사 길이 맞추기
# padding: 200개보다 짧은 경우 처리
# pad_sequence 결과는 numpy로 처리
A_tokenized = pad_sequences(A_tokenized, 
                            maxlen=MY_LEN,
                            padding='post',
                            truncating='post')

# 기사 통계 내기
longest = max([len(x) for x in A_tokenized])
print('제일 긴 기사 단어 수:', longest)

shortest = min([len(x) for x in A_tokenized])
print('제일 짧은 기사 단어 수:', shortest)

print('hash map에 사용된 총 단어 수:', len(A_token.word_counts))

print('최종 처리된 샘플 기사')
print(A_tokenized[MY_SAMPLE])


제일 긴 기사 단어 수: 200
제일 짧은 기사 단어 수: 200
hash map에 사용된 총 단어 수: 29698
최종 처리된 샘플 기사
[3170    1  816  878  115  136  382  347  716   28  816  878  228    1
 3171   27 3170    1 4867  203  569  733 1771  126 4024  816  260  395
 3171  700   21 1650 3629 2848 2607    1 2325 2551  453 2918  570  115
   63 2290  381    7 1161  780 1860 2607   11   92 1572 1052    1  203
  281  154    1  138  364  816    1 2224  847    2    1    1  178 3171
  139  255 1110  816    1  726  136    2   52   60   10  818 3921  195
   41   21   56  495  245 2607 1362    1 2551  382 1022    7  780   70
 3170 3171  700   23    1  878 3992  453  343  322 1393    3    1    2
 3428  583  816  878  297    1   56  203 2296 2403    2 3170 2708 1070
  660  812 1287 3883 1540    1  466  224  504 1540    1   31   96    1
  681  111    2   10 1899  912    2  381    7 1161    1  878   11  722
  256    1 1287  224  504  111 3171   79   70  260  395  716   28    2
    3    1    4 1606   10  823  455  158  823  455    2  569 2178 4024

In [None]:
# 11번 블럭
# 라벨 tokenization 처리
C_token = Tokenizer()
C_token.fit_on_texts(labels)
print(labels)

# 단어를 숫자로 전환
# texts_to_sequences 함수는 결과를 list로 처리
C_tokenized = C_token.texts_to_sequences(labels)
print(C_tokenized)
print(C_token.word_index)


['tech', 'business', 'sport', 'sport', 'entertainment', 'politics', 'politics', 'sport', 'sport', 'entertainment', 'entertainment', 'business', 'business', 'politics', 'sport', 'business', 'politics', 'sport', 'business', 'tech', 'tech', 'tech', 'sport', 'sport', 'tech', 'sport', 'entertainment', 'tech', 'politics', 'entertainment', 'politics', 'tech', 'entertainment', 'entertainment', 'business', 'politics', 'tech', 'entertainment', 'politics', 'business', 'politics', 'sport', 'business', 'sport', 'tech', 'entertainment', 'politics', 'politics', 'politics', 'business', 'sport', 'politics', 'business', 'business', 'sport', 'politics', 'business', 'sport', 'sport', 'business', 'business', 'sport', 'business', 'sport', 'business', 'tech', 'business', 'entertainment', 'tech', 'business', 'politics', 'business', 'politics', 'sport', 'business', 'tech', 'business', 'sport', 'sport', 'business', 'business', 'sport', 'politics', 'business', 'entertainment', 'politics', 'politics', 'business',

In [None]:
# 12번 블럭
# 데이터 4분할
print(type(A_tokenized))
print(type(C_tokenized))

# list를 numpy로 전환
C_tokenized = np.array(C_tokenized)

# 4분할
X_train, X_test, Y_train, Y_test = train_test_split(A_tokenized,
                                                    C_tokenized,
                                                    train_size=MY_SPLIT,
                                                    shuffle=False)

# 데이터 모양 확인
print('학습용 입력 데이터 모양:', X_train.shape)
print('학습용 출력 데이터 모양:', Y_train.shape)

print('평가용 입력 데이터 모양:', X_test.shape)
print('평가용 출력 데이터 모양:', Y_test.shape)

# 샘플 출력
print(X_train[MY_SAMPLE])
print(Y_train[MY_SAMPLE])
print(X_test[MY_SAMPLE])
print(Y_test[MY_SAMPLE])


<class 'numpy.ndarray'>
<class 'list'>
학습용 입력 데이터 모양: (1780, 200)
학습용 출력 데이터 모양: (1780, 1)
평가용 입력 데이터 모양: (445, 200)
평가용 출력 데이터 모양: (445, 1)
[3170    1  816  878  115  136  382  347  716   28  816  878  228    1
 3171   27 3170    1 4867  203  569  733 1771  126 4024  816  260  395
 3171  700   21 1650 3629 2848 2607    1 2325 2551  453 2918  570  115
   63 2290  381    7 1161  780 1860 2607   11   92 1572 1052    1  203
  281  154    1  138  364  816    1 2224  847    2    1    1  178 3171
  139  255 1110  816    1  726  136    2   52   60   10  818 3921  195
   41   21   56  495  245 2607 1362    1 2551  382 1022    7  780   70
 3170 3171  700   23    1  878 3992  453  343  322 1393    3    1    2
 3428  583  816  878  297    1   56  203 2296 2403    2 3170 2708 1070
  660  812 1287 3883 1540    1  466  224  504 1540    1   31   96    1
  681  111    2   10 1899  912    2  381    7 1161    1  878   11  722
  256    1 1287  224  504  111 3171   79   70  260  395  716   28    2
    3  

3. 인공 신경망 구현

In [None]:
# 13번 블럭
# RNN 구현
model = Sequential()

model.add(Embedding(input_dim=MY_VOCAB,
                    output_dim=MY_EMBED))

model.add(Dropout(rate=0.5))

# this is the secret sauce!
model.add(Bidirectional(LSTM(units=MY_HIDDEN)))

model.add(Dense(units=6,
                activation='softmax'))

print ('RNN 요약')
model.summary()

RNN 요약
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          320000    
_________________________________________________________________
dropout (Dropout)            (None, None, 64)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               132000    
_________________________________________________________________
dense (Dense)                (None, 6)                 1206      
Total params: 453,206
Trainable params: 453,206
Non-trainable params: 0
_________________________________________________________________


4. 인공신경망 학습

In [None]:
# 14번 블럭
# RNN 학습 환경 설정
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', 
              metrics=['acc'])

print('학습 시작')
begin = time()

model.fit(X_train,
          Y_train,
          epochs=MY_EPOCH,
          verbose=1)

end = time()
print('총 학습 시간: {:.2f}'.format(end-begin))


학습 시작
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
총 학습 시간: 17.28


5. 인공 신경망 평가

In [None]:
# 15번 블럭
# RNN 평가
# adam
# 총 학습 시간: 29.94
# 최종 정확도: 0.93
# sgd
# 총 학습 시간: 26.59
# 최종 정확도: 0.22

score = model.evaluate(X_test,
                       Y_test,
                       verbose=1)

print('최종 손실값: {:.2f}'.format(score[0]))
print('최종 정확도: {:.2f}'.format(score[1]))


최종 손실값: 0.15
최종 정확도: 0.95


In [None]:
# 16번 블럭
# RNN으로 예측
# keras 학습용 함수: compile, fit, evaluate, predict
# {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}
pred = model.predict(X_test)
#print(pred)
pred = pred.argmax(axis=1)
print('추측값:', pred)
print('정답:', Y_test.flatten())


추측값: [5 4 3 1 1 4 2 4 5 5 3 3 2 5 1 5 5 2 1 3 4 2 1 5 4 3 3 1 1 3 2 2 2 2 5 2 3
 3 4 4 5 3 5 2 3 1 1 3 4 2 4 1 2 2 3 1 1 3 3 5 5 3 2 3 3 2 4 3 3 3 3 3 5 5
 4 3 1 3 1 4 1 1 1 5 4 5 4 1 4 1 1 5 5 2 5 5 3 2 1 4 4 3 2 1 2 5 1 3 5 1 1
 2 3 4 4 2 2 1 3 5 1 1 3 5 4 4 5 2 3 1 3 4 5 1 3 2 5 3 5 3 1 3 2 2 3 2 4 1
 2 5 2 1 1 5 4 3 4 3 3 1 1 1 2 4 5 2 1 2 3 2 4 2 2 2 2 1 1 1 2 2 5 2 2 2 1
 3 1 4 2 1 1 1 2 5 4 4 4 3 2 2 4 2 4 1 1 3 3 3 1 1 3 3 4 2 1 1 1 1 2 1 2 2
 2 2 1 3 1 4 4 1 4 2 5 2 1 2 4 4 3 5 2 4 2 4 3 5 2 5 5 4 3 4 4 2 3 1 5 2 3
 5 2 4 1 4 3 1 3 2 3 3 2 2 2 4 3 2 3 2 4 3 1 3 3 1 5 4 4 2 4 1 2 2 2 1 4 4
 4 1 5 1 3 2 3 3 5 4 2 4 1 5 5 1 2 5 4 4 1 5 2 3 3 3 4 4 2 3 2 4 3 5 5 2 2
 4 5 4 4 1 3 1 1 3 5 5 2 3 3 1 2 2 4 2 4 4 1 2 3 1 2 2 1 4 1 4 5 1 5 5 2 4
 1 1 3 4 2 3 1 1 3 2 4 4 2 2 1 5 4 4 2 3 4 1 1 4 4 3 2 1 5 5 1 5 4 1 2 2 2
 1 1 4 1 2 4 2 2 1 2 3 2 2 5 3 4 3 4 5 3 4 5 3 3 5 1 4 2 4 5 4 1 2 2 3 5 3
 1]
정답: [5 4 3 1 1 4 2 4 5 5 3 3 2 5 1 5 5 2 1 3 4 2 1 5 4 3 3 1 1 2 2 2 2 2 5 2 3
 3 4 4 5 3 5

In [None]:
# 17번 블럭
print('혼동 행렬')
print(confusion_matrix(Y_test,
                       pred))


혼동 행렬
[[ 95   0   3   1   2]
 [  1 101   4   0   0]
 [  0   2  83   1   0]
 [  1   2   0  82   1]
 [  0   1   0   3  62]]


In [None]:
# 18번 블럭
# 실제 기사로 실습
news = ['Google and Microsoft have pledged to support India as it tackles record numbers of coronavirus cases that have overwhelmed the countrys hospitals. Sundar Pichai, the head of Google and parent company Alphabet, said he was devastated by events and the firm would provide $18m (£13m) in funding. Microsoft boss Satya Nadella said he was heartbroken and would help India with its shortage of oxygen supplies. Both chief executives of the technology giants were born in India. Mr Pichai, who was born and schooled in the southern Indian city of Chennai, announced Googles funding move in a tweet on Sunday, linking to a statement by vice-president of Google India, Sanjay Gupta. Right now India is going through our most difficult moment in the pandemic thus far, the statement said.']

# token 처리
news = A_token.texts_to_sequences(news)
print(news)
print(type(news))
print('총 단어 수:', len(news[0]))

news = pad_sequences(news,
                     maxlen=MY_LEN,
                     padding='post',
                     truncating='post')
print('총 단어 수:', len(news[0]))

pred = model.predict(news)
pred = pred.argmax(axis=1)
# {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}
print('RNN 추측값:', pred)



[[837, 1887, 283, 3250, 1714, 487, 290, 426, 1, 223, 1, 128, 738, 1489, 1, 1021, 745, 3250, 1, 1012, 1, 2734, 1, 1, 1012, 392, 1489, 837, 1887, 2958, 47, 1, 2, 1, 1, 1, 3075, 1270, 1887, 1012, 63, 4, 694, 1, 1, 619, 1493, 283, 596, 1, 1, 2, 1, 1, 1, 1887, 4, 131, 426, 2403, 1, 1, 1489, 1, 2903, 1, 113, 1556, 1489, 1012, 75, 3552, 1, 1452, 619, 426, 3, 1, 1, 1, 1452, 1887, 1, 619, 1012, 2369, 916, 510, 1489, 1, 331, 1, 1493, 156, 619, 1385, 1, 579, 332, 1, 487, 1385, 374, 3075, 1299, 194, 1489, 837, 426, 1, 1, 127, 1530, 426, 2283, 52, 4189, 1, 1, 528, 804, 619, 1012, 1, 1, 196, 1012, 374, 2]]
<class 'list'>
총 단어 수: 129
총 단어 수: 200
RNN 추측값: [4]
