# 실습 02 - 텍스트 문서 범주화에 전이학습 적용하기

- 이메일, 평점 등의 텍스트를 분류하는 데에 전이학습 이용
- IMDB 영화 리뷰 데이터를 다운로드 받아 data 디렉토리에 압축 해제한다
    - 다운로드 : http://ai.stanford.edu/~amaas/data/sentiment/
    - 저장경로 : data/practice02/aclImdb
    - 폴더 구조
        - data/aclImdb/train/pos/...txt
        - data/aclImdb/train/neg/...txt
- 사전학습 임베딩 파일 받기
    - 사전 훈련된 Word2Vec 임베딩 불러오기(GloVe)
    - 다운로드 링크 : http://nlp.stanford.edu/data/glove.6B.zip
    - 폴더 구조
        - data/glove.6B/

In [1]:
import os
import config
from dataloader.loader import Loader
from preprocessing.utils import Preprocess, remove_empty_docs
from dataloader.embeddings import GloVe
from model.cnn_document_model import DocumentModel, TrainingParameters
from keras.callbacks import ModelCheckpoint, EarlyStopping
import numpy as np

# 1. 학습 파라미터 설정

In [2]:
# 학습된 모델을 저장할 디렉토리 생성
if not os.path.exists(os.path.join(config.MODEL_DIR, 'imdb')):
    os.makedirs(os.path.join(config.MODEL_DIR, 'imdb'))

# 학습 파라미터 설정
train_params = TrainingParameters('imdb_transfer_tanh_activation', 
                                  model_file_path = config.MODEL_DIR+ '/imdb/transfer_model_10.hdf5',
                                  model_hyper_parameters = config.MODEL_DIR+ '/imdb/transfer_model_10.json',
                                  model_train_parameters = config.MODEL_DIR+ '/imdb/transfer_model_10_meta.json',
                                  num_epochs=30,
                                  batch_size=128)

# 2. 데이터 불러오기

In [3]:
# 다운받은 IMDB 데이터 로드: 학습셋은 5%만 취한다 (전체는 2만5천개)
train_df = Loader.load_imdb_data(directory = 'train')
train_df = train_df.sample(frac=0.05, random_state = train_params.seed)
print(f'train_df.shape : {train_df.shape}')

test_df = Loader.load_imdb_data(directory = 'test')
print(f'test_df.shape : {test_df.shape}')

# 텍스트 데이터, 레이블 추출
corpus = train_df['review'].tolist()
target = train_df['sentiment'].tolist()
corpus, target = remove_empty_docs(corpus, target)
print(f'corpus size : {len(corpus)}')
print(f'target size : {len(target)}')

  soup = BeautifulSoup(text, "html.parser")


train_df.shape : (1250, 2)


  soup = BeautifulSoup(text, "html.parser")


test_df.shape : (25000, 2)
corpus size : 1250
target size : 1250


## > 리뷰 데이터 확인

In [4]:
# 전처리 결과
train_df

Unnamed: 0,review,sentiment
1152,A man and his wife get in a horrible car accid...,1
3058,"Well, what can I say, this movie really got to...",1
12016,This early version of the tale 'The Student of...,1
12239,"To a certain extent, I actually liked this fil...",1
21127,"I watched this film, along with every other ad...",0
...,...,...
24667,"I'm a huge Steven Seagal fan. Hell, I probably...",0
14460,"It's not a terrible movie, really, and Glenn a...",0
12692,There is no doubt that this film has an impres...,0
6400,Most of the feedback I've heard concerning Mea...,1


In [5]:
import random
for i in random.sample(range(train_df.shape[0]),3):
    print('-'*100)
    print(f">> target: {target[i]}")
    print('>> review')
    print(corpus[i])

----------------------------------------------------------------------------------------------------
>> target: 0
>> review
Autobiography of founder of zoo in NYC starts out by being very cute and would be great family movie if it stayed there. however we get more and more involved with reality as gorilla grows up to be a wild thing not easily amenable to his "mother's" wishes - this might scare younger children, esp. scenes where Buddy tries to injure Gertrude. rather quick resolution at the end. below average.
----------------------------------------------------------------------------------------------------
>> target: 1
>> review
Last November, I had a chance to see this film at the Reno Film Festival. I have to say that it was a lot of fun. A few tech errors aside, it was a great experience. I loved the writing and acting, especially from the guy that played the lead role. There is a lot of heart in this movie, a lot of wit to. I got a chance to speak with a few of the filmmakers 

# 3. 전처리

## 3.1. 텍스트 -> 시퀀스 변환(인덱스 시퀀스)

- 참고. 사용자 정의 함수에서 nltk 라이브러리의 wordpunct_tokenize를 사용함
    - -> 아래 코드 필요 ... nltk.download('punkt')
    - 참고. from preprocessing.utils import Preprocess, remove_empty_docs
        - utils.py 내 from nltk.tokenize import sent_tokenize, wordpunct_tokenize

In [6]:
# import nltk
# nltk.download('punkt')

In [7]:
# 학습셋을 인덱스 시퀀스로 변환
preprocessor = Preprocess(corpus=corpus)
corpus_to_seq = preprocessor.fit()

Found 4990 unique tokens.
All documents processed.cessed.

In [8]:
len(corpus_to_seq)

1250

## > 시퀀스 변환 결과 확인

In [9]:
corpus[0]

'A man and his wife get in a horrible car accident. When the wife is left in a persistent vegetative state, the man must choose between pulling the plug and letting her live. The decision is made even harder when he realizes her ghost wants to extract revenge on him and those around him.This comes to us from director Rob Schmidt, who made "Wrong Turn" (a film I have not seen). With only one horror film under his belt, and not a particularly notorious one at that, I was a bit reluctant to watch this episode, expecting Schmidt to be a "Master of Horror" in only the most liberal sense. My apologies to him for my underestimation. As of episode 10 in a 13 episode season, this was actually the best one yet.The issue of the "right to die" is dealt with and covered in enough detail to be a solid plot device. However, this is only the foundation on which the story revolves. Once the horror elements show up, the film goes from "decent" to "spectacular". Great acting, great plot, great dialogue, 

In [10]:
corpus_to_seq[0]

array([  2,   3,   4,   5,   6,   7,   8,   2,   9,  10,  11,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,  12,  13,   6,  14,  15,   8,   2,  16,  13,
         3,  17,  18,  19,  13,   4,  20,  21,  22,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,  13,  23,  14,  24,  25,
        26,  12,  27,  28,  21,  29,  30,  31,  32,  33,  34,   4,  35,
        36,  34,  37,  38,  31,  39,  40,  41,  42,  43,  24,  44,  51,
        52,  53,  54,  46,  55,   5,  56,   4,  49,   2,  57,  58,  53,
        59,  60,  47,  61,   2,  62,  31,  63,  37,  64,  65,  31,  66,
         2,  67,  68,  72,  31,  34,  73,  72,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,  74,  68,  64,   8,   2,  64,
        75,  37,  61,  76,  13,  77,  53,  78,  13,  79,  68,  13,  80,
        31,  81,  14,  82,  51,   4,  83,   8,  84,  85,  31,  8

In [11]:
# 테스트셋을 인덱스 시퀀스로 변환
test_corpus = test_df['review'].tolist()
test_target = test_df['sentiment'].tolist()
test_corpus, test_target = remove_empty_docs(test_corpus, test_target)
test_corpus_to_seq = preprocessor.transform(test_corpus)

All documents processed.ocessed.

In [12]:
print(f'test_corpus_to_seq size : {len(test_corpus_to_seq)}')
print(f'test_corpus_to_seq[0] size : {len(test_corpus_to_seq[0])}')

test_corpus_to_seq size : 25000
test_corpus_to_seq[0] size : 300


# 3.2. 학습셋 & 테스트셋

In [13]:
# 학습셋, 테스트셋 준비
x_train = np.array(corpus_to_seq)
x_test = np.array(test_corpus_to_seq)
y_train = np.array(target)
y_test = np.array(test_target)

print(f'x_train.shape : {x_train.shape}')
print(f'y_train.shape : {y_train.shape}')
print(f'x_test.shape : {x_test.shape}')
print(f'y_test.shape : {y_test.shape}')

x_train.shape : (1250, 300)
y_train.shape : (1250,)
x_test.shape : (25000, 300)
y_test.shape : (25000,)


In [14]:
x_train

array([[  2,   3,   4, ...,   0,   0,   0],
       [229, 129, 146, ...,   0,   0,   0],
       [ 37, 272, 273, ...,   0,   0,   0],
       ...,
       [119,  14, 127, ...,   0,   0,   0],
       [ 69,  68,  13, ...,   0,   0,   0],
       [ 47, 581, 163, ...,   0,   0,   0]], dtype=int32)

In [15]:
y_train

array([1, 1, 1, ..., 0, 1, 0])

## 3.4. word embedding - GloVe 방법

- 사전 훈련된 Word2Vec 임베딩 불러오기(GloVe)
    - 다운로드 링크 : http://nlp.stanford.edu/data/glove.6B.zip
- 참고. 사용자 정의함수로 GloVe 함수 만듦
    - from dataloader.embeddings import GloVe
    
        ```python
        class GloVe:
            def __init__(self, embd_dim=50):
                if embd_dim not in [50, 100, 200, 300]:
                    raise ValueError('embedding dim should be one of [50, 100, 200, 300]')
                self.EMBEDDING_DIM = embd_dim
                self.embedding_matrix = None
                
            def __load__(self):
                print('Reading {} dim GloVe vectors'.format(self.EMBEDDING_DIM))
                self.embeddings_index = {}
                # 사전학습 임베딩 가져오기 (data/glove.6B/glove.6B.50d.txt)
                # ex. 단어 of의 경우: of 0.70853 0.57088 -0.4716 0.18048 ...
                # embeddings_index 딕셔너리에 {"of":[0.70853, 0.57088, -0.4716, 0.18048, ...]}로 전처리
                with open(os.path.join(config.GLOVE_DIR, 'glove.6B.'+str(self.EMBEDDING_DIM)+'d.txt'),encoding="utf8") as fin:
                    for line in fin:
                        try:
                            values = line.split()
                            coefs = np.asarray(values[1:], dtype='float32')
                            word = values[0]
                            self.embeddings_index[word] = coefs
                        except:
                            print(line)

                print('Found %s word vectors.' % len(self.embeddings_index))

            def _init_embedding_matrix(self, word_index_dict, oov_words_file='OOV-Words.txt'):
                # 임베딩 채울 영행렬
                self.embedding_matrix = np.zeros((len(word_index_dict)+2 , self.EMBEDDING_DIM)) # +1 for the 0 word index from paddings.
                not_found_words=0
                missing_word_index = []
                
                with open(oov_words_file, 'w') as f: 
                    for word, i in word_index_dict.items():
                        # embeddings_index: glove.6B.50d.txt 사전학습 임베딩 전처리 딕셔너리
                        embedding_vector = self.embeddings_index.get(word) 
                        if embedding_vector is not None:
                            # words not found in embedding index will be all-zeros.
                            self.embedding_matrix[i] = embedding_vector
                        else:
                            not_found_words+=1
                            f.write(word + ','+str(i)+'\n')
                            missing_word_index.append(i)

                    #oov by average vector:
                    self.embedding_matrix[1] = np.mean(self.embedding_matrix, axis=0)
                    for indx in missing_word_index:
                        self.embedding_matrix[indx] = np.random.rand(self.EMBEDDING_DIM)+ self.embedding_matrix[1]
                print("words not found in embeddings: {}".format(not_found_words))
                
                
            def get_embedding(self, word_index_dict): # input: 단어와 인덱스 딕셔너리
                if self.embedding_matrix is None:
                    self._load()
                    self._init_embedding_matrix(word_index_dict) 
                return self.embedding_matrix
        ```

In [16]:
preprocessor.word_index

{'a': 2,
 'man': 3,
 'and': 4,
 'his': 5,
 'wife': 6,
 'get': 7,
 'in': 8,
 'horrible': 9,
 'car': 10,
 'accident': 11,
 'when': 12,
 'the': 13,
 'is': 14,
 'left': 15,
 'state': 16,
 'must': 17,
 'choose': 18,
 'between': 19,
 'letting': 20,
 'her': 21,
 'live': 22,
 'decision': 23,
 'made': 24,
 'even': 25,
 'harder': 26,
 'he': 27,
 'realizes': 28,
 'ghost': 29,
 'wants': 30,
 'to': 31,
 'revenge': 32,
 'on': 33,
 'him': 34,
 'those': 35,
 'around': 36,
 'this': 37,
 'comes': 38,
 'us': 39,
 'from': 40,
 'director': 41,
 'rob': 42,
 'who': 43,
 'wrong': 44,
 'turn': 45,
 'film': 46,
 'i': 47,
 'have': 48,
 'not': 49,
 'seen': 50,
 'with': 51,
 'only': 52,
 'one': 53,
 'horror': 54,
 'under': 55,
 'belt': 56,
 'particularly': 57,
 'notorious': 58,
 'at': 59,
 'that': 60,
 'was': 61,
 'bit': 62,
 'watch': 63,
 'episode': 64,
 'expecting': 65,
 'be': 66,
 'master': 67,
 'of': 68,
 'most': 69,
 'liberal': 70,
 'sense': 71,
 'my': 72,
 'for': 73,
 'as': 74,
 'season': 75,
 'actually': 76

In [17]:
# GloVe 임베딩 초기화 - glove.6B.50d.txt pretrained 벡터 사용
glove = GloVe(50)
initial_embeddings = glove.get_embedding(preprocessor.word_index)
print(f'initial_embeddings.shape : {initial_embeddings.shape}')

Reading 50 dim GloVe vectors
Found 400000 word vectors.
words not found in embeddings: 16
initial_embeddings.shape : (4992, 50)


In [18]:
# 사전학습 GloVe로부터 단어 별(IMDB 데이터 텍스트) 임베딩 값 가져온 결과
initial_embeddings

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.12868069,  0.10964358, -0.10515184, ..., -0.13136516,
        -0.02682301,  0.11120054],
       [ 0.21705   ,  0.46515   , -0.46757001, ..., -0.043782  ,
         0.41012999,  0.1796    ],
       ...,
       [-0.77372003,  0.13817   , -1.18710005, ...,  0.061871  ,
         0.39048001, -1.12639999],
       [-1.06140006, -0.94668001,  0.019802  , ...,  0.65306997,
         0.28865001,  0.031796  ],
       [ 0.9833132 ,  0.89334826,  0.7481463 , ...,  0.49181514,
         0.91040518,  0.59357639]])

# 4. 전이학습 모델 가져오기 - HandsOnO3

In [19]:
# 모델 하이퍼파라미터 로드
# HandsOn-03_Movie_Review.ipynb에서 아마존 리뷰 모델 학습 후 checkpoint 폴더에 저장
model_json_path = os.path.join(config.MODEL_DIR, 'amazonreviews/model_06.json')
amazon_review_model = DocumentModel.load_model(model_json_path)

# 모델 가중치 로드
model_hdf5_path = os.path.join(config.MODEL_DIR, 'amazonreviews/model_06.hdf5')
amazon_review_model.load_model_weights(model_hdf5_path)


Vocab Size = 43197  and the index of vocabulary words passed has 43195 words


In [36]:
# 모델 임베딩 레이어 추출
learned_embeddings = amazon_review_model.get_classification_model().get_layer('imdb_embedding').get_weights()[0]
print(f'learned_embeddings size : {len(learned_embeddings)}')

# 기존 GloVe 모델을 학습된 임베딩 행렬로 업데이트한다
# params: word_index_dict, other_embedding, other_word_index
glove.update_embeddings(preprocessor.word_index, 
                        np.array(learned_embeddings), 
                        amazon_review_model.word_index)

# 업데이트된 임베딩을 얻는다
initial_embeddings = glove.get_embedding(preprocessor.word_index)


learned_embeddings size : 43197
4895 words are updated out of 4990


- 참고: 사전학습된 모델의 구조

In [34]:
amazon_review_model.get_classification_model().summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 300)]                0         []                            
                                                                                                  
 imdb_embedding (Embedding)  (None, 300, 50)              2159850   ['input_1[0][0]']             
                                                                                                  
 dropout (Dropout)           (None, 300, 50)              0         ['imdb_embedding[0][0]']      
                                                                                                  
 lambda (Lambda)             (None, 30, 50)               0         ['dropout[0][0]']             
                                                                                              

- 레이어 별 이름

In [38]:
[i._name for i in amazon_review_model.get_classification_model().layers]

['input_1',
 'imdb_embedding',
 'dropout',
 'lambda',
 'lambda_1',
 'lambda_2',
 'lambda_3',
 'lambda_4',
 'lambda_5',
 'lambda_6',
 'lambda_7',
 'lambda_8',
 'lambda_9',
 'word_conv',
 'k_max_pooling',
 'k_max_pooling_1',
 'k_max_pooling_2',
 'k_max_pooling_3',
 'k_max_pooling_4',
 'k_max_pooling_5',
 'k_max_pooling_6',
 'k_max_pooling_7',
 'k_max_pooling_8',
 'k_max_pooling_9',
 'reshape',
 'reshape_1',
 'reshape_2',
 'reshape_3',
 'reshape_4',
 'reshape_5',
 'reshape_6',
 'reshape_7',
 'reshape_8',
 'reshape_9',
 'concatenate',
 'sentence_embeddings',
 'sentence_conv',
 'k_max_pooling_10',
 'document_embedding',
 'gaussian_noise',
 'hidden_0',
 'final']

# 5. IMDB 전이학습 모델 생성

In [21]:
# 분류 모델 생성 : IMDB 리뷰 데이터를 입력받아 이진분류를 수행하는 모델 생성
imdb_model = DocumentModel(vocab_size=preprocessor.get_vocab_size(),
                           word_index = preprocessor.word_index,
                           num_sentences=Preprocess.NUM_SENTENCES,     
                           embedding_weights=initial_embeddings,
                           embedding_regularizer_l2 = 0.0,
                           conv_activation = 'tanh',
                           train_embedding = True,   # 임베딩 레이어의 가중치 학습함
                           learn_word_conv = False,  # 단어 수준 conv 레이어의 가중치 학습 안 함
                           learn_sent_conv = False,  # 문장 수준 conv 레이어의 가중치 학습 안 함
                           hidden_dims=64,                                        
                           input_dropout=0.1, 
                           hidden_layer_kernel_regularizer=0.01,
                           final_layer_kernel_regularizer=0.01)

# 가중치 업데이트 : 생성한 imdb_model 모델에서 다음의 각 레이어들의 가중치를 위에서 로드한 가중치로 갱신한다
for l_name in ['word_conv','sentence_conv','hidden_0', 'final']:
    new_weights = amazon_review_model.get_classification_model().get_layer(l_name).get_weights()
    imdb_model.get_classification_model().get_layer(l_name).set_weights(weights=new_weights)

Vocab Size = 4992  and the index of vocabulary words passed has 4990 words


# 6. 모델 학습 및 평가

In [22]:
# 모델 컴파일              
imdb_model.get_classification_model().compile(loss="binary_crossentropy", 
                                              optimizer='rmsprop',
                                              metrics=["accuracy"])

# callback (1) - 체크포인트
checkpointer = ModelCheckpoint(filepath=train_params.model_file_path,
                                verbose=1,
                                save_best_only=True,
                                save_weights_only=True)

# callback (2) - 조기종료
early_stop = EarlyStopping(patience=2)

# 학습 시작
imdb_model.get_classification_model().fit(x_train, 
                                          y_train, 
                                          batch_size=train_params.batch_size,
                                          epochs=train_params.num_epochs,
                                          verbose=2,
                                          validation_split=0.01,
                                          callbacks=[checkpointer])

# 모델 저장
imdb_model._save_model(train_params.model_hyper_parameters)
train_params.save()

Epoch 1/30

Epoch 1: val_loss improved from inf to 1.66842, saving model to ./checkpoint/imdb/transfer_model_10.hdf5
10/10 - 1s - loss: 1.6150 - accuracy: 0.5861 - val_loss: 1.6684 - val_accuracy: 0.3077 - 1s/epoch - 118ms/step
Epoch 2/30

Epoch 2: val_loss improved from 1.66842 to 1.51567, saving model to ./checkpoint/imdb/transfer_model_10.hdf5
10/10 - 0s - loss: 1.4795 - accuracy: 0.5772 - val_loss: 1.5157 - val_accuracy: 0.3077 - 247ms/epoch - 25ms/step
Epoch 3/30

Epoch 3: val_loss improved from 1.51567 to 1.43445, saving model to ./checkpoint/imdb/transfer_model_10.hdf5
10/10 - 0s - loss: 1.3918 - accuracy: 0.5812 - val_loss: 1.4344 - val_accuracy: 0.5385 - 245ms/epoch - 24ms/step
Epoch 4/30

Epoch 4: val_loss improved from 1.43445 to 1.41991, saving model to ./checkpoint/imdb/transfer_model_10.hdf5
10/10 - 0s - loss: 1.3125 - accuracy: 0.5643 - val_loss: 1.4199 - val_accuracy: 0.3846 - 260ms/epoch - 26ms/step
Epoch 5/30

Epoch 5: val_loss improved from 1.41991 to 1.33160, saving

In [23]:
# 모델 평가
imdb_model.get_classification_model().evaluate(x_test, 
                                               y_test, 
                                               batch_size=train_params.batch_size*10,
                                               verbose=2)

20/20 - 1s - loss: 0.7217 - accuracy: 0.5697 - 1s/epoch - 57ms/step


[0.7216610908508301, 0.5697199702262878]

In [24]:
pred_test = imdb_model.get_classification_model().predict(x_test)



In [25]:
test_df = test_df.reset_index()

## 6.1. 부정적인 감정 리뷰 & 예측결과

In [26]:
# i = (test_df[test_df['sentiment']==1]).index[25]
i = 10
print(f"> sentiment: {test_df.loc[i,'sentiment']}")
print(test_df.loc[i,'review'])


> sentiment: 0
Robin Williams is excellent in this movie and it is a pity the material is not enough of a match for him. This may work if you buy into the "U-S-A! Number One!" mentality but story wise nothing much happens. Quite a shame really since the movie is really trying to say something, and says it sincerely. It just doesn't pack enough emotional punch.


In [27]:
# 예측결과
pred_test[i]

array([0.5160389], dtype=float32)

## 6.2. 긍정적인 감정 리뷰 & 예측결과

In [29]:
i = (test_df[test_df['sentiment']==0]).index[20]
# i = np.argmax(pred_test)
i  = 27
print(f"> sentiment: {test_df.loc[i,'sentiment']}")
print(test_df.loc[i,'review'])


> sentiment: 0
Catherine Zeta-Jones and Aaron Eckhart star in a "romantic" drama about an uptight chef played by Zeta-Jones, who ends up carrying for her niece when her sister is killed in a car crash. While she's out taking care of family matters she's replaced by Eckhart.Unfunny maudlin tale with no chemistry between the leads (she's a dead fish and he's okay, but not much of anything). Watching this I was wondering why anyone would want to see this since Zeta-Jones' character is so unlikable. Come on she's so obsessed with cooking and being the best all she does is cook for her therapist or talk about food. Ugh. I won't use any of the numerous puns that come to mind. I couldn't finish it.


In [30]:
# 예측결과
pred_test[i]

array([0.5160389], dtype=float32)

In [31]:
test_df.loc[[27,28],:]

Unnamed: 0,index,review,sentiment
27,23881,Catherine Zeta-Jones and Aaron Eckhart star in...,0
28,24360,This son of a son of a sequel was terrible to ...,0
