# 케라스와 텐서플로 허브를 사용한 영화 리뷰 텍스트 분류하기
- https://www.tensorflow.org/tutorials/keras/text_classification_with_hub?hl=ko
- 영화 리뷰(review) 텍스트를 긍정(positive) 또는 부정(negative)으로 분류
- 텐서플로 허브(TensorFlow Hub)와 케라스(Keras)를 사용한 기초적인 전이 학습(transfer learning) 애플리케이션

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
!pip install -q tensorflow-hub

In [4]:
!pip install -q tfds-nightly

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.4.0 requires h5py~=2.10.0, but you have h5py 3.1.0 which is incompatible.[0m


In [5]:
import tensorflow_hub as hub
import tensorflow_datasets as tfds

In [7]:
print(tf.__version__)
print(tf.executing_eagerly())
print(hub.__version__)
print("사용 가능" if tf.config.experimental.list_physical_devices("GPU") else "사용 불가능")

2.5.0-dev20201230
True
0.10.0
사용 불가능


In [9]:
train_data, validation_data, test_data = tfds.load(
    name='imdb_reviews',
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True
)

[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /Users/wonji/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=0.0, description='Generating splits...', max=3.0, style=ProgressStyle(descr…

HBox(children=(FloatProgress(value=0.0, description='Generating train examples...', max=25000.0, style=Progres…

HBox(children=(FloatProgress(value=0.0, description='Shuffling imdb_reviews-train.tfrecord...', max=25000.0, s…

HBox(children=(FloatProgress(value=0.0, description='Generating test examples...', max=25000.0, style=Progress…

HBox(children=(FloatProgress(value=0.0, description='Shuffling imdb_reviews-test.tfrecord...', max=25000.0, st…

HBox(children=(FloatProgress(value=0.0, description='Generating unsupervised examples...', max=50000.0, style=…

HBox(children=(FloatProgress(value=0.0, description='Shuffling imdb_reviews-unsupervised.tfrecord...', max=500…

[1mDataset imdb_reviews downloaded and prepared to /Users/wonji/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [10]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10))) # next, iter 잘 이해 안 된다

In [12]:
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

In [13]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])>

모델 구성
- 어떻게 텍스트를 표현할 것인가?
- 모델에서 얼마나 많은 층을 사용할 것인가?
- 각 층에서 얼마나 많은 은닉 유닛(hidden unit)을 사용할 것인가?

pre-trained text embedding
- 텍스트 -> 숫자로 변환하는 방법: 임베딩 벡터로 변환 
- pre-trained 사용 시, 텍스트 전처리에 신경 쓸 필요 없고 / 전이학습 장점 / 고정 크기의 벡터이므로 처리 과정이 단순해짐

https://tfhub.dev/
- tensorflow hub: 사전학습된 모델을 이용할 수 있음
- 이 중 하나인 google/tf2-preview/gnews-swivel-20dim/1 이용 예정

In [14]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)
# Wrap a saved model (TF1 hub format) as a Keras layer

In [19]:
trn_sample = train_examples_batch[:2]
print(trn_sample)
print(hub_layer(trn_sample))
print(hub_layer(trn_sample).shape) # 일정한 크기(20)의 embedding vector로 변환

tf.Tensor(
[b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
 b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot de

In [20]:
model = tf.keras.Sequential()
model.add(hub_layer) # output: n x 20 (num_examples x embedding_dimension)
model.add(tf.keras.layers.Dense(16, activation='relu')) # 20 -> 16차원 (은닉유닛 hidden unit 개수)
model.add(tf.keras.layers.Dense(1)) # 16 -> 1차원

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [21]:
# binary classification -> binary_crossentropy (loss function)
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

In [22]:
history = model.fit(train_data.shuffle(10000).batch(512), # 순서 섞고 -> 512 미니배치 
                    epochs=20,
                    validation_data=validation_data.batch(512), verbose=1) # val 별도 

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [23]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
    print(name, value)

49/49 - 1s - loss: 0.3155 - accuracy: 0.8629
loss 0.3154655396938324
accuracy 0.8629199862480164


In [25]:
results

[0.3154655396938324, 0.8629199862480164]