In [21]:
import tensorflow as tf
import os
import re
import shutil
import matplotlib.pyplot as plt
from tensorflow.keras import layers
from tensorflow.keras import losses


### 데이터 설명
분류를 위해서는 긍정/부정 에 대한 label이 존재해야해서 이것들은 aclImdb/train/pos , aclImdb/train/neg 이렇게 구분지어져 있다.

In [9]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("data", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')


Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [12]:
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
dataset_dir

'./aclImdb'

### 데이터 생김새 살펴보기

In [13]:
train_dir = os.path.join(dataset_dir, 'train')

sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
  print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


### 데이터 분할작업

In [15]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

Found 75000 files belonging to 3 classes.
Using 60000 files for training.


In [16]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

Review b'There is this father-son conversation in the climax of \'KALPURUSH\'. I quote the English DVD-subtitle version. Shumonto tells his father: "I may not have become someone, but when I see two people in love, I smile. And when I see someone eating alone, I cry." Ashvini, his father, replies wistfully: "I wish I could\'ve lived my life like you did." These 2 lines, perhaps, comprise the gist of this new film by Buddhadev Dasgupta - director of teeny-weeny gems like \'Tahader Katha\', \'Bagh Bahadur\', \'Uttara\' & \'Mondo Meyer Upakhyan\' - which took nearly 3 years to reach the cinemas in India.<br /><br />The film opens with a man called Ashvini following a younger man called Shumonto, who, we are told, is his son. It seems that the father is stalking - or haunting, rather - his son. As the film progresses and we meet Shumonto\'s ambitious wife, Supriya, and his mother, Koyel, who seems to be tied up with something in her past, we realise that the son is, indeed, haunted by his 

In [17]:
print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


In [18]:
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 75000 files belonging to 3 classes.
Using 15000 files for validation.


In [19]:
raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=batch_size)

Found 25000 files belonging to 2 classes.


### 텍스트 데이터를 전처리 하는 함수 및 layer
- \<br\> 과 같은 태그 및 문장부호를 공백으로 대체
-

In [30]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

In [31]:

max_features = 10000 # vocab의 사이즈의 크기
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

In [23]:
vectorize_layer

<keras.src.layers.preprocessing.text_vectorization.TextVectorization at 0x1751045b0>

In [25]:
import string

# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)



In [27]:
# vectorize 해서 전처리한 결과를 보여주는 함수

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

### 텍스트를 다 숫자로 치환한, vectorized 된 모습

In [28]:
# retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

Review tf.Tensor(b"Okay I must say that before the revealing of the 'monster'. saying that he really didn't fit into that category, just some weird thing that had an annoying screech! And personally I think a granny could have ran away from that thing, but anyway. I actually was getting into this film, although having the main character a drunk and a heroine addict didn't come as an appeal. But such scenes as when she runs away from the train, and you can see the figure at the door was kind of creepy, also where the guard had just been killed and the 'monster' put his hand on the screen.<br /><br />But then disaster stuck form the moment the monster was revealed it just became your average horror, with limited thrills or scares. Slowly I became more bored, and wanted to shut the thing off. I like most people have said was rooting for the homeless people to make it, specially the guy, he gave me a few cheap laughs here and there. I think this film could have really been something specia

In [29]:
print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

1287 --->  charlie
 313 --->  simply
Vocabulary size: 10000


### 효율적으로 데이터 feeding 하기 위해 TF.Dataset API를 활용해서 준다.

In [32]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [33]:
type(raw_train_ds) ## dataset api 임.

tensorflow.python.data.ops.batch_op._BatchDataset

In [34]:
type(train_ds)

tensorflow.python.data.ops.map_op._MapDataset