## 텍스트를 위한 딥러닝#1

### 텍스트 벡터화

원시 텍스트를 입력으로 사용할 수 있도록 수치 텐서로 변경

- 텍스트 분할(토큰화) : 토큰(문자, 단어, 단어의 그룹) 단위로 분할
- 텍스트 표준화 : 소문자로 변환, 구두점 제거, 불용어 처리 등
- 토큰의 인덱싱 : 토큰을 수치 벡터로 변경

### 단어 그룹을 표현하는 두 가지 방법

- BoW 모델
    - 순서를 무시하고 텍스트를 단어의 (순서없는) 집합으로 처리
    - Count기반, TF-IDF기반
    - keras.layers.TextVectorization()
- 시퀀스(sequence) 모델
    - 시계열의 타임스텝처럼 한 번에 하나의 단어씩 등장하는 순서대로 처리
    - 순환신경망, 트랜스포머
    - keras.layers.Embedding()

## BoW(집합) 모델

### 텍스트 벡터화를 위한 코드

In [10]:
import string

class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)

    def tokenize(self, text):
        return text.split()

    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}   # [UNK] : Out Of Vocabulary
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

In [11]:
vect = Vectorizer()

In [12]:
dataset = ['I love you',
           'A popy blooms',
           'Woo Woo!!',
           'I like apple']

In [13]:
vect.make_vocabulary(dataset)
vect.vocabulary

{'': 0,
 '[UNK]': 1,
 'i': 2,
 'love': 3,
 'you': 4,
 'a': 5,
 'popy': 6,
 'blooms': 7,
 'woo': 8,
 'like': 9,
 'apple': 10}

In [14]:
test_sentence = 'I write, you love,'
encoded = vect.encode(test_sentence)
encoded

[2, 1, 4, 3]

In [15]:
# 인코딩 되지 않은 단어는 [UNK] : 2
print(vect.decode(encoded))

i [UNK] you love


### 케라스의 `TextVectorization` 층

**tf.keras.layers.TextVectorization()**

https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization

```python
tf.keras.layers.TextVectorization(max_tokens=None,
                                  standardize='lower_and_strip_punctuation',
                                  split='whitespace',
                                  ngrams=None,
                                  output_mode='int',
                                  output_sequence_length=None,
                                  pad_to_max_tokens=False,
                                  vocabulary=None,
                                  idf_weights=None,
                                  sparse=False,
                                  ragged=False,
                                  encoding='utf-8',
                                  name=None,
                                  **kwargs
                                 )
```

#### TextVectorization층 사용
- 정수인덱스로 인코딩된 단어 시퀀스 반환

In [1]:
from tensorflow import keras
from keras.layers import TextVectorization
text_vector = TextVectorization(output_mode='int')

- 사용자 정의 함수를 활용한 표준화, 토큰화

In [8]:
import re
import string
import tensorflow as tf

def my_standardize(string_tensor):
    lowercase = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(string.punctuation)}]", '')

def my_tokenize(string_tensor):
    return tf.strings.split(string_tensor)

#### 말뭉치 어휘 사전 인덱싱과 출력

- 어휘 사전 인덱싱 : adapt() 메서드

In [5]:
text_vector = TextVectorization(
    standardize=my_standardize,
    split=my_tokenize,
    output_mode='int')

- 어휘 사전 출력 : get_vocabulary() 메서드

In [16]:
text_vector.adapt(dataset)

In [18]:
print(text_vector.get_vocabulary())

['', '[UNK]', 'woo', 'i', 'you', 'popy', 'love', 'like', 'blooms', 'apple', 'a']


#### 문장 인코딩과 디코딩

In [19]:
vocabs = text_vector.get_vocabulary()
test_sentence = 'I love you too'
encoded_sentence = text_vector(test_sentence)
print(encoded_sentence)

tf.Tensor([3 6 4 1], shape=(4,), dtype=int64)


In [21]:
inverse_vocabs = dict(enumerate(vocabs))
decoded_sentence = ' '.join([inverse_vocabs[int(i)] for i in encoded_sentence])
print(decoded_sentence)

i love you [UNK]


---

## BoW 기반 모델링 실습

### 예제 데이터: IMDB 영화 리뷰

#### 데이터 다운로드

In [22]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2024-06-19 04:54:15--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2024-06-19 04:54:28 (6.33 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [23]:
!tar -xf aclImdb_v1.tar.gz

In [24]:
!rm -rf aclImdb/train/unsup

#### 데이터 확인

In [25]:
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

#### 훈련, 검증, 테스트 데이터 준비

In [27]:
import os, pathlib, random, shutil
from sklearn.model_selection import train_test_split

base_dir = pathlib.Path('aclImdb')
train_dir = base_dir / 'train'
val_dir = base_dir / 'val'

for category in ('neg','pos'):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1237).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [28]:
batch_size= 32
train_ds = keras.utils.text_dataset_from_directory('aclImdb/train',batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory('aclImdb/val',batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory('aclImdb/test',batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


- 첫 번째 배치의 크기와 dtype 출력

In [29]:
for inputs, targets in train_ds:
    print(f'inputs.shape = {inputs.shape}, inputs.dtype = {inputs.dtype}')
    print(f'targets.shape = {targets.shape}, targets.dtype = {targets.dtype}')
    print(f'inputs[1] = {inputs[1]}')
    print(f'targets[1] = {targets[1]}')
    break

inputs.shape = (32,), inputs.dtype = <dtype: 'string'>
targets.shape = (32,), targets.dtype = <dtype: 'int32'>
inputs[1] = b"Of course if you are reading my review you have seen this film already. 'Raja Babu' is one of my most favorite characters. I just love the concept of a spoiled brat with a 24*7 servant on his motorcycle. Watch movies and emulate characters etc etc. I love the scene when a stone cracks in Kader khans mouth while eating. Also where Shakti Kapoor narrates a corny story of Raja Babu's affairs on a dinner table and Govinda wearing 'dharam-veer' uniform makes sentimental remarks. Thats my favorite scene of the film. 'Achcha Pitaji To Main Chalta Hoon' scene is just chemistry between two great Indian actors doing a comical scene with no dialogs. Its brilliant. It's a cat mouse film. Just watch these actors helping each other and still taking away the scene from each other. Its total entertainment. If you like Govinda and Kader Khan chemistry then its a must. I think RB 

#### 1) 이진인코딩(binary encoding) 사용한 유니그램(unigram) 방식
: unigram = Single words

**`TextVectorization` 층으로 데이터 전처리**

In [30]:
text_vect = keras.layers.TextVectorization(max_tokens=20000, output_mode='multi_hot')

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vect.adapt(text_only_train_ds)

# num_parallel_calls = 4 -> 다중 CPU 코어 활용을 위한 매개변수
bin_1gram_train_ds = train_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)
bin_1gram_val_ds = val_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)
bin_1gram_test_ds = test_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)

**이진 유니그램 데이터셋의 출력 확인**

In [31]:
for inputs, targets in bin_1gram_train_ds:
    print(f'inputs.shape = {inputs.shape}, inputs.dtype = {inputs.dtype}')
    print(f'targets.shape = {targets.shape}, targets.dtype = {targets.dtype}')
    print(f'inputs[1] = {inputs[1]}')
    print(f'targets[1] = {targets[1]}')
    break

inputs.shape = (32, 20000), inputs.dtype = <dtype: 'float32'>
targets.shape = (32,), targets.dtype = <dtype: 'int32'>
inputs[1] = [1. 1. 1. ... 0. 0. 0.]
targets[1] = 1


**모델 생성**

In [34]:
from keras import layers, Input, Model

max_tokens =20000

def build_model(max_tokens= 20000, hidden_dim=16):
    inputs = Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation= 'relu')(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation='sigmoid')(x)
    model = Model(inputs, outputs)
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [35]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [36]:
model_path = '/content/drive/MyDrive/Colab Notebooks/model/'

In [37]:
model = build_model()
model.summary()
model_name = model_path + 'aclImdb_binary_1gram.h5'
callbacks = [keras.callbacks.ModelCheckpoint(model_name)]

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


**모델 훈련 및 테스트**

In [38]:
history = model.fit(bin_1gram_train_ds.cache(),
                    validation_data= bin_1gram_val_ds.cache(),
                    epochs=10,
                    callbacks= callbacks)
best_model = keras.models.load_model(model_name)
print(f'테스트 정확도: {best_model.evaluate(bin_1gram_test_ds)[1]:.4f}')

Epoch 1/10
Epoch 2/10
 16/625 [..............................] - ETA: 4s - loss: 0.2972 - accuracy: 0.8867

  saving_api.save_model(


Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
테스트 정확도: 0.8747


#### 2) 이진 인코딩을 사용한 바이그램

**바이그램을 반환하는 `TextVectorization` 층 만들기**

In [39]:
text_vect = keras.layers.TextVectorization(max_tokens=20000, output_mode='multi_hot', ngrams=2)
text_vect.adapt(text_only_train_ds)

# num_parallel_calls = 4 -> 다중 CPU 코어 활용을 위한 매개변수
bin_2gram_train_ds = train_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)
bin_2gram_val_ds = val_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)
bin_2gram_test_ds = test_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)

**이진 바이그램 모델 훈련 및 테스트**

In [40]:
model = build_model()
model_name = model_path + 'aclImdb_binary_2gram.h5'
callbacks = [keras.callbacks.ModelCheckpoint(model_name)]
history = model.fit(bin_2gram_train_ds.cache(),
                    validation_data= bin_2gram_val_ds.cache(),
                    epochs=10,
                    callbacks= callbacks)
best_model = keras.models.load_model(model_name)
print(f'테스트 정확도: {best_model.evaluate(bin_2gram_test_ds)[1]:.4f}')

Epoch 1/10
Epoch 2/10
 17/625 [..............................] - ETA: 3s - loss: 0.3062 - accuracy: 0.9062

  saving_api.save_model(


Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
테스트 정확도: 0.8887


- 이진 유니그램으로 벡터화 한 경우 분류 성능
    - 테스트 정확도: 0.8747
- 이진 바이그램으로 벡터화 한 경우 분류 성능
    - 테스트 정확도: 0.8887

#### 3) TF-IDF 인코딩을 사용한 바이그램

**토큰 카운트를 반환하는 `TextVectorization` 층**

In [41]:
text_vect = keras.layers.TextVectorization(max_tokens=20000, output_mode='count', ngrams=2)

**TF-IDF 가중치가 적용된 출력을 반환하는 `TextVectorization` 층**

In [42]:
text_vect = keras.layers.TextVectorization(max_tokens=20000, output_mode='tf_idf', ngrams=2)

**TF-IDF 바이그램 모델 훈련하고 테스트하기**

In [43]:
text_vect.adapt(text_only_train_ds)


tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vect(x), y),num_parallel_calls=4)

In [44]:
model = build_model()
model_name = model_path + 'aclImdb_tf_idf_2gram.h5'
callbacks = [keras.callbacks.ModelCheckpoint(model_name)]
history = model.fit(tfidf_2gram_train_ds.cache(),
                    validation_data= tfidf_2gram_val_ds.cache(),
                    epochs=10,
                    callbacks= callbacks)
best_model = keras.models.load_model(model_name)
print(f'테스트 정확도: {best_model.evaluate(tfidf_2gram_test_ds)[1]:.4f}')

Epoch 1/10


  saving_api.save_model(


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
테스트 정확도: 0.8870


- 이진 유니그램으로 벡터화 한 경우 분류 성능
    - 테스트 정확도: 0.8747
- 이진 바이그램으로 벡터화 한 경우 분류 성능
    - 테스트 정확도: 0.8887
- TF_IDF 바이그램으로 벡터화 한 경우 분류성능
    - 테스트 정확도: 0.8870

---