#### 文本向量化流程
* 文本标准化-字母转换位小写并删除标点符号、词干提取等
* 词元化-单词级词元化、N元语法词元化、字符级词元化。如果关注词顺序的模型叫做序列模型，构建模型则使用单词级词元化；
另外一种输入单词作为一个集合，不考虑原始顺序的叫词袋模型，构建模型则使用N元语法词元化；
* 对所有词元建立索引-为词表中的每个单词分配唯一整数
* 索引向量编码（one-hot编码或嵌入）

In [74]:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np


#### 加载Imdb影评数据训练数据、验证数据、测试数据

In [75]:
batch_size=32
train_ds=keras.utils.text_dataset_from_directory("../DL/data/aclImdb/train",batch_size=batch_size)
val_ds=keras.utils.text_dataset_from_directory("../DL/data/aclImdb/val",batch_size=batch_size)
test_ds=keras.utils.text_dataset_from_directory("../DL/data/aclImdb/test",batch_size=batch_size)


Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [76]:
print(len(train_ds))
for inputs,targets in train_ds:
    print(inputs[0])
    print("inputs shape:",inputs.shape)
    print("inputs dtype:",inputs.dtype)
    print("targets shape:",targets.shape)
    print("targets dtype:",targets.dtype)
    break


625
tf.Tensor(b"This movie doesn't even deserve a one. This was an utter waste of time. It was a waste of film and money. It was not offensive but everything was provocative and disgusting. My spoiler is one that I think should be read by everyone. There is full frontal nudity and disgusting language. But not only that, there is NO plot line, the actors are terrible, the accents are horrible, the actors are small time and I was even EXCITED to watch this movie! <br /><br />The only reason I rented it was for Brian van Holt (who got only a fifteen second part, by the way). I think this might have been a mistake on the directors and editors parts but they repeated the same segments two or three times, adding only a new sentence.<br /><br />A film similar to this is Eraser Head, possibly the most disturbing movie in existence. There is no plot line, and is not funny. Although it isn't trying to be funny. DO NOT WATCH EITHER MOVIE.", shape=(), dtype=string)
inputs shape: (32,)
inputs dtype

#### 用TextVectorization层预处理数据集
> 对文本进行向量化流程

In [77]:
from tensorflow.keras.layers import TextVectorization

In [78]:
#将词表限制为前20000个最常出现的单词。输出词元编码为multi-hot二进制编码,默认为一元语法词元化
# text_vectorization=TextVectorization(max_tokens=20000,output_mode="multi_hot") #一元语法词元化
# text_vectorization=TextVectorization(max_tokens=20000,output_mode="multi_hot",ngrams=2) #二元语法词元化
# text_vectorization=TextVectorization(max_tokens=20000,output_mode="count",ngrams=2) #二元语法的词频
text_vectorization=TextVectorization(max_tokens=20000,output_mode="tf_idf",ngrams=2) #二元语法的TF-IDF编码词元化
#准备文本数据集
text_only_train_ds=train_ds.map(lambda x,y:x)
#对数据集词表建立索引
text_vectorization.adapt(text_only_train_ds)
#显示词表,按照词出现的次数顺序展示。第一个是未知词
print(text_vectorization.get_vocabulary()[0:10])


['[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i']


In [79]:
#分别对训练、验证、测试数据集进行multi-hot二进制编码处理，num_parallel_calls：并行处理
tfidf_2gram_train_ds=train_ds.map(lambda x,y:(text_vectorization(x),y),num_parallel_calls=4)
tfidf_2gram_val_ds=val_ds.map(lambda x,y:(text_vectorization(x),y),num_parallel_calls=4)
tfidf_2gram_test_ds=test_ds.map(lambda x,y:(text_vectorization(x),y),num_parallel_calls=4)


In [80]:
print(len(tfidf_2gram_train_ds))
for inputs,targets in tfidf_2gram_train_ds:
    print("inputs shape:",inputs.shape)
    print("inputs dtype:",inputs.dtype)
    print("targets shape:",targets.shape)
    print("targets dtype:",targets.dtype)
    print("inputs[0]:",inputs[0])
    print("targets[0]:",targets[0])
    break

625
inputs shape: (32, 20000)
inputs dtype: <dtype: 'float32'>
targets shape: (32,)
targets dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(
[506.4039     10.459088    5.6893773 ...   0.          0.
   0.       ], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


#### 模型构建函数

In [81]:
def get_model(max_tokens=20000,hidden_dim=16):
    inputs=keras.Input(shape=(max_tokens,))
    x=layers.Dense(hidden_dim,activation="relu")(inputs)
    x=layers.Dropout(0.5)(x)
    outputs=layers.Dense(1,activation="sigmoid")(x)
    model=keras.Model(inputs,outputs)
    model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
    return model


In [82]:
model=get_model()
model.summary()
callbacks=[keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",save_best_only=True)]

model.fit(tfidf_2gram_train_ds,validation_data=tfidf_2gram_val_ds,epochs=10,callbacks=callbacks)
model=keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc:{model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_12 (Dense)            (None, 16)                320016    
                                                                 
 dropout_6 (Dropout)         (None, 16)                0         
                                                                 
 dense_13 (Dense)            (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc:0.892
