## 二、情感分析——模型实现

### 1、数据分析和预处理

#### （1）导入所需模块

In [1]:
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist

# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

#### （2）读取数据

In [2]:
import imdb 

In [3]:
imdb.maybe_download_and_extract()     #下载并解压imdb数据集

Data has apparently already been downloaded and unpacked.


In [4]:
input_text_train, target_train = imdb.load_data(train=True)
input_text_test,  target_test  = imdb.load_data(train=False)

print("Size of the trainig set: ", len(input_text_train))
print("Size of the testing set:  ", len(input_text_test))

Size of the trainig set:  25000
Size of the testing set:   25000


从这个结果可以看到，这里训练数据与测试数据各有25000项。

下面我们将举一个例子来看看数据集的输入以及输出外观。

In [5]:
text_data = input_text_train + input_text_test

input_text_train[1]

'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days withou

In [6]:
target_train[1]

1.0

这里的输出值为1，这意味着它是一个积极的情感。所以，无论是什么电影，这是一个积极的评论。

#### （3）建立字典

现在，我们讨论tokenizer，这是处理原始数据的第一步，因为神经网络不能处理文本数据。Keras实现了所谓的tokenizer，用于构建词汇表并从单词映射到整数。

In [7]:
num_top_words = 10000
tokenizer_obj = Tokenizer(num_words=num_top_words)     #使用Tokenizer建立单词数为10000的字典。

现在，我们从数据集中获取所有文本，并在文本上调用函数fit，按照每一个单词在影评中出现的次数进行排序，前10000名的单词会列入字典中。

In [8]:
tokenizer_obj.fit_on_texts(text_data)

In [9]:
tokenizer_obj.word_index  #字典数据类型，显示每一个单词单词在所有文章中出现的次数的排名。

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

现在，每个单词都与一个整数相关联

这里，the单词是数字1：

In [10]:
tokenizer_obj.word_index['the']

1

这里，and是数字2：

In [11]:
tokenizer_obj.word_index['and']

2

单词a是数字3：

In [12]:
tokenizer_obj.word_index['a']

3

我们看到movie是数字17：

In [13]:
tokenizer_obj.word_index['movie']

17

Film是数字19：

In [14]:
tokenizer_obj.word_index['film']

19

这意味着the是数据集中使用最多的词，而and是数据集中使用第二多的词。因此，每当我们想要将单词映射到整数tokens时，我们就会得到这些数字。
让我们试着以数字743为例，这是单词romantic：


In [15]:
tokenizer_obj.word_index['romantic']

743

因此，每当我们在输入文本中看到单词romantic时，我们就将它映射到token整数743。

下面我们再次使用tokenizer将训练集中第一个文本中的所有单词转换为整数tokens，指令及结果如下：

In [16]:
input_text_train[1]   #输入训练集中第一个文本

'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days withou

In [17]:
input_train_tokens = tokenizer_obj.texts_to_sequences(input_text_train)   #将文本转换为整数tokens时，它将变成一个整数数组

In [18]:
np.array(input_train_tokens)

array([list([299, 6, 3, 1059, 202, 9, 2119, 30, 1, 167, 55, 14, 47, 79, 6274, 42, 368, 114, 138, 14, 5103, 56, 4515, 153, 8, 1, 4233, 5799, 469, 68, 5, 262, 12, 2072, 6, 72, 2556, 5, 614, 71, 6, 5103, 1, 5, 1897, 1, 5540, 1469, 35, 67, 63, 203, 140, 65, 1151, 1, 4, 1, 223, 871, 29, 3195, 68, 4, 1, 5510, 10, 677, 2, 65, 1469, 50, 10, 210, 1, 398, 8, 60, 3, 1425, 3345, 762, 5, 3491, 175, 1, 368, 10, 1220, 30, 299, 3, 360, 347, 3471, 145, 133, 5, 8306, 27, 4, 125, 5103, 1425, 2563, 5, 299, 10, 525, 12, 106, 1540, 4, 56, 599, 101, 12, 299, 6, 225, 3994, 48, 3, 2244, 12, 9, 213]),
       list([38, 14, 744, 3506, 45, 75, 32, 1771, 15, 153, 18, 110, 3, 1344, 5, 343, 143, 20, 1, 920, 12, 70, 281, 1228, 395, 35, 115, 267, 36, 166, 5, 368, 158, 38, 2058, 15, 1, 504, 88, 83, 101, 4, 1, 4339, 14, 39, 3, 432, 1148, 136, 8697, 42, 177, 138, 14, 2791, 1, 295, 20, 5276, 351, 5, 3029, 2310, 1, 38, 8697, 43, 3611, 26, 365, 5, 127, 53, 20, 1, 2032, 7, 7, 18, 48, 43, 22, 70, 358, 3, 2343, 5, 420, 20, 1, 2

这里，单词homelessness变成了数字299，单词or变成了数字6，依此类推。

同样，我们还需要转换文本的其余部分，代码如下：

In [19]:
input_test_tokens = tokenizer_obj.texts_to_sequences(input_text_test)  #将文本转换为数字列表

#### （4）数字列表截长补短

        现在有另一个问题，因为tokens序列的长度取决于原始文本的长度，即使循环单元可以处理任意长度的序列。但是TensorFlow的工作方式是，批处理中的所有数据都需要具有相同的长度。
        因此，我们需要确保整个数据集中的所有序列都具有相同的长度，或者编写一个自定义数据生成器，以确保单个批处理中的序列具有相同的长度。现在，要确保数据集中的所有序列都具有相同的长度也比较简单，但问题是存在一些异常值。假定我们认为超过2200个单词的句子太长，如果我们有超过2200个单词的句字，那么我们的记忆就会受到很大的伤害。因此，我们必须要做出妥协。

首先，我们需要计算每个输入序列中的所有单词或tokens。从下列结果我们可以看到，一个序列中的平均单词数大约是221个：

In [20]:
total_num_tokens = [len(tokens) for tokens in input_train_tokens + input_test_tokens]
total_num_tokens = np.array(total_num_tokens)

np.mean(total_num_tokens)     #计算所有数字序列的平均单词数

221.27716

从下列结果，我们可以看到这些序列中最大的单词数超过2200：

In [21]:
np.max(total_num_tokens)

2209

        平均值221和最大值2209之间有巨大的差别，如果我们只是在数据集中填充所有的句子，以便它们会有2209个tokens，那么我们就会浪费大量的内存。如果说我们有一个包含数百万个文本序列的数据集，这将会是一个很大的问题。
        所以我们要做出一个妥协。我们将填充所有序列，并截断那些太长的序列，这样它们就有544个单词了。我们的计算方法是：取数据集中所有序列的平均单词数，并添加两个标准差，代码如下：

In [22]:
max_num_tokens = np.mean(total_num_tokens) + 2 * np.std(total_num_tokens)   #均值加两个标准差
max_num_tokens = int(max_num_tokens)
max_num_tokens

544

添加标准差后，我们每一个序列的单词数将保留为544个。

In [23]:
np.sum(total_num_tokens < max_num_tokens)/len(total_num_tokens)  #小于544个单词的序列个数占所有序列个数的比例

0.9453

从这里我们可以看到，大约有95%的文本长度均为544，只有5%的文本比544个单词长。

现在我们知道，在Keras中称这些为函数。它们要么填充太短的序列(所以它们只添加零)，要么截断太长的序列(如果文本太长，基本上只需要切断一些单词)。 
然而，需要注意的是：我们到底是在序列前还是在序列后模式下进行填充和截断呢？

因此，假设我们有一个整数tokens序列，因为它太短了，我们想要填充它。我们可以：要么在开头放置这些零，以便在结尾处有实际的整数tokens。或者用相反的方式来做，这样我们所有的数据都在开始，所有的零在结尾。但是，如果我们回到前面的RNN流程图，我们知道它是一步步地处理序列，所以如果我们开始处理零，它可能没有任何意义，内部状态可能只是保持为零。因此，每当它看到一个特定单词的整数token时，它就会知道，好的，现在我们开始处理数据。

然而，如果所有的零都在末尾，我们就会开始处理所有的数据；那么我们就会在循环单元中有一些内部状态。现在，我们看到了大量的零，这可能会破坏我们刚刚计算出来的内部状态。这就是为什么在开始时填充零可能是个好主意。

另一个问题是关于截断文本。如果文本很长，我们将截断它，以使它适合于文字，或任何数字。现在，想象一下，我们在中间的某个地方抓住了一个句子，它写的是this very good movie，或者this is not。当然，我们只在很长的序列中这样做，但是我们有可能失去正确分类这篇文章所必需的信息。因此，这是我们在截断输入文本时需要做出妥协。一个比较好的方法是创建一个批处理并在批处理中填充文本。因此，当我们看到一个很长的序列时，我们会把其他序列放置在相同的长度上。但我们不需要将所有这些数据存储在内存中，因为大部分数据都是浪费的。

接下来让我们返回并转换整个数据集，使其被截断和填充；它是一个大的数据矩阵：

In [24]:
seq_pad = 'pre'        #pre表示从起始填充或截断

input_train_pad = pad_sequences(input_train_tokens, maxlen=max_num_tokens,
                            padding=seq_pad, truncating=seq_pad)           #padding表示填充，truncating表示截断

input_test_pad = pad_sequences(input_test_tokens, maxlen=max_num_tokens,
                           padding=seq_pad, truncating=seq_pad)

我们检查这个矩阵的形状：

In [25]:
input_train_pad.shape

(25000, 544)

In [26]:
input_test_pad.shape

(25000, 544)

下面，让我们看看填充前后的特定示例tokens：

填充前的数字矩阵如下：

In [27]:
np.array(input_train_tokens[1])

array([  38,   14,  744, 3506,   45,   75,   32, 1771,   15,  153,   18,
        110,    3, 1344,    5,  343,  143,   20,    1,  920,   12,   70,
        281, 1228,  395,   35,  115,  267,   36,  166,    5,  368,  158,
         38, 2058,   15,    1,  504,   88,   83,  101,    4,    1, 4339,
         14,   39,    3,  432, 1148,  136, 8697,   42,  177,  138,   14,
       2791,    1,  295,   20, 5276,  351,    5, 3029, 2310,    1,   38,
       8697,   43, 3611,   26,  365,    5,  127,   53,   20,    1, 2032,
          7,    7,   18,   48,   43,   22,   70,  358,    3, 2343,    5,
        420,   20,    1, 2032,   15,    3, 3346,  208,    1,   22,  281,
         66,   36,    3,  344,    1,  728,  730,    3, 3864, 1320,   20,
          1, 1543,    3, 1293,    2,  267,   22,  281, 2734,    5,   63,
         48,   44,   37,    5,   26, 4339,   12,    6, 2079,    7,    7,
       3425, 2891,   35, 4446,   35,  405,   14,  297,    3,  986,  128,
         35,   45,  267,    8,    1,  181,  366, 69

 填充之后，这个示例如下所示：

In [28]:
input_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
         38,   14,  744, 3506,   45,   75,   32, 17

了解了文本转换为数字列表之后，接下来，我们来看一个向后映射的功能，即从整数tokens映射回文本单词。我们只需要用一个非常简单的助手函数即可，代码如下：

In [29]:
index = tokenizer_obj.word_index      #数字列表
index_inverse_map = dict(zip(index.values(), index.keys()))    #zip函数将键和值反过来


def convert_tokens_to_string(input_tokens):          
    # Convert the tokens back to words
    input_words = [index_inverse_map[token] for token in input_tokens if token != 0]   #将token整数转换为单词

    # join them all words.
    combined_text = " ".join(input_words)

    return combined_text

例如，数据集中的原始文本如下：

In [30]:
input_text_train[1]

'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days withou

如果我们使用一个帮助函数将tokens转换回文本单词，我们将得到以下文本：

In [31]:
convert_tokens_to_string(input_train_tokens[1])

"or as george stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most people think of the homeless as just a lost cause while worrying about things such as racism the war on iraq kids to succeed technology the or worrying if they'll be next to end up on the streets br br but what if you were given a bet to live on the streets for a month without the you once had from a home the entertainment sets a bathroom pictures on the wall a computer and everything you once treasure to see what it's like to be homeless that is lesson br br mel brooks who directs who stars as plays a rich man who has everything in the world until deciding to make a bet with a sissy rival to see if he can live in the streets for thirty days without the if succeeds he can do what he wants with a future project of making more buildings the on where is thrown on the street with a on his leg t

可以看到，除了标点符号和其他符号，其他基本一样。

### 2、构建模型

        现在，我们需要创建RNN，我们将在Keras中用所谓的sequential模型来实现。
        这个体系结构的第一层是所谓的嵌入层。如果我们回顾一下图1中的流程图，我们刚才所做的就是将原始输入文本转换为整数tokens。但是我们仍然不能将它输入到RNN，因此我们必须将其转换为嵌入向量，即介于-1和1之间的值。它们可以在一定程度上超过这个范围，但通常在-1到1之间，这是我们可以在神经网络中处理的数据。
        我们需要决定每个向量的长度，例如，token11被转换成一个实值向量，我们可以将长度设置为10（这个长度实际上是非常短的，通常，它在100到300之间)。

#### （1）加入嵌入层

 这里，我们将嵌入大小设置为8，然后使用Keras将该嵌入层添加到RNN中。这必须是网络的第一层：

In [32]:
rnn_type_model = Sequential()

In [33]:
embedding_layer_size = 8 #typical value for this should be between 200 and 300

rnn_type_model.add(Embedding(input_dim=num_top_words,
                    output_dim=embedding_layer_size,
                    input_length=max_num_tokens,
                    name='embedding_layer'))

#### （2）建立RNN模型

然后，我们可以添加第一个循环层，我们将使用所谓的gated recurrent unit(GRU)。通常情况下，我们看到人们会使用所谓的LSTM，但其他人似乎认为GRU更好，因为LSTM内部有多余的gates。实际上，更简单的代码在更少的gates上工作更好。因此，我们这里采用GRU，让我们定义我们的GRU架构，我们希望输出维数为16，我们需要返回序列：

In [34]:
rnn_type_model.add(GRU(units=16, return_sequences=True))

rnn_type_model.add(GRU(units=8, return_sequences=True))

rnn_type_model.add(GRU(units=4))

rnn_type_model.add(Dense(1, activation='sigmoid'))

model_optimizer = Adam(lr=1e-3)

rnn_type_model.compile(loss='binary_crossentropy',
              optimizer=model_optimizer,
              metrics=['accuracy'])

这里我们添加了三个循环层，最后一个dense层只给出GRU的最终输出，而不是一个完整的输出序列。这里的输出将被输入到一个完全连接或dense层中，该层应该为每个输入序列输出一个值。因为使用Sigmoid激活函数处理，所以它会输出一个介于0到1之间的值。我们在这里使用的是ADAM优化器，并且损失函数是RNN的输出和训练集的实际类值之间的二进制交叉熵，这个值要么是0，要么是1：

现在，我们查看模型的外观，如下：

In [35]:
rnn_type_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_layer (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru (GRU)                    (None, 544, 16)           1200      
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 8)            600       
_________________________________________________________________
gru_2 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense (Dense)                (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


从该模型我们可以知道，我们有一个嵌入层，三个循环单元和一个dense层。注意，这没有太多的参数。

### 3、模型训练和结果分析

#### （1）训练模型

现在我们开始对模型进行训练，代码如下：

In [36]:
rnn_type_model.fit(input_train_pad, target_train,
          validation_split=0.05, epochs=3, batch_size=64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 23750 samples, validate on 1250 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x19c4d1a17b8>

从这里我们可以看到，共执行了3个训练周期，其误差越来越小，准确率越来越高。

#### （2）评估模型准确率

In [39]:
model_result=rnn_type_model.evaluate(input_test_pad,target_test)



accuracy:85.26%

#### （3）进行预测

现在，让我们看一个错误分类文本的例子。

 首先，我们计算测试集中前1000个序列的预测类，然后取实际的类值。我们将它们进行比较，并得到存在这种不匹配的索引列表：

In [40]:
target_predicted = rnn_type_model.predict(x=input_test_pad[0:1000])
target_predicted = target_predicted.T[0]

使用截止阈值表示上述所有值都为正值，其他值将被认为负值：

In [41]:
class_predicted = np.array([1.0 if prob>0.5 else 0.0 for prob in target_predicted])

现在，让我们得到这1000条序列的实际类：

In [42]:
class_actual = np.array(target_test[0:1000])

接下来让我们从输出中获取不正确的样本，代码及结果如下：

In [43]:
incorrect_samples = np.where(class_predicted != class_actual)
incorrect_samples = incorrect_samples[0]
len(incorrect_samples)

82

因此，我们发现这些文本中有82篇被错误地分类；这是我们在这里计算的1000个文本的8.2%。让我们看一下第一个错误分类的文本，代码及结果如下：

In [44]:
index = incorrect_samples[0]
index

35

In [45]:
incorrect_predicted_text = input_text_test[index]
incorrect_predicted_text

'BEING Warner Brothers\' second historical drama featuring Civil War and Battle of the Little Big Horn, General George Armstrong Custer, THEY DIED WITH THEIR BOOTS ON (Warner Brothers, 1941) was the far more accurate of the two; especially when contrasted with SANTA FE TRAIL (Warner Brothers, 1940), which really didn\'t set the bar very high.<br /><br />ALTHOUGH both pictures were starring vehicles for Errol Flynn, there was a change in the casting the part of General Custer. Whereas it was "Dutch", himself, Ronald Reagan portraying the flamboyant, egomaniacal Cavalryman in the earlier picture, with Mr. Flynn playing Virginian and later Confederate Hero General, J.E.B. (or Jeb) Stuart; Errol took on the Custer part for THEY DIED WITH THEIR BOOTS ON.<br /><br />ONCE again, the Warner Brothers\' propensity for using a large number of reliable character actors from the "Warner\'s Repertory Company" are employed in giving the film a sort of authenticity, and all is really happening right b

让我们看看这个示例的模型输出以及实际的类：

In [46]:
target_predicted[index]

0.11293286

In [47]:
class_actual[index]

1.0

从这个结果我们可以看出，预测出的情感值与实际的情感值是不同的。

现在，让我们根据一组新的数据样本测试我们训练了的模型，并查看其结果，这里共有8个样本：

In [48]:
test_sample_1 = "This movie is fantastic! I really like it because it is so good!"
test_sample_2 = "Good movie!"
test_sample_3 = "Maybe I like this movie"
test_sample_4 = "Meh ..."
test_sample_5 = "If I were a drunk teenager then this movie "
test_sample_6 = "Bad movie"
test_sample_7 = "Not a good movie"
test_sample_8 = "This movie really sucks! Can I get my money back please？"
test_samples = [test_sample_1, test_sample_2, test_sample_3, test_sample_4, test_sample_5, test_sample_6, test_sample_7, test_sample_8]

现在，让我们将它们转换为整数tokens，并对这些数字序列截长补短：

In [49]:
test_samples_tokens = tokenizer_obj.texts_to_sequences(test_samples)
test_samples_tokens_pad = pad_sequences(test_samples_tokens, maxlen = max_num_tokens, padding = seq_pad, truncating = seq_pad)

然后填充它们：

In [50]:
test_samples_tokens_pad.shape

(8, 544)

最后，针对它们运行模型，得到如下结果：

In [51]:
rnn_type_model.predict(test_samples_tokens_pad)

array([[0.96684575],
       [0.96288127],
       [0.9389076 ],
       [0.96079415],
       [0.927937  ],
       [0.88653195],
       [0.95413595],
       [0.8341638 ]], dtype=float32)

接近0的值表示消极情感，接近1的值表示积极情感。