# Analyzing IMDB Data in Keras

In [1]:
# Imports
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer#Tokenizer是一个分词器的模块
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

Using TensorFlow backend.


## 输入已经帮助我们很方便地进行预处理。 每个评论被编码为一系列索引，对应于评论中的单词。 词按频率排序，所以整数 1 对应于最频繁的词（“the”），整数 2 对应于第二频繁的词。按照惯例，整数0对应于未知词。

## 然后，你通过简单地连接这些整数，将句子变成一个向量。 比如，如果句子是“是或不是”。 这些词的索引如下：

### "to": 5
### "be": 8
### "or": 21
### "not": 3
### 那么，这个句子就会被编码为矢量[5,8,21,3,5,8]。

## 1. Loading the data
This dataset comes preloaded with Keras, so one simple command will get us training and testing data. There is a parameter for how many words we want to look at. We've set it at 1000, but feel free to experiment.
该数据集预先加载了 Keras，所以一个简单的命令就会帮助我们训练和测试数据。 这里有一个我们想看多少单词的参数。 我们已将它设置为1000，但你可以随时尝试设置为其他数字。

In [2]:
# Loading the data (it's preloaded in Keras)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000)#num_words代表常用词 1000个词

print(x_train.shape)
print(x_test.shape)

(25000,)
(25000,)


## 2. Examining the data
Notice that the data has been already pre-processed, where all the words have numbers, and the reviews come in as a vector with the words that the review contains. For example, if the word 'the' is the first one in our dictionary, and a review contains the word 'the', then there is a 1 in the corresponding vector.

The output comes as a vector of 1's and 0's, where 1 is a positive sentiment for the review, and 0 is negative.
请注意，数据已经过预处理，其中所有单词都包含数字，评论作为向量与评论中包含的单词一起出现。 例如，如果单词'the'是我们词典中的第一个单词，并且评论包含单词'the'，那么在相应的向量中有 1。

输出结果是 1 和 0 的向量，其中 1 表示正面评论，0 是负面评论。

In [3]:
print(np.shape(x_train[1])) #1
print(y_train[0])

(189,)
1


In [4]:
x_train #这里包含的是单词表的索引值，即一句评论对应的每个单词的索引
l_max=[max(x) for x in x_train ]#最大索引1000
print(l_max[0],max(l_max))

973 999


In [5]:
l_min=[min(x) for x in x_train ]#最小索引1
print(l_min[0],min(l_min))

1 1


## 3. One-hot encoding the output
Here, we'll turn the input vectors into (0,1)-vectors. For example, if the pre-processed vector contains the number 14, then in the processed vector, the 14th entry will be 1.
在这里，我们将输入向量转换为 (0,1)-向量。 例如，如果预处理的向量包含数字 14，则在处理的向量中，第 14 个输入将是 1。

In [6]:
# One-hot encoding the output into vector mode, each of length 1000
#Tokenizer是一个用于向量化文本，或将文本转换为序列（即单词在字典中的下标构成的列表，从1算起）的类。
tokenizer = Tokenizer(num_words=1000)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train[0])#例如上面973索引对应第973位的编码为1


[0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0.
 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0.
 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0.
 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

In [7]:
print(np.shape(x_train[2]),np.shape(x_train))

(1000,) (25000, 1000)


And we'll also one-hot encode the output.

In [8]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


In [9]:
y_train

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [1., 0.]])

## 4. Building the  model architecture
Build a model here using sequential. Feel free to experiment with different layers and sizes! Also, experiment adding dropout to reduce overfitting.

In [10]:
from keras import optimizers

model = Sequential()    #网络结构是1000*512*60*2
model.add(Dense(100, activation='sigmoid', input_dim=1000))
model.add(Dropout(0.5))
model.add(Dense(60,activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(10,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.summary()
sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)

# model.compile(loss='categorical_crossentropy',
#               optimizer=sgd,
#               metrics=['accuracy'])
model.compile(loss='categorical_crossentropy',
              optimizer='Adadelta',
              metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 100)               100100    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 60)                6060      
_________________________________________________________________
dropout_2 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                610       
_________________________________________________________________
dropout_3 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 22        
Total para

## 5. Training the model
Run the model here. Experiment with different batch_size, and number of epochs!

In [11]:
hist = model.fit(x_train, y_train,
          batch_size=100,
          epochs=40,
          validation_data=(x_test, y_test), 
          verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/40
 - 4s - loss: 0.7072 - acc: 0.5008 - val_loss: 0.6932 - val_acc: 0.5000
Epoch 2/40
 - 2s - loss: 0.6953 - acc: 0.5068 - val_loss: 0.6932 - val_acc: 0.5000
Epoch 3/40
 - 2s - loss: 0.6933 - acc: 0.5095 - val_loss: 0.6932 - val_acc: 0.5000
Epoch 4/40
 - 2s - loss: 0.6928 - acc: 0.5073 - val_loss: 0.6930 - val_acc: 0.5234
Epoch 5/40
 - 2s - loss: 0.6925 - acc: 0.5161 - val_loss: 0.6918 - val_acc: 0.6189
Epoch 6/40
 - 2s - loss: 0.6904 - acc: 0.5307 - val_loss: 0.6865 - val_acc: 0.7434
Epoch 7/40
 - 2s - loss: 0.6743 - acc: 0.5706 - val_loss: 0.6117 - val_acc: 0.7926
Epoch 8/40
 - 2s - loss: 0.6075 - acc: 0.6644 - val_loss: 0.4991 - val_acc: 0.8155
Epoch 9/40
 - 2s - loss: 0.5469 - acc: 0.7298 - val_loss: 0.4373 - val_acc: 0.8300
Epoch 10/40
 - 2s - loss: 0.5052 - acc: 0.7637 - val_loss: 0.3985 - val_acc: 0.8369
Epoch 11/40
 - 2s - loss: 0.4848 - acc: 0.7809 - val_loss: 0.3829 - val_acc: 0.8435
Epoch 12/40
 - 2s - loss: 0.4628 - 

## 6. Evaluating the model
This will give you the accuracy of the model, as evaluated on the testing set. Can you get something over 85%?

In [12]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.8622
