[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zhujisheng/learn_python/blob/master/12.TensorFlow从入门到熟练/7.文本分析——Embedding.ipynb)

[《Python应用实战》视频课程](https://study.163.com/course/courseMain.htm?courseId=1209533804&share=2&shareId=400000000624093)

# 文本分析——Embedding

难度：★★★★☆


## 什么是embedding

*word embedding将单词转化为向量*

![word_embedding](images/word_embedding.JPG)

- embedding不会丢失单词在文本中的位置（次序）信息

- 向量能蕴含更丰富的信息：

    - 类似含义的单词表达为相近的向量

    - 单词之间的关系可以表达为向量之差

## 神经网络中两种embedding处理方式

- 使用现成的embedding表，在表中查找文本中每个单词对应的向量表达，将文本进行embedding转换
- 在神经网络中训练embedding方法

## 准备数据

In [None]:
import numpy as np
from tensorflow import keras
imdb = keras.datasets.imdb
vocab_size = 10000

(train_sequences, train_labels), (test_sequences, test_labels) = imdb.load_data(num_words=vocab_size)

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

train_sequences_padded = pad_sequences(train_sequences,
                                       padding='post',
                                       truncating='post',
                                       maxlen=256)
test_sequences_padded = pad_sequences(test_sequences,
                                      padding='post',
                                      truncating='post',
                                      maxlen=256)

## 构建模型

In [None]:
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16, input_length=256))
#model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

- Embedding

  input_dim: 单词表单词数量
  
  output_dim: 单词转化为的向量的维度（数组的长度）
  
  input_length: 每段文本中包含的单词数量

## 训练模型

In [None]:
model.fit(train_sequences_padded,
          train_labels,
          epochs=6,
          batch_size=512,
          validation_data=(test_sequences_padded, test_labels),
          verbose=1)

## text卷积与池化

In [None]:
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16, input_length=256))
#model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Conv1D(16, 3, activation='relu'))
model.add(keras.layers.MaxPooling1D(pool_size=2))
#model.add(keras.layers.Conv1D(16, 3, activation='relu'))
#model.add(keras.layers.MaxPooling1D())
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

In [None]:
model.fit(train_sequences_padded,
          train_labels,
          epochs=6,
          batch_size=512,
          validation_data=(test_sequences_padded, test_labels),
          verbose=1)