[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zhujisheng/learn_python/blob/master/12.TensorFlow从入门到熟练/5.文本分析——tokenization.ipynb)

[《Python应用实战》视频课程](https://study.163.com/course/courseMain.htm?courseId=1209533804&share=2&shareId=400000000624093)

# 文本分析——tokenization预处理

难度：★★★★☆

*文本在输入神经网络之前，需要进行预处理——将文本转化为数字表达；这个预处理过程，就称为tokenization。*

## 词汇表

- 一般分析文本会以词（word）为单位，不是以字（字母）为单位

- 文本要转化为数字表达之前，首先需要建立词汇表

- 词汇表中的内容是“词-数字”的对应关系

In [None]:
sentences = [
    'I love my dog',
    'I love my cat very much!'
]

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=10, oov_token='xxxx')
tokenizer.fit_on_texts(sentences)

print(tokenizer.word_index)

## 序列码

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

#### oov_token

oov：out of vocabulate

当将文本转化为序列码时，文本中出现了超出词汇表的内容，就以oov_token替代。

In [None]:
tokenizer.texts_to_sequences(['you love your dog'])

#### 截断与补齐

截断与补齐的作用是让文本具有相同的长度，便于作为神经网络的输入

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences_padded = pad_sequences(sequences, maxlen=5, padding='post', truncating='post')
print(sequences_padded)

## one-hot码

In [None]:
one_hot = tokenizer.texts_to_matrix(sentences, mode='binary')
print(one_hot)

In [None]:
# 在one-hot码中，会丢失文本中词汇的次序

tokenizer.texts_to_matrix(["love my dog i"], mode='binary')

## 样例：影评信息的正负面判断

#### 加载影评序列码数据

In [None]:
import numpy as np
from tensorflow import keras
imdb = keras.datasets.imdb

(train_sequences, train_labels), (test_sequences, test_labels) = imdb.load_data()

In [None]:
print("训练集数据量: ", len(train_sequences))
print("验证集数据量: ", len(test_sequences))
print("训练集第一条数据: ", train_sequences[0])
print("训练集第一条数据的标记: ", train_labels[0])

#### 将序列码转回文本

In [None]:
# imdb的词汇表
word_index = imdb.get_word_index()

word_index = {k:(v+3) for k,v in word_index.items()}
#word_index["<PAD>"] = 0
#word_index["<START>"] = 1
#word_index["<UNK>"] = 2  # unknown
#word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(sequences):
    ret = []
    for seq in sequences:
        ret.append(' '.join([reverse_word_index.get(i,'') for i in seq]).strip())
    return ret

train_text = np.array(decode_review(train_sequences))
test_text = np.array(decode_review(test_sequences))

In [None]:
print(train_text[0])

#### 任务

构建与训练一个神经网络，用于判断影评是正面的（表扬）还是负面的（批评）。