文本向量化的多种实现方法：   
* 将文本分割为单词，并将每个单词转化为一个向量
* 将文本分割为字符，并将每个字符转换为一个向量   
* 提取单词或字符的n-gram，并将每个n-gram转换为一个向量。应用于轻量级的浅层文本处理模型（logistics回归和随机森林）

# 单词和字符的One-hot编码

## 单词级别的one-hot编码

In [1]:
import numpy as np


In [2]:
samples = ['The cat sat on the mat.', 'The dog ate my homework' ]

In [3]:
# 创建每个单词的索引
token_index = {}

for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1             #索引编号0没有指定单词

In [4]:
token_index

{'The': 1,
 'ate': 8,
 'cat': 2,
 'dog': 7,
 'homework': 10,
 'mat.': 6,
 'my': 9,
 'on': 4,
 'sat': 3,
 'the': 5}

In [10]:
token_index['The']

1

In [11]:
token_index.get('The')

1

In [7]:
# 对样本进行分词，只考虑每个样本钱max_length个单词
max_length = 10

results = np.zeros(shape=(len(samples), max_length, len(token_index)+1))

In [8]:
results.shape

(2, 10, 11)

In [12]:
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

In [13]:
results[0]

array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

## 利用keras 实现单词级的one-hot编码

In [14]:
from keras.preprocessing.text import Tokenizer


In [15]:
samples = ['The cat sat on the mat.', 'The dog ate my homework' ]

In [16]:
# 构建单词索引
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)

In [17]:
type(tokenizer)

keras_preprocessing.text.Tokenizer

In [18]:
# 将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)
type(sequences)

list

In [19]:
sequences

[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

In [20]:
# 也可以直接得到one-hot 二进制表示
one_hot_results =tokenizer.texts_to_matrix(samples, mode='binary')
type(one_hot_results)

numpy.ndarray

In [22]:
one_hot_results.shape

(2, 1000)

In [23]:
word_index = tokenizer.word_index
type(word_index)

dict

In [24]:
word_index

{'ate': 7,
 'cat': 2,
 'dog': 6,
 'homework': 9,
 'mat': 5,
 'my': 8,
 'on': 4,
 'sat': 3,
 'the': 1}

# 使用词嵌入

![](./images/编码方式.png) 

获取词嵌入的两种方法：    
* 在完成任务的同时学习词嵌入。一开始是随机的词向量，然后对词向量进行学习，学习方式与学习神经网络的权重相同。   
* 在不同于待解决问题的机器学习任务上预计算词嵌入，然后加载到模型中，这些词嵌入又叫做**预训练词嵌入**

利用Embedding层学习嵌入空间   
![Embedding](./images/Embedding.png)

In [26]:
from keras.datasets import imdb
from keras import preprocessing

In [30]:
max_features = 10000  # 作为特征的单词个数
maxien = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)



In [31]:
x_train.shape

(25000,)

In [32]:
type(x_train)

numpy.ndarray

In [33]:
# 将整数列表转换为形状为（samples, maxlen）的二维整数张量
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxien)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxien)

In [34]:
x_train.shape

(25000, 20)

In [35]:
type(x_train)

numpy.ndarray

In [36]:
from keras.models import Sequential
from keras.layers import Flatten,Dense,Embedding

In [41]:
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxien))
model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))


In [42]:
model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [43]:
history = model.fit(x_train, y_train,
                   epochs=10,
                   batch_size=32,
                   validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# 整合：从原始文本到词嵌入

In [44]:
import os

In [45]:
imdb_dir = r'D:\python\deep learning\data\aclImdb'
train_dir = os.path.join(imdb_dir, 'train')


In [46]:
train_dir

'D:\\python\\deep learning\\data\\aclImdb\\train'

In [48]:
labels =[]
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname),'rb')
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
                

In [49]:
len(labels)

25000

In [50]:
len(texts)

25000

In [51]:
texts[0]

b"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [52]:
labels[0]

0

# RNN

![RNN](./images/RNN.png)   
   RNN:遍历所有的序列元素，并保存一个状态（status）

In [None]:
## RNN实现的伪代码
state_t = 0
for input_t in input_sequence:
    #output_t = f(input_t, state_t)
    output_t = activation( dot(W, input_t) + dot(U, state_t) + b )
    state_t = output_t

In [3]:
# 构建一个带RNN的网络
from keras.models import Sequential
from keras.layers import Embedding,SimpleRNN

In [4]:
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 32)                2080      
Total params: 328,320
Trainable params: 328,320
Non-trainable params: 0
_________________________________________________________________


## 使用上述模型应用于IMDB电影评论分类


### 对数据预处理

In [5]:
from keras.datasets import imdb
from keras.preprocessing import sequence


max_features = 10000           # 作为特征的单词个数
maxlen = 500                   # 每个文本只读前500个单词
batch_size = 32

print('loading data...')
(input_train, y_train),(input_test, y_test) = imdb.load_data(num_words = max_features)

print(len(input_train), 'train sequence')
print(len(input_test), 'test sequence')

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)

print('input_train shape', input_train.shape)
print('input_test shape', input_test.shape)


loading data...
25000 train sequence
25000 test sequence
Pad sequences (samples x time)
input_train shape (25000, 500)
input_test shape (25000, 500)


### 用一个Embedding 层和一个SimpleRNN层来训练一个简单的循环网络

In [6]:
from keras.layers import Dense

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_5 (SimpleRNN)     (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 322,113
Trainable params: 322,113
Non-trainable params: 0
_________________________________________________________________


In [7]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# LSTM

SimpleRNN最大问题，在时刻t,理论上应该能记住许多时间之前的信息，但实际上不会，原因在于**梯度消失问题**。

LSTM是RNN的一种变种：保存信息以便后面使用，从而防止较早期的信号在处理过程中逐渐消失。   
![SimpleRNN](./images/simplernn.png)   
![LSTM](./images/LSTM.png)   
![LSTM解剖](./images/LSTM解剖.png)   

LSTM伪代码：   
output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + b)

i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)   
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)   
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)   
   
c_t+1 = i_t \* k_t + c_t \* f_t