将句子切分成词条的代码

In [1]:
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
sentence.split()

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

In [2]:
import numpy as np
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))  # 词汇表列举了所有想要记录的独立词条
','.join(vocab)

'26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the'

In [4]:
num_tokens = len(token_sequence)
print(num_tokens)
vocab_size = len(vocab)
print(vocab_size)

10
10


In [6]:
onehot_vector = np.zeros((num_tokens, vocab_size), int)
onehot_vector

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [7]:
for i, word in enumerate(token_sequence):
    onehot_vector[i, vocab.index(word)] = 1
print(onehot_vector)

[[0 0 0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 0 0 0]]


In [8]:
' '.join(vocab)

'26. Jefferson Monticello Thomas age at began building of the'

Monticello句子的独热向量序列

In [9]:
import pandas as pd
pd.DataFrame(onehot_vector, columns=vocab)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


独热向量看起来十分稀疏，每个行向量中只有一个非零值，因此我们可以将所有的零替换成空格会显得更为美观

更优美的独热向量展示

In [10]:
df = pd.DataFrame(onehot_vector, columns=vocab)
df[df == 0] = ''
df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,,,,1.0,,,,,,
1,,1.0,,,,,,,,
2,,,,,,,1.0,,,
3,,,,,,,,1.0,,
4,,,1.0,,,,,,,
5,,,,,,1.0,,,,
6,,,,,,,,,,1.0
7,,,,,1.0,,,,,
8,,,,,,,,,1.0,
9,1.0,,,,,,,,,


上述表格的每一行都是一个二值的行向量，为独热向量，任何信息都没有丢失，包含语法和词序。

通过计数的方式得到一个代表该句子的向量。

In [11]:
sentence_bow = {}
for token in sentence.split():
    sentence_bow[token] = 1
sorted(sentence_bow.items())

[('26.', 1),
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]

Python的sorted()将十进制放在字符之前，同时将大写的词放在小写的词之前，这是因为ASCII和Unicode的字符顺序。  
在ASCII表中，大写字母在小写字母之前。

## 构建词袋向量的DataFrame
使用更有效的字典形式，即Pandas中的Series，可以把他封装在Pandas的DataFrame中。

In [13]:
import pandas as pd
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()] )), columns=['sent']).T
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent,1,1,1,1,1,1,1,1,1,1


再往语料库中增加一些样本

In [14]:
sentence = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentence += """Construction was done mostly by local masons and carpenters.\n"""
sentence += "He moved into the South Pavilion in 1770.\n"
sentence += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

corpus = {}
for i, sent in enumerate(sentence.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df[df.columns[:10]]

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent0,1,1,1,1,1,1,1,1,1,1
sent1,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0


## 点积
inner product，也叫做内积，因为两个向量或矩阵的形状必须一致才可以相乘。  

In [15]:
v1 = pd.np.array([1, 2, 3])
v2 = pd.np.array([2, 3, 4])
v1.dot(v2)

  """Entry point for launching an IPython kernel.
  


20

In [16]:
(v1 * v2).sum()  # np数组的乘积是一种十分高效的向量式运算

20

In [17]:
sum([x1 * x2 for x1, x2 in zip(v1, v2)])

20