将句子切分成词条的代码

In [1]:
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
sentence.split()

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

In [2]:
import numpy as np
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))  # 词汇表列举了所有想要记录的独立词条
','.join(vocab)

'26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the'

In [4]:
num_tokens = len(token_sequence)
print(num_tokens)
vocab_size = len(vocab)
print(vocab_size)

10
10


In [6]:
onehot_vector = np.zeros((num_tokens, vocab_size), int)
onehot_vector

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [7]:
for i, word in enumerate(token_sequence):
    onehot_vector[i, vocab.index(word)] = 1
print(onehot_vector)

[[0 0 0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 0 0 0]]


In [8]:
' '.join(vocab)

'26. Jefferson Monticello Thomas age at began building of the'

Monticello句子的独热向量序列

In [9]:
import pandas as pd
pd.DataFrame(onehot_vector, columns=vocab)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


独热向量看起来十分稀疏，每个行向量中只有一个非零值，因此我们可以将所有的零替换成空格会显得更为美观

更优美的独热向量展示

In [10]:
df = pd.DataFrame(onehot_vector, columns=vocab)
df[df == 0] = ''
df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,,,,1.0,,,,,,
1,,1.0,,,,,,,,
2,,,,,,,1.0,,,
3,,,,,,,,1.0,,
4,,,1.0,,,,,,,
5,,,,,,1.0,,,,
6,,,,,,,,,,1.0
7,,,,,1.0,,,,,
8,,,,,,,,,1.0,
9,1.0,,,,,,,,,


上述表格的每一行都是一个二值的行向量，为独热向量，任何信息都没有丢失，包含语法和词序。

通过计数的方式得到一个代表该句子的向量。

In [11]:
sentence_bow = {}
for token in sentence.split():
    sentence_bow[token] = 1
sorted(sentence_bow.items())

[('26.', 1),
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]

Python的sorted()将十进制放在字符之前，同时将大写的词放在小写的词之前，这是因为ASCII和Unicode的字符顺序。  
在ASCII表中，大写字母在小写字母之前。

## 构建词袋向量的DataFrame
使用更有效的字典形式，即Pandas中的Series，可以把他封装在Pandas的DataFrame中。

In [2]:
import pandas as pd
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()] )), columns=['sent']).T
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiece,Jefferson's,obsession.
sent,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


再往语料库中增加一些样本

In [3]:
sentence = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentence += """Construction was done mostly by local masons and carpenters.\n"""
sentence += "He moved into the South Pavilion in 1770.\n"
sentence += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

corpus = {}
for i, sent in enumerate(sentence.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df[df.columns[:10]]

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent0,1,1,1,1,1,1,1,1,1,1
sent1,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0


## 点积
inner product，也叫做内积，因为两个向量或矩阵的形状必须一致才可以相乘。  

In [15]:
v1 = pd.np.array([1, 2, 3])
v2 = pd.np.array([2, 3, 4])
v1.dot(v2)

  """Entry point for launching an IPython kernel.
  


20

In [16]:
(v1 * v2).sum()  # np数组的乘积是一种十分高效的向量式运算

20

In [17]:
sum([x1 * x2 for x1, x2 in zip(v1, v2)])

20

## 度量词袋之间的重合度
若能衡量两个词袋向量之间的重合度，就可以很好的估计它们所用词的相似程度，进而较好的估计语义重合度。

In [4]:
df = df.T
df

Unnamed: 0,sent0,sent1,sent2,sent3
Thomas,1,0,0,0
Jefferson,1,0,0,0
began,1,0,0,0
building,1,0,0,0
Monticello,1,0,0,1
at,1,0,0,0
the,1,0,1,0
age,1,0,0,0
of,1,0,0,0
26.,1,0,0,0


In [5]:
df.sent0.dot(df.sent1)

0

In [6]:
df.sent0.dot(df.sent2)

1

In [7]:
df.sent0.dot(df.sent3)

1

上面结果表明，有个词同时出现在sent0和sent2中。同理，某个词同时出现在sent0和sent3中。  
词之间的重合度可以作为句子相似度的一种度量方式。  
下面给出一个找到共享词的代码。

In [8]:
[(k, v) for (k, v) in (df.sent0 & df.sent3).items() if v]

[('Monticello', 1)]

## 标点符号的处理
利用正则表达式对Monticello句子进行划分。

In [9]:
import re
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
tokens = re.split(r'[-\s.,;!?]+', sentence)
tokens

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '']

[ ]：表一个字符类，即字符集。  
+：表必须匹配方括号内的一个或多个字符。  
\s：一个预定义字符类的快捷表示，该字符类包括所有的空白符。  

## 改进的用于分词的正则表达式
使用re.compile()对正则表达式进行预编译后可以将其以参数方式传递给分词函数或者类，从而加快分词器的运行速度。  

In [11]:
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
tokens[-10:]

[' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']

去除空白符和标点符号

In [12]:
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
tokens = pattern.split(sentence)
tokens

['Thomas',
 ' ',
 'Jefferson',
 ' ',
 'began',
 ' ',
 'building',
 ' ',
 'Monticello',
 ' ',
 'at',
 ' ',
 'the',
 ' ',
 'age',
 ' ',
 'of',
 ' ',
 '26',
 '.',
 '']

In [13]:
[x for x in tokens if x and x not in '- \t\n.,;!?']

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

## 使用NLTK函数库进行分词

In [14]:
from nltk.tokenize import RegexpTokenizer

In [15]:
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokenizer.tokenize(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

## NLTK中更好的分词器

In [16]:
from nltk.tokenize import TreebankWordTokenizer

In [18]:
sentence = """Monticello wasn't designated as UNESCO World Heritage Site until 1987."""
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

['Monticello',
 'was',
 "n't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '.']

NLTK库中又一个分词器casual_tokenize，该分词器用于处理来自社交网络的短文本，可以应对表情符号、非规范用语等。

In [19]:
from nltk.tokenize.casual import casual_tokenize

In [20]:
message = """RT @TJMonticello Best dat everrrrr as Monticello. Awesommmmmmeee day :*)"""

In [21]:
casual_tokenize(message)

['RT',
 '@TJMonticello',
 'Best',
 'dat',
 'everrrrr',
 'as',
 'Monticello',
 '.',
 'Awesommmmmmeee',
 'day',
 ':*)']

In [22]:
casual_tokenize(message, reduce_len=True, strip_handles=True)

['RT',
 'Best',
 'dat',
 'everrr',
 'as',
 'Monticello',
 '.',
 'Awesommmeee',
 'day',
 ':*)']

# 将词汇表扩展到n-gram
## n-gram概念
n-gram是一个最多包含n个元素的序列，这些元素从由它们组成的序列中提取而成。
## 为什么要使用n-gram
当一个词条序列向量化为词袋向量时，它丢失了词序的信息。  
故需将单词条的概念扩展到多词条构成的n-gram

In [23]:
# 1-gram分词器
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
tokens = [x for x in tokens if x and x not in '- \t\n.,;!?']
tokens

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

In [24]:
# nltk中n-gram分词器
from nltk.util import ngrams
list(ngrams(tokens, 2))

[('Thomas', 'Jefferson'),
 ('Jefferson', 'began'),
 ('began', 'building'),
 ('building', 'Monticello'),
 ('Monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', '26')]

In [25]:
list(ngrams(tokens, 3))

[('Thomas', 'Jefferson', 'began'),
 ('Jefferson', 'began', 'building'),
 ('began', 'building', 'Monticello'),
 ('building', 'Monticello', 'at'),
 ('Monticello', 'at', 'the'),
 ('at', 'the', 'age'),
 ('the', 'age', 'of'),
 ('age', 'of', '26')]

In [26]:
two_grams = list(ngrams(tokens, 2))
[" ".join(x) for x in two_grams]

['Thomas Jefferson',
 'Jefferson began',
 'began building',
 'building Monticello',
 'Monticello at',
 'at the',
 'the age',
 'age of',
 'of 26']