# 分类文本句子分割
获得一些已被分割成句子的数据，将它转换成一种适合提取特征的形式。

In [1]:
import nltk

sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in nltk.corpus.treebank_raw.sents():
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

tokens是单独句子标识符的合并链表，boundaries是一个包含所有句子-边界标识符索引的集合。下一步，需要指定用于决定标点是否表示句子边界的数据特征。

In [2]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prevword': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

基于这个特征提取器，可以通过选择所有的标点符号创建一个加标签的特征集链表，然后标注它们是否是边界标识符。

In [3]:
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']

训练和评估一个标点符号分类器。

In [4]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.936026936026936

使用分类器断句。

In [5]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents