# 分类文本电影评论正负面评价
下面的例子，选择电影评论语料库，将每个评论归类为正面或负面。

In [2]:
from nltk.corpus import movie_reviews
import random

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

构建整个语料库中前2000个最频繁词的链表。然后，定义一个特征提取器，检查这些词是否在一个给定的文档中。

In [10]:
import nltk

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

```py
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{'contains(waste)': False, 'contains(lot)': False, ...}
```

训练和测试分类器以进行文档分类。同时，可以使用show_most_informative_features()来找出哪些特征是分类器发现的并且是最有信息量的。

In [12]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.64

In [13]:
classifier.show_most_informative_features(5)

Most Informative Features
           contains(ugh) = True              neg : pos    =      9.7 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.7 : 1.0
          contains(sans) = True              neg : pos    =      7.7 : 1.0
     contains(dismissed) = True              pos : neg    =      7.0 : 1.0
         contains(wires) = True              neg : pos    =      6.4 : 1.0


提到ugh的评论中负面大约是正面的9倍，提到wires的评论中正面是负面6倍。