# 停用词

删除没有信息量的单词还有另一种方法，就是舍弃那些出现次数太多以至于没有信息量的单词。有两种主要方法:使用特定语言的停用词 (stopword)列表，或者舍弃那些出现过于频繁的单词。scikit-learn的feature_extraction.text 模块中提供了英语停用词的内置列表:

In [1]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318
Every 10th stopword:
['yours', 'afterwards', 'a', 'almost', 'around', 'when', 'via', 'nine', 'mostly', 'off', 'him', 'de', 'take', 'get', 'us', 'couldnt', 'full', 'thereupon', 'who', 'me', 'that', 'eight', 'become', 'even', 'several', 'being', 'no', 'else', 'anyhow', 'across', 'my', 'serious']


显然，删除上述列表中的停用词只能使特征数量减少318 个 (即上述列表的长度)，但可能会提高性能。我们来试一下:

In [2]:
from sklearn.datasets import load_files
import numpy as np
reviews_train = load_files("E:/clone/machine-learning/data/aclImdb/train/")

# load_files返回一个Bunch对象，其中包含训练文本和训练标签
data_array = np.array(reviews_train.data)
target_array = np.array(reviews_train.target)

# Filter out documents where the target is not equal to 2
labeled_indices = np.where(target_array != 2)[0]
text_train, y_train = data_array[labeled_indices], target_array[labeled_indices]
text_train = [doc.replace(b'<br />',b' ') for doc in text_train]

reviews_test = load_files("E:/clone/machine-learning/data/aclImdb/test/")
data_array = np.array(reviews_test.data)
target_array = np.array(reviews_test.target)
text_test, y_test = data_array, target_array
text_test = [doc.replace(b'<br />',b' ') for doc in text_test]

In [3]:
# 指定stop_words="english"将使用内置列表
# 我们也可以扩展这个列表并传人我们自己的列表
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))

X_train with stop words:
<25000x26966 sparse matrix of type '<class 'numpy.int64'>'
	with 2149958 stored elements in Compressed Sparse Row format>


现在数据集中的特征数量减少了 305 个 (27271-26966)，说明大部分停用词(但不是所有)都出现了。我们再次运行网格搜索:

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(max_iter=800), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.3f}".format(grid.best_score_))

Best cross-validation score: 0.884


使用停用词后的网格搜索性能略有下降一不至于担心，但鉴于从 27 000 多个特征中删除 305 个不太可能对性能或可解释性造成很大影响，所以使用这个列表似乎是不值得的.固定的列表主要对小型数据集很有帮助，这些数据集可能没有包含足够的信息，模型从数据本身无法判断出哪些单词是停用词