比较四种贝叶斯分类器在文档分类上的表现

MultinomialNB 是文本分类的首选分类器，适用于离散特征。
GaussianNB 假设特征符合正态分布，适用于连续特征，因此需要将稀疏矩阵转为密集矩阵。
ComplementNB 是 MultinomialNB 的变体，专为不平衡数据设计。
BernoulliNB 假设特征是二元的（0/1），适合特定文本分类任务。

In [48]:
# 导入所需的库
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import jieba
import os


In [50]:
# 停用词表地址
stop_words_path = 'text classification/stop/stopword.txt'


In [51]:
# 训练集和测试集的根目录
train_base_path = 'text classification/train/'
test_base_path = 'text classification/test/'


In [54]:
# 文档类别
train_labels = ['体育', '女性', '文学', '校园']
test_labels = ['体育', '女性', '文学', '校园']


In [55]:
# 定义函数以读取文件夹中各个类别下的文档数据
def get_data(base_path, labels):
    contents = []
    for label in labels:
        files = {fileName for fileName in os.listdir(base_path + label)}
        try:
            for fileName in files:
                file = open(base_path + label + '/' + fileName, encoding='gb18030')
                word = jieba.cut(file.read())
                contents.append(" ".join(word))
        except Exception as e:
            print(fileName + ' 文件读取失败: ', e)
    return contents

In [56]:
# 1. 对文档进行分词
# 获取训练集与测试集
train_contents = get_data(train_base_path, train_labels)
test_contents = get_data(test_base_path, test_labels)


In [57]:
# 2. 加载停用词表
stop_words = [line.strip() for line in open(stop_words_path, encoding='utf-8-sig').readlines()]


In [58]:
# 3. 计算单词权重
tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5)
train_features = tf.fit_transform(train_contents)

生成多项式朴素贝叶斯分类器

In [60]:
train_labels_full = ['体育'] * 1337 + ['女性'] * 954 + ['文学'] * 766 + ['校园'] * 249
clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels_full)



用生成的分类器做预测

In [64]:
test_tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5, vocabulary=tf.vocabulary_)
test_features = test_tf.fit_transform(test_contents)
predicted_labels = clf.predict(test_features)



In [70]:
test_labels_full = ['体育'] * 115 + ['女性'] * 38 + ['文学'] * 31 + ['校园'] * 16
print('准确率:', metrics.accuracy_score(test_labels_full, predicted_labels))

准确率: 0.91


In [72]:
from sklearn.naive_bayes import GaussianNB, ComplementNB, BernoulliNB
from sklearn.metrics import accuracy_score

# 将稀疏矩阵转为密集矩阵以适配 GaussianNB
train_features_dense = train_features.toarray()
test_features_dense = test_features.toarray()

# 定义分类器及其名称
classifiers = {
    "MultinomialNB": MultinomialNB(alpha=0.001),
    "GaussianNB": GaussianNB(),
    "ComplementNB": ComplementNB(alpha=0.001),
    "BernoulliNB": BernoulliNB(alpha=0.001)
}

# 比较分类器性能
results = {}
for name, clf in classifiers.items():
    # 针对 GaussianNB 使用密集特征，其余使用稀疏矩阵
    if name == "GaussianNB":
        clf.fit(train_features_dense, train_labels_full)
        predicted_labels = clf.predict(test_features_dense)
    else:
        clf.fit(train_features, train_labels_full)
        predicted_labels = clf.predict(test_features)
    
    # 计算准确率
    accuracy = accuracy_score(test_labels_full, predicted_labels)
    results[name] = accuracy
    print(f"{name} 的准确率: {accuracy:.4f}")

# 输出结果
print("\n分类器准确率比较结果:")
for name, accuracy in results.items():
    print(f"{name}: {accuracy:.4f}")

MultinomialNB 的准确率: 0.9100
GaussianNB 的准确率: 0.8750
ComplementNB 的准确率: 0.9100
BernoulliNB 的准确率: 0.9050

分类器准确率比较结果:
MultinomialNB: 0.9100
GaussianNB: 0.8750
ComplementNB: 0.9100
BernoulliNB: 0.9050
