<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#朴素贝叶斯实现新闻分类" data-toc-modified-id="朴素贝叶斯实现新闻分类-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>朴素贝叶斯实现新闻分类</a></span></li><li><span><a href="#参考文献" data-toc-modified-id="参考文献-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>参考文献</a></span></li></ul></div>

# 朴素贝叶斯实现新闻分类

本实验在给定的数据集下实现了新闻分类，具体步骤如下:  
1. 读取文件，使用分词器`jieba`分词， 得到全数据、训练数据、测试数据三部分
    * 全数据用于特征向量的选取
    
2. 生成断句文本，在创建特征向量时屏蔽掉一些不需要的文本
    * 如果不讲这些词语屏蔽的话就会引入过多的噪声，这类词语在中文中常常含有’啊‘、'的'、’嗯‘等词语，这类词语出现频率高，而且对于分类的效果并不好，而且由于文本的处理，会引入空格、回车等字符，也会影响分类

3. 将训练数据、测试数据向量化，使之能够使用`MultinomialNB`分类， 分类器搭建完成


`jieba`分词器

In [1]:
import os
import random
import jieba # 用于分词的库
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt

In [2]:
'''
function: 
    文本预处理
Paramenters : 
    folder_path : 文本存放的路径
    test_size : 测试集占比
Return :
    all_words_list: 按词频降序排序的训练集列表
    train_data_list: 训练数据集， 已经分词的单词集合
    test_data_list: 测试数据集， 已经分词的单词集合
    train_class_list: 训练数据集的类标签
    test_class_list: 测试数据集的类标签
'''
def TextProcessing(folder_path, test_size=0.2):
    folder_list = os.listdir(folder_path)
    data_list = [] 
    class_list = []
    
    for index, folder in enumerate(folder_list, start=0) :
        new_folder_path = os.path.join(folder_path, folder) # 生成子文件夹的路径
        files = os.listdir(new_folder_path) # 获取所有的文件的文件名
        
        j = 1
        for file in files:
            if j > 100:
                break
            # 读取文件
            with open(os.path.join(new_folder_path, file), 'r', encoding='utf-8') as f:
                raw = f.read()
                
            word_cut = jieba.cut(raw, cut_all = False) # 精简模式，返回一个可迭代的generator
            word_list = list(word_cut)
            
            data_list.append(word_list) # 添加数据集数据
            class_list.append(index)  # 添加类别
            j += 1
        
    data_class_list = list(zip(data_list, class_list)) # 将数据与标签压缩为一个元组
    random.shuffle(data_class_list) # 随机乱序
    index = int(len(data_class_list) * test_size) + 1
    train_list = data_class_list[index:] # 选取训练数据
    test_list = data_class_list[:index] # 选取测试数据 
    train_data_list, train_class_list = zip(*train_list)  # 训练数据解压缩
    test_data_list, test_class_list = zip(*test_list) # 测试数据解压缩
        
    all_worlds_dict = {} # 统计词频
    for word_list in train_data_list:
        for word in word_list:
            if word in all_worlds_dict.keys():
                all_worlds_dict[word] += 1
            else:
                all_worlds_dict[word] = 1
        
    # 按照键值对倒序排序
    all_words_tuple_list = sorted(all_worlds_dict.items(),\
                                      key=lambda x : x[1], reverse=True)
    all_words_list, all_words_nums = zip(*all_words_tuple_list) # 解压缩
    all_words_list = list(all_words_list)
    return all_words_list, train_data_list, test_data_list,\
                                    train_class_list, test_class_list

In [3]:
'''
function: 
    文本特征值的选取
Parameters:
    all_words_list: 所有训练文本的列表
    deleteN: 删除词频最高的deleteN个词
    stopwords_set:指定的结束语
Return :
    feature_words - 特征集
'''
def words_dict(all_words_list, deleteN, stopwords_set=set()):
    feature_words = []
    n = 1
    for index in range(deleteN, len(all_words_list), 1):
        if n > 4500:
            break
        if not all_words_list[index].isdigit() \
            and all_words_list[index] not in stopwords_set \
            and  1 < len(all_words_list[index]) < 5:
            feature_words.append(all_words_list[index])
        n+=1
    return feature_words

In [4]:
'''
function: 根据feature_words将文本向量化
Paraments:
    train_feature_list: 训练数据集
    test_feature_list: 测试数据集
    feature_words: 特征集
Return:
    train_feature_list: 向量化之后的训练数据
    test_feature_list: 向量化之后的特征数据
'''
def TextFeatures(train_data_list, test_data_list, feature_words):
    def text_features(text, feature_words):
        text_words = set(text)
        features = [1 if word in text_words else 0 for word in feature_words]
        return features
    
    train_feature_list = [text_features(text, feature_words)\
                              for text in train_data_list]
    test_feature_list = [text_features(text, feature_words)\
                             for text in test_data_list]
    return train_feature_list, test_feature_list

In [5]:
'''
function:
    创建文本分类器，并计算准确率
Paraments:
    train_feature_list: 训练特征集
    test_feature_list: 测试特征集
    train_class_list: 训练标签
    test_class_list: 测试标签
Returens:
    test_accuracy: 分类器精度
'''
def TextClassifier(train_feature_list, test_feature_list, \
                   train_class_list, test_class_list):
    classifier = MultinomialNB().fit(train_feature_list, train_class_list)
    test_accuracy = classifier.score(test_feature_list, test_class_list)
    print(test_class_list)
    print(classifier.predict(test_feature_list))
    return test_accuracy

In [6]:
'''
Function:
    读取文件中的内容， 并去重
Paraments:
    words_file: 文件路径
Returens:
    words_set: 读取内容的set集合
'''
def MakeWordsSet(words_file):
    words_set = set()
    with open(words_file, 'r', encoding='utf-8') as f:
        for line in f.readlines():
            word = line.strip()
            if len(word) > 0:
                words_set.add(word)
    return words_set

In [7]:
# 文本预处理
folder_path = './Naive_Bayes-master/SogouC/Sample'
all_words_list, train_data_list, test_data_list, train_class_list, \
    test_class_list = TextProcessing(folder_path,test_size=0.2)
# 生成stopwords_set
stopwords_file = './Naive_Bayes-master/stopwords_cn.txt'
stopwords_set = MakeWordsSet(stopwords_file)

test_accuracy_list = []
feature_words = words_dict(all_words_list, 450, stopwords_set )

train_feature_list, test_feature_list = TextFeatures(train_data_list, test_data_list, feature_words)
test_accuracy = TextClassifier(train_feature_list, test_feature_list,\
                               train_class_list, test_class_list)
test_accuracy_list.append(test_accuracy)
ave = lambda c: sum(c) / len(c)
print(ave(test_accuracy_list))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Peak\AppData\Local\Temp\jieba.cache
Loading model cost 0.897 seconds.
Prefix dict has been built successfully.


(4, 7, 1, 2, 8, 3, 1, 1, 1, 6, 2, 5, 5, 7, 3, 7, 0, 0, 3)
[4 1 0 1 8 3 1 2 2 6 2 5 5 7 3 2 0 0 3]
0.6842105263157895


# 参考文献
[1] https://www.cnblogs.com/asialee/p/9417659.html  
[2] https://www.cnblogs.com/pinard/p/6069267.html  
[3] 《统计学习方法》第 2 版，李航   
[4] https://www.jianshu.com/p/4b67141e474e  
[5] https://www.lagou.com/lgeduarticle/66914.html  
[6] https://blog.csdn.net/codejas/article/details/80356544  