<h1>文档目录<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#目标需求" data-toc-modified-id="目标需求-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>目标需求</a></span></li><li><span><a href="#数据整理" data-toc-modified-id="数据整理-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>数据整理</a></span><ul class="toc-item"><li><span><a href="#清洗策略函数" data-toc-modified-id="清洗策略函数-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>清洗策略函数</a></span></li><li><span><a href="#定义数据提取函数" data-toc-modified-id="定义数据提取函数-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>定义数据提取函数</a></span></li><li><span><a href="#获得数据" data-toc-modified-id="获得数据-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>获得数据</a></span></li></ul></li><li><span><a href="#特征提取" data-toc-modified-id="特征提取-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>特征提取</a></span><ul class="toc-item"><li><span><a href="#文本分词" data-toc-modified-id="文本分词-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>文本分词</a></span></li><li><span><a href="#TF-IDF特征提取" data-toc-modified-id="TF-IDF特征提取-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>TF-IDF特征提取</a></span></li></ul></li><li><span><a href="#模型训练与验证" data-toc-modified-id="模型训练与验证-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>模型训练与验证</a></span><ul class="toc-item"><li><span><a href="#采用多项式朴素贝叶斯模型" data-toc-modified-id="采用多项式朴素贝叶斯模型-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>采用多项式朴素贝叶斯模型</a></span></li><li><span><a href="#K折交叉验证评估" data-toc-modified-id="K折交叉验证评估-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>K折交叉验证评估</a></span></li></ul></li><li><span><a href="#测试集评估" data-toc-modified-id="测试集评估-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>测试集评估</a></span></li></ul></div>

# 目标需求
1. 根据文章文字内容对文章的类型进行分类
2. 使用朴素贝叶斯分类对训练集进行训练，并对测试集进行验证，并给出测试集的准确率
3. 文档共有 4 种类型：女性、体育、文学、校园
4. 训练集放到 train 文件夹里，测试集放到 test 文件夹里，停用词放到 stop 文件夹里

# 数据整理

## 清洗策略函数

In [17]:
import re

def clean(text):
    # 过滤每个文档尾的标识符
    text = re.sub('\$LOTOzf\$','',text)
    # 过滤掉网页连接
    text = re.sub(r'http.+[a-zA-Z]\b','',text)
    # 过滤制表符
    text = re.sub('\\t','',text)
    # 过滤空格
    text = re.sub(' ','',text)
    # 过滤奇怪符号
    text = re.sub('[★|●]','',text)
    return text

## 定义数据提取函数

In [18]:
import numpy as np
import os

# 文件读取
def read_text(path, islist=False):
    if islist:
        word_list = []
        with open (path,'r',encoding='utf-8',errors='ignore') as f:
            lines = f.readlines()
            for line in lines:
                word_list.append(line.strip())
            return word_list
    else:
        with open (path,'r',encoding='gb18030') as f:
            return f.read()

# 获取取文本数据进数组
def load_data(file_path):
    for dirpath, dirnames_s, filesnames in os.walk(file_path):
        if dirnames_s:
            dirnames = dirnames_s
        else:
            break
    
    data = []
    lable = []
    for dirname in dirnames:
        for i,j,filesnames in os.walk('{}/{}'.format(file_path,dirname)):
            for filename in filesnames:
                text = clean(read_text('{}/{}/{}'.format(file_path,dirname,filename)))
                data.append(text)
                lable.append(dirname)
    return np.array(data),np.array(lable)

## 获得数据
1. train_data_original：训练集数据
2. train_lable：训练集的标签
3. test_data_original：测试集数据
4. test_lable：测试集的标签
5. stop_words：停用词列表

In [19]:
train_data_original, train_lable = load_data('./train')
test_data_original, test_lable = load_data('./test')
stop_words = read_text('./stop/stopword.txt',islist=True)

# 特征提取
1. 采用jieba包分词
2. 使用词频TF-IDF值作为特征值

## 文本分词

In [20]:
import jieba

# 分词函数
def word_split(datas):
    result = []
    for data in datas:
        text = '/'.join(jieba.cut(data))
        result.append(text)
    return result

# 获取分词文本
train_data_split = word_split(train_data_original)
test_data_split = word_split(test_data_original)

## TF-IDF特征提取
1. train_features 训练集特征数据
2. test_features 测试集特征数据

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 分别生成训练集转化器、特征空间
tfidf_vec_train = TfidfVectorizer(stop_words=stop_words)
train_features = tfidf_vec_train.fit_transform(train_data_split) 

# 分别生成测试集转化器、特征空间（词组空间用训练集的）
tfidf_vec_test = TfidfVectorizer(stop_words=stop_words, vocabulary=tfidf_vec_train.vocabulary_)
test_features = tfidf_vec_test.fit_transform(test_data_split) 

# 模型训练与验证

## 采用多项式朴素贝叶斯模型

In [22]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.01).fit(train_features, train_lable)

## K折交叉验证评估

In [33]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(clf, train_features, train_lable, cv=10)
print('k折验证评估分数为：', '{}%'.format(round(score.mean(),2)*100))

k折验证评估分数为： 90.0%


# 测试集评估

In [34]:
from sklearn import metrics

predict_lable = clf.predict(test_features)
score = metrics.accuracy_score(test_lable, predict_lable)
print('测试集上的准确率：','{}%'.format(round(score,2)*100))

测试集上的准确率： 91.0%
