# 朴素贝叶斯

优点：在数据较少的情况下依然有效，可以处理多类别问题

缺点：对于输入数据的准备方式较为敏感

适用数据类型：标称性数据

## 基本概念与认识

**买瓜问题** 介绍先验概率和后验概率

最近天气炎热，红色石头来到超市准备买个西瓜，可是没有太多的经验，不知道怎么样才能挑个熟瓜。这时候，作为理科生，红色石头就有这样的考虑：

如果我对这个西瓜没有任何了解，包括瓜的颜色、形状、瓜蒂是否脱落。按常理来说，西瓜成熟的概率大概是 60%。那么，这个概率 P(瓜熟) 就被称为**先验概率**。

也就是说，先验概率是**根据以往经验和分析得到的概率，先验概率无需样本数据，不受任何条件的影响**。就像红色石头只根据常识而不根据西瓜状态来判断西瓜是否成熟，这就是先验概率。

再来看，红色石头以前学到了一个判断西瓜是否成熟的常识，就是看瓜蒂是否脱落。一般来说，瓜蒂脱落的情况下，西瓜成熟的概率大一些，大概是 75%。如果把瓜蒂脱落当作一种结果，然后去推测西瓜成熟的概率，这个概率 P(瓜熟 | 瓜蒂脱落) 就被称为**后验概率**。后验概率**类似于条件概率**。

## 朴素贝叶斯进行文本分类

### 词向量转化

In [14]:
import numpy as np

#创建实验样本，
def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]    #1 代表侮辱性的词汇，0不是，对应上面的语句列表
    return postingList, classVec

**set：** 

python的set和其他语言类似, 是一个无序不重复元素集, 基本功能包括关系测试和消除重复元素. 集合对象还支持union(联合), intersection(交), difference(差)和sysmmetric difference(对称差集)等数学运算.

sets 支持 x in set, len(set),和 for x in set。作为一个无序的集合，sets不记录元素位置或者插入点。因此，sets不支持 indexing, slicing, 或其它类序列（sequence-like）的操作。

In [15]:
#创建一个包含在文档中中不存在重复的词的列表
def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) # 两个集合的并集
    #vocabSet为一个不重复此表
    return list(vocabSet) #返回转型为list

#word2vec输入参数为词汇表，和一个文档
#返回的文档向量，对应词汇表中的单词在输入文档中是否出现，创建一个相对应的向量
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0 for i in range(len(vocabList))]
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: 
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

In [16]:
listOPosts, ListClasses = loadDataSet()
listOPosts, ListClasses

([['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
  ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']],
 [0, 1, 0, 1, 0, 1])

In [17]:
myVocabList = createVocabList(listOPosts)
myVocabList

['has',
 'love',
 'dalmation',
 'so',
 'him',
 'I',
 'help',
 'take',
 'food',
 'ate',
 'park',
 'my',
 'worthless',
 'dog',
 'stop',
 'posting',
 'mr',
 'not',
 'how',
 'stupid',
 'quit',
 'to',
 'please',
 'maybe',
 'cute',
 'steak',
 'licks',
 'is',
 'problems',
 'buying',
 'flea',
 'garbage']

In [18]:
setOfWords2Vec(myVocabList, listOPosts[0])

[1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0]

### 训练并计算概率

In [19]:
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix) #训练数据数量
    numWords = len(trainMatrix[0]) #数据词向量长度
    pAbusive = sum(trainCategory)/float(numTrainDocs) #正类别数量/样本总数
    p0Num = np.zeros(numWords); 
    p1Num = np.zeros(numWords)     #初始化概率计算的分子，由多少个词即有多少个
    
    p0Denom = 0
    p1Denom = 0                        #分母变量初始化，此处加入初值2，作为平滑处理的值防止出现bug
    
    for i in range(numTrainDocs): #遍历每一个样本呢
        if trainCategory[i] == 1: #如果对应分类=1
            p1Num += trainMatrix[i] #对应词向量相加为相对应类的和
            p1Denom += sum(trainMatrix[i]) #获取正样本总词语数量
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])#获取负样本总词语数量
    p1Vect = (p1Num/p1Denom)          #当样本为正样本部分，n为词向量每个出现的概率
    p0Vect = (p0Num/p0Denom)          #返回对应样本出现的概率
    return p0Vect, p1Vect, pAbusive #返回两个向量，一个概率（正样本概率），二分类可求负样本概率

In [20]:
listOPosts, listClasses = loadDataSet()

In [21]:
listOPosts

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

In [22]:
ListClasses 

[0, 1, 0, 1, 0, 1]

In [23]:
myVocabList = createVocabList(listOPosts)
myVocabList 

['has',
 'love',
 'dalmation',
 'so',
 'him',
 'I',
 'help',
 'take',
 'food',
 'ate',
 'park',
 'my',
 'worthless',
 'dog',
 'stop',
 'posting',
 'mr',
 'not',
 'how',
 'stupid',
 'quit',
 'to',
 'please',
 'maybe',
 'cute',
 'steak',
 'licks',
 'is',
 'problems',
 'buying',
 'flea',
 'garbage']

In [24]:
trainMat = []
for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))

In [25]:
len(trainMat)
len(trainMat[0])

32

In [26]:
p0V, p1V, pAb = trainNB0(trainMat, listClasses)

In [27]:
p0V, p1V, pAb

(array([0.04166667, 0.04166667, 0.04166667, 0.04166667, 0.08333333,
        0.04166667, 0.04166667, 0.        , 0.        , 0.04166667,
        0.        , 0.125     , 0.        , 0.04166667, 0.04166667,
        0.        , 0.04166667, 0.        , 0.04166667, 0.        ,
        0.        , 0.04166667, 0.04166667, 0.        , 0.04166667,
        0.04166667, 0.04166667, 0.04166667, 0.04166667, 0.        ,
        0.04166667, 0.        ]),
 array([0.        , 0.        , 0.        , 0.        , 0.05263158,
        0.        , 0.        , 0.05263158, 0.05263158, 0.        ,
        0.05263158, 0.        , 0.10526316, 0.10526316, 0.05263158,
        0.05263158, 0.        , 0.05263158, 0.        , 0.15789474,
        0.05263158, 0.05263158, 0.        , 0.05263158, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.05263158,
        0.        , 0.05263158]),
 0.5)

**朴素贝叶斯极大似然估计因为估计的概率值为0的情况，采用贝叶斯分类方法加入平滑处理**

In [28]:
#重载训练器
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix) #训练数据数量
    numWords = len(trainMatrix[0]) #数据词向量长度
    pAbusive = sum(trainCategory)/float(numTrainDocs) #正类别数量/样本总数
    lamba = 1
    p0Num = np.zeros(numWords) + lamba
    p1Num = np.zeros(numWords) + lamba #初始化概率计算的分子，由多少个词即有多少个
    #加入平滑处理
    
    p0Denom = lamba*2
    p1Denom = lamba*2                     #分母变量初始化，此处加入初值2，作为平滑处理的值防止出现bug
    #加入平滑处理
        
    for i in range(numTrainDocs): #遍历每一个样本呢
        if trainCategory[i] == 1: #如果对应分类=1
            p1Num += trainMatrix[i] #对应词向量相加为相对应类的和
            p1Denom += sum(trainMatrix[i]) #获取正样本总词语数量
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])#获取负样本总词语数量
    p1Vect = (p1Num/p1Denom)          #当样本为正样本部分，n为词向量每个出现的概率
    p0Vect = (p0Num/p0Denom)          #返回对应样本出现的概率
    return p0Vect, p1Vect, pAbusive #返回两个向量，一个概率（正样本概率），二分类可求负样本概率

In [29]:
p0V, p1V, pAb = trainNB0(trainMat, listClasses)
p0V, p1V, pAb

(array([0.07692308, 0.07692308, 0.07692308, 0.07692308, 0.11538462,
        0.07692308, 0.07692308, 0.03846154, 0.03846154, 0.07692308,
        0.03846154, 0.15384615, 0.03846154, 0.07692308, 0.07692308,
        0.03846154, 0.07692308, 0.03846154, 0.07692308, 0.03846154,
        0.03846154, 0.07692308, 0.07692308, 0.03846154, 0.07692308,
        0.07692308, 0.07692308, 0.07692308, 0.07692308, 0.03846154,
        0.07692308, 0.03846154]),
 array([0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.0952381 ,
        0.04761905, 0.04761905, 0.0952381 , 0.0952381 , 0.04761905,
        0.0952381 , 0.04761905, 0.14285714, 0.14285714, 0.0952381 ,
        0.0952381 , 0.04761905, 0.0952381 , 0.04761905, 0.19047619,
        0.0952381 , 0.0952381 , 0.04761905, 0.0952381 , 0.04761905,
        0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.0952381 ,
        0.04761905, 0.0952381 ]),
 0.5)

**改进过多极小数相乘的向下溢出问题 **

In [30]:
#重载训练器
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix) #训练数据数量
    numWords = len(trainMatrix[0]) #数据词向量长度
    pAbusive = sum(trainCategory)/float(numTrainDocs) #正类别数量/样本总数
    lamba = 1
    p0Num = np.zeros(numWords) + lamba
    p1Num = np.zeros(numWords) + lamba #初始化概率计算的分子，由多少个词即有多少个
    #加入平滑处理
    
    p0Denom = lamba*2
    p1Denom = lamba*2                     #分母变量初始化，此处加入初值2，作为平滑处理的值防止出现bug
    #加入平滑处理
        
    for i in range(numTrainDocs): #遍历每一个样本呢
        if trainCategory[i] == 1: #如果对应分类=1
            p1Num += trainMatrix[i] #对应词向量相加为相对应类的和
            p1Denom += sum(trainMatrix[i]) #获取正样本总词语数量
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])#获取负样本总词语数量
    p1Vect = np.log(p1Num/p1Denom)          #当样本为正样本部分，n为词向量每个出现的概率
    p0Vect =np.log(p0Num/p0Denom)          #返回对应样本出现的概率
    return p0Vect, p1Vect, pAbusive #返回两个向量，一个概率（正样本概率），二分类可求负样本概率

In [31]:
p0V, p1V, pAb = trainNB0(trainMat, listClasses)
p0V, p1V, pAb

(array([-2.56494936, -2.56494936, -2.56494936, -2.56494936, -2.15948425,
        -2.56494936, -2.56494936, -3.25809654, -3.25809654, -2.56494936,
        -3.25809654, -1.87180218, -3.25809654, -2.56494936, -2.56494936,
        -3.25809654, -2.56494936, -3.25809654, -2.56494936, -3.25809654,
        -3.25809654, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
        -2.56494936, -2.56494936, -2.56494936, -2.56494936, -3.25809654,
        -2.56494936, -3.25809654]),
 array([-3.04452244, -3.04452244, -3.04452244, -3.04452244, -2.35137526,
        -3.04452244, -3.04452244, -2.35137526, -2.35137526, -3.04452244,
        -2.35137526, -3.04452244, -1.94591015, -1.94591015, -2.35137526,
        -2.35137526, -3.04452244, -2.35137526, -3.04452244, -1.65822808,
        -2.35137526, -2.35137526, -3.04452244, -2.35137526, -3.04452244,
        -3.04452244, -3.04452244, -3.04452244, -3.04452244, -2.35137526,
        -3.04452244, -2.35137526]),
 0.5)

### 朴素贝叶斯分类函数

In [32]:
#分类函数
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)    #理论上求概率与类别相乘，由于p1vec即为log输出结果，因此与类别相乘改为相加
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)#其实这里时概率相乘，但是因为改成log所以在此时在可以进行相加，条件概率假设独立
    #求取在
    if p1 > p0:
        return 1
    else:
        return 0

In [33]:
listOPosts, listClasses = loadDataSet()#生成样例数据
myVocabList = createVocabList(listOPosts) #创建词向量列表
trainMat = [] #初始化训练矩阵
for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) 
#生成训练集矩阵
p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))

In [34]:
p0V, p1V, pAb

(array([-2.56494936, -2.56494936, -2.56494936, -2.56494936, -2.15948425,
        -2.56494936, -2.56494936, -3.25809654, -3.25809654, -2.56494936,
        -3.25809654, -1.87180218, -3.25809654, -2.56494936, -2.56494936,
        -3.25809654, -2.56494936, -3.25809654, -2.56494936, -3.25809654,
        -3.25809654, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
        -2.56494936, -2.56494936, -2.56494936, -2.56494936, -3.25809654,
        -2.56494936, -3.25809654]),
 array([-3.04452244, -3.04452244, -3.04452244, -3.04452244, -2.35137526,
        -3.04452244, -3.04452244, -2.35137526, -2.35137526, -3.04452244,
        -2.35137526, -3.04452244, -1.94591015, -1.94591015, -2.35137526,
        -2.35137526, -3.04452244, -2.35137526, -3.04452244, -1.65822808,
        -2.35137526, -2.35137526, -3.04452244, -2.35137526, -3.04452244,
        -3.04452244, -3.04452244, -3.04452244, -3.04452244, -2.35137526,
        -3.04452244, -2.35137526]),
 0.5)

In [35]:
testEntry = ['love', 'my', 'dalmation']

In [36]:
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))#生成词向量
thisDoc

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [37]:
classifyNB(thisDoc, p0V, p1V, pAb)

0

In [38]:
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))

['love', 'my', 'dalmation'] classified as:  0


In [39]:
testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))

['stupid', 'garbage'] classified as:  1


**改进word2vec模型：一个词在一句话中不一定只出现一次**

In [40]:
#词集模型（0-1）转化为词袋模型（0-无穷） （由setOfwords2Vec（）进行改进）
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

In [41]:
#测试
listOPosts, listClasses = loadDataSet()#生成样例数据
myVocabList = createVocabList(listOPosts) #创建词向量列表
trainMat = [] #初始化训练矩阵
for postinDoc in listOPosts:
    trainMat.append(bagOfWords2VecMN(myVocabList, postinDoc)) 
#生成训练集矩阵
p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))


In [42]:
testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))

['stupid', 'garbage'] classified as:  1


In [43]:
testEntry = ['love', 'my', 'dalmation']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))

['love', 'my', 'dalmation'] classified as:  0


## 朴素贝叶斯过滤垃圾邮件

Python 的 **re 模块**（Regular Expression 正则表达式）提供各种正则表达式的匹配操作，在文本解析、复杂字符串分析和信息提取时是一个非常有用的工具，下面我主要总结了re的常用方法

1. re的简介
    使用python的re模块，尽管不能满足所有复杂的匹配情况，但足够在绝大多数情况下能够有效地实现对复杂字符串的分析并提取出相关信息。python 会将正则表达式转化为字节码，利用 C 语言的匹配引擎进行深度优先的匹配。


In [44]:
import re
print(re.__doc__)

Support for regular expressions (RE).

This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.

The special characters are:
    "."      Matches any character except a newline.
    "^"      Matches the start of the string.
    "$"      Matches the end of the string or just before the newline at
             the end of the string.
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
    "+"      Matches 1 or more (greedy) repetitions of t

In [62]:
print(re.match('www', 'www.runoob.com'))  # 在起始位置匹配
print(re.match('com', 'www.runoob.com'))         # 不在起始位置匹配

<_sre.SRE_Match object; span=(0, 3), match='www'>
None


In [63]:
print(re.match('www', 'www.runoob.com').span())  # 在起始位置匹配

(0, 3)


In [74]:
import re
 
line = "Cats are smarter than dogs"
 
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
 
if matchObj:
    print( "matchObj.group() : ", matchObj.group())
    print( "matchObj.group(1) : ", matchObj.group(1))
    print( "matchObj.group(2) : ", matchObj.group(2))
else:
    print( "No match!!")

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter


In [82]:
matchObj.group(1)

'Cats'

In [45]:
import re #正则表达式

def textParse(bigString):    
    #输入句切分形成一个单词列表
    listOfTokens = re.split(r'\W+', bigString) 
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] #提取单词大于2的单词，并将字符串小写化

In [46]:
textParse('my name is wtt, I like studying')

['name', 'wtt', 'like', 'studying']

In [47]:
re.split(r'\W+', 'my name is wtt, I like studying')

['my', 'name', 'is', 'wtt', 'I', 'like', 'studying']

In [48]:
re.split(r'\W', 'my name is wtt, I like studying')

['my', 'name', 'is', 'wtt', '', 'I', 'like', 'studying']

In [52]:
def spamTest():
    docList = []; classList = []; fullText = []
    for i in range(1, 26):
        wordList = textParse(open('email/spam/%d.txt' % i, encoding="ISO-8859-1").read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i, encoding="ISO-8859-1").read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet = []           #create test set
    for i in range(10):
        randIndex = int(np.random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(list(trainingSet)[randIndex])
    trainMat = []; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
            print("classification error", docList[docIndex])
    print('the error rate is: ', float(errorCount)/len(testSet))

In [83]:
docList = []
classList = []
fullText = []
for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i, encoding="ISO-8859-1").read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i, encoding="ISO-8859-1").read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)

In [91]:
len(docList) #25个正样本25个负样本

50

In [93]:
len(fullText) #词汇总数量

1762

In [96]:
vocabList = createVocabList(docList)#创建词汇表
vocabList 

['had',
 'there',
 'might',
 'betterejacu1ation',
 'assigning',
 'certified',
 'have',
 'courier',
 'incredib1e',
 'herbal',
 'away',
 'what',
 'changes',
 'tokyo',
 'town',
 'inconvenience',
 'far',
 'required',
 'important',
 '492',
 'save',
 'development',
 'prepared',
 'mailing',
 'dusty',
 'cheers',
 'bettererections',
 'discussions',
 'cs5',
 'knocking',
 'cats',
 '100mg',
 '156',
 'fundamental',
 'since',
 '396',
 'yeah',
 'when',
 'butt',
 'exhibit',
 'sky',
 'jewerly',
 'cat',
 'logged',
 'germany',
 'and',
 'then',
 'financial',
 'strategy',
 'try',
 'ones',
 'arolexbvlgari',
 'quantitative',
 'like',
 'behind',
 '25mg',
 'door',
 'talked',
 'bike',
 'through',
 'borders',
 'plugin',
 'experts',
 'softwares',
 'got',
 'come',
 'series',
 'reliever',
 'placed',
 'message',
 'service',
 'modelling',
 'release',
 'watchesstore',
 'parallel',
 'python',
 'effective',
 '430',
 'changing',
 'generates',
 'per',
 'jpgs',
 'model',
 'recieve',
 'delivery',
 'gain',
 'ideas',
 'wasn',

In [97]:
len(vocabList)

692

In [139]:
trainingSet = range(50)

#随机选择10份作为测试集
testSet = []           #create test set
for i in range(10):
    randIndex = int(np.random.uniform(0, len(trainingSet))) #功能：从一个均匀分布[low,high)中随机采样，注意定义域是左闭右开，即包含low，不包含high.
    testSet.append(trainingSet[randIndex]) #在测试集中加入对应随机的index的样本数据
    trainingSet = list(trainingSet)
    del(trainingSet[randIndex]) #并在训练集中删除对应提取到测试集中的数据


In [140]:
trainingSet

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 11,
 12,
 13,
 14,
 16,
 18,
 19,
 20,
 21,
 23,
 24,
 25,
 27,
 28,
 29,
 30,
 33,
 34,
 35,
 36,
 38,
 39,
 40,
 42,
 43,
 44,
 45,
 46,
 47,
 48]

In [141]:
testSet

[22, 17, 49, 31, 15, 10, 37, 41, 32, 26]

In [142]:
trainMat = []
trainClasses = []
for docIndex in trainingSet:#遍历训练集的样本索引
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) #将样本词汇进行词向量转换并保存训练集输入矩阵当中
    trainClasses.append(classList[docIndex]) #样本对应的类别标签
p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses)) #生成训练集生成的朴素贝叶斯分类模型参数

In [147]:
p0V, p1V, pSpam

(array([-5.56068163, -5.04985601, -6.65929392, -6.65929392, -5.96614674,
        -6.65929392, -4.86753445, -6.65929392, -6.65929392, -6.65929392,
        -5.96614674, -5.56068163, -5.96614674, -5.96614674, -5.96614674,
        -5.96614674, -5.96614674, -5.96614674, -5.96614674, -6.65929392,
        -6.65929392, -5.96614674, -5.96614674, -5.96614674, -5.96614674,
        -5.96614674, -6.65929392, -5.96614674, -6.65929392, -6.65929392,
        -5.96614674, -6.65929392, -6.65929392, -5.96614674, -5.96614674,
        -6.65929392, -6.65929392, -5.96614674, -5.96614674, -6.65929392,
        -6.65929392, -6.65929392, -5.96614674, -5.96614674, -6.65929392,
        -3.44041809, -5.96614674, -6.65929392, -5.96614674, -5.96614674,
        -5.96614674, -6.65929392, -5.96614674, -5.27299956, -5.96614674,
        -6.65929392, -5.96614674, -5.56068163, -5.96614674, -5.96614674,
        -6.65929392, -5.96614674, -6.65929392, -6.65929392, -5.56068163,
        -5.04985601, -5.96614674, -6.65929392, -6.6

In [148]:
errorCount = 0
for docIndex in testSet:        #测试集样本索引
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) #词向量转化
    if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]: #分类测试结果是否与样本标签相同
        
        errorCount += 1
        print("classification error", docList[docIndex])
print('the error rate is: ', float(errorCount)/len(testSet))

classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
classification error ['oem', 'adobe', 'microsoft', 'softwares', 'fast', 'order', 'and', 'download', 'microsoft', 'office', 'professional', 'plus', '2007', '2010', '129', 'microsoft', 'windows', 'ultimate', '119', 'adobe', 'photoshop', 'cs5', 'extended', 'adobe', 'acrobat', 'pro', 'extended', 'windows', 'professional', 'thousand', 'more', 'titles']
classification error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
the error rate is:  0.3
