## Conditional probability 

$$p(c|x)=\frac{p(x|c)p(c)}{p(x)}$$

$p(c|x)$ means the probability of the c comes from x

### 朴素贝叶斯  
如果有两个特征，每个特征需要10个数据，则共需要10\*10组数据。  
为什么需要10\*10呢，因为特征之间不独立。不同的特征之间存在关联。

特征独立是什么意思？  
表示特征和特征之间无直接的联系，特征之间的先后发生顺序对结果影响不大

## Classifying with conditional probabilities

$$p(c_i|x,y)=\frac{p(x,y|c_i)p(c_i)}{p(x,y)}$$  

if $p(c_1|x,y)>p(c_2|x,y)$, the class is $c_1$  
if $p(c_1|x,y)<p(c_2|x,y)$, the class is $c_2$

## Document classification with naïve Bayes

## Classifying text with Python
### Prepare: making word vectors from text
- Word list to vector function

In [59]:
# 实验样本数据，共5个句子，每个句子包含不同长度的单词，返回数据集和标签list
def loadDataSet():
    postingList = [
        ['my', 'dog', 'has', 'fela', 'problems', 'help', 'please'],
        ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
        ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
        ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
        ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
        ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']
    ]
    classVec = [0, 1, 0, 1, 0, 1]  #1 is abusive, 0 not
    return postingList, classVec

In [60]:
# 提取出样本数据中的关键词，建立包含所有关键词的list，通过set转化为list
def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return list(vocabSet)  # return unique words list

In [61]:
# 将inputSet作为输入单词的集合，如果单词在单词列表中，则把单词列表对应的数值置为1.
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my Vocabulary" % word)
    return returnVec

In [62]:
# 测试，先加载数据，在创建单词列表，
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
Vec0 = setOfWords2Vec(myVocabList, listOPosts[0])
Vec3 = setOfWords2Vec(myVocabList, listOPosts[3])
print(myVocabList)
print(Vec0)
print(Vec3)

['problems', 'garbage', 'food', 'has', 'not', 'buying', 'worthless', 'posting', 'him', 'love', 'stop', 'steak', 'licks', 'quit', 'dalmation', 'mr', 'is', 'I', 'stupid', 'please', 'help', 'ate', 'how', 'dog', 'maybe', 'fela', 'park', 'my', 'take', 'cute', 'to', 'so']
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [63]:
import numpy as np


# trainMatrix 为文档矩阵，和每篇文档所对应的标签
def trainNB0(trainMatrix, trainCategory):
    # 获得文档矩阵的长度
    numTrainDocs = len(trainMatrix)
    # 得到输入矩阵中列的长度
    numWords = len(trainMatrix[0])
    # caculate the probability of abusive sentence
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    '''
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom = 0.0
    p1Denom = 0.0
    '''
    # to lessen the impact of 0 in mutiplication,the code above need to be changed as below
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0

    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # the probability for each word which comes from p1 or p0
    '''
    ln(a*b)  =  ln(a)+ln(b)
    p1Vect = p1Num / p1Denom
    p0Vect = p0Num / p0Denom
    '''
    p1Vect = np.log(p1Num / p1Denom)
    p0Vect = np.log(p0Num / p0Denom)

    return p0Vect, p1Vect, pAbusive

In [64]:
#加载测试数据，返回6个sentences以及6句话对应的标签。
listOPosts, listClasses = loadDataSet()
#根据以上生成的6个sentences中的单词，创建一个不包含重复吃的集合。
myVocabList = createVocabList(listOPosts)
#矩阵，将每句话中的单词映射到集合中。1表示存在。
trainMat = []
for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0V, p1V, pAb = trainNB0(trainMat, listClasses)
print('p0V:', p0V)
print('p1V:', p1V)
print('pAb:', pAb)

p0V: [-2.56494936 -3.25809654 -3.25809654 -2.56494936 -3.25809654 -3.25809654
 -3.25809654 -3.25809654 -2.15948425 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -3.25809654 -1.87180218 -3.25809654 -2.56494936
 -2.56494936 -2.56494936]
p1V: [-3.04452244 -2.35137526 -2.35137526 -3.04452244 -2.35137526 -2.35137526
 -1.94591015 -2.35137526 -2.35137526 -3.04452244 -2.35137526 -3.04452244
 -3.04452244 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -1.65822808 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -1.94591015
 -2.35137526 -3.04452244 -2.35137526 -3.04452244 -2.35137526 -3.04452244
 -2.35137526 -3.04452244]
pAb: 0.5


### Test: modifying the classifier for real-world conditions 
- Naïve Bayes classify function

In [65]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    if p1>p0:
        return 1
    else:
        return 0

In [66]:
def testingNB():
    listOPost, listClasses = loadDataSet()
    myVocaList = createVocabList(listOPosts)
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(np.array(trainMat),np.array(listClasses))
    testEntry = ['love','my','dalmation']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as : ', classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid','garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as : ', classifyNB(thisDoc, p0V, p1V, pAb))

In [67]:
testingNB()

['love', 'my', 'dalmation'] classified as :  0
['stupid', 'garbage'] classified as :  1


### Prepare: the bag-of-words document model
- Naïve Bayes bag-of-words model

In [68]:
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

## Example: classifying spam email with naïve Bayes
- Prepare: tokenizing text 

In [72]:
mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.'
print(mySent.split())

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']


In [76]:
import re
'''
r = re.compile('\\W*')
listOfTokens = r.split(mySent)
'''
r = re.compile('\\w+')
listOfTokens = r.findall(mySent)
print(listOfTokens)

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']


In [78]:
allWords = [tok for tok in listOfTokens if len(tok) > 0]
print(allWords)

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']


In [81]:
emailText = open('email/ham/6.txt').read()
r = re.compile('\\W*')
listOfTokens=r.split(emailText)
print(listOfTokens)

['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one', 'Google', 'Groups', 'group', 'that', 'uses', 'the', 'customized', 'welcome', 'message', 'pages', 'or', 'files', 'we', 'are', 'writing', 'to', 'inform', 'you', 'that', 'we', 'will', 'no', 'longer', 'be', 'supporting', 'these', 'features', 'starting', 'February', '2011', 'We', 'made', 'this', 'decision', 'so', 'that', 'we', 'can', 'focus', 'on', 'improving', 'the', 'core', 'functionalities', 'of', 'Google', 'Groups', 'mailing', 'lists', 'and', 'forum', 'discussions', 'Instead', 'of', 'these', 'features', 'we', 'encourage', 'you', 'to', 'use', 'products', 'that', 'are', 'designed', 'specifically', 'for', 'file', 'storage', 'and', 'page', 'creation', 'such', 'as', 'Google', 'Docs', 'and', 'Google', 'Sites', 'For', 'example', 'you', 'can', 'easily', 'create', 'your', 'pages', 'on', 'Google', 'Sites', 'and', 'share', 'the', 'site', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en',

  This is separate from the ipykernel package so we can avoid doing imports until
