# Naive Bayesian
* 나이브 베이즈 분류 (Naïve Bayes Classification)는 특성들 사이의 독립을 가정하는 베이즈 정리를 적용한 확률 분류기의 일종이다.
* 나이브 베이즈는 분류기를 만드는 기술로써 단일 알고리즘을 통한 훈련이 아닌 일반적인 원칙에 근거한 여러 알고리즘들을 이용하여 훈련된다.
* 모든 나이브 베이즈 분류기는 공통적으로 모든 특성 값은 서로 독립임을 가정한다. 예를 들어, 특정 과일을 사과로 분류 가능하게 하는 특성들 (둥글다, 빨갛다, 지름 10cm)은 나이브 베이즈 분류기에서 특성들 사이에서 발생할 수 있는 연관성이 없음을 가정하고 각각의 특성들이 특정 과일이 사과일 확률에 독립적으로 기여 하는 것으로 간주한다.

### Baye's theorem
* 두 확률 변수의 사전확률과 사후확률 사이의 관계를 나타내는 정리이다.
* 베이즈 확률론 해석에 따르면 베이즈 정리는 새로운 근거가 제시될 때 사후 확률이 어떻게 갱신되는지를 구한다.
* P(A|B) = P(B|A)P(A)/P(B)
    * P(A|B) - 사건B가 발생한 상태에서 사건A가 발생할 조건부 확률
    * P(B|A) - 사건A가 발생한 상태에서 사건B가 발생할 조건부 확률
    * P(A) - 사건A가 발생할 확률, B에 대한 어떠한 정보도 없는 상태에서 A가 발생할 확률
    * P(B) - 사건B가 발생할 확률, A에 대한 어떠한 정보도 없는 상태에서 B가 발생할 확률

### 장점
* It is easy and fast to predict class of test data set. It also perform well in multi class prediction.
* When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
* It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

### 단점
* If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
* On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
* Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

### 나이브 베이즈 분류를 위한 일반적인 접근 방식
1. Collect: Any method. We’ll use RSS feeds in this chapter.
2. Prepare: Numeric or Boolean values are needed.
3. Analyze: With many features, plotting features isn’t helpful. Looking at histograms is a better idea.
4. Train: Calculate the conditional probabilities of the independent features.
5. Test: Calculate the error rate.
6. Use: One common application of naïve Bayes is document classification. You can use naïve Bayes in any classification setting.

In [1]:
# directory setup
import os
myhome=os.path.expanduser('~')
mywd=os.path.join(myhome,'Desktop/S_ParkMinJi/src/')
mytxt=os.path.join(myhome,'Desktop/S_ParkMinJi/doc/')
print myhome, mywd, mytxt

C:\Users\PARKMINJI C:\Users\PARKMINJI\Desktop/S_ParkMinJi/src/ C:\Users\PARKMINJI\Desktop/S_ParkMinJi/doc/


## 1. Machine Learning in Action 예제

In [2]:
%cd {mywd}

C:\Users\PARKMINJI\Desktop\S_ParkMinJi\src


### 1-1. 텍스트로 단어 벡터 만들기

In [3]:
from numpy import *

# 실험을 위해 몇 개의 예제 데이터를 생성
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

# 모든 문서에 있는 유일한 단어 목록을 생성
def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets - or
    return list(vocabSet)

#주어진 문서 내에 어휘 목록에 있는 단어가 존재하는지 아닌지를 표현 - 어휘 목록, 문서, 1과 0의 출력 데이터 사용
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList) #어휘 목록과 같은 길이의 벡터를 생성하고 모두 0으로 채움
    for word in inputSet: #문서 내에 있는 단어를 하나하나 비교
        if word in vocabList: #해당 단어가 어휘 목록에 있다면
            returnVec[vocabList.index(word)] = 1 #출력 벡터에 있는 해당 단어의 값을 1로 설정
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

In [4]:
import bayes
from numpy import *

listoPosts, listclasses = bayes.loadDataSet()
myVocabList = bayes.createVocabList(listoPosts)
print myVocabList
print listoPosts[0]
print bayes.setOfWords2Vec(myVocabList, listoPosts[0])
print listoPosts[3]
print bayes.setOfWords2Vec(myVocabList, listoPosts[3])

['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']
['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
['stop', 'posting', 'stupid', 'worthless', 'garbage']
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]


In [27]:
#Count the number of documents in each class
#for every training document: 
#    for each class: if a token appears in the document → increment the count for that token 
#    increment the count for tokens
#for each class: 
#    for each token: 
#        divide the token count by the total token count to get conditional probabilities 
#return conditional probabilities for each class

In [75]:
#training
def trainNB0(trainMatrix,trainCategory):
    # 초기 확률 
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs) #사전확률 (=모욕적인 말이 나온 확률 ) Base Rate ,Frame theory, Anchor 
    p0Num = zeros(numWords); p1Num = zeros(numWords)
    p0Denom = 0.0; p1Denom = 0.0
    # 벡터 추가 
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom   
    return p0Vect,p1Vect,pAbusive

In [77]:
from numpy import *
reload(bayes)
listoPosts, listClasses = bayes.loadDataSet()
myVocabList = bayes.createVocabList(listoPosts)

trainMat = []
for postinDoc in listoPosts:
    trainMat.append(bayes.setOfWords2Vec(myVocabList, postinDoc))

p0V,p1V,pAb=trainNB0(trainMat, listClasses)
print pAb
print p0V
print p1V

0.5
[ 0.04166667  0.04166667  0.04166667  0.          0.          0.04166667
  0.04166667  0.04166667  0.          0.04166667  0.04166667  0.04166667
  0.04166667  0.          0.          0.08333333  0.          0.
  0.04166667  0.          0.04166667  0.04166667  0.          0.04166667
  0.04166667  0.04166667  0.          0.04166667  0.          0.04166667
  0.04166667  0.125     ]
[ 0.          0.          0.          0.05263158  0.05263158  0.          0.
  0.          0.05263158  0.05263158  0.          0.          0.
  0.05263158  0.05263158  0.05263158  0.05263158  0.05263158  0.
  0.10526316  0.          0.05263158  0.05263158  0.          0.10526316
  0.          0.15789474  0.          0.05263158  0.          0.          0.        ]


In [79]:
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0
    
def testingNB():
    listOPosts,listClasses = bayes.loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print thisDoc
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

In [81]:
reload(bayes)
bayes.testingNB()

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1


In [82]:
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

In [85]:
mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
print mySent.split()

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']


In [86]:
import re
regEx = re.compile('\\W*')
listofTokens = regEx.split(mySent)
print listofTokens

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']


In [88]:
print [tok.lower() for tok in listofTokens if len(tok)>0]

['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']


In [95]:
emailText = open('data/email/spam/6.txt').read()
listofTokens = regEx.split(emailText)
listofTokens

['OEM',
 'Adobe',
 'Microsoft',
 'softwares',
 'Fast',
 'order',
 'and',
 'download',
 'Microsoft',
 'Office',
 'Professional',
 'Plus',
 '2007',
 '2010',
 '129',
 'Microsoft',
 'Windows',
 '7',
 'Ultimate',
 '119',
 'Adobe',
 'Photoshop',
 'CS5',
 'Extended',
 'Adobe',
 'Acrobat',
 '9',
 'Pro',
 'Extended',
 'Windows',
 'XP',
 'Professional',
 'thousand',
 'more',
 'titles']

In [108]:
%cd {mywd}

C:\Users\PARKMINJI\Desktop\S_ParkMinJi\src


In [113]:
def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
    
def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('data/email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('data/email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

In [117]:
reload(bayes)
bayes.spamTest()

classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is:  0.1


## 2. Naive Bayesian 예제

In [7]:
%cd {mywd}

C:\Users\PARKMINJI\Desktop\S_ParkMinJi\src


In [8]:
import feedparser
ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')

ny['entries']
len(ny['entries'])

25

In [9]:
def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
    for pairW in top30Words:
        if pairW[0] in vocabList: vocabList.remove(pairW[0])
    trainingSet = range(2*minLen); testSet=[]           #create test set
    for i in range(20):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print 'the error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

In [15]:
reload(bayes)
ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')

vocabList, pSF, pNY = bayes.localWords(ny,sf)
vocabList, pSF, pNY = bayes.localWords(ny,sf)

the error rate is:  0.25
the error rate is:  0.5


In [17]:
reload(bayes)
bayes.getTopWords(ny,sf)

the error rate is:  0.6
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
really
massage
how
since
girls
friend
need
relationship
married
there
woman
near
current
body
let
alone
change
today
more
male
times
needs
things
hands
kayaking
each
skate
guy
already
open
guess
bet
take
situations
should
going
hopefully
stuff
asleep
reading
both
been
search
cant
well
yes
interested
know
works
down
long
lot
happy
trying
nice
friends
time
relieve
month
woods
fan
fall
drinking
cool
skills
reciprocation
havent
cmt
heading
enjoy
says
click
drinks
abd
indian
visting
here
slow
mild
boy
involved
thats
crime
from
angry
destination
wondering
knows
company
mellow
car
work
learn
meet
give
preferably
watching
toxic
gaming
lesbian
man
stress
short
childhood
help
trade
playful
still
hayward
fit
acti
willing
break
therapy
now
good
goofy
frriends
went
status
everyone
house
year
our
saturday
caucasian
little
teach
care
could
skateboard
thing
first
platonic
relate
lol
playof
hunt
slept
patna
boring
