# Naive Bayes (the easy way)
# 貝氏方法

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

用sklearn.naive_bayes來訓練垃圾郵件分類器：

In [1]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('./emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('./emails/ham', 'ham'))


Let's have a look at that DataFrame:

讓我們來看看DataFrame：

In [32]:
data

Unnamed: 0,message,class
./emails/spam\00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam
./emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
./emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
./emails/spam\00004.eac8de8d759b7e74154f142194282724,##############################################...,spam
./emails/spam\00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam
...,...,...
./emails/ham\02496.aae0c81581895acfe65323f344340856,Man killed 'trying to surf' on Tube train \n\n...,ham
./emails/ham\02497.60497db0a06c2132ec2374b2898084d3,"Hi Gianni,\n\n\n\nA very good resource for thi...",ham
./emails/ham\02498.09835f512f156da210efb99fcc523e21,Gianni Ponzi wrote:\n\n> I have a prob when tr...,ham
./emails/ham\02499.b4af165650f138b10f9941f6cc5bce3c,Neale Pickett <neale@woozle.org> writes:\n\n\n...,ham


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

MultinomialNB()建立兩個輸入的分類器，一個是訓練所需的資料，一個是對應的目標

In [3]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)   #每封郵件串列以及單字出現的次數

classifier = MultinomialNB()
targets = data['class'].values       #計算每個單字出現次數
classifier.fit(counts, targets)

MultinomialNB()

Let's try it out:

試驗兩封郵件:(spam=垃圾郵件，ham=正常郵件)

In [6]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions       

array(['spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.

我們的資料集很小，所以我們的垃圾郵件分類器其實不是很好。嘗試運行一些不同的電子郵件測試它，看看你是否得到你期望的結果。

如果你真的想挑戰自己，嘗試應用訓練/測試這個垃圾郵件分類器 - 看看它如何能預測電子郵件。

In [25]:
from sklearn.model_selection import train_test_split

In [39]:
import pandas as pd
import statsmodels.api as sm
data['class_ord'] = pd.Categorical(data['class']).codes   
data

Unnamed: 0,message,class,class_ord
./emails/spam\00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam,1
./emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam,1
./emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam,1
./emails/spam\00004.eac8de8d759b7e74154f142194282724,##############################################...,spam,1
./emails/spam\00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam,1
...,...,...,...
./emails/ham\02496.aae0c81581895acfe65323f344340856,Man killed 'trying to surf' on Tube train \n\n...,ham,0
./emails/ham\02497.60497db0a06c2132ec2374b2898084d3,"Hi Gianni,\n\n\n\nA very good resource for thi...",ham,0
./emails/ham\02498.09835f512f156da210efb99fcc523e21,Gianni Ponzi wrote:\n\n> I have a prob when tr...,ham,0
./emails/ham\02499.b4af165650f138b10f9941f6cc5bce3c,Neale Pickett <neale@woozle.org> writes:\n\n\n...,ham,0
