<ul id="breadcrumb">
<li><a href="#第4章-朴素贝叶斯">&nbsp;</a></li>
<li><a href="#朴素贝叶斯分类算法">NaiveBayes</a></li>
<li><a href="#课本例4.1">例4.1</a></li>
<li><a href="#GaussianNB-高斯朴素贝叶斯">GaussianNB</a></li>
<li><a href="#scikit-learn实例">scikit实例</a></li>
</ul>

# 第4章 朴素贝叶斯

1．朴素贝叶斯法是典型的生成学习方法。生成方法由训练数据学习联合概率分布
$P(X,Y)$，然后求得后验概率分布$P(Y|X)$。具体来说，利用训练数据学习$P(X|Y)$和$P(Y)$的估计，得到联合概率分布：

$$P(X,Y)＝P(Y)P(X|Y)$$

概率估计方法可以是极大似然估计或贝叶斯估计。

2．朴素贝叶斯法的基本假设是条件独立性，

$$\begin{aligned} P(X&=x | Y=c_{k} )=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right) \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) \end{aligned}$$


这是一个较强的假设。由于这一假设，模型包含的条件概率的数量大为减少，朴素贝叶斯法的学习与预测大为简化。因而朴素贝叶斯法高效，且易于实现。其缺点是分类的性能不一定很高。

3．朴素贝叶斯法利用贝叶斯定理与学到的联合概率模型进行分类预测。

$$P(Y | X)=\frac{P(X, Y)}{P(X)}=\frac{P(Y) P(X | Y)}{\sum_{Y} P(Y) P(X | Y)}$$
 
将输入$x$分到后验概率最大的类$y$。

$$y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X_{j}=x^{(j)} | Y=c_{k}\right)$$

后验概率最大等价于0-1损失函数时的期望风险最小化。


模型：

- 高斯模型
- 多项式模型
- 伯努利模型

## 朴素贝叶斯分类算法

In [None]:
import numpy as np
import pandas as pd

In [33]:
class NB:

    def __init__(self, lambda_=1):
        self.lambda_ = lambda_
        self.classes_ = None
        self.prior_ = None
        self.class_prior_ = None
        self.class_count_ = None

    def fit(self, X, y, debug=False):
        self.classes_ = np.unique(y)
        # to df
        X = pd.DataFrame(X)
        y = pd.DataFrame(y)

        self.class_count_ = y[y.columns[0]].value_counts()
        self.class_prior_ = self.class_count_/y.shape[0]
        if debug:
            print("P(Y={})={}/{}".format(self.classes_[0],self.class_count_[0],y.shape[0]))
            print("P(Y={})={}/{}".format(self.classes_[1],self.class_count_[1],y.shape[0]))
            
        # prior
        self.prior_ = dict()
        for idx in X.columns:
            for j in self.classes_:
                p_x_y = X[(y == j).values][idx].value_counts()
                for i in p_x_y.index:
                    self.prior_[(idx, i, j)] = p_x_y[i]/self.class_count_[j]
                    if debug:
                        print("P(X^{}={}|Y={})={}/{}".format(idx,i,j,p_x_y[i],self.class_count_[j]))

    def predict(self, X, debug=False):
        rst = []
        for class_ in self.classes_:
            if debug:
                msg="P(Y={})".format(class_)
            py = self.class_prior_[class_]
            pxy = 1
            for idx, x in enumerate(X):
                pxy *= self.prior_[(idx, x, class_)]
                if debug:
                    msg +="P(X^{}={}|Y={})".format(idx,x,class_)

            rst.append(py*pxy)
            if debug:
                msg += "={:.2f}".format(rst[-1])
                print(msg)
                
        return self.classes_[np.argmax(rst)]

### 课本例4.1

In [10]:
data = pd.read_csv("data_4-1.txt", header=None, sep=",")
X = data[data.columns[0:2]]
y = data[data.columns[2]]
clf = NB(lambda_=1)
clf.fit(X, y)
rst = clf.predict([2, "S"])
print(rst)

-1


In [35]:
data = pd.read_csv("data_4-2.txt", header=None, sep=",")
X = data[data.columns[0:2]]
y = data[data.columns[2]]
clf = NB(lambda_=0)
clf.fit(X, y, debug=True)
rst = clf.predict([2, "S"], debug=True)
print(rst)

P(Y=ab)=9/15
P(Y=ok)=6/15
P(X^0=1|Y=ab)=3/6
P(X^0=2|Y=ab)=2/6
P(X^0=3|Y=ab)=1/6
P(X^0=3|Y=ok)=4/9
P(X^0=2|Y=ok)=3/9
P(X^0=1|Y=ok)=2/9
P(X^1=S|Y=ab)=3/6
P(X^1=M|Y=ab)=2/6
P(X^1=L|Y=ab)=1/6
P(X^1=M|Y=ok)=4/9
P(X^1=L|Y=ok)=4/9
P(X^1=S|Y=ok)=1/9
P(Y=ab)P(X^0=2|Y=ab)P(X^1=S|Y=ab)=0.07
P(Y=ok)P(X^0=2|Y=ok)P(X^1=S|Y=ok)=0.02
ab


# GaussianNB 高斯朴素贝叶斯

### iris数据集
iris数据集中两个分类的数据和[sepal length，sepal width]作为特征
![Sepal vs. Petal](images/sepal_petal.jpg)

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from collections import Counter
import math

In [38]:
# data
def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
    data = np.array(df.iloc[:100, :])
    # print(data)
    return data[:,:-1], data[:,-1]

In [59]:
X, y = create_data()
t_size=0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=t_size)
print("训练集占{}%，测试集占{}%".format(int((1-t_size)*100), int(t_size*100)))
print("测试样本：x={} y={}".format(X_test[0], y_test[0]))

训练集占70%，测试集占30%
测试样本：x=[5.5 3.5 1.3 0.2] y=0.0


参考：https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

### GaussianNB 高斯朴素贝叶斯

特征的可能性被假设为高斯

概率密度函数：
$$P(x_i | y_k)=\frac{1}{\sqrt{2\pi\sigma^2_{yk}}}exp(-\frac{(x_i-\mu_{yk})^2}{2\sigma^2_{yk}})$$

数学期望(mean)：$\mu$

方差：$\sigma^2=\frac{\sum(X-\mu)^2}{N}$

In [36]:
class NaiveBayes:
    def __init__(self):
        self.model = None

    # 数学期望
    @staticmethod
    def mean(X):
        return sum(X) / float(len(X))

    # 标准差（方差）
    def stdev(self, X):
        avg = self.mean(X)
        return math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))

    # 概率密度函数
    def gaussian_probability(self, x, mean, stdev):
        exponent = math.exp(-(math.pow(x - mean, 2) /
                              (2 * math.pow(stdev, 2))))
        return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent

    # 处理X_train
    def summarize(self, train_data):
        summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
        return summaries

    # 分类别求出数学期望和标准差
    def fit(self, X, y):
        labels = list(set(y))
        data = {label: [] for label in labels}
        for f, label in zip(X, y):
            data[label].append(f)
        self.model = {
            label: self.summarize(value)
            for label, value in data.items()
        }
        return 'gaussianNB train done!'

    # 计算概率
    def calculate_probabilities(self, input_data):
        # summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
        # input_data:[1.1, 2.2]
        probabilities = {}
        for label, value in self.model.items():
            probabilities[label] = 1
            for i in range(len(value)):
                mean, stdev = value[i]
                probabilities[label] *= self.gaussian_probability(
                    input_data[i], mean, stdev)
        return probabilities

    # 类别
    def predict(self, X_test):
        # {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
        label = sorted(
            self.calculate_probabilities(X_test).items(),
            key=lambda x: x[-1])[-1][0]
        return label

    def score(self, X_test, y_test):
        right = 0
        for X, y in zip(X_test, y_test):
            label = self.predict(X)
            if label == y:
                right += 1

        return right / float(len(X_test))

In [58]:
model = NaiveBayes()
model.fit(X_train, y_train)
print("预测类别：{:.1f} 分数：{:.1f}".format(model.predict([4.4,  3.2,  1.3,  0.2]), model.score(X_test, y_test)))

预测类别：0.0 分数：1.0


### scikit-learn实例

In [60]:
from sklearn.naive_bayes import GaussianNB

In [61]:
clf = GaussianNB()
clf.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [69]:
print("预测类别：{} 分数：{}".format(clf.predict([[4.4,  3.2,  1.3,  0.2]]), clf.score(X_test, y_test)))

预测类别：[0.] 分数：1.0


In [14]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB # 伯努利模型和多项式模型

----
参考代码：https://github.com/wzyonggege/statistical-learning-method

中文注释制作：机器学习初学者

微信公众号：ID:ai-start-com

配置环境：python 3.5+

代码全部测试通过。
![gongzhong](images/gongzhong.jpg)