# 朴素贝叶斯

连续数据，特征的可能性被假设为高斯概率密度函数：
$$P(x_i | y_k)=\frac{1}{\sqrt{2\pi}\sigma_{yk}}exp(-\frac{(x_i-\mu_{yk})^2}{2\sigma^2_{yk}})$$

数学期望(mean)：$\mu$，方差：$\sigma^2=\frac{\sum(X-\mu)^2}{N}$

## 算法
1. 计算先验概率 $$P(Y=c_k)=\frac{\sum_{i=1}^{N} I(y_i = c_k)}{N}  , k = 1,2,...,K$$

2. 计算条件概率：按label类别计算第$l$个特征取第j个值时候的概率 $$P(X^{(j)} =x ^ {(j)} | Y=c_k), k = 1,2,...,K$$

3. 计算后验概率：$$P(Y=c_k|X=x)=P(X=c_k)\prod_{i=1}^{N} (y=c_k|x=x_i)$$

4. 选择后验概率最大的$c_k$

-----

在**高斯朴素贝叶斯分类器**算法过程中条件概率为$$P(x^{(j)} | y_k)=\frac{1}{\sqrt{2\pi}\sigma_{yk}}exp(-\frac{(x^{(j)}-\mu_{yk})^2}{2\sigma^2_{yk}})$$


即首先按label重新排列数据。分别求出每个label $c_k$中，每个特征的均值$\mu_{yk}$、标准差$\sigma_{yk}$参数。用于后续后验概率的计算

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter

In [2]:
class GaussNaiveBayes:
    def __init__(self):
        self.model = None
        self.prior_probability = {} #先验概率
        
    def mean(self,x):
        return sum(x)/ float(len(x))
    
    def stdev(self, X):
        avg = self.mean(X)
        return math.sqrt(sum([pow(x-avg, 2) for x in X]) / float(len(X)))
        
    
    # 高斯概率密度函数
    def gaussian_probability(self, x, mean, stdev):
        exp = math.exp(-(math.pow(x - mean, 2)/(2* math.pow(stdev,2))))
        return exp/(math.sqrt(2* math.pi) * stdev)
    
    
    
    # 计算条件概率的标准差、方差
    def fit(self,X,Y):
        label_num = list(set(Y)) #类别的列表
        data = {label:[] for label in label_num}  #字典，类别->属于该类别的数据
        self.model = {label:[] for label in label_num}
        
        for x, label in zip(X,Y):
            data[label].append(x)
        
        for label, value in data.items():
            for i in zip(*value):    #将矩阵转置，这样i表示原来的一列，即某个特征的所有数据
                self.model[label].append((self.mean(i),self.stdev(i)))     #每个特征的均值，标准差           
            self.prior_probability[label] =  len(self.model[label])/len(X) #先验概率        
        
        return 'GaussNaiveBayes model trafin done'
                                  
    # 计算后验概率
    def calculate_probability(self, x_test):
        post_probability = { key: 1 for key in self.model.keys()}
        
        for label, para in self.model.items():
            for j in range(len(para)): 
                mean, stdev = para[j] #ck条件下，第j个特征的均值、标准差
                post_probability[label] *= self.gaussian_probability(x_test[j],mean ,stdev) #条件概率乘积
            post_probability[label] *= self.prior_probability[label] #乘对应的先验概率，得到后验概率
        
        return post_probability
            
                
    def predict(self, x):
        probility = self.calculate_probability(x)
        return (max(probility.items(), key=lambda y:y[1]))[0] #后验概率最大的类别
        
    
    def score(self, X_test, Y_test):
        right_count = 0
        for x, y in zip(X_test,Y_test):
            y_predict = self.predict(x)
            if y_predict == y:
                right_count += 1
        return right_count / len(X_test)
    

In [3]:
# Load data
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
data = np.array(df.iloc[:100,[0,1,-1]])
X, Y = data[:,:-1], data[:,-1]
# 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=0) 

In [4]:
nb = GaussNaiveBayes()
nb.fit(X_train,Y_train)

'GaussNaiveBayes model trafin done'

In [5]:
print(nb.predict([4.4,  3.2]))
print(nb.predict([1.3,  0.2]))

0.0
1.0


In [6]:
nb.score(X_test,Y_test)

1.0

# sklearn.naive_bayes

In [9]:
from sklearn.naive_bayes import GaussianNB

In [10]:
clf = GaussianNB()
clf.fit(X_train, Y_train)

GaussianNB(priors=None)

In [11]:
clf.score(X_test, Y_test)

1.0

In [12]:
clf.predict([[4.4,  3.2],[1.3,  0.2]])

array([0., 1.])