## 贝叶斯分类
### 1. 贝叶斯分类器
贝叶斯分类器是一类分类算法的总称，这类算法均以贝叶斯定理为基础
$$
P(y|x) = \frac{P(x|y)P(y)}{P(x)}
$$
贝叶斯分类器的决策过程是**基于特征之间的条件独立性假设，即假设每个特征之间相互独立**，那么似然就可以写成：
$$
P(x|y) = P(x_{1},x_{2},...,x_{n}|y) = P(x_{1}|y)P(x_{2}|y)...P(x_{n}|y)
$$


对于给定的输入x，通过计算后验概率$ P(y|x) $来决定x的类别。贝叶斯分类器的决策规则为：对于输入x，选择能使后验概率$P(y|x)$最大的类别y作为x的类别，贝叶斯分类器的决策规则为：
$$
\hat{y} = argmax_{y_{i}}P(y_{i}|x) = argmax_{y_{i}}\frac{P(x|y_{i})P(y_{i})}{P(x)}
$$


In [7]:
import pandas as pd
import numpy as np
df = pd.read_excel(
    r'E:\python\python\machine_learning\data\bayes_decision.xlsx')
del df['Day']
df.head()

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


In [8]:
num2feature={ i:feature  for i,feature in  enumerate(df.columns)}
num2feature

{0: 'Outlook', 1: 'Temperature', 2: 'Humidity', 3: 'Wind', 4: 'Play Tennis'}

In [9]:
featureEncoder=[ df[col].unique().tolist() for col in df.columns]
featureEncoder={ y:j for i in featureEncoder for j,y in enumerate(i)}
labelEncoder={ y:j for j,y in enumerate(df[df.columns[-1]].unique())}
featureEncoder,labelEncoder

({'Sunny': 0,
  'Overcast': 1,
  'Rain': 2,
  'Hot': 0,
  'Mild': 1,
  'Cool': 2,
  'High': 0,
  'Normal': 1,
  'Weak': 0,
  'Strong': 1,
  'No': 0,
  'Yes': 1},
 {'No': 0, 'Yes': 1})

In [10]:
for col in df.columns:
    df[col]=df[col].map(featureEncoder)

df

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,0,0,0,0,0
1,0,0,0,1,0
2,1,0,0,0,1
3,2,1,0,0,1
4,2,2,1,0,1
5,2,2,1,1,0
6,1,2,1,1,1
7,0,1,0,0,0
8,0,2,1,0,1
9,2,1,1,0,1


In [11]:
x=df[df.columns[:-1]]
y=df[df.columns[-1]]
x

Unnamed: 0,Outlook,Temperature,Humidity,Wind
0,0,0,0,0
1,0,0,0,1
2,1,0,0,0
3,2,1,0,0
4,2,2,1,0
5,2,2,1,1
6,1,2,1,1
7,0,1,0,0
8,0,2,1,0
9,2,1,1,0


In [12]:
class NaiveBayesClassifier:
    def __init__(self, df):
        self.df = df
        self.prior_prob = {}
        self.likelihood = {}

    def navie_bayes(self):
        # 计算先验概率
        label_counts = self.df['Play Tennis'].value_counts()
        total_count = len(self.df)
        self.prior_prob = {label: count /
                           total_count for label, count in label_counts.items()}
        # 计算条件概率（似然）
        for label in label_counts.index:
            self.likelihood[label] = {}
            label_df = self.df[self.df['Play Tennis'] == label] # 取出标签列为label的行
            for col in self.df.columns[:-1]:  # 不包含标签列
                self.likelihood[label][col] = {}
                for value, count in label_df[col].value_counts().items():
                # 取出特征列col中每个值的个数
                    self.likelihood[label][col][value] = count / \
                        label_counts[label]

    def predict(self, features):
        results = {}
        for label in self.prior_prob:
            prob = self.prior_prob[label]
            for col, value in features.items():
                prob *= self.likelihood[label].get(col, {}).get(value, 1)
            results[label] = prob
        total_prob = sum(results.values())
        return {label: prob  /total_prob for label, prob in results.items()}

bayes_clasfier = NaiveBayesClassifier(df)
bayes_clasfier.navie_bayes()
test=['Sunny','Cool','High','Strong']
test_data={}
for i ,j in enumerate(test):
    test_data[num2feature[i]]=featureEncoder[j]
res=bayes_clasfier.predict(test_data)
max_class=max(res,key=res.get)
labelDecoder={v:k for k,v in labelEncoder.items()}
print(labelDecoder[max_class])
# bayes_clasfier.prior_prob,bayes_clasfier.likelihood

No
