## 贝叶斯分类
### 1. 贝叶斯分类器
贝叶斯分类器是一类分类算法的总称，这类算法均以贝叶斯定理为基础
$$
P(y|x) = \frac{P(x|y)P(y)}{P(x)}
$$
贝叶斯分类器的决策过程是**基于特征之间的条件独立性假设，即假设每个特征之间相互独立**。


对于给定的输入x，通过计算后验概率$ P(y|x) $来决定x的类别。贝叶斯分类器的决策规则为：对于输入x，选择能使后验概率$P(y|x)$最大的类别y作为x的类别，贝叶斯分类器的决策规则为：
$$
\hat{y} = argmax_{y_{i}}P(y_{i}|x) = argmax_{y_{i}}\frac{P(x|y_{i})P(y_{i})}{P(x)}
$$

In [53]:
import numpy as np
import pandas as pd

# 定义数据
data = [
    [1, 0, 0, 1], [0, 1, 0, 0], [0, 1, 0, 1], [1, 0, 0, 1],
    [0, 1, 1, 0], [1, 0, 1, 0], [0, 0, 0, 1], [0, 1, 1, 0],
    [1, 0, 0, 1], [1, 1, 0, 1]
]
test=[0,0,1]
# 创建DataFrame
Data = pd.DataFrame(data, columns=['A', 'B', 'C', 'y'])
print("Initial Data:")
print(Data.head())

Initial Data:
   A  B  C  y
0  1  0  0  1
1  0  1  0  0
2  0  1  0  1
3  1  0  0  1
4  0  1  1  0



定义朴素贝叶斯分类器
接下来，定义一个函数来计算先验概率和条件概率。朴素贝叶斯分类器假设特征之间相互独立，这使得我们能够简单地计算特征的条件概率：


In [54]:
def naive_bayes(X_data, Y_data):
    y = Y_data.values
    y_unique = np.unique(y)
    prior_prob = {yu: np.mean(y == yu) for yu in y_unique}

    condition_prob = {}
    for feature in X_data.columns:
        condition_prob[feature] = {}
        for yu in y_unique:
            subset = X_data[y == yu]
            feature_counts = subset[feature].value_counts(normalize=True)
            condition_prob[feature][yu] = feature_counts.to_dict()

    return prior_prob, condition_prob


# 分离特征和标签
X_data = Data[['A', 'B', 'C']]
Y_data = Data['y']

# 计算先验概率和条件概率
prior_prob, condition_prob = naive_bayes(X_data, Y_data)

### 输出计算结果
展示先验概率和条件概率的计算结果。

### 预测新的输入样本
使用预测函数来计算给定新输入样本的分类。

In [55]:
def predict(test_data, prior_prob, condition_prob):
    results = []
    for index, row in test_data.iterrows():
        class_probs = {yu: prior_prob[yu] for yu in prior_prob.keys()}
        for feature in test_data.columns:
            for yu in class_probs:
                class_probs[yu] *= condition_prob[feature][yu].get(
                    row[feature], np.finfo(float).eps)

        total_prob = sum(class_probs.values())
        normalized_probs = {k: v / total_prob for k, v in class_probs.items()}
        results.append(max(normalized_probs, key=normalized_probs.get))
    return results


# 测试数据
test_data = pd.DataFrame([[0, 0, 1]], columns=['A', 'B', 'C'])
predictions = predict(test_data, prior_prob, condition_prob)

# 输出预测结果
print("Predictions:")
print(predictions)

Predictions:
[0]


### 最终分析结果解释
在我们的朴素贝叶斯分类器实现中，我们没有使用拉普拉斯平滑处理。根据运行结果，特征 C 当值为 1 时，在训练数据集中总是对应于目标类别 0。因此，当我们对新样本（特征 A=0, B=0, C=1）进行分类预测时，分类器直接判定该样本属于类别 0。

这种情况突显了在某些特征值在特定类别中完全缺失时，朴素贝叶斯分类器可能会直接做出决策的局限性。这也展示了拉普拉斯平滑的重要性：通过对每个类别的计数中添加一个小的常数（通常是1），拉普拉斯平滑可以避免因训练数据中的特征值缺失而导致的概率计算为零，从而使模型对未见过的特征组合具有更好的泛化能力。

因此，在面对实际数据集时，特别是那些可能存在未观察到的特征值组合的数据集时，推荐使用拉普拉斯平滑来增强朴素贝叶斯模型的稳健性和预测能力。

以下是加入拉普拉斯平滑后的结果

In [56]:
def naive_bayes(X_data, Y_data):
    y = Y_data.values
    y_unique = np.unique(y)
    prior_prob = {yu: (np.sum(y == yu) + 1) / (len(y) + len(y_unique))
                  for yu in y_unique}

    condition_prob = {}
    for feature in X_data.columns:
        condition_prob[feature] = {}
        for yu in y_unique:
            subset = X_data[y == yu]
            feature_counts = subset[feature].value_counts(normalize=True)
            condition_prob[feature][yu] = {k: (v + 1) / (len(subset) + len(subset[feature].unique())) for k, v in feature_counts.items()}

    return prior_prob, condition_prob


# 分离特征和标签
X_data = Data[['A', 'B', 'C']]
Y_data = Data['y']

# 计算先验概率和条件概率
prior_prob, condition_prob = naive_bayes(X_data, Y_data)

In [57]:
print("先验概率:")
print(prior_prob)
print("\n条件概率:")
for feature, probs in condition_prob.items():
    print(f"特征 {feature}:")
    for class_val, prob in probs.items():
        print(f"  {class_val}: {prob}")


先验概率:
{0: 0.4166666666666667, 1: 0.5833333333333334}

条件概率:
特征 A:
  0: {0: 0.2916666666666667, 1: 0.20833333333333334}
  1: {1: 0.20833333333333331, 0: 0.16666666666666666}
特征 B:
  0: {1: 0.2916666666666667, 0: 0.20833333333333334}
  1: {0: 0.20833333333333331, 1: 0.16666666666666666}
特征 C:
  0: {1: 0.2916666666666667, 0: 0.20833333333333334}
  1: {0: 0.2857142857142857}


In [58]:
def predict_with_laplace(test_data, prior_prob, condition_prob):
    results = []
    for index, row in test_data.iterrows():
        class_probs = {yu: prior_prob[yu] for yu in prior_prob.keys()}
        for feature in test_data.columns:
            for yu in class_probs:
                feature_prob = condition_prob[feature][yu].get(row[feature], 0)
                class_probs[yu] *= feature_prob

        total_prob = sum(class_probs.values())
        normalized_probs = {k: (v + 1) / (total_prob + len(class_probs)) for k, v in class_probs.items()}
        results.append(max(normalized_probs, key=normalized_probs.get))
    return results

# Predict test data with Laplace smoothing
predictions_with_laplace = predict_with_laplace(test_data, prior_prob, condition_prob)
predictions_with_laplace

[0]

### 加入拉普拉斯后结果任然是0

In [59]:
import numpy as np
import pandas as pd

# 构建示例数据集
data = [
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny', 'Mild', 'High', 'Weak', 'No'],
    ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
    ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
    ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Strong', 'No']
]
columns = ['Outlook', 'Temperature', 'Humidity', 'Wind', 'Play Tennis']
df = pd.DataFrame(data, columns=columns)
df.head()

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


 数据预处理：将字符串标签转换为数值

In [60]:
feature2NUM = {}
for col in columns:
    for idx, val in enumerate(df[col].unique()):
        feature2NUM[val] = idx

df_encoded = df.applymap(lambda x: feature2NUM[x])
df_encoded

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,0,0,0,0,0
1,0,0,0,1,0
2,1,0,0,0,1
3,2,1,0,0,1
4,2,2,1,0,1
5,2,2,1,1,0
6,1,2,1,1,1
7,0,1,0,0,0
8,0,2,1,0,1
9,2,1,1,0,1


划分特征和标签

In [61]:
x = df_encoded.iloc[:, :-1]
y = df_encoded.iloc[:, -1]
y


0     0
1     0
2     1
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    1
11    1
12    1
13    0
Name: Play Tennis, dtype: int64

### 定义朴素贝叶斯分类器类

In [62]:
class NaiveBayesClassifier:
    def __init__(self, x, y):
        self.features = x
        self.labels = y
        self.prior_prob = {}
        self.likelihood = {}

    def train(self):
        label_counts = self.labels.value_counts()
        total_count = len(self.labels)

        # 计算先验概率
        self.prior_prob = {label: count / total_count for label, count in label_counts.items()}

        # 计算条件概率（似然）
        for label in label_counts.index:
            self.likelihood[label] = {}
            label_df = self.features[self.labels == label]
            for col in self.features.columns:
                self.likelihood[label][col] = {}
                value_counts = label_df[col].value_counts()
                for value, count in value_counts.items():
                    self.likelihood[label][col][value] = count / label_counts[label]

    def predict(self, test_data):
        results = {}
        for label in self.prior_prob:
            prob = self.prior_prob[label]
            for col, value in test_data.items():
                prob *= self.likelihood[label].get(col, {}).get(value, 1)
            results[label] = prob
        total_prob = sum(results.values())
        return {label: prob / total_prob for label, prob in results.items()}


In [63]:
classifier = NaiveBayesClassifier(x, y)
classifier.train()

### 预测新数据

In [64]:
test = ['Sunny', 'Cool', 'High', 'Strong']
test_data = {columns[i]: feature2NUM[val] for i, val in enumerate(test)}
prediction = classifier.predict(test_data)
prediction

{1: 0.20458265139116202, 0: 0.795417348608838}

### 输出预测结果

In [65]:
label_decoder = {v: k for k, v in feature2NUM.items() if k in df['Play Tennis'].unique()}
predicted_label = max(prediction, key=prediction.get)
print(f"Predicted decision for playing tennis: {label_decoder[predicted_label]}")

Predicted decision for playing tennis: No
