# 朴素贝叶斯(练习1)

<style>
.CodeMirror pre {
    white-space: pre-wrap;
}
</style>

朴素贝叶斯是一种高度基于训练集的算法. 这种模型/分类器的训练速度非常快, 以高斯朴素贝叶斯为例, 它只需要计算需要的统计参数, 如均值和方差, 还有标签中每个类别的概率, 从面上来看, 好像没有训练过程. 因为如同线性回归这样的算法, 会经过不断的优化产生回归系数, 这个回归系数就构成了模型. 但是朴素贝叶斯没有这样的过程, 它只需要计算一些统计参数, 然后就可以进行预测了. 预测的过程会利用这些统计参数, 结合特征在标签下的条件概率, 然后得到一个后验概率, 这个后验概率就用于预测了.

## 1. Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from scipy import signal

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#for accuracy_score, classification_report and confusion_matrix
from sklearn import metrics
from sklearn.metrics import accuracy_score

# to make this notebook's output stable across runs
np.random.seed(42)

## 2. Introduction

In this tutorial we will firstly learn how to create a Naive Bayes (NB) classifier in sklearn. Then we will learn how to obtain different performance metrics (precision, recall, F1-score and confusion matrix), how to apply different procedures for evaluating the performance of classifiers (cross-validation and leave-one-out) and finally, how to use grid search with cross-validation for model selection. 

首先, 我们会学习如何创造一个朴素贝叶斯分类器(用Sklearn). 然后我们会学习如何获取衡量性能的不同目标(精度, 召回率, F1分数和混淆矩阵), 如何应用不同的程序来评估分类器的性能(交叉验证和留一法), 最后, 如何使用带有交叉验证的网格搜索来选择模型.

## 3. Creating an NB classifier

There are four main types of NB classifiers in sklearn:  <b>GaussianNB</b>, <b>CategoricalNB</b>,  <b>MultinomialNB</b> and <b>BernoulliNB</b>. 

- The first two are the ones we discussed at the lecture -  <b>GaussianNB</b> is applicable to numeric data, while <b>CategoricalNB</b> is applicable to categorical data. 

- <b>BernoulliNB</b> and <b>MultinomialNB</b> are mostly used for text clasification; they assume binary data and count data respectively, e.g. how many times a word appears in a document.

In this tutorial we will create a NB for the iris data which is a numeric dataset, so we will use the <b>GaussianNB</b> class to create the classifier.

Let's load the iris data and create the training and test splits:

In [3]:
# load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()

# create the training and test splits
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, stratify=iris.target, random_state=42) # stratify的作用是分层, 因为当将数据集随机分为训练集和测试集的时候, 有时候会出现某类别的数据在训练集或者测试集中缺失或者失衡的情况. 比如, 如果某一类别的所有样本都分配到测试集中, 那么模型在训练的时候将无法学习如何预测这个类别, 因为训练集中没有该类别的样本. 为了避免这种情况, 可以使用分层的方法, 即stratification. 例如, 如果整个数据集中某一个类别的比例是60%, 另一个类别的比例是40%, 那么分层抽样的时候, 训练集和测试集中该类别的比例也会是60%和40%. random_state是随机种子, 为了保证每次运行代码时, 随机分配的结果都是一样的.

In [18]:
X_train

array([[7.4, 2.8, 6.1, 1.9],
       [7.7, 2.8, 6.7, 2. ],
       [5.5, 2.4, 3.7, 1. ],
       [6.1, 2.8, 4. , 1.3],
       [5.5, 2.5, 4. , 1.3],
       [6.3, 3.3, 6. , 2.5],
       [4.6, 3.4, 1.4, 0.3],
       [6.3, 2.7, 4.9, 1.8],
       [4.8, 3.1, 1.6, 0.2],
       [6.2, 2.8, 4.8, 1.8],
       [5.1, 3.8, 1.9, 0.4],
       [6.4, 2.7, 5.3, 1.9],
       [5. , 2. , 3.5, 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [5.2, 4.1, 1.5, 0.1],
       [5.8, 2.7, 3.9, 1.2],
       [7.2, 3. , 5.8, 1.6],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 4.4, 1.5, 0.4],
       [5.2, 2.7, 3.9, 1.4],
       [7. , 3.2, 4.7, 1.4],
       [6.9, 3.1, 4.9, 1.5],
       [5.3, 3.7, 1.5, 0.2],
       [5.6, 2.9, 3.6, 1.3],
       [6.7, 3. , 5.2, 2.3],
       [4.9, 3. , 1.4, 0.2],
       [6.8, 3.2, 5.9, 2.3],
       [6.3, 2.3, 4.4, 1.3],
       [6.9, 3.2, 5.7, 2.3],
       [5. , 3.6, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.2],
       [6.7, 3.1, 4.7, 1.5],
       [4.9, 3.1, 1.5, 0.1],
       [6.3, 2.8, 5.1, 1.5],
       [4.9, 3

In [5]:
iris.target # 这个应该是这个数据集的标签

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### Task:
Use GaussianNB from sklearn.naive_bayes to create an NB classifier on the training data and evaluate its acuracy on the test data.

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [10]:
# 导包
from sklearn.naive_bayes import GaussianNB

In [11]:
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
y_pred

array([0, 1, 1, 1, 0, 1, 1, 2, 2, 2, 1, 2, 1, 1, 0, 0, 0, 1, 0, 1, 2, 1,
       2, 1, 2, 1, 0, 2, 0, 2, 2, 2, 0, 0, 0, 0, 2, 1])

In [20]:
accuracy_score(y_test, y_pred) # 使用sklearn.metrics.accuracy_score来计算准确率

0.9210526315789473

## 3. More performance measures: precision, recall and F1 score. Confusion matrix.

In addition to accuracy, we can calculate other performance measures - e.g. precision, recall and their combination - the F1-score. In sklearn this can be convenintly done using the <b>classification_report</b> method, which also shows the accuracy. The confusion matrix can be calculated using the <b>confusion_matrix</b> method.

### Task: 
1) Continuing on the previous exercise (NB classifier on the iris data), write the Python code to calculate precision, recall, F1 measure and confusion matrix on the test set by using the methods <b>classification_report</b> and <b>confusion_matrix</b>.

See:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

2) Examine the results:
- How are the precision, recall and F1-score calculated - per class or overall for the test set?
- Making sense of the confusion matrix: Where are the correctly classified examples? How many examples from class 1 are incorrectly classified as class 2?


In [21]:
# 得到混淆矩阵
metrics.confusion_matrix(y_test, y_pred)

array([[12,  0,  0],
       [ 0, 12,  1],
       [ 0,  2, 11]])

混淆矩阵的$C_{ij}$的意思是说: 实际上是第$i$类, 但是预测为了第$j$类的样本数. 例如, $C_{12}$的意思是说: 实际上是第$1$类, 但是预测为了第$2$类的样本数.

In [23]:
# 得到分类报告
metrics.classification_report(y_test, y_pred, output_dict=True) # 可以看到里面的accuracy和上面我们算出来的是一样的

{'0': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12.0},
 '1': {'precision': 0.8571428571428571,
  'recall': 0.9230769230769231,
  'f1-score': 0.8888888888888888,
  'support': 13.0},
 '2': {'precision': 0.9166666666666666,
  'recall': 0.8461538461538461,
  'f1-score': 0.88,
  'support': 13.0},
 'accuracy': 0.9210526315789473,
 'macro avg': {'precision': 0.9246031746031745,
  'recall': 0.923076923076923,
  'f1-score': 0.9229629629629629,
  'support': 38.0},
 'weighted avg': {'precision': 0.9226190476190477,
  'recall': 0.9210526315789473,
  'f1-score': 0.9209356725146198,
  'support': 38.0}}

## 4. Cross-validation for evaluating performance

Cross-validation, in particular <b>10-fold stratified cross-validation</b>, is the standard method in machine learning for evaluating the performance of classification and prediction models. Recall that we are interested in the generalization performance, i.e. how well a classifier will perform on new, previously unseen data.

To perform cross-validation in sklearn, we can use the <b>cross_val_score</b> function. It takes as parameters the classifier we would like to evaluate and the data - the feature vectors and the target classes (also caled ground-truth labels). The parameter <b>cv</b> specifies the number of folds; the default value is 3, so we need to set it to 10 for 10-fold cross-validation. Note that this function performs <b>stratified</b> cross validation for classification tasks. 

Let's evaluate our NB classifier using 10-fold cross-validation:

In [3]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

from sklearn.model_selection import cross_val_score
scores = cross_val_score(nb, iris.data, iris.target, cv=10)
print("Cross-validation scores: {}".format(scores)) #accuracy for each fold
print("Average cross-validation score: {:.2f}".format(scores.mean())) #average accuracy over all folds

Cross-validation scores: [0.93333333 0.93333333 1.         0.93333333 0.93333333 0.93333333
 0.86666667 1.         1.         1.        ]
Average cross-validation score: 0.95


The most important result is the average cross-validation score (the average accuracy over the the 10 folds) but it is also useful to look at the accuracy for each fold. In our case, we can see that there is a relatively high variation between the 10 folds  - from 86% to 100% accuracy. 

### Task: 
Compare NB's cross-validation accuracy with NB's accuracy when we used a single training/test split. How can you explain the difference? Which is the more reliable measure?

### Leave-one-out cross-validation

Leave-one-out is a special case of cross-validation where each fold is a single example:

In [4]:
from sklearn.model_selection import LeaveOneOut
one_out = LeaveOneOut()
scores = cross_val_score(nb, iris.data, iris.target, cv=one_out)
print("Number of evaluations: ", len(scores))
print("Mean accuracy: {:.2f}".format(scores.mean()))

Number of evaluations:  150
Mean accuracy: 0.95


### Task:
What are the advantages and disadvantages of leave-one-out?

## 5. Grid search with cross-validation for parameter selection

We can improve the generalization performance of machine learning algorithms by tuning their parameters. NB doesn't have many parameters but in the previous weeks we saw that the performance of k-Nearest Neighbor algorithm depends on the number of neighbors and distance measure type, the performance of regression models depends on the values of <b>alpha</b> and <b>C</b>. 

In sklearn we can use grid search with cross-validation to search through different parameter combinations and select the best one. 

Let's consider k-nearest neighbor (k-NN) as an example and tune two of its parameters by considering the following values:
- number of neareast neighbours <b>n_neighbours</b> = 1, 3, 5, 11 and 13
- distance measure - Manhattan and Euclidean, which can be controlled by the value of parameter <b>p</b>, 1 or 2 respectively

This gives us 5 x 2 combinations of paramneter values. We would like to find the best combination - the one that we expect to generalise well on new examples.

We will use the following procedure, called <b>grid-search with cross-validation for parameter tuning</b>:

Pseudocode:

Create the parameter grid (i.e. the parameter combinations)
Split the data into training set and test set
For each parameter combination
    Train a k-NN classifier on the training data using 10-fold cross-validation as an evaluation procedure
    Compute the cross-validation accuracy cv_acc
    If cv_acc > best_cv_acc 
       best_cv_acc = cv_acc
       best_parameters = current parameters
Rebuild the k-NN model using the whole training data and best_parameters
Evaluate it on the test data and report the results

- The data is split into training set and test set
- The cross-validation loop uses the training data. It is performed for every parameter combination. Its purpose is to select the best parameter combination - the one with the highest cross-validation accuracy. This involves, for every parameter combination, building 10 models on 90% of the training data (9 folds) and evaluating them on the remaining 10% (1 fold).
- Once this is done, a new model is trained using the selected best parameter combination on the <b>whole training set</b> and evaluated on the test set.

Code for our example:

In [5]:
param_grid = {'n_neighbors': [1, 3, 5, 11, 15],
              'p': [1, 2]}
print("Parameter grid:\n{}".format(param_grid))

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10,
                          return_train_score=True)


grid_search.fit(X_train, y_train)

print("Test set score: {:.2f}".format(grid_search.score(X_test, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Parameter grid:
{'n_neighbors': [1, 3, 5, 11, 15], 'p': [1, 2]}
Test set score: 0.95
Best parameters: {'n_neighbors': 11, 'p': 1}
Best cross-validation score: 0.98
Best estimator:
KNeighborsClassifier(n_neighbors=11, p=1)


Explanation: We create an object of type GridSearchCV, which is then fitted to the <b>training</b> data. This fitting includes 2 things: 
1. Searching for and determining the best parameter combination - the one with the best cross-validation accuracy <b>and</b> 
2. Building a new model on the whole training set with the best parameter combination from 1.

It is important to understand the difference between <b>best cross-validation score</b> and <b>test set score </b>:
- <b>best cross-validation score</b> is the mean cross-validation accuracy, with cross-validation performed on the <b>training set</b>. This step involves building 10 models on the training data, each time using 9 folds together (90% of the training data) to create the model and testing this model the remaining 10th fold (10% of the training data). The purpose of this step is to select the best parameter combination, which is the one with the highest cross-validation accuracy.
- <b>test set score</b> - this is the the accuracy on the test of a model that was created using the <b>whole training set</b> (100%) with the selected parameters. This is the result that we report as a measure of generalization performance.

## Summary

In [6]:
nb = GaussianNB().fit(X_train, y_train)
scores = cross_val_score(nb, iris.data, iris.target, cv=10)
scores = cross_val_score(nb, iris.data, iris.target, cv=one_out)

param_grid = {'n_neighbors': [1, 3, 5, 11, 15], 'p': [1, 2]}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test, y_test)))

Test set score: 0.95


## Acknowledgements

This tutorial is based on:

Aurelien Geron (2022). Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow, O'Reilly.

Andreas C. Mueller and Sarah Guido (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly.
