#  <span style="color:orange">异常检测教程 (ANO101) - 初级</span>

**使用版本: PyCaret 2.2** <br />
**更新时间: 2020年11月25**

# 1.0 教程简介
欢迎阅读异常检测的初级教程**(ANO101)**. 本教程使用`pycaret.anomaly`模块来介绍异常检测的基本操作，面向的读者是PyCaret的新手。

在本教程中你会学到:


* **获取数据:** 如何从PyCaret中获取数据？
* **环境搭建:** 如何搭建异常检测所需的必要环境？
* **创建模型:** 如何创建模型并标注原始数据？
* **模型性能图:** 如何使用多种图像来分析模型的性能？
* **预测:** 如何用训练好的模型来预测新的数据集？
* **保存/加载模型:**  如何保存/加载模型以便之后使用？

阅读时间 : 大约25分钟


# 1.1 安装 PyCaret
安装PyCaret一般只需要几分钟。请按照以下说明进行操作。

# 在本地 Jupyter Notebook 安装 PyCaret
`pip install pycaret`  <br />

# 在 Google Colab 或者 Azure Notebooks 安装 PyCaret
`!pip install pycaret`


# 1.2 预先要求
- Python 3.6 或更高
- PyCaret 2.0 或更高
- 网络链接来获取PyCaret中的数据
- 异常检测的基础知识

# 1.3 Google Colab 用户:
如何你在 Google Colab跑这个Notebook, 请运行以下代码来显示交互式的图像。<br/>
<br/>
`from pycaret.utils import enable_colab` <br/>
`enable_colab()`

# 1.4 其他阅读资料:
- __[Anomaly Detectiom Tutorial (ANO102) - Level Intermediate](https://github.com/pycaret/pycaret/blob/master/tutorials/Anomaly%20Detection%20Tutorial%20Level%20Intermediate%20-%20ANO102.ipynb)__
- __[Anomaly Detection Tutorial (ANO103) - Level Expert](https://github.com/pycaret/pycaret/blob/master/tutorials/Anomaly%20Detection%20Tutorial%20Level%20Expert%20-%20ANO103.ipynb)__

# 2.0 什么是异常检测?

异常检测主要用于发掘异常数据。这些异常数据的特点是与大部分数据差距很大。典型的问题有银行诈骗检测，结构缺陷，医学问题或者文本错误。这些问题可以被归为三类：

- **无监督异常检测:** 无监督异常检测技术检测未标记测试数据集中的异常，假设数据集中的大多数实例是正常的，来寻找最不合群的的数据。
<br/>
<br/>
- **有监督异常检测:** 这项技术需要标记过的数据集，数据会被标注为“异常”和”非异常“。 <br/>
<br/>
- **半监督异常检测:** 这项技术通过建立一个模型来模拟训练集中的正常“行为”来预测测试集的异常概率。<br/>

`pycaret.anomaly` 模块支持无监督和有监督的异常检测。在本教程中，我们只演示无监督的异常检测。

__[关于异常检测的更多资料](https://en.wikipedia.org/wiki/Anomaly_detection)__

# 3.0  PyCaret 异常检测模块概述
PyCaret 的异常检测模块(`pycaret.anomaly`) 基于非监督机器学习，通过与大部分正常数据对比来识别异常数据。

PyCaret 的异常检测模块通过`setup()`函数提供了好几种预处理功能。该模块有超过12种算法和图像来分析异常检测的结果。模块中的`tune_model()`函数可以用来调试超参数，最终优化我们的`AUC`或者`R2`。

# 4.0 教程数据集

在本次教程我们会使用“老鼠的蛋白质表现“数据集。该数据集由 77 种修改后的蛋白质的表达水平组成，这些修改后的蛋白质在皮层的核部分会产生可检测的信号。 __[更多信息](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression)__ 


# 数据集作者与联系方式:
Clara Higuera Department of Software Engineering and Artificial Intelligence, Faculty of Informatics and the Department of Biochemistry and Molecular Biology, Faculty of Chemistry, University Complutense, Madrid, Spain.
Email: clarahiguera@ucm.es

Katheleen J. Gardiner, creator and owner of the protein expression data, is currently with the Linda Crnic Institute for Down Syndrome, Department of Pediatrics, Department of Biochemistry and Molecular Genetics, Human Medical Genetics and Genomics, and Neuroscience Programs, University of Colorado, School of Medicine, Aurora, Colorado, USA.
Email: katheleen.gardiner@ucdenver.edu

Krzysztof J. Cios is currently with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA, and IITiS Polish Academy of Sciences, Poland.
Email: kcios@vcu.edu

__[原始数据集链接](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression)__ 

# 5.0 获取数据

你可以选择下载数据 （__[下载链接](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression)__），并用pandas来加载 __[(pandas教程)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)__。 或者你可以使用 PyCaret的`get_data()` 函数来获取数据 （需要网络链接）。

In [1]:
from pycaret.datasets import get_data
dataset = get_data('mice')

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class
0,309_1,0.503644,0.747193,0.430175,2.816329,5.990152,0.21883,0.177565,2.373744,0.232224,...,0.108336,0.427099,0.114783,0.13179,0.128186,1.675652,Control,Memantine,C/S,c-CS-m
1,309_2,0.514617,0.689064,0.41177,2.789514,5.685038,0.211636,0.172817,2.29215,0.226972,...,0.104315,0.441581,0.111974,0.135103,0.131119,1.74361,Control,Memantine,C/S,c-CS-m
2,309_3,0.509183,0.730247,0.418309,2.687201,5.622059,0.209011,0.175722,2.283337,0.230247,...,0.106219,0.435777,0.111883,0.133362,0.127431,1.926427,Control,Memantine,C/S,c-CS-m
3,309_4,0.442107,0.617076,0.358626,2.466947,4.979503,0.222886,0.176463,2.152301,0.207004,...,0.111262,0.391691,0.130405,0.147444,0.146901,1.700563,Control,Memantine,C/S,c-CS-m
4,309_5,0.43494,0.61743,0.358802,2.365785,4.718679,0.213106,0.173627,2.134014,0.192158,...,0.110694,0.434154,0.118481,0.140314,0.14838,1.83973,Control,Memantine,C/S,c-CS-m


In [2]:
#查看数据的形状
dataset.shape

(1080, 82)

我们取走了5%的样本用来在未知数据上演示`predict_model()`函数，这个和train/test split有区别。这个5%样本是用于模拟现实的场景。

In [3]:
data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (1026, 82)
Unseen Data For Predictions: (54, 82)


# 6.0 配置 PyCaret 环境

`setup()` 函数初始化 PyCaret的环境并且自动生成预处理管道来处理数据用于模型训练。 `setup()` 函数必须在其他函数前运行. 这个函数最少只需一个参数：pandas dataframe。所有其他的函数都是不必要的，我们会在更高难的教程中使用它们。

当`setup()`运行之后，PyCaret会自动推算出数据的类型。虽然大多数数据推算的结果是对的，但是也不绝对。所以PyCaret会输出一个表格让你确认，如果数据类型是对的，你可以输入`enter`来继续，或者`quit`来退出。识别出数据类型对PyCaret来说至关重要，因为所有的预处理工作都是基于正确的数据类型。在之后的教程中，我们会学习如何用`setup()`的`numeric_features`和`categorical_features`覆盖PyCaret输出的数据类型。

In [4]:
from pycaret.anomaly import *

exp_ano101 = setup(data, normalize = True, 
                   ignore_features = ['MouseID'],
                   session_id = 123)

Unnamed: 0,Description,Value
0,session_id,123
1,Original Data,"(1026, 82)"
2,Missing Values,True
3,Numeric Features,77
4,Categorical Features,4
5,Ordinal Features,False
6,High Cardinality Features,False
7,High Cardinality Method,
8,Transformed Data,"(1026, 91)"
9,CPU Jobs,-1


当`setup()`成功运行后，它会打印出包含不少重要信息的表格。这些信息大部分都和预处理相关。在这个教程中，我们需要注意的是：

- **session_id:**  一个自定义的数字作为种子，这在之后用于再现结果。如果`session_id`没有被传入，一个随机数会被生成。在这次试验中，我们使用`123`作为种子。<br/>
<br/>
- **Missing Values 缺失值:**  当原数据中存在缺失值时，该项会显示`True`，PyCaret会自动使用`mean`来填充数字，`constant`来填充类别。这些填充方法都可以使用 `numeric_imputation`和`categorical_imputation`参数在`setup()`中设置。 <br/>
<br/>
- **Original Data 原数据:**  展示出原数据的形状，(1026, 82) 意味着它有1026个样本以及82个特征。 <br/>
<br/>
- **Transformed Data 转换后的数据:** 显示出转换后数据的形状，现在有91个特征，因为我们使用了encoding来处理类别特征。 <br/>
<br/>
- **Numeric Features 数字特征:**  数字特征的数量，82个特征中有77个是数字特征。N <br/>
<br/>
- **Categorical Features 类别特征:**  类别特征的数量，82个特征中有5个是类别特征。注意，我们用`ignore_feature`忽略了一个类别特征`MouseID`。 <br/>

# 7.0 创建模型

在PyCaret中创建模型很简单。我们会使用`create_model()` 函数。这个函数只需要一个强制性的参数，这个模型的名字。一个训练好的模型对象会被返回。

In [5]:
iforest = create_model('iforest')

In [6]:
print(iforest)

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)


我们创建了Isolation Forest模型。注意 `contamination`参数被设成了`0.05`。`fraction` 参数用来决定这个数据中离群值的比例。 在下面的例子中，我们会创建`One Class Support Vector Machine`模型，并设置`fraction`为0.025。

In [7]:
svm = create_model('svm', fraction = 0.025)

In [None]:
print(svm)

只需要把 `iforest` 替代为 `svm`，我们就成功创建了另一个异常检测模型。我们目前有12个模型可供使用。

In [8]:
models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
abod,Angle-base Outlier Detection,pyod.models.abod.ABOD
cluster,Clustering-Based Local Outlier,pyod.models.cblof.CBLOF
cof,Connectivity-Based Local Outlier,pyod.models.cof.COF
iforest,Isolation Forest,pyod.models.iforest.IForest
histogram,Histogram-based Outlier Detection,pyod.models.hbos.HBOS
knn,K-Nearest Neighbors Detector,pyod.models.knn.KNN
lof,Local Outlier Factor,pyod.models.lof.LOF
svm,One-class SVM detector,pyod.models.ocsvm.OCSVM
pca,Principal Component Analysis,pyod.models.pca.PCA
mcd,Minimum Covariance Determinant,pyod.models.mcd.MCD


# 8.0 使用模型

现在我们用`assign_model()`函数来分析这1080个样本。

In [9]:
iforest_results = assign_model(iforest)
iforest_results.head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
0,3501_12,0.34493,0.626194,0.383583,2.534561,4.097317,0.303547,0.222829,4.592769,0.239427,...,0.2527,0.218868,0.249187,1.139493,Ts65Dn,Memantine,S/C,t-SC-m,0,-0.014462
1,3520_5,0.630001,0.839187,0.357777,2.651229,4.261675,0.253184,0.185257,3.816673,0.20494,...,0.155008,0.153219,,1.642886,Control,Memantine,C/S,c-CS-m,0,-0.070193
2,3414_13,0.555122,0.726229,0.278319,2.097249,2.897553,0.222222,0.174356,1.86788,0.203379,...,0.136109,0.15553,0.185484,1.65767,Ts65Dn,Memantine,C/S,t-CS-m,0,-0.070143
3,3488_8,0.275849,0.430764,0.285166,2.265254,3.250091,0.189258,0.157837,2.917611,0.202594,...,0.127944,0.207671,0.175357,0.893598,Control,Saline,S/C,c-SC-s,0,-0.080521
4,3501_7,0.304788,0.617299,0.335164,2.638236,4.876609,0.28059,0.199417,4.835421,0.236314,...,0.245277,0.202171,0.240372,0.795637,Ts65Dn,Memantine,S/C,t-SC-m,0,-0.064749


注意`Label`和`Score`两列结果被加到了最后。0代表正常值，1代表离群值。`Score`是由算法计算出来的。注意`MouseID`又被重新放回数据。 

# 9.0 模型表现

`plot_model()` 函数可以用来分析模型的表现。这个函数接收一个训练好的模型对象并返回一个图像。

# 9.1 T-distributed Stochastic Neighbor Embedding (t-SNE)

In [10]:
plot_model(iforest)

# 9.2 Uniform Manifold Approximation and Projection绘图

In [11]:
plot_model(iforest, plot = 'umap')

# 10.0 预测没见过的数据

`predict_model()` 函数用于预测模型没见过的数据。我们现在就用我们的`iforest`模型来预测我们之前保存的`data_unseen`。

In [12]:
unseen_predictions = predict_model(iforest, data=data_unseen)
unseen_predictions.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
0,0.447506,0.628176,0.367388,2.385939,4.807635,0.218578,0.176233,2.141282,0.195188,1.442398,...,0.116657,0.140766,0.14218,1.816389,Control,Memantine,C/S,c-CS-m,0,-0.077131
1,0.704633,0.802537,0.35011,2.467733,5.5484,0.205323,0.165058,2.107281,0.171401,1.938913,...,0.111089,0.157731,0.158543,1.404481,Control,Memantine,C/S,c-CS-m,0,-0.060165
2,0.505093,0.695549,0.376029,2.915585,5.917957,0.226734,0.174271,2.663039,0.190038,1.535091,...,0.131515,0.188391,,1.69926,Control,Memantine,C/S,c-CS-m,0,-0.052132
3,0.429133,0.563175,0.258429,2.028151,3.542553,0.214075,0.176759,3.165139,0.16743,1.217676,...,0.118223,0.171071,0.173702,1.405727,Control,Memantine,C/S,c-CS-m,0,-0.09161
4,0.373648,0.471165,0.257909,1.860032,2.938526,0.218262,0.15038,2.610132,0.142571,1.020024,...,0.086785,0.126537,0.11269,0.790975,Control,Memantine,C/S,c-CS-m,1,0.037436


注意离群值的`Score`会很大。你也可以使用`predict_model()`函数来标注训练数据。 

In [13]:
data_predictions = predict_model(iforest, data = data)
data_predictions.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
0,0.34493,0.626194,0.383583,2.534561,4.097317,0.303547,0.222829,4.592769,0.239427,1.360164,...,0.2527,0.218868,0.249187,1.139493,Ts65Dn,Memantine,S/C,t-SC-m,0,-0.014462
1,0.630001,0.839187,0.357777,2.651229,4.261675,0.253184,0.185257,3.816673,0.20494,1.716583,...,0.155008,0.153219,,1.642886,Control,Memantine,C/S,c-CS-m,0,-0.070193
2,0.555122,0.726229,0.278319,2.097249,2.897553,0.222222,0.174356,1.86788,0.203379,1.610136,...,0.136109,0.15553,0.185484,1.65767,Ts65Dn,Memantine,C/S,t-CS-m,0,-0.070143
3,0.275849,0.430764,0.285166,2.265254,3.250091,0.189258,0.157837,2.917611,0.202594,1.734746,...,0.127944,0.207671,0.175357,0.893598,Control,Saline,S/C,c-SC-s,0,-0.080521
4,0.304788,0.617299,0.335164,2.638236,4.876609,0.28059,0.199417,4.835421,0.236314,1.226532,...,0.245277,0.202171,0.240372,0.795637,Ts65Dn,Memantine,S/C,t-SC-m,0,-0.064749


# 11.0 保存模型

现在我们可以使用`save_model()`函数来保存我们的模型以便之后使用。 

In [14]:
save_model(iforest,'Final IForest Model 25Nov2020')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True,
                                       features_todrop=['MouseID'],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[],
                                       target='UNSUPERVISED_DUMMY_TARGET',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='most frequent',
                                 fill_value_categorical=None,
                                 fill_value_n...
                 ('fix_perfect', 'passthrough'),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('

# 12.0 加载保存过的模型

我们用`load_model()`函数来加载一个保存的模型。

In [15]:
saved_iforest = load_model('Final IForest Model 25Nov2020')

Transformation Pipeline and Model Successfully Loaded


当模型成功被加载之后，我们可以直接使用`predict_model()`函数来预测数据。

In [16]:
new_prediction = predict_model(saved_iforest, data=data_unseen)

In [17]:
new_prediction.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
0,0.447506,0.628176,0.367388,2.385939,4.807635,0.218578,0.176233,2.141282,0.195188,1.442398,...,0.116657,0.140766,0.14218,1.816389,Control,Memantine,C/S,c-CS-m,0,-0.077131
1,0.704633,0.802537,0.35011,2.467733,5.5484,0.205323,0.165058,2.107281,0.171401,1.938913,...,0.111089,0.157731,0.158543,1.404481,Control,Memantine,C/S,c-CS-m,0,-0.060165
2,0.505093,0.695549,0.376029,2.915585,5.917957,0.226734,0.174271,2.663039,0.190038,1.535091,...,0.131515,0.188391,,1.69926,Control,Memantine,C/S,c-CS-m,0,-0.052132
3,0.429133,0.563175,0.258429,2.028151,3.542553,0.214075,0.176759,3.165139,0.16743,1.217676,...,0.118223,0.171071,0.173702,1.405727,Control,Memantine,C/S,c-CS-m,0,-0.09161
4,0.373648,0.471165,0.257909,1.860032,2.938526,0.218262,0.15038,2.610132,0.142571,1.020024,...,0.086785,0.126537,0.11269,0.790975,Control,Memantine,C/S,c-CS-m,1,0.037436


注意`unseen_predictions`和`new_prediction`的结果一摸一样，因为我们用的是同一个模型。

# 16.0 总结/下一步

在本教程中，我们涵盖了`pycaret.anomaly`的基础。在之后的教程中，我们会演示更深层的概念来让你对模型操控自如。 
点击链接查看中级教程 __[Anomaly Detection Tutorial (ANO102) - Level Intermediate](https://github.com/pycaret/pycaret/blob/master/tutorials/Anomaly%20Detection%20Tutorial%20Level%20Intermediate%20-%20ANO102.ipynb)__