From e60b522ad426e85ff812c47255098aaca37f7ae9 Mon Sep 17 00:00:00 2001 From: Yue Zhao Date: Mon, 26 Nov 2018 14:14:26 -0500 Subject: [PATCH] add README.rst for the transformation --- README.rst | 535 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 535 insertions(+) create mode 100644 README.rst diff --git a/README.rst b/README.rst new file mode 100644 index 000000000..3f52ef74f --- /dev/null +++ b/README.rst @@ -0,0 +1,535 @@ +.. role:: raw-html-m2r(raw) + :format: html + + +Python Outlier Detection (PyOD) +=============================== + + +.. image:: https://badge.fury.io/py/pyod.svg + :target: https://badge.fury.io/py/pyod + :alt: PyPI version + + +.. image:: https://mybinder.org/badge_logo.svg + :target: https://mybinder.org/v2/gh/yzhao062/pyod/master + :alt: Binder + + +.. image:: https://readthedocs.org/projects/pyod/badge/?version=latest + :target: https://pyod.readthedocs.io/en/latest/?badge=latest + :alt: Documentation Status + +| + +.. image:: https://ci.appveyor.com/api/projects/status/1kupdy87etks5n3r/branch/master?svg=true + :target: https://ci.appveyor.com/project/yzhao062/pyod/branch/master + :alt: Build status + + +.. image:: https://travis-ci.org/yzhao062/pyod.svg?branch=master + :target: https://travis-ci.org/yzhao062/pyod + :alt: Build Status + + +.. image:: https://coveralls.io/repos/github/yzhao062/pyod/badge.svg + :target: https://coveralls.io/github/yzhao062/pyod + :alt: Coverage Status + + +.. image:: https://api.codeclimate.com/v1/badges/bdc3d8d0454274c753c4/maintainability + :target: https://codeclimate.com/github/yzhao062/Pyod/maintainability + :alt: Maintainability + + +.. image:: https://img.shields.io/github/stars/yzhao062/pyod.svg + :target: https://github.com/yzhao062/Pyod/stargazers + :alt: GitHub stars + + +.. image:: https://img.shields.io/github/forks/yzhao062/pyod.svg + :target: https://github.com/yzhao062/Pyod/network + :alt: GitHub forks + + +.. image:: https://pepy.tech/badge/pyod + :target: https://pepy.tech/project/pyod + :alt: Downloads + + + +.. image:: https://pepy.tech/badge/pyod/month + :target: https://pepy.tech/project/pyod + :alt: Downloads + +------------------------------------------------------------------------------------------------------------- + +PyOD is a comprehensive and scalable **Python toolkit** for **detecting outlying objects** in +multivariate data. This exciting yet challenging field is commonly referred as +**\ *\ `Outlier Detection `_\ *\ ** +or **\ *\ `Anomaly Detection `_\ *\ ** . +Since 2017, PyOD has been successfully used in various academic researches [4, 8, 17] and commercial products. +PyOD is featured for: + + +* **Unified APIs, detailed documentation, and interactive examples** across various algorithms. +* **Advanced models**\ , including **Neural Networks/Deep Learning** and **Outlier Ensembles**. +* **Optimized performance with JIT and parallelization** when possible, using numba and parallelization. +* **Compatible with both Python 2 & 3** (scikit-learn compatible as well). + +**Important Notes**\ : +PyOD contains some neural network based models, e.g., AutoEncoders, which are +implemented in keras. However, PyOD would **NOT** install **keras** and/or **tensorflow** automatically. This +reduces the risk of damaging your local installations. +So you should install keras and a back-end lib like tensorflow, if you want +to use neural net based models. An instruction is provided `here `_. + +**Table of Contents**\ : +:raw-html-m2r:`` + + +* `Key Links & Resources <#key-links-and-resources>`_ +* `Quick Introduction <#quick-introduction>`_ +* `Installation <#installation>`_ +* `API Cheatsheet & Reference <#api-cheatsheet--reference>`_ +* `Quick Start for Outlier Detection <#quick-start-for-outlier-detection>`_ +* `Quick Start for Combining Outlier Scores from Various Base Detectors <#quick-start-for-combining-outlier-scores-from-various-base-detectors>`_ +* `How to Contribute and Collaborate <#how-to-contribute-and-collaborate>`_ +* `Algorithm Benchmark <#algorithm-benchmark>`_ + + +.. raw:: html + + + + + +---- + +Key Links and Resources +^^^^^^^^^^^^^^^^^^^^^^^ + + +* + **\ `Documentation & API Reference `_\ ** + .. image:: https://readthedocs.org/projects/pyod/badge/?version=latest + :target: https://pyod.readthedocs.io/en/latest/?badge=latest + :alt: Documentation Status + + +* + **\ `Examples `_\ ** **&** **\ `Example Documentation `_\ ** + +* + **\ `Interactive Jupyter Notebooks `_\ ** + .. image:: https://mybinder.org/badge_logo.svg + :target: https://mybinder.org/v2/gh/yzhao062/pyod/master + :alt: Binder + + +* + **\ `Anomaly detection resources `_\ **\ , e.g., courses, books, papers and videos. + +---- + +Quick Introduction +^^^^^^^^^^^^^^^^^^ + +PyOD toolkit consists of three major groups of functionalities: (i) outlier +detection algorithms; (ii) outlier ensemble frameworks and (iii) outlier +detection utility functions. + +**\ *Individual Detection Algorithms*\ **\ : + + +#. + Linear Models for Outlier Detection: + + + #. **PCA: Principal Component Analysis** (use the sum of + weighted projected distances to the eigenvector hyperplane) [10] + #. **MCD: Minimum Covariance Determinant** (use the mahalanobis distances + as the outlier scores) [11, 12] + #. **One-Class Support Vector Machines** [3] + +#. + Proximity-Based Outlier Detection Models: + + + #. **LOF: Local Outlier Factor** [1] + #. **CBLOF: Clustering-Based Local Outlier Factor** [15] + #. **HBOS: Histogram-based Outlier Score** [5] + #. **kNN: k Nearest Neighbors** (use the distance to the kth nearest + neighbor as the outlier score) [13] + #. **Average kNN or kNN Sum** (use the average distance to k + nearest neighbors as the outlier score or sum all k distances) [14] + #. **Median kNN** Outlier Detection (use the median distance to k nearest + neighbors as the outlier score) + +#. + Probabilistic Models for Outlier Detection: + + + #. **ABOD: Angle-Based Outlier Detection** [7] + #. **FastABOD: Fast Angle-Based Outlier Detection using approximation** [7] + +#. + Outlier Ensembles and Combination Frameworks + + + #. **Isolation Forest** [2] + #. **Feature Bagging** [9] + +#. + Neural Networks and Deep Learning Models (implemented in Keras) + + + #. + **AutoEncoder with Fully Connected NN** [16, Chapter 3] + + FAQ regarding AutoEncoder in PyOD and debugging advice: + `known issues `_ + +**\ *Outlier Detector/Scores Combination Frameworks*\ **\ : + + +#. **Feature Bagging**\ : build various detectors on random selected features [9] +#. **Average** & **Weighted Average**\ : simply combine scores by averaging [6] +#. **Maximization**\ : simply combine scores by taking the maximum across all + base detectors [6] +#. **Average of Maximum (AOM)** [6] +#. **Maximum of Average (MOA)** [6] +#. **Threshold Sum (Thresh)** [6] + +**Comparison of all implemented models** are made available below: + (\ `Figure `_\ , + `Code `_\ , + `Jupyter Notebooks `_\ ): + +For Jupyter Notebooks, please navigate to **"/notebooks/Compare All Models.ipynb"** + + +.. image:: https://raw.githubusercontent.com/yzhao062/Pyod/master/examples/ALL.png + :target: https://raw.githubusercontent.com/yzhao062/Pyod/master/examples/ALL.png + :alt: Comparision_of_All + + +---- + +Installation +^^^^^^^^^^^^ + +It is recommended to use **pip** for installation. Please make sure +**the latest version** is installed, as PyOD is updated frequently: + +.. code-block:: cmd + + pip install pyod + pip install --upgrade pyod # make sure the latest version is installed! + +Alternatively, install from github directly (\ **NOT Recommended**\ ) + +.. code-block:: cmd + + git clone https://github.com/yzhao062/pyod.git + python setup.py install + +**Required Dependencies**\ : + + +* Python 2.7, 3.5, 3.6, or 3.7 +* nose +* numpy>=1.13 +* numba>=0.35 +* scipy>=0.19.1 +* scikit_learn>=0.19.1 + +**Optional Dependencies (required for running examples and AutoEncoder)**\ : + + +* keras (optional, required if calling AutoEncoder, other backend works) +* matplotlib (optional, required for running examples) +* tensorflow (optional, required if calling AutoEncoder, other backend works) + +**Known Issue 1**\ : PyOD depends on matplotlib, which would throw errors in conda +virtual environment on mac OS. See reasons and solutions `here `_. + +**Known Issue 2**\ : PyOD builds on various packages, which most of them you should have +already installed. If you are installing PyOD in a fresh state (virtualenv), +downloading and installing the dependencies, e.g., TensorFlow, may take +**3-5 mins**. + +**Known Issue 3**\ : If you are willing to run examples, matplotlib is required. +PyOD does not list it as a required package for eliminating the dependency. +Similarly, Keras and TensorFlow are listed as optional. However, they are +both required if you want to use neural network based models, such as +AutoEncoder. See reasons and solutions `here `_ + +---- + +API Cheatsheet & Reference +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Full API Reference: (https://pyod.readthedocs.io/en/latest/pyod.html). API cheatsheet for all detectors: + + +* **fit(X)**\ : Fit detector. +* **fit_predict(X)**\ : Fit detector and predict if a particular sample is an outlier or not. +* **fit_predict_score(X, y)**\ : Fit, predict and then evaluate with predefined metrics (ROC and precision @ rank n). +* **decision_function(X)**\ : Predict anomaly score of X of the base classifiers. +* **predict(X)**\ : Predict if a particular sample is an outlier or not. The model must be fitted first. +* **predict_proba(X)**\ : Predict the probability of a sample being outlier. The model must be fitted first. + +Key Attributes of a fitted model: + + +* **decision\ *scores*\ **\ : The outlier scores of the training data. The higher, the more abnormal. + Outliers tend to have higher scores. +* **labels_**\ : The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. + +Full package structure can be found below: + + +* http://pyod.readthedocs.io/en/latest/genindex.html +* http://pyod.readthedocs.io/en/latest/py-modindex.html + +---- + +Quick Start for Outlier Detection +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +See **examples directory** for more demos. `"examples/knn_example.py" `_ +demonstrates the basic APIs of PyOD using kNN detector. **It is noted the APIs for other detectors are similar**. + +More detailed instruction of running examples can be found `here. `_ + + +#. + Initialize a kNN detector, fit the model, and make the prediction. + + .. code-block:: python + + + from pyod.models.knn import KNN # kNN detector + + # train kNN detector + clf_name = 'KNN' + clf = KNN() + clf.fit(X_train) + + # get the prediction label and outlier scores of the training data + y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers) + y_train_scores = clf.decision_scores_ # raw outlier scores + + # get the prediction on the test data + y_test_pred = clf.predict(X_test) # outlier labels (0 or 1) + y_test_scores = clf.decision_function(X_test) # outlier scores + +#. + Evaluate the prediction by ROC and Precision@rank *n* (p@n): + + .. code-block:: python + + # evaluate and print the results + print("\nOn Training Data:") + evaluate_print(clf_name, y_train, y_train_scores) + print("\nOn Test Data:") + evaluate_print(clf_name, y_test, y_test_scores) + + + #. + See a sample output & visualization + + .. code-block:: python + + On Training Data: + KNN ROC:1.0, precision @ rank n:1.0 + + On Test Data: + KNN ROC:0.9989, precision @ rank n:0.9 + + .. code-block:: python + + visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, + y_test_pred, show_figure=True, save_figure=False) + +Visualization (\ `knn_figure `_\ ): + +.. image:: https://raw.githubusercontent.com/yzhao062/Pyod/master/examples/KNN.png + :target: https://raw.githubusercontent.com/yzhao062/Pyod/master/examples/KNN.png + :alt: kNN example figure + + +---- + +Quick Start for Combining Outlier Scores from Various Base Detectors +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +"examples/comb_example.py" illustrates the APIs for combining multiple base detectors +(\ `Code `_\ , +`Jupyter Notebooks `_\ ). + +For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"** + +Given we have *n* individual outlier detectors, each of them generates an individual score for all samples. +The task is to combine the outputs from these detectors effectively +**Key Step: conducting Z-score normalization on raw scores before the combination.** +Four combination mechanisms are shown in this demo: + + +#. Average: take the average of all base detectors. +#. maximization : take the maximum score across all detectors as the score. +#. Average of Maximum (AOM): first randomly split n detectors in to p groups. For each group, use the maximum within the group as the group output. Use the average of all group outputs as the final output. +#. Maximum of Average (MOA): similarly to AOM, the same grouping is introduced. However, we use the average of a group as the group output, and use maximum of all group outputs as the final output. + To better understand the merging techniques, refer to [6]. + +The walkthrough of the code example is provided: + + +#. + Import models and generate sample data + + .. code-block:: python + + + from pyod.models.knn import KNN + from pyod.models.combination import aom, moa, average, maximization + from pyod.utils.data import generate_data + + X, y = generate_data(train_only=True) # load data + +#. + First initialize 20 kNN outlier detectors with different k (10 to 200), and get the outlier scores: + + .. code-block:: python + + # initialize 20 base detectors for combination + k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, + 150, 160, 170, 180, 190, 200] + + train_scores = np.zeros([X_train.shape[0], n_clf]) + test_scores = np.zeros([X_test.shape[0], n_clf]) + + for i in range(n_clf): + k = k_list[i] + + clf = KNN(n_neighbors=k, method='largest') + clf.fit(X_train_norm) + + train_scores[:, i] = clf.decision_scores_ + test_scores[:, i] = clf.decision_function(X_test_norm) + +#. Then the output codes are standardized into zero mean and unit variance before combination. + .. code-block:: python + + from pyod.utils.utility import standardizer + train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores) + +#. Then four different combination algorithms are applied as described above: + .. code-block:: python + + comb_by_average = average(test_scores_norm) + comb_by_maximization = maximization(test_scores_norm) + comb_by_aom = aom(test_scores_norm, 5) # 5 groups + comb_by_moa = moa(test_scores_norm, 5)) # 5 groups + +#. Finally, all four combination methods are evaluated with ROC and Precision + @ Rank n: + .. code-block:: bash + + Combining 20 kNN detectors + Combination by Average ROC:0.9194, precision @ rank n:0.4531 + Combination by Maximization ROC:0.9198, precision @ rank n:0.4688 + Combination by AOM ROC:0.9257, precision @ rank n:0.4844 + Combination by MOA ROC:0.9263, precision @ rank n:0.4688 + +---- + +How to Contribute and Collaborate +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You are welcome to contribute to this exciting project, and we are preparing +a manuscript at `JMLR `_ (Track for open-source software). + +If you are interested in contributing: + + +* + Please first check Issue lists for "help wanted" tag and comment the one + you are interested + +* + Fork the repository and add your improvement/modification/fix + +* + Create a pull request + +To make sure the code has the same style and standard, please refer to models, +such as abod.py, hbos.py, or feature bagging for example. + +You are also welcome to share your ideas by opening an issue or dropping me an email +at yuezhao@cs.toronto.edu :) + +---- + +Algorithm Benchmark +^^^^^^^^^^^^^^^^^^^ + +To provide an overview and quick guidance of the implemented models, benchmark +is supplied. + +In total, 17 benchmark data are used for comparision, all datasets could be +downloaded at `ODDS `_. + +For each dataset, it is first split into 60% for training and 40% for testing. +All experiments are repeated 20 times independently with different samplings. +The mean of 20 trials are taken as the final result. Three evaluation metrics +are provided: + + +* The area under receiver operating characteristic (ROC) curve +* Precision @ rank n (P@N) +* Execution time + +Check the latest result `here `_. +You are welcome to replicate this process by running +`benchmark.py `_. + +---- + +Reference +^^^^^^^^^ + +[1] Breunig, M.M., Kriegel, H.P., Ng, R.T. and Sander, J., 2000, May. LOF: identifying density-based local outliers. In *ACM SIGMOD Record*\ , pp. 93-104. ACM. + +[2] Liu, F.T., Ting, K.M. and Zhou, Z.H., 2008, December. Isolation forest. In *ICDM '08*\ , pp. 413-422. IEEE. + +[3] Ma, J. and Perkins, S., 2003, July. Time-series novelty detection using one-class support vector machines. In *IJCNN' 03*\ , pp. 1741-1745. IEEE. + +[4] Y. Zhao and M.K. Hryniewicki, "DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles," *ACM SIGKDD Workshop on Outlier Detection De-constructed (ODD v5.0)*\ , 2018. + +[5] Goldstein, M. and Dengel, A., 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. In *KI-2012: Poster and Demo Track*\ , pp.59-63. + +[6] Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.\ *ACM SIGKDD Explorations Newsletter*\ , 17(1), pp.24-47. + +[7] Kriegel, H.P. and Zimek, A., 2008, August. Angle-based outlier detection in high-dimensional data. In *KDD '08*\ , pp. 444-452. ACM. + +[8] Y. Zhao and M.K. Hryniewicki, "XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning," *IEEE International Joint Conference on Neural Networks*\ , 2018. + +[9] Lazarevic, A. and Kumar, V., 2005, August. Feature bagging for outlier detection. In *KDD '05*. 2005. + +[10] Shyu, M.L., Chen, S.C., Sarinnapakorn, K. and Chang, L., 2003. A novel anomaly detection scheme based on principal component classifier. *MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING*. + +[11] Rousseeuw, P.J. and Driessen, K.V., 1999. A fast algorithm for the minimum covariance determinant estimator. *Technometrics*\ , 41(3), pp.212-223. + +[12] Hardin, J. and Rocke, D.M., 2004. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. *Computational Statistics & Data Analysis*\ , 44(4), pp.625-638. + +[13] Ramaswamy, S., Rastogi, R. and Shim, K., 2000, May. Efficient algorithms for mining outliers from large data sets. *ACM Sigmod Record*\ , 29(2), pp. 427-438). + +[14] Angiulli, F. and Pizzuti, C., 2002, August. Fast outlier detection in high dimensional spaces. In *European Conference on Principles of Data Mining and Knowledge Discovery* pp. 15-27. + +[15] He, Z., Xu, X. and Deng, S., 2003. Discovering cluster-based local outliers. *Pattern Recognition Letters*\ , 24(9-10), pp.1641-1650. + +[16] Aggarwal, C.C., 2015. Outlier analysis. In Data mining (pp. 237-263). Springer, Cham. + +[17] Zhao, Y., Hryniewicki, M.K., Nasrullah, Z., and Li, Z. SCP: Selective Combination in Parallel Outlier Ensembles. *SIAM International Conference on Data Mining (SDM)*. **Currently Under Review**.