Skip to content

Commit

Permalink
Merge 001971a into 810d394
Browse files Browse the repository at this point in the history
  • Loading branch information
yzhao062 committed Apr 10, 2019
2 parents 810d394 + 001971a commit a7d8cc3
Show file tree
Hide file tree
Showing 71 changed files with 1,723 additions and 2,996 deletions.
2 changes: 2 additions & 0 deletions .gitignore
@@ -1,3 +1,5 @@
__pycache__/
pyod.egg-info/
.cache/
.pytest_cache
__pycache__
8 changes: 8 additions & 0 deletions CHANGES.txt
Expand Up @@ -58,6 +58,14 @@ v<0.6.8>, <01/31/2019> -- Optimize unit tests for faster execution.
v<0.6.8>, <02/08/2019> -- Update docs with media coverage.
v<0.6.8>, <02/10/2019> -- Fix issue in CBLOF for n_cluster discrepancy.
v<0.6.8>, <02/10/2019> -- Minor doc improvement and stability enhancement.
v<0.6.9>, <03/12/2019> -- Major documentation update for JMLR.
v<0.6.9>, <03/12/2019> -- Change CI tool env variable setting.
v<0.6.9>, <03/18/2019> -- Update SOS default parameter setting and documentation.
v<0.6.9>, <03/29/2019> -- Refactor visualize function (moved to utils).
v<0.6.9>, <03/30/2019> -- Add License info and show support to 996.ICU!
v<0.6.9>, <04/08/2019> -- Redesign ReadMe for clarity.
v<0.6.9>, <04/08/2019> -- Deprecate fit_predict and fit_predict_score function.
v<0.6.9>, <04/10/2019> -- Add inclusion criteria and Python 2.7 retirement notice.



Expand Down
342 changes: 200 additions & 142 deletions README.rst

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/README_11262018.md
Expand Up @@ -325,7 +325,7 @@ To make sure the code has the same style and standard, please refer to models,
such as abod.py, hbos.py, or feature bagging for example.

You are also welcome to share your ideas by opening an issue or dropping me an email
at yuezhao@cs.toronto.edu :)
at zhaoy@cmu.edu :)

---

Expand Down
7 changes: 3 additions & 4 deletions docs/about.rst
Expand Up @@ -11,7 +11,7 @@ Zain Nasrullah (joined in 2018):
`LinkedIn (Zain Nasrullah) <https://www.linkedin.com/in/zain-nasrullah-097a2b85>`_

Winston (Zheng) Li (joined in 2018):
`LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl/>`_
`LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl>`_

----

Expand All @@ -33,9 +33,8 @@ or::

Zhao, Y., Nasrullah, Z. and Li, Z., 2019. PyOD: A Python Toolbox for Scalable Outlier Detection. arXiv preprint arXiv:1901.01588.

PyOD paper is **accepted** at `JMLR <http://www.jmlr.org/mloss/>`_
`PyOD paper <https://arxiv.org/abs/1901.01588>`_ is **accepted** at `JMLR <http://www.jmlr.org/mloss/>`_
(machine learning open-source software track) **with minor revisions (to appear)**.
See `arxiv preprint <https://arxiv.org/abs/1901.01588>`_.


----
Expand All @@ -56,7 +55,7 @@ PyOD has been well acknowledged by the machine learning community with a few fea

**GitHub Python Trending**:

- 2019: Feb 10th-11th, Jan 23th-24th, Jan 10th-14th
- 2019: Apr 5th-6th, Feb 10th-11th, Jan 23th-24th, Jan 10th-14th
- 2018: Jun 15, Dec 8th-9th

**Miscellaneous**:
Expand Down
20 changes: 18 additions & 2 deletions docs/api_cc.rst
@@ -1,12 +1,28 @@
API CheatSheet
==============

The following APIs are applicable for all detector models for easy use.

* :func:`pyod.models.base.BaseDetector.fit`: Fit detector. y is optional for unsupervised methods.
* :func:`pyod.models.base.BaseDetector.fit_predict`: Fit detector first and then predict whether a particular sample is an outlier or not.
* :func:`pyod.models.base.BaseDetector.fit_predict_score`: Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
* :func:`pyod.models.base.BaseDetector.decision_function`: Predict raw anomaly score of X using the fitted detector.
* :func:`pyod.models.base.BaseDetector.predict`: Predict if a particular sample is an outlier or not using the fitted detector.
* :func:`pyod.models.base.BaseDetector.predict_proba`: Predict the probability of a sample being outlier using the fitted detector.
* :func:`pyod.models.base.BaseDetector.fit_predict`: **[Deprecated in V0.6.9]** Fit detector first and then predict whether a particular sample is an outlier or not.
* :func:`pyod.models.base.BaseDetector.fit_predict_score`: **[Deprecated in V0.6.9]** Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.


Key Attributes of a fitted model:

* :attr:`pyod.models.base.BaseDetector.decision_scores_`: The outlier scores of the training data. The higher, the more abnormal.
Outliers tend to have higher scores.
* :attr:`pyod.models.base.BaseDetector.labels_`: The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies.


**Note** \ : fit_predict() and fit_predict_score() are deprecated in V0.6.9 due
to consistency issue and will be removed in V0.7.2. To get the binary labels
of the training data X_train, one should call clf.fit(X_train) and use
:attr:`pyod.models.base.BaseDetector.labels_`, instead of calling clf.predict(X_train).


See base class definition below:

Expand Down
42 changes: 23 additions & 19 deletions docs/benchmark.rst
Expand Up @@ -4,27 +4,37 @@ Benchmarks
Introduction
------------

To provide an overview and guidance of the implemented models, benchmark
is supplied below.

In total, 16 benchmark data are used for comparision, all datasets could be
downloaded at `ODDS <http://odds.cs.stonybrook.edu/#table1>`_.
A benchmark is supplied for select algorithms to provide an overview of the implemented models.
In total, 17 benchmark datasets are used for comparison, which
can be downloaded at `ODDS <http://odds.cs.stonybrook.edu/#table1>`_.

For each dataset, it is first split into 60% for training and 40% for testing.
All experiments are repeated 10 times independently with different splits.
The mean of 20 trials is regarded as the final result. Three evaluation metrics
All experiments are repeated 10 times independently with random splits.
The mean of 10 trials is regarded as the final result. Three evaluation metrics
are provided:

- The area under receiver operating characteristic (ROC) curve
- Precision @ rank n (P@N)
- Execution time

**Note**: LSCP is a combination framework. In this benchmark it is based on 5
LOF detector (n_neighbors=[10,...,50]), so it is only meaningful to compare
LSCP with LOF, instead of other detection algorithms.

You are welcome to replicate this process by running:
`benchmark.py <https://github.com/yzhao062/Pyod/blob/master/notebooks/benchmark.py>`_
You could replicate this process by running
`benchmark.py <https://github.com/yzhao062/pyod/blob/master/notebooks/benchmark.py>`_.

We also provide the hardware specification for reference.

=============== =======================================
Specification Value
=============== =======================================
Platform PC
OS Microsoft Windows 10 Enterprise
CPU Intel i7-6820HQ @ 2.70GHz
RAM 32GB
Software PyCharm 2018.02
Python Python 3.6.2
Core Single core (no parallelization)
=============== =======================================


ROC Performance
---------------
Expand All @@ -44,13 +54,7 @@ P@N Performance
Execution Time
--------------

.. csv-table:: Time Complexity in Seconds (average of 10 independent trials)
.. csv-table:: Time Elapsed in Seconds (average of 10 independent trials)
:file: tables/time.csv
:header-rows: 1

Conclusion
----------

TO ADD


49 changes: 35 additions & 14 deletions docs/example.rst
Expand Up @@ -48,7 +48,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
n_train=n_train, n_test=n_test, contamination=contamination)
3. Initialize a :class:`pyod.models.knn.KNN` detector, fit the model, and make
the prediction:
the prediction.

.. code-block:: python
Expand All @@ -65,7 +65,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
y_test_pred = clf.predict(X_test) # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test) # outlier scores
4. Evaluate the prediction using ROC and Precision\@rank n :func:`pyod.utils.data.evaluate_print`:
4. Evaluate the prediction using ROC and Precision @ Rank n :func:`pyod.utils.data.evaluate_print`.

.. code-block:: python
Expand All @@ -75,7 +75,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)
5. See sample outputs on both training and test data:
5. See sample outputs on both training and test data.

.. code-block:: bash
Expand All @@ -85,7 +85,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
On Test Data:
KNN ROC:0.9989, precision @ rank n:0.9
6. Generate the visualizations by visualize function included in all examples:
6. Generate the visualizations by visualize function included in all examples.

.. code-block:: python
Expand All @@ -102,15 +102,28 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
Model Combination Example
-------------------------

`comb_example.py <https://github.com/yzhao062/Pyod/blob/master/examples/comb_example.py>`_ is a quick demo for showing the API for combining multiple algorithms.
Given we have *n* individual outlier detectors, each of them generates an individual score for all samples. The task is to combine the outputs from these detectors effectivelly.
Outlier detection often suffers from model instability due to its unsupervised
nature. Thus, it is recommended to combine various detector outputs, e.g., by averaging,
to improve its robustness. Detector combination is a subfield of outlier ensembles;
refer :cite:`b-kalayci2018anomaly` for more information.

**Model combination example** is made available below
(`Code <https://github.com/yzhao062/Pyod/blob/master/examples/comb_example.py>`_, `Jupyter Notebooks <https://mybinder.org/v2/gh/yzhao062/Pyod/master>`_):

For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"**
Four score combination mechanisms are shown in this demo:

1. Import models and generate sample data:

#. **Average**: average scores of all detectors.
#. **maximization**: maximum score across all detectors.
#. **Average of Maximum (AOM)**: divide base detectors into subgroups and take the maximum score for each subgroup. The final score is the average of all subgroup scores.
#. **Maximum of Average (MOA)**: divide base detectors into subgroups and take the average score for each subgroup. The final score is the maximum of all subgroup scores.


"examples/comb_example.py" illustrates the API for combining the output of multiple base detectors
(\ `comb_example.py <https://github.com/yzhao062/pyod/blob/master/examples/comb_example.py>`_\ ,
`Jupyter Notebooks <https://mybinder.org/v2/gh/yzhao062/pyod/master>`_\ ). For Jupyter Notebooks,
please navigate to **"/notebooks/Model Combination.ipynb"**


1. Import models and generate sample data.

.. code-block:: python
Expand All @@ -121,7 +134,7 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
X, y= generate_data(train_only=True) # load data
2. First initialize 20 kNN outlier detectors with different k (10 to 200), and get the outlier scores:
2. Initialize 20 kNN outlier detectors with different k (10 to 200), and get the outlier scores.

.. code-block:: python
Expand All @@ -141,7 +154,8 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
train_scores[:, i] = clf.decision_scores_
test_scores[:, i] = clf.decision_function(X_test_norm)
3. Then the output scores are standardized into zero average and unit std before combination:
3. Then the output scores are standardized into zero average and unit std before combination.
This step is crucial to adjust the detector outputs to the same scale.

.. code-block:: python
Expand All @@ -150,7 +164,7 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
# scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores)
4. Then four different combination algorithms are applied as described above:
4. Four different combination algorithms are applied as described above:

.. code-block:: python
Expand All @@ -159,7 +173,7 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
comb_by_aom = aom(test_scores_norm, 5) # 5 groups
comb_by_moa = moa(test_scores_norm, 5)) # 5 groups
5. Finally, all four combination methods are evaluated with ROC and Precision
5. Finally, all four combination methods are evaluated by ROC and Precision
@ Rank n:

.. code-block:: bash
Expand All @@ -169,3 +183,10 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
Combination by Maximization ROC:0.9198, precision @ rank n:0.4688
Combination by AOM ROC:0.9257, precision @ rank n:0.4844
Combination by MOA ROC:0.9263, precision @ rank n:0.4688
.. rubric:: References

.. bibliography:: zreferences.bib
:cited:
:labelprefix: B
:keyprefix: b-
84 changes: 0 additions & 84 deletions docs/examples.rst

This file was deleted.

0 comments on commit a7d8cc3

Please sign in to comment.