Merge 001971a into 810d394

yzhao062 · Apr 10, 2019 · a7d8cc3 · a7d8cc3
2 parents 810d394 + 001971a
commit a7d8cc3
Show file tree

Hide file tree

Showing 71 changed files with 1,723 additions and 2,996 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
+__pycache__/
 pyod.egg-info/
 .cache/
 .pytest_cache
+__pycache__
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -58,6 +58,14 @@ v<0.6.8>, <01/31/2019> -- Optimize unit tests for faster execution.
 v<0.6.8>, <02/08/2019> -- Update docs with media coverage.
 v<0.6.8>, <02/10/2019> -- Fix issue in CBLOF for n_cluster discrepancy.
 v<0.6.8>, <02/10/2019> -- Minor doc improvement and stability enhancement.
+v<0.6.9>, <03/12/2019> -- Major documentation update for JMLR.
+v<0.6.9>, <03/12/2019> -- Change CI tool env variable setting.
+v<0.6.9>, <03/18/2019> -- Update SOS default parameter setting and documentation.
+v<0.6.9>, <03/29/2019> -- Refactor visualize function (moved to utils).
+v<0.6.9>, <03/30/2019> -- Add License info and show support to 996.ICU!
+v<0.6.9>, <04/08/2019> -- Redesign ReadMe for clarity.
+v<0.6.9>, <04/08/2019> -- Deprecate fit_predict and fit_predict_score function.
+v<0.6.9>, <04/10/2019> -- Add inclusion criteria and Python 2.7 retirement notice.
 
 
 

diff --git a/README.rst b/README.rst
diff --git a/docs/README_11262018.md b/docs/README_11262018.md
@@ -325,7 +325,7 @@ To make sure the code has the same style and standard, please refer to models,
 such as abod.py, hbos.py, or feature bagging for example.
 
 You are also welcome to share your ideas by opening an issue or dropping me an email
-at yuezhao@cs.toronto.edu :)
+at zhaoy@cmu.edu :)
 
 ---
 

diff --git a/docs/about.rst b/docs/about.rst
@@ -11,7 +11,7 @@ Zain Nasrullah (joined in 2018):
 `LinkedIn (Zain Nasrullah) <https://www.linkedin.com/in/zain-nasrullah-097a2b85>`_
 
 Winston (Zheng) Li (joined in 2018):
-`LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl/>`_
+`LinkedIn (Winston Li) <https://www.linkedin.com/in/winstonl>`_
 
 ----
 
@@ -33,9 +33,8 @@ or::
 
     Zhao, Y., Nasrullah, Z. and Li, Z., 2019. PyOD: A Python Toolbox for Scalable Outlier Detection. arXiv preprint arXiv:1901.01588.
 
-PyOD paper is **accepted** at `JMLR <http://www.jmlr.org/mloss/>`_
+`PyOD paper <https://arxiv.org/abs/1901.01588>`_ is **accepted** at `JMLR <http://www.jmlr.org/mloss/>`_
 (machine learning open-source software track) **with minor revisions (to appear)**.
-See `arxiv preprint <https://arxiv.org/abs/1901.01588>`_.
 
 
 ----
@@ -56,7 +55,7 @@ PyOD has been well acknowledged by the machine learning community with a few fea
 
 **GitHub Python Trending**:
 
-- 2019: Feb 10th-11th, Jan 23th-24th, Jan 10th-14th
+- 2019: Apr 5th-6th, Feb 10th-11th, Jan 23th-24th, Jan 10th-14th
 - 2018: Jun 15, Dec 8th-9th
 
 **Miscellaneous**:

diff --git a/docs/api_cc.rst b/docs/api_cc.rst
@@ -1,12 +1,28 @@
 API CheatSheet
 ==============
 
+The following APIs are applicable for all detector models for easy use.
+
 * :func:`pyod.models.base.BaseDetector.fit`: Fit detector. y is optional for unsupervised methods.
-* :func:`pyod.models.base.BaseDetector.fit_predict`: Fit detector first and then predict whether a particular sample is an outlier or not.
-* :func:`pyod.models.base.BaseDetector.fit_predict_score`: Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
 * :func:`pyod.models.base.BaseDetector.decision_function`: Predict raw anomaly score of X using the fitted detector.
 * :func:`pyod.models.base.BaseDetector.predict`: Predict if a particular sample is an outlier or not using the fitted detector.
 * :func:`pyod.models.base.BaseDetector.predict_proba`: Predict the probability of a sample being outlier using the fitted detector.
+* :func:`pyod.models.base.BaseDetector.fit_predict`: **[Deprecated in V0.6.9]** Fit detector first and then predict whether a particular sample is an outlier or not.
+* :func:`pyod.models.base.BaseDetector.fit_predict_score`: **[Deprecated in V0.6.9]** Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
+
+
+Key Attributes of a fitted model:
+
+* :attr:`pyod.models.base.BaseDetector.decision_scores_`: The outlier scores of the training data. The higher, the more abnormal.
+  Outliers tend to have higher scores.
+* :attr:`pyod.models.base.BaseDetector.labels_`: The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies.
+
+
+**Note** \ : fit_predict() and fit_predict_score() are deprecated in V0.6.9 due
+to consistency issue and will be removed in V0.7.2. To get the binary labels
+of the training data X_train, one should call clf.fit(X_train) and use
+:attr:`pyod.models.base.BaseDetector.labels_`, instead of calling clf.predict(X_train).
+
 
 See base class definition below:
 

diff --git a/docs/benchmark.rst b/docs/benchmark.rst
@@ -4,27 +4,37 @@ Benchmarks
 Introduction
 ------------
 
-To provide an overview and guidance of the implemented models, benchmark
-is supplied below.
-
-In total, 16 benchmark data are used for comparision, all datasets could be
-downloaded at `ODDS <http://odds.cs.stonybrook.edu/#table1>`_.
+A benchmark is supplied for select algorithms to provide an overview of the implemented models.
+In total, 17 benchmark datasets are used for comparison, which
+can be downloaded at `ODDS <http://odds.cs.stonybrook.edu/#table1>`_.
 
 For each dataset, it is first split into 60% for training and 40% for testing.
-All experiments are repeated 10 times independently with different splits.
-The mean of 20 trials is regarded as the final result. Three evaluation metrics
+All experiments are repeated 10 times independently with random splits.
+The mean of 10 trials is regarded as the final result. Three evaluation metrics
 are provided:
 
 - The area under receiver operating characteristic (ROC) curve
 - Precision @ rank n (P@N)
 - Execution time
 
-**Note**: LSCP is a combination framework. In this benchmark it is based on 5
-LOF detector (n_neighbors=[10,...,50]), so it is only meaningful to compare
-LSCP with LOF, instead of other detection algorithms.
 
-You are welcome to replicate this process by running:
-`benchmark.py <https://github.com/yzhao062/Pyod/blob/master/notebooks/benchmark.py>`_
+You could replicate this process by running
+`benchmark.py <https://github.com/yzhao062/pyod/blob/master/notebooks/benchmark.py>`_.
+
+We also provide the hardware specification for reference.
+
+===============  =======================================
+Specification    Value
+===============  =======================================
+Platform         PC
+OS               Microsoft Windows 10 Enterprise
+CPU              Intel i7-6820HQ @ 2.70GHz
+RAM              32GB
+Software         PyCharm 2018.02
+Python           Python 3.6.2
+Core             Single core (no parallelization)
+===============  =======================================
+
 
 ROC Performance
 ---------------
@@ -44,13 +54,7 @@ P@N Performance
 Execution Time
 --------------
 
-.. csv-table:: Time Complexity in Seconds (average of 10 independent trials)
+.. csv-table:: Time Elapsed in Seconds (average of 10 independent trials)
    :file: tables/time.csv
    :header-rows: 1
 
-Conclusion
-----------
-
-TO ADD
-
-
diff --git a/docs/example.rst b/docs/example.rst
@@ -48,7 +48,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
             n_train=n_train, n_test=n_test, contamination=contamination)
 
 3. Initialize a :class:`pyod.models.knn.KNN` detector, fit the model, and make
-   the prediction:
+   the prediction.
 
     .. code-block:: python
 
@@ -65,7 +65,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
         y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
         y_test_scores = clf.decision_function(X_test)  # outlier scores
 
-4. Evaluate the prediction using ROC and Precision\@rank n :func:`pyod.utils.data.evaluate_print`:
+4. Evaluate the prediction using ROC and Precision @ Rank n :func:`pyod.utils.data.evaluate_print`.
 
     .. code-block:: python
 
@@ -75,7 +75,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
         print("\nOn Test Data:")
         evaluate_print(clf_name, y_test, y_test_scores)
 
-5. See sample outputs on both training and test data:
+5. See sample outputs on both training and test data.
 
     .. code-block:: bash
 
@@ -85,7 +85,7 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
         On Test Data:
         KNN ROC:0.9989, precision @ rank n:0.9
 
-6. Generate the visualizations by visualize function included in all examples:
+6. Generate the visualizations by visualize function included in all examples.
 
     .. code-block:: python
 
@@ -102,15 +102,28 @@ Full example: `knn_example.py <https://github.com/yzhao062/Pyod/blob/master/exam
 Model Combination Example
 -------------------------
 
-`comb_example.py <https://github.com/yzhao062/Pyod/blob/master/examples/comb_example.py>`_ is a quick demo for showing the API for combining multiple algorithms.
-Given we have *n* individual outlier detectors, each of them generates an individual score for all samples. The task is to combine the outputs from these detectors effectivelly.
+Outlier detection often suffers from model instability due to its unsupervised
+nature. Thus, it is recommended to combine various detector outputs, e.g., by averaging,
+to improve its robustness. Detector combination is a subfield of outlier ensembles;
+refer :cite:`b-kalayci2018anomaly` for more information.
 
-**Model combination example** is made available below
-(`Code <https://github.com/yzhao062/Pyod/blob/master/examples/comb_example.py>`_, `Jupyter Notebooks <https://mybinder.org/v2/gh/yzhao062/Pyod/master>`_):
 
-For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"**
+Four score combination mechanisms are shown in this demo:
 
-1. Import models and generate sample data:
+
+#. **Average**: average scores of all detectors.
+#. **maximization**: maximum score across all detectors.
+#. **Average of Maximum (AOM)**: divide base detectors into subgroups and take the maximum score for each subgroup. The final score is the average of all subgroup scores.
+#. **Maximum of Average (MOA)**: divide base detectors into subgroups and take the average score for each subgroup. The final score is the maximum of all subgroup scores.
+
+
+"examples/comb_example.py" illustrates the API for combining the output of multiple base detectors
+(\ `comb_example.py <https://github.com/yzhao062/pyod/blob/master/examples/comb_example.py>`_\ ,
+`Jupyter Notebooks <https://mybinder.org/v2/gh/yzhao062/pyod/master>`_\ ). For Jupyter Notebooks,
+please navigate to **"/notebooks/Model Combination.ipynb"**
+
+
+1. Import models and generate sample data.
 
     .. code-block:: python
 
@@ -121,7 +134,7 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
         X, y= generate_data(train_only=True)  # load data
 
 
-2. First initialize 20 kNN outlier detectors with different k (10 to 200), and get the outlier scores:
+2. Initialize 20 kNN outlier detectors with different k (10 to 200), and get the outlier scores.
 
     .. code-block:: python
 
@@ -141,7 +154,8 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
             train_scores[:, i] = clf.decision_scores_
             test_scores[:, i] = clf.decision_function(X_test_norm)
 
-3. Then the output scores are standardized into zero average and unit std before combination:
+3. Then the output scores are standardized into zero average and unit std before combination.
+   This step is crucial to adjust the detector outputs to the same scale.
 
     .. code-block:: python
 
@@ -150,7 +164,7 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
         # scores have to be normalized before combination
         train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores)
 
-4. Then four different combination algorithms are applied as described above:
+4. Four different combination algorithms are applied as described above:
 
     .. code-block:: python
 
@@ -159,7 +173,7 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
         comb_by_aom = aom(test_scores_norm, 5) # 5 groups
         comb_by_moa = moa(test_scores_norm, 5)) # 5 groups
 
-5. Finally, all four combination methods are evaluated with ROC and Precision
+5. Finally, all four combination methods are evaluated by ROC and Precision
    @ Rank n:
 
     .. code-block:: bash
@@ -169,3 +183,10 @@ For Jupyter Notebooks, please navigate to **"/notebooks/Model Combination.ipynb"
         Combination by Maximization ROC:0.9198, precision @ rank n:0.4688
         Combination by AOM ROC:0.9257, precision @ rank n:0.4844
         Combination by MOA ROC:0.9263, precision @ rank n:0.4688
+
+.. rubric:: References
+
+.. bibliography:: zreferences.bib
+   :cited:
+   :labelprefix: B
+   :keyprefix: b-
diff --git a/docs/examples.rst b/docs/examples.rst