[MRG+1] Support Vector Data Description #7910

ivannz · 2016-11-19T17:02:42Z

Reference Issue

This PR covers previous issues concerning SVDD:

the original unfinished PR [WIP] Add svdd implementation #3201 which is now very outdated;
Issue Add support for SVDD (one-class SVM) #2807 requesting SVDD to be added to Scikit;
PR [MRG] ENH: Add SVDD to svm module #5899 which attempted to revive PR [WIP] Add svdd implementation #3201 but was closed;
Unfinished PR Add support for SVDD (one-class SVM) #2807 #3013 which tried to modify the libsvm part only.

What does this implement/fix? Explain your changes.

This PR offers the following:

SVDD-L1 implementation based on this libsvm implementation. The model was extended to the case of a penalty cost vector (in line with other implemented support vector models in Scikit);
documentation, outlining the SVDD-L1 model;
an example showing a difference between the One-Class SVM and SVDD models.

Any other comments?

The original model was proposed by Tax and Duin (2004), and later reformulated by Chang, Lee, Lin (2013). This PR implements the latter reformulation and extends it to the case of different weights of the observations. I can provide proof of the key theorems if necessary.

ivannz · 2016-11-19T18:34:55Z

I tried twice to correct the E402 (module level import not at top of file) PEP8 complaint in examples/svm/plot_oneclass_vs_svdd.py.

I just do not know where to move the imports. Besides in the examples/svm/plot_oneclass.py example the imports are just after the docstring and no complaint is issued.

nelson-liu · 2016-11-19T22:48:39Z

there were issues with E402 here as well, perhaps we should ignore E402? ping @lesteve

nelson-liu · 2016-11-19T22:50:11Z

doc/modules/outlier_detection.rst

@@ -302,3 +318,25 @@ multiple modes and :class:`ensemble.IsolationForest` and
     an outlier detection method), the :class:`ensemble.IsolationForest`,
     the :class:`neighbors.LocalOutlierFactor`
     and a covariance-based outlier detection :class:`covariance.EllipticEnvelope`.
+
+One-class SVM versus SVDD-L1


One-Class is used in the rest of the docs instead of One-class, perhaps you should change to maintain consistency?

nelson-liu · 2016-11-19T22:50:49Z

doc/modules/outlier_detection.rst

+One-class SVM versus SVDD-L1
+----------------------------
+
+The One-class SVM and SVDD models though apparently different both try to construct


models though apparently different both, models, though apparently different, both

lesteve · 2016-11-20T12:04:52Z

there were issues with E402 here as well, perhaps we should ignore E402? ping @lesteve

Move print(__doc__) after the imports and you'll get rid of the errors. Having said that, the list of flake8 errors we ignore is up for discussion. The "import not at top of file" aka E402 could definitely be one on the list, especially for examples.

ivannz · 2016-11-20T18:30:35Z

I decided to change the parametrization of the SVDD from C to nu due to the following reasons:

it is much more intuitive for a user to set the fraction of outliers (nu), rather than the reciprocal to the average number of outliers (C);
uniform parametrization with One-Class SVM allows to switch between the models easily.

ivannz · 2016-11-21T08:36:46Z

It seems that the CircleCI has failed due to reasons unrelated to the latest updates of the PR. The last log message is:

`$ /home/ubuntu/miniconda/bin/conda create -n testenv --yes --quiet python numpy scipy cython nose coverage matplotlib sphinx pillow`

    Traceback (most recent call last):
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/exceptions.py", line 479, in conda_exception_handler
        return_value = func(*args, **kwargs)
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/cli/main.py", line 145, in _main
        exit_code = args.func(args, p)
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/cli/main_create.py", line 68, in execute
        install(args, parser, 'create')
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/cli/install.py", line 420, in install
        raise CondaRuntimeError('RuntimeError: %s' % e)
    CondaRuntimeError: Runtime error: RuntimeError: Runtime error: HTTPError: 530 Server Error:  for url: https://repo.continuum.io/pkgs/free/linux-64/sphinx-1.4.8-py27_0.tar.bz2: https://repo.continuum.io/pkgs/free/linux-64/sphinx-1.4.8-py27_0.tar.bz2

Maybe the test should be rerun.

lesteve · 2016-11-21T09:02:31Z

For some reason Travis and AppVeyor haven't run, so maybe try git ci --amend followed by a git push -f to get all of them to rerun.

ivannz · 2016-11-21T09:39:39Z

These checks didn't run probably due to the explicit [ci skip] flag in the commit message. I put that in there, because the commit contained a minor edit in the documentation.

GaelVaroquaux · 2016-11-21T09:56:24Z

These checks didn't run probably due to the explicit [ci skip] flag in the commit message. I put that in the message, because the commit contained a minor edit in the documentation.

I rerun circle, and it finished alright.

lesteve · 2016-11-21T10:38:37Z

These checks didn't run probably due to the explicit [ci skip] flag in the commit message. I put that in there, because the commit contained a minor edit in the documentation.

Fair enough, I did not know about this feature. I am not sure I would encourage this to be honest, because when the status comes back green, it's easy to miss that only part of the CIs did run.

ivannz · 2016-11-21T18:35:13Z

A third party suggested that I rebased the commits in my branch to keep it up to date. I did so a couple of hours ago, but only now noticed that it botched the branch by pulling some commits from the master and duplicating all my commits atop. Do you have any suggestions on how to fix his, or do i have to initiate a new pr altogether?

jnothman · 2016-11-21T20:22:00Z

You should be fine without a new PR. Probably easiest is git reset --hard OLD_SHA where OLD_SHA is some previous commit hash you're okay with going back to. Then fetch the latest master and just git merge it into your branch.

amueller · 2016-11-21T21:33:15Z

doc/modules/outlier_detection.rst

+One-Class SVM versus SVDD-L1
+----------------------------
+
+The One-Class SVM and SVDD models, though apparently different, both try to construct


This is not a very clear description.

This subsection is just an illustration of the difference between the models. The actual description is in modules/svm.rst. I will add references to the detailed descriptions.

ivannz · 2016-11-23T12:15:18Z

I have given the SVDD libsvm code another thorough check and updated its description as well as the comparison to the One-Class SVM.

ivannz · 2016-11-27T10:39:44Z

Sorry for the long delay with implementing unit tests for this feature.

I added the following tests to test_svm.py:

test_svdd() (4 assertions) validates the output of the libsvm solver for the SVDD problem;
test_oneclass_and_svdd() (2 assertions) numerically tests the claim that SVDD and One-Class SVM coincide for a stationary kernel (rbf);
test_svdd_decision_function() (4 assertions) tests the coverage of the decision boundary for a non-stationary kernel (poly with degree=2 and coef0=1.0);
Updated test_immutable_coef_property() (+2 assertions) for consistency with other models.

Excerpts from make test-coverage:

current master

Name                                                Stmts   Miss Branch BrPart  Cover
-------------------------------------------------------------------------------------
sklearn/svm.py                                          4      0      0      0   100%
sklearn/svm/base.py                                   326     11    118     10    95%
sklearn/svm/bounds.py                                  20      0      8      0   100%
sklearn/svm/classes.py                                 87      1     12      1    98%

svdd_l1 branch

Name                                                Stmts   Miss Branch BrPart  Cover
-------------------------------------------------------------------------------------
sklearn/svm.py                                          4      0      0      0   100%
sklearn/svm/base.py                                   326     11    118     10    95%
sklearn/svm/bounds.py                                  20      0      8      0   100%
sklearn/svm/classes.py                                 96      1     12      1    98%

nosetests sklearn/svm/tests/ ran 82 tests and returned OK

ivannz · 2016-11-27T11:42:55Z

As usual, all tests ran successfully except for Travis. Besides the usual E402 PEP8
compliance requirement, it also failed on a new test, which is unrelated to SVM
submodule: sklearn.model_selection.tests.test_split.test_cv_iterable_wrapper
on lines 1031-1032.

At the first glance there should be no error since

kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)

print list(kf_iter_wrapped.split(X, y))
print list(kf_iter_wrapped.split(X, y))

prints identical CV split lists (as it should)

[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]

[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]

@lesteve, do you have any suggestions?

lesteve · 2016-11-27T20:26:23Z

Don't worry about the test failing that is not related to your PR. Actually it is the numpy-dev and it is allowed to fail so it will not influence the green or red status of your PR.

For the record I can reproduce it locally with numpy from master so I'll investigate.

ivannz · 2016-11-28T08:13:57Z

@nelson-liu , @amueller , do you have any remarks/suggestions regarding the description of the One-Class SVM and the SVDD?

ivannz · 2016-12-01T08:11:35Z

I rebased the branch to keep it up to date with the master.

amueller · 2016-12-01T22:33:40Z

Travis is failing because of pep8.
This looks good but will probably need a while for us to review, sorry about that...

ivannz · 2016-12-02T09:23:40Z

I improved the documentation of the SVDD and the One-Class SVM and added a sparse
vs. dense test similar to svm.OneClassSVM into test_sparse.py.

Should I resolve the E402 PEP8 issue in the plot_oneclass_vs_svdd.py example?
I can do that by moving the print after the imports, but at the cost of breaking the
uniform structure of example scripts in the svm example folder.

ivannz · 2016-12-02T10:05:00Z

AppVeyor tests exited with code 1, failing on test_multinomial_logistic_regression_string_inputs
in linear_model.tests.test_logistic.
SVM-related tests were all passed successfully though.

Update:
For the record, the failed test is related to #7966:

======================================================================
FAIL: sklearn.linear_model.tests.test_logistic.test_multinomial_logistic_regression_string_inputs
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27\lib\site-packages\sklearn\linear_model\tests\test_logistic.py", line 431, in test_multinomial_logistic_regression_string_inputs
    ['bar', 'baz', 'foo'])
AssertionError: Lists differ: ['bar', 'baz'] != ['bar', 'baz', 'foo']

Second list contains 1 additional elements.
First extra element 2:
'foo'

- ['bar', 'baz']
+ ['bar', 'baz', 'foo']
?              +++++++

amueller · 2016-12-06T21:38:40Z

@ivannz the test failure in logistic was fixed in master in the meantime. I'm not sure how we are handling the print(__doc__) right now.

amueller · 2016-12-06T21:43:42Z

I think recently we might just have not used the print, I guess assuming that people mostly look at the examples on the website

…cikit-learn#16992 and scikit-learn#16973

…ple is never a support vector

…m BaseLibSVM

…in SVDD

see-also cross-reference in ocSVM and SVDD (reflecting scikit-learn#18332)

ensure SVDD passes numpydoc validation (scikit-learn#20463) check for svdd in `test_sparse.py:check_svm_model_equal` to avoid calling `.predict_proba`

…_new add backticks (scikit-learn#20914), deprecate **params in fit (scikit-learn#20843), add feature_names_in_ (scikit-learn#20787) uncompromisingly reformat plot_oneclass_vs_svdd with black

…from ocSVM, and bumped versionadded; added SVDD to tests which involved ocSVM

…cikit-learn#22898)

…cSVM scikit-learn#24001) finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843) TST ensure SVDD passes param-validation test_common.py due to scikit-learn#23462 (scikit-learn#22722)

add SVDD announcement to svm.cpp, fix stray trailing spaces (#r374671161)

ivannz · 2022-09-05T00:04:38Z

Thank you @cmarmo , for pointing out these unresolved comments! I have gone though the thread once more and compiled a summary of the key issues, comments and requests (the simplest of them have been addressed):

Clarify the space, in which the hypersphere is computed doc/modules/svm.rst#L852-855
Add a short recap of the modifications of LIBSVM related to SVDD at the top of sklearn/svm/src/libsvm/svm.cpp (see also this recap)
decide what to do with the defect related to precomputed kernels. Briefly, inference with SVDD in this case requires the K_XZ block (as is standard for all current svm models) AND the diagonal of the K_XX block (which is not available), where X and Z are the test and train datasets, respectively. However, only in the case of precomputed kernel matrices, the current (and upstream LIBSVM) implementation fetch, essentially, the diagonal of the K_XZ block, which results in completely incorrect scores.
- a clarification of the differences between the SVDD and the OneClassSVM in the case of precomputed kernels.
- possible ways to resolutions: restricting parameter settings as a way to resolve the defect without breaking the API, or fusing fit and predict (like in LocalOutlierFactor), which means that SVDD would only perform outlier detection in the case of the precomputed kernel
- a pre v1.0 summary of the PR's status before considering for merging
make the inductive bias of each model a bit more explicit without going deep into mathematical details, add some intuition for picking linear soft-margin hyperplane (OneClassSVM) or spherical soft-margin hyper-surface (SVDD), hint at what kind of data / use case is likely to make SVDD-L1 perform better than OC-SVM, especially in the case of non-stationary kernels, e.g. polynomial, or precomputed, and, finally, briefly explain the impact of the choice on the shape of the decision function
- doc/modules/outlier_detection.rst#L438, examples/svm/plot_oneclass_vs_svdd.py#L9-18, sklearn/svm/_classes.py#L1828
provide convincing arguments in favour of SVDD-L1 against OneClass SVM. Below is the digest of the key concerns raised by the maintainers:
- The main bottleneck is equivalence of both models, when applied with the Gaussian kernel, which is the case in most applications of the One-Class SVM and SVDD. Simple baseline could be to fit a hyperplane and a hyperball in the input space with a linear kernel, possibly on a high-dimensional anomaly detection dataset. Is there a concrete business case where SVDD-L1 with a linear or polynomial kernel would yield interesting results (better than OC-SVM)? If the SVDD is always used with a Gaussian Kernel then one can use the One-Class SVM instead. How is SVDD better than EllipticEnvelope, which optimizes over the ellipsoids?
  - #582030125, #582814604, #511358987, #582009910, #337609509, #582814604, #353104999

cmarmo · 2022-09-07T18:51:30Z

I have gone though the thread once more and compiled a summary of the key issues, comments and requests (the simplest of them have been addressed):

Thanks @ivannz for the sum up! Just to be sure ... that means we wait for you to address those last points before a new review... right?

jeremiedbb · 2022-11-24T13:16:49Z

We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3

ivannz · 2022-11-24T13:42:46Z

@jeremiedbb @cmarmo I am sorry for not being able to pay due attention to the PR since mid September due to personal events. As soon as I get everything re-settled down (approximately in mid January 2023) i will get back to this PR.

The plan is to fix the errors in the oc-SVM and SVDD docs and to address the three points in my last comment.

nelson-liu reviewed Nov 19, 2016

View reviewed changes

ivannz force-pushed the svdd_l1 branch from 587a3b7 to 5f5dc29 Compare November 21, 2016 20:28

amueller reviewed Nov 21, 2016

View reviewed changes

ivannz changed the title ~~Support Vector Data Description~~ [MRG] Support Vector Data Description Nov 24, 2016

ivannz force-pushed the svdd_l1 branch from 080a896 to 4dad0e6 Compare November 25, 2016 11:24

lesteve mentioned this pull request Nov 27, 2016

[MRG + 1] Fix sklearn.model_selection.tests.test_split:test_cv_iterable_wrapper on numpy master #7946

Merged

ivannz force-pushed the svdd_l1 branch from 5b0c0c9 to efc7ba5 Compare December 1, 2016 08:09

ivannz force-pushed the svdd_l1 branch from 85f6e8e to 28da694 Compare December 7, 2016 09:36

ivannz added 18 commits September 4, 2022 12:11

some oneliners

84a1dcb

side-by-side comparison of scsvm with svdd (stationary kernel)

3d39584

moved SVDD announcement from v0.20 to v0.23

4d7217d

removed hardcoded sample sizes

2654479

patches to svdd-l1 reflecting scikit-learn#14286, scikit-learn#16530, s…

1e54626

…cikit-learn#16992 and scikit-learn#16973

reflect scikit-learn#17176: zero weight in SV models means that a sam…

7a0ede0

…ple is never a support vector

reflect scikit-learn#15521: document attribtues inherited by SVDD fro…

6bf0fb5

…m BaseLibSVM

reflect scikit-learn#14286: test for negative or null sample_weights …

00f3279

…in SVDD

update mode in more-tags (reflecting scikit-learn#17361)

0e015e0

see-also cross-reference in ocSVM and SVDD (reflecting scikit-learn#18332)

moved SVDD announcement from v0.23 to v1.0

6bb003f

docfix in SVDD related to scikit-learn#20236

fd43605

migrate svdd code style to Black (scikit-learn#18948)

b0f4926

ensure SVDD passes numpydoc validation (scikit-learn#20463) check for svdd in `test_sparse.py:check_svm_model_equal` to avoid calling `.predict_proba`

update version in svdd docs to 1.1, relocate from 1.0 to 1.1 in whats…

742954a

…_new add backticks (scikit-learn#20914), deprecate **params in fit (scikit-learn#20843), add feature_names_in_ (scikit-learn#20787) uncompromisingly reformat plot_oneclass_vs_svdd with black

move feature announcement from 1.1 to 1.2

9c95eea

fixed user guide ref in SVDD docstring, copied kernel parameter docs …

36778b4

…from ocSVM, and bumped versionadded; added SVDD to tests which involved ocSVM

Removed deprecated class_weight_ from the docs of SVDD (related to s…

4e5ca41

…cikit-learn#22898)

clarify the parent space of the SVDD hypersphere (#r374672496)

80a1725

add SVDD announcement to svm.cpp, fix stray trailing spaces (#r374671161)

ivannz force-pushed the svdd_l1 branch from d5c4e03 to 80a1725 Compare September 5, 2022 00:04

cmarmo removed the Waiting for Reviewer label Sep 7, 2022

jeremiedbb modified the milestones: 1.2, 1.3 Nov 24, 2022

jeremiedbb modified the milestones: 1.3, 1.4 Jul 6, 2023

glemaitre added the Stalled label Dec 7, 2023

glemaitre removed this from the 1.4 milestone Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Support Vector Data Description #7910

[MRG+1] Support Vector Data Description #7910

ivannz commented Nov 19, 2016 •

edited

ivannz commented Nov 19, 2016 •

edited

nelson-liu commented Nov 19, 2016

nelson-liu Nov 19, 2016

nelson-liu Nov 19, 2016

lesteve commented Nov 20, 2016

ivannz commented Nov 20, 2016

ivannz commented Nov 21, 2016

lesteve commented Nov 21, 2016

ivannz commented Nov 21, 2016 •

edited

GaelVaroquaux commented Nov 21, 2016 via email

lesteve commented Nov 21, 2016

ivannz commented Nov 21, 2016

jnothman commented Nov 21, 2016

amueller Nov 21, 2016

ivannz Nov 21, 2016

ivannz commented Nov 23, 2016 •

edited

ivannz commented Nov 27, 2016 •

edited

ivannz commented Nov 27, 2016

lesteve commented Nov 27, 2016

ivannz commented Nov 28, 2016 •

edited

ivannz commented Dec 1, 2016

amueller commented Dec 1, 2016

ivannz commented Dec 2, 2016

ivannz commented Dec 2, 2016 •

edited

amueller commented Dec 6, 2016

amueller commented Dec 6, 2016

ivannz commented Sep 5, 2022 •

edited

cmarmo commented Sep 7, 2022

jeremiedbb commented Nov 24, 2022

ivannz commented Nov 24, 2022

[MRG+1] Support Vector Data Description #7910

Are you sure you want to change the base?

[MRG+1] Support Vector Data Description #7910

Conversation

ivannz commented Nov 19, 2016 • edited

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

ivannz commented Nov 19, 2016 • edited

nelson-liu commented Nov 19, 2016

nelson-liu Nov 19, 2016

Choose a reason for hiding this comment

nelson-liu Nov 19, 2016

Choose a reason for hiding this comment

lesteve commented Nov 20, 2016

ivannz commented Nov 20, 2016

ivannz commented Nov 21, 2016

lesteve commented Nov 21, 2016

ivannz commented Nov 21, 2016 • edited

GaelVaroquaux commented Nov 21, 2016 via email

lesteve commented Nov 21, 2016

ivannz commented Nov 21, 2016

jnothman commented Nov 21, 2016

amueller Nov 21, 2016

Choose a reason for hiding this comment

ivannz Nov 21, 2016

Choose a reason for hiding this comment

ivannz commented Nov 23, 2016 • edited

ivannz commented Nov 27, 2016 • edited

ivannz commented Nov 27, 2016

lesteve commented Nov 27, 2016

ivannz commented Nov 28, 2016 • edited

ivannz commented Dec 1, 2016

amueller commented Dec 1, 2016

ivannz commented Dec 2, 2016

ivannz commented Dec 2, 2016 • edited

amueller commented Dec 6, 2016

amueller commented Dec 6, 2016

ivannz commented Sep 5, 2022 • edited

cmarmo commented Sep 7, 2022

jeremiedbb commented Nov 24, 2022

ivannz commented Nov 24, 2022

ivannz commented Nov 19, 2016 •

edited

ivannz commented Nov 19, 2016 •

edited

ivannz commented Nov 21, 2016 •

edited

ivannz commented Nov 23, 2016 •

edited

ivannz commented Nov 27, 2016 •

edited

ivannz commented Nov 28, 2016 •

edited

ivannz commented Dec 2, 2016 •

edited

ivannz commented Sep 5, 2022 •

edited