Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Support Vector Data Description #7910

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

ivannz
Copy link

@ivannz ivannz commented Nov 19, 2016

Reference Issue

This PR covers previous issues concerning SVDD:

What does this implement/fix? Explain your changes.

This PR offers the following:

  • SVDD-L1 implementation based on this libsvm implementation. The model was extended to the case of a penalty cost vector (in line with other implemented support vector models in Scikit);
  • documentation, outlining the SVDD-L1 model;
  • an example showing a difference between the One-Class SVM and SVDD models.

Any other comments?

The original model was proposed by Tax and Duin (2004), and later reformulated by Chang, Lee, Lin (2013). This PR implements the latter reformulation and extends it to the case of different weights of the observations. I can provide proof of the key theorems if necessary.

@ivannz
Copy link
Author

ivannz commented Nov 19, 2016

I tried twice to correct the E402 (module level import not at top of file) PEP8 complaint in examples/svm/plot_oneclass_vs_svdd.py.

I just do not know where to move the imports. Besides in the examples/svm/plot_oneclass.py example the imports are just after the docstring and no complaint is issued.

@nelson-liu
Copy link
Contributor

there were issues with E402 here as well, perhaps we should ignore E402? ping @lesteve

@@ -302,3 +318,25 @@ multiple modes and :class:`ensemble.IsolationForest` and
an outlier detection method), the :class:`ensemble.IsolationForest`,
the :class:`neighbors.LocalOutlierFactor`
and a covariance-based outlier detection :class:`covariance.EllipticEnvelope`.

One-class SVM versus SVDD-L1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-Class is used in the rest of the docs instead of One-class, perhaps you should change to maintain consistency?

One-class SVM versus SVDD-L1
----------------------------

The One-class SVM and SVDD models though apparently different both try to construct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

models though apparently different both, models, though apparently different, both

@lesteve
Copy link
Member

lesteve commented Nov 20, 2016

there were issues with E402 here as well, perhaps we should ignore E402? ping @lesteve

Move print(__doc__) after the imports and you'll get rid of the errors. Having said that, the list of flake8 errors we ignore is up for discussion. The "import not at top of file" aka E402 could definitely be one on the list, especially for examples.

@ivannz
Copy link
Author

ivannz commented Nov 20, 2016

I decided to change the parametrization of the SVDD from C to nu due to the following reasons:

  1. it is much more intuitive for a user to set the fraction of outliers (nu), rather than the reciprocal to the average number of outliers (C);
  2. uniform parametrization with One-Class SVM allows to switch between the models easily.

@ivannz
Copy link
Author

ivannz commented Nov 21, 2016

It seems that the CircleCI has failed due to reasons unrelated to the latest updates of the PR. The last log message is:

`$ /home/ubuntu/miniconda/bin/conda create -n testenv --yes --quiet python numpy scipy cython nose coverage matplotlib sphinx pillow`

    Traceback (most recent call last):
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/exceptions.py", line 479, in conda_exception_handler
        return_value = func(*args, **kwargs)
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/cli/main.py", line 145, in _main
        exit_code = args.func(args, p)
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/cli/main_create.py", line 68, in execute
        install(args, parser, 'create')
      File "/home/ubuntu/miniconda/lib/python2.7/site-packages/conda/cli/install.py", line 420, in install
        raise CondaRuntimeError('RuntimeError: %s' % e)
    CondaRuntimeError: Runtime error: RuntimeError: Runtime error: HTTPError: 530 Server Error:  for url: https://repo.continuum.io/pkgs/free/linux-64/sphinx-1.4.8-py27_0.tar.bz2: https://repo.continuum.io/pkgs/free/linux-64/sphinx-1.4.8-py27_0.tar.bz2

Maybe the test should be rerun.

@lesteve
Copy link
Member

lesteve commented Nov 21, 2016

For some reason Travis and AppVeyor haven't run, so maybe try git ci --amend followed by a git push -f to get all of them to rerun.

@ivannz
Copy link
Author

ivannz commented Nov 21, 2016

These checks didn't run probably due to the explicit [ci skip] flag in the commit message. I put that in there, because the commit contained a minor edit in the documentation.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 21, 2016 via email

@lesteve
Copy link
Member

lesteve commented Nov 21, 2016

These checks didn't run probably due to the explicit [ci skip] flag in the commit message. I put that in there, because the commit contained a minor edit in the documentation.

Fair enough, I did not know about this feature. I am not sure I would encourage this to be honest, because when the status comes back green, it's easy to miss that only part of the CIs did run.

@ivannz
Copy link
Author

ivannz commented Nov 21, 2016

A third party suggested that I rebased the commits in my branch to keep it up to date. I did so a couple of hours ago, but only now noticed that it botched the branch by pulling some commits from the master and duplicating all my commits atop. Do you have any suggestions on how to fix his, or do i have to initiate a new pr altogether?

@jnothman
Copy link
Member

You should be fine without a new PR. Probably easiest is git reset --hard OLD_SHA where OLD_SHA is some previous commit hash you're okay with going back to. Then fetch the latest master and just git merge it into your branch.

One-Class SVM versus SVDD-L1
----------------------------

The One-Class SVM and SVDD models, though apparently different, both try to construct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a very clear description.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subsection is just an illustration of the difference between the models. The actual description is in modules/svm.rst. I will add references to the detailed descriptions.

@ivannz
Copy link
Author

ivannz commented Nov 23, 2016

I have given the SVDD libsvm code another thorough check and updated its description as well as the comparison to the One-Class SVM.

@ivannz ivannz changed the title Support Vector Data Description [MRG] Support Vector Data Description Nov 24, 2016
@ivannz
Copy link
Author

ivannz commented Nov 27, 2016

Sorry for the long delay with implementing unit tests for this feature.

I added the following tests to test_svm.py:

  • test_svdd() (4 assertions) validates the output of the libsvm solver for the SVDD problem;
  • test_oneclass_and_svdd() (2 assertions) numerically tests the claim that SVDD and One-Class SVM coincide for a stationary kernel (rbf);
  • test_svdd_decision_function() (4 assertions) tests the coverage of the decision boundary for a non-stationary kernel (poly with degree=2 and coef0=1.0);
  • Updated test_immutable_coef_property() (+2 assertions) for consistency with other models.

Excerpts from make test-coverage:

current master

Name                                                Stmts   Miss Branch BrPart  Cover
-------------------------------------------------------------------------------------
sklearn/svm.py                                          4      0      0      0   100%
sklearn/svm/base.py                                   326     11    118     10    95%
sklearn/svm/bounds.py                                  20      0      8      0   100%
sklearn/svm/classes.py                                 87      1     12      1    98%

svdd_l1 branch

Name                                                Stmts   Miss Branch BrPart  Cover
-------------------------------------------------------------------------------------
sklearn/svm.py                                          4      0      0      0   100%
sklearn/svm/base.py                                   326     11    118     10    95%
sklearn/svm/bounds.py                                  20      0      8      0   100%
sklearn/svm/classes.py                                 96      1     12      1    98%

nosetests sklearn/svm/tests/ ran 82 tests and returned OK

@ivannz
Copy link
Author

ivannz commented Nov 27, 2016

As usual, all tests ran successfully except for Travis. Besides the usual E402 PEP8
compliance requirement, it also failed on a new test, which is unrelated to SVM
submodule: sklearn.model_selection.tests.test_split.test_cv_iterable_wrapper
on lines 1031-1032.

At the first glance there should be no error since

kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)

print list(kf_iter_wrapped.split(X, y))
print list(kf_iter_wrapped.split(X, y))

prints identical CV split lists (as it should)

[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]

[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]

@lesteve, do you have any suggestions?

@lesteve
Copy link
Member

lesteve commented Nov 27, 2016

Don't worry about the test failing that is not related to your PR. Actually it is the numpy-dev and it is allowed to fail so it will not influence the green or red status of your PR.

For the record I can reproduce it locally with numpy from master so I'll investigate.

@ivannz
Copy link
Author

ivannz commented Nov 28, 2016

@nelson-liu , @amueller , do you have any remarks/suggestions regarding the description of the One-Class SVM and the SVDD?

@ivannz
Copy link
Author

ivannz commented Dec 1, 2016

I rebased the branch to keep it up to date with the master.

@amueller
Copy link
Member

amueller commented Dec 1, 2016

Travis is failing because of pep8.
This looks good but will probably need a while for us to review, sorry about that...

@ivannz
Copy link
Author

ivannz commented Dec 2, 2016

I improved the documentation of the SVDD and the One-Class SVM and added a sparse
vs. dense
test similar to svm.OneClassSVM into test_sparse.py.

Should I resolve the E402 PEP8 issue in the plot_oneclass_vs_svdd.py example?
I can do that by moving the print after the imports, but at the cost of breaking the
uniform structure of example scripts in the svm example folder.

@ivannz
Copy link
Author

ivannz commented Dec 2, 2016

AppVeyor tests exited with code 1, failing on test_multinomial_logistic_regression_string_inputs
in linear_model.tests.test_logistic.
SVM-related tests were all passed successfully though.

Update:
For the record, the failed test is related to #7966:

======================================================================
FAIL: sklearn.linear_model.tests.test_logistic.test_multinomial_logistic_regression_string_inputs
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27\lib\site-packages\sklearn\linear_model\tests\test_logistic.py", line 431, in test_multinomial_logistic_regression_string_inputs
    ['bar', 'baz', 'foo'])
AssertionError: Lists differ: ['bar', 'baz'] != ['bar', 'baz', 'foo']

Second list contains 1 additional elements.
First extra element 2:
'foo'

- ['bar', 'baz']
+ ['bar', 'baz', 'foo']
?              +++++++

@amueller
Copy link
Member

amueller commented Dec 6, 2016

@ivannz the test failure in logistic was fixed in master in the meantime. I'm not sure how we are handling the print(__doc__) right now.

@amueller
Copy link
Member

amueller commented Dec 6, 2016

I think recently we might just have not used the print, I guess assuming that people mostly look at the examples on the website

see-also cross-reference in ocSVM and SVDD (reflecting scikit-learn#18332)
ensure SVDD passes numpydoc validation (scikit-learn#20463)
check for svdd in `test_sparse.py:check_svm_model_equal` to avoid calling `.predict_proba`
…_new

add backticks (scikit-learn#20914), deprecate **params in fit (scikit-learn#20843), add feature_names_in_ (scikit-learn#20787)
uncompromisingly reformat plot_oneclass_vs_svdd with black
…from ocSVM, and bumped versionadded; added SVDD to tests which involved ocSVM
…cSVM scikit-learn#24001)

finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843)
TST ensure SVDD passes param-validation test_common.py due to scikit-learn#23462 (scikit-learn#22722)
add SVDD announcement to svm.cpp, fix stray trailing spaces (#r374671161)
@ivannz
Copy link
Author

ivannz commented Sep 5, 2022

Thank you @cmarmo , for pointing out these unresolved comments! I have gone though the thread once more and compiled a summary of the key issues, comments and requests (the simplest of them have been addressed):

  • Clarify the space, in which the hypersphere is computed doc/modules/svm.rst#L852-855

  • Add a short recap of the modifications of LIBSVM related to SVDD at the top of sklearn/svm/src/libsvm/svm.cpp (see also this recap)

  • decide what to do with the defect related to precomputed kernels. Briefly, inference with SVDD in this case requires the K_XZ block (as is standard for all current svm models) AND the diagonal of the K_XX block (which is not available), where X and Z are the test and train datasets, respectively. However, only in the case of precomputed kernel matrices, the current (and upstream LIBSVM) implementation fetch, essentially, the diagonal of the K_XZ block, which results in completely incorrect scores.

    • a clarification of the differences between the SVDD and the OneClassSVM in the case of precomputed kernels.
    • possible ways to resolutions: restricting parameter settings as a way to resolve the defect without breaking the API, or fusing fit and predict (like in LocalOutlierFactor), which means that SVDD would only perform outlier detection in the case of the precomputed kernel
    • a pre v1.0 summary of the PR's status before considering for merging
  • make the inductive bias of each model a bit more explicit without going deep into mathematical details, add some intuition for picking linear soft-margin hyperplane (OneClassSVM) or spherical soft-margin hyper-surface (SVDD), hint at what kind of data / use case is likely to make SVDD-L1 perform better than OC-SVM, especially in the case of non-stationary kernels, e.g. polynomial, or precomputed, and, finally, briefly explain the impact of the choice on the shape of the decision function

  • provide convincing arguments in favour of SVDD-L1 against OneClass SVM. Below is the digest of the key concerns raised by the maintainers:

    • The main bottleneck is equivalence of both models, when applied with the Gaussian kernel, which is the case in most applications of the One-Class SVM and SVDD. Simple baseline could be to fit a hyperplane and a hyperball in the input space with a linear kernel, possibly on a high-dimensional anomaly detection dataset. Is there a concrete business case where SVDD-L1 with a linear or polynomial kernel would yield interesting results (better than OC-SVM)? If the SVDD is always used with a Gaussian Kernel then one can use the One-Class SVM instead. How is SVDD better than EllipticEnvelope, which optimizes over the ellipsoids?

@cmarmo
Copy link
Member

cmarmo commented Sep 7, 2022

I have gone though the thread once more and compiled a summary of the key issues, comments and requests (the simplest of them have been addressed):

Thanks @ivannz for the sum up! Just to be sure ... that means we wait for you to address those last points before a new review... right?

@jeremiedbb
Copy link
Member

We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3

@jeremiedbb jeremiedbb modified the milestones: 1.2, 1.3 Nov 24, 2022
@ivannz
Copy link
Author

ivannz commented Nov 24, 2022

@jeremiedbb @cmarmo I am sorry for not being able to pay due attention to the PR since mid September due to personal events. As soon as I get everything re-settled down (approximately in mid January 2023) i will get back to this PR.

The plan is to fix the errors in the oc-SVM and SVDD docs and to address the three points in my last comment.

@jeremiedbb jeremiedbb modified the milestones: 1.3, 1.4 Jul 6, 2023
@glemaitre glemaitre removed this from the 1.4 milestone Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet