Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix iforest average path length #13251

Merged
merged 13 commits into from Feb 26, 2019

Conversation

Projects
None yet
5 participants
@albertcthomas
Copy link
Contributor

albertcthomas commented Feb 25, 2019

Reference Issues/PRs

Taking over #12085. Besides fixing the average path length for IsolationForest this PR also improves the checks for the predicted number of outliers in the common tests.
Closes #12085, Fixes #11839

Only modifications in the tests were required.

cc @joshuakennethjones

joshuakennethjones and others added some commits Sep 14, 2018

Fix issue #11839
Fix Issue #11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
Add non-regression test.
Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.
Improve comment & test.
Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.
Fix tests in tie cases & match conventions.
Change assert_equal to assert ... == to adhere to latest conventions, and change test to properly deal with anomaly score ties in critical regions if 'decision_function' method is supported by the estimator in question, or default to the old behavior if not.
Add tests of tests.
Refactoring and adding more tests to try and get coverage to an acceptable level.
assert_almost_equal(_average_path_length(1), 1., decimal=10)
assert _average_path_length(0) == 0.
assert _average_path_length(1) == 0.
assert _average_path_length(2) == 1.

This comment has been minimized.

Copy link
@ngoix

ngoix Feb 25, 2019

Contributor

can we also test that _average_path_length is increasing for more values? I guess _average_path_length(2) < _average_path_length(3) would be enough

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Feb 25, 2019

ping @agramfort for a review when you have time

@glemaitre
Copy link
Contributor

glemaitre left a comment

Could you add an entry inside in what's new

Show resolved Hide resolved sklearn/ensemble/iforest.py
Show resolved Hide resolved sklearn/ensemble/iforest.py Outdated
Show resolved Hide resolved sklearn/ensemble/iforest.py Outdated
Show resolved Hide resolved sklearn/ensemble/iforest.py Outdated
Show resolved Hide resolved sklearn/ensemble/tests/test_iforest.py Outdated
Show resolved Hide resolved sklearn/ensemble/tests/test_iforest.py Outdated
Show resolved Hide resolved sklearn/ensemble/tests/test_iforest.py Outdated
assert_almost_equal(_average_path_length(5), result_one, decimal=10)
assert_almost_equal(_average_path_length(999), result_two, decimal=10)
assert_array_almost_equal(_average_path_length(np.array([1, 5, 999])),
[1., result_one, result_two], decimal=10)
assert_array_almost_equal(_average_path_length(np.array([1, 2, 5, 999])),

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 26, 2019

Contributor

Since we are changing this line, could we use assert_allclose

This comment has been minimized.

Copy link
@albertcthomas

albertcthomas Feb 26, 2019

Author Contributor

Yes. What's the difference? assert_allclose does not check the shapes are the same?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 26, 2019

Contributor

all_close use rtol atol instead of decimal. It is just recommended by numpy for consistency:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.testing.assert_array_almost_equal.html

@@ -1548,16 +1568,16 @@ def check_outliers_train(name, estimator_orig, readonly_memmap=True):

decision = estimator.decision_function(X)

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 26, 2019

Contributor
for func in ['decision_function', 'score_samples']:
    output = getattr(estimator, func)(X)
    assert output.dtype == np.dtype('float')
    assert output.shape == (n_samples,)

This comment has been minimized.

Copy link
@albertcthomas

albertcthomas Feb 26, 2019

Author Contributor

I now use a for loop but a bit different to your suggestion as we need the outputs for other checks.

Show resolved Hide resolved sklearn/utils/estimator_checks.py Outdated
@ngoix

ngoix approved these changes Feb 26, 2019

Copy link
Contributor

ngoix left a comment

appart from my small comment on _average_path_length monotonic testing and guillaume formatting comments LGTM

glemaitre and others added some commits Feb 26, 2019

apply suggestions from code review
Co-Authored-By: albertcthomas <albertthomas88@gmail.com>
sc
@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Feb 26, 2019

Thanks for the reviews @glemaitre and @ngoix

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 26, 2019

+1 for MRG

@agramfort agramfort changed the title [MRG] Fix iforest average path length [MRG+1] Fix iforest average path length Feb 26, 2019

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 26, 2019

ok to merge when green @ngoix and @glemaitre ?

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 26, 2019

Merging. Azure pipeline is green.

@glemaitre glemaitre merged commit bcdeadd into scikit-learn:master Feb 26, 2019

7 of 9 checks passed

ci/circleci: deploy Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Feb 26, 2019

Thanks for the reviews @ngoix, @glemaitre and @agramfort. Thanks for most of the work @joshuakennethjones and sorry for delaying the original PR.

@joshuakennethjones

This comment has been minimized.

Copy link

joshuakennethjones commented Feb 26, 2019

Thanks for pushing this one across the finish line @albertcthomas! Glad to be able to help out a little bit -- I appreciate the efforts of everyone involved in maintaining and improving what is obviously a very useful package.

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.