-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
[MRG+1] Fix multi-label issues in IsolationForest benchmark #8638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks! I agree that the described fix is appropriate, but now that there are other changes you will need to be patient for a full review. I hope to look soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
benchmarks/bench_isolation_forest.py
Outdated
| ax[2].hist(scoring[y_test == 1], bins, color='r', | ||
| label='outliers') | ||
| ax[2].legend(loc="lower right") | ||
| y_pred = model.predict(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to call predict before decision_function, y_pred is unused,
and it artificially increases the predict time.
benchmarks/bench_isolation_forest.py
Outdated
| bins = np.linspace(-0.5, 0.5, 200) | ||
| ax[0].hist(scoring, bins, color='black') | ||
| ax[0].set_title('Decision function for %s dataset' % dat) | ||
| ax[0].legend(loc="lower right") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this line, since there is no label in this plot.
It will remove the matplotlib warning
benchmarks/bench_isolation_forest.py
Outdated
| ========================================== | ||
| A test of IsolationForest on classical anomaly detection datasets. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put the description on the top of the file
benchmarks/bench_isolation_forest.py
Outdated
| A test of IsolationForest on classical anomaly detection datasets. | ||
| """ | ||
| print(__doc__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put the print(__doc__) just below the import and before print_outlier_ratio(y)
benchmarks/bench_isolation_forest.py
Outdated
| fig_roc, ax_roc = plt.subplots(1, 1, figsize=(8, 5)) | ||
|
|
||
| # Set this to true for plotting score histograms for each dataset: | ||
| with_scoring_hists = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the previous example, shouldn't this be set to True by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it clearer with just the ROC curves, hence the False by default (but can be changed if required).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please rename it "with_decision_functions_histograms"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
LGTM |
…ges (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif
|
Thanks @hrjn ! |
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
Reference Issue
Fixes #8637
What does this implement/fix? Explain your changes.
In previous version, using Python 3 and
LabelBinarizerto encode categorical features from SA & SF datasets fromkddcup99led to the error described in #8637. Replacing it withMultipleLabelBinarizerfixes the problem and allows the code to run on both Python 2 and 3.Output obtained with new version:

Any other comments?
Additional minor cleaning/refactoring:
with_scoring_histsflag (set toFalseby default) to avoid plotting all score histograms (might potentially clog the screen with figure windows)