Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F1 and F1-PA #34

Open
carrtesy opened this issue Nov 11, 2022 · 1 comment
Open

F1 and F1-PA #34

carrtesy opened this issue Nov 11, 2022 · 1 comment

Comments

@carrtesy
Copy link

Hello,
I've encountered the exactly same issue as previous issue, and want to ask your opinion.

To provide more,
Along with the code without point adjustment (commenting out the PA part as in previous issue), I personally got the following result:

Dataset F1-PA F1-PA (paper) F1
MSL 0.9500 0.9359 0.0209
PSM 0.9750 0.9789 0.0217
SMAP 0.9636 0.9669 0.0189
SMD 0.8944 0.9233 0.0201

Although I agree that the F1-PA algorithm has practical justification (abnormal time point will cause an alert and
further make the whole segment noticed in real-world applications.
), (1) F1 seems to aggravate too much, and (2) AAAI paper raises concern about F1-PA metrics: even random guessing can achieive high F1-PA depending on data distribution.

I want to ask your opinion on these results.
Thanks in advance.

@GPla
Copy link

GPla commented Jan 13, 2023

Another paper that might be relevant is from Wu and Keogh that discusses the problems of current time series benchmarks, but also touches on problems of the evaluation.

There are also range-based metrics, that view both the anomalies and predictions as ranges.
A recent paper is from Hwang et al. which propose eTaPR (an implementation can be found here).

I agree that PA should not be used because of its apparent flaws. The community should go back to report the unadjusted scores + a new range-based metric (e.g., eTaPR).

I also did a comparison with other SOTA unsupervised approaches for SMD.
In this experiment, I achieved better than SOTA with DIF (hasn't been applied to SMD in the original paper). NCAD shows an issue which might be similar to the one of this paper.

Method F1 F1-PA F1-eTaPR
NCAD 07.1 81.5 07.4
GDN 42.0 93.0 36.3
DGHL 44.2 89.9 47.5
DIF 38.8 96.2 39.1
VAE 28.6 80.7 32.8

However, looking at the anomaly scores DIF doesn't look like the best method. The following image shows the anomaly scores for host 2-5 of SMD. Some anomaly scores are scaled for illustration purposes.

4b42f047-19d6-4278-a0b4-3c8e5f94ddd2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants