-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anomaly detections/multiple jobs comparison #132
Comments
Using only two snapshots it might be hard, because the whole idea of anomaly detection is to find an event with value lower/greater than 2 standard deviations from the mean. BTW, pandas-profiling seems to be a great tool |
Yep, I mentioned 5, could be as much as we want since we gonna get the data from job stats, and it's fast. Not from the real data, at least for now. |
@ejulio Thinking about implementation:
Or let it be completely separate:
|
If there's a use case for the first API, then it looks more familiar. |
You'll have to provide arguments anyway. But it can be as
|
How do we treat 100% decrease? |
@ejulio @raphapassini I made an example finding the coverage with zscore https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/Arche/anomalies.ipynb Is that what you meant? |
I think that it is fine to consider only the latest job. The standard approach for anomaly detection is to assume a normal distribution and then set as an anomaly if:
It would be in both tails of the distribution. Another approach would be:
Probably this is something we are not looking right now, but it could an idea to check differences in behavior for larger periods of time (say between years). |
Looks like I included 2 std deviations method and it makes me wonder what if deviation is too small comparable to the coverage?
It seems that z-score deals with such cases. |
So, it is not necessarily 0.01%. |
Let me get it right. Say, we have 60% coverage average with 0.005% deviation. And then everything lower than 59.3% or higher than 60.7% is out of the normal. That makes sense. |
@alexander-matsievsky What's your thoughts? Especially from partical observations? |
Maybe the percentages are making things worse. If you want, 60 and 10 can be percentages. Then, mean = 60% and then any other percentage < 40% or percentage > 80% is an anomaly. And they can be item counts as well. |
I put the code in the notebook |
Another question - API. |
@manycoding @ejulio Hi, guys! Checking the variables independently is less accurate, one should take the correlation into account. DM me for an example. |
@manycoding , I think the API should be as close to the current one as possible. @alexander-matsievsky , i agree that correlation can give more information, but checking a distribution of stats should be enough to detect some anomalies. |
For results, we decided to make a grid with box plots. Only for suspisious data. |
* Drop _type from cov difference, add to Source mock * Add anomalies & tests, #132 * Display anomalies with box plots * Update to plotly 4
Implemented in #138 |
I still don't have any particular idea how to make it in the best way.
Outliers by deviation? Percentiles?
Pandas-profiling.
1st approach:
The data is from Scrapy Cloud stats.
@alexander-matsievsky Did I get everything?
@ejulio
https://pandas-profiling.github.io/pandas-profiling/examples/nza/nza_report.html
The text was updated successfully, but these errors were encountered: