Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomaly detections/multiple jobs comparison #132

Closed
manycoding opened this issue Jul 4, 2019 · 19 comments
Closed

Anomaly detections/multiple jobs comparison #132

manycoding opened this issue Jul 4, 2019 · 19 comments
Labels
help wanted Extra attention is needed Type: Feature New feature or request
Milestone

Comments

@manycoding
Copy link
Contributor

manycoding commented Jul 4, 2019

I still don't have any particular idea how to make it in the best way.
Outliers by deviation? Percentiles?
Pandas-profiling.

1st approach:

  • plot field coverage for multiple jobs (5).
  • notify about suspicious decrease, e.g.:
> Check 265529/50/33.
< `USER_RATING_AVERAGE` has suspicious coverage decrease.
< `USER_RATING_COUNT` has suspicious coverage decrease.

The data is from Scrapy Cloud stats.
@alexander-matsievsky Did I get everything?
@ejulio

https://pandas-profiling.github.io/pandas-profiling/examples/nza/nza_report.html

@manycoding manycoding added Type: Feature New feature or request help wanted Extra attention is needed labels Jul 4, 2019
@manycoding manycoding changed the title Anomaly detections Anomaly detections/multiple jobs comparison Jul 4, 2019
@ejulio
Copy link

ejulio commented Jul 5, 2019

Using only two snapshots it might be hard, because the whole idea of anomaly detection is to find an event with value lower/greater than 2 standard deviations from the mean.
So, the first part is to be able to work with a set of N snapshots.

BTW, pandas-profiling seems to be a great tool

@manycoding
Copy link
Contributor Author

So, the first part is to be able to work with a set of N snapshots.

Yep, I mentioned 5, could be as much as we want since we gonna get the data from job stats, and it's fast. Not from the real data, at least for now.

@manycoding
Copy link
Contributor Author

manycoding commented Jul 6, 2019

@ejulio Thinking about implementation:
It's either we do this for any number of jobs (downloading data only for the first two):

a = Arche(source="0/0/1", targets=["0/0/2", "0/0/3", ...])
a.report_all()

Or let it be completely separate:

arche.rules.comparison.anomalies(["0/0/1", "0/0/2", "0/0/3", ...]).show()

@ejulio
Copy link

ejulio commented Jul 8, 2019

If there's a use case for the first API, then it looks more familiar.
Otherwise, the second one might be better.
What will happen if I instantiate a = Arche(source='job id', target='job id') and then follow to the second API?

@manycoding
Copy link
Contributor Author

What will happen if I instantiate a = Arche(source='job id', target='job id') and then follow to the second API?

You'll have to provide arguments anyway. But it can be as glance():

a = Arche(source='job id', target='job id')
a.anomalies()

@manycoding
Copy link
Contributor Author

How do we treat 100% decrease?
And the most important question - what we account for suspisious change (I reckon increase counts too)? I remember somebody told me about 2 standard deviations, I cannot find more.

@manycoding manycoding added this to the 0.3.7 milestone Jul 8, 2019
@manycoding
Copy link
Contributor Author

manycoding commented Jul 8, 2019

@ejulio @raphapassini I made an example finding the coverage with zscore https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/Arche/anomalies.ipynb

Is that what you meant?
Do we care only for the latest job?

@ejulio
Copy link

ejulio commented Jul 9, 2019

I think that it is fine to consider only the latest job.
Mostly because this monitoring might be continuous, so no need to check jobs in the past.

The standard approach for anomaly detection is to assume a normal distribution and then set as an anomaly if:

  • value < 2 * std
  • value > 2 * std

It would be in both tails of the distribution.
I think it is fine to take this assumption of normality because it is related to counts and they should be somewhat similar given a small time interval between jobs (say 1 week or so).

Another approach would be:

  • Input: a set of control jobs
  • Input: a set of jobs to be evaluated
  • Perform a statistical t-test
  • Report the p-value (reject accept the null hypothesis)

Probably this is something we are not looking right now, but it could an idea to check differences in behavior for larger periods of time (say between years).
For example, we could identify if we are getting more bans/retries/bad responses which is also informative and might be a trigger for action

@manycoding
Copy link
Contributor Author

manycoding commented Jul 9, 2019

Looks like -2<Z-Score<2 should be the same as 2 standard deviations?

I included 2 std deviations method and it makes me wonder what if deviation is too small comparable to the coverage?
E.g. 0.005% deviation means anything further from 0.01% is bad, but should we worry?

  1. Comparing to 100% coverage, it doesn't matter
  2. Comparing to the field coverage, it could matter, since we deal with the historical data.

It seems that z-score deals with such cases.

@ejulio
Copy link

ejulio commented Jul 10, 2019

So, it is not necessarily 0.01%.
It depends on std itself. If std is small, 0.01% in difference is a large thing. If std is large, it means nothing.

@manycoding
Copy link
Contributor Author

Let me get it right. Say, we have 60% coverage average with 0.005% deviation. And then everything lower than 59.3% or higher than 60.7% is out of the normal. That makes sense.
Or we have 60% coverage with 0.0005% deviation. Now it's 59.93% and 60.07% which we look for.
So you mean all of these numbers make complete sense for scraped data? That we shouldn't check with the absolute deviation size (like look only on dev >1%)?

@manycoding
Copy link
Contributor Author

@alexander-matsievsky What's your thoughts? Especially from partical observations?

@ejulio
Copy link

ejulio commented Jul 10, 2019

Maybe the percentages are making things worse.
Lets ignore the units (no matter if they are absolute values, percentages, counts).
Say that on average, coverage is 60 and std is 10.
Then, any value < 40 or value > 80 is an anomaly.

If you want, 60 and 10 can be percentages. Then, mean = 60% and then any other percentage < 40% or percentage > 80% is an anomaly. And they can be item counts as well.

@manycoding
Copy link
Contributor Author

I put the code in the notebook

manycoding added a commit that referenced this issue Jul 10, 2019
@manycoding
Copy link
Contributor Author

Another question - API.
So far you need to feed jobs, but it could be (thanks to @raphapassini ):
keys you choose
or a time range (e.g. everything for the last week)
or a number of last jobs

@alexander-matsievsky
Copy link
Member

@manycoding @ejulio Hi, guys! Checking the variables independently is less accurate, one should take the correlation into account. DM me for an example.

@ejulio
Copy link

ejulio commented Jul 11, 2019

@manycoding , I think the API should be as close to the current one as possible.
Maybe we can think of wrappers, but it would also be a matter of thinking about modules then.

@alexander-matsievsky , i agree that correlation can give more information, but checking a distribution of stats should be enough to detect some anomalies.
Then we'd start talking about multivariate stuff :)

@manycoding
Copy link
Contributor Author

For results, we decided to make a grid with box plots. Only for suspisious data.

manycoding added a commit that referenced this issue Jul 23, 2019
* Drop _type from cov difference, add to Source mock

* Add anomalies & tests, #132

* Display anomalies with box plots

* Update to plotly 4
@manycoding
Copy link
Contributor Author

Implemented in #138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Type: Feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants