Anomaly detections/multiple jobs comparison #132

manycoding · 2019-07-04T14:41:33Z

I still don't have any particular idea how to make it in the best way.
Outliers by deviation? Percentiles?
Pandas-profiling.

1st approach:

plot field coverage for multiple jobs (5).
notify about suspicious decrease, e.g.:

> Check 265529/50/33.
< `USER_RATING_AVERAGE` has suspicious coverage decrease.
< `USER_RATING_COUNT` has suspicious coverage decrease.

The data is from Scrapy Cloud stats.
@alexander-matsievsky Did I get everything?
@ejulio

https://pandas-profiling.github.io/pandas-profiling/examples/nza/nza_report.html

The text was updated successfully, but these errors were encountered:

ejulio · 2019-07-05T21:25:27Z

Using only two snapshots it might be hard, because the whole idea of anomaly detection is to find an event with value lower/greater than 2 standard deviations from the mean.
So, the first part is to be able to work with a set of N snapshots.

BTW, pandas-profiling seems to be a great tool

manycoding · 2019-07-05T22:51:49Z

So, the first part is to be able to work with a set of N snapshots.

Yep, I mentioned 5, could be as much as we want since we gonna get the data from job stats, and it's fast. Not from the real data, at least for now.

manycoding · 2019-07-06T16:11:19Z

@ejulio Thinking about implementation:
It's either we do this for any number of jobs (downloading data only for the first two):

a = Arche(source="0/0/1", targets=["0/0/2", "0/0/3", ...])
a.report_all()

Or let it be completely separate:

arche.rules.comparison.anomalies(["0/0/1", "0/0/2", "0/0/3", ...]).show()

ejulio · 2019-07-08T14:32:58Z

If there's a use case for the first API, then it looks more familiar.
Otherwise, the second one might be better.
What will happen if I instantiate a = Arche(source='job id', target='job id') and then follow to the second API?

manycoding · 2019-07-08T15:38:28Z

What will happen if I instantiate a = Arche(source='job id', target='job id') and then follow to the second API?

You'll have to provide arguments anyway. But it can be as glance():

a = Arche(source='job id', target='job id')
a.anomalies()

manycoding · 2019-07-08T19:03:18Z

How do we treat 100% decrease?
And the most important question - what we account for suspisious change (I reckon increase counts too)? I remember somebody told me about 2 standard deviations, I cannot find more.

manycoding · 2019-07-08T22:37:22Z

@ejulio @raphapassini I made an example finding the coverage with zscore https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/Arche/anomalies.ipynb

Is that what you meant?
Do we care only for the latest job?

ejulio · 2019-07-09T11:50:38Z

I think that it is fine to consider only the latest job.
Mostly because this monitoring might be continuous, so no need to check jobs in the past.

The standard approach for anomaly detection is to assume a normal distribution and then set as an anomaly if:

value < 2 * std
value > 2 * std

It would be in both tails of the distribution.
I think it is fine to take this assumption of normality because it is related to counts and they should be somewhat similar given a small time interval between jobs (say 1 week or so).

Another approach would be:

Input: a set of control jobs
Input: a set of jobs to be evaluated
Perform a statistical t-test
Report the p-value (reject accept the null hypothesis)

Probably this is something we are not looking right now, but it could an idea to check differences in behavior for larger periods of time (say between years).
For example, we could identify if we are getting more bans/retries/bad responses which is also informative and might be a trigger for action

manycoding · 2019-07-09T20:47:47Z

Looks like -2<Z-Score<2 should be the same as 2 standard deviations?

I included 2 std deviations method and it makes me wonder what if deviation is too small comparable to the coverage?
E.g. 0.005% deviation means anything further from 0.01% is bad, but should we worry?

Comparing to 100% coverage, it doesn't matter
Comparing to the field coverage, it could matter, since we deal with the historical data.

It seems that z-score deals with such cases.

ejulio · 2019-07-10T12:19:21Z

So, it is not necessarily 0.01%.
It depends on std itself. If std is small, 0.01% in difference is a large thing. If std is large, it means nothing.

manycoding · 2019-07-10T15:58:28Z

Let me get it right. Say, we have 60% coverage average with 0.005% deviation. And then everything lower than 59.3% or higher than 60.7% is out of the normal. That makes sense.
Or we have 60% coverage with 0.0005% deviation. Now it's 59.93% and 60.07% which we look for.
So you mean all of these numbers make complete sense for scraped data? That we shouldn't check with the absolute deviation size (like look only on dev >1%)?

manycoding · 2019-07-10T15:58:55Z

@alexander-matsievsky What's your thoughts? Especially from partical observations?

ejulio · 2019-07-10T18:34:22Z

Maybe the percentages are making things worse.
Lets ignore the units (no matter if they are absolute values, percentages, counts).
Say that on average, coverage is 60 and std is 10.
Then, any value < 40 or value > 80 is an anomaly.

If you want, 60 and 10 can be percentages. Then, mean = 60% and then any other percentage < 40% or percentage > 80% is an anomaly. And they can be item counts as well.

manycoding · 2019-07-10T22:09:48Z

I put the code in the notebook

manycoding · 2019-07-11T14:48:40Z

Another question - API.
So far you need to feed jobs, but it could be (thanks to @raphapassini ):
keys you choose
or a time range (e.g. everything for the last week)
or a number of last jobs

alexander-matsievsky · 2019-07-11T15:13:23Z

@manycoding @ejulio Hi, guys! Checking the variables independently is less accurate, one should take the correlation into account. DM me for an example.

ejulio · 2019-07-11T21:01:57Z

@manycoding , I think the API should be as close to the current one as possible.
Maybe we can think of wrappers, but it would also be a matter of thinking about modules then.

@alexander-matsievsky , i agree that correlation can give more information, but checking a distribution of stats should be enough to detect some anomalies.
Then we'd start talking about multivariate stuff :)

manycoding · 2019-07-15T21:21:01Z

For results, we decided to make a grid with box plots. Only for suspisious data.

* Drop _type from cov difference, add to Source mock * Add anomalies & tests, #132 * Display anomalies with box plots * Update to plotly 4

manycoding · 2019-08-22T10:28:07Z

Implemented in #138

manycoding added Type: Feature New feature or request help wanted Extra attention is needed labels Jul 4, 2019

manycoding changed the title ~~Anomaly detections~~ Anomaly detections/multiple jobs comparison Jul 4, 2019

manycoding mentioned this issue Jul 6, 2019

Compare tagged fields between two dataframes #23

Closed

manycoding added this to the 0.3.7 milestone Jul 8, 2019

manycoding added a commit that referenced this issue Jul 10, 2019

Add anomalies & tests, #132

d0d5dfe

manycoding added a commit that referenced this issue Jul 23, 2019

Anomalies (#138)

2889718

* Drop _type from cov difference, add to Source mock * Add anomalies & tests, #132 * Display anomalies with box plots * Update to plotly 4

manycoding closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomaly detections/multiple jobs comparison #132

Anomaly detections/multiple jobs comparison #132

manycoding commented Jul 4, 2019 •

edited

Loading

ejulio commented Jul 5, 2019

manycoding commented Jul 5, 2019

manycoding commented Jul 6, 2019 •

edited

Loading

ejulio commented Jul 8, 2019

manycoding commented Jul 8, 2019

manycoding commented Jul 8, 2019

manycoding commented Jul 8, 2019 •

edited

Loading

ejulio commented Jul 9, 2019

manycoding commented Jul 9, 2019 •

edited

Loading

ejulio commented Jul 10, 2019

manycoding commented Jul 10, 2019

manycoding commented Jul 10, 2019

ejulio commented Jul 10, 2019

manycoding commented Jul 10, 2019

manycoding commented Jul 11, 2019

alexander-matsievsky commented Jul 11, 2019

ejulio commented Jul 11, 2019

manycoding commented Jul 15, 2019

manycoding commented Aug 22, 2019

Anomaly detections/multiple jobs comparison #132

Anomaly detections/multiple jobs comparison #132

Comments

manycoding commented Jul 4, 2019 • edited Loading

ejulio commented Jul 5, 2019

manycoding commented Jul 5, 2019

manycoding commented Jul 6, 2019 • edited Loading

ejulio commented Jul 8, 2019

manycoding commented Jul 8, 2019

manycoding commented Jul 8, 2019

manycoding commented Jul 8, 2019 • edited Loading

ejulio commented Jul 9, 2019

manycoding commented Jul 9, 2019 • edited Loading

ejulio commented Jul 10, 2019

manycoding commented Jul 10, 2019

manycoding commented Jul 10, 2019

ejulio commented Jul 10, 2019

manycoding commented Jul 10, 2019

manycoding commented Jul 11, 2019

alexander-matsievsky commented Jul 11, 2019

ejulio commented Jul 11, 2019

manycoding commented Jul 15, 2019

manycoding commented Aug 22, 2019

manycoding commented Jul 4, 2019 •

edited

Loading

manycoding commented Jul 6, 2019 •

edited

Loading

manycoding commented Jul 8, 2019 •

edited

Loading

manycoding commented Jul 9, 2019 •

edited

Loading