-
-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only report on persistent changes #22
Comments
The filter mechanism is designed to cut out parts of pages that always change. Do you have an example URL for which this happens? It might make sense to cache the last two versions, but then it should probably be per-URL and also configurable (e.g. it might make sense to cache the last 7 versions for a page that rotates it content on a daily basis, with repeating content every week). |
Yes, I am already using the filter mechanism for a few pages, which works brilliantly, by the way. However, these changes appear to occur for almost the whole of this page. (In this case, I should probably just watch the mercurial project instead.) I have urlwatch operating hourly by cron. This page has been reported as changing to nonsense five times in the last six days, then reverting back on the next watch, or sometimes on the same watch (not sure how that works). I've reproduced the last nonsense conversion below. To be honest, this is really just a bug in the watched page, so I understand if it's not a priority to address.
|
Maybe some feature that keeps the last N versions of the page and when diffing checks if one of them matches could do the trick here? As a per-URL override, that is. |
Yes, that might work. One part that confuses me would be what to do about the "false positive" page. Consider the following.
However, what happens now with the nonsense page cache? If, in the future, the page changes to the nonsense page permanently, urlwatch should notify. Hence, I guess the nonsense page's cache should be destroyed when the page reverts? Another thing… perhaps users could specify how many unique versions to keep, and a maximum age (measured chronologically or "checks ago") of cached pages. If the nonsense page persists for up to n hours, or m checking cycles, urlwatch can still ignore it. |
I would assume that urlwatch would report the first change but would not report the change back, and any subsequent change will then get reported again? So assuming we have these states: A, B, C, D (and considering "B" the intermediate trash state) Then it would report A->B, skip B->A (A is already known), skip A->B (B is already known), report B->C, skip C->B (B is already known), report B->D This is not ideal, as we want to report the B-to-something changes with the state before B (assuming B contains garbage), so with A->B->C we want to report a diff of A->C when the actual change is B->C, right? Would a filter that can mark a result as "garbage" work in that way? That we just filter garbage? I do have a page where it actually sometimes prints out "No connection to the database" instead of a list of items, so that would solve that problem as well (filter would then be "If the text contains No connection to the database, ignore for now"). |
Yes, this makes much sense overall, in both the reporting and diff-ing. However, if the "junk" results in different pages each time, which (I expect) is occurring in my example above, then this would not solve that problem. The only solution I can think of is to only report when the junk page is present after multiple checks. I appreciate that this may be confusing to the end user, and will result in a delay in reporting, so even though it'd solve my problem in theory, I'm not sure that it's ideal. |
The new SQLite3/minidb-based storage method in urlwatch 2 does keep old versions around until manually purged, so it might be possible to implement something like this now by looking at the last N versions of a page. |
I'm finding that I'm getting a lot of false positives. urlwatch detects changes in a webpage in a particular run. In the next run, it detects a complete reversion of these changes. I see this on multiple websites, with different sections of the page. I'm not sure what is causing it, but it might be due to corrupted incoming files.
Perhaps urlwatch could have the option of caching the last two versions when it detects changes, and only report when changes persist for two runs?
The text was updated successfully, but these errors were encountered: