-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--id option for de-duplicating objects over time #2
Comments
|
This means I need to be able to tell if the item has changed since being spotted in a previous version. I'm going to reuse the def _hash(record):
return hashlib.sha1(
json.dumps(record, separators=(",", ":"), sort_keys=True, default=repr).encode(
"utf8"
)
).hexdigest() |
Problem: I want items in the There's a PR but it's a bit out-of-date. One workaround could be to synthesize an item ID from the sha256 hash of the values in those ID columns. |
I'm going to do that - but I'll actually abuse the |
New schema design:
|
I was testing this against https://github.com/simonw/sfpublicworks-tree-removal-notifications/blob/main/tree-list.json like so:
And getting really confusing results. Turns out simonw/sfpublicworks-tree-removal-notifications@0413e61 Maybe try to detect if a single page has multiple items with the same ID in it? |
Added this error:
|
OK this is working:
The duplicate IDs were for an incident that appeared to be filed twice, with slightly different categories. |
Spun off from #1.
The text was updated successfully, but these errors were encountered: