Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--id option for de-duplicating objects over time #2

Closed
simonw opened this issue Nov 11, 2021 · 8 comments
Closed

--id option for de-duplicating objects over time #2

simonw opened this issue Nov 11, 2021 · 8 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Nov 11, 2021

Spun off from #1.

@simonw simonw added the enhancement New feature or request label Nov 11, 2021
@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

More interesting is if you specify columns to be treated as IDs within that data, using the --id option one or more times.

If you do this, two columns will be created - item and item_history.

The item table will contain just the most recent version of each row.

The item_history table will contain a row for each captured differing version of that item, plus git_commit_at, git_hashandgit_version` columns (where version starts at 1 and increments for each captured version).

@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

This means I need to be able to tell if the item has changed since being spotted in a previous version.

I'm going to reuse the _hash() function from sqlite-utils (used for hash_id there) for this:

def _hash(record):
    return hashlib.sha1(
        json.dumps(record, separators=(",", ":"), sort_keys=True, default=repr).encode(
            "utf8"
        )
    ).hexdigest()

@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

Problem: sqlite-utils still doesn't support compound foreign keys:

I want items in the versions table to relate back to the row in the items table - but that table was going to have a compound primary key.

There's a PR but it's a bit out-of-date.

One workaround could be to synthesize an item ID from the sha256 hash of the values in those ID columns.

@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

I'm going to do that - but I'll actually abuse the _hash() function by passing in a dictionary of just the ID => value columns.

@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

New schema design:

  • commits: hash, commit_at
  • item: id, columns
  • item_versions: item, version, commit, columns

@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

I was testing this against https://github.com/simonw/sfpublicworks-tree-removal-notifications/blob/main/tree-list.json like so:

git-history file db.db ../sfpublicworks-tree-removal-notifications/tree-list.json \
  --repo ../sfpublicworks-tree-removal-notifications \
  --id title --id district \
  --ignore img_src

And getting really confusing results. Turns out title, district is not unique across commits!

simonw/sfpublicworks-tree-removal-notifications@0413e61

Banners_and_Alerts_and_Scraped_at_Wed_Feb_17_01_24_56_UTC_2021_·_simonw_sfpublicworks-tree-removal-notifications_0413e61

Maybe try to detect if a single page has multiple items with the same ID in it?

@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

Added this error:

(git-history) git-history % git-history file db.db ../sfpublicworks-tree-removal-notifications/tree-list.json --repo ../sfpublicworks-tree-removal-notifications --id title --id district --ignore img_src                   
Error: Commit: c98322bcf14655c606e6ab8c383ea5079fde5781 - found multiple items with the same ID:
[
    {
        "title": "183 Westwood Drive",
        "district": "District 7",
        "Site number": "1",
        "Reason(s) for removal": "Has arbutus fungal canker with 100% mortality rate, there is no cure or treatment and canker will spread and girdle cambium then kill tree as it forms cracks, prompt removal critical for preventing spread to other arbutus; replaceable with species appropriate for narrow sidewalk",
        "Posting period": "6/2/2020 - 7/17/2020"
    },
    {
        "title": "183 Westwood Drive",
        "district": "District 7",
        "Site number": "2",
        "Reason(s) for removal": "Has arbutus fungal canker with 100% mortality rate, there is no cure or treatment and canker will spread and girdle cambium then kill tree as it forms cracks, prompt removal critical for preventing spread to other arbutus; replaceable with species appropriate for narrow sidewalk",
        "Posting period": "6/2/2020 - 7/17/2020"
    }
]

@simonw
Copy link
Owner Author

simonw commented Nov 11, 2021

OK this is working:

git-history file db.db /tmp/getIncidentsGit/incidents.json --repo /tmp/getIncidentsGit \
    --id CallTimeOpened --id Address --id CrossStreet \
    --convert 'json.loads(content)["ActiveIncidentDataTable"]' \
    --ignore-duplicate-ids

The duplicate IDs were for an incident that appeared to be filed twice, with slightly different categories.

@simonw simonw closed this as completed in 84663ee Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant