Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigdiffy with datasets that contain duplicates #721

Open
AndersonReyes opened this issue Mar 4, 2024 · 4 comments
Open

bigdiffy with datasets that contain duplicates #721

AndersonReyes opened this issue Mar 4, 2024 · 4 comments

Comments

@AndersonReyes
Copy link

AndersonReyes commented Mar 4, 2024

Plenty of cases where I needed to diff a dataset that contains duplicates. The current process is to output to BQ and manually use a query to count by key.

The only way I could think of comparing this data is via counting by a custom key built from the data itself. Does it make sense that there should be a diff option to just count by key? count on both sides and check all keys match and the counts match. This is not meant to solve all cases.

@idreeskhan
Copy link
Contributor

idreeskhan commented Mar 5, 2024

There needs to be a way to disambiguate between different rows. We could automate generating unique keys but I also think this is a thing that can be manually done upstream, either by generating a new dataset, or importing the BigDiffy library and pre-processing in code.

This is also why we allow generating concatenated keys from multiple columns. Is there a reason that doesn't work for your case?

@monzalo14
Copy link
Member

Yeah, pre-aggregation was also one of the first things on my mind because the suggested approach has very little in common with the current features in ratatool-diffy. Two ideas around pre-aggregation:

  1. Pre-aggregate rows by key with some function (e.g. mean) and then diff
    • This should have the same result for non-duplicates
    • There’s some thought needed for nested records (and for choosing the function in non-numeric cases)
    • Sounds like it would ignore too many corner cases
  2. Pre-aggregate rows by key by creating one additional level of nesting and then diff
    • We support arbitrary levels of nesting, so this in theory would work (need to take a closer look at how we handle record types)

@AndersonReyes
Copy link
Author

AndersonReyes commented Mar 6, 2024

There needs to be a way to disambiguate between different rows. We could automate generating unique keys but I also think this is a thing that can be manually done upstream, either by generating a new dataset, or importing the BigDiffy library and pre-processing in code.

This is also why we allow generating concatenated keys from multiple columns. Is there a reason that doesn't work for your case?

Specific case is because there could be multiple records that look exactly the same for all the field values (still valid events). Any combination of columns would yield duplicate here.

Though the pre processing sounds like an option. Just setup a custom diffy job for these rare case where I could just salt the unique key with like a counter or something ({uniquekey}_dup1, {uniquekey}_dup2) and then pass that to the diffy library.

@idreeskhan
Copy link
Contributor

Yea, I believe preprocessing will still be effective there as if all values are the same it doesn't matter if you assign dup1 and dup2 in different ways each run.

The main issue is that we don't want to bake too much logic into the jobs and provide the flexibility of importing as a library for highly specific use cases. If this winds up being very common then I think the discussion is a bit different but we haven't really seen that so far. The jobs are already quick unwieldy as is and would like to not bloat them any further unless needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants