New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bigdiffy with datasets that contain duplicates #721
Comments
There needs to be a way to disambiguate between different rows. We could automate generating unique keys but I also think this is a thing that can be manually done upstream, either by generating a new dataset, or importing the BigDiffy library and pre-processing in code. This is also why we allow generating concatenated keys from multiple columns. Is there a reason that doesn't work for your case? |
Yeah, pre-aggregation was also one of the first things on my mind because the suggested approach has very little in common with the current features in
|
Specific case is because there could be multiple records that look exactly the same for all the field values (still valid events). Any combination of columns would yield duplicate here. Though the pre processing sounds like an option. Just setup a custom diffy job for these rare case where I could just salt the unique key with like a counter or something ({uniquekey}_dup1, {uniquekey}_dup2) and then pass that to the diffy library. |
Yea, I believe preprocessing will still be effective there as if all values are the same it doesn't matter if you assign dup1 and dup2 in different ways each run. The main issue is that we don't want to bake too much logic into the jobs and provide the flexibility of importing as a library for highly specific use cases. If this winds up being very common then I think the discussion is a bit different but we haven't really seen that so far. The jobs are already quick unwieldy as is and would like to not bloat them any further unless needed. |
Plenty of cases where I needed to diff a dataset that contains duplicates. The current process is to output to BQ and manually use a query to count by key.
The only way I could think of comparing this data is via counting by a custom key built from the data itself. Does it make sense that there should be a diff option to just count by key? count on both sides and check all keys match and the counts match. This is not meant to solve all cases.
The text was updated successfully, but these errors were encountered: