New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identity changes unexpectedly #1160
Comments
This is because we use Location (i.e the line/column and file path) when creating the check identity. It seems that using the line number is excessively specific and we should skip it, but what is the current state of the identity calculation discussions @tombaeyens @vijaykiran please? I know there were some loose ends on the topic. |
@m1n0 The solution is indeed to remove the location line/col. Only keep the file path. But then we should ideally also produce an error log if 2 checks in the same file produce the same identity. In fact, that should be for all checks in a scan: As soda cloud triggers generation of identity, ensure verification that no duplicate identities are created within 1 scan. If that is the case, print an error message pointing to all checks / locations that result in same check identities. |
is it possible to have two checks with the same identity? Shouldnt that fail even earlier, during parsing stage? |
Yes: When there are 2 checks with the same identity, it make sense to have it fail. I do want to raise awareness for the background story of identity as it is a tricky matter. It's not just a mechanism like programmatic identity in python or java. This identity mechanism is created to try and create seamless syncing between checks in sourcecode and the check entities on Soda Cloud that build up history over time. So the goal is that if users make changes to the same check, that we should ideally be able to figure that out and keep the link with the previous existing check. So identity has the ability to remain the same, even if some detail of the check changes. Though in practice we even had to include the threshold. Anyways that just to sketch the background around identity. |
thanks for writing up the background, my suggestion then is that after parsing the checks and creating the identity for each of them we should make sure that there are no duplicate identities before we proceed with execution, would you agree? |
As agreed before we remove line and col information from the identity. Important to note is that duplicate check identities only pose a problem for Soda Cloud sync. And duplicate identities should only be handled when pushing the scan results to Soda Cloud. All the rest should keep working as is. So when duplicate identities occur, we should ensure that only 1 (probably the first is most convenient) of the check results is sent to Soda Cloud. All subsequent check results with the same check identity should not be sent to Soda Cloud. When handling duplicate check identities we should proceed as much of the scan as possible, and not apply the fail fast principle. Impl note: it's probably the simplest to add this check inside the Soda Cloud method at the end when sending the scan results. As that's where the identities are generated. @vijaykiran Review what happens if similar checks are generated by a for each as there are already directly in the check file. And potentially update the identity generation if needed. If that logic is changed. Post a note here and tag me to make me aware. |
@tombaeyens this issue and your last comment above now seem to be solved.
|
One scenario is not covered yet - checking for each unique identity between each table, but also unique compared to a same check created specifically for one of the tables covered by for each. PR coming for that soon. |
* Test for 'for each' check identity. Fix #1160 Co-authored-by: Vijay Kiran <mail@vijaykiran.com>
* Test for 'for each' check identity. Fix sodadata#1160 Co-authored-by: Vijay Kiran <mail@vijaykiran.com>
Given the
checks.yml
configurationexecution of
soda scan
will create 3 new checks.After changing the threshold of the first check and leaving the other 2 checks unchanged in
checks.yml
:all of a sudden
min(usd)
andmax(usd)
get new identity resulting in creation of new checks insoda-cloud
and you end up with 6 checks instead of 4.The text was updated successfully, but these errors were encountered: