Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity changes unexpectedly #1160

Closed
jzalucki opened this issue Mar 23, 2022 · 8 comments · Fixed by #1317
Closed

Identity changes unexpectedly #1160

jzalucki opened this issue Mar 23, 2022 · 8 comments · Fixed by #1317
Assignees
Labels
bug Something isn't working

Comments

@jzalucki
Copy link

jzalucki commented Mar 23, 2022

Given the checks.yml configuration

checks for bitcoin:
  - row_count between 0 and 1000000
  - min(usd):
      warn: when < 20000
      fail: when < 10000
  - max(usd):
      warn: when > 70000
      fail: when > 100000

execution of soda scan will create 3 new checks.

After changing the threshold of the first check and leaving the other 2 checks unchanged in checks.yml:

checks for bitcoin:
  - row_count:
      warn: when between 20 and 100
  - min(usd):
      warn: when < 20000
      fail: when < 10000
  - max(usd):
      warn: when > 70000
      fail: when > 100000

all of a sudden min(usd) and max(usd) get new identity resulting in creation of new checks in soda-cloud and you end up with 6 checks instead of 4.

@jzalucki jzalucki changed the title Check identity wrongly calculated when changing from simple check to warning threshold Check identity calculated incorrectly when changing from simple check to warning threshold Mar 23, 2022
@jzalucki jzalucki added the bug Something isn't working label Mar 23, 2022
@vijaykiran vijaykiran added this to To do in Soda Core Roadmap Mar 28, 2022
@tombaeyens tombaeyens moved this from To do to Bugs in Soda Core Roadmap Mar 28, 2022
@tombaeyens tombaeyens changed the title Check identity calculated incorrectly when changing from simple check to warning threshold Identity changes unexpectedly Mar 28, 2022
@m1n0
Copy link
Contributor

m1n0 commented Mar 28, 2022

This is because we use Location (i.e the line/column and file path) when creating the check identity. It seems that using the line number is excessively specific and we should skip it, but what is the current state of the identity calculation discussions @tombaeyens @vijaykiran please? I know there were some loose ends on the topic.

@tombaeyens
Copy link
Contributor

@m1n0 The solution is indeed to remove the location line/col. Only keep the file path.

But then we should ideally also produce an error log if 2 checks in the same file produce the same identity. In fact, that should be for all checks in a scan:

As soda cloud triggers generation of identity, ensure verification that no duplicate identities are created within 1 scan. If that is the case, print an error message pointing to all checks / locations that result in same check identities.

@m1n0
Copy link
Contributor

m1n0 commented Mar 29, 2022

is it possible to have two checks with the same identity? Shouldnt that fail even earlier, during parsing stage?

@tombaeyens
Copy link
Contributor

Yes: When there are 2 checks with the same identity, it make sense to have it fail.

I do want to raise awareness for the background story of identity as it is a tricky matter. It's not just a mechanism like programmatic identity in python or java. This identity mechanism is created to try and create seamless syncing between checks in sourcecode and the check entities on Soda Cloud that build up history over time. So the goal is that if users make changes to the same check, that we should ideally be able to figure that out and keep the link with the previous existing check. So identity has the ability to remain the same, even if some detail of the check changes. Though in practice we even had to include the threshold. Anyways that just to sketch the background around identity.

@m1n0
Copy link
Contributor

m1n0 commented Apr 4, 2022

thanks for writing up the background, my suggestion then is that after parsing the checks and creating the identity for each of them we should make sure that there are no duplicate identities before we proceed with execution, would you agree?

@vijaykiran vijaykiran moved this from Bugs to Sprint Backlog in Soda Core Roadmap Apr 19, 2022
@vijaykiran vijaykiran moved this from Sprint Backlog to Bugs in Soda Core Roadmap Apr 19, 2022
@tombaeyens
Copy link
Contributor

As agreed before we remove line and col information from the identity.
This might lead to duplicate check identities.

Important to note is that duplicate check identities only pose a problem for Soda Cloud sync. And duplicate identities should only be handled when pushing the scan results to Soda Cloud. All the rest should keep working as is.

So when duplicate identities occur, we should ensure that only 1 (probably the first is most convenient) of the check results is sent to Soda Cloud. All subsequent check results with the same check identity should not be sent to Soda Cloud.

When handling duplicate check identities we should proceed as much of the scan as possible, and not apply the fail fast principle.

Impl note: it's probably the simplest to add this check inside the Soda Cloud method at the end when sending the scan results. As that's where the identities are generated.

@vijaykiran Review what happens if similar checks are generated by a for each as there are already directly in the check file. And potentially update the identity generation if needed. If that logic is changed. Post a note here and tag me to make me aware.

@tombaeyens tombaeyens assigned vijaykiran and unassigned m1n0 Apr 21, 2022
@tombaeyens tombaeyens moved this from Bugs to Sprint Backlog in Soda Core Roadmap Apr 21, 2022
@m1n0
Copy link
Contributor

m1n0 commented Apr 28, 2022

@tombaeyens this issue and your last comment above now seem to be solved.
In summary:

  • current identity implementation takes the whole check definition except the Name property and creates a hash of that.
  • line number and file are not used in the hash at all
  • for each checks create their own identity with a special non-sodacl syntax, table name is included so uniqueness is guaranteed.
  • we have quite extensive test coverage for these scenarios but we are missing a test for for each construct. I will add one and then we should be able to close this issue

@m1n0 m1n0 assigned m1n0 and unassigned vijaykiran Apr 28, 2022
@m1n0 m1n0 moved this from Sprint Backlog to In progress in Soda Core Roadmap Apr 28, 2022
@m1n0
Copy link
Contributor

m1n0 commented May 3, 2022

One scenario is not covered yet - checking for each unique identity between each table, but also unique compared to a same check created specifically for one of the tables covered by for each. PR coming for that soon.

m1n0 added a commit that referenced this issue May 4, 2022
@m1n0 m1n0 closed this as completed in #1317 May 4, 2022
Soda Core Roadmap automation moved this from In progress to Done May 4, 2022
m1n0 added a commit that referenced this issue May 4, 2022
* Test for 'for each' check identity.

Fix #1160

Co-authored-by: Vijay Kiran <mail@vijaykiran.com>
linderttobias pushed a commit to linderttobias/soda-core that referenced this issue Sep 14, 2022
* Test for 'for each' check identity.

Fix sodadata#1160

Co-authored-by: Vijay Kiran <mail@vijaykiran.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants