Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing StreamingIngestResponseCode from the SFException #601

Open
joshtree41 opened this issue Oct 10, 2023 · 9 comments
Open

Accessing StreamingIngestResponseCode from the SFException #601

joshtree41 opened this issue Oct 10, 2023 · 9 comments
Assignees

Comments

@joshtree41
Copy link

Hi team,

I think it might be useful if there a way that the StreamingIngestResponseCode could be accessed from the SFException.

Currently this exposes the ErorrCode message - but this a bit less useful I think.

Some motivation behind this can be found in #547 (trying to diagnose and troubleshoot channel invalidations).

Let me know what you all think! I can provide more details if needed.

@sfc-gh-lsembera
Copy link
Contributor

sfc-gh-lsembera commented Oct 10, 2023

Hi @joshtree41, thanks for creating the issue! We are now planning some error reporting improvements. Could you please describe your use case and why do you need the value of StreamingIngestResponseCode?

@joshtree41
Copy link
Author

Hey @sfc-gh-lsembera, sweet sounds good. And yeah of course, here's some details:

Currently in our pipelines - we'll use the following methods to insert data, check for validity, and/or check commit progress.

  • SnowflakeStreamingIngestChannel#close
  • SnowflakeStreamingIngestChannel#isValid
  • SnowflakeStreamingIngestChannel#getLatestCommittedOffsetToken
  • SnowflakeStreamingIngestChannel#insertRows

For any of these methods, we can get a SFException that suggests that the channel has been invalidated, or has failures, etc. In such cases we've built is some mechanism of re-opening the channel.

Often times - in some warning or error log that is fired asynchronously and immediately proceeds the caught error, there will be a message with the StreamingIngestResponseCode. This is really useful for us because depending on the code, we have to respond differently. We often get some status code 10's and 20's that don't really require any remediation and are often transient. Other times we might get the less frequent 21 - which almost always suggests that some data are being lost and we have to rewind some of our message queues.

Currently we have to go into the job logs and check the status codes that were fired in the warnings, but ideally we'd be able to access that information in the SFException.

If we had access to the streaming response code, we could build out more actionable custom alerts and metrics into our pipelines.

@sfc-gh-lsembera
Copy link
Contributor

@joshtree41 It is not safe to assume that no data can be lost when status codes like 10 are returned. The safe way is to reopen the channel, check the last committed offset token and continue ingesting from there. These server-side response codes, while useful for debugging, are too burdensome to work with and we don't want our users to have to interpret dozens of server-side status codes.

We are thinking about improving the error handling by categorising all errors into two groups:

  • Transient, which users can safely fix by reopening the channel, checking the last committed offset and continue ingestion from there.
    • Examples would be error codes 10, 21 or 35
  • Non-transient errors or errors indicating something is wrong with the user flow
    • Examples would be:
      • wrong configuration
      • invalid table schema
      • status code 20, which indicates that the same channel is being used by some other streaming ingest client
        • It is not always safe to just reopen on status code 20, because if two clients would be trying to ingest via the same channel, they would keep stealing the channel from each other, constantly receiving status code 20 and possibly livelocking themselves

What do you think, would this approach solve your issues with the SDK error codes?

@joshtree41
Copy link
Author

@sfc-gh-lsembera

It is not safe to assume that no data can be lost when status codes like 10 are returned

Haha I should have known you'd respond with that! Not to worry, we have a replay mechanism on our end for that scenario. What I have noticed though is that invalidations w/ status code 10 have never resulted in lost data on our end. Our rewind has to happen externally from our pipelines currently, and we have some checks built in to see if there are any missing time windows of data. We've always been in the clear w/ status code 10. I'm not really sure why, could be happenstance.

QQ: What is status code 35? Don't see a description in the enum.

These server-side response codes, while useful for debugging, are too burdensome to work with and we don't want our users to have to interpret dozens of server-side status codes.

IMO - I'd rather just have the code attached - having a categorization into Transient/Transient invalidations/errors would be really useful as well, but you might as well just through the code in there too.

I don't think it would be at all two cumbersome to have a method on the error that returns that actual error, in addition to a method or mechanism for getting the categorization. Having more information is better I think, it could help with debugging issues (even transient issues).

@joshtree41
Copy link
Author

Hey @sfc-gh-lsembera - any update here? Want me to provide more info for you guys?

@sfc-gh-lsembera
Copy link
Contributor

Hi @joshtree41, no need to provide any more info. I don't have ETA on this, yet.

@joshtree41
Copy link
Author

Hey @sfc-gh-lsembera - any update here?

@sfc-gh-lsembera
Copy link
Contributor

sfc-gh-lsembera commented Dec 19, 2023

Hi @joshtree41, this issue hasn't been prioritized, yet. There is no decision, yet, if we are going to expose these internal codes.

@joshtree41
Copy link
Author

Okay cool, thanks @sfc-gh-lsembera.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants