-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about Fault-Tolerance and Exactly-Once Delivery mechanisms #58
Comments
So it's absolutely not enough for stream process were files size and content is not deterministic . Far from exactly-once !
After a first look to the code : There is a huge amount of really NOT elegant concurrent code (locks and shared data structure to track "current work" ) and WRONG concurrent code ( values shared between threads , not atomic or even volatile )
Conclusion , this connector is really weakly write and tested |
To be exactly-once the flush of events have to be deterministic ( if the connector restart, it always recreate the same files , with the same amount of events from the same offsets , like msg 5144 to 5155 ) But with a NON deterministic flush time rule it's impossible to be exactly-once -> Line 599 in 1f793ad
and even this logic is buggy -> The time flush should be base on the time on the events ( see the doc of confluent -> https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#exactly-once-delivery-on-top-of-eventual-consistency ) Having this done and if snowpipe really never re-ingest the same file ( based on file name and MD5 ) then it will be exactly-once |
The connector is not waiting for a snowpipe confirmation of the ingestion of the files before updating in intern the offsets to commits at the next flush call by the kafka-connect framework. -> Line 700 in 1f793ad
So if anything happen to this file before snowpipe ingest it , or if the files is impossible to ingest , the data could be lost ( cause not every topic have unlimited retention ). From my understanding the connector is not even at-lest-once |
Hi @raphaelauv thank you for giving the feedback and looking into the code in detail.
CC: @sfc-gh-zli |
Hello @sfc-gh-japatel about 1 -> the documentation need corrections about 2 -> the connector is not exactly-once , what you do to LIMIT the duplication has no interest. about 3 -> It's a snowflake-connector not a S3 connector. I expect my data to be available in a snowflake table. The connector should not commit offsets if some data is not ingest, cause having a corrupted file in a bucket is not what I expect. I already see 2 cases that make the connector not at-least-once :
Conclusion by default the connector should be at-least-once and not a best-effort connector, or say it in your documentation. |
Snowpipe does not guarantee exactly once. Snowpipe streaming does guarantee exactly once Closing this issue out due to age - please reopen if further discussion is needed. |
@sfc-gh-japatel it's about |
The current docs for this connector state
I have several questions about this:
Later in the same section you call attention to
The text was updated successfully, but these errors were encountered: