Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: Make binary string encoding configurable, add support for base64 #770

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

sfc-gh-kgaputis
Copy link

Addresses issue where streaming client receives base64-encoded strings for binary.

Adds new parameter BINARY_STRING_ENCODING which supports values hex and base64. The default value is hex.

@sfc-gh-tzhang
Copy link
Contributor

@sfc-gh-lsembera what do you think about supporting base64 as well?

Also, could we do it automatically instead of adding a parameter?

Copy link
Contributor

@sfc-gh-lsembera sfc-gh-lsembera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-lsembera what do you think about supporting base64 as well?

I am fine with supporting base64, as well. I just don't like the removal of surrounding quotes - we don't do it for any other data type, and I think we shouldn't do it for base64 either.

Also, could we do it automatically instead of adding a parameter?

We cannot do it automatically. Without additional information, we cannot tell if a string is base64- or hex-encoded.

if(binaryStringEncoding == Constants.BinaryStringEncoding.BASE64) {
try {
String stringInput = ((String) input).trim();
// Remove double quotes if present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not strip surrounding quotes. We don't do it for hex and neither does it work in Snowflake Worksheets.

@@ -888,76 +890,76 @@ public void testValidateAndParseBinary() throws DecoderException {
assertArrayEquals(
"honk".getBytes(StandardCharsets.UTF_8),
validateAndParseBinary(
"COL", "honk".getBytes(StandardCharsets.UTF_8), Optional.empty(), 0));
"COL", "honk".getBytes(StandardCharsets.UTF_8), Optional.empty(), 0, Constants.BinaryStringEncoding.HEX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the same test coverage for base64, as we have for hex.

import java.util.Map;
import java.util.Optional;
import java.util.Set;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no wildcard imports.

@@ -64,6 +67,9 @@ public class ParameterProvider {
public static final Constants.BdecParquetCompression BDEC_PARQUET_COMPRESSION_ALGORITHM_DEFAULT =
Constants.BdecParquetCompression.GZIP;

public static final Constants.BinaryStringEncoding BINARY_STRING_ENCODING_DEFAULT =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests to ParameterProviderTest.

@sfc-gh-tzhang
Copy link
Contributor

Also, could we do it automatically instead of adding a parameter?

We cannot do it automatically. Without additional information, we cannot tell if a string is base64- or hex-encoded.

Thanks for confirming, then I think this should be a channel level configuration instead of the whole client, take a look at what we did for setDefaultTimezone

@sfc-gh-lsembera
Copy link
Contributor

Thanks for confirming, then I think this should be a channel level configuration instead of the whole client, take a look at what we did for setDefaultTimezone

The problem with having the configuration per-channel is that it is not as easily configurable from KC.

@sfc-gh-kgaputis
Copy link
Author

@sfc-gh-tzhang @sfc-gh-lsembera, based on the requirement to work with kafka connector, I think this config should apply to the whole client.

Also, I suspect that most streaming use cases with embedded binary in Avro messages will be either 100% hex encoded or 100% base64 encoded, based on the particular tooling used in the customer's environment, so the need to support mixed encodings would be less likely.

sfc-gh-japatel and others added 18 commits September 12, 2024 05:03
…#759)

* SNOW-1357377 Add request Id in all streaming ingest APIs

* Fix tests and log requestId

* Format

* Remove request_id and no star imports
…sor 0.21 (snowflakedb#774)

SNOW-1457523: Fix CVE for snowflake-ingest-java io.airlift:aircompressor 0.21
…b#782)

* Add Microbenchmark for Insert rows

* Various performance improvements in the insertRows path

* Fix tests and format

* Make flush threads daemon to allow jvm to exit if these threads are active

* Remove commented out line

* Make memory info provider a singleton because we dont want multiple instances of it

* Address review comments

* Lower benchmark row count to make gh actions happy

* Review comments

* Mark these deps as test-only and check memory every 100ms
…akedb#779)

SNOW-1457523: Upgrade Parquet dependencies version to fix CVE
…eption from JDBC (snowflakedb#780)

We have seen many complains from customer about having false positive ERROR log from JDBC, since we will refresh the token only if it expires. This change switches to actively refresh the token before it expires to avoid the ERROR log.

Refresh stage metadata automatically in the background to avoid exceptions encountered during file upload in uploadWithoutConnection snowflakedb#753

Testing is covered by existing long running end2end test
…e client was created with (snowflakedb#794)

* Reject new stage metadata if the deployment id does not match what the client was created with

* Review comments
* SNOW-1545879 Reduce the max channel/chunk sizes

* Fix flaky test

* Trigger build
Co-authored-by: Hitesh Madan <hitesh.madan@snowflake.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants