-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: categorise semi structured datatype for databricks during missing datatype #4771
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4771 +/- ##
==========================================
+ Coverage 74.61% 74.67% +0.06%
==========================================
Files 414 414
Lines 48654 48665 +11
==========================================
+ Hits 36304 36343 +39
+ Misses 9966 9947 -19
+ Partials 2384 2375 -9 ☔ View full report in Codecov by Sentry. |
fbeb114
to
362ac77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- warehouse/integrations/deltalake/deltalake.go (2 hunks)
Additional comments not posted (1)
warehouse/integrations/deltalake/deltalake.go (1)
363-372
: Ensure proper handling of semi-structured data types.The logic to categorize semi-structured data types (
ARRAY
,MAP
,STRUCT
) into uppercase for statistics tracking is implemented correctly. This aligns with the PR's objective to handle semi-structured data types effectively.
4155fa2
to
b124ce4
Compare
* chore: update flaky test for error reporting aggregation test * chore: use ElementsMatch func in require for assertion * chore: update the logic of forming errors property * chore: update for better readability --------- Co-authored-by: Sai Sankeerth <sanpj2292@github.com>
362ac77
to
2545612
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (12)
- enterprise/reporting/error_reporting.go (1 hunks)
- warehouse/integrations/azure-synapse/azure-synapse.go (2 hunks)
- warehouse/integrations/bigquery/bigquery.go (2 hunks)
- warehouse/integrations/clickhouse/clickhouse.go (1 hunks)
- warehouse/integrations/datalake/schema-repository/glue.go (2 hunks)
- warehouse/integrations/deltalake/deltalake.go (2 hunks)
- warehouse/integrations/deltalake/deltalake_test.go (19 hunks)
- warehouse/integrations/mssql/mssql.go (3 hunks)
- warehouse/integrations/postgres/postgres.go (2 hunks)
- warehouse/integrations/redshift/redshift.go (1 hunks)
- warehouse/integrations/snowflake/snowflake.go (1 hunks)
- warehouse/utils/utils.go (2 hunks)
Files not reviewed due to errors (1)
- warehouse/utils/utils.go (no review received)
Files skipped from review as they are similar to previous changes (1)
- warehouse/integrations/deltalake/deltalake.go
Additional comments not posted (12)
warehouse/integrations/datalake/schema-repository/glue.go (1)
106-106
: LGTM! Ensure that thestats.Default
parameter is consistently used in similar function calls across the codebase.warehouse/integrations/postgres/postgres.go (1)
471-471
: LGTM! Ensure that thepg.stats
parameter is consistently used in similar function calls across the codebase.Verification successful
The search results show that the
WHCounterStat
function is consistently used with astats
parameter across various files in the codebase. Here are the relevant occurrences:
warehouse/integrations/deltalake/deltalake.go
warehouse/integrations/snowflake/snowflake.go
warehouse/integrations/postgres/postgres.go
warehouse/integrations/mssql/mssql.go
warehouse/integrations/redshift/redshift.go
warehouse/integrations/datalake/schema-repository/glue.go
warehouse/integrations/clickhouse/clickhouse.go
warehouse/integrations/bigquery/bigquery.go
warehouse/integrations/azure-synapse/azure-synapse.go
Each instance includes a
stats
parameter, confirming that the change is consistently applied.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify consistent use of `pg.stats` in `WHCounterStat` function calls. # Test: Search for inconsistent usage. Expect: No occurrences of `WHCounterStat` without `pg.stats`. rg --type go $'WHCounterStat\([^,]*,[^,]*,[^,]*\)'Length of output: 51
Script:
#!/bin/bash # Description: Search for all occurrences of `WHCounterStat` and verify the parameters used. # Find all occurrences of `WHCounterStat` and display the surrounding lines for context. rg --type go 'WHCounterStat' -A 2 -B 2Length of output: 4612
enterprise/reporting/error_reporting.go (1)
543-553
: LGTM! The manual iteration overreportGrpKeys
andreportsCountMap
to populateerrs
is implemented correctly.warehouse/integrations/mssql/mssql.go (1)
940-940
: Ensure proper handling of unrecognized data types.The addition of
ms.stats
as an argument in theWHCounterStat
function call is consistent with the changes across other warehouse integrations. This change enhances the tracking of missing data types, aligning with the PR's objectives.warehouse/integrations/azure-synapse/azure-synapse.go (1)
932-932
: Ensure proper handling of unrecognized data types.Verification successful
The code snippet provided shows that unrecognized data types are being handled by adding them to the
unrecognizedSchema
map with the valuewarehouseutils.MissingDatatype
. Additionally, a counter statistic is being incremented to log the occurrence of a missing data type.
- The unrecognized data types are stored in a separate schema (
unrecognizedSchema
).- A counter statistic is incremented to log the missing data type.
This indicates that unrecognized data types are indeed being handled appropriately by logging and tracking them.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that all data types are recognized and properly handled. # Test: Search for unrecognized data types in the schema. Expect: No occurrences of unrecognized data types. rg --type go $'warehouseutils.MissingDatatype' warehouse/integrations/azure-synapse/Length of output: 215
Script:
#!/bin/bash # Extract the relevant code snippet around the identified line to analyze the handling of unrecognized data types. sed -n '920,940p' warehouse/integrations/azure-synapse/azure-synapse.goLength of output: 793
warehouse/integrations/bigquery/bigquery.go (2)
20-20
: Added import forstats
fromgithub.com/rudderlabs/rudder-go-kit/stats
.This change is necessary to support the new functionality of tracking missing data types as described in the PR summary.
825-825
: UpdatedWHCounterStat
function call to includestats.Default
as an argument.This change aligns with the PR's objective to enhance statistics tracking by including a default stats instance when reporting missing data types.
warehouse/integrations/redshift/redshift.go (1)
1197-1197
: Ensure proper handling of missing data types.The addition of
rs.stats
as an argument in theWHCounterStat
function call is consistent with the PR's objective to enhance statistics tracking for missing data types. This change should help in better monitoring and reporting of issues related to unrecognized data types in Redshift schemas.warehouse/integrations/clickhouse/clickhouse.go (1)
1022-1022
: Ensure proper handling of unrecognized data types.This change correctly logs missing data types as specified in the PR summary. It's a good practice to track these occurrences for debugging and data integrity checks.
warehouse/integrations/snowflake/snowflake.go (1)
1419-1419
: Update theWHCounterStat
function call to includesf.stats
.This change aligns with the PR's objective to enhance statistics tracking by including the
stats
object in theWHCounterStat
function calls. It correctly implements the new signature ofWHCounterStat
which now requires astats.Stats
parameter.warehouse/integrations/deltalake/deltalake_test.go (2)
17-17
: Added import formemstats
fromrudder-go-kit
.This import is necessary for the new statistics tracking functionality as per the PR description.
1222-1236
: Thecleanup
function efficiently handles schema deletion with retries.This is a robust approach to ensure that resources are cleaned up even if transient errors occur during the cleanup process.
2545612
to
b810d2b
Compare
b810d2b
to
9426bf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (13)
- enterprise/reporting/error_reporting.go (1 hunks)
- warehouse/integrations/azure-synapse/azure-synapse.go (2 hunks)
- warehouse/integrations/bigquery/bigquery.go (2 hunks)
- warehouse/integrations/clickhouse/clickhouse.go (1 hunks)
- warehouse/integrations/datalake/datalake.go (1 hunks)
- warehouse/integrations/datalake/schema-repository/glue.go (2 hunks)
- warehouse/integrations/deltalake/deltalake.go (2 hunks)
- warehouse/integrations/deltalake/deltalake_test.go (19 hunks)
- warehouse/integrations/mssql/mssql.go (4 hunks)
- warehouse/integrations/postgres/postgres.go (2 hunks)
- warehouse/integrations/redshift/redshift.go (1 hunks)
- warehouse/integrations/snowflake/snowflake.go (1 hunks)
- warehouse/utils/utils.go (2 hunks)
Additional comments not posted (18)
warehouse/integrations/datalake/datalake.go (1)
16-16
: The addition of thestats
import is appropriate for enhanced statistics tracking.warehouse/integrations/datalake/schema-repository/glue.go (1)
106-106
: The update to includestats.Default
in theWHCounterStat
function call aligns with the PR's objective to standardize statistics tracking across the platform.warehouse/integrations/postgres/postgres.go (1)
471-471
: The update to includepg.stats
in theWHCounterStat
function call aligns with the PR's objective to standardize statistics tracking across the platform.warehouse/utils/utils.go (1)
Line range hint
744-755
: The modification to include astats.Stats
parameter in theWHCounterStat
function enhances flexibility and configurability for statistics tracking.Verification successful
The modification to include a
stats.Stats
parameter in theWHCounterStat
function has been consistently applied across various warehouse integration files.
warehouse/integrations/snowflake/snowflake.go
warehouse/integrations/postgres/postgres.go
warehouse/integrations/deltalake/deltalake.go
warehouse/integrations/mssql/mssql.go
warehouse/integrations/redshift/redshift.go
warehouse/integrations/clickhouse/clickhouse.go
warehouse/integrations/bigquery/bigquery.go
warehouse/integrations/datalake/schema-repository/glue.go
warehouse/integrations/azure-synapse/azure-synapse.go
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the integration of the new `WHCounterStat` function across different warehouse integrations. # Test: Search for the function usage across warehouse integrations. Expect: Consistent usage of the new signature. rg --type go $'WHCounterStat'Length of output: 2030
warehouse/integrations/mssql/mssql.go (1)
940-940
: Good use of the updatedWHCounterStat
function to track missing data types.This change aligns with the PR's objective to enhance error reporting by categorizing semi-structured data types and tracking them appropriately.
warehouse/integrations/azure-synapse/azure-synapse.go (1)
932-932
: The update to include thestats
parameter in theWHCounterStat
function call is correctly implemented and aligns with the new function signature. Good job ensuring consistency across the codebase.warehouse/integrations/bigquery/bigquery.go (1)
825-825
: The addition ofstats.Default
to theWHCounterStat
function call aligns with the broader updates across the codebase to standardize statistics tracking. This ensures consistent usage of thestats
parameter and likely enhances the configurability and default handling of statistics.warehouse/integrations/redshift/redshift.go (1)
1197-1197
: The update to includestats
in theWHCounterStat
function call aligns with the PR's objectives to enhance statistics tracking. Ensure that thestats
object is properly initialized and integrated across all uses in the system.Verification successful
The verification confirms that the
stats
parameter is properly integrated in allWHCounterStat
function calls across the system.
warehouse/integrations/snowflake/snowflake.go
warehouse/integrations/postgres/postgres.go
warehouse/integrations/mssql/mssql.go
warehouse/integrations/redshift/redshift.go
warehouse/integrations/datalake/schema-repository/glue.go
warehouse/integrations/deltalake/deltalake.go
warehouse/integrations/clickhouse/clickhouse.go
warehouse/integrations/bigquery/bigquery.go
warehouse/integrations/azure-synapse/azure-synapse.go
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the integration of the `stats` parameter in all `WHCounterStat` function calls across the system. # Test: Search for the function usage. Expect: All occurrences should include the `stats` parameter. rg --type go $'WHCounterStat\('Length of output: 2034
warehouse/integrations/clickhouse/clickhouse.go (1)
1022-1022
: The update to includech.stats
in theWHCounterStat
function call aligns with the broader initiative to standardize statistics tracking across the platform.warehouse/integrations/deltalake/deltalake_test.go (9)
1222-1236
: ThedropSchema
function implements robust error handling and retry logic, which is crucial for ensuring clean test environments. Good job on this implementation.
1120-1145
: The test case for schema creation when it does not exist is well-structured and effectively validates the functionality of fetching schemas. It's good to see comprehensive checks before and after the operation.
1147-1218
: The test case for handling missing datatypes is comprehensive and uses mocks effectively to simulate system behavior. This is crucial for ensuring robust error handling and reporting in production environments.
Line range hint
929-976
: The test case for loading data when the schema does not exist is well-implemented, covering crucial error handling paths and ensuring that the system behaves as expected in this failure scenario.
Line range hint
798-844
: The test case for handling mismatches in the number of columns during data loading is well-crafted, ensuring that the system can handle and report schema mismatches effectively.
Line range hint
842-888
: The test case for handling mismatches in schema during data loading is thorough and ensures that the system can correctly identify and handle schema discrepancies.
Line range hint
886-931
: The test case for handling discards during data loading is well-implemented, ensuring that discarded records are handled appropriately and reported correctly.
Line range hint
974-1037
: The test case for partition pruning is well-structured and effectively tests the system's ability to handle partition pruning, which is crucial for performance optimization in large datasets.
1085-1220
: The test case for fetching schema is comprehensive and effectively tests the system's ability to fetch and recognize schemas correctly, ensuring robust functionality in production environments.
037a79c
to
1e342f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (13)
- enterprise/reporting/error_reporting.go (1 hunks)
- warehouse/integrations/azure-synapse/azure-synapse.go (2 hunks)
- warehouse/integrations/bigquery/bigquery.go (2 hunks)
- warehouse/integrations/clickhouse/clickhouse.go (1 hunks)
- warehouse/integrations/datalake/datalake.go (1 hunks)
- warehouse/integrations/datalake/schema-repository/glue.go (2 hunks)
- warehouse/integrations/deltalake/deltalake.go (3 hunks)
- warehouse/integrations/deltalake/deltalake_test.go (19 hunks)
- warehouse/integrations/mssql/mssql.go (4 hunks)
- warehouse/integrations/postgres/postgres.go (2 hunks)
- warehouse/integrations/redshift/redshift.go (1 hunks)
- warehouse/integrations/snowflake/snowflake.go (1 hunks)
- warehouse/utils/utils.go (2 hunks)
Files skipped from review as they are similar to previous changes (11)
- enterprise/reporting/error_reporting.go
- warehouse/integrations/azure-synapse/azure-synapse.go
- warehouse/integrations/bigquery/bigquery.go
- warehouse/integrations/clickhouse/clickhouse.go
- warehouse/integrations/datalake/schema-repository/glue.go
- warehouse/integrations/deltalake/deltalake.go
- warehouse/integrations/mssql/mssql.go
- warehouse/integrations/postgres/postgres.go
- warehouse/integrations/redshift/redshift.go
- warehouse/integrations/snowflake/snowflake.go
- warehouse/utils/utils.go
Additional comments not posted (4)
warehouse/integrations/datalake/datalake.go (1)
16-16
: Ensure the newly importedlogger
is utilized effectively throughout the file.This import enhances logging capabilities, which is crucial for debugging and monitoring in production environments. Make sure that the logging statements cover all critical operations and error handling paths to maximize the utility of this addition.
warehouse/integrations/deltalake/deltalake_test.go (3)
17-17
: Added import forgithub.com/rudderlabs/rudder-go-kit/stats/memstats
.This import is necessary for the new functionality to track memory statistics as part of the testing framework.
29-29
: Added multiple imports for backend configurations and test helpers.These imports are essential for the expanded testing capabilities and integration with backend configurations.
1222-1236
: FunctiondropSchema
is well-implemented for schema cleanup.This utility function is crucial for ensuring that schemas are properly cleaned up after tests, preventing potential conflicts or resource leaks in subsequent tests.
1e342f9
to
4e38b50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (13)
- enterprise/reporting/error_reporting.go (1 hunks)
- warehouse/integrations/azure-synapse/azure-synapse.go (2 hunks)
- warehouse/integrations/bigquery/bigquery.go (2 hunks)
- warehouse/integrations/clickhouse/clickhouse.go (1 hunks)
- warehouse/integrations/datalake/datalake.go (1 hunks)
- warehouse/integrations/datalake/schema-repository/glue.go (2 hunks)
- warehouse/integrations/deltalake/deltalake.go (3 hunks)
- warehouse/integrations/deltalake/deltalake_test.go (19 hunks)
- warehouse/integrations/mssql/mssql.go (4 hunks)
- warehouse/integrations/postgres/postgres.go (2 hunks)
- warehouse/integrations/redshift/redshift.go (1 hunks)
- warehouse/integrations/snowflake/snowflake.go (1 hunks)
- warehouse/utils/utils.go (2 hunks)
Files skipped from review as they are similar to previous changes (12)
- enterprise/reporting/error_reporting.go
- warehouse/integrations/azure-synapse/azure-synapse.go
- warehouse/integrations/bigquery/bigquery.go
- warehouse/integrations/clickhouse/clickhouse.go
- warehouse/integrations/datalake/datalake.go
- warehouse/integrations/datalake/schema-repository/glue.go
- warehouse/integrations/deltalake/deltalake.go
- warehouse/integrations/mssql/mssql.go
- warehouse/integrations/postgres/postgres.go
- warehouse/integrations/redshift/redshift.go
- warehouse/integrations/snowflake/snowflake.go
- warehouse/utils/utils.go
Additional comments not posted (1)
warehouse/integrations/deltalake/deltalake_test.go (1)
491-491
: Standardize cleanup operations.
[REFACTOR_SUGGESTion]
The repeated use ofdropSchema
in the cleanup blocks across multiple test cases suggests a pattern that could be abstracted into a helper function or setup in a more centralized manner to reduce redundancy and improve maintainability.- t.Cleanup(func() { - dropSchema(t, d.DB.DB, namespace) - }) + setupCleanup(t, d, namespace)And define
setupCleanup
as:func setupCleanup(t *testing.T, d *deltalake.Deltalake, namespace string) { t.Cleanup(func() { if err := dropSchema(t, d.DB.DB, namespace); err != nil { t.Error("Failed to drop schema:", err) } }) }Also applies to: 516-516, 564-564, 617-617, 668-668, 721-721, 773-773, 798-798, 842-842, 886-886, 929-929, 974-974, 1035-1035
time.Minute, | ||
time.Second, | ||
) | ||
dropSchema(t, db, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper cleanup of resources in tests.
The dropSchema
function is called in a cleanup block, which is good practice. However, consider adding error handling within the dropSchema
function to log or handle potential failures during the cleanup process.
time.Minute, | ||
time.Second, | ||
) | ||
dropSchema(t, db, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency in test cleanup.
Similar to the previous comment, ensure that the dropSchema
function used in the cleanup process handles potential errors effectively. This consistency in error handling will make the tests more robust.
func dropSchema(t *testing.T, db *sql.DB, namespace string) { | ||
t.Helper() | ||
|
||
require.Eventually(t, | ||
func() bool { | ||
_, err := db.Exec(fmt.Sprintf(`DROP SCHEMA %s CASCADE;`, namespace)) | ||
if err != nil { | ||
t.Logf("error deleting schema %q: %v", namespace, err) | ||
return false | ||
} | ||
return true | ||
}, | ||
time.Minute, | ||
time.Second, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance the dropSchema
function with error handling.
The dropSchema
function currently logs errors but does not return them, which could mask issues during test cleanup. Modify it to return errors so they can be handled or logged by the caller.
- func dropSchema(t *testing.T, db *sql.DB, namespace string) {
+ func dropSchema(t *testing.T, db *sql.DB, namespace string) error {
- require.Eventually(t,
+ return require.Eventually(t,
func() bool {
_, err := db.Exec(fmt.Sprintf(`DROP SCHEMA %s CASCADE;`, namespace))
if err != nil {
t.Logf("error deleting schema %q: %v", namespace, err)
return false
}
return true
},
time.Minute,
time.Second,
)
}
Committable suggestion was skipped due to low confidence.
WalkthroughThe recent updates span multiple files, primarily enhancing error reporting, statistics tracking, and schema handling in various data warehouse integrations. Key changes include modifications to function signatures, the addition of new methods, and updates to import statements. These enhancements aim to improve the robustness and accuracy of error and statistics handling across different warehouse integrations. Changes
Sequence Diagram(s)No sequence diagrams are necessary for these changes as they primarily involve updates to function parameters, imports, and minor logic adjustments without introducing new features or significantly altering control flow. Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
Description
MAP
,ARRAY
,STRUCT
we are getting the entire schema.STRUCT<activity_type_name STRING, logmessage STRING, clouderrorreporting_googleapis_com_notification_trigger_error_ingestion_time STRING, notificationtype STRING, errorgroupid STRING>
Linear Ticket
Security
Summary by CodeRabbit
New Features
Refactor
Tests