-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(warehouse): glue partitions #2899
Conversation
Codecov ReportBase: 51.33% // Head: 51.73% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2899 +/- ##
==========================================
+ Coverage 51.33% 51.73% +0.40%
==========================================
Files 321 321
Lines 50646 50791 +145
==========================================
+ Hits 25997 26279 +282
+ Misses 23079 22931 -148
- Partials 1570 1581 +11
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
if err != nil { | ||
return err | ||
} | ||
schema, _, _ := ls.FetchSchema(ls.warehouse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FetchSchema always returns nil, ignoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the err from FetchSchema
then? We can re-introduce it once error is possible.
I see the schema needs to satisfy an interface.
One possible way to work around this:
func (ls *LocalSchemaRepository) localFetchSchema() warehouseutils.SchemaT {
return ls.uploader.GetLocalSchema()
}
func (ls *LocalSchemaRepository) FetchSchema(_ warehouseutils.Warehouse) (warehouseutils.SchemaT, warehouseutils.SchemaT, error) {
s := ls.localFetchSchema()
return s, s, nil
}
func (ls *LocalSchemaRepository) CreateTable(tableName string, columnMap map[string]string) (err error) {
schema:= ls.localFetchSchema(ls.warehouse)
if err != nil { | ||
return err | ||
} | ||
schema, _, _ := ls.FetchSchema(ls.warehouse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FetchSchema always returns nil, ignoring it.
if err != nil { | ||
return err | ||
} | ||
schema, _, _ := ls.FetchSchema(ls.warehouse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FetchSchema always returns nil, ignoring it.
warehouse/integrations/datalake/schema-repository/schema_repository.go
Outdated
Show resolved
Hide resolved
warehouse/upload.go
Outdated
@@ -482,6 +484,10 @@ func (job *UploadJobT) run() (err error) { | |||
job.matchRowsInStagingAndLoadFiles() | |||
job.recordLoadFileGenerationTimeStat(startLoadFileID, endLoadFileID) | |||
|
|||
if err = job.RefreshPartitions(startLoadFileID, endLoadFileID); err != nil { | |||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log the error? As an Warn
possibly?
warehouse/upload.go
Outdated
// This is best done every 100 files, since it's a batch request for updates in Glue | ||
partitionBatchSize := config.GetInt("Warehouse.refreshPartitionBatchSize", 99) | ||
for i := 0; i < len(loadFiles); i += partitionBatchSize { | ||
end := i + partitionBatchSize | ||
if end > len(loadFiles) { | ||
end = len(loadFiles) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to move the batching logic to the GlueSchemaRepository
or GlueClient ?
It seems destination-specific logic.
Also why GetInt("Warehouse.refreshPartitionBatchSize", 99)
is not a 100 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a limit set by the SDK for the batching to be 100. Since these changes were contributed from an open source user who kept it to 99.
5c8a8c9
uploadOutput, err := fm.Upload(context.TODO(), f, fmt.Sprintf("rudder-test-payload/s3-datalake/%s/dt=2006-01-02/", warehouseutils.RandHex())) | ||
require.NoError(t, err) | ||
|
||
err = g.RefreshPartitions(testTable, []warehouseutils.LoadFileT{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we checking the side effects of refresh partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we have added only the round-trip flow for Glue. Since this requires an external service call to Glue. Should we mock the API call and tests this?
fileBatches := make([][]warehouseutils.LoadFileT, 0, len(files)/batchSize+1) | ||
|
||
for len(files) > 0 { | ||
cut := int(math.Min(float64(len(files)), float64(batchSize))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general would avoid using float64 for anything that requires precision: https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/#example-1-the-odometer-that-stopped
However, I can not think of an example where this would fail. So take this as a minor suggestion:
cut := int(math.Min(float64(len(files)), float64(batchSize))) | |
cut := batchSize | |
if len(files) > cut { | |
cut = len(files) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. @lvrach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
Glue partitions
Reference: #2476
Notion Ticket
https://www.notion.so/rudderstacks/AWS-Glue-partitioning-b80301553aed40958df49c3a9beeba40
Security