Parallel fetch of batches in Snowflake connector #4070

esevastyanov · 2024-02-16T09:16:40Z

rill start --verbose --env connector.snowflake.parallel_fetch_limit=20 ./projects/snowflake

begelundmuller · 2024-02-16T10:41:29Z

runtime/drivers/snowflake/sql_store.go

+	if err != nil {
+		return nil, err
+	}
+	fetchLimitPtr, exists := dsnConfig.Params["parallelFetchLimit"]


Is this a standard Snowflake property? If not, I think it would be better to set it as a separate connector property – i.e. configured using --var connector.snowflake.parallel_fetch_limit=N.

No, it isn't a Snowflake property
Moved to env var

-env connector.snowflake.parallel_fetch_limit=20

begelundmuller · 2024-02-16T10:50:00Z

runtime/drivers/snowflake/sql_store.go

+
+	// Fetch batches async as it takes most of the time
+	var wg sync.WaitGroup
+	fetchResultChan := make(chan fetchResult, len(f.batches))


What is the max number of batches? Wondering if this might take a lot of memory. If the writing phase is fast, I think we should avoid a buffered channel here. See: https://github.com/uber-go/guide/blob/master/style.md#channel-size-is-one-or-none

There is no way to control the number of batches and number of records per batch in Snowflake. There might be 1K batches for 100M rows

begelundmuller · 2024-02-16T10:52:54Z

runtime/drivers/snowflake/sql_store.go

 	for _, batch := range f.batches {
-		records, err := batch.Fetch()
-		if err != nil {
-			return nil, err
+		wg.Add(1)
+		go func(b *sf.ArrowBatch) {


Same question – how many batches might there be? Starting 10s-100s of Goroutines is okay, but probably worth optimizing if looking at thousands.

I wonder if it might be simpler to use an errgroup with WithContext and SetLimit. And then doing the writing from a separate goroutine. It would avoid the semaphore, avoid potentially 100s of concurrent goroutines, and make error propagation / cancellation easier. Something like this:

grp, ctx := errgroup.WithContext(ctx) grp.SetLimit(f.parallelFetchLimit) go func() { for { select { case res, ok <-resultChan: // Write... case <-ctx.Done(): // Either finished or errored return } } }() for _, batch := range f.batches { grp.Go(...) } err := wg.Wait() // done

Converted using errgroup. The writing also requires an error propagation so used errgroup the the writing too.

begelundmuller · 2024-02-16T11:01:50Z

runtime/drivers/snowflake/sql_store.go

+			defer sem.Release(1)
+			err := sem.Acquire(f.ctx, 1)
+			if err != nil {
+				fetchResultChan <- fetchResult{Records: nil, Batch: nil, Err: err}
+				return
+			}


I think releases should be after successful acquire (else might panic)

removed the use of a semaphore

begelundmuller · 2024-02-19T12:57:11Z

runtime/drivers/snowflake/sql_store.go

+			case <-ctx.Done():
+				if ctx.Err() != nil {
+					return nil
 				}


The if check seems redundant since ctx.Done will never be sent without ctx.Err being non-nil. Also, I think we should return ctx.Err() since if the iterator's ctx is cancelled, it's expected that the ctx error is returned (this doesn't matter in case another goroutine returns an error, since errgroup will only return the first error to Wait()).

begelundmuller · 2024-02-19T13:06:27Z

runtime/drivers/snowflake/sql_store.go

+	}
+
+	err = fetchGrp.Wait()
+	ctx.Done()


This statement doesn't do anything

* Parallel fetch of batches in Snowflake connector * Errgroups for fetching and writing * Fixed context cancellation case

Parallel fetch of batches in Snowflake connector

19a175d

cohenscottr approved these changes Feb 16, 2024

View reviewed changes

esevastyanov requested a review from begelundmuller February 16, 2024 09:58

begelundmuller requested changes Feb 16, 2024

View reviewed changes

Errgroups for fetching and writing

03a94e8

esevastyanov requested a review from begelundmuller February 16, 2024 18:52

begelundmuller reviewed Feb 19, 2024

View reviewed changes

begelundmuller added the blocker A release blocker issue that should be resolved before a new release label Feb 19, 2024

Fixed context cancellation case

f61b300

begelundmuller approved these changes Feb 19, 2024

View reviewed changes

begelundmuller merged commit b9ccaf8 into main Feb 19, 2024
4 checks passed

begelundmuller deleted the 4069-parallel-fetch branch February 19, 2024 13:49

mindspank pushed a commit that referenced this pull request Feb 23, 2024

Parallel fetch of batches in Snowflake connector (#4070)

0c6bc91

* Parallel fetch of batches in Snowflake connector * Errgroups for fetching and writing * Fixed context cancellation case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel fetch of batches in Snowflake connector #4070

Parallel fetch of batches in Snowflake connector #4070

esevastyanov commented Feb 16, 2024 •

edited

Loading

begelundmuller Feb 16, 2024

esevastyanov Feb 16, 2024

begelundmuller Feb 16, 2024

esevastyanov Feb 16, 2024

begelundmuller Feb 16, 2024

begelundmuller Feb 16, 2024

esevastyanov Feb 16, 2024

begelundmuller Feb 16, 2024

esevastyanov Feb 16, 2024

begelundmuller Feb 19, 2024

begelundmuller Feb 19, 2024

Parallel fetch of batches in Snowflake connector #4070

Parallel fetch of batches in Snowflake connector #4070

Conversation

esevastyanov commented Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

esevastyanov commented Feb 16, 2024 •

edited

Loading