add boltDB batching for Put operation, add benchmark test #1865

JessicaGreben · 2019-04-30T14:51:49Z

What:
This PR modifies the storage/boltdb/client.go client.Put() method to use boltDB db.Batch() method instead of boltDB db.Update().

This PR also adds a benchmark test that compares boltDB Put operations in the following 3 scenarios:

using the current boltDB db.Update operation
turning fsync off and running boltDB db.Update operation
using boltDB db.Batch instead of db.Update
using boltDB db.Batch and turning off fsync

Why:
The boltDB method db.Update creates a new transaction and commits the transaction right away which writes to disk and fsyncs. Kademlia RoutingTable calls db.Update frequently so this creates a lot of writes to disk slowing things down. Here we test if using boltDB db.Batch can help us improve performance. We chose to not handle the fsync ourselves since it caused additional complicated logic without a big enough increase in performance. Benchmark test results:

~/storj/storage/boltdb - (jg/bolt-benchmark) $ go test -bench=.
goos: darwin
goarch: amd64
pkg: storj.io/storj/storage/boltdb
BenchmarkClientWrite-12               	      10	 157192126 ns/op
BenchmarkClientNoSyncWrite-12         	      30	  44419786 ns/op
BenchmarkClientBatchWrite-12          	    1000	   2011001 ns/op
BenchmarkClientBatchNoSyncWrite-12    	    1000	   1849665 ns/op
PASS
ok  	storj.io/storj/storage/boltdb	5.202s

Code Review Checklist (to be filled out by reviewer)

Does the PR describe what changes are being made?
Does the PR describe why the changes are being made?
Does the code follow our style guide?
Does the code follow our testing guide?
Could the PR be broken into smaller PRs?
Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
Does any documentation need updating?

egonelbre · 2019-05-01T08:10:38Z

storage/boltdb/client.go

@@ -67,6 +68,7 @@ func NewShared(path string, buckets ...string) ([]*Client, error) {
 	if err != nil {
 		return nil, Error.Wrap(err)
 	}
+	db.NoSync = true


There are other things that may use bolt.

I mostly added this db.NoSync here to start a discussion about if we should be handling the syncing logic ourselves. Personally I think we should avoid this setting and instead keep letting BoltDB handling the syncing. While this setting does seem to improve performance a little bit, I'm not convinced it is worth the increased risk and complication added to the code base.

Anyhow, about your comment...it looks like this function NewShared is currently only used for the Routing table creation. So if we decide that we want to handle syncing ourselves, then syncing logic needs to be added to the routing table code for bootstrap and SA (as it is for SN below).

If we do choose to keep this db.NoSync = true in here, I don't know how we can ensure that future uses of this function handle syncing correctly. I could add comment, but it still seems risky.

egonelbre · 2019-05-01T08:10:53Z

cmd/storagenode/main.go

@@ -136,6 +136,18 @@ func cmdRun(cmd *cobra.Command, args []string) (err error) {
 		return errs.New("Error starting master database on storagenode: %+v", err)
 	}

+	// Sync routing table database every 1s
+	ticker := time.NewTicker(time.Second)


Do we have handling for corruption? Also this part here looks out of place.

I agree this part looks out of place. I would prefer this syncing code be in the same place where db.NoSync = true is set so that it makes sense why we need this.

I don't know the best way to add the NoSync setting and its related logic. Seems like its messy here and Im not sure if there is a way to avoid that, other than removing the NoSync code altogether and letting BoltDB handle the syncing. That is my preferred option, let BoltDB do the sync.

Also what do you mean "handling for corruption"?

egonelbre · 2019-05-01T08:11:44Z

cmd/storagenode/main.go

+	go func() {
+		for {
+			<-ticker.C
+			kdb.Sync()


This logically races with closing the database.

egonelbre · 2019-05-01T08:12:42Z

storage/boltdb/client_test.go

+
+func BenchmarkClientBatchWrite(b *testing.B) {
+	// setup db
+	tempdir, err := ioutil.TempDir("", "storj-bolt")


Use testcontext to create directories.

egonelbre · 2019-05-01T08:12:53Z

storage/boltdb/client_test.go

+	if err != nil {
+		fmt.Println("err:", err)
+	}
+	defer func() { _ = os.RemoveAll(tempdir) }()


Don't ignore deletion errors, they help us find bugs.

egonelbre · 2019-05-01T08:13:53Z

storage/redis/client.go

+// Sync writes data to disk
+func (client *Client) Sync() error {
+	// TODO: this satisfies the storage.KeyValueStore interface, implement if needed.
+	return Error.New("Sync not implemented")


So when someone uses a different backend implementation, then it will fail? The approach here should be to return nil because the implementation is always synced.

thepaul · 2019-05-03T19:13:20Z

storage/boltdb/client.go

+// to disk every 1000 operations or 10ms, whichever is first.
+// The MaxBatchDelay are using default settings and be changed if need be.
+// Ref: https://github.com/boltdb/bolt/blob/master/db.go#L160
+// Note: when using this method, make sure its being executed asynchronously since


a nit: change this comment to say "make sure its being executed asynchronously if needed", because there may be many cases where it's entirely correct to block for that duration, and people might get confused and think it's disallowed.

thepaul · 2019-05-03T21:56:10Z

storage/boltdb/client.go

+		bucket := tx.Bucket(client.Bucket)
+		return bucket.Put(key, value)
+	})
+	mon.IntVal("boltDB Batch time elapsed").Observe(int64(time.Since(start)))


I believe our current practice is to use underscores in monkit names, like 'boltdb_batch_time_elapsed'.

thepaul · 2019-05-03T22:01:02Z

storage/boltdb/client.go

+		return storage.ErrEmptyKey.New("")
+	}
+
+	return client.db.Update(func(tx *bolt.Tx) error {


Is there a particular reason not to stick with client.update() here, as the other Client methods do? If no, probably best to switch back to avoid confusion. If yes, this deserves an explanatory comment, and if the same reason applies to the other methods, maybe those should be changed as well.

i thought the way it was written before was confusing. For example this code is hard to read to me because both PutAndCommit and update are returning calls to functions that call anonymous functions, it was hard to tell what was going on.

// PutAndCommit adds a value to the provided key in boltdb, returning an error on failure. func (client *Client) PutAndCommit(key storage.Key, value storage.Value) error { if key.IsZero() { return storage.ErrEmptyKey.New("") } return client.update(func(bucket *bolt.Bucket) error { return bucket.Put(key, value) }) } func (client *Client) update(fn func(*bolt.Bucket) error) error { return Error.Wrap(client.db.Update(func(tx *bolt.Tx) error { return fn(tx.Bucket(client.Bucket)) })) }

I felt like we could remove the update method all together, but now that Im looking at it more, I guess its written like that to abstract away the boltDB method calls? I'm not sure. Anyhow, I can revert it to keep it consistent if that is preferred?

Yeah, I'm not sure what the reasoning is behind the update() method. Probably to abstract those particular parts of Bolt world. But yeah, consistency helps readability a lot on its own. We should either use update() here or take it out in the other methods (and if you want to do that, a separate commit would probably be preferable).

i think i will revert the changes in this PR, then submit a different PR to remove that layer of abstraction i.e update, etc methods

thepaul · 2019-05-03T22:05:04Z

storage/boltdb/client_test.go

@@ -28,6 +30,7 @@ func TestSuite(t *testing.T) {
 	if err != nil {
 		t.Fatalf("failed to create db: %v", err)
 	}
+	store.db.MaxBatchDelay = 1 * time.Millisecond


It's a little worrying anytime we have to run the test environment with a configuration that is so different from what is in prod. Are there any ways to avoid it without entirely breaking test performance? Maybe doing more test operations at the same time, so that batches would naturally tend to complete more quickly?

yeah i can remove this if its not ideal for testing.

thepaul · 2019-05-03T22:08:06Z

storage/boltdb/client_test.go

+	if err != nil {
+		fmt.Printf("failed to create db: %v\n", err)
+	}
+	kdb := dbs[0]


Why are the deferred dbs[x].Close() calls not needed here, as they are in the two previous functions?

oops I will add

thepaul · 2019-05-03T22:10:54Z

storage/boltdb/client_test.go

+	dbfile := ctx.File("testbolt.db")
+	dbs, err := NewShared(dbfile, "kbuckets", "nodes")
+	if err != nil {
+		fmt.Printf("failed to create db: %v\n", err)


This (and all following fmt.Printf() calls in the benchmark functions) should be a b.Fatalf() call. The benchmark results are not valid if we have failed to create, close, or modify the DBs in the expected way. Plus if these are happening during automated testing, we really want to know about it.

thepaul · 2019-05-03T22:15:12Z

storage/boltdb/client_test.go

+			key := storage.Key(fmt.Sprintf("testkey%d", i))
+			value := storage.Value("testvalue")
+
+			err := kdb.PutAndCommit(key, value)


BenchmarkClientWrite() and BenchmarkClientNoSyncWrite() would be more meaningful comparisons if they also used the same WaitGroup/goroutine pattern as BenchmarkClientBatchWrite() and BenchmarkClientBatchNoSyncWrite(). It might even be worth pulling out the commonalities between them all into a single runner function that takes a callback argument for doing the actual appropriate operation.

hm ok let me give that a shot

stefanbenten

LGTM, nice job @JessicaGreben

egonelbre · 2019-05-07T04:35:00Z

storage/boltdb/client_test.go

+				defer wg.Done()
+				err := kdb.PutAndCommit(key, value)
+				if err != nil {
+					b.Fatal("PutAndCommit Nosync err:", err)


benchmark methods cannot be called from a goroutine it's a race.

oops I will fix that in another PR since this is merged. thanks egon.

* add boltDB batching for Put operation, add benchmark test * add batchPut method to kademlia routingTable * add BatchPut method for other KeyValueStore to satisfy interface * return err not implemented * add noSync to boltdb client * rm boltDB noSync * make batch block and fix tests * changes per CR * rm test setting so it matches prod code behavior * fix lint errs

JessicaGreben added 2 commits April 30, 2019 07:40

add boltDB batching for Put operation, add benchmark test

89ea691

add batchPut method to kademlia routingTable

ecf607c

JessicaGreben requested a review from jenlij as a code owner April 30, 2019 15:16

cla-bot bot added the cla-signed label Apr 30, 2019

JessicaGreben added 3 commits April 30, 2019 08:22

add BatchPut method for other KeyValueStore to satisfy interface

232224a

return err not implemented

d7df884

add noSync to boltdb client

8468820

egonelbre reviewed May 1, 2019

View reviewed changes

JessicaGreben added 3 commits May 2, 2019 10:44

Merge branch 'master' into jg/bolt-benchmark

5c6cf35

rm boltDB noSync

1f1abca

make batch block and fix tests

54a392d

JessicaGreben requested a review from thepaul May 3, 2019 17:59

JessicaGreben added Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR labels May 3, 2019

thepaul reviewed May 3, 2019

View reviewed changes

JessicaGreben added 3 commits May 6, 2019 08:58

changes per CR

5f81af1

rm test setting so it matches prod code behavior

5adb744

fix lint errs

1d643ad

stefanbenten approved these changes May 6, 2019

View reviewed changes

Merge branch 'master' into jg/bolt-benchmark

3197f46

thepaul approved these changes May 6, 2019

View reviewed changes

Merge branch 'master' into jg/bolt-benchmark

8980647

JessicaGreben merged commit a9b8b50 into master May 6, 2019

JessicaGreben deleted the jg/bolt-benchmark branch May 6, 2019 20:47

egonelbre reviewed May 7, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add boltDB batching for Put operation, add benchmark test #1865

add boltDB batching for Put operation, add benchmark test #1865

JessicaGreben commented Apr 30, 2019 •

edited

egonelbre May 1, 2019

JessicaGreben May 1, 2019 •

edited

egonelbre May 1, 2019 •

edited

JessicaGreben May 1, 2019 •

edited

egonelbre May 1, 2019

egonelbre May 1, 2019

egonelbre May 1, 2019

egonelbre May 1, 2019

thepaul May 3, 2019

thepaul May 3, 2019

thepaul May 3, 2019

JessicaGreben May 6, 2019 •

edited

thepaul May 6, 2019

JessicaGreben May 6, 2019

thepaul May 3, 2019

JessicaGreben May 3, 2019

thepaul May 3, 2019

JessicaGreben May 3, 2019

thepaul May 3, 2019

thepaul May 3, 2019

JessicaGreben May 3, 2019

stefanbenten left a comment

egonelbre May 7, 2019

JessicaGreben May 7, 2019

add boltDB batching for Put operation, add benchmark test #1865

add boltDB batching for Put operation, add benchmark test #1865

Conversation

JessicaGreben commented Apr 30, 2019 • edited

Code Review Checklist (to be filled out by reviewer)

Choose a reason for hiding this comment

JessicaGreben May 1, 2019 • edited

Choose a reason for hiding this comment

egonelbre May 1, 2019 • edited

Choose a reason for hiding this comment

JessicaGreben May 1, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JessicaGreben May 6, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanbenten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JessicaGreben commented Apr 30, 2019 •

edited

JessicaGreben May 1, 2019 •

edited

egonelbre May 1, 2019 •

edited

JessicaGreben May 1, 2019 •

edited

JessicaGreben May 6, 2019 •

edited