Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add boltDB batching for Put operation, add benchmark test #1865

Merged
merged 13 commits into from May 6, 2019

Conversation

JessicaGreben
Copy link
Contributor

@JessicaGreben JessicaGreben commented Apr 30, 2019

Jira V3-1493

What:
This PR modifies the storage/boltdb/client.go client.Put() method to use boltDB db.Batch() method instead of boltDB db.Update().

This PR also adds a benchmark test that compares boltDB Put operations in the following 3 scenarios:

  1. using the current boltDB db.Update operation
  2. turning fsync off and running boltDB db.Update operation
  3. using boltDB db.Batch instead of db.Update
  4. using boltDB db.Batch and turning off fsync

Why:
The boltDB method db.Update creates a new transaction and commits the transaction right away which writes to disk and fsyncs. Kademlia RoutingTable calls db.Update frequently so this creates a lot of writes to disk slowing things down. Here we test if using boltDB db.Batch can help us improve performance. We chose to not handle the fsync ourselves since it caused additional complicated logic without a big enough increase in performance. Benchmark test results:

~/storj/storage/boltdb - (jg/bolt-benchmark) $ go test -bench=.
goos: darwin
goarch: amd64
pkg: storj.io/storj/storage/boltdb
BenchmarkClientWrite-12               	      10	 157192126 ns/op
BenchmarkClientNoSyncWrite-12         	      30	  44419786 ns/op
BenchmarkClientBatchWrite-12          	    1000	   2011001 ns/op
BenchmarkClientBatchNoSyncWrite-12    	    1000	   1849665 ns/op
PASS
ok  	storj.io/storj/storage/boltdb	5.202s

Code Review Checklist (to be filled out by reviewer)

  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Could the PR be broken into smaller PRs?
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?

@@ -67,6 +68,7 @@ func NewShared(path string, buckets ...string) ([]*Client, error) {
if err != nil {
return nil, Error.Wrap(err)
}
db.NoSync = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other things that may use bolt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly added this db.NoSync here to start a discussion about if we should be handling the syncing logic ourselves. Personally I think we should avoid this setting and instead keep letting BoltDB handling the syncing. While this setting does seem to improve performance a little bit, I'm not convinced it is worth the increased risk and complication added to the code base.

Anyhow, about your comment...it looks like this function NewShared is currently only used for the Routing table creation. So if we decide that we want to handle syncing ourselves, then syncing logic needs to be added to the routing table code for bootstrap and SA (as it is for SN below).

If we do choose to keep this db.NoSync = true in here, I don't know how we can ensure that future uses of this function handle syncing correctly. I could add comment, but it still seems risky.

@@ -136,6 +136,18 @@ func cmdRun(cmd *cobra.Command, args []string) (err error) {
return errs.New("Error starting master database on storagenode: %+v", err)
}

// Sync routing table database every 1s
ticker := time.NewTicker(time.Second)
Copy link
Member

@egonelbre egonelbre May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have handling for corruption? Also this part here looks out of place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this part looks out of place. I would prefer this syncing code be in the same place where db.NoSync = true is set so that it makes sense why we need this.

I don't know the best way to add the NoSync setting and its related logic. Seems like its messy here and Im not sure if there is a way to avoid that, other than removing the NoSync code altogether and letting BoltDB handle the syncing. That is my preferred option, let BoltDB do the sync.

Also what do you mean "handling for corruption"?

go func() {
for {
<-ticker.C
kdb.Sync()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logically races with closing the database.


func BenchmarkClientBatchWrite(b *testing.B) {
// setup db
tempdir, err := ioutil.TempDir("", "storj-bolt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use testcontext to create directories.

if err != nil {
fmt.Println("err:", err)
}
defer func() { _ = os.RemoveAll(tempdir) }()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ignore deletion errors, they help us find bugs.

// Sync writes data to disk
func (client *Client) Sync() error {
// TODO: this satisfies the storage.KeyValueStore interface, implement if needed.
return Error.New("Sync not implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when someone uses a different backend implementation, then it will fail? The approach here should be to return nil because the implementation is always synced.

@JessicaGreben JessicaGreben requested a review from thepaul May 3, 2019 17:59
@JessicaGreben JessicaGreben added Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR labels May 3, 2019
// to disk every 1000 operations or 10ms, whichever is first.
// The MaxBatchDelay are using default settings and be changed if need be.
// Ref: https://github.com/boltdb/bolt/blob/master/db.go#L160
// Note: when using this method, make sure its being executed asynchronously since
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit: change this comment to say "make sure its being executed asynchronously if needed", because there may be many cases where it's entirely correct to block for that duration, and people might get confused and think it's disallowed.

bucket := tx.Bucket(client.Bucket)
return bucket.Put(key, value)
})
mon.IntVal("boltDB Batch time elapsed").Observe(int64(time.Since(start)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our current practice is to use underscores in monkit names, like 'boltdb_batch_time_elapsed'.

return storage.ErrEmptyKey.New("")
}

return client.db.Update(func(tx *bolt.Tx) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason not to stick with client.update() here, as the other Client methods do? If no, probably best to switch back to avoid confusion. If yes, this deserves an explanatory comment, and if the same reason applies to the other methods, maybe those should be changed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the way it was written before was confusing. For example this code is hard to read to me because both PutAndCommit and update are returning calls to functions that call anonymous functions, it was hard to tell what was going on.

// PutAndCommit adds a value to the provided key in boltdb, returning an error on failure.
func (client *Client) PutAndCommit(key storage.Key, value storage.Value) error {
	if key.IsZero() {
		return storage.ErrEmptyKey.New("")
	}

	return client.update(func(bucket *bolt.Bucket) error {
		return bucket.Put(key, value)
	})
}

func (client *Client) update(fn func(*bolt.Bucket) error) error {
	return Error.Wrap(client.db.Update(func(tx *bolt.Tx) error {
		return fn(tx.Bucket(client.Bucket))
	}))
}

I felt like we could remove the update method all together, but now that Im looking at it more, I guess its written like that to abstract away the boltDB method calls? I'm not sure. Anyhow, I can revert it to keep it consistent if that is preferred?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what the reasoning is behind the update() method. Probably to abstract those particular parts of Bolt world. But yeah, consistency helps readability a lot on its own. We should either use update() here or take it out in the other methods (and if you want to do that, a separate commit would probably be preferable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i will revert the changes in this PR, then submit a different PR to remove that layer of abstraction i.e update, etc methods

@@ -28,6 +30,7 @@ func TestSuite(t *testing.T) {
if err != nil {
t.Fatalf("failed to create db: %v", err)
}
store.db.MaxBatchDelay = 1 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little worrying anytime we have to run the test environment with a configuration that is so different from what is in prod. Are there any ways to avoid it without entirely breaking test performance? Maybe doing more test operations at the same time, so that batches would naturally tend to complete more quickly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i can remove this if its not ideal for testing.

if err != nil {
fmt.Printf("failed to create db: %v\n", err)
}
kdb := dbs[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the deferred dbs[x].Close() calls not needed here, as they are in the two previous functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I will add

dbfile := ctx.File("testbolt.db")
dbs, err := NewShared(dbfile, "kbuckets", "nodes")
if err != nil {
fmt.Printf("failed to create db: %v\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and all following fmt.Printf() calls in the benchmark functions) should be a b.Fatalf() call. The benchmark results are not valid if we have failed to create, close, or modify the DBs in the expected way. Plus if these are happening during automated testing, we really want to know about it.

key := storage.Key(fmt.Sprintf("testkey%d", i))
value := storage.Value("testvalue")

err := kdb.PutAndCommit(key, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BenchmarkClientWrite() and BenchmarkClientNoSyncWrite() would be more meaningful comparisons if they also used the same WaitGroup/goroutine pattern as BenchmarkClientBatchWrite() and BenchmarkClientBatchNoSyncWrite(). It might even be worth pulling out the commonalities between them all into a single runner function that takes a callback argument for doing the actual appropriate operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok let me give that a shot

Copy link
Contributor

@stefanbenten stefanbenten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice job @JessicaGreben

@JessicaGreben JessicaGreben merged commit a9b8b50 into master May 6, 2019
@JessicaGreben JessicaGreben deleted the jg/bolt-benchmark branch May 6, 2019 20:47
defer wg.Done()
err := kdb.PutAndCommit(key, value)
if err != nil {
b.Fatal("PutAndCommit Nosync err:", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark methods cannot be called from a goroutine it's a race.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I will fix that in another PR since this is merged. thanks egon.

bryanchriswhite pushed a commit that referenced this pull request Jun 4, 2019
* add boltDB batching for Put operation, add benchmark test

* add batchPut method to kademlia routingTable

* add BatchPut method for other KeyValueStore to satisfy interface

* return err not implemented

* add noSync to boltdb client

* rm boltDB noSync

* make batch block and fix tests

* changes per CR

* rm test setting so it matches prod code behavior

* fix lint errs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants