New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add boltDB batching for Put operation, add benchmark test #1865
Conversation
storage/boltdb/client.go
Outdated
@@ -67,6 +68,7 @@ func NewShared(path string, buckets ...string) ([]*Client, error) { | |||
if err != nil { | |||
return nil, Error.Wrap(err) | |||
} | |||
db.NoSync = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other things that may use bolt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly added this db.NoSync
here to start a discussion about if we should be handling the syncing logic ourselves. Personally I think we should avoid this setting and instead keep letting BoltDB handling the syncing. While this setting does seem to improve performance a little bit, I'm not convinced it is worth the increased risk and complication added to the code base.
Anyhow, about your comment...it looks like this function NewShared
is currently only used for the Routing table creation. So if we decide that we want to handle syncing ourselves, then syncing logic needs to be added to the routing table code for bootstrap and SA (as it is for SN below).
If we do choose to keep this db.NoSync = true
in here, I don't know how we can ensure that future uses of this function handle syncing correctly. I could add comment, but it still seems risky.
cmd/storagenode/main.go
Outdated
@@ -136,6 +136,18 @@ func cmdRun(cmd *cobra.Command, args []string) (err error) { | |||
return errs.New("Error starting master database on storagenode: %+v", err) | |||
} | |||
|
|||
// Sync routing table database every 1s | |||
ticker := time.NewTicker(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have handling for corruption? Also this part here looks out of place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this part looks out of place. I would prefer this syncing code be in the same place where db.NoSync = true
is set so that it makes sense why we need this.
I don't know the best way to add the NoSync
setting and its related logic. Seems like its messy here and Im not sure if there is a way to avoid that, other than removing the NoSync code altogether and letting BoltDB handle the syncing. That is my preferred option, let BoltDB do the sync.
Also what do you mean "handling for corruption"?
cmd/storagenode/main.go
Outdated
go func() { | ||
for { | ||
<-ticker.C | ||
kdb.Sync() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logically races with closing the database.
storage/boltdb/client_test.go
Outdated
|
||
func BenchmarkClientBatchWrite(b *testing.B) { | ||
// setup db | ||
tempdir, err := ioutil.TempDir("", "storj-bolt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use testcontext
to create directories.
storage/boltdb/client_test.go
Outdated
if err != nil { | ||
fmt.Println("err:", err) | ||
} | ||
defer func() { _ = os.RemoveAll(tempdir) }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't ignore deletion errors, they help us find bugs.
storage/redis/client.go
Outdated
// Sync writes data to disk | ||
func (client *Client) Sync() error { | ||
// TODO: this satisfies the storage.KeyValueStore interface, implement if needed. | ||
return Error.New("Sync not implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when someone uses a different backend implementation, then it will fail? The approach here should be to return nil
because the implementation is always synced.
storage/boltdb/client.go
Outdated
// to disk every 1000 operations or 10ms, whichever is first. | ||
// The MaxBatchDelay are using default settings and be changed if need be. | ||
// Ref: https://github.com/boltdb/bolt/blob/master/db.go#L160 | ||
// Note: when using this method, make sure its being executed asynchronously since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit: change this comment to say "make sure its being executed asynchronously if needed", because there may be many cases where it's entirely correct to block for that duration, and people might get confused and think it's disallowed.
storage/boltdb/client.go
Outdated
bucket := tx.Bucket(client.Bucket) | ||
return bucket.Put(key, value) | ||
}) | ||
mon.IntVal("boltDB Batch time elapsed").Observe(int64(time.Since(start))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe our current practice is to use underscores in monkit names, like 'boltdb_batch_time_elapsed'.
storage/boltdb/client.go
Outdated
return storage.ErrEmptyKey.New("") | ||
} | ||
|
||
return client.db.Update(func(tx *bolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason not to stick with client.update()
here, as the other Client methods do? If no, probably best to switch back to avoid confusion. If yes, this deserves an explanatory comment, and if the same reason applies to the other methods, maybe those should be changed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought the way it was written before was confusing. For example this code is hard to read to me because both PutAndCommit
and update
are returning calls to functions that call anonymous functions, it was hard to tell what was going on.
// PutAndCommit adds a value to the provided key in boltdb, returning an error on failure.
func (client *Client) PutAndCommit(key storage.Key, value storage.Value) error {
if key.IsZero() {
return storage.ErrEmptyKey.New("")
}
return client.update(func(bucket *bolt.Bucket) error {
return bucket.Put(key, value)
})
}
func (client *Client) update(fn func(*bolt.Bucket) error) error {
return Error.Wrap(client.db.Update(func(tx *bolt.Tx) error {
return fn(tx.Bucket(client.Bucket))
}))
}
I felt like we could remove the update
method all together, but now that Im looking at it more, I guess its written like that to abstract away the boltDB method calls? I'm not sure. Anyhow, I can revert it to keep it consistent if that is preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure what the reasoning is behind the update()
method. Probably to abstract those particular parts of Bolt world. But yeah, consistency helps readability a lot on its own. We should either use update()
here or take it out in the other methods (and if you want to do that, a separate commit would probably be preferable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i will revert the changes in this PR, then submit a different PR to remove that layer of abstraction i.e update
, etc methods
storage/boltdb/client_test.go
Outdated
@@ -28,6 +30,7 @@ func TestSuite(t *testing.T) { | |||
if err != nil { | |||
t.Fatalf("failed to create db: %v", err) | |||
} | |||
store.db.MaxBatchDelay = 1 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little worrying anytime we have to run the test environment with a configuration that is so different from what is in prod. Are there any ways to avoid it without entirely breaking test performance? Maybe doing more test operations at the same time, so that batches would naturally tend to complete more quickly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i can remove this if its not ideal for testing.
if err != nil { | ||
fmt.Printf("failed to create db: %v\n", err) | ||
} | ||
kdb := dbs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the deferred dbs[x].Close()
calls not needed here, as they are in the two previous functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops I will add
storage/boltdb/client_test.go
Outdated
dbfile := ctx.File("testbolt.db") | ||
dbs, err := NewShared(dbfile, "kbuckets", "nodes") | ||
if err != nil { | ||
fmt.Printf("failed to create db: %v\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and all following fmt.Printf()
calls in the benchmark functions) should be a b.Fatalf()
call. The benchmark results are not valid if we have failed to create, close, or modify the DBs in the expected way. Plus if these are happening during automated testing, we really want to know about it.
storage/boltdb/client_test.go
Outdated
key := storage.Key(fmt.Sprintf("testkey%d", i)) | ||
value := storage.Value("testvalue") | ||
|
||
err := kdb.PutAndCommit(key, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BenchmarkClientWrite() and BenchmarkClientNoSyncWrite() would be more meaningful comparisons if they also used the same WaitGroup/goroutine pattern as BenchmarkClientBatchWrite() and BenchmarkClientBatchNoSyncWrite(). It might even be worth pulling out the commonalities between them all into a single runner function that takes a callback argument for doing the actual appropriate operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm ok let me give that a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice job @JessicaGreben
defer wg.Done() | ||
err := kdb.PutAndCommit(key, value) | ||
if err != nil { | ||
b.Fatal("PutAndCommit Nosync err:", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmark methods cannot be called from a goroutine it's a race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops I will fix that in another PR since this is merged. thanks egon.
* add boltDB batching for Put operation, add benchmark test * add batchPut method to kademlia routingTable * add BatchPut method for other KeyValueStore to satisfy interface * return err not implemented * add noSync to boltdb client * rm boltDB noSync * make batch block and fix tests * changes per CR * rm test setting so it matches prod code behavior * fix lint errs
Jira V3-1493
What:
This PR modifies the
storage/boltdb/client.go
client.Put()
method to use boltDBdb.Batch()
method instead of boltDBdb.Update()
.This PR also adds a benchmark test that compares boltDB Put operations in the following 3 scenarios:
db.Update
operationdb.Update
operationdb.Batch
instead ofdb.Update
db.Batch
and turning off fsyncWhy:
The boltDB method
db.Update
creates a new transaction and commits the transaction right away which writes to disk and fsyncs. Kademlia RoutingTable callsdb.Update
frequently so this creates a lot of writes to disk slowing things down. Here we test if using boltDBdb.Batch
can help us improve performance. We chose to not handle the fsync ourselves since it caused additional complicated logic without a big enough increase in performance. Benchmark test results:Code Review Checklist (to be filled out by reviewer)