Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3-1987: Optimize audits stats persistence #2632

Merged
merged 27 commits into from Jul 31, 2019

Conversation

@ethanadams
Copy link
Contributor

ethanadams commented Jul 25, 2019

What:
Update reporter and overlay to update audit stats in batched transactions.

Why:
Traces show significant time being spend looping through individual stats updates.

https://storjlabs.atlassian.net/browse/V3-1987

Simple benchmark tests show a significant improvement between single update/transaction model vs batching within a transaction.

##
## sqlite (file)
##

# 100 updates
BenchmarkBulk/100_updates_and_commits         	     300	   4066550 ns/op	  125079 B/op	    4100 allocs/op
BenchmarkBulk/100_updates_1_commit           	    2000	    760509 ns/op	   45066 B/op	    1527 allocs/op
BenchmarkBulk/100_in_one_db_call             	    2000	    652836 ns/op	  403846 B/op	    1329 allocs/op


# 500 updates
BenchmarkBulk/500_updates_and_commits         	      50	  20456170 ns/op	  627740 B/op	   20516 allocs/op
BenchmarkBulk/500_updates_1_commit            	     500	   3689338 ns/op	  221219 B/op	    7527 allocs/op
BenchmarkBulk/500_in_one_db_call              	     300	   5042053 ns/op	 9825272 B/op	    6546 allocs/op

##
## postgres (localhost)
##

# 100 updates
BenchmarkBulk/100_updates_and_commits         	      30	  41845984 ns/op	   92011 B/op	    1907 allocs/op
BenchmarkBulk/100_updates_1_commit            	     100	  13995990 ns/op	   38376 B/op	    1010 allocs/op
BenchmarkBulk/100_in_one_db_call              	     200	   8571144 ns/op	  210926 B/op	     616 allocs/op

# 500 updates
BenchmarkBulk/500_updates_and_commits         	       5	 211364240 ns/op	  455345 B/op	    9516 allocs/op
BenchmarkBulk/500_updates_1_commit            	      20	  71476219 ns/op	  188978 B/op	    5011 allocs/op
BenchmarkBulk/500_in_one_db_call              	      30	  45730926 ns/op	 4931225 B/op	    3030 allocs/op

Still considering the single SQL call, but that will require a lot of business logic in the SQL, which would duplicate logic in Go code.

Please describe the tests:

  • Test 1: Updated existing tests to use BatchUpdateStats
  • Test 2:

Please describe the performance impact:

Code Review Checklist (to be filled out by reviewer)

  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?
_, err := reporter.overlay.UpdateStats(ctx, &overlay.UpdateRequest{

updateRequests := make([]*overlay.UpdateRequest, len(successNodeIDs))
for i, nodeID := range successNodeIDs {

This comment has been minimized.

Copy link
@egonelbre

egonelbre Jul 25, 2019

Member

Is there a reason we cannot do just single query for the whole batch?

This comment has been minimized.

Copy link
@egonelbre

egonelbre Jul 25, 2019

Member

Most of these updates should be possible to condense into a single update.

This comment has been minimized.

Copy link
@ethanadams

ethanadams Jul 25, 2019

Author Contributor

Yes, I agree. I was looking for opinions as to whether I should put all the business logic in SQL vs using the existing Go logic.

I will work on the SQL model next.

This comment has been minimized.

Copy link
@egonelbre

egonelbre Jul 25, 2019

Member

You shouldn't need a SQL model either. An add/subtract with set to DQ should do it.

This comment has been minimized.

Copy link
@ethanadams

ethanadams Jul 26, 2019

Author Contributor

updated to execute updates/deletes in on call to DB

}

// UpdateStats a single storagenode's stats in the db
func updateStats(ctx context.Context, updateReq *overlay.UpdateRequest, tx *dbx.Tx) (stats *overlay.NodeStats, err error) {

This comment has been minimized.

Copy link
@egonelbre

egonelbre Jul 25, 2019

Member

I wonder whether this would end up problematic with multiple servers updating at the same time.

This comment has been minimized.

Copy link
@simongui

simongui Jul 30, 2019

Contributor

LWW (last write wins) would be the guarantee this provides wouldn't it?

ethanadams and others added 6 commits Jul 25, 2019
…tched into 1 call to the DB
return failed, errs.Combine(Error.New("failed to record some audit fail statuses in overlay"), errlist.Err())

failed, err = reporter.overlay.BatchUpdateStats(ctx, updateRequests)
if err != nil && len(failed) > 0 {

This comment has been minimized.

Copy link
@coyle

coyle Jul 30, 2019

Contributor

should this be an or not an and ?

This comment has been minimized.

Copy link
@ethanadams

ethanadams Jul 30, 2019

Author Contributor

I agree. Updated

This comment has been minimized.

Copy link
@aligeti

aligeti Jul 30, 2019

Contributor

I think the conditions should be separated. BatchupdateStats returns nil for error, if the passed slice length is zero, nothing to update and returning nil is ok. but changing the condition to OR will return err. As suggested above, checking the length of the slice before calling BatchupdateStats will work with using if err != nil || len(failed) > 0

}
for _, nodeID := range failedBatch {
for _, pendingAudit := range pendingAudits {
if nodeID == pendingAudit.NodeID {

This comment has been minimized.

Copy link
@coyle

coyle Jul 30, 2019

Contributor

maybe a better algorithm here would be to build a map of [NodeID]pendingAudit then check if any nodeIDs from failedBatch are present in the map, would decrease this in time complexity

This comment has been minimized.

Copy link
@ethanadams

ethanadams Jul 30, 2019

Author Contributor

Added map for lookup

@coyle
coyle approved these changes Jul 30, 2019
var errlist errs.Group
for _, nodeID := range successNodeIDs {
_, err := reporter.overlay.UpdateStats(ctx, &overlay.UpdateRequest{

This comment has been minimized.

Copy link
@aligeti

aligeti Jul 30, 2019

Contributor

shouldn't you add the length of successNodeIDs > 0 check? else return nil with a warning?

@@ -181,6 +181,8 @@ func (reporter *Reporter) recordAuditSuccessStatus(ctx context.Context, successN
func (reporter *Reporter) recordPendingAudits(ctx context.Context, pendingAudits []*PendingAudit) (failed []*PendingAudit, err error) {
defer mon.Task()(&ctx)(&err)
var errlist errs.Group

var updateRequests []*overlay.UpdateRequest

This comment has been minimized.

Copy link
@aligeti

aligeti Jul 30, 2019

Contributor

shouldn't you add the length of pendingAudits > 0 check? else return nil with a warning? and at other similar usages

return failed, errs.Combine(Error.New("failed to record some audit fail statuses in overlay"), errlist.Err())

failed, err = reporter.overlay.BatchUpdateStats(ctx, updateRequests)
if err != nil && len(failed) > 0 {

This comment has been minimized.

Copy link
@aligeti

aligeti Jul 30, 2019

Contributor

I think the conditions should be separated. BatchupdateStats returns nil for error, if the passed slice length is zero, nothing to update and returning nil is ok. but changing the condition to OR will return err. As suggested above, checking the length of the slice before calling BatchupdateStats will work with using if err != nil || len(failed) > 0

ethanadams added 6 commits Jul 30, 2019
…to ethan/V3-1987-dbx
@ethanadams ethanadams merged commit c9b46f2 into master Jul 31, 2019
2 checks passed
2 checks passed
continuous-integration/jenkins/pr-head This commit looks good
Details
verification/cla-signed
Details
@ethanadams ethanadams deleted the ethan/V3-1987-dbx branch Jul 31, 2019
VitaliiShpital added a commit to VitaliiShpital/storj that referenced this pull request Aug 29, 2019
* Added batch update stats for recordAuditSuccessStatus
* Added batch update stats to recordAuditFailStatus
* added configurable batch size
* build individual update/delete statements so the statements can be batched into 1 call to the DB
* notified #config-changes channel and ran make update-satellite-config-lock
* updated tests to use batch update stats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.