-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib/db: Slightly improve indirection (ref #6372) #6373
Conversation
I was working on indirecting version vectors, and that resulted in some refactoring and improving the existing block indirection stuff. We may or may not end up doing the version vector indirection, but I think these changes are reasonable anyhow and will simplify the diff significantly if we do go there. The main points are: - A bunch of renaming to make the indirection and GC not about "blocks" but about "indirection". - Adding a cutoff so that we don't actually indirect for small block lists. This gets us better performance when handling small files as it cuts out the indirection for quite small loss in space efficiency. - Being paranoid and always recalculating the hash on put. This costs some CPU, but the consequences if a buggy or malicious implementation silently substituted the block list by lying about the hash would be bad.
lib/db/transactions.go
Outdated
return err | ||
} | ||
} else if err != nil { | ||
return err | ||
} | ||
fi.Blocks = nil | ||
} else { | ||
fi.BlocksHash = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. I think this is a generally useful piece of information to have. Perhaps we should check if the block list is empty when loading instead to decide if we need to take an indirection or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, avoid adding to bloom filter if it has a blockhash but also a blocklist in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. You mean that we should always calculate, set and send the hash, even when we don't need it internally? It's not expensive to do it for the small lists that this avoids, so sure we could do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of agree, even though the review comment below suggests the opposite: I feel like BlocksHash
should either be a db internal implementation detail or always be there if exposed towards protocol.
|
||
func init() { | ||
// deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lost the plot a bit, but is this actually released already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the RC. So if we delay the 1.4.0 release and put this into another RC this can go away, otherwise it kinda needs to be there.
lib/db/transactions.go
Outdated
func (t readWriteTransaction) putFile(fkey []byte, fi protocol.FileInfo) error { | ||
var bkey []byte | ||
|
||
if len(fi.Blocks) > blocksIndirectionCutoff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does idxchk need handling of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. It would, if we did your change to always have the hash, but as is now the idxck checks the hash if it's present and otherwise not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight it may have been better to not expose BlocksHash
on the protocol and consider it an implementation detail, but I guess that ship has sailed at this point (too big a refactor for the RC).
blocksKey := t.keyer.GenerateBlockListKey(nil, fi.BlocksHash) | ||
if _, err := t.Get(blocksKey); backend.IsNotFound(err) { | ||
func (t readWriteTransaction) putFile(fkey []byte, fi protocol.FileInfo) error { | ||
var bkey []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this declared out here? Isn't (re-) used outside of the if scope below as far as I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was in my original which also indirected the version vector... I figured it could stay there for clarity, but it has no practical reason now, no.
// we need to copy it. | ||
err := f.Unmarshal(append([]byte{}, dbi.Value()...)) | ||
|
||
intf, err := t.unmarshalTrunc(dbi.Value(), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the commend above valid (anymore)? unmarshalTrunc
doesn't do any buffer copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't valid at all since we switched to protobuf, which doesn't keep references (or we'd be fucked in so many places...).
This change in general isn't strictly needed right now, but I think all unmarshalling should use the unmarshalTrunc function. (Again, when indirecting version vectors that also affects the truncated version.)
Changed to always set the hash regardless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's left to address: To RC, or not to RC?
I am in favor of RC: Gets rid of the additional env var and if there is something wrong with it (don't think so), I'd rather have it in the same release cycle as the original change (same for not having a huge diff in the next release cycle concerning code from this one). That seems worth delaying the release (or even skipping the next one, letting the current cycle take 2 months).
I'd say delay it, or release the rc with the env var changed so we're not releasing legacy straight away. |
I was working on indirecting version vectors, and that resulted in some refactoring and improving the existing block indirection stuff. We may or may not end up doing the version vector indirection, but I think these changes are reasonable anyhow and will simplify the diff significantly if we do go there. The main points are: - A bunch of renaming to make the indirection and GC not about "blocks" but about "indirection". - Adding a cutoff so that we don't actually indirect for small block lists. This gets us better performance when handling small files as it cuts out the indirection for quite small loss in space efficiency. - Being paranoid and always recalculating the hash on put. This costs some CPU, but the consequences if a buggy or malicious implementation silently substituted the block list by lying about the hash would be bad.
* master: lib/db: Remove reference to env var that never existed lib/db: Slightly improve indirection (ref syncthing#6372) (syncthing#6373) lib/db: Remove reference to env var that never existed lib/db: Slightly improve indirection (ref syncthing#6372) (syncthing#6373)
* master: (64 commits) lib/db: Be more lenient during migration (fixes syncthing#6397) (syncthing#6398) lib/db: Be more lenient during migration (fixes syncthing#6397) (syncthing#6398) cmd/ursrv: Analytics for Synology dist build: Build image should use Go 1.13 for now gui, lib/api: Remove CPU & RAM measurements (fixes syncthing#6249) (syncthing#6393) gui, man, authors: Update docs, translations, and contributors all: Tweak error creation (syncthing#6391) authors: Cleanup on request build: We can now use Go 1.13 lib/db: Prevent GC concurrently with migration (fixes syncthing#6389) (syncthing#6390) lib/db: Prevent GC concurrently with migration (fixes syncthing#6389) (syncthing#6390) build: Fix syso creation (fixes syncthing#6386) (syncthing#6387) build: Fix syso creation (fixes syncthing#6386) (syncthing#6387) lib/db: Correct metadata recalculation (fixes syncthing#6381) (syncthing#6382) lib/db: Correct metadata recalculation (fixes syncthing#6381) (syncthing#6382) lib/db: Remove reference to env var that never existed lib/db: Slightly improve indirection (ref syncthing#6372) (syncthing#6373) lib/db: Remove reference to env var that never existed lib/db: Slightly improve indirection (ref syncthing#6372) (syncthing#6373) build: Forked github.com/spaolacci/murmur3 for unsafe (ref syncthing#6371) ...
* master: (64 commits) lib/db: Be more lenient during migration (fixes syncthing#6397) (syncthing#6398) lib/db: Be more lenient during migration (fixes syncthing#6397) (syncthing#6398) cmd/ursrv: Analytics for Synology dist build: Build image should use Go 1.13 for now gui, lib/api: Remove CPU & RAM measurements (fixes syncthing#6249) (syncthing#6393) gui, man, authors: Update docs, translations, and contributors all: Tweak error creation (syncthing#6391) authors: Cleanup on request build: We can now use Go 1.13 lib/db: Prevent GC concurrently with migration (fixes syncthing#6389) (syncthing#6390) lib/db: Prevent GC concurrently with migration (fixes syncthing#6389) (syncthing#6390) build: Fix syso creation (fixes syncthing#6386) (syncthing#6387) build: Fix syso creation (fixes syncthing#6386) (syncthing#6387) lib/db: Correct metadata recalculation (fixes syncthing#6381) (syncthing#6382) lib/db: Correct metadata recalculation (fixes syncthing#6381) (syncthing#6382) lib/db: Remove reference to env var that never existed lib/db: Slightly improve indirection (ref syncthing#6372) (syncthing#6373) lib/db: Remove reference to env var that never existed lib/db: Slightly improve indirection (ref syncthing#6372) (syncthing#6373) build: Forked github.com/spaolacci/murmur3 for unsafe (ref syncthing#6371) ...
Purpose
I was working on indirecting version vectors, and that resulted in some
refactoring and improving the existing block indirection stuff. We may
or may not end up doing the version vector indirection, but I think
these changes are reasonable anyhow and will simplify the diff
significantly if we do go there. The main points are:
A bunch of renaming to make the indirection and GC not about "blocks"
but about "indirection".
Adding a cutoff so that we don't actually indirect for small block
lists. This gets us better performance when handling small files as it
cuts out the indirection for quite small loss in space efficiency.
Being paranoid and always recalculating the hash on put. This costs
some CPU, but the consequences if a buggy or malicious implementation
silently substituted the block list by lying about the hash would be bad.
This may or may not be RC material depending on how we think about the change to indirecting small block lists and letting the old env var into the wild...