Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase go-git cache size #440

Closed
jfontan opened this issue Aug 27, 2018 · 16 comments · Fixed by #464
Closed

Increase go-git cache size #440

jfontan opened this issue Aug 27, 2018 · 16 comments · Fixed by #464
Assignees
Labels
performance Performance improvements proposal proposal for new additions or changes

Comments

@jfontan
Copy link
Contributor

jfontan commented Aug 27, 2018

Right now there is no way of changing the default cache size in go-git and its size too small (96 MiB). I've been doing tests changing this value and its performance improved a lot.

Repositories: linux (2013), numpy, tensorflow
Number of rows: 395709
Query: SELECT count(*) FROM commits c NATURAL JOIN ref_commits r WHERE r.ref_name = 'HEAD';

Default cache: 1 row in set (54 min 22.17 sec)
Cache size * 8: 1 row in set (20 min 43.69 sec)

Memory consumption is also not too big. gitbase used 1.3 GiB in this query.

We should add an option to go-git Open to select cache size.

@jfontan jfontan added proposal proposal for new additions or changes performance Performance improvements labels Aug 27, 2018
@eiso
Copy link
Member

eiso commented Aug 28, 2018

@jfontan how does this scale with increased memory in relation to repo size? The reason I am asking is that I much rather have a significantly faster response and have a customer use 100GB of ram for a query. High memory instances are cheap these days.

@kuba-- kuba-- self-assigned this Aug 30, 2018
@jfontan
Copy link
Contributor Author

jfontan commented Aug 30, 2018

@eiso, maybe it could heuristically assign a size but the fist step will be adding an option to change it.

@kuba--
Copy link
Contributor

kuba-- commented Aug 30, 2018

...or maybe we can start from something like 20% of RAM as default for gitbase

@kuba--
Copy link
Contributor

kuba-- commented Aug 31, 2018

Maybe we should be able to configure cache size not only for PlainOpen, but everywhere where we use decoders/encoders (packfiles.go). For instance when we're creating a new repo decoder, we call:
packfile := packfile.NewPackfile(idx, fs, packf) what under the hood creates a default object LRU cache. Maybe we can create there NewPackfileWithCache and pass the same cache option from Storage - WDYT?

@eiso
Copy link
Member

eiso commented Aug 31, 2018

@jfontan my question was more about if the memory improvement stays the same dependent on size of the repository (which I am assuming is a yes)

@kuba--
Copy link
Contributor

kuba-- commented Aug 31, 2018

I'll try to implement it by adding support for couple environment variables, e.g.:
GO_GIT_OBJECT_MAXCACHE etc.
So everybody will be able to adjust go-git cache, doesn't matter if it's used directly or via apps like gitbase.
Any thoughts/concerns?

I think it's more flexible (can skip couple levels of passing options from one constructor to another) and it doesn't require any extra integration.
For instance you can just upgrade go-git in gitbase without any other changes in implementation and by setting env vars, benefit from bigger cache.

@jfontan
Copy link
Contributor Author

jfontan commented Aug 31, 2018

@eiso unfortunately I've seen changes in speedup from repo to repo but I don't know the cause. Maybe is the repo that has more complex commit tree and makes iterating it more expensive. We would have to measure more to know for sure.

@jfontan
Copy link
Contributor Author

jfontan commented Aug 31, 2018

@kuba-- Where do you want to add the environment variable read, in go-git itself? This will be an easier change but I'm not sure about the env vars policy in go-git library. @smola ?

@kuba--
Copy link
Contributor

kuba-- commented Aug 31, 2018

yeah, I thought about the lowest level. So, in this case go-git

@jfontan
Copy link
Contributor Author

jfontan commented Aug 31, 2018

@kuba-- You are totally right about NewPackfileWithCache. I have only tested with opening normal repos in repository_pool but packfile opening should be changed. I believe that is used for indexes so definitely needs to have a proper cache.

If we go the env variable way maybe changing the default value is enough to change both normal repository opens or direct packfile open.

@kuba--
Copy link
Contributor

kuba-- commented Aug 31, 2018

exactly. this is my intention- avoid tons of silly changes just by passing yet another static option.
if we do it once in go-git it will work for everybody almost immediately

@smola
Copy link
Contributor

smola commented Sep 4, 2018

Maybe we should allow passing a cache instead of just cache size. That would allow:

  • Sharing the same cache across repositories, useful to have a max size cap instead of per-repository.
  • Experiment with different cache types.

@kuba--
Copy link
Contributor

kuba-- commented Sep 4, 2018

Right now we have just one type of cache (LRU): plumbing/cache/object_lru, buffer_lru.
Both use the same default max size (96MB).
It makes sense to allow pass own cache implementation, but:

  • it won't avoid passing down the cache to many objects
  • I think, it will require to make a cache high level interface, because right now we have something what looks like a cache but under the hood uses own implementation caches:
// Object is an interface to a object cache.
type Object interface {
	// Put puts the given object into the cache. Whether this object will
	// actually be put into the cache or not is implementation specific.
	Put(o plumbing.EncodedObject)
	// Get gets an object from the cache given its hash. The second return value
	// is true if the object was returned, and false otherwise.
	Get(k plumbing.Hash) (plumbing.EncodedObject, bool)
	// Clear clears every object from the cache.
	Clear()
}

// Buffer is an interface to a buffer cache.
type Buffer interface {
	// Put puts a buffer into the cache. If the buffer is already in the cache,
	// it will be marked as used. Otherwise, it will be inserted. Buffer might
	// be evicted to make room for the new one.
	Put(key int64, slice []byte)
	// Get returns a buffer by its key. It marks the buffer as used. If the
	// buffer is not in the cache, (nil, false) will be returned.
	Get(key int64) ([]byte, bool)
	// Clear clears every object from the cache.
	Clear()
}

Moreover passing down go-git plumbing/cache/Object in apps like gitbase may not be good practice. So I see 3 options/possibilities:

  • Expose cache interface and let people pass own caches
  • Pass down cache size and/or cache type (+ implement different cache types in go-git)
  • Independently let adjust size a type by env. vars by apps and in the future implement different cache types in go-git.

@jfontan
Copy link
Contributor Author

jfontan commented Sep 5, 2018

Only Object cache is interesting for us. Buffer is only used when parsing the packfile on clone to generate the index.

I believe that for now we can benefit being able to pass a cache to the storer as @smola says. We already have packfile.NewPackfileWithCache so that is solved. Being able to pass cache when creating the storer (filesystem.Storage) would be enough. In gitbase I'm already changing repository open to use the new options so I'm always creating the storer, like the siva version:

https://github.com/src-d/gitbase/blob/master/repository_pool.go#L68

So at the end we will only have to pass cache in three places, two in repository_pool.go and one in packfiles.go.

We may need a better cache interface in the future but I would stick with the one that go-git has for now.

@kuba--
Copy link
Contributor

kuba-- commented Sep 5, 2018

I also don't want to over-engineer it, so that's why I preferred env vars (but I see some mental no to env vars :))
Ok, so lets pass just cache to Storage, as a first step.

@kuba-- kuba-- added the blocked Some other issue is blocking this label Sep 6, 2018
@kuba--
Copy link
Contributor

kuba-- commented Sep 6, 2018

So far, this issue is blocked till
src-d/go-git#947 is merged.

@kuba-- kuba-- removed the blocked Some other issue is blocking this label Sep 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance improvements proposal proposal for new additions or changes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants