Skip to content

internal/function: use cache for expensive language operations #457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 7, 2018
Merged

internal/function: use cache for expensive language operations #457

merged 3 commits into from
Sep 7, 2018

Conversation

erizocosmico
Copy link
Contributor

Closes #442

This adds a two queue lru cache to language UDF. Why a 2Q cache instead of a regular LRU? Because we do not only want the most recent but the most frequent as well to maximize the cache usage.

Since running enry on a file_path only is just marginally slower than using the cache, I decided to not use the cache unless the blob_content is being provided (we could use two caches, though) to not pollute the cache and use it only for the really expensive operations: reading stuff from blobs.

Note that the cache is per gitbase instance.

Results are pretty much the ones @ajnavarro describes in the issue.

Before:

mysql> select count(*) from (select file_path, language(file_path, blob_content) as language FROM files WHERE language(file_path, blob_content) = 'Go') t;
+----------+
| COUNT(*) |
+----------+
|    39781 |
+----------+
1 row in set (20.85 sec)

Now:

mysql> select count(*) from (select file_path, language(file_path, blob_content) as language FROM files WHERE language(file_path, blob_content) = 'Go') t;
+----------+
| COUNT(*) |
+----------+
|    39781 |
+----------+
1 row in set (17.14 sec)

@erizocosmico erizocosmico requested a review from a team September 5, 2018 09:48
enry "gopkg.in/src-d/enry.v1"
"gopkg.in/src-d/go-mysql-server.v0/sql"
)

const defaultLanguageCacheSize = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we are trying to let apps adjust cache sizes (and pass them down to packages like go-git), so maybe in this case we can also think about how easily we can modify this cache in the future (at least size).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this should be configurable, right @ajnavarro? Environment variable or CLI flag that passes down to here? WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, it should be configurable. it should be a env var o cli flag, yep.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
@erizocosmico
Copy link
Contributor Author

Added configuration to the cache size via env var

@@ -12,6 +12,7 @@
| `GITBASE_INDEX_DIR` | directory to save indexes, default `/var/lib/gitbase/index` |
| `GITBASE_TRACE` | enable jaeger tracing, default disabled |
| `GITBASE_READONLY` | allow read queries only, disabling creating and deleting indexes, default disabled |
| `GITBASE_LANGUAGE_CACHE_SIZE` | size of the cache for the `language` UDF. The size is the maximum number of elements kept in the cache |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -12,6 +12,7 @@
| `GITBASE_INDEX_DIR` | directory to save indexes, default `/var/lib/gitbase/index` |
| `GITBASE_TRACE` | enable jaeger tracing, default disabled |
| `GITBASE_READONLY` | allow read queries only, disabling creating and deleting indexes, default disabled |
| `GITBASE_LANGUAGE_CACHE_SIZE` | size of the cache for the `language` UDF. The size is the maximum number of elements kept in the cache |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add the default value here?

enry "gopkg.in/src-d/enry.v1"
"gopkg.in/src-d/go-mysql-server.v0/sql"
)

const (
languageCacheSizeKey = "GITBASE_LANGUAGE_CACHE_SIZE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add that env variable as we are doing with others? check https://github.com/src-d/gitbase/blob/master/cmd/gitbase/command/server.go#L50

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add it there we will have to pass it all the way down until the infinite. Besides, there are other examples of environment variables such as blob max size that are not there but where they're used

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
@erizocosmico
Copy link
Contributor Author

Added default value to docs

@ajnavarro ajnavarro merged commit f2081de into src-d:master Sep 7, 2018
@erizocosmico erizocosmico deleted the feature/lang-cache branch September 18, 2018 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants