Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

surprising performance issue #443

Closed
campoy opened this issue Aug 28, 2018 · 13 comments
Closed

surprising performance issue #443

campoy opened this issue Aug 28, 2018 · 13 comments
Assignees
Labels
performance Performance improvements

Comments

@campoy
Copy link
Contributor

campoy commented Aug 28, 2018

I just ran this query on top of github.com/golang/go:

  SELECT
  	LANGUAGE(t.tree_entry_name, b.blob_content) as lang,
	t.tree_entry_name as name,
       b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash

This finishes in 0.65s 🎉

Unfortunately this other request takes forever:

  SELECT
	t.tree_entry_name as name,
        b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash
   WHERE LANGUAGE(t.tree_entry_name, b.blob_content) = 'go'

Trying to see whether I could find a workaround I wrote this second query:

SELECT name, code
FROM 
(
  SELECT
	LANGUAGE(t.tree_entry_name, b.blob_content) = 'go' as lang,
	t.tree_entry_name as name,    
        b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash
) as blobs
WHERE lang = 'go'

Both of these requests take too long for me to wait.

@campoy campoy added the performance Performance improvements label Aug 28, 2018
@ajnavarro
Copy link
Contributor

@campoy could you send the result of the queries but adding at the beginning EXPLAIN FORMAT=tree please? It will give us really valuable info about query performance.

@jfontan
Copy link
Contributor

jfontan commented Oct 3, 2018

With current release these queries take around 1 minute in my machine. Have you been using indexes to go that fast?

The last two queries return 0 rows as the string returned by LANGUAGE is Go.

@kuba--
Copy link
Contributor

kuba-- commented Oct 4, 2018

SELECT  	t.tree_entry_name AS NAME,  b.blob_content AS CODE
FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash
WHERE LANGUAGE(t.tree_entry_name, b.blob_content) = 'Go';

screen shot 2018-10-04 at 09 54 35

@kuba--
Copy link
Contributor

kuba-- commented Oct 4, 2018

There is no index involved, because so far we cannot create an index across multiple columns from different tables and LANGUAGE(t.tree_entry_name, b.blob_content) will require that.

@erizocosmico erizocosmico self-assigned this Oct 4, 2018
@erizocosmico
Copy link
Contributor

erizocosmico commented Oct 4, 2018

I have executed all three queries and the results in terms of time are quite similar:

  • First query: 1069809 rows in 1m34.779148952s
  • Second query: 760122 rows in 1m26.175958486s
  • Third query: 760122 rows in 1m36.166300878s

Most likely, what happened is that you saw no output because every single row was being filtered due to LANGUAGE(..., ...) = 'go'. It should be Go instead of go, which is the correct enry result.

I think for the sake of user friendliness we should either output lowercased language names or specify in the documentation the correct names for all languages. WDYT @ajnavarro?

Correct second query:

  SELECT
	t.tree_entry_name as name,
        b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash
   WHERE LANGUAGE(t.tree_entry_name, b.blob_content) = 'go'

Correct third query:

SELECT name, code
FROM 
(
  SELECT
	LANGUAGE(t.tree_entry_name, b.blob_content) as lang,
	t.tree_entry_name as name,    
        b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash
) as blobs
WHERE lang = 'Go'

That said, if what you were aiming for was getting all Go code from the repo, this is a more concise and less expensive query to to that:

SELECT tree_entry_name as name, blob_content as code
FROM tree_entries NATURAL JOIN blobs
WHERE LANGUAGE(tree_entry_name, blob_content) = 'Go'

@kuba--
Copy link
Contributor

kuba-- commented Oct 4, 2018

personally I think LANGUAGE should return all lowercases. There is no point why we need to capitalise lang's name.

@erizocosmico
Copy link
Contributor

@kuba-- totally agree

@ajnavarro
Copy link
Contributor

@erizocosmico @kuba-- I cannot think in any other downside than old queries that are using correct Enry strings will fail. Maybe is totally OK. WDYT @smola ?

@smola
Copy link
Contributor

smola commented Oct 4, 2018

UAST might break if you change LANGUAGE to lowercase. You should check that. Other than that, it should be ok.

@smola
Copy link
Contributor

smola commented Oct 4, 2018

Another option is adding support for LOWER and UPPER

@erizocosmico
Copy link
Contributor

@smola UAST already lowercases everything, afaik. But lower and upper + adding to the docs the list of languages returned by enry sounds good too.

@smola
Copy link
Contributor

smola commented Oct 8, 2018

👍 to adding a list of languages by enry. bblfsh has the same issue. We could document that list in src-d/enry README.md and link it from both projects.
See bblfsh/documentation#191

@ajnavarro
Copy link
Contributor

Opening new issue to handle the real problem: #658

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance improvements
Projects
None yet
Development

No branches or pull requests

6 participants