Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose full repository languages/file types inventory via GraphQL #2587

Open
sqs opened this Issue Mar 5, 2019 · 2 comments

Comments

Projects
None yet
3 participants
@sqs
Copy link
Member

sqs commented Mar 5, 2019

Each repository has a full "inventory", which is the result of processing all of its files, determinining their languages, and counting bytes of each language/file type. This is not exposed in GraphQL; only the top language is exposed (via Repository.language). We should expose the full inventory so that API consumers can see all of the languages in use in the repository.

Also, it should include line counts, not just byte counts.

Details: For every Git tree and blob, there should be a GraphQL field inventory that exposes:

  • All languages found in the tree/file (files obviously only have one language right now), in an array of the following for each detected language:
    • Language name and other metadata (eg "is this markup or code?") that the inventory package already produces
    • Number of bytes of this language
    • Number of lines of code of this language (can be naive lines, no need to use SLOC or anything)

The most common use case is to get this info for a repository at its HEAD commit, so there should be a Repository.inventory field that returns this info for the HEAD commit's root Git tree.

The language determination from the existing inventory package is sufficient. No need to make it more advanced.

The expected usage patterns are mainly:

  • Seeing this in the UI (eg on a tree page) like the GitHub languages breakdown on the repo page
    • So caching this computation is probably a good idea. It doesn't need to be precomputed; it can be computed on-the-fly if not cached.
  • Querying across many/all repositories (eg up to 30k repositories)
    • This should be possible, but it is OK if it takes minutes or hours and requires repeated calls until the data is computed and cached for every repository. Basically, there should be a way to get this info for that many repositories, but it doesn't need to be instant or even very fast.

Related: #2586

Customers:

@sqs sqs added the feature-request label Mar 5, 2019

@sqs sqs self-assigned this Mar 5, 2019

@sqs sqs added this to the 3.2 milestone Mar 6, 2019

@sqs sqs modified the milestones: 3.2, Backlog Mar 19, 2019

@sqs sqs added the customer label Mar 19, 2019

@sqs sqs assigned tsenart and keegancsmith and unassigned sqs Apr 17, 2019

@sqs sqs added the repos label Apr 17, 2019

@tsenart tsenart added team/core-services and removed repos labels Apr 17, 2019

@keegancsmith

This comment has been minimized.

Copy link
Member

keegancsmith commented Apr 17, 2019

Querying across many/all repositories (eg up to 30k repositories)

Is there more specific information on why a customer wants to do this and how important it is?

@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Apr 17, 2019

@keegancsmith:

Is there more specific information on why a customer wants to do this and how important it is?

https://app.hubspot.com/contacts/2762526/company/464956351 specifically asked for this, and they have that scale. They would like to be able to compute these stats over all repositories to know how their usage of languages is changing over time (eg "are we using more or less of $LANG?").

Based on demos and conversations with the other companies mentioned in the top comment (where the decision maker for deploying Sourcegraph would have seen immediate product value if Sourcegraph were able to give stats about language usage), it would also be an effective element of the initial demo. It would be something that directly gives the decision maker value, instead of just giving value to their team. For example, in the demo, we would ask "Do you even have a way to know which languages are in use?" "No." "Sourcegraph can tell you - here's a Python script [and in the future, a UI screen] that hits our API and tells you. Run it!" "Wow, this is a new capability that lets me plan hiring/training better...and there were these other questions I wanted answered: ..." (which leads to more good high-level product value conversations).

But the constraint is really that this feature shouldn't totally fail at 30k-repo scale, not that it needs to be fast at that scale. It is OK if it takes minutes or hours, and multiple requests, to compute for 30k repos when requested. For example, if it was all computed on-the-fly and had a 60-second request timeout, that impl would very likely not satisfy that customer need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.