Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡ GitHub GraphQL API #33

Merged
merged 21 commits into from
Mar 31, 2023
Merged

⚡ GitHub GraphQL API #33

merged 21 commits into from
Mar 31, 2023

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Feb 7, 2023

This PR replaces queries to GitHub's REST API by a single GraphQL query.

The REST API required additional queries for nested attributes. In particular, we needed one query per contributor to extract user information. This resulted in extremely slow performances for large open-source projects.

Improvements:

When using the GraphQL endpoint, we can specify the desired schema of the response with a single query. This provides a major speedup proportional to the number of contributors in the target repository (e.g., query time for Renku: 1.04s instead of 38.7s with REST).

Caveats (so far):

  • Contributors are not available in GraphQL. I used mentionableUsers instead. In cases where the repo belongs to an organization, this will include all organization members.
  • mentionableUsers, user organizations (affiliations) and repositoryTopics are paginated (note this is also the case in the REST API). This is currently set to arbitrary thresholds (100 for contributors and affiliations, 10 for topics).
  • Token authentication is mandatory (while REST allowed for 60 unauthenticated requests).
Example pre-PR output
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://github.com/SDSC-ORD/gimie> a schema:SoftwareSourceCode ;
    schema:CodeRepository "https://github.com/SDSC-ORD/gimie" ;
    schema:author <https://api.github.com/orgs/SDSC-ORD> ;
    schema:contributor <https://api.github.com/users/cmdoret>,
        <https://api.github.com/users/martinfontanet>,
        <https://api.github.com/users/rmfranken>,
        <https://api.github.com/users/sabinem>,
        <https://api.github.com/users/sabrinaossey> ;
    schema:dateCreated "2022-12-07"^^xsd:date ;
    schema:dateModified "2023-01-31"^^xsd:date ;
    schema:description "Extract linked metadata from repositories"^^xsd:string ;
    schema:downloadUrl "https://api.github.com/repos/SDSC-ORD/gimie/tarball" ;
    schema:keywords "fair-data"^^xsd:string,
        "git"^^xsd:string,
        "linked-open-data"^^xsd:string,
        "metadata-extraction"^^xsd:string,
        "python"^^xsd:string,
        "scientific-software"^^xsd:string ;
    schema:license "https://spdx.org/licenses/Apache-2.0" ;
    schema:name "SDSC-ORD/gimie"^^xsd:string ;
    schema:programmingLanguage "Makefile"^^xsd:string,
        "Python"^^xsd:string ;
    schema:version "v0.2.0"^^xsd:string .

<https://api.github.com/orgs/SwissDataScienceCenter> a schema:Organization ;
    schema:description "An ETH Domain initiative for accelerating the adoption of data science" ;
    schema:name "SwissDataScienceCenter" .

<https://api.github.com/orgs/koszullab> a schema:Organization ;
    schema:description "" ;
    schema:name "koszullab" .

<https://api.github.com/users/cmdoret> a schema:Person ;
    schema:affiliation <https://api.github.com/orgs/SDSC-ORD>,
        <https://api.github.com/orgs/SwissDataScienceCenter>,
        <https://api.github.com/orgs/koszullab> ;
    schema:identifier "cmdoret" ;
    schema:name "Cyril Matthey-Doret" .

<https://api.github.com/users/martinfontanet> a schema:Person ;
    schema:affiliation <https://api.github.com/orgs/SDSC-ORD> ;
    schema:identifier "martinfontanet" .

<https://api.github.com/users/rmfranken> a schema:Person ;
    schema:identifier "rmfranken" .

<https://api.github.com/users/sabinem> a schema:Person ;
    schema:affiliation <https://api.github.com/orgs/SDSC-ORD> ;
    schema:identifier "sabinem" ;
    schema:name "Sabine Maennel" .

<https://api.github.com/users/sabrinaossey> a schema:Person ;
    schema:identifier "sabrinaossey" ;
    schema:name "sabrinaossey" .

<https://api.github.com/orgs/SDSC-ORD> a schema:Organization ;
    schema:description "" ;
    schema:legalName "Swiss Data Science Center - ORD" ;
    schema:logo <https://avatars.githubusercontent.com/u/114115753?v=4> ;
    schema:name "SDSC-ORD" .
Example post-PR output
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://github.com/SDSC-ORD/gimie> a schema:SoftwareSourceCode ;
    schema:CodeRepository "https://github.com/SDSC-ORD/gimie" ;
    schema:author <https://github.com/SDSC-ORD> ;
    schema:contributor <https://github.com/caviri>,
        <https://github.com/cmdoret>,
        <https://github.com/ksanao>,
        <https://github.com/martinfontanet>,
        <https://github.com/rmfranken>,
        <https://github.com/sabinem>,
        <https://github.com/sabrinaossey>,
        <https://github.com/supermaxiste> ;
    schema:dateCreated "2022-12-07"^^xsd:date ;
    schema:dateModified "2023-01-31"^^xsd:date ;
    schema:description "Extract linked metadata from repositories"^^xsd:string ;
    schema:downloadUrl "https://github.com/SDSC-ORD/gimie/archive/refs/tags/v0.2.0.tar.gz" ;
    schema:keywords "fair-data"^^xsd:string,
        "git"^^xsd:string,
        "linked-open-data"^^xsd:string,
        "metadata-extraction"^^xsd:string,
        "python"^^xsd:string,
        "scientific-software"^^xsd:string ;
    schema:license "https://spdx.org/licenses/Apache-2.0" ;
    schema:name "SDSC-ORD/gimie"^^xsd:string ;
    schema:programmingLanguage "Python"^^xsd:string ;
    schema:version "v0.2.0"^^xsd:string .

<https://github.com/EPFL-Data-Champions> a schema:Organization ;
    schema:description "Cross-disciplinary community around research data, voluntary EPFL's researchers and staff with keen interest in research data." ;
    schema:legalName "EPFL Data Champions" ;
    schema:logo <https://avatars.githubusercontent.com/u/78474394?v=4> ;
    schema:name "EPFL-Data-Champions" .

<https://github.com/caviri> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "caviri" ;
    schema:name "Carlos Vivar Rios" .

<https://github.com/cmdoret> a schema:Person ;
    schema:affiliation <https://github.com/EPFL-Data-Champions>,
        <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter>,
        <https://github.com/koszullab> ;
    schema:identifier "cmdoret" ;
    schema:name "Cyril Matthey-Doret" .

<https://github.com/koszullab> a schema:Organization ;
    schema:description "" ;
    schema:legalName "Romain Koszul Laboratory" ;
    schema:logo <https://avatars.githubusercontent.com/u/9391430?v=4> ;
    schema:name "koszullab" .

<https://github.com/ksanao> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter> ;
    schema:identifier "ksanao" ;
    schema:name "Oksana Riba Grognuz" .

<https://github.com/martinfontanet> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter> ;
    schema:identifier "martinfontanet" .

<https://github.com/rmfranken> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "rmfranken" .

<https://github.com/sabinem> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "sabinem" ;
    schema:name "Sabine Maennel" .

<https://github.com/sabrinaossey> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter> ;
    schema:identifier "sabrinaossey" ;
    schema:name "sabrinaossey" .

<https://github.com/supermaxiste> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "supermaxiste" .

<https://github.com/SwissDataScienceCenter> a schema:Organization ;
    schema:description "An ETH Domain initiative for accelerating the adoption of data science" ;
    schema:legalName "Swiss Data Science Center" ;
    schema:logo <https://avatars.githubusercontent.com/u/25008760?v=4> ;
    schema:name "SwissDataScienceCenter" .

<https://github.com/SDSC-ORD> a schema:Organization ;
    schema:description "" ;
    schema:legalName "Swiss Data Science Center - ORD" ;
    schema:logo <https://avatars.githubusercontent.com/u/114115753?v=4> ;
    schema:name "SDSC-ORD" .

@cmdoret cmdoret linked an issue Feb 7, 2023 that may be closed by this pull request
3 tasks
@cmdoret cmdoret added the enhancement New feature or request label Feb 8, 2023
@cmdoret cmdoret requested a review from caviri February 8, 2023 15:16
@caviri
Copy link

caviri commented Feb 23, 2023

Great, it looks wonderful. The only problem I faced was related to the scope of the Token.

It's necessary to add read:org to the Token, while the actual gimie version can run with the default token. I would recommend that this should be added to README.md.

I also saw in the graphQL query a limit on the number of mentionableUsers(first: 100), organizations(first: 100), and repositoryTopics(first: 10). Maybe this should be a parameter or mentioned in the README? Is there any similar limit in the actual gimie version?

The speed achieved is impressive. Great work.

@caviri
Copy link

caviri commented Feb 23, 2023

I've been doing some research on mentionableUsers and its definition seems quite obscure in GH documentation: A list of Users that can be mentioned in the context of the repository. The fact that contributors are mixed in an organization's repos is undesired. What do you think about having both methods depending if the repo belongs to a user vs. an organization?

@cmdoret
Copy link
Member Author

cmdoret commented Feb 23, 2023

Great, it looks wonderful. The only problem I faced was related to the scope of the Token.

It's necessary to add read:org to the Token, while the actual gimie version can run with the default token. I would recommend that this should be added to README.md.

I also saw in the graphQL query a limit on the number of mentionableUsers(first: 100), organizations(first: 100), and repositoryTopics(first: 10). Maybe this should be a parameter or mentioned in the README? Is there any similar limit in the actual gimie version?

The speed achieved is impressive. Great work.

Yes, the REST API is paginated so the current version of gimie would also have this issue. We could work around this by adding logic for pagination (multiple queries with limit + offset) but this would be a story for another day ;)

Adding it + the required token scope in the README is a good idea!

@cmdoret cmdoret linked an issue Feb 23, 2023 that may be closed by this pull request
3 tasks
@cmdoret
Copy link
Member Author

cmdoret commented Mar 17, 2023

@caviri OK it looks like we're now getting the actual contributors. 🎉
Main changes:

  • Get list of commit (by batches due to pagination of GraphQL API)
  • Retrieve set of committers (aka contributor)
  • Fetch metadata about each contributor
  • Added tests/test_github.py for integration test on a few repositories with different setups

What do you think?

@cmdoret
Copy link
Member Author

cmdoret commented Mar 20, 2023

Using the paginated list of commits becomes slow when there are many commits... So we run into the original issue: The execution takes a long time due to waiting on many requests (Now querying Renku takes 19s vs 38s for the original REST version). I see two solutions:

  • Asynchronously query commit pages with multiple threads to speed it up
    • This may not be easy to implement.
  • Cap the number of queries (e.g. last 500 commits)
    • May miss early contributors.

Maybe we can just ignore it for now and keep performance optimization for an other PR. What do you think @caviri ?

@cmdoret
Copy link
Member Author

cmdoret commented Mar 20, 2023

🚀 Managed to restore full speed using a combination of GraphQL and REST:

  • Query list of contributor IDs with REST (1 query) -> select node_id from the response
  • Query GraphQL for the list of nodes with ids: $node_ids. Fortunately REST's node_id matches GraphQL ids!

So in two queries we get deep metadata about all contributors! It takes 2.5s for Renku (vs 38s with REST).

@cmdoret cmdoret requested a review from vancauwe March 29, 2023 15:57
@cmdoret cmdoret merged commit 61ac947 into main Mar 31, 2023
@cmdoret cmdoret deleted the gh-graphql branch March 31, 2023 16:20
@cmdoret cmdoret mentioned this pull request May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[gimie] retrieve contributors from GitHub GraphQL API [gimie] Use Github GraphQL API
2 participants