⚡ GitHub GraphQL API #33

cmdoret · 2023-02-07T20:27:28Z

This PR replaces queries to GitHub's REST API by a single GraphQL query.

The REST API required additional queries for nested attributes. In particular, we needed one query per contributor to extract user information. This resulted in extremely slow performances for large open-source projects.

Improvements:

When using the GraphQL endpoint, we can specify the desired schema of the response with a single query. This provides a major speedup proportional to the number of contributors in the target repository (e.g., query time for Renku: 1.04s instead of 38.7s with REST).

Caveats (so far):

Contributors are not available in GraphQL. I used mentionableUsers instead. In cases where the repo belongs to an organization, this will include all organization members.
mentionableUsers, user organizations (affiliations) and repositoryTopics are paginated (note this is also the case in the REST API). This is currently set to arbitrary thresholds (100 for contributors and affiliations, 10 for topics).
Token authentication is mandatory (while REST allowed for 60 unauthenticated requests).

Example pre-PR output

@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://github.com/SDSC-ORD/gimie> a schema:SoftwareSourceCode ;
    schema:CodeRepository "https://github.com/SDSC-ORD/gimie" ;
    schema:author <https://api.github.com/orgs/SDSC-ORD> ;
    schema:contributor <https://api.github.com/users/cmdoret>,
        <https://api.github.com/users/martinfontanet>,
        <https://api.github.com/users/rmfranken>,
        <https://api.github.com/users/sabinem>,
        <https://api.github.com/users/sabrinaossey> ;
    schema:dateCreated "2022-12-07"^^xsd:date ;
    schema:dateModified "2023-01-31"^^xsd:date ;
    schema:description "Extract linked metadata from repositories"^^xsd:string ;
    schema:downloadUrl "https://api.github.com/repos/SDSC-ORD/gimie/tarball" ;
    schema:keywords "fair-data"^^xsd:string,
        "git"^^xsd:string,
        "linked-open-data"^^xsd:string,
        "metadata-extraction"^^xsd:string,
        "python"^^xsd:string,
        "scientific-software"^^xsd:string ;
    schema:license "https://spdx.org/licenses/Apache-2.0" ;
    schema:name "SDSC-ORD/gimie"^^xsd:string ;
    schema:programmingLanguage "Makefile"^^xsd:string,
        "Python"^^xsd:string ;
    schema:version "v0.2.0"^^xsd:string .

<https://api.github.com/orgs/SwissDataScienceCenter> a schema:Organization ;
    schema:description "An ETH Domain initiative for accelerating the adoption of data science" ;
    schema:name "SwissDataScienceCenter" .

<https://api.github.com/orgs/koszullab> a schema:Organization ;
    schema:description "" ;
    schema:name "koszullab" .

<https://api.github.com/users/cmdoret> a schema:Person ;
    schema:affiliation <https://api.github.com/orgs/SDSC-ORD>,
        <https://api.github.com/orgs/SwissDataScienceCenter>,
        <https://api.github.com/orgs/koszullab> ;
    schema:identifier "cmdoret" ;
    schema:name "Cyril Matthey-Doret" .

<https://api.github.com/users/martinfontanet> a schema:Person ;
    schema:affiliation <https://api.github.com/orgs/SDSC-ORD> ;
    schema:identifier "martinfontanet" .

<https://api.github.com/users/rmfranken> a schema:Person ;
    schema:identifier "rmfranken" .

<https://api.github.com/users/sabinem> a schema:Person ;
    schema:affiliation <https://api.github.com/orgs/SDSC-ORD> ;
    schema:identifier "sabinem" ;
    schema:name "Sabine Maennel" .

<https://api.github.com/users/sabrinaossey> a schema:Person ;
    schema:identifier "sabrinaossey" ;
    schema:name "sabrinaossey" .

<https://api.github.com/orgs/SDSC-ORD> a schema:Organization ;
    schema:description "" ;
    schema:legalName "Swiss Data Science Center - ORD" ;
    schema:logo <https://avatars.githubusercontent.com/u/114115753?v=4> ;
    schema:name "SDSC-ORD" .

Example post-PR output

@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://github.com/SDSC-ORD/gimie> a schema:SoftwareSourceCode ;
    schema:CodeRepository "https://github.com/SDSC-ORD/gimie" ;
    schema:author <https://github.com/SDSC-ORD> ;
    schema:contributor <https://github.com/caviri>,
        <https://github.com/cmdoret>,
        <https://github.com/ksanao>,
        <https://github.com/martinfontanet>,
        <https://github.com/rmfranken>,
        <https://github.com/sabinem>,
        <https://github.com/sabrinaossey>,
        <https://github.com/supermaxiste> ;
    schema:dateCreated "2022-12-07"^^xsd:date ;
    schema:dateModified "2023-01-31"^^xsd:date ;
    schema:description "Extract linked metadata from repositories"^^xsd:string ;
    schema:downloadUrl "https://github.com/SDSC-ORD/gimie/archive/refs/tags/v0.2.0.tar.gz" ;
    schema:keywords "fair-data"^^xsd:string,
        "git"^^xsd:string,
        "linked-open-data"^^xsd:string,
        "metadata-extraction"^^xsd:string,
        "python"^^xsd:string,
        "scientific-software"^^xsd:string ;
    schema:license "https://spdx.org/licenses/Apache-2.0" ;
    schema:name "SDSC-ORD/gimie"^^xsd:string ;
    schema:programmingLanguage "Python"^^xsd:string ;
    schema:version "v0.2.0"^^xsd:string .

<https://github.com/EPFL-Data-Champions> a schema:Organization ;
    schema:description "Cross-disciplinary community around research data, voluntary EPFL's researchers and staff with keen interest in research data." ;
    schema:legalName "EPFL Data Champions" ;
    schema:logo <https://avatars.githubusercontent.com/u/78474394?v=4> ;
    schema:name "EPFL-Data-Champions" .

<https://github.com/caviri> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "caviri" ;
    schema:name "Carlos Vivar Rios" .

<https://github.com/cmdoret> a schema:Person ;
    schema:affiliation <https://github.com/EPFL-Data-Champions>,
        <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter>,
        <https://github.com/koszullab> ;
    schema:identifier "cmdoret" ;
    schema:name "Cyril Matthey-Doret" .

<https://github.com/koszullab> a schema:Organization ;
    schema:description "" ;
    schema:legalName "Romain Koszul Laboratory" ;
    schema:logo <https://avatars.githubusercontent.com/u/9391430?v=4> ;
    schema:name "koszullab" .

<https://github.com/ksanao> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter> ;
    schema:identifier "ksanao" ;
    schema:name "Oksana Riba Grognuz" .

<https://github.com/martinfontanet> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter> ;
    schema:identifier "martinfontanet" .

<https://github.com/rmfranken> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "rmfranken" .

<https://github.com/sabinem> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "sabinem" ;
    schema:name "Sabine Maennel" .

<https://github.com/sabrinaossey> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD>,
        <https://github.com/SwissDataScienceCenter> ;
    schema:identifier "sabrinaossey" ;
    schema:name "sabrinaossey" .

<https://github.com/supermaxiste> a schema:Person ;
    schema:affiliation <https://github.com/SDSC-ORD> ;
    schema:identifier "supermaxiste" .

<https://github.com/SwissDataScienceCenter> a schema:Organization ;
    schema:description "An ETH Domain initiative for accelerating the adoption of data science" ;
    schema:legalName "Swiss Data Science Center" ;
    schema:logo <https://avatars.githubusercontent.com/u/25008760?v=4> ;
    schema:name "SwissDataScienceCenter" .

<https://github.com/SDSC-ORD> a schema:Organization ;
    schema:description "" ;
    schema:legalName "Swiss Data Science Center - ORD" ;
    schema:logo <https://avatars.githubusercontent.com/u/114115753?v=4> ;
    schema:name "SDSC-ORD" .

caviri · 2023-02-23T12:16:55Z

Great, it looks wonderful. The only problem I faced was related to the scope of the Token.

It's necessary to add read:org to the Token, while the actual gimie version can run with the default token. I would recommend that this should be added to README.md.

I also saw in the graphQL query a limit on the number of mentionableUsers(first: 100), organizations(first: 100), and repositoryTopics(first: 10). Maybe this should be a parameter or mentioned in the README? Is there any similar limit in the actual gimie version?

The speed achieved is impressive. Great work.

caviri · 2023-02-23T12:37:08Z

I've been doing some research on mentionableUsers and its definition seems quite obscure in GH documentation: A list of Users that can be mentioned in the context of the repository. The fact that contributors are mixed in an organization's repos is undesired. What do you think about having both methods depending if the repo belongs to a user vs. an organization?

cmdoret · 2023-02-23T13:22:07Z

Great, it looks wonderful. The only problem I faced was related to the scope of the Token.

It's necessary to add read:org to the Token, while the actual gimie version can run with the default token. I would recommend that this should be added to README.md.

I also saw in the graphQL query a limit on the number of mentionableUsers(first: 100), organizations(first: 100), and repositoryTopics(first: 10). Maybe this should be a parameter or mentioned in the README? Is there any similar limit in the actual gimie version?

The speed achieved is impressive. Great work.

Yes, the REST API is paginated so the current version of gimie would also have this issue. We could work around this by adding logic for pagination (multiple queries with limit + offset) but this would be a story for another day ;)

Adding it + the required token scope in the README is a good idea!

…aphql

cmdoret · 2023-03-17T17:28:11Z

@caviri OK it looks like we're now getting the actual contributors. 🎉
Main changes:

Get list of commit (by batches due to pagination of GraphQL API)
Retrieve set of committers (aka contributor)
Fetch metadata about each contributor
Added tests/test_github.py for integration test on a few repositories with different setups

What do you think?

cmdoret · 2023-03-20T10:47:28Z

Using the paginated list of commits becomes slow when there are many commits... So we run into the original issue: The execution takes a long time due to waiting on many requests (Now querying Renku takes 19s vs 38s for the original REST version). I see two solutions:

Asynchronously query commit pages with multiple threads to speed it up
- This may not be easy to implement.
Cap the number of queries (e.g. last 500 commits)
- May miss early contributors.

Maybe we can just ignore it for now and keep performance optimization for an other PR. What do you think @caviri ?

cmdoret · 2023-03-20T17:29:23Z

🚀 Managed to restore full speed using a combination of GraphQL and REST:

Query list of contributor IDs with REST (1 query) -> select node_id from the response
Query GraphQL for the list of nodes with ids: $node_ids. Fortunately REST's node_id matches GraphQL ids!

So in two queries we get deep metadata about all contributors! It takes 2.5s for Renku (vs 38s with REST).

cmdoret added 11 commits February 5, 2023 12:59

feat: add fields in org and person schemas

86da818

feat: add gh extraction logic for new fields (language, version, ...)

b45d2ff

feat: add helper to build spdx url from license name

16f6aec

fix: drop release notes, use standard field for version

dc44350

fix: rm missing attribute from affiliation response

7d7fea7

fix: enable value types in GithubExtractorSchema

2a95066

feat: add GraphQL query function

f3eae3b

refactor: gh extractor methods for GraphQL

9588b23

ci: debug malformed graphql response in actions

d972730

feat: error handling for graphql queries

8cc7b60

ci: update gh token in ci tests

73dade2

cmdoret linked an issue Feb 7, 2023 that may be closed by this pull request

[gimie] Use Github GraphQL API #32

Closed

3 tasks

cmdoret added the enhancement New feature or request label Feb 8, 2023

cmdoret requested a review from caviri February 8, 2023 15:16

cmdoret added 2 commits February 23, 2023 14:53

fix: crash when gh repo has no release

64ce8de

doc: mention token scope and gh pagination in readme

8f62ffa

cmdoret mentioned this pull request Feb 23, 2023

[gimie] retrieve contributors from GitHub GraphQL API #37

Closed

3 tasks

cmdoret linked an issue Feb 23, 2023 that may be closed by this pull request

[gimie] retrieve contributors from GitHub GraphQL API #37

Closed

3 tasks

cmdoret and others added 5 commits February 24, 2023 11:22

Merge branch 'main' into gh-graphql

4e9f0b9

feat: get gh contributors from graphql using commit list

1e069dc

Merge branch 'gh-graphql' of github-cmdoret:SDSC-ORD/gimie into gh-gr…

dbd418d

…aphql

test: add integration test with various github repos

db20753

fix: handle pagination of gh commit list in GraphQL API

0441b74

cmdoret added 2 commits March 20, 2023 11:52

fix: skip bot users

f3bda2f

feat: FAST hybrid REST/GraphQL query for gh contributors

8bf59fa

fix: gh crash when no programming language detected

18760b5

cmdoret requested a review from vancauwe March 29, 2023 15:57

cmdoret merged commit 61ac947 into main Mar 31, 2023

cmdoret deleted the gh-graphql branch March 31, 2023 16:20

cmdoret mentioned this pull request May 30, 2023

Readme rework #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ GitHub GraphQL API #33

⚡ GitHub GraphQL API #33

cmdoret commented Feb 7, 2023 •

edited

caviri commented Feb 23, 2023 •

edited

caviri commented Feb 23, 2023

cmdoret commented Feb 23, 2023

cmdoret commented Mar 17, 2023

cmdoret commented Mar 20, 2023 •

edited

cmdoret commented Mar 20, 2023

⚡ GitHub GraphQL API #33

⚡ GitHub GraphQL API #33

Conversation

cmdoret commented Feb 7, 2023 • edited

caviri commented Feb 23, 2023 • edited

caviri commented Feb 23, 2023

cmdoret commented Feb 23, 2023

cmdoret commented Mar 17, 2023

cmdoret commented Mar 20, 2023 • edited

cmdoret commented Mar 20, 2023

cmdoret commented Feb 7, 2023 •

edited

caviri commented Feb 23, 2023 •

edited

cmdoret commented Mar 20, 2023 •

edited