Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull index of all public repositories using the repositories endpoint #111

Open
aaronsteers opened this issue Apr 24, 2021 · 0 comments
Open

Comments

@aaronsteers
Copy link

aaronsteers commented Apr 24, 2021

I'd like to pull a list of all public repositories using the repositories endpoint described here: https://docs.github.com/en/rest/reference/repos#list-public-repositories

API info

I've tested the endpoint with some success. The incremental replication key is 'since' - which accepts not a datetime, but an incremental integer identify column which is applied to each repository as it is created. Apparently, this key can be used to paginate through all public repos on github.

Use case

The use case here would be to collect repo ids we could use to then collect follow-up metrics - specifically for repos match a naming conventions for Singer plugins and forks: tap-<something>, target-<something>, pipelinewise-tap-<something>, etc. Once we collect the repo names and IDs we would collect and aggregate additional github metrics on usage, commits, etc.

More info here: https://gitlab.com/meltano/singerhub/-/issues/3 and https://gitlab.com/meltano/singerhub/-/issues/11

As part of our initiative to make taps more discoverable and to help the Singer/Stitch/Meltano community members more quickly locate and evaluate from the large (and growing) list of available taps and targets.

New vs existing tap

I know the paradigm we have here in this tap expects a set of specific repos to extract, and this application would break with that paradigm. If it is preferable to spin this off as a separate tap, I would understand that argument and in that case would likely try to spin off a fork specifically for the purpose of parsing the github index (maybe tap-github-index?).

Expected volume of data

The volume of data is large, but not prohibitively so.

Looks like approximately 48 million public repos according to a quick github search:

https://github.com/search?q=is:public

This is up from 28 million approximately a year ago:

image

Sample record:

[
  {
    "id": 1,
    "node_id": "MDEwOlJlcG9zaXRvcnkx",
    "name": "grit",
    "full_name": "mojombo/grit",
    "private": false,
    "owner": {
      "login": "mojombo",
      "id": 1,
      "node_id": "MDQ6VXNlcjE=",
      "avatar_url": "https://avatars.githubusercontent.com/u/1?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/mojombo",
      "html_url": "https://github.com/mojombo",
      "followers_url": "https://api.github.com/users/mojombo/followers",
      "following_url": "https://api.github.com/users/mojombo/following{/other_user}",
      "gists_url": "https://api.github.com/users/mojombo/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/mojombo/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/mojombo/subscriptions",
      "organizations_url": "https://api.github.com/users/mojombo/orgs",
      "repos_url": "https://api.github.com/users/mojombo/repos",
      "events_url": "https://api.github.com/users/mojombo/events{/privacy}",
      "received_events_url": "https://api.github.com/users/mojombo/received_events",
      "type": "User",
      "site_admin": false
    },
    //...
]
@aaronsteers aaronsteers changed the title Pull list of _all_ public repositories using the repositories endpoint Pull index of all public repositories using the repositories endpoint Apr 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant