Skip to content
This repository has been archived by the owner on Mar 14, 2023. It is now read-only.

Do something with the contributor list #41

Open
nrc opened this issue Jan 20, 2015 · 5 comments
Open

Do something with the contributor list #41

nrc opened this issue Jan 20, 2015 · 5 comments

Comments

@nrc
Copy link
Member

nrc commented Jan 20, 2015

We currently maintain a list of contributors per repo which supplements the list we pull from GitHub. We could be more efficient (albeit more stateful) by updating this list with contributors as we go along. We could either do this manually, or use a DB instead. See some discussion in #38

@davidalber
Copy link
Collaborator

Apologies. This is a long comment. There's a lot of context that goes with it.

So...way back in #38 it was identified that some users were, oddly, not appearing in the contributors response. This was fixed by adding a short list of the users identified who were known to not be showing up in the contributors response. The problem goes well beyond those two users. Let's take a look at the main Rust repo. Here's what GitHub has to say about the number of contributors:

image

Let's take a look at the contributors request (specifically the header in the response):

$ http --print=h "https://api.github.com/repos/rust-lang/rust/contributors?per_page=100"
HTTP/1.1 200 OK
...
Link: <https://api.github.com/repositories/724712/contributors?per_page=100&page=2>; rel="next", <https://api.github.com/repositories/724712/contributors?per_page=100&page=5>; rel="last"
...

That shows there's only five pages of contributors of 100 users each! That must mean that Highfive is greeting repeat contributors all of the time in rust-lang/rust. (You probably already see this, but I don't pay very close attention there right now.)

What is going on?! Well, let's try the anon parameter in the contributors request documentation.

$ http --print=h "https://api.github.com/repos/rust-lang/rust/contributors?per_page=100&anon=1"
HTTP/1.1 200 OK
...
Link: <https://api.github.com/repositories/724712/contributors?per_page=100&anon=1&page=2>; rel="next", <https://api.github.com/repositories/724712/contributors?per_page=100&anon=1&page=24>; rel="last"
...

That shows there's twenty-four pages of contributors of 100 users each. It also lets us finally see User Aatch (I'm not @-ing him since I would guess he's not inclined to participate in this discussion).

$ http "https://api.github.com/repos/rust-lang/rust/contributors?per_page=100&anon=1" | jq 'map(select(.name == "James Miller"))'
[
  {
    "email": "bladeon@gmail.com",
    "name": "James Miller",
    "type": "Anonymous",
    "contributions": 159
  }
]

It seems like it's collapsed all of his contributions into that one email address because I didn't find his other email address in any of the responses. The numbers don't quite add up.

Anyway, twenty-four pages is closer, but now it's higher than expected. Here's the critical part from the documentation:

GitHub identifies contributors by author email address. This endpoint groups contribution counts by GitHub user, which includes all associated email addresses. To improve performance, only the first 500 author email addresses in the repository link to GitHub users. The rest will appear as anonymous contributors without associated GitHub user information.

This seems like a critical issue with the way Highfive is currently identifying first-time contributors in popular repositories since Highfive is attempting to match usernames.

Here are a couple alternatives we could look at:

  • Use the commit search API. That appears to allow searching the master branch for commits by email address, for instance. This sounds pretty perfect, except the API is a preview and is subject to change without notice.
  • The payloads Highfive receives via the webhook contain an attribute called author_association. Its possible values are here and include things like CONTRIBUTOR, FIRST_TIMER, and FIRST_TIME_CONTRIBUTOR. That looks pretty nice, but I haven't been able to fully grok its behavior yet. For instance, it behaves strangely in this repository. I'm guessing it's because it's a fork, but you'll note I don't have the little "Contributor" badge next to my name at the top of this comment, so at least the weirdness is consistent with the UI.

@davidalber
Copy link
Collaborator

That must mean that Highfive is greeting repeat contributors all of the time in rust-lang/rust. (You probably already see this, but I don't pay very close attention there right now.)

There an example of this in rust-lang/rust#49329. An earlier, already merged, PR from that user is in rust-lang/rust#48076.

@davidalber
Copy link
Collaborator

The commit search API looks like it does what Highfive really wants (easily determine if a given user has commits on the default branch). Like the author_association field, however, it does not work for forks. I've submitted a question to figure out if there's a way around this.

@nrc
Copy link
Member Author

nrc commented Mar 25, 2018

Given that GH can see this because it gives you the option of highlighting PRs from first-time contributors, I assume there must be API for it. Probably commit search is it, but there might also be something in the GraphQL API?

I don't understand the fork issue. Does that just mean it doesn't work for Rust Highfive, for example?

davidalber added a commit to davidalber/highfive that referenced this issue Mar 25, 2018
The current approach to checking for new contributors does not work
in repositories with more than 500 contributors because the
contributors list API endpoint only associates the first 500
committer emails with GitHub usernames. See rust-lang#41 for more
explanation.

This commit modifies the new contributor check to use the commit
search API endpoint. A drawback of this approach is that (at least
currently) that endpoint does not provide useful information for
fork repositories, which is why the new contributor check has a
condition on whether the repository is a fork.
@davidalber
Copy link
Collaborator

davidalber commented Mar 25, 2018

Given that GH can see this because it gives you the option of highlighting PRs from first-time contributors, I assume there must be API for it. Probably commit search is it, but there might also be something in the GraphQL API?

There might be. I can take a look.

I don't understand the fork issue. Does that just mean it doesn't work for Rust Highfive, for example?

That's right. Commit searches on fork repositories like rust-lang-nursery/highfive say that everyone has made zero commits.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants