# Exploring the GitHub GraphQL API 

Like most open source projects, NumPy has many open issues and pull requests that require developer attention, and not nearly enough developer-hours to address them all. Even the simple task of reading through the issues/PRs would take hours as there are on the order of 1000 open issues and ~200 open PR's for NumPy (as of 1/2020). Thus it might be worthwhile to attempt some sort of analysis to determine the relative *importance* of the open issues. 

A simple idea would be to look at the number of times any given issue has been cross-referenced as a weak proxy for *importance*, i.e. how often is a particular concept referenced by other developers. A question like this seems like a good match for GitHub's new [GraphQL API](https://developer.github.com/v4/guides/intro-to-graphql/). Note the API is not really about building graphs but rather a different way of formulating queries of GitHub's data. Nevertheless the GraphQL API seems to be well-suited for the type of question we are interested in asking, as it is natural to think of issues as *nodes* and cross-references as *edges*.

## Getting started with GraphQL

I had never heard about GraphQL (and am certainly no expert in database querying), but I found the [Intro to GitHub's GraphQL API](https://developer.github.com/v4/guides/intro-to-graphql/) to be very accessible. 

The [Github GraphQL Explorer](https://developer.github.com/v4/explorer/) is super useful for building your own queries.

## The question

It helps to have a clear statement of the question we want answered. The simplest statement of our question would be

> *How many times has each open issue in the numpy repo been cross-referenced?*

If we cast this question in terms of our simple graph model (open issues as *nodes* and cross-references as *edges*), it is clear that we can find the answer by counting the number of edges connected to each node!

Of course there are a ton of ways that we could modify this question to make it more specific, necessitating a more complicated query to answer it. We'll get there, but let's start simple.

## Formulating a query to address the question

I started by loading the [sample query](https://developer.github.com/v4/guides/forming-calls/#example-query) into the explorer and slowly modifying the query by looking up features in the [schema reference](https://developer.github.com/v4/). This eventually crystallized into the following query:

```
query {                                                                         
  repository(owner:"numpy", name:"numpy") {                                     
    issues(first:100 states:OPEN) {                                             
      edges {                                                                   
        node {                                                                  
          timelineItems(first:100, itemTypes:CROSS_REFERENCED_EVENT){            
            totalCount                                                          
          }                                                                     
        }                                                                       
      }                                                                         
    }                                                                           
  }                                                                             
}
```
Paste the above into the [explorer](https://developer.github.com/v4/explorer/) to get proper syntax highlighting.

Building the graph of issues is [covered in the original example](https://developer.github.com/v4/guides/forming-calls/#example-query), so I'll only elaborate on the part that's different: the timelineItems.

Note that `issues` is of type [IssueConnection](https://developer.github.com/v4/object/issueconnection/) which naturally terminates on nodes of type [Issue](https://developer.github.com/v4/object/issue/). The Issue object has a [timelineItems](https://developer.github.com/v4/object/issuetimelineitemsconnection/) connection that lists all the edges to events (commits, comments, references, etc.) associated with that issue. Furthermore, the `timelineItems` connection implements a filter on the [itemType](https://developer.github.com/v4/enum/issuetimelineitemsitemtype/). This allows us to only follow along connections to nodes that represent cross-reference events (as opposed to commits, comments, etc.).

Finally, the GraphQL query must terminate in scalar values. Fortunately, the `timelineItems` connection implements a `totalCount` attribute that gives us an integer value of the total number of connections. In conjunction with the `itemTypes:CROSS_REFERENCED_EVENTS` filter, this value represents the total number of cross-reference events for each issue.

## Pagination