# Mining Software Repositories: Data Extraction

**Note**: This file contains blank code cells. Code will be provided and discussed during the Thursday, October 8 class meeting (lecture: labeled _"Mining Software Repositories (Practicum)"_). However, solutions will not be posted on Canvas.

This notebook will guide you on how to collect and synthesize software repository information gathered from GitHub. To guide us through the process, we will attempt to answer a data-driven question: _Who reviewed my changes?_

## Accessing the GitHub API

GitHub provides two different APIs for accessing their data:
  * **GitHub API v3** - A REST-based interface for interacting with GitHub data using HTTP methods such as `GET`, `PUT`, `POST`, `PATCH`, `DELETE`. ([spec](https://docs.github.com/en/rest))
  * **GitHub API v4** - A GraphQL-based interface for interacting with GitHub data using a schema-based syntax for querying graph data structures. ([spec](https://docs.github.com/en/graphql))

For all in-class demonstrations and assignments, we will use the GitHub REST API (v3). Unless absolutely necessary, refrain from submitting code that uses the GitHub GraphQL API (v4).

The number of calls to the GitHub API is rate-limited. For unauthenticated users, it's 60 calls/hour, for authenticated users is 5000 calls/hours. To authenticate, we must first create an API access token.

To start that process, go here: [https://github.com/settings/tokens/new](https://github.com/settings/tokens/new). GitHub might prompt you to login or authenticate as part of this process. Once authenticated, select `Generate new token` and obtain a personal access token (PAT) that has `(no scope)` set so that read-only access to public information is allowed (i.e. leave the scope fields unchecked); further documentation available at ["Creating a personal access token"](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token).

> **WARNING**: Treat your tokens like passwords and keep them secret. When using the GitHub API Mining Utility, set the token during instantiation, but do not publish the token in any Python programs or IPython/Jupyter notebooks. 

Once the token is created, copy/paste it as the value for the `token` variable and add your GitHub username to the `userName` variable below:

In [4]:
# removed username/token for privacy
userName = 'INSERT-USERNAME'
token = 'INSERT-TOKEN'

## Interacting with the GitHub API

In order to reduce some of the friction of developing a repository mining pipeline, we are providing a simplified Python utility library that handles the authentication, pagination, and formatting of `GET` requests against the GitHub REST API (v3). You can find installation instructions and documentation on the project GitHub page: https://github.com/EPICLab/miner-utils

In order to pull that `miner-utils` Python library into this JupyterLab notebook, we can install it (along with some other useful libraries) by executing the following commands:

In [None]:
# These commands install required dependecies into your environment
!pip install git+https://github.com/EPICLab/miner-utils
!pip install gitpython
!pip install pandas

In [2]:
#useful imports (especially when restarting)
from minerutils import GitHub
import pandas as pd
from git import Repo
import json

The first step for any GitHub repository mining using the `miner-utils` library is to create a `GitHub` _object_ and initialize it with our authentication details:

In [5]:
gh = GitHub(userName, token)
#print(gh)

## Downloading GitHub repository data

Before beginning to compose any GitHub API requests, it is important to know what REST endpoints are available for the GitHub API: https://docs.github.com/en/free-pro-team@latest/rest/reference/git

We will use the `get(url, params={}, headers={}, perPage=100)` command provided by the `miner-utils` in order to format `GET` requests that will properly query the GitHub REST API (v3). As an example, we would use the following command to determine our current API rate-limit:

In [6]:
limit = gh.get('rate_limit')
#parsed = json.loads(limit)
#print(json.dumps(parsed, indent = 4, sort_keys=True))

#print(limit)
print(limit['rate']['limit'])

5000


For the rest of this demonstration, we will use [`scala/scala`](https://github.com/scala/scala) repository as our example target. To answer the question of _Who reviews my changes?_, we need to collect information about all pull requests made in this project. To do this, we need to use the `pulls` REST endpoint (see [Pulls Request API](https://docs.github.com/en/rest/reference/pulls)).

Write your code to download _all_ the pull requests for the `github.com/scala/scala` project in the code box below. Please note that it will take a few minutes to download the whole dataset.

In [7]:
# download all the pull requests for github.com/scala/scala
pulls = gh.get("/repos/scala/scala/pulls", params={"state": "all"})

# how many pull requests are there?
print("length:", len(pulls))


length: 9187


In order to reduce the number of GitHub API requests that we make, we can save this data to a file so that we don't have to rerun the above commands everytime we want to work with this data.

In [8]:
# writing the data to the file, so we don't have to rerun it again

with open("pulls.json", "w") as f:
    f.write(json.dumps(pulls))
    
    gh.writeData("pulls.json", pulls)

    
# if you already have the data extracted, you can simply load it into the environment
#with open("pulls.json", "r") as f:
 #   pulls = json.loads(f.read())
    
pulls = gh.readData("pulls.json")

## Processing the GitHub API data

Simply retrieving the data from a GitHub API endpoint is no more powerful than using a browser and pointing it at those same endpoints (e.g. [https://api.github.com/rate_limit](https://api.github.com/rate_limit)). Therefore, we need to refine our queries into something more meaningful.

In order to determine who makes pull requests in the `scala/scala` project, we can use the `pulls` data that we previously retrieved. As a simplified example, let's examine all of the pull requests made by a particular developer (we will use the developer `lrytz` as are example developer). And since we already retrieved all of the data necessary to answer this question, we don't need to formulate another API request (we can rely on regular Python functionality to answer this question).

In [9]:
# how many pull requests did the user `lrytz` submit?
byAuthor = []
for pull in pulls:
    if(pull['user']['login'] == 'lrytz'):
        byAuthor.append(pull)
        
# number of PRs
print("Number of lrytz's PR's: ", len(byAuthor))

# # getting the PID and PR state of lrytz's pull requests
# for pull in byAuthor[1:10]:
#     print("ID: ", pull['id'])
# #     print("Username: ", pull['user']['login'])
#     print("PR State: ", pull['state'])
#     print("\n")

    
# getting all the comments in the first index of byAuthor
comments_url = byAuthor[1]['comments_url']
comments = gh.get(comments_url)

# printing the first comment associated with `lrytz`
print("\n***First comment**: \n", comments[0]['body'])

# printing all the comments in byAuthor[1]
print("Number of comments: ", len(comments))

for index, comment in enumerate(comments):
    print("\ncomment #", index+1)
    print(comment['body'])

Number of lrytz's PR's:  522

***First comment**: 
 I'm personally in favor of the simpler solution, always keep the field. Even dotty agrees

```
➜  sandbox git:(t12002b) dotc ../test/files/run/t12002.scala && dotr Test
good boy!
List()
List()
List(private int C.x)
List(private int D.x)
List(private int E.x)
List(private int F.x)
List(private int G.x)
List(private final int H.x)
List(private int I.x)
```

Number of comments:  4

comment # 1
I'm personally in favor of the simpler solution, always keep the field. Even dotty agrees

```
➜  sandbox git:(t12002b) dotc ../test/files/run/t12002.scala && dotr Test
good boy!
List()
List()
List(private int C.x)
List(private int D.x)
List(private int E.x)
List(private int F.x)
List(private int G.x)
List(private final int H.x)
List(private int I.x)
```


comment # 2
On one of the PRs, I mention that doti doesn't do it, but it wasn't clear what "inferred `private[this]`" would entail in doti.

comment # 3
I agree there is not much benefit in this 

## Retrieving additional details from GitHub

The `pulls` data that we previously retrieved contains all of the basic information about pull requests made within the `scala/scala` project, but details for any particular pull request are missing (this is a design consideration of the API to reduce data size and possible response timeouts).

In order to fill in all of the details of each pull request, we need to call the Pulls Request API using the PR number (per the ["Get a pull request"](https://docs.github.com/en/rest/reference/pulls#get-a-pull-request) documentation).

In [10]:
# creating new array to hold PR specific details.
# byAuthor only has a list of PR, not details for EACH PR
completePRByAuthor = []
for pull in byAuthor:
    pr = gh.get(pull['url'])
    completePRByAuthor.append(pr)
    
# print(completePRByAuthor[0]['user'])


We can now use the complete `pulls` data to dive further into data analysis and answer our original question, _Who reviews my changes?_

However, we first need to clarify our nomenclature so that we can map our data to the question. Pull requests contain changes made by a developer, so we will equate a pull request with _changes_. And the person that merges a pull request is typically the person that _reviewed_ and accepted the changes contained within it. Therefore, we need to determine the user that most often closes pull requests that were created by our example user (`lrytz`) in order to answer our question.

In [11]:
# which user most often merged (or reviwed) PRs made by 'lrytz'?

# getting the authors who merged lrytz's PR's
mergedBy = []
for pull in completePRByAuthor:
    if(pull['merged_by'] is not None):
        mergedBy.append(pull['merged_by']['login'])
        
# print(mergedBy)
print("Total merges:", len(mergedBy))

# getting the number of PR's lrytz self merged
# own_merged = []
lrytz_count = 0
for pull in completePRByAuthor:
    if(pull['merged_by'] is not None and pull['merged_by']['login'] == 'lrytz'):
#         own_merged.append(pull['merged_by']['login'])
        lrytz_count += 1
            
# print(len(own_merged))
print("Lrytz count: ", lrytz_count)

# find the users who merged the most PR's (python way)
max_merge = max(mergedBy, key = mergedBy.count)
print("User who merged the most of `lrytz` PR's: ", max_merge)

Total merges: 455
Lrytz count:  198
User who merged the most of `lrytz` PR's:  lrytz


In [54]:
# which of those users merged the most of those PRs? (pandas way)
import pandas

# put mergedBy array into a df
mergedByDF = pandas.DataFrame(mergedBy, columns=["merged_by"])

# use groupby to get the counts for each user and to_frame to add column to the df
countedDF = mergedByDF.groupby("merged_by").size().to_frame('size')

# sort the df by descending order to get who merged `lrytz` PR's the most
sortedDF = countedDF.sort_values(by=["size"], ascending=False)

# print!! lrytz merged their own PRs the most!
sortedDF

Unnamed: 0_level_0,size
merged_by,Unnamed: 1_level_1
lrytz,198
adriaanm,94
retronym,70
SethTisue,42
gkossakowski,14
paulp,13
jsuereth,9
dwijnand,6
szeiger,6
JamesIry,2


# Extracting information from the Git repository

Although we have answered our original question, we might also want to explore datapoints that exist within the structure of Git repositories themselves. The following examples indicate how to use the `GitPython` library to begin formulating queries that return information about the files and code contained within a repository.

Although the documentation for `GitPython` is extensive ([documentation](https://gitpython.readthedocs.io/en/stable/intro.html)), it is also difficult to parse. Therefore, I recommend using Dev_Dungeon's [Working with Git Repositories in Python](https://www.devdungeon.com/content/working-git-repositories-python) for beginners; most of the following commands can be found in this guide.

To begin with, you must have a working version of the Git repository cloned onto your system.

In [1]:
import git

# Check out via HTTPS (this might take awhile, but it only has to be done once)
repo = git.Repo.clone_from('https://github.com/scala/scala.git', '../scala')

GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v https://github.com/scala/scala.git ../scala
  stderr: 'fatal: destination path '../scala' already exists and is not an empty directory.
'

In [3]:
# load an existing local repository
my_repo = git.Repo('../scala')

To then begin answering data-driven questions about the repository, we must first determine which part of the Git protocol will need to be examined for a given question. For example, if we want to know which files were changed in a specific commit.

In [5]:
commit_hash = '1fb249c635d5748d5de5c96b9c7eb93a2c29f830'
diff = my_repo.git.diff(commit_hash+'~1..'+commit_hash, name_only=True)
print(diff)

build.sbt
src/compiler/scala/tools/nsc/backend/jvm/BCodeHelpers.scala
src/compiler/scala/tools/nsc/backend/jvm/analysis/AliasingFrame.scala
src/compiler/scala/tools/nsc/backend/jvm/analysis/BackendUtils.scala
src/compiler/scala/tools/nsc/backend/jvm/opt/LocalOpt.scala
src/library/scala/Array.scala
src/library/scala/collection/concurrent/TrieMap.scala
src/library/scala/collection/mutable/ArrayBuilder.scala
src/library/scala/collection/mutable/BitSet.scala
src/library/scala/collection/mutable/WrappedArrayBuilder.scala
src/library/scala/collection/parallel/mutable/package.scala
src/reflect/scala/reflect/internal/pickling/PickleBuffer.scala
src/repl-jline/scala/tools/nsc/interpreter/jline/JLineDelimiter.scala
