Skip to content

soroushj/ghpr-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GHPR Tools

Tools for the GHPR dataset.

GHPR contains data about pull requests that have fixed one or more issues on GitHub. Each instance of GHPR contains data about an issue and a pull request, where the pull request has fixed the issue.

Requirements

Install the requirements from requirements.txt:

pip3 install -r requirements.txt

GHPR Crawler

GHPR Crawler uses the GitHub REST API to find pull requests that have fixed one or more issues on GitHub. It saves such issues and pull requests as JSON files.

The raw data for GHPR is an example of data generated by the GHPR Crawler.

The flow is as follows:

  • For repository R:
    • For each page G in the list of closed pull requests in R, from oldest to newest:
      • Optionally save G.
      • For each simple pull request p in G:
        • If p is merged:
          • Let L be the list of issue numbers that are linked by p using a GitHub keyword and are in R.
          • If L is not empty:
            • Fetch pull request P with the pull request number of p.
            • Set the linked_issue_numbers property of P to L.
            • Save P.
            • For each issue number i in L:
              • Fetch issue I with the issue number i.
              • Save I.

Crawler CLI

Run python3 crawler.py --help for usage.

$ python3 crawler.py --help
usage: crawler.py [-h] [-t TOKEN] [-d DST_DIR] [-s START_PAGE] [-p PER_PAGE]
                  [-a] [-m MAX_REQUEST_TRIES] [-r REQUEST_RETRY_WAIT_SECS]
                  [-l LOG_FILE]
                  repo [repo ...]

Crawl GitHub repositories to find and save issues and pull requests that have
fixed them. The crawler goes through the pages of closed pull requests, from
oldest to newest. If a pull request is merged and links one or more issues in
its description, the pull request and its linked issue(s) will be fetched and
saved as JSON files. The list of linked issue numbers is added to the fetched
pull request JSON object with the key "linked_issue_numbers". The JSON files
will be saved in DEST_DIR/owner/repo. The directories will be created if they
do not already exist. The naming pattern for files is issue-N.json for issues,
pull-N.json for pull requests, and pulls-page-N.json for pages of pull
requests. Any existing file will be overwritten. The GitHub API limits
unauthenticated clients to 60 requests per hour. The rate limit is 5,000
requests per hour for authenticated clients. For this reason, you should
provide a GitHub OAuth token if you want to crawl a large repository. You can
create a personal access token at https://github.com/settings/tokens.

positional arguments:
  repo                  full repository name, e.g., "octocat/Hello-World" for
                        the https://github.com/octocat/Hello-World repository

optional arguments:
  -h, --help            show this help message and exit
  -t TOKEN, --token TOKEN
                        your GitHub OAuth token, can also be provided via a
                        GITHUB_OAUTH_TOKEN environment variable (default:
                        None)
  -d DST_DIR, --dst-dir DST_DIR
                        directory for saving JSON files (default: repos)
  -s START_PAGE, --start-page START_PAGE
                        page to start crawling from (default: 1)
  -p PER_PAGE, --per-page PER_PAGE
                        pull requests per page, between 1 and 100 (default:
                        100)
  -a, --save-pull-pages
                        save the pages of pull requests (default: False)
  -m MAX_REQUEST_TRIES, --max-request-tries MAX_REQUEST_TRIES
                        number of times to try a request before terminating
                        (default: 100)
  -r REQUEST_RETRY_WAIT_SECS, --request-retry-wait-secs REQUEST_RETRY_WAIT_SECS
                        seconds to wait before retrying a failed request
                        (default: 10)
  -l LOG_FILE, --log-file LOG_FILE
                        file to write logs to (default: None)

Crawler API

See crawler.py.

class Crawler(object):
    """Crawl GitHub repositories to find and save merged pull requests and the issues
    they have fixed.

    The crawler goes through the pages of closed pull requests, from oldest to
    newest. If a pull request is merged and links one or more issues in its
    description, the pull request and its linked issue(s) will be fetched and
    saved as JSON files. The list of linked issue numbers is added to the fetched
    pull request JSON object with the key "linked_issue_numbers". The JSON files
    will be saved in DEST_DIR/owner/repo. The directories will be created if they
    do not already exist. The naming pattern for files is issue-N.json for issues,
    pull-N.json for pull requests, and pulls-page-N.json for pages of pull
    requests. Any existing file will be overwritten. The GitHub API limits
    unauthenticated clients to 60 requests per hour. The rate limit is 5,000
    requests per hour for authenticated clients. For this reason, you should
    provide a GitHub OAuth token if you want to crawl a large repository. You can
    create a personal access token at https://github.com/settings/tokens.

    Attributes:
        dst_dir (str): Directory for saving JSON files.
        per_page (int): Pull requests per page, between 1 and 100.
        save_pull_pages (bool): Save the pages of pull requests.
        max_request_tries (int): Number of times to try a request before
            terminating.
        request_retry_wait_secs (int): Seconds to wait before retrying a failed request.
    """

    def __init__(self,
                 token=None,
                 dst_dir='repos',
                 per_page=100,
                 save_pull_pages=False,
                 max_request_tries=100,
                 request_retry_wait_secs=10):
        """Initializes Crawler.

        The GitHub API limits unauthenticated clients to 60 requests per hour. The
        rate limit is 5,000 requests per hour for authenticated clients. For this
        reason, you should provide a GitHub OAuth token if you want to crawl a large
        repository. You can create a personal access token at
        https://github.com/settings/tokens.

        Args:
            token (str): Your GitHub OAuth token. If None, the crawler will be
                unauthenticated.
            dst_dir (str): Directory for saving JSON files.
            per_page (int): Pull requests per page, between 1 and 100.
            save_pull_pages (bool): Save the pages of pull requests.
            max_request_tries (int): Number of times to try a request before
                terminating.
            request_retry_wait_secs (int): Seconds to wait before retrying a failed request.
        """

    def crawl(self, owner, repo, start_page=1):
        """Crawls a GitHub repository, finds and saves merged pull requests and the issues
        they have fixed.

        The crawler goes through the pages of closed pull requests, from oldest to
        newest. If a pull request is merged and links one or more issues in its
        description, the pull request and its linked issue(s) will be fetched and
        saved as JSON files. The list of linked issue numbers is added to the fetched
        pull request JSON object with the key "linked_issue_numbers". The JSON files
        will be saved in DEST_DIR/owner/repo. The directories will be created if they
        do not already exist. The naming pattern for files is issue-N.json for issues,
        pull-N.json for pull requests, and pulls-page-N.json for pages of pull
        requests. Any existing file will be overwritten.

        Args:
            owner (str): The username of the repository owner, e.g., "octocat" for the
                https://github.com/octocat/Hello-World repository.
            repo (str): The name of the repository, e.g., "Hello-World" for the
                https://github.com/octocat/Hello-World repository.
            start_page (int): Page to start crawling from.

        Raises:
            TooManyRequestFailures: A request failed max_request_tries times.
        """

GHPR Writer

GHPR Writer reads JSON files downloaded by the GHPR Crawler and writes a CSV file from their data.

The GHPR dataset is an example of data generated by the GHPR Writer.

Writer CLI

Run python3 writer.py --help for usage.

$ python3 writer.py --help
usage: writer.py [-h] [-l LIMIT_ROWS] src_dir dst_file

Read JSON files downloaded by the Crawler and write a CSV file from their
data. The source directory must contain owner/repo/issue-N.json and
owner/repo/pull-N.json files. The destination directory of Crawler should
normally be used as the source directory of Writer. The destination file will
be overwritten if it already exists.

positional arguments:
  src_dir               source directory
  dst_file              destination CSV file

optional arguments:
  -h, --help            show this help message and exit
  -l LIMIT_ROWS, --limit-rows LIMIT_ROWS
                        limit number of rows to write, ignored if non-positive
                        (default: 0)

Writer API

See writer.py.

def write_dataset(src_dir, dst_file, limit_rows=0):
    """Reads JSON files downloaded by the Crawler and writes a CSV file from their
    data.

    The CSV file will have the following columns:
    - repo_id: Integer
    - issue_number: Integer
    - issue_title: Text
    - issue_body_md: Text, in Markdown format, can be empty
    - issue_body_plain: Text, in plain text, can be empty
    - issue_created_at: Integer, in Unix time
    - issue_author_id: Integer
    - issue_author_association: Integer enum (see values below)
    - issue_label_ids: Comma-separated integers, can be empty
    - pull_number: Integer
    - pull_created_at: Integer, in Unix time
    - pull_merged_at: Integer, in Unix time
    - pull_comments: Integer
    - pull_review_comments: Integer
    - pull_commits: Integer
    - pull_additions: Integer
    - pull_deletions: Integer
    - pull_changed_files: Integer
    The value of issue_body_plain is converted from issue_body_md. The conversion is
    not always perfect. In some cases, issue_body_plain still contains some Markdown
    tags.
    The value of issue_author_association can be one of the following:
    - 0: Collaborator
    - 1: Contributor
    - 2: First-timer
    - 3: First-time contributor
    - 4: Mannequin
    - 5: Member
    - 6: None
    - 7: Owner
    Rows are sorted by repository owner username, repository name, pull request
    number, and then issue number.
    The source directory must contain owner/repo/issue-N.json and
    owner/repo/pull-N.json files. The destination directory of Crawler should
    normally be used as the source directory of Writer. The destination file will be
    overwritten if it already exists.

    Args:
        src_dir (str): Source directory.
        dst_file (str): Destination CSV file.
        limit_rows (int): Maximum number of rows to write.
    """