## GAP Data Analytics, Basic Repo Data Retrieval

This Jupyter Notebook intends to automate the process of extracting fundamental, introductory data on GAP repositories hosted on GitHub, relevant to the redistribution of the programming language. To extract data from the PyGithub API, the user must first install the PyGitHub library, which is the Python wrapper for the GitHub API. Installation of the packages required for using this framework is initiated throught the pip shell command below.

In [None]:
# Import sys module for various system-specific parameters and functions
# Exclude lines that are already satisfied using the grep search command
import sys
!{sys.executable} -m pip install PyGithub numpy pandas matplotlib seaborn | grep -v 'already satisfied'

# Import required modules and libraries
import os
import json
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
from ydata_profiling import ProfileReport
from github import Repository

# Get current working directory and append parent directory for module imports
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
sys.path.append(parent_dir)

# Import modules from other project scripts
from data_constants import *


### Managing GitHub API Calls

The process of connecting to GitHub and verifying the user's GitHub token is done by storing the access token as an environment variable. The This way, the access token is not exposed in the script. The function for getting the token is imported from the utils file in the project. As the API has a call limit of 5000 calls per hour, the capacity and remaining calls available,a s well as the reset time, is tracked below.

In [None]:
# Track the rate limit for GitHub compared to calls used, and see when the limit will reset
remaining_requests, request_limit = g.rate_limiting
print(f"Request limit for API Calls: {request_limit}")
print(f"Remaining requests for API Calls: {remaining_requests}")

limit_reset_time = g.rate_limiting_resettime
reset_time = datetime.fromtimestamp(limit_reset_time).strftime('%Y-%m-%d %H:%M:%S')
print(f"Reset time for API Calls: {reset_time}")


### General Statistics on GAP Packages and Distribution

Core metrics based on the general state of packages, relevant to the management of GAP from GitHub, are provided below. These numbers are helpful in providing some foundational understanding of the current state of the programming language packages, in terms of development, distribution and redistribution. This data is on a collective level, and not per individual package.

In [None]:
# Define global variables for the Jupyter Notebook
# Get the number of repositories that are public for gap-packages organisation on GitHub
org = g.get_organization(ORG_NAME_PACKAGES)
repos = org.get_repos(type="public")
total_packages = repos.totalCount
print(f"Number of GAP packages fra GAP Respository: {total_packages}")


In [None]:
# Number of GAP packages hosted elsewhere on GitHub
# The information is attempted gathered through the web scraping technique provided by Beautiful Soup
# NB: These numbers are only indicative and not completely accurate due to the webpage listing style, counts per parent list item
url = "https://gap-packages.github.io/"
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Find the section of the webpage with the packages stored elsewhere on GitHub
section = soup.find("section", id="main-content")
heading = section.find(id="packages-hosted-elsewhere-on-github")
ul = heading.find_next("ul")

# Do not include any child elements that are ul or li, as not to let these increase the count
packages = ul.find_all("li", recursive=False)
count = len(packages)

print(f"Number of GAP entities or packages hosted elsewhere on GitHub: {count}")


### Individual Statistics per GitHub GAP Package

Individual statistical metrics given per GAP repository managed by the gap-packages organisation on GitHub, are provided below. The process is divided into two parts, by first defining functions to get the data, and then retrieving it for each repository. Running the script will export the data to a 'repo_data.json' file in the 'collected_data' folder, displaying the results of the generated data per package.

##### Functions to Retrieve Individual Repo Statistics

In [None]:
def get_total_releases(repo: Repository) -> tuple:
    """Get total number of releases and the latest release for a given repository.
    
    Args:
        repo (Repository): The GitHub repository.
    
    Returns:
        tuple: The repository name, total releases count and latest release date.
    """
    releases = repo.get_releases()
    total_releases = releases.totalCount

    if total_releases > 0:
        latest_release = releases[0]
        latest_release_date = latest_release.published_at.date()
        return repo.name, total_releases, latest_release_date
    else:
        return repo.name, total_releases, None
    

In [None]:
def get_repository_age(repo: Repository) -> tuple:
    """Get the age for a respository, measured in days.
    
    Args:
        repo (Repository): The GitHub repository.
    
    Returns:
        tuple: The repository name and the age of the repository measured in days.
    """
    age = (datetime.now().date() - repo.created_at.date())
    return repo.name, age.days


In [None]:
def get_last_event(repo: Repository) -> tuple:
    """Get the last activity event for a repository.

    Args:
        repo (Repository): The GitHub repository.

    Returns:
        tuple: The repository name, last event time as a date object (or None), and last event type (or None).
    
    Note:
        - Watch event is excluded from EVENT_TYPES, as not to hide other events that would be of greater significance.
        - Only events within the past 90 days are included in the search, per API limitations.
    """
    EVENT_TYPES = {
    "CommitCommentEvent": "Comment was made on a commit",
    "CreateEvent": "New branch or tag in repository",
    "DeleteEvent": "Branch or tag was deleted from the repository",
    "ForkEvent": "Repository was forked",
    "IssueEvent": "An issue was opened, closed or edited",
    "IssueCommentEvent": "Comment made on an issue",
    "PullRequestEvent": "Pull request was opened, closed, merged or synchronised",
    "PullRequestReviewEvent": "Pull request review was submitted",
    "PullRequestReviewCommentEvent": "Comment was made on a pull request review",
    "PushEvent": "Push to the repository",
    "ReleaseEvent": "Release was published for the repository",
    }
    
    events = repo.get_events()
    if events.totalCount > 0:
        last_event = events[0]
        last_event_type = EVENT_TYPES.get(last_event.type)
        last_event_time = last_event.created_at.date()
        return repo.name, last_event_time, last_event_type
    else:
        return repo.name, None, None
    

In [None]:
def check_release_status(repo: Repository) -> tuple:
    """Get information on bugs and enhancement opportunities for a repository.

    Args:
        repo (Repository): The GitHub repository.
    
    Returns:
        tuple: The repository name, total number of open issues, number of bug issues (or None) and
        and number of enhancement issues (or None).
    """
    open_issues = repo.get_issues(state='open')
    
    open_issues_count = open_issues.totalCount
    bug_count = 0
    enhancement_count = 0

    for issue in open_issues:
        labels = [label.name for label in issue.labels]
        if 'bug' in labels:
            bug_count += 1
        if 'enhancement' in labels:
            enhancement_count += 1

    if bug_count > 0 or enhancement_count > 0:
        return repo.name, open_issues_count, bug_count, enhancement_count
    else:
        return repo.name, open_issues_count, 0, 0
    

In [None]:
def pull_request_status(repo: Repository) -> tuple:
    """ Get information on PRs for a repository. 
    
    Args:
        repo (Repository): The GitHub repository.
    
    Returns:
        tuple: The repository name, number of PRs, numbers of open PRs and number of closed PRs.
    """
    pull_requests = repo.get_pulls(state='all')
    
    total_pull_requests = pull_requests.totalCount
    open_pull_requests = repo.get_pulls(state='open').totalCount
    closed_pull_requests = repo.get_pulls(state='closed').totalCount

    return repo.name, total_pull_requests, open_pull_requests, closed_pull_requests


##### Get and Display Individual Repo Statistics

In [None]:
# Display alternative 1: Printing out the information
# Generate relevant statistics for all repositories managed by the gap-packages organisation on GitHub
# for repo in repos:
#     # Call function for total releases for the repository
#     repo_name, total_releases, latest_release_date = get_total_releases(repo)
#     if total_releases > 0:
#         print(f"Total Releases for {repo_name}: {total_releases}")
#         print(f"Latest Release Date: {latest_release_date}")
#     else:
#         print(f"No releases for {repo_name}")
    
#     # Call function for total releases
#     repo_name, repo_age = get_repository_age(repo)
#     print(f"Repository: {repo_name}, Age: {repo_age} days")

#     # Call function for last event type for the repository
#     repo_name, last_event_time, last_event_type = get_last_event(repo)
#     if last_event_time is not None:
#         print(f"Last Activity Time for {repo_name}: {last_event_time}")
#         print(f"Last Activity Type for {repo_name}: {last_event_type}")
#     else:
#         print(f"No activity in {repo_name} within the past 90 days")

#    # Call function for total issues information
#     repo_name, open_issues_count, bug_count, enhancement_count = check_release_status(repo)
#     if bug_count > 0 or enhancement_count > 0:
#             print(f"The repository {repo_name} has open bug and enhancement issues.")
#             print(f"Total open issues: {open_issues_count}")
#             print(f"Open bug issues: {bug_count}")
#             print(f"Open enhancement issues: {enhancement_count}")
#     else:
#             print(f"The repository {repo_name} has no open bug or enhancement issues.")
#             print(f"Total open issues for {repo_name}: {open_issues_count}")
    
#     # Call function for total PRs information
#     repo_name, total_pull_requests, open_pull_requests, closed_pull_requests = pull_request_status(repo)
#     print(f"Total Pull Requests for {repo_name}: {total_pull_requests}")
#     print(f"Open Pull Requests for {repo_name}: {open_pull_requests}")
#     print(f"Closed Pull Requests for {repo_name}: {closed_pull_requests}")


In [None]:
# Display alternative 2: Creating a ProfileReport for more statistical analysis
# Use profiling library to see other, generalised statistics
# profile = ProfileReport(df, title="Statistics for packages by gap-packages on GitHub")
# profile.to_widgets()


In [None]:
# Export collected data to JSON file to store them for later use and better overview
data_folder = 'collected_data'
all_data = []

# Iterate over the information for each repository
for repo in repos:
    repo_name = repo.name
    _, total_releases, latest_release_date = get_total_releases(repo)
    _, repo_age = get_repository_age(repo)
    _, last_event_time, last_event_type = get_last_event(repo)
    _, open_issues_count, bug_count, enhancement_count = check_release_status(repo)
    _, total_pull_requests, open_pull_requests, closed_pull_requests = pull_request_status(repo)

# Create a dictionary for the data, converting dates to strings in order to store them
    data = {
        'repo': repo_name,
        'total_releases': total_releases,
        'latest_release': str(latest_release_date),
        'age_in_days': repo_age,
        'last_activity_time': str(last_event_time),
        'last_event_type': last_event_type,
        'open_issues_count': open_issues_count,
        'bug_count': bug_count,
        'enhancement_count': enhancement_count,
        'total_pull_requests': total_pull_requests,
        'open_pull_requests': open_pull_requests,
        'closed_pull_requests': closed_pull_requests
    }

    all_data.append(data)

# Create a file path for the JSON file, and add it to the data folder
file_path = os.path.join(data_folder, 'repo_data.json')

with open(file_path, 'w', encoding='utf-8') as f:
    json.dump(all_data, f, ensure_ascii=False, indent=4)

print("Repository data has been exported to the 'repo_data' file in the 'collected_data' folder.")
