### GAP Data Analytics, Data Retrieval

This Jupyter Notebook allows for automation in the process of extracting GitHub statistics relevant for the redistribution of the GAP programming language. To extract data from the PyGithub API, it is first necessary to install the PyGitHub library. This library provides a Python wrapper for the GitHub REST API.

In [None]:
# Import sys module for various system-specific parameters and functions
# Exclude lines that are already satisfied using the grep search command
import sys
!{sys.executable} -m pip install numpy pandas matplotlib seaborn PyGithub | grep -v 'already satisfied'

# Import required libraries and packages
import requests
from datetime import datetime
from github import Github
from bs4 import BeautifulSoup

# Import modules from other project scripts
from utils import get_github_token

### Managing GitHub API Connection

Connecting to GitHub and verifying the user GitHub token is done through storing the access token as an environment variable. This way, the access token is not exposed in the script. The function for getting the token is imported from the utils file in the project. The API has a call limit of 5000 calls per hour, which creates the need to track the usage and remaining calls.

In [None]:
# Get the GitHub access token and create instance of the GitHub class
github_token = get_github_token()
if github_token:
    g = Github(github_token)

In [None]:
# Track the rate limit for GitHub compared to calls used, and see when the limit will reset
remaining_requests, request_limit = g.rate_limiting
print(f"Request Limit for API Calls: {request_limit}")
print(f"Remaining Requests for API Calls: {remaining_requests}")

limit_reset_time = g.rate_limiting_resettime
reset_time = datetime.fromtimestamp(limit_reset_time).strftime('%Y-%m-%d %H:%M:%S')
print(f"Reset Time for API Calls: {reset_time}")

### General GAP Package and Distribution Statistics

Core statistical metrics based on the general state of packages, relevant for the management of GAP from GitHub, are provided below. These numbers are helpful in providing some foundational understanding of the current sitation of the programming language, in terms of development, distribution and redistribution.

In [None]:
# Number of GAP packages hosted in the gap-packages organisation on GitHub
org_name = "gap-packages"
org = g.get_organization(org_name)

# Get the number of repositories that are public
repos = org.get_repos(type="public")
total_packages = repos.totalCount
print(f"Number of GAP packages fra GAP Respository: {total_packages}")

In [None]:
# Number of GAP packages hosted elsewhere on GitHub
# The information is attempted gathered through the web scraping technique provided by Beautiful Soup
# NB: These numbers are only indicative and not completely accurate due to the webpage listing style, counts per parent list item
url = "https://gap-packages.github.io/"
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Find the section of the webpage with the packages stored elsewhere on GitHub
section = soup.find("section", id="main-content")
heading = section.find(id="packages-hosted-elsewhere-on-github")
ul = heading.find_next("ul")

# Do not include any child elements that are ul or li, as not to let these increase the count
packages = ul.find_all("li", recursive=False)
count = len(packages)

print(f"Number of GAP entities or packages hosted elsewhere on GitHub: {count}")

### Individual Statistics per GitHub GAP Package

Individual statistical metrics given per GAP repository managed by the gap-packages organisation on GitHub, are provided below. These numbers are helpful in providing some insight to aid decision-making when deciding how to address the respective packages, in terms of their status on need as well as readiness for redistribution.

In [None]:
# Function to get the number of releases for a repository
# Also provide the latest release date, to indicate whether a package was released relatively recently
def get_total_releases(repo):
    releases = repo.get_releases()
    total_releases = releases.totalCount

    if total_releases > 0:
        latest_release = releases[0]
        latest_release_date = latest_release.published_at.date()
        return repo.name, total_releases, latest_release_date
    else:
        return repo.name, total_releases, None

In [None]:
# Function to get the age for a respository, measured in days
def get_repository_age(repo):
    age = (datetime.now().date() - repo.created_at.date())
    return repo.name, age.days

In [None]:
# Function to get the last activity event for a repository
# Watch event is excluded, as not to hide other events that would be of greater significance
# NB: Only events within the past 90 days are included in the search, per API limitations
def get_last_event(repo):
    # Define dictionary for events to be considered
    EVENT_TYPES = {
    "CommitCommentEvent": "Comment was made on a commit",
    "CreateEvent": "New branch or tag in repository",
    "DeleteEvent": "Branch or tag was deleted from the repository",
    "ForkEvent": "Repository was forked",
    "IssueEvent": "An issue was opened, closed or edited",
    "IssueCommentEvent": "Comment made on an issue",
    "PullRequestEvent": "Pull request was opened, closed, merged or synchronised",
    "PullRequestReviewEvent": "Pull request review was submitted",
    "PullRequestReviewCommentEvent": "Comment was made on a pull request review",
    "PushEvent": "Push to the repository",
    "ReleaseEvent": "Release was published for the repository",
    }

    events = repo.get_events()
    if events.totalCount > 0:
        last_event = events[0]
        last_event_type = EVENT_TYPES.get(last_event.type)
        last_event_time = last_event.created_at.date()
        return repo.name, last_event_time, last_event_type
    else:
        return repo.name, None, None

In [None]:
# Define function to print repository dictionary
def print_repo_info(repo_info):
    print(f"Repository: {repo_info['name']}")
    print(f"Total Releases: {repo_info['total_releases']}")
    if repo_info['latest_release_date']:
        print(f"Latest Release Date: {repo_info['latest_release_date']}")
    print(f"Age: {repo_info['age']} days")
    if repo_info['last_event_time']:
        print(f"Last Activity Time: {repo_info['last_event_time']}")
        print(f"Last Activity Type: {repo_info['last_event_type']}")
    else:
        print(f"No activity within the past 90 days")

In [None]:
# Generate relevant statistics for all repositories managed by the gap-packages organisation on GitHub
for repo in repos:
    # Call function for total releases for the repository
    repo_name, total_releases, latest_release_date = get_total_releases(repo)
    if total_releases > 0:
        print(f"Total Releases for {repo_name}: {total_releases}")
        print(f"Latest Release Date: {latest_release_date}")
    else:
        print(f"No releases for {repo_name}")
    
    # Call function for total releases
    repo_name, repo_age = get_repository_age(repo)
    print(f"Repository: {repo_name}, Age: {repo_age} days")

    # Call function for last event type for the repository
    repo_name, last_event_time, last_event_type = get_last_event(repo)
    if last_event_time is not None:
        print(f"Last Activity Time for {repo_name}: {last_event_time}")
        print(f"Last Activity Type for {repo_name}: {last_event_type}")
    else:
        print(f"No activity in {repo_name} within the past 90 days")