## Analysing Retrieved Data

This Jupyter Notebook is intended for closer examination and subsequent validation of the data that has been generated by the previous ones. Beyond having access to the data, the real value from a management perspective lies in analysing, comparing and constrasting the data, automating the process of pointing out findings that may be of significance in the redistribution process. Combining different data types from various sources, this notebook points to the noteworthy findings of the data extraction process.

In [None]:
# Import required modules and libraries
import os
import sys
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from ydata_profiling import ProfileReport
from packaging import version
from github import Repository
from datetime import datetime, timedelta
from packaging.version import Version, parse

# Get current working directory and append parent directory for module imports
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
sys.path.append(parent_dir)

# Import modules from other project scripts
from data_constants import *

### Managing GitHub API Calls

The process of connecting to GitHub and verifying the user's GitHub token is done by storing the access token as an environment variable. The function for getting the token is imported from the utils file in the project. As the API has a call limit of 5000 calls per hour, the capacity and remaining calls available, as well as the reset time, is tracked below. Each function to retrieve and export the data in the data retrieval files have functionality to pause the workflow when the user runs out of API calls. The program  then automatically sleep until the limit is renewed, and then resume the job it was completing at the time the limit ran out.

In [None]:
# Track the rate limit for GitHub compared to calls used, and see when the limit will reset
# If there are less than 100 API calls left, the program will sleep until the API limit renews.
remaining_requests, request_limit = g.rate_limiting
print(f"Request limit for API Calls: {request_limit}")
print(f"Remaining requests for API Calls: {remaining_requests}")

limit_reset_time = g.rate_limiting_resettime
reset_time = datetime.fromtimestamp(limit_reset_time).strftime('%Y-%m-%d %H:%M:%S')
print(f"Reset time for API Calls: {reset_time}")

### Processing and Evaluating the Data Output

As this framework intends to separate code and data, and because the results yielded from running a block of code will differ in time due to the changes made for the input itself, the validation process has the key focus of providing the user of the framework with an executive summery of the metrics, outliers and findings that are of significance to the redustribution process when the script is executed. As such, the findings generated in this notebook points to relationshiops and deviations that could be of importance at this moment in time.

##### Functions to Analyse Testing Information

In [None]:
def load_data(file_path: str) -> dict:
    """Load data from a JSON file.

    Args:
        file_path (str): The path to the JSON file.

    Returns:
        dict: The loaded data as a Python dictionary.
    """
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data


In [None]:
def check_versions(data: dict) -> tuple:
    """Check the versions provided in the data.

    Args:
        data (dict): The data containing package versions.

    Returns:
        tuple:
            - ci_only_packages (list): Packages with versions specified only in the CI file.
            - package_info_only_packages (list): Packages with versions specified only in the PackageInfo file.
            - both_versions_packages (list): Packages with versions specified in both CI file and PackageInfo file.
    """
    ci_only_packages = []
    package_info_only_packages = []
    both_versions_packages = []

    for package, versions in data.items():
        ci_versions = versions.get('tested_ci_versions', [])
        package_info_versions = versions.get('required_pkginfo_version', [])

        if ci_versions and not package_info_versions:
            ci_only_packages.append(package)

        elif package_info_versions and not ci_versions:
            package_info_only_packages.append(package)

        elif ci_versions and package_info_versions:
            both_versions_packages.append(package)

    return ci_only_packages, package_info_only_packages, both_versions_packages


In [None]:
def compare_ci_and_pkg_versions(data: dict) -> list:
    """Compare the CI and PackageInfo testing data to see if all versions tested in the 
    CI file are equal to or greater than the required version from the PackageInfo file.

    Args:
        data (dict): The data containing package versions.

    Returns:
        list: Packages where not all versions in the CI file are above that in the PackageInfo file.
    """
    packages_with_mismatch = []

    for package, versions in data.items():
        ci_versions = versions.get("tested_ci_versions", [])
        package_version = versions.get("required_pkginfo_version")

        if ci_versions and package_version:
            if not all(version.parse(ci) >= version.parse(package_version[0]) for ci in ci_versions):
                packages_with_mismatch.append(package)

    return packages_with_mismatch


In [None]:
def find_next_version(version: Version) -> Version:
    """
    Find the next version after the given version.

    Args:
        version (Version): The version for which the next version needs to be found.

    Returns:
        Version: The next version after the given version.
    """
    version_info = list(version.release)
    if version_info[-1] == 0:
        version_info[-2] = version_info[-2] + 1
    else:
        version_info[-1] = version_info[-1] + 1
    return Version(".".join(str(comp) for comp in version_info))

In [None]:
def version_tuple(version_str: str) -> tuple:
    """Convert a version string to a tuple representation.

    Args:
        version_str (str): The version string to be converted.

    Returns:
        tuple: A tuple representation of the version with three components:
        major version, minor version and patch version.
    """
    version = parse(version_str)
    version_info = list(version.release)
    if len(version_info) == 2:
        version_info.append(0)
    return tuple(version_info)

In [None]:
def check_version_gaps(data: dict) -> dict:
    """Check for version gaps in testing, where there is a gap greater than 0.1 between the
    tested versions and the required version from the PackageInfo file.

    Args:
        data (dict): The data containing package versions.

    Returns:
        dict: A dict mapping package names to True if there are version gaps, False otherwise.
    """
    version_gaps = {}
    for package, package_data in data.items():
        required_versions = package_data.get("required_pkginfo_version", [])
        if not required_versions:
            continue

        ci_versions = package_data.get("tested_ci_versions", [])
        if not ci_versions:
            continue

        required_version_str = required_versions[0]
        required_version = version_tuple(required_version_str)
        lowest_ci_version = min(version_tuple(v) for v in ci_versions)

        # Check if the lowest tested CI version is exactly 0.1 higher than the required version
        if (
            lowest_ci_version[:-2] == required_version[:-2]
            and lowest_ci_version[-2] == required_version[-2] + 1
            and lowest_ci_version[-1] == 0
        ):
            continue

        next_version = find_next_version(parse(required_version_str))

        has_gap = not (
            (lowest_ci_version[:-1] >= required_version[:-1] and lowest_ci_version <= version_tuple(str(next_version)))
            or (lowest_ci_version[:-2] == required_version[:-2] and lowest_ci_version[-1] == required_version[-1] + 1)
        )

        if has_gap:
            version_gaps[package] = has_gap

    return version_gaps


##### Analyse and Display Testing Information

##### General Statistics on GAP Packages and Distribution

Core metrics based on the general state of packages, relevant to the management of GAP from GitHub, are provided below. These numbers are helpful in providing some foundational understanding of the current state of the programming language packages, in terms of development, distribution and redistribution. This data is on a collective level, and not per individual package.

In [None]:
# Executive summary on the general overview of GAP and GAP packages hosted by the GAP organisation on GitHub
org = g.get_organization(ORG_NAME_PACKAGES)
repos = org.get_repos(type="public")

# Get the total number of GAP repositories hosted by the GAP organisation on GitHub
total_packages = repos.totalCount
print(f"Number of GAP packages fra GAP Respository: {total_packages}")

In [None]:
# Get the latest release and version of GAP
repo_url = "https://api.github.com/repos/gap-system/PackageDistro/releases/latest"
response = requests.get(repo_url)
latest_release = response.json()
latest_version = latest_release.get("tag_name")
print(f"The latest version of GAP is: {latest_version}.")


In [None]:
# Number of GAP packages hosted elsewhere on GitHub
# The information is attempted gathered through the web scraping technique provided by Beautiful Soup
# NB: These numbers are only indicative and not completely accurate due to the webpage listing style, counts per parent list item
url = "https://gap-packages.github.io/"
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Find the section of the webpage with the packages stored elsewhere on GitHub
section = soup.find("section", id="main-content")
heading = section.find(id="packages-hosted-elsewhere-on-github")
ul = heading.find_next("ul")

# Do not include any child elements that are ul or li, as not to let these increase the count
packages = ul.find_all("li", recursive=False)
count = len(packages)

print(f"Number of GAP entities or packages hosted elsewhere on GitHub: {count}")


##### All Individual Statistics

Core metrics based on the state of each package repositories relevant to the management of GAP from GitHub, are provided below. Looking into charactersitcs of packages individually, it is possible not only to compare and contrast packages to get some indication of their relative activity, but also point to individual problems, contributors that might need some help and what packages are more extensive in terms of collaboration than others.

In [None]:
# Load the repo data from the JSON file
data_folder = "collected_data"
repo_file_path = os.path.join(data_folder, "repo_data.json")
repo_data = load_data(repo_file_path)

# Load monitoring data from the JSON file
monitoring_file_path = os.path.join(data_folder, "monitoring_data.json")
monitoring_data = load_data(monitoring_file_path)

# Load testing data from the JSON file
testing_file_path = os.path.join(data_folder, "testing_data.json")
testing_data = load_data(testing_file_path)

# Load community data from the JSON file
community_file_path = os.path.join(data_folder, "community_data.json")
community_data = load_data(community_file_path)


##### Repo Data: Key Metrics and Notable Statistics

In [None]:
# Calculate overall statistics and generate relevant insights for each repository
total_repos = len(repo_data)
total_releases = sum(repo['total_releases'] for repo in repo_data)
total_open_issues = sum(repo['open_issues_count'] for repo in repo_data)
total_open_pull_requests = sum(repo['open_pull_requests'] for repo in repo_data if repo['open_pull_requests'])
total_bug_count = sum(repo['bug_count'] for repo in repo_data if repo['bug_count'])
total_enhancement_count = sum(repo['enhancement_count'] for repo in repo_data if repo['enhancement_count'])

# Display the calculated overall metrics
print(f"Total Repositories: {total_repos}")
print(f"Total Releases: {total_releases}")
print(f"Total Open Issues: {total_open_issues}")
print(f"Total Open Pull Requests: {total_open_pull_requests}")
print(f"Total Bug Count: {total_bug_count}")
print(f"Total Enhancement Count: {total_enhancement_count}")

# Display inactive repositories where there has been no activity in the last 90 days
print("\nRepositories that had no activity in the last 90 days:")
inactive_repositories = [repo['repo'] for repo in repo_data if repo['last_activity_time'] is None]
if inactive_repositories:
    for repo_name in inactive_repositories:
        print(repo_name)
else:
    print("All repositories had activity within in the past 90 days.")


In [None]:
# Display some other information in a ProfileReport, for more statistical analysis
data_list = [
    {
        'total_releases': repo['total_releases'],
        'age_in_days': repo['age_in_days'],
        'open_issues_count': repo['open_issues_count'],
        'total_pull_requests': repo['total_pull_requests'],
        'open_pull_requests': repo['open_pull_requests'],
        'closed_pull_requests': repo['closed_pull_requests'],
    }
    for repo in repo_data
]

# Create a DataFrame from the list of dictionaries
repo_df = pd.DataFrame(data_list)

# Generate the ProfileReport based on the selected columns
profile = ProfileReport(repo_df, title="Statistics for selected columns in repositories managed by gap-packages on GitHub")
profile.to_widgets()

##### Monitoring Data: Key Metrics and Notable Statistics

In [None]:
# Get the relevant information from the loaded data
packages_with_different_versions = monitoring_data['packages_with_different_versions']
all_previous_and_maybe_next = monitoring_data['all_previous_and_maybe_next']
previous_and_maybe_next_labels = monitoring_data['previous_and_maybe_next_labels']

# Compare the latest released version number to the one on the main branch in GAP PackageDistro
# Packages with different versions numbers will be in the next GAP release
print("Packages with different versions in the latest GAP release and in the GAP PackageDistro:")
for package_data in packages_with_different_versions:
    package_name = package_data['package_name']
    latest_package_version = package_data['latest_version']
    main_branch_version = package_data['main_branch_version']
    print(f"{package_name}, Latest Version: {latest_package_version}, Main Branch Version: {main_branch_version}")

# Find the packages in unmerged PRs, as these may be in the next release but have not yet been merged
print("\nAll packages that were in the previous release and looks to also be in the next:")
for package in all_previous_and_maybe_next:
    print(package)

# Only retrieve packages with unmerged PRs that have a specific labels, as these labels indicate release relation
print("\nPackages with release related labels that were in the previous release and looks to also be in the next:")
for package in previous_and_maybe_next_labels:
    print(package)

##### Testing Data: Key Metrics and Notable Statistics

In [None]:
# Get the total number of test directories and test files for all the repositories
tst_dirs_with_files = len(testing_data)
total_test_files = sum(data.get("tst_file_count", 0) for data in testing_data.values())
tst_files_info = {package: data.get("tst_file_count", 0) for package, data in testing_data.items()}
print(f"Repositories with test directories containing files: {tst_dirs_with_files}")
print(f"Total number of test files for all packages: {total_test_files}")

# Get the number of repositories with a CI file, and the names of the ones who does not have one
repos_with_ci_file = [package for package, data in testing_data.items() if "ci_file_version" in data]
ci_tested_version = {package: data["ci_file_version"] for package, data in testing_data.items() if "ci_file_version" in data}
repos_without_ci_tests = [package for package, data in testing_data.items() if "ci_file_version" not in data]
print(f"Number of repositories with CI file: {len(repos_with_ci_file)}")
num_packages_without_tests = len(repos_without_ci_tests)
if num_packages_without_tests > 0:
    print(f"Packages without any test related data in their 'CI' files: {', '.join(repos_without_ci_tests)}")

# Get the number of repositories with a PackageInfo.g file, and the names of the ones who does not have one
repos_with_pkginfo_file = [package for package, data in testing_data.items() if "required_pkginfo_version" in data]
repos_without_pkginfo_file = [package for package in testing_data.keys() if package not in repos_with_pkginfo_file]
print(f"Number of repositories with 'PackageInfo.g' file: {len(repos_with_pkginfo_file)}")
if len(repos_without_pkginfo_file) > 0:
    print(f"Packages without a 'PackageInfo.g' file: {', '.join(repos_without_pkginfo_file)}")


In [None]:
# Analyse the version information for individual GAP packages
ci_only_packages, package_info_only_packages, both_versions_packages = check_versions(testing_data)

print("Number of packages with version testing in CI file but no required version in PackageInfo file:", len(ci_only_packages))
if ci_only_packages:
    print("Packages with version testing for CI file but no required version in PackageInfo file:")
    print(", ".join(ci_only_packages))

print("Number of packages with required version in PackageInfo file but no version testing in CI file:", len(package_info_only_packages))
if package_info_only_packages:
    print("Packages with version testing for PackageInfo file but not CI file:")
    print(", ".join(package_info_only_packages))
    
print("Number of packages with tested versions in CI file and required version in PackageInfo file:", len(both_versions_packages))
if both_versions_packages:
    print("Packages with tested versions in CI file and required version in PackageInfo file:")
    print(", ".join(both_versions_packages))


In [None]:
# Get the repositories with the latest version of GAP in their tested versions from the CI file
# Remove any prefix or additional text after the version number, if provided for the current version
latest_package_version = latest_version.lstrip("v")
latest_version_parts = latest_package_version.split("-")
latest_package_version = latest_version_parts[0]

latest_version_obj = version.parse(latest_package_version)
repos_with_latest_gap_version = [
    package for package, data in testing_data.items() if any(latest_version_obj == version.parse(ci) for ci in data.get("tested_ci_versions", []))
]

repos_with_latest_gap_version = len(repos_with_latest_gap_version)
print(f"Number of repositories with the latest version of GAP in their tested versions: {repos_with_latest_gap_version}")

if repos_with_latest_gap_version > 0:
    print("Repositories:")
    print(", ".join(repos_with_latest_gap_version))


In [None]:
# Check if all version tests in the CI file are equal to or greater than the number listed in the PackageInfo file
packages_with_mismatch = compare_ci_and_pkg_versions(testing_data)
if packages_with_mismatch:
    print(f"CI versions are not all greater than or equal to PackageInfo version for package(s): {', '.join(packages_with_mismatch)}")


In [None]:
# Check for gaps in the tested versions and the required version for each package, or higher required than tested version
version_gaps = check_version_gaps(testing_data)
if version_gaps:
    print("Packages with gaps in tested and required version, or where required version is higher than tested version:")
    for package, has_gaps in version_gaps.items():
        if has_gaps:
            print(package)

##### Community Data: Key Metrics and Notable Statistics

In [None]:
# Unpack the data from the loaded JSON file
# Currently, the inactive contributors checks for contributors with no commits for the past 24 months
all_authors = community_data['authors']
all_submitters = community_data['submitters']
author_repo_counts = community_data['author_repo_counts']
author_submitters = community_data['author_submitters']
inactive_contributors = community_data['inactive_contributors']
authors_contributed_together = community_data['interactions']

print(f"Total number of authors for all GAP packages: {len(all_authors)}")
print(f"Total number of submitters for all GAP packages: {len(all_submitters)}")
print(f"Total number of authors who were also submitters for all GAP packages: {len(author_submitters)}")
print(f"Total number of inactive contributors for all GAP packages: {len(inactive_contributors)}")
