## Testing and Package Actions Data Retrieval

This Jupyter Notebook investigates testing, actions and workflows for the GAP packages hosted on GitHub. In the redistribution of GAP and distribution GAP packages, authors will often test the compatibility of new versions of the package based on existing versions of GAP. This can be done in several ways, most frequently through tst directories in the repository. However, testing informaiton can also be completed through GitHub Actions, which is a tool for automating workflows, in the form of CI.yml files. They can also be found in PackageInfo.g files. In the context of redistribution, valuable information is found in the context of how testing is performed, especially in terms of what GAP versions are tested on, and if testing is consistent. This notebook will generate the data needed to examine testing data, which will later on be used for data validation. 

In [None]:
# Import required modules and libraries
import os
import sys
import re
import requests
import json
from datetime import datetime
from github import Repository, RateLimitExceededException

# Get current working directory and append parent directory for module imports
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
sys.path.append(parent_dir)

# Import modules from other project scripts
from data_constants import *


### Information on Testing and Testing Consistency

Even though it is common for most GAP repositories to have a tst directory, this might be empty. Furthermore, test files can be stored in different ways, some directly in the tst directory while others could be hidden in subdirectories. Additionally, useful indicators can be uncovered by looking into and comparing the use of tst directories, GitHub actions through CI.yml files and PackageInfo.g files for testing, to study their popularity and consistency. Running the script will export the data to a 'testing_data.json' file in the 'collected_data' folder, displaying the results of the generated data per package.

In [None]:
# Define global variables for the Jupyter Notebook
org = g.get_organization(ORG_NAME_PACKAGES)
repos = org.get_repos(type="public")


##### Functions to Retrieve Testing Information

In [None]:
def check_for_tst_dir(repo: Repository) -> tuple:
    """Check if a 'tst' directory exists and if it is empty for a given repository.

    Args:
        repo (Repository): The GitHub repository object.

    Returns:
        tuple:
            - tst_dir_exists (bool): True if 'tst' directory exists, False otherwise.
            - tst_dir_empty (bool): True if 'tst' directory is empty, False otherwise.
            - repositories_with_tests (int): The count of repositories with a 'tst' directory.
    """
    tst_dir_exists = False
    tst_dir_empty = False
    repositories_with_tests = 0

    contents = repo.get_contents("")

    for item in contents:
        if item.type == "dir" and item.name == "tst":
            tst_dir_exists = True
            repositories_with_tests += 1
            test_contents = repo.get_contents(item.path)
            if len(test_contents) == 0:
                tst_dir_empty = True
            break
    
    return tst_dir_exists, tst_dir_empty, repositories_with_tests


In [None]:
def process_tst_directory(repo: Repository, directory_path: str, tst_file_info: dict) -> tuple:
    """Recursively count '.tst' files in 'tst' directories and subdirectories of a repository.

    Args:
        repo (Repository): The GitHub repository object.
        directory_path (str): The path of the directory to process.
        tst_file_info (dict): Repository name, number of tst files and total tst file lines.

    Returns:
        tuple:
            - num_tst_files (int): The total count of '.tst' files found.
            - total_lines (int): The total number of lines in all '.tst' files.
    """
    contents = repo.get_contents(directory_path)
    num_tst_files = 0
    total_lines = 0

    for item in contents:
        if item.type == "file" and item.name.endswith(".tst"):
            tst_file_content = requests.get(item.download_url).text
            lines = tst_file_content.splitlines()
            num_tst_files += 1
            total_lines += len(lines)
        
        elif item.type == "dir":
            subdirectory_path = f"{directory_path}/{item.name}"
            subdir_num_tst_files, subdir_total_lines = process_tst_directory(repo, subdirectory_path, tst_file_info)
            num_tst_files += subdir_num_tst_files
            total_lines += subdir_total_lines
    
    return num_tst_files, total_lines


In [None]:
def analyse_tst_files(repos: Repository) -> tuple:
    """Analyse the contents of the tst directories for all repositories.

    Args:
        repos (Repository): List of GitHub repositories.

    Returns:
        tuple:
            - total_test_files (int): The total number of test files across all packages.
            - tst_files_info (list): A list of dictionaries, where each dictionary represents
                information about a test directory with files.
    """
    tst_files_info = []
    total_test_files = 0
    total_lines = 0

    for repo in repos:
        test_exists, tst_dir_empty, _ = check_for_tst_dir(repo)

        if test_exists and not tst_dir_empty:
            tst_file_info = {
                "repository": repo.name,
                "num_tst_files": 0,
                "total_lines": 0
            }
            num_tst_files, lines = process_tst_directory(repo, "tst", tst_file_info)

            if num_tst_files > 0:
                tst_file_info["num_tst_files"] = num_tst_files
                tst_file_info["total_lines"] = lines
                tst_files_info.append(tst_file_info)

            total_test_files += num_tst_files
            total_lines += lines

    return total_test_files, tst_files_info


In [None]:
def ci_version_testing(repos: Repository) -> tuple:
    """Retrieve version information from CI.yml files in multiple repositories.

    Args:
        repos (Repository): List of GitHub repositories.

    Returns:
        tuple:
            - repos_with_ci_file (int): The number of repositories that have a CI.yml file in their workflows.
            - ci_tested_version (dict): A dictionary where the keys are repository names and the values
                are lists of tested versions extracted from the CI.yml file.
            - repos_without_ci_tests (list): A list of repository names that do not have CI tests.
    
    Raises:
        Exception: If an error occurs while analysing a repository.
    """
    repos_with_ci_file = 0
    ci_tested_version = {}
    repos_without_ci_tests = []
    for repo in repos:
        repo_name = repo.name
        try:
            contents = repo.get_contents("")
            has_workflows = any(content.name == ".github" and content.type == "dir" for content in contents)
            if has_workflows:
                workflows_contents = repo.get_contents(".github/workflows")
                if isinstance(workflows_contents, list):
                    if any(file.name.lower() == "ci.yml" for file in workflows_contents):
                        repos_with_ci_file += 1
                        ci_file = next(file for file in workflows_contents if file.name.lower() == "ci.yml")
                        pattern = r"stable-(\d+\.\d+)"
                        ci_file_contents = requests.get(ci_file.download_url).text
                        matches = re.findall(pattern, ci_file_contents)
                        if matches:
                            ci_tested_version[repo_name] = matches
                        else:
                            repos_without_ci_tests.append(repo_name)
        except Exception as e:
            print(f"Error occurred while analyzing repository '{repo_name}': {str(e)}")
    return repos_with_ci_file, ci_tested_version, repos_without_ci_tests


In [None]:
def pkginfo_version_testing(repos: Repository) -> tuple:
    """Retrieve version information from PackageInfo.g files in multiple repositories.

    Args:
        repos (Repository): List of GitHub repositories.

    Returns:
        tuple:
            - repos_with_pkginfo_file (int): The number of repositories that have a PackageInfo.g file.
            - pkg_tested_version (list): A list of tuples where each tuple contains the repository name
                and the corresponding GAP version extracted from PackageInfo.g.

    Raises:
        Exception: If an error occurs while retrieving version information.
    """
    repos_with_pkginfo_file = 0
    pkg_tested_version = []
    for repo in repos:
        repo_name = repo.name
        try:
            contents = repo.get_contents("", ref="HEAD")
            pkginfo_file = next((file for file in contents if file.name.lower() == "packageinfo.g"), None)
            if pkginfo_file:
                repos_with_pkginfo_file += 1
                pkginfo_content = pkginfo_file.decoded_content.decode("utf-8")
                version_pattern = r'GAP\s+:=\s+"[^"]*?([\d.]+)"'
                version_match = re.search(version_pattern, pkginfo_content)
                if version_match:
                    gap_version = version_match.group(1)
                    pkg_tested_version.append((repo_name, gap_version))
        except Exception as e:
            raise Exception(f"Error occurred while analysing repository '{repo_name}': {str(e)}")
    return repos_with_pkginfo_file, pkg_tested_version


In [None]:
def export_testing_data() -> None:
    """Export the testing data to a JSON file, while instructing the program to sleep for the
    duration of the time it takes for the GitHub API calls limit to reset in the event that it runs out.

    Args:
        None.
        
    Returns:
        None.
    """
    while True:
        try:
            # Export collected data to JSON file to store them for later use and better overview
            data_folder = "collected_data"
            version_testing_data = {}

            total_test_files, tst_files_info = analyse_tst_files(repos)
            repos_with_ci_file, ci_tested_version, repos_without_ci_tests = ci_version_testing(repos)
            repos_with_pkginfo_file, pkg_tested_version = pkginfo_version_testing(repos)

            # Add all repositories as keys to version_testing_data
            for repo in repos:
                package = repo.name
                version_testing_data[package] = {}

            for tst_file_info in tst_files_info:
                package = tst_file_info["repository"]
                if "num_tst_files" in tst_file_info and "total_lines" in tst_file_info:
                    version_testing_data[package]["tst_file_count"] = tst_file_info["num_tst_files"]
                    version_testing_data[package]["total_lines_in_tst_files"] = tst_file_info["total_lines"]

            # Add version info from CI.yml files to the dictionary
            for package, versions in ci_tested_version.items():
                if versions:
                    version_testing_data[package]["ci_file_version"] = versions

            # Add GAP version info from PackageInfo.g files to the dictionary
            for package, version in pkg_tested_version:
                if version:
                    version_testing_data[package]["pkginfo_version"] = [version]

            file_path = os.path.join(data_folder, "testing_data.json")

            with open(file_path, "w") as f:
                json.dump(version_testing_data, f, indent=4)

            print(f"Version testing data has been exported to the 'testing_data' file in the 'collected_data' folder.")
            break

        except RateLimitExceededException:
            remaining_requests, _ = g.rate_limiting
            reset_time = g.rate_limiting_resettime
            if remaining_requests < 100:
                wait_until_reset(reset_time)


##### Get and Export Testing Information

In [None]:
# Call the function to export the data
export_testing_data()
