## Analysing Retrieved Data

This Jupyter Notebook is for examination and subsequent validation of the data generated by the previous data collection notebooks. The real framework value from a management perspective lies in analysing, comparing and constrasting the data, automating the work of pointing out findings that may be of significance in the redistribution process.

In [None]:
# Import required modules and libraries
import os
import sys
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from ydata_profiling import ProfileReport
from packaging import version
from packaging.version import Version, parse
from IPython.display import display, Markdown

# Get current working directory and append parent directory for module imports
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
sys.path.append(parent_dir)

# Import modules from other project scripts
from data_constants import *


### Managing GitHub API Calls

Connecting to GitHub and verifying the user token is completed by storing the access token as an environment variable. The function for getting the token is imported from the utils file in the project. As the API has a limit of 5000 calls per hour, the remaining calls available and the reset time is tracked below. Each function to retrieve and export the data in the data retrieval files have functionality to pause the workflow when the user runs out of API calls. The program will automatically sleep until the limit is renewed, and then resume the job it was completing at the time the limit ran out.

In [None]:
# Check the total and remaining GitHub API calls, and see when the limit will reset
remaining_requests, request_limit = g.rate_limiting
print(f"Request limit for API Calls: {request_limit}")
print(f"Remaining requests for API Calls: {remaining_requests}")

limit_reset_time = g.rate_limiting_resettime
reset_time = datetime.fromtimestamp(limit_reset_time).strftime('%Y-%m-%d %H:%M:%S')
print(f"Reset time for API Calls: {reset_time}")


### Processing and Evaluating the Data

As this framework intends to separate code and data, the validation process will use the data generated by the user in the data extraction process as the basis for analysis. No information is provided except from the created by the user in the 'collected_data'folder. The focus of the analysis and validation process is to provide an executive summary on the data, while pointing to any outliers and findings of significance. As such, this notebook highlights relationships and deviations that could be of importance to the redustribution process.

##### Functions to Analyse Retrieved Data

In [None]:
def check_versions(data: dict) -> tuple:
    """Check the versions provided in the data.

    Args:
        data (dict): The data containing package versions.

    Returns:
        tuple:
            - ci_only_packages (list): Packages with versions specified only in the CI file.
            - package_info_only_packages (list): Packages with versions specified only in the PackageInfo file.
            - both_versions_packages (list): Packages with versions specified in both CI file and PackageInfo file.
    """
    ci_only_packages = []
    package_info_only_packages = []
    both_versions_packages = []

    for package, versions in data.items():
        ci_versions = versions.get('tested_ci_versions', [])
        package_info_versions = versions.get('required_pkginfo_version', [])

        if ci_versions and not package_info_versions:
            ci_only_packages.append(package)

        elif package_info_versions and not ci_versions:
            package_info_only_packages.append(package)

        elif ci_versions and package_info_versions:
            both_versions_packages.append(package)

    return ci_only_packages, package_info_only_packages, both_versions_packages


In [None]:
def compare_ci_and_pkg_versions(data: dict) -> list:
    """Compare the CI and PackageInfo data to see if all versions tested in the 
    CI file are equal to or greater than the required version from the PackageInfo file.

    Args:
        data (dict): The data containing package versions.

    Returns:
        list: Packages where not all versions in the CI file are above that in the PackageInfo file.
    """
    packages_with_mismatch = []

    for package, versions in data.items():
        ci_versions = versions.get("tested_ci_versions", [])
        package_version = versions.get("required_pkginfo_version")

        if ci_versions and package_version:
            if not all(version.parse(ci) >= version.parse(package_version[0]) for ci in ci_versions):
                packages_with_mismatch.append(package)

    return packages_with_mismatch


In [None]:
def find_next_version(version: Version) -> Version:
    """Find the next version after the given version, to check for gaps in testing patterns.

    Args:
        version (Version): The version for which to find the next version.

    Returns:
        Version: The next version that should be after the given version.
    """
    version_info = list(version.release)
    if version_info[-1] == 0:
        version_info[-2] = version_info[-2] + 1
    else:
        version_info[-1] = version_info[-1] + 1
    return Version(".".join(str(comp) for comp in version_info))


In [None]:
def version_tuple(version_str: str) -> tuple:
    """Convert a version string to a tuple representation.

    Args:
        version_str (str): The version string to be converted.

    Returns:
        tuple: A tuple representation of the version, consisting of
        major version, minor version and patch version.
    """
    version = parse(version_str)
    version_info = list(version.release)
    if len(version_info) == 2:
        version_info.append(0)
    return tuple(version_info)


In [None]:
def check_version_gaps(data: dict) -> dict:
    """Check for gaps in version testing between the tested versions and the required version,
    discarding patch components as gaps. This is done by checking if the lowest tested version 
    is exactly 0.1 higher than the required version.

    Args:
        data (dict): The data containing package versions.

    Returns:
        dict: A dict mapping package names to True if there are version gaps, False otherwise.
    """
    version_gaps = {}
    for package, package_data in data.items():
        required_versions = package_data.get("required_pkginfo_version", [])
        if not required_versions:
            continue

        ci_versions = package_data.get("tested_ci_versions", [])
        if not ci_versions:
            continue

        required_version_str = required_versions[0]
        required_version = version_tuple(required_version_str)
        lowest_ci_version = min(version_tuple(v) for v in ci_versions)

        if (
            lowest_ci_version[:-2] == required_version[:-2]
            and lowest_ci_version[-2] == required_version[-2] + 1
            and lowest_ci_version[-1] == 0
        ):
            continue

        next_version = find_next_version(parse(required_version_str))

        has_gap = not (
            (lowest_ci_version[:-1] >= required_version[:-1] and lowest_ci_version <= version_tuple(str(next_version)))
            or (lowest_ci_version[:-2] == required_version[:-2] and lowest_ci_version[-1] == required_version[-1] + 1)
        )

        if has_gap:
            version_gaps[package] = has_gap

    return version_gaps


##### Analyse and Display Extracted Data

In [None]:
# Define organisation and repositories
org = g.get_organization(ORG_NAME_PACKAGES)
repos = org.get_repos(type="public")

# Load the repository data from the JSON file
data_folder = "collected_data"
repo_file_path = os.path.join(data_folder, "repo_data.json")
repo_data = load_data(repo_file_path)

# Load monitoring data from the JSON file
monitoring_file_path = os.path.join(data_folder, "monitoring_data.json")
monitoring_data = load_data(monitoring_file_path)

# Load testing data from the JSON file
testing_file_path = os.path.join(data_folder, "testing_data.json")
testing_data = load_data(testing_file_path)

# Load community data from the JSON file
community_file_path = os.path.join(data_folder, "community_data.json")
community_data = load_data(community_file_path)


##### General Statistics: GAP Packages and Distribution

In [None]:
# Get the total number of GAP repositories hosted by the GAP organisation on GitHub
total_packages = repos.totalCount
print(f"Number of GAP packages from GAP Respository: {total_packages}")


In [None]:
# Get the latest release and version of GAP
repo_url = f"https://api.github.com/repos/{ORG_NAME_SYSTEM}/gap/releases/latest"
headers = {'Authorization': f'token {github_token}'}
response = requests.get(repo_url, headers=headers)
latest_release = response.json()
latest_version = latest_release.get("tag_name")
print(f"The latest version of GAP is: {latest_version}.")


In [None]:
# Get the number of GAP packages hosted elsewhere on GitHub
# The information is attempted gathered through the web scraping technique provided by Beautiful Soup
# NB: These numbers are only indicative and not completely accurate due to the webpage listing style, counts per parent list item
url = "https://gap-packages.github.io/"
response = requests.get(url)

# Find the section of the webpage with the packages stored elsewhere on GitHub
soup = BeautifulSoup(response.text, "html.parser")
section = soup.find("section", id="main-content")
ul = section.find_next("ul")

# Do not include any child elements that are ul or li, as not to let these increase the count
packages = ul.find_all("li", recursive=False)
count = len(packages)

print(f"Number of GAP packages hosted elsewhere on GitHub: {count}")


##### Repository Data: Key Metrics and Notable Statistics

In [None]:
# Calculate overall statistics and generate relevant insights for each repository
total_repos = len(repo_data)
total_releases = sum(repo['total_releases'] for repo in repo_data)
total_open_issues = sum(repo['open_issues_count'] for repo in repo_data)
total_open_pull_requests = sum(repo['open_pull_requests'] for repo in repo_data if repo['open_pull_requests'])
total_bug_count = sum(repo['bug_count'] for repo in repo_data if repo['bug_count'])
total_enhancement_count = sum(repo['enhancement_count'] for repo in repo_data if repo['enhancement_count'])

# Display the calculated overall metrics
print(f"Total Repositories: {total_repos}")
print(f"Total Releases: {total_releases}")
print(f"Total Open Issues: {total_open_issues}")
print(f"Total Open Pull Requests: {total_open_pull_requests}")
print(f"Total Bug Count: {total_bug_count}")
print(f"Total Enhancement Count: {total_enhancement_count}")

# Display inactive repositories where there has been no activity in the last 90 days
pd.set_option('display.max_rows', None)
inactive_repositories = [repo['repo'] for repo in repo_data if repo['last_activity_time'] is None]

if inactive_repositories:
    # Create a DataFrame from the list of inactive repositories
    df = pd.DataFrame(inactive_repositories, columns=['Inactive Repositories'])
    display(df)
else:
    print("All repositories had activity within in the past 90 days.")


In [None]:
# Generate statistical analysis for other variables, using a ProfileReport
data_list = [
    {
        'total_releases': repo['total_releases'],
        'age_in_days': repo['age_in_days'],
        'open_issues_count': repo['open_issues_count'],
        'total_pull_requests': repo['total_pull_requests'],
        'open_pull_requests': repo['open_pull_requests'],
        'closed_pull_requests': repo['closed_pull_requests'],
    }
    for repo in repo_data
]

# Create a DataFrame from the list of dicts
repo_df = pd.DataFrame(data_list)

# Generate the ProfileReport based on selected columns
profile = ProfileReport(repo_df, title="Statistical analysis for repositories managed by gap-system organisation on GitHub")
profile.to_widgets()


##### Monitoring Data: Key Metrics and Notable Statistics

In [None]:
# Get the relevant information from the loaded data
packages_with_different_versions = monitoring_data['packages_with_different_versions']
all_previous_and_maybe_next = monitoring_data['all_previous_and_maybe_next']
previous_and_maybe_next_labels = monitoring_data['previous_and_maybe_next_labels']

# Compare the latest released version number to the one on the main branch in GAP PackageDistro
# Packages with different versions numbers will be in the next GAP release
df_versions = pd.DataFrame(packages_with_different_versions)
df_versions = df_versions[['package_name', 'latest_version', 'main_branch_version']]
print("Packages with different versions in the latest GAP release and in the GAP PackageDistro: ")
display(df_versions)

# Find the packages in unmerged pull requests, as these may be in the next release but have not yet been merged
df_previous_next = pd.DataFrame(all_previous_and_maybe_next, columns=['Package'])
print("\nAll packages that were in the previous release and have unmerged pull requests: ")
display(df_previous_next)

# Only retrieve packages with unmerged pull requests that have certain labels indicating release relation
df_labels = pd.DataFrame(previous_and_maybe_next_labels, columns=['Package'])
print("\nPackages with release related labels that were in the previous release and have unmerged pull requests: ")
display(df_labels)


##### Testing Data: Key Metrics and Notable Statistics

In [None]:
# Get the total number of test directories and test files for all the repositories
tst_dirs_with_files = len(testing_data)
total_test_files = sum(data.get("tst_file_count", 0) for data in testing_data.values())
display(Markdown(f"**Repositories with test directories containing files:** {tst_dirs_with_files}"))
display(Markdown(f"**Total number of test files for all packages:** {total_test_files}"))

# Get the number of repositories with a CI file, and the names of the ones that does not have one
repos_with_ci_file = [package for package, data in testing_data.items() if "tested_ci_versions" in data]
num_packages_without_tests = len([package for package, data in testing_data.items() if "tested_ci_versions" not in data])
display(Markdown(f"**Number of repositories with CI file:** {len(repos_with_ci_file)}"))

# Get the number of repositories with a PackageInfo file, and the names of the ones who does not have one
repos_with_pkginfo_file = [package for package, data in testing_data.items() if "required_pkginfo_version" in data]
repos_without_pkginfo_file = [package for package in testing_data.keys() if package not in repos_with_pkginfo_file]
display(Markdown(f"**Number of repositories with PackageInfo.g file:** {len(repos_with_pkginfo_file)}"))

# Compare version testing in tested versions and required version for packages
ci_only_packages, package_info_only_packages, both_versions_packages = check_versions(testing_data)
display(Markdown(f"**Number of packages with CI version testing but no required version:** {len(ci_only_packages)}"))
display(Markdown(f"**Number of packages with required version but no CI version testing:** {len(package_info_only_packages)}"))
display(Markdown(f"**Number of packages with both CI version testing and required version:** {len(both_versions_packages)}"))

# Create a DataFrame with more detailed information on the testing for  each repository
df_detailed_info = pd.DataFrame({
    "Repository": list(testing_data.keys()),
    "Count Test Files": [data.get("tst_file_count", 0) for data in testing_data.values()],
    "Tested CI Versions": [', '.join(data.get("tested_ci_versions", ["None"])) for data in testing_data.values()],
    "Required PkgInfo Version": [', '.join(data.get("required_pkginfo_version", ["None"])) for data in testing_data.values()],
    "CI file has test data in file": ['Yes' if "tested_ci_versions" in data else 'No' for package, data in testing_data.items()],
    "Package has PackageInfo file": ['Yes' if package in repos_with_pkginfo_file else 'No' for package in testing_data.keys()]
})

display(df_detailed_info)


In [None]:
# Get the repositories with the latest version of GAP in their tested versions from the CI file
# Remove any prefix or additional text after the version number, if provided for the current version, do not account for patch versions
latest_package_version = latest_version.lstrip("v")
latest_package_version = ".".join(latest_package_version.split(".")[:2])

latest_version_obj = version.parse(latest_package_version)
repos_with_latest_gap_version = [
    package for package, data in testing_data.items() if any(latest_version_obj == version.parse(ci) for ci in data.get("tested_ci_versions", []))
]

count_repos_with_latest_gap_version = len(repos_with_latest_gap_version)
print(f"Number of repositories with the latest version of GAP in their tested versions: {count_repos_with_latest_gap_version}")

if count_repos_with_latest_gap_version > 0:
    df_repos = pd.DataFrame(repos_with_latest_gap_version, columns=['Package'])
    print("\nRepositories with the latest version of GAP in their tested versions:")
    display(df_repos)


In [None]:
# Check if package tested versions are equal to or greater than the required version
packages_with_mismatch = compare_ci_and_pkg_versions(testing_data)
df_packages_with_mismatch = pd.DataFrame(packages_with_mismatch, columns=['Package'])

if not df_packages_with_mismatch.empty:
    print("Tested versions are not all greater than or equal to required version for the following packages:")
    display(df_packages_with_mismatch)
else:
    print("All packages have tested versions that are greater than or equal to the required version.")


In [None]:
# Check for gaps in the tested versions and the required version for each package, or higher required than some tested version
version_gaps = check_version_gaps(testing_data)
df_version_gaps = pd.DataFrame([(package, ', '.join(data.get("tested_ci_versions", ["None"])), ', '.join(data.get("required_pkginfo_version", ["None"]))) for package, data in testing_data.items() if version_gaps.get(package)], columns=['Package', 'CI File Version', 'Required PkgInfo Version'])

if not df_version_gaps.empty:
    print("Packages with gaps in tested and required version, or where required version is higher than some tested version: ")
    display(df_version_gaps)
else:
    print("No packages with gaps in tested and required version, or higher required version than some tested version.")


##### Community Data: Key Metrics and Notable Statistics

In [None]:
# Get key numbers for contributors of GAP on GitHub
# Currently, the inactive contributors checks for contributors with no commits for the past 12 months
all_authors = community_data['authors']
all_submitters = community_data['submitters']
author_submitters = community_data['author_submitters']
inactive_contributors = community_data['inactive_contributors']

print(f"Total number of authors for all GAP packages: {len(all_authors)}")
print(f"Total number of submitters for all GAP packages: {len(all_submitters)}")
print(f"Total number of authors who were also submitters for all GAP packages: {len(author_submitters)}")
print(f"Total number of inactive contributors for all GAP packages: {len(inactive_contributors)}")


##### Save Notebook For Dashboard

After the script has been executed once and the outputs have been generated, it is very important to **save the file** before starting the Streamlit dashboard. If not, the outputs will not be available on the dashboard.