## GAP Data Analytics, Data Validation

This Jupyter Notebook is intended for closer examination and subsequent validation of the data that has been generated by the previous ones. Beyond having access to the data, the real value from a management perspective lies in analysing, comparing and constrasting the data, automating the process of pointing out findings that may be of significance in the redistribution process. Combining different data types from various sources, this notebook points to the noteworthy findings of the data extraction process.

In [None]:
# Import required modules and libraries
import os
import sys
import json
from packaging import version
from github import Repository

# Get current working directory and append parent directory for module imports
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
sys.path.append(parent_dir)

# Import modules from other project scripts
from data_constants import *

### Processing and Evaluating the Data Output

As this framework intends to separate code and data, and because the results yielded from running a block of code will differ in time due to the changes made for the input itself, the validation process has the key focus of providing the user of the framework with an executive summery of the metrics, outliers and findings that are of significance to the redustribution process when the script is executed. As such, the findings generated in this notebook points to relationshiops and deviations that could be of importance at this moment in time.

##### Functions to Analyse Testing Information

In [None]:
def load_data(file_path: str) -> dict:
    """Load data from a JSON file.

    Args:
        file_path (str): The path to the JSON file.

    Returns:
        dict: The loaded data as a Python dictionary.
    """
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data


In [None]:
def check_versions(data: dict) -> tuple:
    """Check the versions provided in the data.

    Args:
        data (dict): The data containing package versions.

    Returns:
        tuple:
            - ci_only_packages (list): Packages with versions specified only in the CI file.
            - package_info_only_packages (list): Packages with versions specified only in the PackageInfo file.
            - both_versions_packages (list): Packages with versions specified in both CI file and PackageInfo file.
    """
    ci_only_packages = []
    package_info_only_packages = []
    both_versions_packages = []

    for package, versions in data.items():
        ci_versions = versions.get('ci_file_version', [])
        package_info_versions = versions.get('pkginfo_version', [])

        if ci_versions and not package_info_versions:
            ci_only_packages.append(package)

        elif package_info_versions and not ci_versions:
            package_info_only_packages.append(package)

        elif ci_versions and package_info_versions:
            both_versions_packages.append(package)

    return ci_only_packages, package_info_only_packages, both_versions_packages


In [None]:
def compare_ci_and_pkg_versions(data: dict) -> list:
    """Compare the CI and PackageInfo testing data specifically.

    Args:
        data (dict): The data containing package versions.

    Returns:
        list: Packages where not all versions in the CI file are above that in the PackageInfo file.
    """
    packages_with_mismatch = []

    for package, versions in data.items():
        ci_versions = versions.get("ci_file_version", [])
        package_version = versions.get("pkginfo_version")

        if ci_versions and package_version:
            if not all(version.parse(ci) >= version.parse(package_version[0]) for ci in ci_versions):
                packages_with_mismatch.append(package)

    return packages_with_mismatch


##### Analyse and Display Testing Information

In [None]:
# Load the repo data from the JSON file
data_folder = "collected_data"
repo_file_path = os.path.join(data_folder, "repo_data.json")
repo_data = load_data(repo_file_path)

# Load monitoring data from the JSON file
monitoring_file_path = os.path.join(data_folder, "monitoring_data.json")
monitoring_data = load_data(monitoring_file_path)

# Load testing data from the JSON file
testing_file_path = os.path.join(data_folder, "testing_data.json")
testing_data = load_data(testing_file_path)

# Load community data from the JSON file
community_file_path = os.path.join(data_folder, "community_data.json")
community_data = load_data(community_file_path)


In [None]:
# Analyse the version information for individual GAP packages
ci_only_packages, package_info_only_packages, both_versions_packages = check_versions(testing_data)

# Printing the results of the analysis
print("Number of packages with CI_Version but not PackageInfo_Version:", len(ci_only_packages))
if ci_only_packages:
    print("Packages with CI_Version but not PackageInfo_Version:")
    print(", ".join(ci_only_packages))

print("Number of packages with PackageInfo_Version but not CI_Version:", len(package_info_only_packages))
if package_info_only_packages:
    print("Packages with PackageInfo_Version but not CI_Version:")
    print(", ".join(package_info_only_packages))
    
print("Number of packages with both CI_Version and PackageInfo_Version:", len(both_versions_packages))
if both_versions_packages:
    print("Packages with both CI_Version and PackageInfo_Version:")
    print(", ".join(both_versions_packages))


In [None]:
# Check if all CI versions are equal to or greater than the number listed in the PackageInfo file
packages_with_mismatch = compare_ci_and_pkg_versions(testing_data)
if packages_with_mismatch:
    print(f"CI versions are not all greater than or equal to PackageInfo version for package(s): {', '.join(packages_with_mismatch)}")
