## GAP Data Analytics, Community Study

This Jupyter Notebook is intended to provide a deeper understanding of the community behind GAP distributed through GitHub, by studying the members developing, releasing and collaborating on GAP packages on GitHub, to gather valuable information on their collaboration trends and patterns. In the interest of privacy, the real values of contributor usernames are hashed upon extraction. The hash value is then the variable used to compute and generate statistical data analysis.

In [None]:
# Import required modules and libraries
import os
import sys
import json
import hashlib
from github import Repository

# Get current working directory and append parent directory for module imports
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
sys.path.append(parent_dir)

# Import modules from other project scripts
from data_constants import *


### Studying the community

Several variables related to autors and collaborations can provide valuable input on how the community behind GAP functions, and what dependencies might exist. Further investigating the frequency of contributions, who contributes to what and where connections are made yields an understanding of who the people behind the GAP packages are, how the collaborate and what the trends point to.

In [None]:
# Define global variables for the Jupyter Notebook
org = g.get_organization(ORG_NAME_PACKAGES)
repos = org.get_repos(type="public")


##### Functions to Retrieve Community Metrics

In [None]:
def hash_username(author_name: str) -> str:
    """Hashes the author name upon retrieval, using the SHA-256 algorithm.

    Args:
        author_name (str): The author name to be hashed.

    Returns:
        str: The hash value of the author name.
    """
    return hashlib.sha256(author_name.encode()).hexdigest()


In [None]:
def community_contributors(repos: Repository) -> tuple:
    """Get the numbers of GitHub GAP repository authors, authors who are also submitters, authors who are both submitters and commenters
    and 

    Args:
        repos (Repository): List of GitHub repositories.

    Returns:
        tuple: A set of hash values for all users that are authors,
               a set of hash values for users who are authors and submitters,
               and a set of hash values for users who are authors, submitters and commenters.
    """
    all_authors = set()
    authors_and_submitters = set()
    authors_submitters_commenters = set()
    author_repo_counts = {}

    for repo in repos:
        # Keep track of repositories already processed by an author
        repos_by_author = set()
        for commit in repo.get_commits():
            try:
                author = hash_username(commit.commit.author.name)
                all_authors.add(author)

                if commit.author is not None and commit.author.login is not None:
                    submitter = hash_username(commit.author.login)

                    if submitter != author:
                        authors_and_submitters.add(submitter)

                    if submitter in all_authors:
                        authors_submitters_commenters.add(submitter)

                # Count unique repositories for each author
                if author not in repos_by_author:
                    author_repo_counts[author] = author_repo_counts.get(author, 0) + 1
                    repos_by_author.add(author)

            except Exception as e:
                pass

    return all_authors, authors_and_submitters, authors_submitters_commenters, author_repo_counts


##### Get and Display Community Metrics

In [None]:
# Get information on the total number of contributors
total_authors, author_submitters, author_submitter_commenter, author_repo_counts = community_contributors(repos)
print(f"Total number of authors for all GAP packages: {len(total_authors)}")
print(f"Total number of unique submitters for all GAP packages: {len(author_submitters)}")
print(f"Total number of unique submitters for all GAP packages: {len(author_submitter_commenter)}")

# Get information on how many repositories an author contributed to
# Sort the contributors by author count in descending order
sorted_contributors = sorted(author_repo_counts.items(), key=lambda x: x[1], reverse=True)
for value, count in sorted_contributors:
    print(f"Author Hash Value: {value}\tRepo Contribution Count: {count}")


In [None]:
# Export collected data to JSON file to store them for later use and better overview
data_folder = "collected_data"
data = {
    'total_authors': list(total_authors),
    'author_submitters': list(author_submitters),
    'author_submitter_commenter': list(author_submitter_commenter),
    'author_repo_counts': author_repo_counts,
}

# Create a file path for the JSON file, and add it to the data folder
file_path = os.path.join(data_folder, "community_data.json")

# Write the data to the JSON file
with open(file_path, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

print("Community data has been exported to the 'community_data.json' file in the 'collected_data' folder.")
