## Community Study Data Retrieval

This Jupyter Notebook is intended to provide a deeper understanding of the community behind GAP distributed through GitHub, by studying the members developing, releasing and collaborating on GAP packages on GitHub, to gather valuable information on their collaboration trends and patterns. In the interest of privacy, the real values of contributor usernames are hashed upon extraction. The hash value is then the variable used to compute and generate statistical data analysis.

In [None]:
# Import required modules and libraries
import os
import sys
import json
import hashlib
from datetime import datetime, timedelta
from github import Repository

# Get current working directory and append parent directory for module imports
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
sys.path.append(parent_dir)

# Import modules from other project scripts
from data_constants import *


### Managing GitHub API Calls

The process of connecting to GitHub and verifying the user's GitHub token is done by storing the access token as an environment variable. The function for getting the token is imported from the utils file in the project. As the API has a call limit of 5000 calls per hour, the capacity and remaining calls available, as well as the reset time, is tracked below. When the user runs out of API calls, the program will sleep until the limit is renewed, and then resume the job it was completing at the time the limit ran out.

In [None]:
# Track the rate limit for GitHub compared to calls used, and see when the limit will reset
# If there are less than 100 API calls left, the program will sleep until the API limit renews.
remaining_requests, request_limit = g.rate_limiting
print(f"Request limit for API Calls: {request_limit}")
print(f"Remaining requests for API Calls: {remaining_requests}")

limit_reset_time = g.rate_limiting_resettime
reset_time = datetime.fromtimestamp(limit_reset_time).strftime('%Y-%m-%d %H:%M:%S')
print(f"Reset time for API Calls: {reset_time}")

threshold = 100
if remaining_requests <= threshold:
    wait_until_reset(reset_time)


### Studying the community

Several variables related to autors and collaborations can provide valuable input on how the community behind GAP functions, and what dependencies might exist. Further investigating the frequency of contributions, who contributes to what and where connections are made yields an understanding of who the people behind the GAP packages are, how the collaborate and what the trends point to.

In [None]:
# Define global variables for the Jupyter Notebook
org = g.get_organization(ORG_NAME_PACKAGES)
repos = org.get_repos(type="public")


##### Functions to Retrieve Community Metrics

In [None]:
def hash_username(author_name: str) -> str:
    """Hashes the author name upon retrieval, using the SHA-256 algorithm.

    Args:
        author_name (str): The author name to be hashed.

    Returns:
        str: The hash value of the author name.
    """
    return hashlib.sha256(author_name.encode()).hexdigest()


In [None]:
def get_commits_by_contributor(repo: Repository, contributors_set: set, threshold_date: datetime, inactive_contributors: dict) -> None:
    """Get the commits made by each contributor since the given threshold date and identify inactive contributors.

    Args:
        repo (Repository): The GitHub repository to get the commits from.
        contributors_set (set): Hash values representing the contributors.
        threshold_date (datetime): The threshold date to filter commits.
        inactive_contributors (dict): Inactive contributors and their latest contribution date.
    """
    for contributor_hash in contributors_set:
        try:
            # Get commits for each contributor
            commits = repo.get_commits(since=threshold_date, author=contributor_hash)
            for commit in commits:
                if commit.author is not None:  # Check if commit.author is not None
                    commit_timestamp = None
                    if hasattr(commit.author, 'date'):
                        commit_timestamp = commit.author.date
                    elif hasattr(commit.author, 'created_at'):
                        commit_timestamp = commit.author.created_at

                    if commit_timestamp is not None and commit_timestamp < threshold_date:
                        inactive_contributors[contributor_hash] = commit_timestamp

        except Exception as e:
            print(f"Error while processing {repo.name}: {e}")
            continue

In [None]:
def community_contributors(repos: Repository, threshold_months=24) -> tuple:
    """Get the numbers of GitHub GAP repository authors, authors who are also submitters, number of repos each author contributed to,
    authors who are also submitters and data on what authors interacted with what issue submitters. Also, identify inactive contributors.

    Args:
        repos (Repository): List of GitHub repositories.
        threshold_months (int, optional): Threshold in months to identify inactive contributors. Defaults to 6.

    Returns:
        tuple: A set of hash values for all users that are authors,
            a set of hash values for all users that are issue submitters,
            a dict with showing how many repositories an author contributed to,
            a set of hash values for users who are authors and submitters,
            a dict containing authors and what issue submitters interacted with their repos
            and a dict containing inactive contributors and their latest contribution date.
    """
    all_authors = set()
    all_submitters = set()
    authors_submitters = set()
    author_repo_counts = {}
    authors_contributed_together = {}
    inactive_contributors = {}
    today = datetime.today()
    threshold_date = today - timedelta(days=threshold_months * 30)

    for repo in repos:
        # Get all authors and their contribution count
        contributors = repo.get_contributors()
        contributors_set = set(hash_username(contributor.login) for contributor in contributors)
        all_authors.update(contributors_set)

        for contributor_hash in contributors_set:
            author_repo_counts[contributor_hash] = author_repo_counts.get(contributor_hash, 0) + 1

        # Calculate the date threshold for inactive contributors
        threshold_date = today - timedelta(days=threshold_months * 30)

        # Get inactive contributors based on threshold
        get_commits_by_contributor(repo, contributors_set, threshold_date, inactive_contributors)

        # Get all submitters for the repo
        issues = repo.get_issues(state="all")        
        submitters_in_repo = set(hash_username(issue.user.login) for issue in issues)
        all_submitters.update(submitters_in_repo)

        # Get all interactions
        for submitter in submitters_in_repo:
            for contributor_hash in contributors_set:
                if submitter != contributor_hash:
                    if contributor_hash not in authors_contributed_together:
                        authors_contributed_together[contributor_hash] = []
                    if submitter not in authors_contributed_together[contributor_hash]:
                        authors_contributed_together[contributor_hash].append(submitter)

    # Get all authors and submitters
    authors_submitters = all_submitters.intersection(all_authors)

    return all_authors, all_submitters, author_repo_counts, authors_submitters, inactive_contributors, authors_contributed_together




##### Get and Export Community Metrics

In [None]:
# Call the functions and unpack the tuples to store the data
all_authors, all_submitters, author_repo_counts, author_submitters, inactive_contributors, authors_contributed_together = community_contributors(repos)

# Export collected data to JSON file to store them for later use and better overview
data_folder = "collected_data"
data = {
    'authors': list(all_authors),
    'submitters': list(all_submitters),
    'author_repo_counts': author_repo_counts,
    'author_submitters': list(author_submitters),
    'inactive_contributors': inactive_contributors,
    'interactions': authors_contributed_together
}

# Create a file path for the JSON file, and add it to the data folder
file_path = os.path.join(data_folder, "community_data.json")

# Write the data to the JSON file
with open(file_path, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

print("Community data has been exported to the 'community_data.json' file in the 'collected_data' folder.")
