# GitHub Actions Information Needs taxonomy

## 1. Dataset preparation

### 1.1 Data Filtering

There are three subsets of data. They correspond to filter by title, body, and tags. 
1. posts_by_title: Posts containing "Github Actions" or its variants in the question title (github actions, github-action, Github actions, Github-actions, Github Actions, Github-Actions).
2. posts_by_body: Posts containing "Github Actions" or its variants in the question body.
3. posts_by_tags: Posts tagged with any of of github actions tag list ('github-actions', 'building-github-actions', 'github-actions-self-hosted-runners', 'github-actions-runners', 'github-actions-services', 'github-actions-artifacts', 'github-actions-reusable-workflows', 'github-actions-workflows', 'github-actions-marketplace')

In [30]:
import pandas as pd
import numpy as np

In [31]:
# Posts filtered by three filters: Title, Body, and tags.
posts_by_title = pd.read_json('../data/raw_data/posts_by_title.json')
posts_by_body = pd.read_json('../data/raw_data/posts_by_body.json')
posts_by_tags = pd.read_json('../data/raw_data/posts_by_tags.json')
posts_by_title.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)
posts_by_body.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)
posts_by_tags.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)

In [32]:
posts_by_body = posts_by_body[posts_by_body['post_type_id']==1] # Keeping only post bodies that correspond to question bodies, removing answers bodies.
posts_by_tags.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,9873.0,9873.0,3670.0,0.0,9873.0,9873.0,9801.0,4271.0,9873.0,9873.0
mean,71260800.0,1.0,70227080.0,,2.747392,2217.544718,8707827.0,6605726.0,0.981363,1.657753
std,5184317.0,0.0,5424942.0,,11.666862,8988.151347,6554179.0,6039375.0,1.119418,2.394376
min,54176760.0,1.0,54177630.0,,-6.0,5.0,91.0,-1.0,0.0,0.0
25%,67964110.0,1.0,66227630.0,,0.0,136.0,2628868.0,1623876.0,0.0,0.0
50%,72633060.0,1.0,71291860.0,,1.0,459.0,7764329.0,4290962.0,1.0,1.0
75%,75623190.0,1.0,74934410.0,,2.0,1385.0,13836210.0,10234950.0,1.0,2.0
max,77593690.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22965690.0,35.0,26.0


In [33]:
# Number of different datasets
print("Number of posts filtered by title:", len(posts_by_title))
print("Number of posts filtered by body:", len(posts_by_body))
print("Number of posts filtered by tags :", len(posts_by_tags))
df_union = pd.concat([posts_by_tags, posts_by_title, posts_by_body]).drop_duplicates().reset_index(drop=True)
print("Number of posts of the union:", len(df_union))
df_merged_1 = pd.merge(posts_by_tags, posts_by_title, how='inner')
df_intersection = pd.merge(df_merged_1, posts_by_body, how='inner')
print("Number of posts of the intersection:", len(df_intersection))

Number of posts filtered by title: 4538
Number of posts filtered by body: 6297
Number of posts filtered by tags : 9873
Number of posts of the union: 11323
Number of posts of the intersection: 2903


In [34]:
# save intersection dataset
df_intersection.to_csv('../data/processed_data/intersection.csv')

### 1.2 Sampling

- We have decided to use df_intersection because we consider that we are ensuring that the topic is for sure related to GitHub Actions because it is tagged and mentioned in the title and the body question. 

- We decided to select a random sample.

- 340 posts or more are needed to have a confidence level of 95% that the real value is within ±5%.

In [35]:
# Creating the random sample. We are accepting the first 340 posts that we consider are related to the topic.
sample_size = 400
seed = 0
df_sample = df_intersection.sample(n=sample_size, random_state=seed)
print("Sample length: ", df_sample.shape[0])
df_sample.head();

Sample length:  400


### 1.3 Manual examination

- Two authors examinated the posts manually in order the removing posts that were not GA related or links that did not work. Finally, 340 posts were accepted.

In [36]:
# Posts that are accepted by both evaluators were accepted.
sample_reviewed=pd.read_excel('../data/raw_data/sample_eval1_reviewed.xlsx')
sample_reviewed['evaluator2']=pd.read_excel('../data/raw_data/sample_eval2_reviewed.xlsx')['evaluator2']
sample_accepted=sample_reviewed[(sample_reviewed['evaluator1']==1)&(sample_reviewed['evaluator2']==1)].head(340)
sample_accepted = pd.merge(df_intersection, sample_accepted[['id']], on='id', how='inner')
sample_accepted.to_excel('../data/processed_data/sample_accepted.xlsx')
sample_accepted.head(3);

### 1.4 Coding in sentences

Body text was parsed from html to plain text. Removing code, blocks, or links. Usually images are also links. After that, text was divided in sentences.

In [37]:
from bs4 import BeautifulSoup
import re

def html_to_sentences_df(html_content):
    """
    Parses the provided HTML content, replaces specific tags with placeholders, adjusts paragraph endings, and splits the content into sentences. 
    Returns a pandas DataFrame with each sentence in a separate row.
    
    Parameters:
    - html_content: String containing HTML content.
    
    Returns:
    - DataFrame with each sentence as a separate row.
    """
    
    # Parse the HTML
    soup = BeautifulSoup(html_content, 'html.parser')

    # Replace code blocks, blockquotes, and links with placeholders
    for code in soup.find_all('code'):
        code.replace_with("-CODE-")
    for blockquote in soup.find_all('blockquote'):
        blockquote.replace_with("-BLOCK-")
    for a in soup.find_all('a'):
        a.replace_with("-LINK-")

    # Extract text and replace newline entities
    text = soup.get_text()
    text = text.replace('&#xA;', '\n').strip()

    # Pre-process text to handle ':\\n-CODE-' pattern
    text = re.sub(r':\s*\n-CODE-', ': -CODE-', text)
    text = re.sub(r':\s*\n-BLOCK-', ': -BLOCK-', text)
    text = re.sub(r':\s*\n-LINK-', ': -LINK-', text)
    
    # Adjust paragraph endings where necessary
    pattern = r'(?<![\.\!\?\s])\s*\n'
    text = re.sub(pattern, '.\n', text)

    # Replace ':.' with ':'
    text = text.replace(':.', ':')

    # Split text into sentences
    sentences = re.split(r'(?<=[.!?]) +', text.replace('\n', ' '))

    # Create DataFrame
    df = pd.DataFrame(sentences, columns=['sentence'])
    
    return df

In [38]:
def process_df(sample_df):
    # Initialize an empty DataFrame to hold all sentences
    all_sentences_df = pd.DataFrame()

    # Iterate over each row in the DataFrame
    for index, row in sample_df.iterrows():

        # Process the title and body, assuming they are HTML content
        title_sentences_df = html_to_sentences_df(row['post_title'])
        body_sentences_df = html_to_sentences_df(row['post_body'])

        # Add a column with the index/id of the post
        title_sentences_df['id'] = index
        body_sentences_df['id'] = index

        # Add a column with the index/id of the post
        title_sentences_df['post_id'] = row['id']
        body_sentences_df['post_id'] = row['id']

        # Add a column to indicate the source of the sentences
        title_sentences_df['source'] = 'title'
        body_sentences_df['source'] = 'body'
        
        # Combine title and body sentences
        combined_sentences_df = pd.concat([title_sentences_df, body_sentences_df], ignore_index=True)
        
        # Add the combined sentences to the overall DataFrame
        all_sentences_df = pd.concat([all_sentences_df, combined_sentences_df], ignore_index=True)

        all_sentences_df = all_sentences_df[['id', 'post_id', 'source', 'sentence']]

    return all_sentences_df

all_sentences_df = process_df(sample_accepted) # this df contains all the sentences from the sample of 340 posts.

all_sentences_df

  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = Bea

Unnamed: 0,id,post_id,source,sentence
0,0,57503578,title,Making pull requests to a GitHub repository au...
1,0,57503578,body,I have a file in a GitHub repository that need...
2,0,57503578,body,"As part of a -LINK-, I want to have a bot runn..."
3,0,57503578,body,I have a suspicion that the -LINK- can help me...
4,0,57503578,body,I see some official automation workflows that ...
...,...,...,...,...
3171,339,77519735,body,-CODE-.
3172,339,77519735,body,Below is an error message.
3173,339,77519735,body,-CODE-.
3174,339,77519735,body,"I searched for an error message, but I couldn'..."


----------------------------------------------------------------

## 2. Manual classification

After performing manually the taxonomy, we identified 8 Information Needs (DN), and 24 Relevant Information (RI) groups. 
The 8 DN are: Error Handling (EH), Incompatibility (IN), Insufficient Implementation (II), Migration (MI), Functionality Implementation (FI), Orientation (OR), Alternative Solution (AS), and GHA Learning (LE).

In [39]:
DN_list = ['EH', 'IN', 'II', 'MI', 'FI', 'OR', 'AS', 'LE']
sentences_classified = pd.read_excel("../data/processed_data/sentences_taxonomy.xlsx", sheet_name=DN_list)
sample_sentences = pd.DataFrame()
for k in sentences_classified.keys():
    sentences_classified[k].drop_duplicates(inplace=True)
    sample_sentences = pd.concat([sample_sentences, sentences_classified[k]], ignore_index=True)
sample_sentences = pd.concat([sample_sentences, pd.get_dummies(sample_sentences['RI_id'], dtype=int)], axis=1)

In [40]:
sample_sentences.sort_values(by=['sentence_id'], inplace=True)
sample_sentences.drop(['RI_id'], axis=1, inplace=True)

In [41]:
# List of Relevant Information id's
RI_list = sample_sentences.columns[5:]

In [42]:
# Merging duplicated sentences with different RI categories.
agg_dict = dict()
agg_dict['id'] = 'first'
agg_dict['post_id'] = 'first'
agg_dict['source'] = 'first'
agg_dict['sentence'] = 'first'
agg_dict.update({col: 'sum' for col in RI_list})
sample_sentences = sample_sentences.groupby('sentence_id').agg(agg_dict).reset_index()
sample_sentences;

There are 1000 sentences that contain one or more types of Relevant Information.

In [43]:
# Creating a DF with all the sentences of the sample and their categorization

# Add each new column filled with zeros to the DataFrame
for column in RI_list:
    all_sentences_df[column] = 0
all_sentences_df.index.name = 'sentence_id'

for i in range(len(all_sentences_df)):
    for j in range(len(sample_sentences)):
        if i == sample_sentences.loc[j, 'sentence_id']:
            all_sentences_df.loc[i, RI_list] = sample_sentences.loc[j, RI_list]
all_sentences_df

Unnamed: 0_level_0,id,post_id,source,sentence,AS1,EH1,EH2,EH3,EH4,EH5,...,II1,IN1,LE1,LE2,MI1,OR1,OR2,OR3,OR4,OR5
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,57503578,title,Making pull requests to a GitHub repository au...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,57503578,body,I have a file in a GitHub repository that need...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,57503578,body,"As part of a -LINK-, I want to have a bot runn...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,57503578,body,I have a suspicion that the -LINK- can help me...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,57503578,body,I see some official automation workflows that ...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3171,339,77519735,body,-CODE-.,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3172,339,77519735,body,Below is an error message.,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3173,339,77519735,body,-CODE-.,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3174,339,77519735,body,"I searched for an error message, but I couldn'...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
# Apply the function to the 'tags' column only if the tags are not lists
if isinstance(sample_accepted['tags'].iloc[0], str):
    def extract_tags(tags_str):
        return tags_str.strip('<>').split('><')
    sample_accepted['tags'] = sample_accepted['tags'].apply(extract_tags)

In [45]:
# Merge the DataFrames on 'post_id' (from all_sentences_df) and 'id' (from sample_accepted)
sentence_analysis_df = pd.merge(all_sentences_df, sample_accepted, left_on='post_id', right_on='id', how='left')
# Drop the redundant 'id_y' column
sentence_analysis_df.drop(['id_y'], axis=1, inplace=True)
# Rename 'id_x' to 'sample_id'
sentence_analysis_df.rename(columns={'id_x': 'sample_id'}, inplace=True)
# Save the merged DataFrame to an Excel file with the new name
sentence_analysis_df.to_excel(f'../results/sentences_classified.xlsx', index=False)

----------------------------------------------------------------

## 3. Results and Analysis

In [46]:
# Define a mapping from RI columns to DN categories
dn_mapping = {
    'EH': ['EH1', 'EH2', 'EH3', 'EH4', 'EH5', 'EH6', 'EH7', 'EH8', 'EH9'],
    'IN': ['IN1'],
    'II': ['II1'],
    'MI': ['MI1'],
    'FI': ['FI1'],
    'OR': ['OR1', 'OR2', 'OR3', 'OR4', 'OR5'],
    'AS': ['AS1'],
    'LE': ['LE1', 'LE2']
}

In [47]:
# Calculate the total number of sentences
total_sentences = all_sentences_df.shape[0]

# Calculate the number of sentences without any RI
ri_columns = [col for sublist in dn_mapping.values() for col in sublist]
sentences_without_ri = all_sentences_df[ri_columns].sum(axis=1) == 0
num_sentences_without_ri = sentences_without_ri.sum()

# Display the results
print(f"Total number of sentences: {total_sentences}")
print(f"Number of sentences with any RI: {total_sentences - num_sentences_without_ri}")
print(f"Number of sentences without any RI: {num_sentences_without_ri}")


Total number of sentences: 3176
Number of sentences with any RI: 1005
Number of sentences without any RI: 2171


In [48]:
# Calculate the counts for each RI category
ri_columns = all_sentences_df.columns[4:]  # Assuming RI columns start from the 5th column onwards
ri_counts = all_sentences_df[ri_columns].sum()

# Convert the counts to a DataFrame for better readability
ri_counts_df = ri_counts.reset_index()
ri_counts_df.columns = ['RI_Category', 'Count']

ri_counts_df

Unnamed: 0,RI_Category,Count
0,AS1,9
1,EH1,98
2,EH2,8
3,EH3,4
4,EH4,14
5,EH5,80
6,EH6,12
7,EH7,67
8,EH8,63
9,EH9,61


In [49]:
# Calculate the number of posts that have each RI category
ri_post_counts = all_sentences_df.groupby('post_id')[ri_columns].sum()
ri_post_counts = (ri_post_counts > 0).sum().sort_values(ascending=False)
ri_post_counts

FI1    154
LE1    118
II1     86
EH1     78
EH5     70
OR1     61
EH7     61
EH8     50
EH9     50
IN1     48
OR2     34
OR5     33
OR3     19
EH4     14
MI1     12
EH6     11
OR4     10
AS1      9
LE2      8
EH2      6
EH3      4
dtype: int64

In [50]:
# Calculate the total number of unique posts
total_posts = all_sentences_df['post_id'].nunique()

# Initialize a dictionary to store the count of unique posts for each DN
dn_unique_post_counts = {dn: 0 for dn in dn_mapping.keys()}

# Calculate the number of unique posts that have each DN category
for dn, ri_list in dn_mapping.items():
    dn_unique_post_counts[dn] = (all_sentences_df.groupby('post_id')[ri_list].sum() > 0).any(axis=1).sum()

# Convert the dictionary to a DataFrame
dn_unique_post_counts_df = pd.DataFrame.from_dict(dn_unique_post_counts, orient='index', columns=['Unique Post Count'])

# Calculate the percentage of total posts
dn_unique_post_counts_df['Percentage'] = (dn_unique_post_counts_df['Unique Post Count'] / total_posts) * 100

# Sort the DataFrame by the count of unique posts in descending order
dn_unique_post_counts_df = dn_unique_post_counts_df.sort_values(by='Unique Post Count', ascending=False)

# Display the DataFrame
print(dn_unique_post_counts_df)

    Unique Post Count  Percentage
EH                182   53.529412
FI                154   45.294118
OR                134   39.411765
LE                121   35.588235
II                 86   25.294118
IN                 48   14.117647
MI                 12    3.529412
AS                  9    2.647059


In [51]:
# Calculate the number of unique sentences per RI category
ri_unique_sentence_counts = {ri: (all_sentences_df[ri] > 0).sum() for ri in ri_columns}

# Convert the dictionary to a DataFrame for display
ri_unique_sentence_counts_df = pd.DataFrame.from_dict(ri_unique_sentence_counts, orient='index', columns=['Unique Sentence Count']).sort_values(by='Unique Sentence Count', ascending=False)

# Display the DataFrame
print(ri_unique_sentence_counts_df)

ri_unique_sentence_counts_df.sum()

     Unique Sentence Count
FI1                    196
LE1                    150
II1                    102
EH1                     98
EH5                     80
OR1                     70
EH7                     67
IN1                     63
EH8                     63
EH9                     61
OR5                     44
OR2                     41
OR3                     20
EH4                     14
MI1                     14
EH6                     12
OR4                     10
AS1                      9
LE2                      8
EH2                      8
EH3                      4


Unique Sentence Count    1134
dtype: int64