# Taxonomy

## Dataset preparation

There are three subsets of data. They correspond to filter by title, body, and tags. 
1. posts_by_title: Posts containing "Github Actions" or its variants in the question title (github actions, github-action, Github actions, Github-actions, Github Actions, Github-Actions).
2. posts_by_body: Posts containing "Github Actions" or its variants in the question body.
3. posts_by_tags: Posts tagged with any of of github actions tag list ('github-actions', 'building-github-actions', 'github-actions-self-hosted-runners', 'github-actions-runners', 'github-actions-services', 'github-actions-artifacts', 'github-actions-reusable-workflows', 'github-actions-workflows', 'github-actions-marketplace')

In [1]:
import pandas as pd
import numpy as np

In [2]:
posts_by_title = pd.read_json('data/posts_by_title.json')
posts_by_body = pd.read_json('data/posts_by_body.json')
posts_by_tags = pd.read_json('data/posts_by_tags.json')
posts_by_title.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)
posts_by_body.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)
posts_by_tags.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)

In [3]:
posts_by_body = posts_by_body[posts_by_body['post_type_id']==1] # Keeping only post bodies that correspond to question bodies
# posts_by_tags.drop(columns=['DeletionDate', 'OwnerDisplayName', 'LastEditorDisplayName', 'FavoriteCount', 'ClosedDate', 'CommunityOwnedDate'], inplace=True);

In [4]:
# posts_by_tags.rename(columns={'Id':'id', 'PostTypeId':'post_type_id', 'AcceptedAnswerId': 'accepted_answer_id', 'ParentId': 'parent_id', 'CreationDate':'creation_date',
#                                 'Score':'score', 'ViewCount':'view_count', 'Body':'post_body', 'OwnerUserId':'owner_user_id', 'LastEditorUserId':'last_editor_user_id',
#                                 'LastEditDate':'last_edit_date', 'LastActivityDate':'last_activity_date', 'Title':'post_title', 'Tags':'tags', 'AnswerCount':'answer_count',
#                                 'CommentCount':'comment_count', 'ContentLicense':'content_license', 'Unnamed: 23':'link'}, inplace=True)


In [5]:
posts_by_tags.describe()
#posts_by_title.describe()
# posts_by_body.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,9873.0,9873.0,3670.0,0.0,9873.0,9873.0,9801.0,4271.0,9873.0,9873.0
mean,71260800.0,1.0,70227080.0,,2.747392,2217.544718,8707827.0,6605726.0,0.981363,1.657753
std,5184317.0,0.0,5424942.0,,11.666862,8988.151347,6554179.0,6039375.0,1.119418,2.394376
min,54176760.0,1.0,54177630.0,,-6.0,5.0,91.0,-1.0,0.0,0.0
25%,67964110.0,1.0,66227630.0,,0.0,136.0,2628868.0,1623876.0,0.0,0.0
50%,72633060.0,1.0,71291860.0,,1.0,459.0,7764329.0,4290962.0,1.0,1.0
75%,75623190.0,1.0,74934410.0,,2.0,1385.0,13836210.0,10234950.0,1.0,2.0
max,77593690.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22965690.0,35.0,26.0


In [6]:
df_union = pd.concat([posts_by_tags, posts_by_title, posts_by_body]).drop_duplicates().reset_index(drop=True)
df_union.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,11323.0,11323.0,4140.0,0.0,11323.0,11323.0,11245.0,4915.0,11323.0,11323.0
mean,71330220.0,1.0,70285560.0,,2.563543,2062.098296,8737486.0,6706649.0,0.963967,1.648238
std,5145350.0,0.0,5374074.0,,11.054158,8502.726944,6573781.0,6117352.0,1.085275,2.396634
min,54176760.0,1.0,54177630.0,,-6.0,5.0,91.0,-1.0,0.0,0.0
25%,68052830.0,1.0,66373780.0,,0.0,132.0,2628868.0,1611459.0,0.0,0.0
50%,72676530.0,1.0,71299040.0,,1.0,438.0,7793017.0,4420967.0,1.0,1.0
75%,75652710.0,1.0,74953000.0,,2.0,1299.0,13917820.0,10606950.0,1.0,2.0
max,77593690.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22965690.0,35.0,26.0


In [7]:
df_merged_1 = pd.merge(posts_by_tags, posts_by_title, how='inner')
df_intersection = pd.merge(df_merged_1, posts_by_body, how='inner')
df_intersection.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,2903.0,2903.0,1140.0,0.0,2903.0,2903.0,2882.0,1272.0,2903.0,2903.0
mean,70504730.0,1.0,69380520.0,,3.831898,2881.829142,8590357.0,6978242.0,1.053049,1.714433
std,5605103.0,0.0,5787363.0,,16.139532,11830.760186,6426077.0,6081297.0,1.377158,2.484013
min,54176760.0,1.0,54177630.0,,-3.0,6.0,91.0,-1.0,0.0,0.0
25%,66371500.0,1.0,64725360.0,,0.0,183.0,2628868.0,2087967.0,0.0,0.0
50%,71865960.0,1.0,70416120.0,,1.0,570.0,7648607.0,4900238.0,1.0,1.0
75%,75420590.0,1.0,74501440.0,,3.0,1650.0,13524240.0,11065870.0,1.0,3.0
max,77591540.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22613770.0,35.0,26.0


### Number of questions filtered:
- posts_by_title: 4538
- posts_by_body: 6297
- posts_by_tags: 9873
- df_intersection: 2903
- df_union: 11323

In [8]:
df_intersection['tags'].str.contains('<github-actions>').value_counts()

tags
True     2894
False       9
Name: count, dtype: int64

- We have decided to use df_intersection because we consider that we are ensuring that the topic is for sure related to GitHub Actions because it is tagged and mentioned in the title and the body question. 

- We decided to select a random sample.

- 340 posts or more are needed to have a confidence level of 95% that the real value is within ±5%.

In [9]:
# Creating the random sample. We are accepting the first 340 posts that we consider are related to the topc.
sample_size = 400
seed = 0
df_sample = df_intersection.sample(n=sample_size, random_state=seed)
print("Sample length: ", df_sample.shape[0])
df_sample.head()

Sample length:  400


Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,creation_date,score,view_count,post_body,owner_user_id,last_editor_user_id,last_edit_date,last_activity_date,post_title,tags,answer_count,comment_count,content_license,link
582,65011370,1,,,2020-11-25 19:17:14,0,637,<p>I have this go.yml for github actions</p>&#...,12215821.0,,,2020-11-26 08:31:42,"Github actions, problem with dep installing",<go><github-actions><dep>,1,0,CC BY-SA 4.0,https://stackoverflow.com/q/65011370
1811,73822327,1,73822678.0,,2022-09-23 02:09:56,1,132,<p>I can locally create and use a docker conta...,7628816.0,7628816.0,2022-09-23 03:25:16,2022-09-23 03:25:37,ory/dockertest not working on GitHub Actions,<docker><github-actions><clickhouse><ory>,1,0,CC BY-SA 4.0,https://stackoverflow.com/q/73822327
2249,75607496,1,,,2023-03-01 18:12:01,1,107,<p>We have a legacy Maven application which is...,3065868.0,3065868.0,2023-03-03 08:57:38,2023-03-03 08:57:38,GitHub Actions: Generate Deployment sources fo...,<maven><github-actions><websphere><websphere-8...,0,9,CC BY-SA 4.0,https://stackoverflow.com/q/75607496
1652,73018888,1,,,2022-07-18 07:48:42,2,1437,<p>GitHub Actions concurrency broke my process...,119790.0,,,2022-10-03 23:31:28,How to disable GitHub Actions Concurrency,<github><github-actions>,1,2,CC BY-SA 4.0,https://stackoverflow.com/q/73018888
667,65855054,1,,,2021-01-23 02:10:44,3,5595,<p>So I have a repo with multiple directories ...,3681199.0,,,2021-01-23 08:11:48,Can you have multiple working directories with...,<go><github-actions>,1,2,CC BY-SA 4.0,https://stackoverflow.com/q/65855054


In [10]:
df_sample[['id', 'link']].assign(evaluator1=np.nan).to_excel('data/sample_eval1.xlsx')
df_sample[['id', 'link']].assign(evaluator2=np.nan).to_excel('data/sample_eval2.xlsx')

In [11]:
sample_reviewed=pd.read_excel('data/sample_eval1_reviewed.xlsx')
sample_reviewed['evaluator2']=pd.read_excel('data/sample_eval2_reviewed.xlsx')['evaluator2']

In [12]:
sample_accepted=sample_reviewed[(sample_reviewed['evaluator1']==1)&(sample_reviewed['evaluator2']==1)].head(340)
sample_accepted = pd.merge(df_intersection, sample_accepted[['id']], on='id', how='inner')
sample_accepted.to_excel('data/sample_accepted.xlsx')

In [13]:
sample_accepted

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,creation_date,score,view_count,post_body,owner_user_id,last_editor_user_id,last_edit_date,last_activity_date,post_title,tags,answer_count,comment_count,content_license,link
0,57503578,1,,,2019-08-15 00:30:14,13,7788,<p>I have a file in a GitHub repository that n...,54929.0,,,2023-04-05 01:42:41,Making pull requests to a GitHub repository au...,<github><github-actions>,1,0,CC BY-SA 4.0,https://stackoverflow.com/q/57503578
1,57509118,1,57521953.0,,2019-08-15 11:42:14,4,2302,"<p>I am working with new GitHub actions, idea ...",911930.0,,,2019-08-16 09:15:48,New GitHub actions run in empty folders,<github><github-actions><github-ci>,1,0,CC BY-SA 4.0,https://stackoverflow.com/q/57509118
2,57639507,1,57639714.0,,2019-08-24 16:02:24,4,819,<p>I'm trying to set up a CI/CD pipeline in Gi...,1341734.0,,,2022-04-06 11:40:39,How to access a service in Github Actions CI/CD?,<github><elixir><github-actions>,1,0,CC BY-SA 4.0,https://stackoverflow.com/q/57639507
3,57806624,1,57806894.0,,2019-09-05 13:29:41,80,35589,<p>I'm using GitHub Actions to build my projec...,3231778.0,,,2022-03-13 19:24:54,GitHub Actions - How to build project in sub-d...,<github-actions>,2,0,CC BY-SA 4.0,https://stackoverflow.com/q/57806624
4,57808152,1,59022667.0,,2019-09-05 14:56:14,11,4658,<p>I'm trying out GitHub Actions to build my F...,3231778.0,3231778.0,2019-09-05 15:48:37,2022-09-14 10:01:40,How to build Flutter in GitHub Actions CI/CD,<flutter><github-actions>,4,1,CC BY-SA 4.0,https://stackoverflow.com/q/57808152
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
335,77395114,1,77395417.0,,2023-10-31 10:30:28,0,62,<p>I wanted to experiment with GitHub Actions:...,11186407.0,,,2023-10-31 13:15:41,GitHub Actions generated docker image results ...,<docker><rust><github-actions>,1,0,CC BY-SA 4.0,https://stackoverflow.com/q/77395114
336,77431693,1,,,2023-11-06 14:04:11,0,32,<p>Basically as it says in the title.</p>&#xA;...,22436104.0,9938317.0,2023-11-06 15:35:14,2023-11-06 15:35:14,GitHub Actions VS code extension gives access ...,<github><visual-studio-code><github-actions><s...,0,3,CC BY-SA 4.0,https://stackoverflow.com/q/77431693
337,77446605,1,77446960.0,,2023-11-08 14:49:09,1,95,<p>I have my unittests in a top level <code>te...,471136.0,,,2023-11-08 15:41:57,Running Python Poetry unit test in Github actions,<python><github-actions><python-poetry>,1,1,CC BY-SA 4.0,https://stackoverflow.com/q/77446605
338,77469265,1,,,2023-11-12 14:11:32,0,28,<p>I'm facing an issue with accessing metadata...,22902806.0,,,2023-11-12 14:11:32,How to Retrieve Metadata of AWS-Stored Audio F...,<amazon-web-services><amazon-s3><github-action...,0,1,CC BY-SA 4.0,https://stackoverflow.com/q/77469265


In [14]:
from bs4 import BeautifulSoup
import re

In [21]:
def html_to_sentences_df(html_content):
    """
    Parses the provided HTML content, replaces specific tags with placeholders,
    adjusts paragraph endings, and splits the content into sentences.
    Returns a pandas DataFrame with each sentence in a separate row.
    
    Parameters:
    - html_content: String containing HTML content.
    
    Returns:
    - DataFrame with each sentence as a separate row.
    """
    
    # Parse the HTML
    soup = BeautifulSoup(html_content, 'html.parser')

    # Replace code blocks, blockquotes, and links with placeholders
    for code in soup.find_all('code'):
        code.replace_with("-CODE-")
    for blockquote in soup.find_all('blockquote'):
        blockquote.replace_with("-BLOCK-")
    for a in soup.find_all('a'):
        a.replace_with("-LINK-")

    # Extract text and replace newline entities
    text = soup.get_text()
    text = text.replace('&#xA;', '\n').strip()

    # Pre-process text to handle ':\\n-CODE-' pattern
    text = re.sub(r':\s*\n-CODE-', ': -CODE-', text)
    text = re.sub(r':\s*\n-BLOCK-', ': -BLOCK-', text)
    text = re.sub(r':\s*\n-LINK-', ': -LINK-', text)
    
    # Adjust paragraph endings where necessary
    pattern = r'(?<![\.\!\?\s])\s*\n'
    text = re.sub(pattern, '.\n', text)

    # Replace ':.' with ':'
    text = text.replace(':.', ':')

    # Split text into sentences
    sentences = re.split(r'(?<=[.!?]) +', text.replace('\n', ' '))

    # Create DataFrame
    df = pd.DataFrame(sentences, columns=['sentence'])
    
    return df

In [23]:
def process_df(sample_df):
    # Initialize an empty DataFrame to hold all sentences
    all_sentences_df = pd.DataFrame()

    # Iterate over each row in the DataFrame
    for index, row in sample_df.iterrows():

        # Process the title and body, assuming they are HTML content
        title_sentences_df = html_to_sentences_df(row['post_title'])
        body_sentences_df = html_to_sentences_df(row['post_body'])

        # Add a column with the index/id of the post
        title_sentences_df['id'] = index
        body_sentences_df['id'] = index

        # Add a column with the index/id of the post
        title_sentences_df['post_id'] = row['id']
        body_sentences_df['post_id'] = row['id']

        # Add a column to indicate the source of the sentences
        title_sentences_df['source'] = 'title'
        body_sentences_df['source'] = 'body'
        
        # Combine title and body sentences
        combined_sentences_df = pd.concat([title_sentences_df, body_sentences_df], ignore_index=True)
        
        # Add the combined sentences to the overall DataFrame
        all_sentences_df = pd.concat([all_sentences_df, combined_sentences_df], ignore_index=True)

        all_sentences_df = all_sentences_df[['id', 'post_id', 'source', 'sentence']]
    
    return all_sentences_df

all_sentences_df = process_df(sample_accepted)

  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = BeautifulSoup(html_content, 'html.parser')
  soup = Bea

In [17]:
categories = [1, 4, 5, 7, 2, 9, 13, 11, 14, 20, 22, 19, 23]
# Add each new column filled with zeros to the DataFrame
for column in categories:
    all_sentences_df[column] = 0

In [18]:
all_sentences_df.to_excel('data/sentences_eval1.xlsx')
all_sentences_df.to_excel('data/sentences_eval2.xlsx')

In [45]:
classified_sentences_eval1 = pd.read_excel('data/sentences_eval1_reviewed.xlsx')
classified_sentences_eval2 = pd.read_excel('data/sentences_eval2_reviewed.xlsx')

In [46]:
classified_sentences_eval1.columns.drop(['Unnamed: 0', 'id', 'post_id', 'source', 'sentence'])

Index([1, 4, 5, 7, 2, 9, 13, 11, 14, 20, 22, 19, 23], dtype='object')

In [48]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score


kappa = pd.DataFrame()
for category in classified_sentences_eval1.columns.drop(['Unnamed: 0', 'id', 'post_id', 'source', 'sentence']):
    kappa[category] = [cohen_kappa_score(classified_sentences_eval1[category], classified_sentences_eval2[category])]
    
kappa

Unnamed: 0,1,4,5,7,2,9,13,11,14,20,22,19,23
0,0.439652,0.238585,0.240824,0.219126,0.0,0.405172,0.0,0.05899,0.0,0.498816,0.362615,0.0,0.0
