# Taxonomy

## Dataset preparation

There are three subsets of data. They correspond to filter by title, body, and tags. 
1. posts_by_title: Posts containing "Github Actions" or its variants in the question title (github actions, github-action, Github actions, Github-actions, Github Actions, Github-Actions).
2. posts_by_body: Posts containing "Github Actions" or its variants in the question body.
3. posts_by_tags: Posts tagged with any of of github actions tag list ('github-actions', 'building-github-actions', 'github-actions-self-hosted-runners', 'github-actions-runners', 'github-actions-services', 'github-actions-artifacts', 'github-actions-reusable-workflows', 'github-actions-workflows', 'github-actions-marketplace')

In [25]:
import pandas as pd
import numpy as np

In [26]:
posts_by_title = pd.read_json('data/posts_by_title.json')
posts_by_body = pd.read_json('data/posts_by_body.json')
posts_by_tags = pd.read_json('data/posts_by_tags.json')
posts_by_title.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)
posts_by_body.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)
posts_by_tags.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)

In [27]:
posts_by_body = posts_by_body[posts_by_body['post_type_id']==1] # Keeping only post bodies that correspond to question bodies
# posts_by_tags.drop(columns=['DeletionDate', 'OwnerDisplayName', 'LastEditorDisplayName', 'FavoriteCount', 'ClosedDate', 'CommunityOwnedDate'], inplace=True);

In [28]:
# posts_by_tags.rename(columns={'Id':'id', 'PostTypeId':'post_type_id', 'AcceptedAnswerId': 'accepted_answer_id', 'ParentId': 'parent_id', 'CreationDate':'creation_date',
#                                 'Score':'score', 'ViewCount':'view_count', 'Body':'post_body', 'OwnerUserId':'owner_user_id', 'LastEditorUserId':'last_editor_user_id',
#                                 'LastEditDate':'last_edit_date', 'LastActivityDate':'last_activity_date', 'Title':'post_title', 'Tags':'tags', 'AnswerCount':'answer_count',
#                                 'CommentCount':'comment_count', 'ContentLicense':'content_license', 'Unnamed: 23':'link'}, inplace=True)


In [29]:
posts_by_tags.describe()
#posts_by_title.describe()
# posts_by_body.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,9873.0,9873.0,3670.0,0.0,9873.0,9873.0,9801.0,4271.0,9873.0,9873.0
mean,71260800.0,1.0,70227080.0,,2.747392,2217.544718,8707827.0,6605726.0,0.981363,1.657753
std,5184317.0,0.0,5424942.0,,11.666862,8988.151347,6554179.0,6039375.0,1.119418,2.394376
min,54176760.0,1.0,54177630.0,,-6.0,5.0,91.0,-1.0,0.0,0.0
25%,67964110.0,1.0,66227630.0,,0.0,136.0,2628868.0,1623876.0,0.0,0.0
50%,72633060.0,1.0,71291860.0,,1.0,459.0,7764329.0,4290962.0,1.0,1.0
75%,75623190.0,1.0,74934410.0,,2.0,1385.0,13836210.0,10234950.0,1.0,2.0
max,77593690.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22965690.0,35.0,26.0


In [30]:
df_union = pd.concat([posts_by_tags, posts_by_title, posts_by_body]).drop_duplicates().reset_index(drop=True)
df_union.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,11323.0,11323.0,4140.0,0.0,11323.0,11323.0,11245.0,4915.0,11323.0,11323.0
mean,71330220.0,1.0,70285560.0,,2.563543,2062.098296,8737486.0,6706649.0,0.963967,1.648238
std,5145350.0,0.0,5374074.0,,11.054158,8502.726944,6573781.0,6117352.0,1.085275,2.396634
min,54176760.0,1.0,54177630.0,,-6.0,5.0,91.0,-1.0,0.0,0.0
25%,68052830.0,1.0,66373780.0,,0.0,132.0,2628868.0,1611459.0,0.0,0.0
50%,72676530.0,1.0,71299040.0,,1.0,438.0,7793017.0,4420967.0,1.0,1.0
75%,75652710.0,1.0,74953000.0,,2.0,1299.0,13917820.0,10606950.0,1.0,2.0
max,77593690.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22965690.0,35.0,26.0


In [31]:
df_merged_1 = pd.merge(posts_by_tags, posts_by_title, how='inner', on='id')
df_intersection = pd.merge(df_merged_1, posts_by_body, how='inner', on='id')
df_intersection.describe()

Unnamed: 0,id,post_type_id_x,accepted_answer_id_x,parent_id_x,score_x,view_count_x,owner_user_id_x,last_editor_user_id_x,answer_count_x,comment_count_x,...,comment_count_y,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,2903.0,2903.0,1140.0,0.0,2903.0,2903.0,2882.0,1272.0,2903.0,2903.0,...,2903.0,2903.0,1140.0,0.0,2903.0,2903.0,2882.0,1272.0,2903.0,2903.0
mean,70504730.0,1.0,69380520.0,,3.831898,2881.829142,8590357.0,6978242.0,1.053049,1.714433,...,1.714433,1.0,69380520.0,,3.831898,2881.829142,8590357.0,6978242.0,1.053049,1.714433
std,5605103.0,0.0,5787363.0,,16.139532,11830.760186,6426077.0,6081297.0,1.377158,2.484013,...,2.484013,0.0,5787363.0,,16.139532,11830.760186,6426077.0,6081297.0,1.377158,2.484013
min,54176760.0,1.0,54177630.0,,-3.0,6.0,91.0,-1.0,0.0,0.0,...,0.0,1.0,54177630.0,,-3.0,6.0,91.0,-1.0,0.0,0.0
25%,66371500.0,1.0,64725360.0,,0.0,183.0,2628868.0,2087967.0,0.0,0.0,...,0.0,1.0,64725360.0,,0.0,183.0,2628868.0,2087967.0,0.0,0.0
50%,71865960.0,1.0,70416120.0,,1.0,570.0,7648607.0,4900238.0,1.0,1.0,...,1.0,1.0,70416120.0,,1.0,570.0,7648607.0,4900238.0,1.0,1.0
75%,75420590.0,1.0,74501440.0,,3.0,1650.0,13524240.0,11065870.0,1.0,3.0,...,3.0,1.0,74501440.0,,3.0,1650.0,13524240.0,11065870.0,1.0,3.0
max,77591540.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22613770.0,35.0,26.0,...,26.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22613770.0,35.0,26.0


### Number of questions filtered:
- posts_by_title: 4538
- posts_by_body: 6297
- posts_by_tags: 9873
- df_intersection: 2903
- df_union: 11323

In [32]:
df_intersection['tags'].str.contains('<github-actions>').value_counts()

tags
True     2894
False       9
Name: count, dtype: int64

- We have decided to use df_intersection because we consider that we are ensuring that the topic is for sure related to GitHub Actions because it is tagged and mentioned in the title and the body question. 

- We decided to select a random sample.

- 340 posts or more are needed to have a confidence level of 95% that the real value is within ±5%.

In [33]:
# Creating the random sample. We are accepting the first 340 posts that we consider are related to the topc.
sample_size = 400
seed = 0
df_sample = df_intersection.sample(n=sample_size, random_state=seed)
print("Sample length: ", df_sample.shape[0])
df_sample.head()

Sample length:  400


Unnamed: 0,id,post_type_id_x,accepted_answer_id_x,parent_id_x,creation_date_x,score_x,view_count_x,post_body_x,owner_user_id_x,last_editor_user_id_x,...,owner_user_id,last_editor_user_id,last_edit_date,last_activity_date,post_title,tags,answer_count,comment_count,content_license,link
582,65011370,1,,,2020-11-25 19:17:14,0,637,<p>I have this go.yml for github actions</p>&#...,12215821.0,,...,12215821.0,,,2020-11-26 08:31:42,"Github actions, problem with dep installing",<go><github-actions><dep>,1.0,0,CC BY-SA 4.0,https://stackoverflow.com/q/65011370
1811,73822327,1,73822678.0,,2022-09-23 02:09:56,1,132,<p>I can locally create and use a docker conta...,7628816.0,7628816.0,...,7628816.0,7628816.0,2022-09-23 03:25:16,2022-09-23 03:25:37,ory/dockertest not working on GitHub Actions,<docker><github-actions><clickhouse><ory>,1.0,0,CC BY-SA 4.0,https://stackoverflow.com/q/73822327
2249,75607496,1,,,2023-03-01 18:12:01,1,107,<p>We have a legacy Maven application which is...,3065868.0,3065868.0,...,3065868.0,3065868.0,2023-03-03 08:57:38,2023-03-03 08:57:38,GitHub Actions: Generate Deployment sources fo...,<maven><github-actions><websphere><websphere-8...,0.0,9,CC BY-SA 4.0,https://stackoverflow.com/q/75607496
1652,73018888,1,,,2022-07-18 07:48:42,2,1437,<p>GitHub Actions concurrency broke my process...,119790.0,,...,119790.0,,,2022-10-03 23:31:28,How to disable GitHub Actions Concurrency,<github><github-actions>,1.0,2,CC BY-SA 4.0,https://stackoverflow.com/q/73018888
667,65855054,1,,,2021-01-23 02:10:44,3,5595,<p>So I have a repo with multiple directories ...,3681199.0,,...,3681199.0,,,2021-01-23 08:11:48,Can you have multiple working directories with...,<go><github-actions>,1.0,2,CC BY-SA 4.0,https://stackoverflow.com/q/65855054


In [34]:
df_sample[['id', 'link']].assign(evaluator1=np.nan).to_excel('data/sample_eval1.xlsx')
df_sample[['id', 'link']].assign(evaluator2=np.nan).to_excel('data/sample_eval2.xlsx')

ModuleNotFoundError: No module named 'openpyxl'