# Taxonomy

## Dataset preparation

There are three subsets of data. They correspond to filter by title, body, and tags. 
1. posts_by_title: Posts containing "Github Actions" or its variants in the question title (github actions, github-action, Github actions, Github-actions, Github Actions, Github-Actions).
2. posts_by_body: Posts containing "Github Actions" or its variants in the question body.
3. posts_by_tags: Posts tagged with any of of github actions tag list ('github-actions', 'building-github-actions', 'github-actions-self-hosted-runners', 'github-actions-runners', 'github-actions-services', 'github-actions-artifacts', 'github-actions-reusable-workflows', 'github-actions-workflows', 'github-actions-marketplace')

In [31]:
import pandas as pd

In [47]:
posts_by_title = pd.read_json('data/posts_by_title.json')
posts_by_body = pd.read_json('data/posts_by_body.json')
posts_by_tags = pd.read_csv('data/posts_by_tags.csv')
posts_by_title.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)
posts_by_body.rename(columns={"CONCAT('https://stackoverflow.com/q/', p.Id)": 'link'}, inplace=True)

In [48]:
posts_by_body = posts_by_body[posts_by_body['post_type_id']==1] # Keeping only post bodies that correspond to question bodies
posts_by_tags.drop(columns=['DeletionDate', 'OwnerDisplayName', 'LastEditorDisplayName', 'FavoriteCount', 'ClosedDate', 'CommunityOwnedDate'], inplace=True);

In [49]:
posts_by_tags.rename(columns={'Id':'id', 'PostTypeId':'post_type_id', 'AcceptedAnswerId': 'accepted_answer_id', 'ParentId': 'parent_id', 'CreationDate':'creation_date',
                                'Score':'score', 'ViewCount':'view_count', 'Body':'post_body', 'OwnerUserId':'owner_user_id', 'LastEditorUserId':'last_editor_user_id',
                                'LastEditDate':'last_edit_date', 'LastActivityDate':'last_activity_date', 'Title':'post_title', 'Tags':'tags', 'AnswerCount':'answer_count',
                                'CommentCount':'comment_count', 'ContentLicense':'content_license', 'Unnamed: 23':'link'}, inplace=True)


In [59]:
#posts_by_tags.describe()
#posts_by_title.describe()
posts_by_body.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,6297.0,6297.0,2327.0,0.0,6297.0,6297.0,6255.0,2837.0,6297.0,6297.0
mean,71065780.0,1.0,69950680.0,,2.836748,2213.280769,8619572.0,6780544.0,0.978402,1.686835
std,5347366.0,0.0,5549951.0,,12.542027,9546.348354,6515283.0,6089129.0,1.159121,2.453132
min,54176760.0,1.0,54177630.0,,-3.0,6.0,91.0,-1.0,0.0,0.0
25%,67385570.0,1.0,65782610.0,,0.0,142.0,2628868.0,1778005.0,0.0,0.0
50%,72394940.0,1.0,71030110.0,,1.0,457.0,7615872.0,4621513.0,1.0,1.0
75%,75611740.0,1.0,74814100.0,,2.0,1350.0,13705800.0,10817280.0,1.0,3.0
max,77591540.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22900690.0,35.0,26.0


In [65]:
df_union = pd.concat([posts_by_tags, posts_by_title, posts_by_body]).drop_duplicates().reset_index(drop=True)
df_union.describe()

Unnamed: 0,id,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,17582.0,17582.0,6643.0,0.0,17582.0,17582.0,17455.0,7731.0,17582.0,17582.0
mean,71197170.0,1.0,70193670.0,,2.859629,2332.955295,8661308.0,6650777.0,0.99636,1.668297
std,5244839.0,0.0,5474512.0,,12.111609,9413.060583,6520149.0,6033402.0,1.13754,2.416215
min,54176760.0,1.0,54177630.0,,-6.0,6.0,91.0,-1.0,0.0,0.0
25%,67762100.0,1.0,66134180.0,,0.0,157.0,2628868.0,1688639.0,0.0,0.0
50%,72534350.0,1.0,71224240.0,,1.0,503.0,7728038.0,4417586.0,1.0,1.0
75%,75649200.0,1.0,74972540.0,,2.0,1473.0,13733460.0,10348970.0,1.0,2.0
max,77740410.0,1.0,77888180.0,,355.0,337310.0,23178880.0,23161610.0,35.0,26.0


In [64]:
df_merged_1 = pd.merge(posts_by_tags, posts_by_title, how='inner', on='id')
df_intersection = pd.merge(df_merged_1, posts_by_body, how='inner', on='id')
df_intersection.describe()

Unnamed: 0,id,post_type_id_x,accepted_answer_id_x,parent_id_x,score_x,view_count_x,owner_user_id_x,last_editor_user_id_x,answer_count_x,comment_count_x,...,comment_count_y,post_type_id,accepted_answer_id,parent_id,score,view_count,owner_user_id,last_editor_user_id,answer_count,comment_count
count,2857.0,2857.0,1146.0,0.0,2857.0,2857.0,2836.0,1257.0,2857.0,2857.0,...,2857.0,2857.0,1140.0,0.0,2857.0,2857.0,2836.0,1256.0,2857.0,2857.0
mean,70417480.0,1.0,69421010.0,,4.052503,3139.711936,8541469.0,6970519.0,1.083304,1.728036,...,1.728386,1.0,69380520.0,,3.901995,2926.654533,8541469.0,6957873.0,1.070004,1.728386
std,5605049.0,0.0,5799237.0,,16.773162,12653.223765,6387419.0,6085586.0,1.387491,2.492825,...,2.493214,0.0,5787363.0,,16.259177,11920.330407,6387419.0,6071022.0,1.381652,2.493214
min,54176760.0,1.0,54177630.0,,-2.0,8.0,91.0,-1.0,0.0,0.0,...,0.0,1.0,54177630.0,,-2.0,6.0,91.0,-1.0,0.0,0.0
25%,66303180.0,1.0,64728880.0,,0.0,236.0,2628868.0,2031033.0,0.0,0.0,...,0.0,1.0,64725360.0,,0.0,194.0,2628868.0,2028900.0,0.0,0.0
50%,71750840.0,1.0,70471600.0,,1.0,647.0,7615872.0,4904027.0,1.0,1.0,...,1.0,1.0,70416120.0,,1.0,587.0,7615872.0,4900238.0,1.0,1.0
75%,75405250.0,1.0,74549800.0,,3.0,1829.0,13410900.0,11065870.0,1.0,3.0,...,3.0,1.0,74501440.0,,3.0,1666.0,13410900.0,11056840.0,1.0,3.0
max,77586940.0,1.0,77649390.0,,355.0,337310.0,23022570.0,23022570.0,35.0,26.0,...,26.0,1.0,77585670.0,,348.0,321529.0,23022570.0,22613770.0,35.0,26.0


### Number of questions filtered:
- posts_by_title: 4538
- posts_by_body: 6297
- posts_by_tags: 9828
- df_intersection: 2857
- df_union: 17582

In [75]:
df_intersection['tags'].str.contains('<github-actions>').value_counts()

tags
True     2848
False       9
Name: count, dtype: int64

In [77]:
df_intersection.columns

Index(['id', 'post_type_id_x', 'accepted_answer_id_x', 'parent_id_x',
       'creation_date_x', 'score_x', 'view_count_x', 'post_body_x',
       'owner_user_id_x', 'last_editor_user_id_x', 'last_edit_date_x',
       'last_activity_date_x', 'post_title_x', 'tags_x', 'answer_count_x',
       'comment_count_x', 'content_license_x', 'link_x', 'post_type_id_y',
       'accepted_answer_id_y', 'parent_id_y', 'creation_date_y', 'score_y',
       'view_count_y', 'post_body_y', 'owner_user_id_y',
       'last_editor_user_id_y', 'last_edit_date_y', 'last_activity_date_y',
       'post_title_y', 'tags_y', 'answer_count_y', 'comment_count_y',
       'content_license_y', 'link_y', 'post_type_id', 'accepted_answer_id',
       'parent_id', 'creation_date', 'score', 'view_count', 'post_body',
       'owner_user_id', 'last_editor_user_id', 'last_edit_date',
       'last_activity_date', 'post_title', 'tags', 'answer_count',
       'comment_count', 'content_license', 'link'],
      dtype='object')

In [76]:
df_intersection

Unnamed: 0,id,post_type_id_x,accepted_answer_id_x,parent_id_x,creation_date_x,score_x,view_count_x,post_body_x,owner_user_id_x,last_editor_user_id_x,...,owner_user_id,last_editor_user_id,last_edit_date,last_activity_date,post_title,tags,answer_count,comment_count,content_license,link
0,54176763,1,54177628.0,,2019-01-14 06:40:49,0,808,"<p>I'm trying out Filters in GitHub Actions, h...",7110144.0,1421222.0,...,7110144.0,1421222.0,2019-02-06 19:49:10,2019-02-06 19:49:10,"GitHub Actions: Filter returns ""jq: error Cann...",<github><github-actions>,1.0,0,CC BY-SA 4.0,https://stackoverflow.com/q/54176763
1,54310050,1,60067489.0,,2019-01-22 14:06:45,62,56073,<p>My use case is I want to have a unique vers...,774437.0,1421222.0,...,774437.0,1421222.0,2019-02-06 19:45:45,2023-09-26 17:25:18,How to version build artifacts using GitHub Ac...,<github><automation><versioning><continuous-de...,7.0,3,CC BY-SA 4.0,https://stackoverflow.com/q/54310050
2,54483260,1,54492739.0,,2019-02-01 16:18:10,16,10307,<p>I woke up to my GitHub Actions BETA invite ...,1418711.0,1421222.0,...,1418711.0,1421222.0,2019-02-06 19:44:44,2023-09-11 11:02:11,Is it possible to persist a WORKDIR between Ac...,<github><github-actions>,4.0,1,CC BY-SA 4.0,https://stackoverflow.com/q/54483260
3,55110729,1,57958803.0,,2019-03-11 21:32:03,82,61091,<p>Say I have a GitHub actions workflow with 2...,4449679.0,226526.0,...,4449679.0,226526.0,2019-03-11 23:35:57,2022-08-13 11:27:44,How do I cache steps in GitHub actions?,<github><github-actions>,6.0,3,CC BY-SA 4.0,https://stackoverflow.com/q/55110729
4,56030316,1,,,2019-05-07 20:51:08,5,1750,"<p>So far, I was under the impression as per t...",3403196.0,3403196.0,...,3403196.0,3403196.0,2019-05-07 20:59:16,2019-08-23 12:15:33,does /github/home persist between github actions?,<github><github-actions>,2.0,1,CC BY-SA 4.0,https://stackoverflow.com/q/56030316
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2852,77564875,1,,,2023-11-28 14:55:03,0,45,<p>I have a private repo which has all GitHub ...,14900339.0,3001761.0,...,14900339.0,3001761.0,2023-11-28 16:29:42,2023-11-28 16:29:42,Unable to call private repo in GitHub Actions,<github><github-actions>,1.0,0,CC BY-SA 4.0,https://stackoverflow.com/q/77564875
2853,77566308,1,,,2023-11-28 18:27:14,1,143,"<p>I have a GitHub repository where I build, p...",19488382.0,2123530.0,...,19488382.0,2123530.0,2023-11-29 22:34:55,2023-11-29 22:34:55,Github Actions logs says .env: permission deni...,<git><docker><github><github-actions>,1.0,3,CC BY-SA 4.0,https://stackoverflow.com/q/77566308
2854,77572403,1,77585666.0,,2023-11-29 15:17:24,1,85,<p>I have a private npm package inside an orga...,13203366.0,3001761.0,...,13203366.0,3001761.0,2023-11-29 18:04:03,2023-12-01 13:40:00,Unable to install a private npm package inside...,<node.js><github><npm><github-actions>,1.0,2,CC BY-SA 4.0,https://stackoverflow.com/q/77572403
2855,77585640,1,,,2023-12-01 13:35:34,0,44,<p>I have a git repository where some JUnit-Te...,519334.0,,...,519334.0,,,2023-12-01 13:35:34,Github Actions JUnittest Failing: How to get t...,<java><gradle><junit><github-actions><error-lo...,0.0,2,CC BY-SA 4.0,https://stackoverflow.com/q/77585640
