### Combining the four different data files annotated by 4 annotators

Final file format will be 
content, label_1, label_2, label_3, label_4, label

In the above label_1, label_2, label_3 and label_4 are labels of individual annotators where as label is the final label as per majority voting

In [21]:
import pandas as pd
import os

annotation_dir = '../data/Annotation_instances'
file_paths = {
    "label_1": "linkedIn_data_Kartik.csv",
    "label_2": "linkedIn_data_Muhammad.csv",
    "label_3": "linkedIn_data_Timothy.csv",
    "label_4": "linkedIn_data_Zhengyi.csv"
}
for key, fpath in file_paths.items():
    file_paths[key] = os.path.join(annotation_dir, fpath)

dfs = {label: pd.read_csv(path, index_col=0) for label, 
       path in file_paths.items()}
labels = {f"label_{i}": df["label"] for i, (label, df) in 
          enumerate(dfs.items(), start=1)}

# combining all the df labels together so we get 
final_df = pd.read_csv(list(file_paths.values())[0], 
        index_col=0).drop(columns=["label"], errors="ignore")
for df in list(dfs.values())[1:]:
    df = df.drop(columns=["label"], errors="ignore")
    final_df = final_df.combine_first(df) 
for label_name, label_series in labels.items():
    final_df[label_name] = label_series

# defining the order of columns
final_columns = ['content'] + list(file_paths.keys())
column_order = [column for column in final_df if column not in final_columns]
column_order = column_order + final_columns
final_df = final_df[column_order]
df = final_df
df.head()

Unnamed: 0_level_0,followers,connections,time_spent,content_links,media_type,num_hashtags,hashtag_followers,hashtags,reactions,comments,views,content,label_1,label_2,label_3,label_4
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
4,6484.0,500+,2 months ago,[['https://www.linkedin.com/in/ACoAABhNxDUB9IX...,article,3,0,"[['#verifiedresumes', 'https://www.linkedin.co...",22,2,,I count myself fortunate to have spent time wi...,4.0,6.0,6.0,
23,6484.0,500+,10 months ago,"[['https://lnkd.in/exKRtb6', 'https://lnkd.in/...",image,0,0,[],22,1,,No-one can be sure how America will ‘snap back...,,5.0,4.0,6.0
28,6484.0,500+,11 months ago,"[['https://lnkd.in/evGsZSH', 'https://lnkd.in/...",article,5,0,"[['#apprenticeships', 'https://www.linkedin.co...",10,0,,We've known since the Great Depression that si...,6.0,6.0,5.0,
37,6484.0,500+,1 year ago,[['https://www.linkedin.com/feed/hashtag/?keyw...,video,1,0,"[['#apprenticeship', 'https://www.linkedin.com...",31,4,,Great to talk with Fox Business today on why c...,,5.0,5.0,6.0
51,6484.0,500+,2 years ago,[['https://www.linkedin.com/feed/hashtag/?keyw...,article,1,0,"[['#apprenticeship', 'https://www.linkedin.com...",27,1,,Where can an #apprenticeship take you ? Grea...,6.0,6.0,6.0,


### Getting the most frequent label as per voting strategy


In [17]:
def most_frequent_label(row):
    return row.dropna().mode().iloc[0] if not \
    row.dropna().empty else None

df["label"] = df[["label_1", "label_2", 
        "label_3", "label_4"]].apply(most_frequent_label, 
        axis=1)
df.head()

Unnamed: 0_level_0,followers,connections,time_spent,content_links,media_type,num_hashtags,hashtag_followers,hashtags,reactions,comments,views,content,label_1,label_2,label_3,label_4,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
4,6484.0,500+,2 months ago,[['https://www.linkedin.com/in/ACoAABhNxDUB9IX...,article,3,0,"[['#verifiedresumes', 'https://www.linkedin.co...",22,2,,I count myself fortunate to have spent time wi...,4.0,6.0,6.0,,6.0
23,6484.0,500+,10 months ago,"[['https://lnkd.in/exKRtb6', 'https://lnkd.in/...",image,0,0,[],22,1,,No-one can be sure how America will ‘snap back...,,5.0,4.0,6.0,4.0
28,6484.0,500+,11 months ago,"[['https://lnkd.in/evGsZSH', 'https://lnkd.in/...",article,5,0,"[['#apprenticeships', 'https://www.linkedin.co...",10,0,,We've known since the Great Depression that si...,6.0,6.0,5.0,,6.0
37,6484.0,500+,1 year ago,[['https://www.linkedin.com/feed/hashtag/?keyw...,video,1,0,"[['#apprenticeship', 'https://www.linkedin.com...",31,4,,Great to talk with Fox Business today on why c...,,5.0,5.0,6.0,5.0
51,6484.0,500+,2 years ago,[['https://www.linkedin.com/feed/hashtag/?keyw...,article,1,0,"[['#apprenticeship', 'https://www.linkedin.com...",27,1,,Where can an #apprenticeship take you ? Grea...,6.0,6.0,6.0,,6.0


### Save the file for inter annotator agreement analysis

In [18]:
df.to_csv('../data/Annotation_instances/linkedin_combined_annotation.csv')

### Mapping Label Number to Titles

In [19]:
label_mapping = {
    1.0: 'Professional Growth',
    2.0: 'Events',
    3.0: 'Interactive Promotions',
    4.0: 'Educational Resources',
    5.0: 'Trends',
    6.0: 'Others'
}

df['label'] = df['label'].map(label_mapping)

df.head()

Unnamed: 0_level_0,followers,connections,time_spent,content_links,media_type,num_hashtags,hashtag_followers,hashtags,reactions,comments,views,content,label_1,label_2,label_3,label_4,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
4,6484.0,500+,2 months ago,[['https://www.linkedin.com/in/ACoAABhNxDUB9IX...,article,3,0,"[['#verifiedresumes', 'https://www.linkedin.co...",22,2,,I count myself fortunate to have spent time wi...,4.0,6.0,6.0,,Others
23,6484.0,500+,10 months ago,"[['https://lnkd.in/exKRtb6', 'https://lnkd.in/...",image,0,0,[],22,1,,No-one can be sure how America will ‘snap back...,,5.0,4.0,6.0,Educational Resources
28,6484.0,500+,11 months ago,"[['https://lnkd.in/evGsZSH', 'https://lnkd.in/...",article,5,0,"[['#apprenticeships', 'https://www.linkedin.co...",10,0,,We've known since the Great Depression that si...,6.0,6.0,5.0,,Others
37,6484.0,500+,1 year ago,[['https://www.linkedin.com/feed/hashtag/?keyw...,video,1,0,"[['#apprenticeship', 'https://www.linkedin.com...",31,4,,Great to talk with Fox Business today on why c...,,5.0,5.0,6.0,Trends
51,6484.0,500+,2 years ago,[['https://www.linkedin.com/feed/hashtag/?keyw...,article,1,0,"[['#apprenticeship', 'https://www.linkedin.com...",27,1,,Where can an #apprenticeship take you ? Grea...,6.0,6.0,6.0,,Others


### Save the final annotation file

In [20]:
df = df.drop(columns=[col for col in df.columns if col.startswith("label_")])
df.to_csv('../data/annotated_data.csv')