# Annotation Setup
Generates sample of 100 emails containing `crime`. Creates initial spreadsheets for each annotators; assigned each annotator different subsets of the sample so we can calculate multiple inter-annotator agreement metrics, so each pair had 25 emails of overlap. Also generates 10 random emails not included in the 100-email sample as a 'practice round'

In [1]:
import pandas as pd

In [2]:
corpus = pd.read_csv('~/../data/princeton_emails/corpus_v1.0.csv', index_col=0, usecols=['body_text', 'uid_email']).reset_index()

In [3]:
corpus.head(10)

Unnamed: 0,body_text,uid_email
0,Thanks for joining the team! My name is Kathle...,7182e4e604717330ecaf2699be61b200
1,We’re just 5 days away from our June 30th FEC ...,00768081c0a2487180314475ed1121d1
2,Thanks for joining the team! My name is Kathle...,54f56022dcd037ccb583f65a5668a073
3,Today we remember and honor the legacy of Dr. ...,ff3fc8ba9b209b771a73ef831a4117b5
4,"Here’s the truth,summer is the most difficult ...",36e237928f238bf5fab8d5a5462d9a04
5,This has been a whirlwind of a year for Kathle...,a7536957771135b29d8bc52057c7a461
6,"Friends, we’re not on track to meet our end-of...",94a9d5119f4bde190b07e0d43efed0b6
7,"Yesterday, we launched a brand new ad -- focus...",907edec244529241380876b5ec3ab49f
8,We’re the closest we’ve ever been to reaching ...,81046454397b4299893ff7f66049ff94
9,The hours we have left to reach every voter be...,52d40b84bfd96f8386c3b70fd387d143


In [4]:
keyword = "crime"
keyword_emails = []

for index, row in corpus.iterrows():
    if keyword in str(row['body_text']):
        keyword_emails.append(row)

In [5]:
full_subset = pd.concat(keyword_emails, axis=1).T
labeling_set = full_subset.sample(100, ignore_index=True)
labeling_set.to_csv('email_labeling/labeling_set.csv')

chosen_ids = labeling_set['uid_email'].to_list()
remaining_subset = full_subset.loc[~full_subset['uid_email'].isin(chosen_ids)]
practice_set = remaining_subset.sample(10, ignore_index=True)

In [6]:
labeling_set['annotator1'] = pd.Series(['Mark'] * 50 + ['Katie'] * 50)
labeling_set['annotator2'] = pd.Series(['Matt'] * 25 + ['Serah'] * 50 + ['Matt'] * 25)

In [7]:
mark = labeling_set.loc[labeling_set['annotator1'] == 'Mark']
katie = labeling_set.loc[labeling_set['annotator1'] == 'Katie']
matt = labeling_set.loc[labeling_set['annotator2'] == 'Matt']
serah = labeling_set.loc[labeling_set['annotator2'] == 'Serah']

In [8]:
mark = mark.drop(columns=['annotator1', 'annotator2'])
mark['A'] = ""
mark['B'] = ""
mark['C'] = ""
mark['notes'] = ""
mark.to_csv('email_labeling/mark.csv')

In [9]:
serah = serah.drop(columns=['annotator1', 'annotator2'])
serah['A'] = ""
serah['B'] = ""
serah['C'] = ""
serah['notes'] = ""
serah.to_csv('email_labeling/serah.csv')

In [10]:
matt = matt.drop(columns=['annotator1', 'annotator2'])
matt['A'] = ""
matt['B'] = ""
matt['C'] = ""
matt['notes'] = ""
matt.to_csv('email_labeling/matt.csv')

In [11]:
katie = katie.drop(columns=['annotator1', 'annotator2'])
katie['A'] = ""
katie['B'] = ""
katie['C'] = ""
katie['notes'] = ""
katie.to_csv('email_labeling/katie.csv')

In [12]:
practice_set['A'] = ""
practice_set['B'] = ""
practice_set['C'] = ""
practice_set['notes'] = ""
practice_set.to_csv('email_labeling/practice_set.csv')