# Gendered Language Annotation in brat
Annotation workflow:
1. Load metadata descriptions for annotation into [brat](https://brat.nlplab.org/) [completed]
2. Specify annotation schema in brat
3. Set up multiple annotator accounts [4 users as of Nov. 27 2020]
4. Annotate descriptions
5. Export annotations
6. Evaluate inter-annotator agreement (IAA)

### Reviewing Annotations
Import libraries:

In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
import string
import csv
import re

### Developing Workflow
**Step 1:** Load my dataset after annotating in brat

In [3]:
filename = "training1-0"
txt = ".txt"
ann = ".ann"
pilot_txt = open('pilots/bratTxts/'+filename+txt, 'r')
print(pilot_txt.read())

ID:Coll-1149
Collection, Sub-collection, and Item IDs
'Coll-1149'

Scope and Contents
'This key research resource is an important survival, being a manuscript account book detailing transactions - debits and credits - relating to the lead-ore company at Leadhills, operated by Sir John Hope of Craighall. Many important people are mentioned in this book, including Alexander Hope of London, Archibald Hope of Craighall, the Earl of Wigtown, the Duke of Hamilton, the Lord of Inglestone, Charles Erskine of Alba, Alexander Tait, Lady Marie Keith, the Earl of Crawford, Lord Mordington, Lord Cardcross, and Alexander Ross. The amounts involved are huge, with the account of revenues in hand running to over Â£70,000 towards the end of the period. The manuscript volume itself is composed of a short alphabetic table of names, then from folio 1, accounts dating from 1 August 1662, Edinburgh, to 7 September 1671, Edinburgh, at folio 221. Towards the rear of the volume are another set of accounts and r

In [5]:
pilot_ann = open('pilots/bratAnnots/'+filename+ann, 'r')
print(pilot_ann.read())

T2	Gendered-Role 445 449	Duke
T3	Gendered-Role 467 471	Lord
T5	Gendered-Role 568 572	Lord
T6	Gendered-Role 585 589	Lord
T8	Gendered-Role 1099 1103	Lord
T9	Gendered-Role 1162 1165	Sir
T12	Gendered-Pronoun 1207 1209	He
T14	Gendered-Role 1400 1404	King
T15	Gendered-Role 1529 1536	brother
T16	Gendered-Pronoun 1525 1528	his
T17	Gendered-Role 1537 1540	Sir
T13	Gendered-Role 2025 2028	Sir
T1	Gendered-Role 277 280	Sir
T4	Gendered-Role 424 428	Earl
T18	Gendered-Role 528 532	Lady
T19	Gendered-Role 550 554	Earl
T20	Gendered-Role 1084 1087	Sir
T10	Gendered-Role 1155 1158	son
T11	Omission 1148 1191	eldest son of Sir Thomas Hope of Craighall,
T21	Gendered-Role 1198 1205	baronet
T22	Gendered-Pronoun 1298 1300	he
T23	Gendered-Role 1689 1692	Sir
T24	Gendered-Role 1614 1618	King
T25	Gendered-Role 1741 1745	King
T7	Gendered-Role 1835 1838	Sir
T26	Gendered-Role 2122 2125	Sir
T27	Gendered-Role 2137 2141	Lord
T28	Gendered-Role 2201 2204	Sir
T29	Gendered-Role 2219 2222	Sir



In [7]:
df_ann = pd.read_csv("pilots/bratAnnots/"+filename+ann, sep="\t", names=["entity","label-start-end","text"])
df_ann['file'] = [filename for x in range(df_ann.shape[0])]

In [8]:
label_and_offsets = list(df_ann["label-start-end"])
list_of_landos = []
for row in label_and_offsets:
    row_list = row.split(" ")
    list_of_landos += [row_list]
# print(list_of_landos)
label = []
start = []
end = []
for l in list_of_landos:
    label += [l[0]]
    start += [l[1]]
    end += [l[2]]
# print(label)
# print(start)

In [9]:
df_ann['label'] = label
df_ann['start'] = start
df_ann['end'] = end
df_ann['annotator'] = ['Lucy' for x in range(df_ann.shape[0])]
df_ann

Unnamed: 0,entity,label-start-end,text,file,label,start,end,annotator
0,T2,Gendered-Role 445 449,Duke,training1-0,Gendered-Role,445,449,Lucy
1,T3,Gendered-Role 467 471,Lord,training1-0,Gendered-Role,467,471,Lucy
2,T5,Gendered-Role 568 572,Lord,training1-0,Gendered-Role,568,572,Lucy
3,T6,Gendered-Role 585 589,Lord,training1-0,Gendered-Role,585,589,Lucy
4,T8,Gendered-Role 1099 1103,Lord,training1-0,Gendered-Role,1099,1103,Lucy
5,T9,Gendered-Role 1162 1165,Sir,training1-0,Gendered-Role,1162,1165,Lucy
6,T12,Gendered-Pronoun 1207 1209,He,training1-0,Gendered-Pronoun,1207,1209,Lucy
7,T14,Gendered-Role 1400 1404,King,training1-0,Gendered-Role,1400,1404,Lucy
8,T15,Gendered-Role 1529 1536,brother,training1-0,Gendered-Role,1529,1536,Lucy
9,T16,Gendered-Pronoun 1525 1528,his,training1-0,Gendered-Pronoun,1525,1528,Lucy


**Step 2:** Load Melissa's dataset after annotating in brat

In [10]:
filename = "training1-0"
txt = ".txt"
ann = ".ann"
mel_txt = open('pilots/melissaAnnots/'+filename+txt, 'r')
df_mann = pd.read_csv("pilots/melissaAnnots/"+filename+ann, sep="\t", names=["entity","label-start-end","text"])
df_mann['file'] = [filename for x in range(df_ann.shape[0])]
label_and_offsets = list(df_mann["label-start-end"])
list_of_landos = []
for row in label_and_offsets:
    row_list = row.split(" ")
    list_of_landos += [row_list]
label = []
start = []
end = []
for l in list_of_landos:
    label += [l[0]]
    start += [l[1]]
    end += [l[2]]
df_mann['label'] = label
df_mann['start'] = start
df_mann['end'] = end
df_mann['annotator'] = ['Melissa' for x in range(df_ann.shape[0])]
df_mann

Unnamed: 0,entity,label-start-end,text,file,label,start,end,annotator
0,T1,Gendered-Role 277 280,Sir,training1-0,Gendered-Role,277,280,Melissa
1,T2,Gendered-Role 424 428,Earl,training1-0,Gendered-Role,424,428,Melissa
2,T3,Gendered-Role 445 449,Duke,training1-0,Gendered-Role,445,449,Melissa
3,T4,Gendered-Role 467 471,Lord,training1-0,Gendered-Role,467,471,Melissa
4,T5,Gendered-Role 528 532,Lady,training1-0,Gendered-Role,528,532,Melissa
5,T6,Gendered-Role 550 554,Earl,training1-0,Gendered-Role,550,554,Melissa
6,T7,Gendered-Role 568 572,Lord,training1-0,Gendered-Role,568,572,Melissa
7,T8,Gendered-Role 585 589,Lord,training1-0,Gendered-Role,585,589,Melissa
8,T9,Gendered-Role 1084 1087,Sir,training1-0,Gendered-Role,1084,1087,Melissa
9,T10,Gendered-Role 1155 1158,son,training1-0,Gendered-Role,1155,1158,Melissa


Measure inter-annotator agreement (IAA) to determine how similar the text is annotated, considering annotations that:
 * Are exact matches (same label start and end numbers for the same label type)
 * Are overlapping (the start or end of one annotator's label is contained within the start or end of another annotator's label  of the same type)

In [11]:
lucy = list(df_ann["label-start-end"])
mel = list(df_mann["label-start-end"])

In [12]:
def longerShorterList(list1, list2):
    if len(list1) >= len(list2):
        return list1, list2
    else:
        return list2, list1        

In [44]:
def findExactMatches(ann1, ann2):
    long, short = longerShorterList(ann1, ann2)
    matched = []
    unmatched = []
    for label in long:
        if label in short:
            matched += [label]
        else:
            unmatched += [label]
    percentage_long = len(matched)/len(long)*100
    percentage_short = len(matched)/len(short)*100
    return matched, unmatched, percentage_long, percentage_short

In [46]:
exact_matches, unmatched, overlap_long, overlap_short = findExactMatches(lucy, mel)
print("Exact matches:\n", exact_matches)
print("\nPercentage of longer list:", str(overlap_long)+"%")
print("\nPercentage of shorter list:", str(overlap_short)+"%")
print("\nCheck for overlap:\n", unmatched)

Exact matches:
 ['Gendered-Role 445 449', 'Gendered-Role 467 471', 'Gendered-Role 568 572', 'Gendered-Role 585 589', 'Gendered-Role 1162 1165', 'Gendered-Pronoun 1207 1209', 'Gendered-Role 1400 1404', 'Gendered-Role 1537 1540', 'Gendered-Role 2025 2028', 'Gendered-Role 277 280', 'Gendered-Role 424 428', 'Gendered-Role 528 532', 'Gendered-Role 550 554', 'Gendered-Role 1084 1087', 'Gendered-Role 1155 1158', 'Gendered-Pronoun 1298 1300', 'Gendered-Role 1689 1692', 'Gendered-Role 1614 1618', 'Gendered-Role 1741 1745', 'Gendered-Role 1835 1838', 'Gendered-Role 2122 2125', 'Gendered-Role 2137 2141', 'Gendered-Role 2201 2204', 'Gendered-Role 2219 2222']

Percentage of longer list: 82.75862068965517%

Percentage of shorter list: 82.75862068965517%

Check for overlap:
 ['Gendered-Role 1099 1103', 'Gendered-Role 1529 1536', 'Gendered-Pronoun 1525 1528', 'Omission 1148 1191', 'Gendered-Role 1198 1205']


In [28]:
def splitAnnotation(annotation_list):
    list_of_landos = []
    for item in annotation_list:
        item_list = item.split(" ")
        list_of_landos += [item_list]
    label = []
    start = []
    end = []
    for l in list_of_landos:
        label += [l[0]]
        start += [l[1]]
        end += [l[2]]
    return label, start, end

In [38]:
def getLabelOffsets(label_type, df):
    rows = df[df.label == label_type]
    start = list(rows.start)
    end = list(rows.end)
    return dict(zip(start, end))

In [39]:
labels = ["Person-Name", "Woman", "Man", "Nonbinary", "Unknown", 
          "Linguistic", "Gendered-Role", "Gendered-Pronoun",  "Generalization", 
          "Contextual", "Occupation", "Stereotype", "Omission", "Empowering"]

person = getLabelOffsets(label[0], df_ann)
woman = getLabelOffsets(label[1], df_ann)
man = getLabelOffsets(label[2], df_ann)
nonbinary = getLabelOffsets(label[3], df_ann)
unknown = getLabelOffsets(label[4], df_ann)
linguistic = getLabelOffsets(label[5], df_ann)
role = getLabelOffsets(label[6], df_ann)
pronoun = getLabelOffsets(label[7], df_ann)
gen = getLabelOffsets(label[8], df_ann)
cont = getLabelOffsets(label[9], df_ann)
occ  = getLabelOffsets(label[10], df_ann)
ster = getLabelOffsets(label[11], df_ann)
omis = getLabelOffsets(label[12], df_ann)
emp = getLabelOffsets(label[13], df_ann)

In [41]:
# person

In [42]:
df_ann = df_mann
person_m = getLabelOffsets(label[0], df_ann)
woman_m = getLabelOffsets(label[1], df_ann)
man_m = getLabelOffsets(label[2], df_ann)
nonbinary_m = getLabelOffsets(label[3], df_ann)
unknown_m = getLabelOffsets(label[4], df_ann)
linguistic_m = getLabelOffsets(label[5], df_ann)
role_m = getLabelOffsets(label[6], df_ann)
pronoun_m = getLabelOffsets(label[7], df_ann)
gen_m = getLabelOffsets(label[8], df_ann)
cont_m = getLabelOffsets(label[9], df_ann)
occ_m  = getLabelOffsets(label[10], df_ann)
ster_m = getLabelOffsets(label[11], df_ann)
omis_m = getLabelOffsets(label[12], df_ann)
emp_m = getLabelOffsets(label[13], df_ann)

In [59]:
def findMatches(label_offsets1, label_offsets2):
    left_overlap_keys = []
    right_overlap_keys = []
    exact_match_keys = []
    enclosure_keys = []
    containment_keys = []

    for start, end in label_offsets1.items():
        for start_m, end_m in label_offsets2.items():
            # Check for left overlap
            if start <= start_m and start < end_m and end >= start_m and end < end_m:
                left_overlap_keys += [start]
            # Check for right overlap 
            elif start > start_m and end >= end_m and start <= end_m and end > start_m:
                right_overlap_keys += [start]
            # Check for exact match
            elif start == start_m and end == end_m:
                exact_match_keys += [start]
            # Check for enclosure
            elif start < start_m and start < end_m and end > start_m and end > end_m:
                enclosure_keys += [start]
            # Check for containment
            elif start > start_m and start < end_m and end > start_m and end < end_m:
                containment_keys += [start]
    
    return left_overlap_keys, right_overlap_keys, exact_match_keys, enclosure_keys, containment_keys

In [60]:
left_overlap_keys, right_overlap_keys, exact_match_keys, enclosure_keys, containment_keys = findMatches(role, role_m)
print("Left:",len(left_overlap_keys))
print("Right:",len(right_overlap_keys))
print("Exact:",len(exact_match_keys))
print("Enclosed:",len(enclosure_keys))
print("Contained:",len(containment_keys))

Left: 0
Right: 0
Exact: 22
Enclosed: 0
Contained: 0
