# Merge Annotated Datasets for a Gold Standard, part 3

#### To continue reconciling the differences in the five annotatated archival metadata descriptions datasets to create one merged dataset:

  [4.](#4) Review overlapping annotations with the *same* label and add the longest annotation to the gold standard DataFrame, and then remove the others from the old DataFrame.  Be sure to record both annotators as contributing to gold standard in `annotator` column.

  [5.](#5) For any files annotator 1 didn't label, add all annotator 0's `Gendered-Role` labels.  For any files annotator 0 didn't label, add all annotator 2's `Gendered-Role` labels.

  [6.](#6) Add all the remaining `Generalization` labels to the gold standard DataFrame.  Change any text spans that end in -ess, -boy, -girl, or -man labeled as Gendered-Role to Generalization.

  [7.](#7) Add all the remaining Person-Name labels from annotators 0 and 2 to the gold standard DataFrame.
  
  [8.](#8) Add the `Omission` and `Stereotype` labels from annotators 0, 3, and 4 to the gold standard DataFrame.
  
  [9.](#9) Count the number (and calculate the proportion) of labels each annotator contributed to the gold standard DataFrame.
  
  [10.](#10) Create a copy of the gold standard, drop the "annotator" column, and then drop duplicate rows from the gold standard DataFrame.  This is the final gold standard on which to train and evaluate classifiers.
  
***

I. [Person-Name Overlaps](#p)

* [Unknown](#p-u)
* [Feminine](#p-f)
* [Masculine](#p-m)

II. [Linguistic Overlaps](#l)

* [Gendered-Role](#l-gr)
* [Generalization](#l-g)
* [Gendered-Pronoun](#l-gp)

III. [Contextual Overlaps](#c)

* [Stereotype](#c-s)
* [Omission](#c-om)
* [Occupation](#c-o)

IV. [Add to Gold](#g)
* [Matches](#g-m)
* [Overlaps](#g-o)
* [Remainder (steps 5-8)](#g-r)

In [1]:
import pandas as pd
import numpy as np
import string
import csv
import re
import os
from intervaltree import Interval, IntervalTree

Find overlapping annotations among annotators 0, 1, & 2, and among annotators 0, 3, & 4, one label at a time, one file at a time.  Add the longest annotation (the largest difference between offsets) from among the overlapping annotations to the gold standard.

**Functions:**

In [4]:
# Separate the input DataFrame's offsets column into 'start' offset and 'end' offset columns of type int
def splitOffsets(df):
    offsets = list(df.offsets)
    start, end = [], []
    for o in offsets:
        pair = o[1:-1]
        pair_list = pair.split(",")
        start += [pair_list[0]]
        end += [pair_list[1]]
    df["start"] = start
    df["end"] = end
    df = df.astype({"start":int, "end":int})
    return df

# Find the files both input annotators labeled
def findCommonFiles(df_a, df_b):
    common = []
    files_a = set(list(df_a.file))
    files_b = set(list(df_b.file))
    for f in files_a:
        if f in files_b:
            common += [f]
    return common

# Create an interval tree for one annotator for a specified file and specified label
def createIntervalTree(df, filename, labelname):
    subdf = df[df.file == filename]                                       # Get only rows for the input file
    subdf = subdf[subdf.label == labelname]                               # Get only rows for that file with the input label
    offsets = list(zip(list(subdf.start), list(subdf.end)))
    return IntervalTree.from_tuples(offsets)

# Find strict agreements (exact matches)
def findMatches(tree_exp, tree_pred):
    matches = []
    for annotation in tree_exp:
        if annotation in tree_pred:
            matches += [annotation]
    return matches

# Find annotations that overlap one another (including enveloping but excluding exactly matching annotations)
# Note: exp = A, pred = B
def findOverlaps(tree_exp, tree_pred):
    overlaps_pred, overlaps_exp = [], []
    for annotation in tree_exp:
        overlapping_intervals = tree_pred.overlap(annotation)
        if len(overlapping_intervals) > 0:
            for oi in overlapping_intervals:
                overlaps_pred += [oi]
    for annotation in tree_pred:
        overlapping_intervals = tree_exp.overlap(annotation)
        if len(overlapping_intervals) > 0:
            for oi in overlapping_intervals:
                overlaps_exp += [oi]
    return overlaps_exp, overlaps_pred

# Record the type of agreement in the DataFrame
def recordAgreements(df, labelname, filename, agreements, agreement_col_name, agreement_type):
    df_label = df.loc[df.label == labelname]
    df_file = df_label.loc[df_label.file == filename]
    for a in agreements:
        offset = "("+str(a.begin)+", "+str(a.end)+")"
        row = df_file.loc[df_file.offsets == offset].index
        df.at[row, agreement_col_name] = agreement_type
    return df

# Record exact matches between two annotators' DataFrames in the DF's agreement columns 
def getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files):
    for f in common_files:
        # Create interval trees of the offset data for each annotator for file f
        treeA = createIntervalTree(sub_annA, f, labelname)
        treeB = createIntervalTree(sub_annB, f, labelname)
        # Find exact matches between the annotators
        matches = findMatches(treeA,treeB)
        annA = recordAgreements(annA,labelname,f,matches,agreement_B,"Match")
        annB = recordAgreements(annB,labelname,f,matches,agreement_A,"Match")
    return annA, annB

# Record overlapping annotations between two annotators' DataFrames in the DF's agreement columns
def getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files):
    for f in common_files:
        # Create interval trees of the offset data for each annotator for file f
        treeA = createIntervalTree(sub_annA, f, labelname)
        treeB = createIntervalTree(sub_annB, f, labelname)
        # Find overlaps between the annotators
        overlapsA, overlapsB = findOverlaps(treeA,treeB)
        annA = recordAgreements(annA,labelname,f,overlapsA,agreement_B,"Overlap")
        annB = recordAgreements(annB,labelname,f,overlapsB,agreement_A,"Overlap")
    return annA, annB

<a id="p"></a>
### Person-Name Overlaps: 
* Annotators: 0, 2 (excluding 1's data because has too many inconsistencies)
* Labels: Unknown, Masculine, Feminine (Nonbinary not used)

In [44]:
# Load data
ann0 = pd.read_csv("labels0PL.csv")
# ann0.head()
ann2 = pd.read_csv("labels2.csv")
# ann2.head()

In [45]:
# Add a column to each annotator's DataFrame to record the type of agreement (or None if file not common between the annotators)
ann0 = splitOffsets(ann0)
ann0["agreement_2"] = [None]*ann0.shape[0]
# ann0.head()  # Looks good 
ann2 = splitOffsets(ann2)
ann2["agreement_0"] = [None]*ann2.shape[0]
# ann2.tail()  # Looks good

In [46]:
# Find files both annotators 0 and 2 labeled
common_files = findCommonFiles(ann0, ann2)
print(len(common_files))

170


<a id="p-u"></a>
#### UNKNOWN
Find agreements (exact matches and overlaps) between annotator 0 and annotator 2's 'Unknown' annotations:

In [47]:
labelname = "Unknown"

Find exact matches:

In [48]:
# Create subsets of the annotators' DataFrames that include only the files they BOTH labeled
sub_ann0 = ann0.loc[ann0.file.isin(common_files)]
print(sub_ann0.shape)
sub_ann2 = ann2.loc[ann2.file.isin(common_files)]
print(sub_ann2.shape)

(6790, 12)
(7025, 12)


In [49]:
annA = ann0
sub_annA = sub_ann0
agreement_B = "agreement_2"
annB = ann2
sub_annB = sub_ann2
agreement_A = "agreement_0"
ann0, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Exact matches in Annotator 0's DataFrame:", (ann0.loc[ann0.agreement_2 == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (ann2.loc[ann2.agreement_0 == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 2630
Exact matches in Annotator 2's DataFrame: 2630


In [56]:
# # Check by finding intersection of DataFrames - shouldn't be less than numbers of exact matches above
# ann0_multiindex = ann0.set_index(['file', 'offsets', 'text'], inplace=False)
# ann0_multiindex = ann0_multiindex.loc[ann0_multiindex.label == 'Unknown']
# ann0_multiindex.drop(columns=['id', 'entity', 'category'],inplace=True)
# ann2_multiindex = ann2.set_index(['file', 'offsets', 'text'], inplace=False)
# ann2_multiindex.drop(columns=['id', 'entity', 'category'],inplace=True)
# ann2_multiindex = ann2_multiindex.loc[ann2_multiindex.label == 'Unknown']
# ann2_multiindex.tail()
# intersection02 = ann0_multiindex.join(ann2_multiindex, on=['file', 'offsets', 'text'], how='inner', lsuffix='_0', rsuffix='_2')
# # intersection02.head(10)
# print(intersection02.shape)  # (2630, 12)
# intersection20 = ann2_multiindex.join(ann0_multiindex, on=['file', 'offsets', 'text'], how='inner', lsuffix='_2', rsuffix='_0')
# print(intersection20.shape)  # (2630, 12) - Looks good

(2630, 12)
(2630, 12)


Find overlaps:

In [50]:
annA = ann0
sub_ann0 = ann0.loc[ann0.file.isin(common_files)]        # Look only at files both annotators labeled
sub_annA = sub_ann0.loc[sub_ann0.agreement_2 != "Match"] # Filter out rows already recorded as having a "Match"
agreement_B = "agreement_2"
annB = ann2
sub_ann2 = ann2.loc[ann2.file.isin(common_files)]        # Look only at files both annotators labeled
sub_annB = sub_ann2.loc[sub_ann2.agreement_0 != "Match"] # Filter out rows already recorded as having a "Match"
agreement_A = "agreement_0"
ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Overlaps in Annotator 0's DataFrame:", ann0.loc[ann0.agreement_2 == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", ann2.loc[ann2.agreement_0 == "Overlap"].shape[0])
# ann0.loc[ann0.agreement_2 == "Overlap"] - looks good
# ann2.loc[ann2.agreement_0 == "Overlap"] - looks good

Overlaps in Annotator 0's DataFrame: 143
Overlaps in Annotator 2's DataFrame: 142


In [16]:
# # Check by finding overlaps before matches - should be same as or greater than the number of matches
# labelname = 'Unknown'
# ann0 = pd.read_csv("labels0PL.csv")
# ann2 = pd.read_csv("labels2.csv")
# ann0 = splitOffsets(ann0)
# ann0["agreement_2"] = [None]*ann0.shape[0]
# ann2 = splitOffsets(ann2)
# ann2["agreement_0"] = [None]*ann2.shape[0]

# common_files = findCommonFiles(ann0, ann2)
# sub_ann0 = ann0.loc[ann0.file.isin(common_files)]  # Look only at files both annotators labeled
# sub_ann2 = ann2.loc[ann2.file.isin(common_files)]  # Look only at files both annotators labeled

# annA = ann0
# sub_annA = sub_ann0
# agreement_B = "agreement_2"
# annB = ann2
# sub_annB = sub_ann2
# agreement_A = "agreement_0"
# ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

# print("Overlaps in Annotator 0's DataFrame:", ann0.loc[ann0.agreement_2 == "Overlap"].shape[0]) # 2630
# print("Overlaps in Annotator 2's DataFrame:", ann2.loc[ann2.agreement_0 == "Overlap"].shape[0]) # 2772 - looks good!

Overlaps in Annotator 0's DataFrame: 2630
Overlaps in Annotator 2's DataFrame: 2772


The various sums of matches and overlaps all seem to be making sense and align with our double checks! Let's write the data to CSVs for safe-keeping and then move on to the other labels used in the annotation process.

In [51]:
ann0.to_csv("Agreements/ann0_agreement2_Unknown.csv")
ann2.to_csv("Agreements/ann2_agreement0_Unknown.csv")

<a id="p-f"></a>
#### FEMININE
Find agreements (exact matches and overlaps) between annotator 0 and annotator 2's 'Feminine' annotations:

In [52]:
labelname = "Feminine"

In [53]:
# Find exact matches
sub_ann0 = ann0.loc[ann0.file.isin(common_files)]
sub_ann2 = ann2.loc[ann2.file.isin(common_files)]
annA = ann0
sub_annA = sub_ann0
agreement_B = "agreement_2"
annB = ann2
sub_annB = sub_ann2
agreement_A = "agreement_0"
ann0, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Exact matches in Annotator 0's DataFrame:", (ann0.loc[ann0.agreement_2 == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (ann2.loc[ann2.agreement_0 == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 3275
Exact matches in Annotator 2's DataFrame: 3275


In [54]:
# Find overlaps
annA = ann0
sub_ann0 = ann0.loc[ann0.file.isin(common_files)]        # Look only at files both annotators labeled
sub_annA = sub_ann0.loc[sub_ann0.agreement_2 != "Match"] # Filter out rows already recorded as having a "Match"
agreement_B = "agreement_2"
annB = ann2
sub_ann2 = ann2.loc[ann2.file.isin(common_files)]        # Look only at files both annotators labeled
sub_annB = sub_ann2.loc[sub_ann2.agreement_0 != "Match"] # Filter out rows already recorded as having a "Match"
agreement_A = "agreement_0"
ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Overlaps in Annotator 0's DataFrame:", ann0.loc[ann0.agreement_2 == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", ann2.loc[ann2.agreement_0 == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 215
Overlaps in Annotator 2's DataFrame: 217


In [55]:
ann0.to_csv("Agreements/ann0_agreement2_UnknownFeminine.csv")
ann2.to_csv("Agreements/ann2_agreement0_UnknownFeminine.csv")

<a id="p-m"></a>
#### MASCULINE
Find agreements (exact matches and overlaps) between annotator 0 and annotator 2's 'Masculine' annotations:

In [56]:
labelname = "Masculine"

In [57]:
# Find exact matches
sub_ann0 = ann0.loc[ann0.file.isin(common_files)]
sub_ann2 = ann2.loc[ann2.file.isin(common_files)]
annA = ann0
sub_annA = sub_ann0
agreement_B = "agreement_2"
annB = ann2
sub_annB = sub_ann2
agreement_A = "agreement_0"
ann0, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Exact matches in Annotator 0's DataFrame:", (ann0.loc[ann0.agreement_2 == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (ann2.loc[ann2.agreement_0 == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 3968
Exact matches in Annotator 2's DataFrame: 3968


In [58]:
# Find overlaps
annA = ann0
sub_ann0 = ann0.loc[ann0.file.isin(common_files)]        # Look only at files both annotators labeled
sub_annA = sub_ann0.loc[sub_ann0.agreement_2 != "Match"] # Filter out rows already recorded as having a "Match"
agreement_B = "agreement_2"
annB = ann2
sub_ann2 = ann2.loc[ann2.file.isin(common_files)]        # Look only at files both annotators labeled
sub_annB = sub_ann2.loc[sub_ann2.agreement_0 != "Match"] # Filter out rows already recorded as having a "Match"
agreement_A = "agreement_0"
ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Overlaps in Annotator 0's DataFrame:", ann0.loc[ann0.agreement_2 == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", ann2.loc[ann2.agreement_0 == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 250
Overlaps in Annotator 2's DataFrame: 251


In [59]:
ann0.to_csv("Agreements/ann0_agreement2_UnknownFeminineMasculine.csv")
ann2.to_csv("Agreements/ann2_agreement0_UnknownFeminineMasculine.csv")

<a id="l"></a>
### Linguistic Overlaps: 
* Annotators: 0, 1, 2
* Labels: Gendered-Role, Generalization, Gendered-Pronoun* 

**already added some to gold standard, so find matches and overlaps there too!*

In [64]:
# Load data
ann0 = pd.read_csv("labels0PL.csv")
ann1 = pd.read_csv("labels1.csv", index_col=0)
ann1.set_index(["file","offsets","text"],inplace=True)
ann1 = ann1.reset_index()
ann2 = pd.read_csv("labels2.csv")
# ann0.head()
ann1.head()  # looks good - organized like the other two DataFrames now
# ann2.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove
0,Coll-1326_00100.ann,"(1132, 1135)",his,0,T0,Gendered-Pronoun,Annotator 1,Linguistic,
1,Coll-1326_00100.ann,"(1142, 1144)",He,1,T1,Gendered-Pronoun,Annotator 1,Linguistic,
2,Coll-1326_00100.ann,"(1532, 1535)",his,2,T2,Gendered-Pronoun,Annotator 1,Linguistic,
3,Coll-1326_00100.ann,"(1548, 1550)",He,3,T3,Gendered-Pronoun,Annotator 1,Linguistic,
4,Coll-1326_00100.ann,"(48, 62)",Dr. Rutherford,4,T4,Unknown,Annotator 1,Person-Name,


In [65]:
# Split the offsets into two columns and add columns to each annotator's DataFrame to record agreements
ann0 = splitOffsets(ann0)
ann0["agreement_1"] = [None]*ann0.shape[0]
ann0["agreement_2"] = [None]*ann0.shape[0]
ann1 = splitOffsets(ann1)
ann1["agreement_0"] = [None]*ann1.shape[0]
ann1["agreement_2"] = [None]*ann1.shape[0]
ann2 = splitOffsets(ann2)
ann2["agreement_0"] = [None]*ann2.shape[0]
ann2["agreement_1"] = [None]*ann2.shape[0]
ann1.tail()  # Looks good

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end,agreement_0,agreement_2
16510,Coll-1434_11300.ann,"(4867, 4870)",man,16657,T22,Gendered-Role,Annotator 1,Linguistic,,4867,4870,,
16511,Coll-1434_09000.ann,"(2117, 2120)",His,16658,T0,Gendered-Pronoun,Annotator 1,Linguistic,,2117,2120,,
16512,Coll-1434_09000.ann,"(2117, 2176)","His Excellency, the Kushbegi, Vice Emir of Bok...",16659,T1,Masculine,Annotator 1,Person-Name,,2117,2176,,
16513,Coll-1434_09000.ann,"(2137, 2145)",Kushbegi,16660,T2,Gendered-Role,Annotator 1,Linguistic,,2137,2145,,
16514,Coll-1434_09000.ann,"(2147, 2156)",Vice Emir,16661,T3,Gendered-Role,Annotator 1,Linguistic,,2147,2156,,


In [66]:
common_files02 = findCommonFiles(ann0, ann2)  # Find files both annotators 0 and 2 labeled
common_files12 = findCommonFiles(ann1, ann2)  # Find files both annotators 1 and 2 labeled
common_files01 = findCommonFiles(ann0, ann1)  # Find files both annotators 0 and 1 labeled

<a id="l-gr"></a>
#### GENDERED-ROLE
Find agreements (exact matches and overlaps) between annotators 0, 1, and 2's 'Gendered-Role' annotations:

In [67]:
labelname = "Gendered-Role"

##### Annotators 0 & 2

In [71]:
# Find exact matches
common_files = common_files02
annA = ann0
annB = ann2
agreement_A = "agreement_0"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Exact matches in Annotator 0's DataFrame:", (ann0.loc[ann0.agreement_2 == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (ann2.loc[ann2.agreement_0 == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 1350
Exact matches in Annotator 2's DataFrame: 1350


In [72]:
# Find overlaps
common_files = common_files02
annA = ann0
annB = ann2
agreement_A = "agreement_0"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann2
print("Overlaps in Annotator 0's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 46
Overlaps in Annotator 2's DataFrame: 46


##### Annotators 1 & 2

In [77]:
# Find exact matches
common_files = common_files12
annA = ann1
annB = ann2
agreement_A = "agreement_1"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann1, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann1
annB = ann2
print("Exact matches in Annotator 1's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 1's DataFrame: 414
Exact matches in Annotator 2's DataFrame: 414


In [78]:
# Find overlaps
common_files = common_files12
annA = ann1
annB = ann2
agreement_A = "agreement_1"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann1, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann1
annB = ann2
print("Overlaps in Annotator 1's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 1's DataFrame: 21
Overlaps in Annotator 2's DataFrame: 21


##### Annotators 0 & 1

In [79]:
# Find exact matches
common_files = common_files01
annA = ann0
annB = ann1
agreement_A = "agreement_0"
agreement_B = "agreement_1"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann1 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann1
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 1's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 1734
Exact matches in Annotator 1's DataFrame: 1734


In [81]:
# Find overlaps
common_files = common_files01
annA = ann0
annB = ann1
agreement_A = "agreement_0"
agreement_B = "agreement_1"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann1 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann1
print("Overlaps in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Overlap"]).shape[0])
print("Overlaps in Annotator 1's DataFrame:", (annB.loc[annB[agreement_A] == "Overlap"]).shape[0])

Overlaps in Annotator 0's DataFrame: 62
Overlaps in Annotator 1's DataFrame: 62


In [87]:
# ann0.loc[ann0.label == "Gendered-Role"].head(20)
# Write the data to files for safe-keeping
ann0.to_csv("Agreements/ann0_agreement12_GenderedRole.csv")
ann1.to_csv("Agreements/ann1_agreement02_GenderedRole.csv")
ann2.to_csv("Agreements/ann2_agreement01_GenderedRole.csv")

<a id="l-g"></a>
#### GENERALIZATION
Find agreements (exact matches and overlaps) between annotators 0, 1, and 2's 'Generalization' annotations:

In [88]:
labelname = "Generalization"

##### Annotators 0 & 2

In [89]:
# Find exact matches
common_files = common_files02
annA = ann0
annB = ann2
agreement_A = "agreement_0"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Exact matches in Annotator 0's DataFrame:", (ann0.loc[ann0.agreement_2 == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (ann2.loc[ann2.agreement_0 == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 1397
Exact matches in Annotator 2's DataFrame: 1397


In [90]:
# Find overlaps
common_files = common_files02
annA = ann0
annB = ann2
agreement_A = "agreement_0"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann2
print("Overlaps in Annotator 0's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 71
Overlaps in Annotator 2's DataFrame: 72


##### Annotators 1 & 2

In [91]:
# Find exact matches
common_files = common_files12
annA = ann1
annB = ann2
agreement_A = "agreement_1"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann1, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann1
annB = ann2
print("Exact matches in Annotator 1's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 1's DataFrame: 415
Exact matches in Annotator 2's DataFrame: 415


In [92]:
# Find overlaps
common_files = common_files12
annA = ann1
annB = ann2
agreement_A = "agreement_1"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann1, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann1
annB = ann2
print("Overlaps in Annotator 1's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 1's DataFrame: 21
Overlaps in Annotator 2's DataFrame: 21


##### Annotators 0 & 1

In [93]:
# Find exact matches
common_files = common_files01
annA = ann0
annB = ann1
agreement_A = "agreement_0"
agreement_B = "agreement_1"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann1 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann1
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 1's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 1766
Exact matches in Annotator 1's DataFrame: 1766


In [94]:
# Find overlaps
common_files = common_files01
annA = ann0
annB = ann1
agreement_A = "agreement_0"
agreement_B = "agreement_1"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann1 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann1
print("Overlaps in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Overlap"]).shape[0])
print("Overlaps in Annotator 1's DataFrame:", (annB.loc[annB[agreement_A] == "Overlap"]).shape[0])

Overlaps in Annotator 0's DataFrame: 67
Overlaps in Annotator 1's DataFrame: 67


In [95]:
# Write the data to files for safe-keeping
ann0.to_csv("Agreements/ann0_agreement12_Generalization.csv")
ann1.to_csv("Agreements/ann1_agreement02_Generalization.csv")
ann2.to_csv("Agreements/ann2_agreement01_Generalization.csv")

<a id="l-gp"></a>
#### GENDERED-PRONOUN
Find agreements (exact matches and overlaps) between annotators 0, 1, and 2's 'Gendered-Pronoun' annotations:

In [96]:
labelname = "Gendered-Pronoun"

##### Annotators 0 & 2

In [97]:
# Find exact matches
common_files = common_files02
annA = ann0
annB = ann2
agreement_A = "agreement_0"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)

print("Exact matches in Annotator 0's DataFrame:", (ann0.loc[ann0.agreement_2 == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (ann2.loc[ann2.agreement_0 == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 1918
Exact matches in Annotator 2's DataFrame: 1918


In [98]:
# Find overlaps
common_files = common_files02
annA = ann0
annB = ann2
agreement_A = "agreement_0"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann2
print("Overlaps in Annotator 0's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 71
Overlaps in Annotator 2's DataFrame: 72


##### Annotators 1 & 2

In [99]:
# Find exact matches
common_files = common_files12
annA = ann1
annB = ann2
agreement_A = "agreement_1"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann1, ann2 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann1
annB = ann2
print("Exact matches in Annotator 1's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 2's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 1's DataFrame: 933
Exact matches in Annotator 2's DataFrame: 933


In [100]:
# Find overlaps
common_files = common_files12
annA = ann1
annB = ann2
agreement_A = "agreement_1"
agreement_B = "agreement_2"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann1, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann1
annB = ann2
print("Overlaps in Annotator 1's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 2's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 1's DataFrame: 21
Overlaps in Annotator 2's DataFrame: 21


##### Annotators 0 & 1

In [101]:
# Find exact matches
common_files = common_files01
annA = ann0
annB = ann1
agreement_A = "agreement_0"
agreement_B = "agreement_1"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann1 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann1
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 1's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 5161
Exact matches in Annotator 1's DataFrame: 5161


In [102]:
# Find overlaps
common_files = common_files01
annA = ann0
annB = ann1
agreement_A = "agreement_0"
agreement_B = "agreement_1"

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann1 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann1
print("Overlaps in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Overlap"]).shape[0])
print("Overlaps in Annotator 1's DataFrame:", (annB.loc[annB[agreement_A] == "Overlap"]).shape[0])

Overlaps in Annotator 0's DataFrame: 68
Overlaps in Annotator 1's DataFrame: 68


In [103]:
# # Write the data to files for safe-keeping
ann0.to_csv("Agreements/ann0_agreement12_GenderedPronoun.csv")
ann1.to_csv("Agreements/ann1_agreement02_GenderedPronoun.csv")
ann2.to_csv("Agreements/ann2_agreement01_GenderedPronoun.csv")

<a id="c"></a>
### Contextual Overlaps: 
* Annotators: 0, 3, 4
* Labels: Stereotype, Omission, Occupation* 

**already added some to gold standard, so find matches and overlaps there too!*

In [110]:
# Load data
ann0 = pd.read_csv("labels0C.csv", index_col=0)
ann0.set_index(["file","offsets","text"],inplace=True)
ann0 = ann0.reset_index()
ann0["remove"] = [None]*ann0.shape[0]
ann3 = pd.read_csv("labels3.csv", index_col=0)
ann3.set_index(["file","offsets","text"],inplace=True)
ann3 = ann3.reset_index()
ann3["remove"] = [None]*ann3.shape[0]
ann4 = pd.read_csv("labels4.csv", index_col=0)
ann4.set_index(["file","offsets","text"],inplace=True)
ann4 = ann4.reset_index()
ann4["remove"] = [None]*ann4.shape[0]
# ann0.head()
# ann3.head()
ann4.head()   # Looks good - all matching columns and column order

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove
0,BAI_01200.ann,"(2381, 2397)",Duke of Montrose,22,T20,Omission,Annotator 4,Contextual,
1,BAI_01200.ann,"(2578, 2592)",women's groups,24,T22,Stereotype,Annotator 4,Contextual,
2,BAI_01200.ann,"(5450, 5476)",Fowler and Pearse families,27,T26,Omission,Annotator 4,Contextual,
3,Coll-1434_15200.ann,"(3822, 3836)",President Taft,36,T11,Omission,Annotator 4,Contextual,
4,Coll-1434_15200.ann,"(5109, 5123)",A group of men,44,T16,Stereotype,Annotator 4,Contextual,


In [111]:
# Split the offsets into two columns and add columns to each annotator's DataFrame to record agreements
ann0 = splitOffsets(ann0)
ann0["agreement_3"] = [None]*ann0.shape[0]
ann0["agreement_4"] = [None]*ann0.shape[0]
ann3 = splitOffsets(ann3)
ann3["agreement_0"] = [None]*ann3.shape[0]
ann3["agreement_4"] = [None]*ann3.shape[0]
ann4 = splitOffsets(ann4)
ann4["agreement_0"] = [None]*ann4.shape[0]
ann4["agreement_3"] = [None]*ann4.shape[0]
ann4.tail()  # Looks good

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end,agreement_0,agreement_3
2355,Coll-146_16400.ann,"(2134, 2139)",a boy,6586,T2,Stereotype,Annotator 4,Contextual,,2134,2139,,
2356,Coll-1490_00300.ann,"(1094, 1101)",friends,6588,T1,Omission,Annotator 4,Contextual,,1094,1101,,
2357,Coll-1490_00300.ann,"(1000, 1012)",Lady Jackson,6589,T2,Omission,Annotator 4,Contextual,,1000,1012,,
2358,Coll-1490_00300.ann,"(386, 393)",wedding,6590,T3,Omission,Annotator 4,Contextual,,386,393,,
2359,Coll-1490_00300.ann,"(7, 12)",Kitty,6591,T5,Omission,Annotator 4,Contextual,,7,12,,


In [112]:
common_files03 = findCommonFiles(ann0, ann3)  # Find files both annotators 0 and 3 labeled
common_files34 = findCommonFiles(ann3, ann4)  # Find files both annotators 3 and 4 labeled
common_files04 = findCommonFiles(ann0, ann4)  # Find files both annotators 0 and 4 labeled

<a id="c-s"></a>
#### STEREOTYPE
Find agreements (exact matches and overlaps) between annotators 0, 3, and 4's 'Stereotype' annotations:

In [113]:
labelname = "Stereotype"

##### Annotators 0 & 3

In [114]:
# Find exact matches
common_files = common_files03
annA = ann0
annB = ann3
agreement_A = "agreement_0"
agreement_B = "agreement_3"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann3 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann3
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 3's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 444
Exact matches in Annotator 3's DataFrame: 444


In [115]:
# Find overlaps
annA = ann0
annB = ann3

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann3
print("Overlaps in Annotator 0's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 3's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 60
Overlaps in Annotator 3's DataFrame: 55


##### Annotators 3 & 4

In [116]:
# Find exact matches
common_files = common_files34
annA = ann3
annB = ann4
agreement_A = "agreement_3"
agreement_B = "agreement_4"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann3, ann4 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann3
annB = ann4
print("Exact matches in Annotator 3's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 3's DataFrame: 4
Exact matches in Annotator 4's DataFrame: 4


In [117]:
# Find overlaps
annA = ann3
annB = ann4

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann3, ann4 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann3
annB = ann4
print("Overlaps in Annotator 3's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 4's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 3's DataFrame: 17
Overlaps in Annotator 4's DataFrame: 19


##### Annotators 0 & 4

In [118]:
# Find exact matches
common_files = common_files04
annA = ann0
annB = ann4
agreement_A = "agreement_0"
agreement_B = "agreement_4"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann4 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann4
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 43
Exact matches in Annotator 4's DataFrame: 43


In [119]:
# Find overlaps
annA = ann0
annB = ann4

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann4 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann4
print("Overlaps in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Overlap"]).shape[0])
print("Overlaps in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Overlap"]).shape[0])

Overlaps in Annotator 0's DataFrame: 117
Overlaps in Annotator 4's DataFrame: 97


In [120]:
# # Write the data to files for safe-keeping
ann0.to_csv("Agreements/ann0_agreement34_Stereotype.csv")
ann3.to_csv("Agreements/ann3_agreement04_Stereotype.csv")
ann4.to_csv("Agreements/ann4_agreement03_Stereotype.csv")

<a id="c-om"></a>
#### OMISSION
Find agreements (exact matches and overlaps) between annotators 0, 3, and 4's 'Omission' annotations:

In [121]:
labelname = "Omission"

##### Annotators 0 & 3

In [122]:
# Find exact matches
common_files = common_files03
annA = ann0
annB = ann3
agreement_A = "agreement_0"
agreement_B = "agreement_3"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann3 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann3
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 3's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 1228
Exact matches in Annotator 3's DataFrame: 1228


In [123]:
# Find overlaps
annA = ann0
annB = ann3

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann3
print("Overlaps in Annotator 0's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 3's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 628
Overlaps in Annotator 3's DataFrame: 573


##### Annotators 3 & 4

In [124]:
# Find exact matches
common_files = common_files34
annA = ann3
annB = ann4
agreement_A = "agreement_3"
agreement_B = "agreement_4"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann3, ann4 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann3
annB = ann4
print("Exact matches in Annotator 3's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 3's DataFrame: 191
Exact matches in Annotator 4's DataFrame: 191


In [125]:
# Find overlaps
annA = ann3
annB = ann4

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann3, ann4 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann3
annB = ann4
print("Overlaps in Annotator 3's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 4's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 3's DataFrame: 39
Overlaps in Annotator 4's DataFrame: 44


##### Annotators 0 & 4

In [126]:
# Find exact matches
common_files = common_files04
annA = ann0
annB = ann4
agreement_A = "agreement_0"
agreement_B = "agreement_4"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann4 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann4
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 396
Exact matches in Annotator 4's DataFrame: 396


In [127]:
# Find overlaps
annA = ann0
annB = ann4

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann4 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann4
print("Overlaps in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Overlap"]).shape[0])
print("Overlaps in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Overlap"]).shape[0])

Overlaps in Annotator 0's DataFrame: 175
Overlaps in Annotator 4's DataFrame: 152


In [128]:
# # Write the data to files for safe-keeping
ann0.to_csv("Agreements/ann0_agreement34_Omission.csv")
ann3.to_csv("Agreements/ann3_agreement04_Omission.csv")
ann4.to_csv("Agreements/ann4_agreement03_Omission.csv")

<a id="c-o"></a>
#### OCCUPATION
Find agreements (exact matches and overlaps) between annotators 0, 3, and 4's 'Occupation' annotations:

In [129]:
labelname = "Occupation"

##### Annotators 0 & 3

In [130]:
# Find exact matches
common_files = common_files03
annA = ann0
annB = ann3
agreement_A = "agreement_0"
agreement_B = "agreement_3"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann3 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann3
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 3's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 1228
Exact matches in Annotator 3's DataFrame: 1228


In [131]:
# Find overlaps
annA = ann0
annB = ann3

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann2 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann3
print("Overlaps in Annotator 0's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 3's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 0's DataFrame: 628
Overlaps in Annotator 3's DataFrame: 573


##### Annotators 3 & 4

In [132]:
# Find exact matches
common_files = common_files34
annA = ann3
annB = ann4
agreement_A = "agreement_3"
agreement_B = "agreement_4"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann3, ann4 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann3
annB = ann4
print("Exact matches in Annotator 3's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 3's DataFrame: 191
Exact matches in Annotator 4's DataFrame: 191


In [133]:
# Find overlaps
annA = ann3
annB = ann4

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann3, ann4 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann3
annB = ann4
print("Overlaps in Annotator 3's DataFrame:", annA.loc[annA[agreement_B] == "Overlap"].shape[0])
print("Overlaps in Annotator 4's DataFrame:", annB.loc[annB[agreement_A] == "Overlap"].shape[0])

Overlaps in Annotator 3's DataFrame: 39
Overlaps in Annotator 4's DataFrame: 44


##### Annotators 0 & 4

In [134]:
# Find exact matches
common_files = common_files04
annA = ann0
annB = ann4
agreement_A = "agreement_0"
agreement_B = "agreement_4"

sub_annA = annA.loc[annA.file.isin(common_files)]  # Look only at files both annotators labeled
sub_annB = annB.loc[annB.file.isin(common_files)]  # Look only at files both annotators labeled

ann0, ann4 = getAllExactMatchesPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann4
print("Exact matches in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Match"]).shape[0])
print("Exact matches in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Match"]).shape[0])

Exact matches in Annotator 0's DataFrame: 396
Exact matches in Annotator 4's DataFrame: 396


In [135]:
# Find overlaps
annA = ann0
annB = ann4

sub_annA = annA.loc[annA.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annA = sub_annA.loc[sub_annA[agreement_B] != "Match"]  # Filter out rows already recorded as having a "Match"
sub_annB = annB.loc[annB.file.isin(common_files)]          # Look only at files both annotators labeled
sub_annB = sub_annB.loc[sub_annB[agreement_A] != "Match"]  # Filter out rows already recorded as having a "Match"

ann0, ann4 = getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files)
annA = ann0
annB = ann4
print("Overlaps in Annotator 0's DataFrame:", (annA.loc[annA[agreement_B] == "Overlap"]).shape[0])
print("Overlaps in Annotator 4's DataFrame:", (annB.loc[annB[agreement_A] == "Overlap"]).shape[0])

Overlaps in Annotator 0's DataFrame: 175
Overlaps in Annotator 4's DataFrame: 152


In [136]:
# # Write the data to files for safe-keeping
ann0.to_csv("Agreements/ann0_agreement34_Occupation.csv")
ann3.to_csv("Agreements/ann3_agreement04_Occupation.csv")
ann4.to_csv("Agreements/ann4_agreement03_Occupation.csv")

<a id="g"></a>
### Add to Gold

* Add one version of matching rows to gold standard
* Add longest row among rows of overlapping annotations to gold standard
* Pick one from each group of overlapping annotations in gold standard for Gendered Pronoun and Occupation

In [9]:
# # Clean up annotator column values
# gold = pd.read_csv("gold_standard.csv", dtype={"file":str, "offsets":str, "text":str, "id":str, "entity":str, "label":str,
#                                               "annotator":str, "category":str})
# gold.replace("Annotator 0", "0", inplace=True)
# gold.replace("Annotator 1", "1", inplace=True)
# gold.replace("Annotator 2", "2", inplace=True)
# gold.replace("Annotator 3", "3", inplace=True)
# gold.replace("Annotator 4", "4", inplace=True)
# gold.annotator.unique()
# gold.to_csv("gold_standard.csv")

In [4]:
gold = pd.read_csv("gold_standard.csv", index_col=0)
gold.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
0,Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
1,Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
2,Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
3,Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
4,BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


<a id="g-m"></a>
#### MATCHES

In [6]:
def useAnnotatorNumber(df):
    df.replace("Annotator 0", "0", inplace=True)
    df.replace("Annotator 1", "1", inplace=True)
    df.replace("Annotator 2", "2", inplace=True)
    df.replace("Annotator 3", "3", inplace=True)
    df.replace("Annotator 4", "4", inplace=True)
    return df

def addMatchedRows(filename, merged_df):
    # Load and format the DataFrames
    filepath = "Agreements/"+filename
    df = pd.read_csv(filepath,index_col=0)
    agmt_colA = df.columns[-1]
    agmt_colB = df.columns[-2]
    matchesA = df.loc[df[agmt_colA] == "Match"]
    if "agreement" in agmt_colB:
        matchesB = df.loc[df[agmt_colB] == "Match"]
        # Make sure the columns are the same and in the same order
        matchesB = matchesB[["file","offsets","text","id","entity","label","annotator","category"]]
    else:
        matchesB = pd.DataFrame()
    # Make sure the columns are the same and in the same order
    matchesA = matchesA[["file","offsets","text","id","entity","label","annotator","category"]]
    merged_df = merged_df[["file","offsets","text","id","entity","label","annotator","category"]]
    
    # Append the rows of matches to the merged DataFrame 
    old_row_count = gold.shape[0]
    new_row_count = matchesA.shape[0] + matchesB.shape[0]
    added = (gold.append(matchesA)).append(matchesB)
    assert(added.shape[0] == old_row_count + new_row_count)
    
    # Clean up annotator columns so values are all a single digit string
    added = useAnnotatorNumber(added)
    
    return added

In [6]:
agreement_files = os.listdir("./Agreements")
merged = gold
print(merged.shape[0])
for f in agreement_files:
    if f != ".ipynb_checkpoints":
        merged = addMatchedRows(f, merged)
print(merged.shape[0])  # 65770, 66357

65770
66357


In [7]:
merged.annotator.unique()
merged.replace('4',4,inplace=True)
merged.annotator.unique()
merged.to_csv("merged.csv")

<a id="g-o"></a>
#### OVERLAPS

In [17]:
# Input a dataframe and the labelname under consideration
# Output two DataFrames that are the subset of the input DataFrame, the first being all annotations that
# overlap with one other annotator and the second being all annotations that overlap with two other annotators,
# both with rows whose label is the input labelname
def getOverlaps(df, labelname):
    df = df.loc[df.label == labelname]
    agmt_colA = df.columns[-1]
    agmt_colB = df.columns[-2]
    overlaps = df.loc[df[agmt_colA] == "Overlap"]
    overlaps = overlaps.append(df.loc[df[agmt_colB] == "Overlap"])
    # Remove any matched rows from the overlap DataFrame (those rows have already been added to the merged dataset)
    overlaps = overlaps.loc[df[agmt_colA] != "Match"]
    overlaps = overlaps.loc[df[agmt_colB] != "Match"]
#     triple_overlaps = (overlaps.dropna())
#     double_overlaps = overlaps.drop(index=list(triple_overlaps.index))
    return overlaps #double_overlaps,triple_overlaps

# Create an interval tree for one annotator's agreement DataFrame for a specified file and label
def createIntervalTree(df, filename):
    subdf = df.loc[df.file == filename]
    offsets = list(zip(list(subdf.start), list(subdf.end)))
    return IntervalTree.from_tuples(offsets)

# Find the longest text span (interval) annotated for each group of overlapping annotations,
# comparing two or three trees (two or three annotators' labels) for one file at a time and
# for one label at a time
def findBiggestIntervals(tree_exp, tree_pred, tree_pred2=False):
    interval_list = []
    for annotation in tree_exp:
        biggest_interval = annotation
        overlapping_intervals = list(tree_pred.overlap(annotation))
        if tree_pred2:
            overlapping_intervals += list(tree_pred2.overlap(annotation))
        if len(overlapping_intervals) > 0:
            for oi in overlapping_intervals:
                if (annotation.end - annotation.begin) <= (oi.begin - oi.end):
                    biggest_interval = oi
        interval_list += [biggest_interval]
    return interval_list

In [20]:
def getRowsToAdd(df0, df1, df2=pd.DataFrame()):
    if df2.empty:
            file_list = list(set(list(df0.file)+list(df1.file)))
    else:
        file_list = list(set(list(df0.file)+list(df1.file)+list(df2.file)))
    for_merged = pd.DataFrame(columns=['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end'])
    for f in file_list:
        tree0 = createIntervalTree(df0, f)
        tree1 = createIntervalTree(df1, f)
        # Find biggest intervals for file f, label Gendered Role
        if df2.empty:
            biggest_intervals = findBiggestIntervals(tree0, tree1)
        else:
            tree2 = createIntervalTree(df2, f)
            biggest_intervals = findBiggestIntervals(tree0, tree1, tree2)
        # Get rows for merged dataset
        df0 = df0[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
        df1 = df1[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
        if df2.empty:
            df = df0.append(df1)
        else:
            df2 = df2[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
            df = (df0.append(df1)).append(df2)
        for iv in biggest_intervals:
            row = df.loc[df.start == iv.begin]
            row = row.loc[row.end == iv.end]
            for_merged = for_merged.append(row)

    return for_merged

In [21]:
print(os.listdir("./Agreements"))

['.ipynb_checkpoints', 'ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']


##### Gendered Role
Among annotators 0, 1, and 2

In [264]:
labelname = "Gendered-Role"
df0 = pd.read_csv("Agreements/ann0_agreement12_GenderedRole.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
# df0.head()  # Looks good
df1 = pd.read_csv('Agreements/ann1_agreement02_GenderedRole.csv',index_col=0)
df1 = getOverlaps(df1, labelname)
df2 = pd.read_csv('Agreements/ann2_agreement01_GenderedRole.csv',index_col=0)
df2 = getOverlaps(df2, labelname)

In [265]:
# double_overlaps0,triple_overlaps0 = getOverlaps(df0)
# double_overlaps1,triple_overlaps1 = getOverlaps(df1)
# double_overlaps2,triple_overlaps2 = getOverlaps(df2)
# print("Double and Triple Overlaps for Gendered Role Annotations")
# print("- Annotator 0:",double_overlaps0.shape[0], triple_overlaps0.shape[0])  # Makes sense
# print("- Annotator 1:",double_overlaps1.shape[0], triple_overlaps1.shape[0])
# print("- Annotator 2:",double_overlaps2.shape[0], triple_overlaps2.shape[0])
# double_overlaps1.head() # Looks good
# triple_overlaps2.tail()  # Looks good

In [266]:
for_merged = getRowsToAdd(df0, df1, df2)
for_merged = for_merged.append(getRowsToAdd(df1, df0, df2))
for_merged = for_merged.append(getRowsToAdd(df2, df0, df1))
for_merged.shape

(208, 11)

In [267]:
# Clean up annotator columns so values are all a single digit string
for_merged = useAnnotatorNumber(for_merged)
for_merged.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
2260,Coll-1462_00100.ann,"(1143, 1156)",granddaughter,3339,T3,Gendered-Role,0,Linguistic,,1143,1156
10439,Coll-1434_16000.ann,"(2393, 2397)",boys,15310,T34,Gendered-Role,0,Linguistic,,2393,2397
18422,Coll-1109_00100.ann,"(480, 492)",Lord Provost,26879,T0,Gendered-Role,0,Linguistic,,480,492
13530,Coll-1383_00600.ann,"(8394, 8400)",Master,19689,T137,Gendered-Role,0,Linguistic,,8394,8400
11139,Coll-1434_09700.ann,"(559, 565)",Prince,16369,T14,Gendered-Role,0,Linguistic,,559,565


In [268]:
# Write to CSV file for safe-keeping
for_merged.to_csv("overlaps_for_merged.csv")

##### Gendered Pronoun
Among annotators 0, 1, and 2

In [269]:
labelname = "Gendered-Pronoun"
df0 = pd.read_csv("Agreements/ann0_agreement12_GenderedPronoun.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df1 = pd.read_csv('Agreements/ann1_agreement02_GenderedPronoun.csv',index_col=0)
df1 = getOverlaps(df1, labelname)
df2 = pd.read_csv('Agreements/ann2_agreement01_GenderedPronoun.csv',index_col=0)
df2 = getOverlaps(df2, labelname)

In [270]:
for_merged = getRowsToAdd(df0, df1, df2)
for_merged = for_merged.append(getRowsToAdd(df1, df0, df2))
for_merged = for_merged.append(getRowsToAdd(df2, df0, df1))
for_merged.shape

(2, 11)

In [271]:
# Clean up annotator columns so values are all a single digit string
for_merged = useAnnotatorNumber(for_merged)
for_merged.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
21262,Coll-1036_00500.ann,"(16401, 16403)",He,31028,T221,Gendered-Pronoun,0,Linguistic,,16401,16403
16059,Coll-1036_00500.ann,"(16401, 16405)",He's,16204,T157,Gendered-Pronoun,1,Linguistic,,16401,16405


In [272]:
# Write to CSV file for safe-keeping
already_added = pd.read_csv("overlaps_for_merged.csv", index_col=0)
# already_added.shape
to_merge = already_added.append(for_merged)
# to_merge.shape
to_merge.to_csv("overlaps_for_merged.csv")

##### Generalization
Among annotators 0, 1, and 2

In [273]:
# ['.ipynb_checkpoints', 'ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']
labelname = "Generalization"
df0 = pd.read_csv("Agreements/ann0_agreement12_Generalization.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df1 = pd.read_csv('Agreements/ann1_agreement02_Generalization.csv',index_col=0)
df1 = getOverlaps(df1, labelname)
df2 = pd.read_csv('Agreements/ann2_agreement01_Generalization.csv',index_col=0)
df2 = getOverlaps(df2, labelname)

In [274]:
for_merged = getRowsToAdd(df0, df1, df2)
for_merged = for_merged.append(getRowsToAdd(df1, df0, df2))
for_merged = for_merged.append(getRowsToAdd(df2, df0, df1))
for_merged.shape

(61, 11)

In [275]:
# Clean up annotator columns so values are all a single digit string
for_merged = useAnnotatorNumber(for_merged)
for_merged.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
4063,Coll-1443_00100.ann,"(7684, 7689)",B.Ed.,6028,T116,Generalization,0,Linguistic,,7684,7689
9242,Coll-100_00100.ann,"(557, 581)",C.M. (Master in Surgery),13501,T18,Generalization,0,Linguistic,,557,581
9240,Coll-100_00100.ann,"(514, 519)",B.Sc.,13499,T16,Generalization,0,Linguistic,,514,519
9241,Coll-100_00100.ann,"(548, 552)",M.B.,13500,T17,Generalization,0,Linguistic,,548,552
14267,BAI_02300.ann,"(544, 548)",M.B.,20796,T38,Generalization,0,Linguistic,,544,548


In [276]:
# Add the new rows with rows already added and write them to a CSV file for safe-keeping
already_added = pd.read_csv("overlaps_for_merged.csv", index_col=0)
already_added.shape  # Looks good

(210, 11)

In [277]:
to_merge = already_added.append(for_merged)
to_merge.shape  # Looks good

(271, 11)

In [278]:
to_merge.to_csv("overlaps_for_merged.csv")

##### Stereotype
Among annotators 0, 3, and 4

In [279]:
# ['ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']
labelname = "Stereotype"
df0 = pd.read_csv("Agreements/ann0_agreement34_Stereotype.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df3 = pd.read_csv('Agreements/ann3_agreement04_Stereotype.csv',index_col=0)
df3 = getOverlaps(df3, labelname)
df4 = pd.read_csv('Agreements/ann4_agreement03_Stereotype.csv',index_col=0)
df4 = getOverlaps(df4, labelname)

In [280]:
for_merged = getRowsToAdd(df0, df3, df4)
for_merged = for_merged.append(getRowsToAdd(df3, df0, df4))
for_merged = for_merged.append(getRowsToAdd(df4, df0, df3))
for_merged.shape

(349, 11)

In [281]:
# Clean up annotator columns so values are all a single digit string
for_merged = useAnnotatorNumber(for_merged)
for_merged.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
5713,Coll-1310_02500.ann,"(8949, 8956)",problem,29634,T216,Stereotype,0,Contextual,,8949,8956
3917,Coll-1404_00100.ann,"(969, 992)",honorary degree of D.D.,20610,T16,Stereotype,0,Contextual,,969,992
5728,Coll-1373_00100.ann,"(5140, 5149)",Wernerian,29706,T50,Stereotype,0,Contextual,,5140,5149
1601,Coll-1434_19100.ann,"(493, 496)",men,7708,T9,Stereotype,0,Contextual,,493,496
4700,Coll-1310_00400.ann,"(1208, 1284)",awarded the declaration of the Third Class of ...,24107,T47,Stereotype,0,Contextual,,1208,1284


In [282]:
# Add the new rows with rows already added and write them to a CSV file for safe-keeping
already_added = pd.read_csv("overlaps_for_merged.csv", index_col=0)
already_added.shape  # Looks good

(271, 11)

In [283]:
to_merge = already_added.append(for_merged)
to_merge.shape  # Looks good

(620, 11)

In [284]:
to_merge.to_csv("overlaps_for_merged.csv")

##### Omission
Among annotators 0, 3, and 4

In [285]:
# ['ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']
labelname = "Omission"
df0 = pd.read_csv("Agreements/ann0_agreement34_Omission.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df3 = pd.read_csv('Agreements/ann3_agreement04_Omission.csv',index_col=0)
df3 = getOverlaps(df3, labelname)
df4 = pd.read_csv('Agreements/ann4_agreement03_Omission.csv',index_col=0)
df4 = getOverlaps(df4, labelname)

In [286]:
for_merged = getRowsToAdd(df0, df3, df4)
for_merged = for_merged.append(getRowsToAdd(df3, df0, df4))
for_merged = for_merged.append(getRowsToAdd(df4, df0, df3))
for_merged.shape

(1206, 11)

In [287]:
# Clean up annotator columns so values are all a single digit string
for_merged = useAnnotatorNumber(for_merged)
for_merged.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
5256,Coll-1022_00100.ann,"(5698, 5704)",Corson,27269,T110,Omission,0,Contextual,,5698,5704
938,Coll-1434_02200.ann,"(57, 63)",Family,4624,T23,Omission,0,Contextual,,57,63
935,Coll-1434_02200.ann,"(749, 761)",Fruit Seller,4617,T16,Omission,0,Contextual,,749,761
3800,Coll-1434_02900.ann,"(792, 800)",Students,20210,T1,Omission,0,Contextual,,792,800
3919,Coll-1404_00100.ann,"(80, 83)",Sir,20620,T26,Omission,0,Contextual,,80,83


In [288]:
# Add the new rows with rows already added and write them to a CSV file for safe-keeping
already_added = pd.read_csv("overlaps_for_merged.csv", index_col=0)
already_added.shape  # Looks good

(620, 11)

In [289]:
to_merge = already_added.append(for_merged)
to_merge.shape  # Looks good

(1826, 11)

In [290]:
to_merge.to_csv("overlaps_for_merged.csv")

##### Occupation
Among annotators 0, 3, and 4

In [291]:
# ['ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']
labelname = "Occupation"
df0 = pd.read_csv("Agreements/ann0_agreement34_Occupation.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df3 = pd.read_csv('Agreements/ann3_agreement04_Occupation.csv',index_col=0)
df3 = getOverlaps(df3, labelname)
df4 = pd.read_csv('Agreements/ann4_agreement03_Occupation.csv',index_col=0)
df4 = getOverlaps(df4, labelname)

In [292]:
for_merged = getRowsToAdd(df0, df3, df4)
for_merged = for_merged.append(getRowsToAdd(df3, df0, df4))
for_merged = for_merged.append(getRowsToAdd(df4, df0, df3))
for_merged.shape

(0, 11)

None to add!

##### Unknown
Among annotators 0 and 2

In [299]:
# ['.ipynb_checkpoints', 'ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']
labelname = "Unknown"
df0 = pd.read_csv("Agreements/ann0_agreement2_UnknownFeminineMasculine.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df2 = pd.read_csv('Agreements/ann2_agreement0_UnknownFeminineMasculine.csv',index_col=0)
df2 = getOverlaps(df2, labelname)

In [300]:
for_merged = getRowsToAdd(df0, df2)
for_merged = for_merged.append(getRowsToAdd(df2, df0))
for_merged.shape

(285, 11)

In [302]:
# Clean up annotator columns so values are all a single digit string
for_merged.replace("Annotator 0", "0", inplace=True)
for_merged.replace("Annotator 2", "2", inplace=True)
for_merged.tail()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
18109,Coll-1434_17400.ann,"(3622, 3632)",Fred] Heal,18405,T16,Unknown,2,Person-Name,,3622,3632
3256,Coll-1052_00100.ann,"(5752, 5759)",athleen,3375,T34,Unknown,2,Person-Name,,5752,5759
114,BAI_00800.ann,"(636, 654)",Mary Sarah Fowler,116,T6,Unknown,2,Person-Name,,636,654
148,BAI_01100.ann,"(294, 306)",Prof McIntyr,150,T4,Unknown,2,Person-Name,,294,306
1660,Coll-1034_00100.ann,"(510, 516)",Hegel;,1700,T10,Unknown,2,Person-Name,,510,516


In [303]:
# Add the new rows with rows already added and write them to a CSV file for safe-keeping
already_added = pd.read_csv("overlaps_for_merged.csv", index_col=0)
already_added.shape  # Looks good

(1826, 11)

In [304]:
to_merge = already_added.append(for_merged)
to_merge.shape  # Looks good

(2111, 11)

In [305]:
to_merge.to_csv("overlaps_for_merged.csv")

##### Feminine
Among annotators 0 and 2

In [306]:
# ['.ipynb_checkpoints', 'ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']
labelname = "Feminine"
df0 = pd.read_csv("Agreements/ann0_agreement2_UnknownFeminineMasculine.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df2 = pd.read_csv('Agreements/ann2_agreement0_UnknownFeminineMasculine.csv',index_col=0)
df2 = getOverlaps(df2, labelname)

In [307]:
for_merged = getRowsToAdd(df0, df2)
for_merged = for_merged.append(getRowsToAdd(df2, df0))
for_merged.shape

(147, 11)

In [308]:
# Clean up annotator columns so values are all a single digit string
for_merged.replace("Annotator 0", "0", inplace=True)
for_merged.replace("Annotator 2", "2", inplace=True)
for_merged.tail()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
1803,Coll-1036_00400.ann,"(13481, 13496)",Marjory Kennedy,1845,T124,Feminine,2,Person-Name,,13481,13496
2020,Coll-1036_00400.ann,"(34507, 34525)","Skye, Mrs Mathison",2069,T348,Feminine,2,Person-Name,,34507,34525
2125,Coll-1036_00400.ann,"(43949, 43973)","Fraser, Marjory Kennedy-",2179,T458,Feminine,2,Person-Name,,43949,43973
2563,Coll-1036_00400.ann,"(102334, 102356)",Marjory Kennedy-Fraser,2647,T926,Feminine,2,Person-Name,,102334,102356
18814,Coll-1434_20700.ann,"(4017, 4020)",J.G,19144,T24,Feminine,2,Person-Name,,4017,4020


In [309]:
# Add the new rows with rows already added and write them to a CSV file for safe-keeping
already_added = pd.read_csv("overlaps_for_merged.csv", index_col=0)
already_added.shape  # Looks good

(2111, 11)

In [310]:
to_merge = already_added.append(for_merged)
to_merge.shape  # Looks good

(2258, 11)

In [311]:
to_merge.to_csv("overlaps_for_merged.csv")

##### Masculine
Among annotators 0 and 2

In [312]:
# ['.ipynb_checkpoints', 'ann0_agreement2_UnknownFeminineMasculine.csv', 'ann2_agreement0_UnknownFeminineMasculine.csv', 'ann0_agreement12_GenderedRole.csv', 'ann1_agreement02_GenderedRole.csv', 'ann2_agreement01_GenderedRole.csv', 'ann0_agreement12_Generalization.csv', 'ann1_agreement02_Generalization.csv', 'ann2_agreement01_Generalization.csv', 'ann0_agreement12_GenderedPronoun.csv', 'ann1_agreement02_GenderedPronoun.csv', 'ann2_agreement01_GenderedPronoun.csv', 'ann0_agreement34_Stereotype.csv', 'ann3_agreement04_Stereotype.csv', 'ann4_agreement03_Stereotype.csv', 'ann0_agreement34_Omission.csv', 'ann3_agreement04_Omission.csv', 'ann4_agreement03_Omission.csv', 'ann0_agreement34_Occupation.csv', 'ann3_agreement04_Occupation.csv', 'ann4_agreement03_Occupation.csv']
labelname = "Masculine"
df0 = pd.read_csv("Agreements/ann0_agreement2_UnknownFeminineMasculine.csv",index_col=0)
df0 = getOverlaps(df0, labelname)
df2 = pd.read_csv('Agreements/ann2_agreement0_UnknownFeminineMasculine.csv',index_col=0)
df2 = getOverlaps(df2, labelname)

In [313]:
for_merged = getRowsToAdd(df0, df2)
for_merged = for_merged.append(getRowsToAdd(df2, df0))
for_merged.shape

(69, 11)

In [314]:
# Clean up annotator columns so values are all a single digit string
for_merged.replace("Annotator 0", "0", inplace=True)
for_merged.replace("Annotator 2", "2", inplace=True)
for_merged.tail()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove,start,end
4041,Coll-1036_00500.ann,"(4036, 4052)","Kennedy, Robert,",4196,T42,Masculine,2,Person-Name,,4036,4052
3403,Coll-1057_00300.ann,"(644, 648)",Alan,3527,T6,Masculine,2,Person-Name,,644,648
3404,Coll-1057_00300.ann,"(659, 662)",Ken,3528,T7,Masculine,2,Person-Name,,659,662
18941,Coll-1438_00100.ann,"(4065, 4075)",Henderson',19272,T79,Masculine,2,Person-Name,,4065,4075
18770,Coll-1434_20500.ann,"(2199, 2205)",Oarlot,19098,T14,Masculine,2,Person-Name,,2199,2205


In [315]:
# Add the new rows with rows already added and write them to a CSV file for safe-keeping
already_added = pd.read_csv("overlaps_for_merged.csv", index_col=0)
already_added.shape  # Looks good

(2258, 11)

In [316]:
to_merge = already_added.append(for_merged)
to_merge.shape  # Looks good

(2327, 11)

In [317]:
to_merge.to_csv("overlaps_for_merged.csv")

#### Remove reviewed rows

Record which rows have been reviewed because they have matches or overlaps, and can thus be removed from the annotator's DataFrames.  Use the unique IDs for each annotator's annotation to determine which rows to remove, one annotator at a time.

In [49]:
contextual_labels = ["Occupation","Omission","Stereotype"]
linguistic_labels = ["Gendered-Role","Generalization","Gendered-Pronoun"]

def getMatchAndOverlapRows(labels_list,agmt_cols,files):
    df1 = pd.DataFrame()
    i = 1
    while i < 3:
        df = pd.read_csv(files[i], index_col=0)
        df = df.loc[df.label == labels_list[(i)]]
        df = df.loc[df[agmt_cols[0]].isin(["Overlap","Match"])]
        df = df.loc[df[agmt_cols[1]].isin(["Overlap","Match"])]
        df1 = df1.append(df)
        i += 1
    matches_or_overlaps = df1[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
    assert len(set(matches_or_overlaps.id)) == len(list(matches_or_overlaps.id)), "Each IDs should be unique"
    return matches_or_overlaps

##### Annotator 0

In [53]:
# Get all annotator 0's matched or overlapped annotations for the Person-Name and Linguistic categories of labels
files0 = ['Agreements/ann0_agreement2_UnknownFeminineMasculine.csv', 'Agreements/ann0_agreement12_GenderedRole.csv',
          'Agreements/ann0_agreement12_Generalization.csv','Agreements/ann0_agreement12_GenderedPronoun.csv']
# Get only the Person-Name and Linguistic Rows with agreements recorded for them (matches or overlaps)
# df0p = pd.read_csv(files0[0], index_col=0)
# df0p = df0p.loc[df0p.category == "Person-Name"]
# df0p = df0p.loc[df0p.agreement_2.isin(["Overlap","Match"])]
# df0p.tail(10) # Looks good

labels_list = linguistic_labels
df0l = pd.DataFrame()
i = 1
while i < 4:
    df0l_i = pd.read_csv(files0[i], index_col=0)
    df0l_i = df0l_i.loc[df0l_i.label == labels_list[(i-1)]]
#     df0l_i = df0l_i.loc[df0l_i.agreement_2.isin(["Overlap","Match"])]
    df0l_i = df0l_i.loc[df0l_i.agreement_1.isin(["Overlap","Match"])]
    df0l = df0l.append(df0l_i)
    i += 1
# df0l.tail(20)  # Looks good

# df0p = df0p[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
# df0l = 
matches_or_overlaps0PL = df0l[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
# matches_or_overlaps0PL = df0p.append(df0l)
assert len(set(matches_or_overlaps0PL.id)) == len(list(matches_or_overlaps0PL.id)), "Each IDs should be unique"

In [56]:
# Remove the matched or overlapped annotations from annotator 0's Person-Name and Linguistic annotation data
indeces_to_remove = list(matches_or_overlaps0PL.index)
# print(indeces_to_remove)  # Looks good
ann0PL = pd.read_csv("labels0PL.csv")
remaining_i = list(ann0PL.index)
i_to_drop = [i for i in indeces_to_remove if i in remaining_i]
# print(i_to_drop)
print(ann0PL.shape)
ann0PL = ann0PL.drop(index=i_to_drop)  #indeces_to_remove
print(ann0PL.shape)
# (21794, 9)
# (15587, 9)  # After removing overlaps and agreements with annotator 2

(15587, 10)
(11935, 10)


In [57]:
# Update the annotator 0 data file
ann0PL.to_csv("labels0PL.csv")
# Write the agreed upon annotations to a file for annotator 0 (rows that overlapped or matched with data from annotators 1 or 2)
# matches_or_overlaps0PL.to_csv("labels0PL_agreed2.csv")
matches_or_overlaps0PL.to_csv("labels0PL_agreed1.csv")

In [36]:
# Get all annotator 0's matched or overlapped annotations for the Contextual categories of labels
files0 = ['Agreements/ann0_agreement34_Occupation.csv', 'Agreements/ann0_agreement34_Omission.csv',
          'Agreements/ann0_agreement34_Stereotype.csv']

labels_list = contextual_labels
agmt_cols = ["agreement_3", "agreement_4"]
matches_or_overlaps0C = getMatchAndOverlapRows(labels_list,agmt_cols,files0)

In [37]:
# Remove the matched or overlapped annotations from annotator 0's Person-Name and Linguistic annotation data
indeces_to_remove = list(matches_or_overlaps0C.index)
ann0C = pd.read_csv("labels0C.csv")
print(ann0C.shape)
ann0C = ann0C.drop(index=indeces_to_remove)
print(ann0C.shape)

(6157, 9)
(5973, 9)


In [38]:
# Update the annotator 0 data file
ann0C.to_csv("labels0C.csv")
# Write the agreed upon annotations to a file for annotator 0 (rows that overlapped or matched with data from anns 3 or 4)
matches_or_overlaps0C.to_csv("labels0C_agreed.csv")

##### Annotator 1

In [50]:
# Get all annotator 1's matched or overlapped annotations for the Contextual categories of labels
files1 = ['Agreements/ann1_agreement02_GenderedRole.csv', 'Agreements/ann1_agreement02_Generalization.csv',
          'Agreements/ann1_agreement02_GenderedPronoun.csv']
labels_list = linguistic_labels
agmt_cols = ["agreement_0", "agreement_2"]
matches_or_overlaps1 = getMatchAndOverlapRows(labels_list,agmt_cols,files1)

In [51]:
# Remove the matched or overlapped annotations from annotator 1's Person-Name and Linguistic annotation data
indeces_to_remove = list(matches_or_overlaps1.index)
ann1 = pd.read_csv("labels1.csv")
print(ann1.shape)
ann1 = ann1.drop(index=indeces_to_remove)
print(ann1.shape)

(16515, 10)
(15999, 10)


In [52]:
# Update the annotator 1 data file
ann1.to_csv("labels1.csv")
# Write the agreed upon annotations to a file for annotator 1
matches_or_overlaps1.to_csv("labels1_agreed.csv")

##### Annotator 2

In [61]:
# Get all annotator 2's matched or overlapped annotations
files2 = ['Agreements/ann2_agreement0_UnknownFeminineMasculine.csv', 'Agreements/ann2_agreement01_GenderedRole.csv',
           'Agreements/ann2_agreement01_Generalization.csv', 'Agreements/ann2_agreement01_GenderedPronoun.csv']

# Get only the Person-Name and Linguistic Rows with agreements recorded for them (matches or overlaps)
df2p = pd.read_csv(files2[0], index_col=0)
df2p = df2p.loc[df2p.category == "Person-Name"]
df2p = df2p.loc[df2p.agreement_0.isin(["Overlap","Match"])]
# df2p.tail(10) # Looks good

labels_list = linguistic_labels
df2l = pd.DataFrame()
i = 1
while i < 4:
    df2l_i = pd.read_csv(files2[i], index_col=0)
    df2l_i = df2l_i.loc[df2l_i.label == labels_list[(i-1)]]
    df2l_i = df2l_i.loc[df2l_i.agreement_0.isin(["Overlap","Match"])]
    df2l_i = df2l_i.loc[df2l_i.agreement_1.isin(["Overlap","Match"])]
    df2l = df2l.append(df2l_i)
    i += 1

# df2p = df2p[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
# df2l = df2l[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category', 'remove', 'start', 'end']]
matches_or_overlaps2 = df2p.append(df2l)
assert len(set(matches_or_overlaps2.id)) == len(list(matches_or_overlaps2.id)), "Each IDs should be unique"

In [62]:
# Remove the matched or overlapped annotations from annotator 2's Person-Name and Linguistic annotation data
indeces_to_remove = list(matches_or_overlaps2.index)
ann2 = pd.read_csv("labels2.csv")
print(ann2.shape)
ann2 = ann2.drop(index=indeces_to_remove)
print(ann2.shape)

(19532, 9)
(14384, 9)


In [63]:
# Update the annotator 2 data file
ann2.to_csv("labels2.csv")
# Write the agreed upon annotations to a file for annotator 2
matches_or_overlaps2.to_csv("labels2_agreed.csv")

##### Annotator 3

In [64]:
# Get all annotator 3's matched or overlapped annotations for the Contextual categories of labels
files = ['Agreements/ann3_agreement04_Occupation.csv', 'Agreements/ann3_agreement04_Omission.csv',
          'Agreements/ann3_agreement04_Stereotype.csv']
labels_list = contextual_labels
agmt_cols = ["agreement_0", "agreement_4"]
matches_or_overlaps = getMatchAndOverlapRows(labels_list,agmt_cols,files)

In [65]:
# Remove the matched or overlapped annotations from annotator 3's annotation data
indeces_to_remove = list(matches_or_overlaps.index)
ann = pd.read_csv("labels3.csv")
print(ann.shape)
ann = ann.drop(index=indeces_to_remove)
print(ann.shape)

(2767, 9)
(2585, 9)


In [66]:
# Update the annotator 3 data file
ann.to_csv("labels3.csv")
# Write the agreed upon annotations to a file for annotator 3
matches_or_overlaps.to_csv("labels3_agreed.csv")

##### Annotator 4

In [67]:
# Get all annotator 4's matched or overlapped annotations for the Contextual categories of labels
files = ['Agreements/ann4_agreement03_Occupation.csv', 'Agreements/ann4_agreement03_Omission.csv',
          'Agreements/ann4_agreement03_Stereotype.csv']
labels_list = contextual_labels
agmt_cols = ["agreement_0", "agreement_3"]
matches_or_overlaps = getMatchAndOverlapRows(labels_list,agmt_cols,files)

In [68]:
# Remove the matched or overlapped annotations from annotator 3's annotation data
indeces_to_remove = list(matches_or_overlaps.index)
ann = pd.read_csv("labels4.csv")
print(ann.shape)
ann = ann.drop(index=indeces_to_remove)
print(ann.shape)

(2360, 9)
(2176, 9)


In [69]:
# Update the annotator 3 data file
ann.to_csv("labels4.csv")
# Write the agreed upon annotations to a file for annotator 3
matches_or_overlaps.to_csv("labels4_agreed.csv")

<a id="g-r"></a>
#### REMAINING STEPS

In [7]:
# 5. For any files annotator 1 didn't label, add all annotator 0's `Gendered-Role` labels. 
merged = pd.read_csv("merged.csv",index_col=0)  # Contains disagreements + GP and O + matched
merged = merged.astype({'file':str, 'offsets':str,'text':str, 'id':int, 'entity':str, 'label':str, 'annotator':int, 'category':str})
merged = merged.drop_duplicates()
# # 5. For any files annotator 1 didn't label, add all annotator 0's Gendered-Role labels. 
ann0 = pd.read_csv("labels0PL.csv",index_col=0)
ann0 = ann0[['file', 'offsets','text', 'id', 'entity', 'label', 'annotator', 'category']]  # align with merged's columns
ann1 = pd.read_csv("labels1.csv",index_col=0)
ann1 = ann1[['file', 'offsets','text', 'id', 'entity', 'label', 'annotator', 'category']]
ann0_gr = ann0.loc[ann0.label == "Gendered-Role"]
ann1_gr = ann1.loc[ann1.label == "Gendered-Role"]
common01 = findCommonFiles(ann0_gr, ann1_gr)
to_drop = list((ann0_gr.loc[ann0_gr.file.isin(common01)]).index)
print(ann0_gr.shape)
ann0_gr = ann0_gr.drop(index=to_drop)
print(ann0_gr.shape)
ann0_gr = useAnnotatorNumber(ann0_gr)
merged = merged.append(ann0_gr)
print(merged.shape)
merged.head()

(1696, 8)
(290, 8)
(66479, 8)


Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
0,Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
1,Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
2,Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
3,Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
4,BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


In [8]:
# 5. (continued) For any files annotator 0 didn't label, add all annotator 2's Gendered-Role labels.
ann2 = pd.read_csv("labels2.csv",index_col=0)
ann2 = ann2[['file', 'offsets','text', 'id', 'entity', 'label', 'annotator', 'category']]
ann0_gr = ann0.loc[ann0.label == "Gendered-Role"]
ann2_gr = ann2.loc[ann2.label == "Gendered-Role"]
common02 = findCommonFiles(ann0_gr, ann2_gr)
to_drop = list((ann2_gr.loc[ann2_gr.file.isin(common02)]).index)
print(ann2_gr.shape)
ann2_gr = ann2_gr.drop(index=to_drop)
print(ann2_gr.shape)
ann2_gr = useAnnotatorNumber(ann2_gr)
merged = merged.append(ann2_gr)
print(merged.shape)
merged.tail()

(1622, 8)
(981, 8)
(67460, 8)


Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
19397,Coll-1460_00100.ann,"(506, 508)",Mr,19760,T0,Gendered-Role,2,Linguistic
19398,Coll-1460_00100.ann,"(911, 915)",Miss,19761,T1,Gendered-Role,2,Linguistic
19461,Coll-1463_00100.ann,"(607, 611)",King,19825,T0,Gendered-Role,2,Linguistic
19465,Coll-1463_00100.ann,"(649, 662)",Prince Regent,19829,T4,Gendered-Role,2,Linguistic
19470,Coll-1465_00100.ann,"(957, 967)",Englishman,19834,T3,Gendered-Role,2,Linguistic


In [9]:
# 6. Add all the remaining Generalization labels to the merged DataFrame. 
print(merged.shape)
anns = [ann0, ann1, ann2]
for ann in anns:
    ann_g = ann.loc[ann.label == "Generalization"]
    ann_g.head()
    merged = merged.append(ann_g)
print(merged.shape)
merged.tail()

(67460, 8)
(67817, 8)


Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
19230,Coll-1448_00100.ann,"(959, 963)",B.Ed,19584,T22,Generalization,Annotator 2,Linguistic
19250,Coll-1451_00100.ann,"(855, 860)",B.Sc.,19608,T12,Generalization,Annotator 2,Linguistic
19287,Coll-1453_00100.ann,"(1081, 1084)",M.A,19645,T4,Generalization,Annotator 2,Linguistic
19347,Coll-1455_00100.ann,"(1112, 1117)",B.Sc.,19706,T11,Generalization,Annotator 2,Linguistic
19445,Coll-1462_00100.ann,"(729, 739)",countrymen,19808,T23,Generalization,Annotator 2,Linguistic


In [10]:
# 6. (continued) Change any text spans that end in -ess, -boy, -girl, or -man labeled as Gendered-Role to Generalization.
merged_gr = merged.loc[merged.label == "Gendered-Role"]
merged_gr_text = list(merged_gr.text)
to_change = []
for t in merged_gr_text:
    if ("ess" in t) or ("boy" in t) or ("girl" in t) or ("man" in t):
        to_change += [t]
to_change = set(to_change)
print(to_change)

{'Scotsman', 'Duchess', 'Yeoman', 'Knight Commander', 'Kirkman', 'heiress', 'Empresses', 'girls', 'boy', 'Sealwoman', 'Baroness', 'girl', 'Cornishman', 'tribesman', 'Fatherless', 'milkman', "Spearman's", 'Messrs', 'Shepherdess', 'sportsman', 'cowboys', "woman's", 'governesses', 'statesman', 'Tribesman', "man's", 'Mitherless', 'authoress', 'Wasp-Woman', 'Grand Duchess', 'managing Director', "Woman's", 'cowboy', 'Messrs.', 'Vice-Chairman', 'man', 'warehouseman', 'policeman', 'Heiress', 'schoolboy', 'Husbandman', 'Seal-Woman', 'horsewoman', 'Manageress', 'Englishman', 'clergyman', 'Empress', 'Empress Dowager', 'Woman', 'boys', 'co-chairman', 'land-girls', 'Schoolboys', 'day-girl', 'Seal-woman', 'man-of-war', 'Princess', "Seal-Woman'", 'Milkman', 'Messers', 'actress', 'yes-man', 'Shepherdesses', 'chairman', 'Marquess', 'Chairman', 'Knight Commander of the Indian Empire', 'Scotchman', 'Clegyman', 'Gentleman', 'Bardess', 'mess-boy', 'mistress', 'choirboy', 'Knight Commander of the Order of t

Some of these we don't actually want to change, so let's manually edit the set to create our final list of text spans whose label should be Generalization:

In [11]:
to_change = ['boys', 'man', 'yes-man', 'clergyman', 'woman', 'girls', 'Chairman', 'Englishman', "woman's", 'Tribesman', 'schoolboy', 'actress', 'Duchess', 'Princess', 'Empress Dowager', 'statesman', 'Vice-Chairman', 'cowboys', "Seal-Woman'", 'Cornishman', 'horsewoman', 'Shepherdesses', 'Countess', 'Schoolboys', 'Baroness', 'Wasp-Woman', 'heiress', 'Yeoman', 'gentleman', 'Sealwoman', 'horesman', 'Marquess', 'cowboy', 'Manageress', 'chairman', 'boy', 'choirboy', 'co-chairman', 'Empress', 'milkman', 'land-girls', 'mess-boy', 'Scotsman', 'Seal-woman', 'Seal-Woman', 'policeman', 'Clegyman', 'authoress', 'Shepherdess', 'Milkman', 'Heiress', 'Kirkman', 'Empresses', 'day-girl', 'mistress', 'warehouseman', 'Bardess', 'Grand Duchess', 'Cowboy', 'Husbandman', 'sportsman', 'Gentleman', 'tribesman', 'governesses', 'Scotchman']
# removed: 'Knight Commander of the Order of the British Empire','Knight Commander of the Indian Empire', 'Mitherless', 'Fatherless', "Woman's", 'Messrs','Messers', 'Messrs.', 'Knight Commander', 'man-of-war', 'girl', "man's", "Spearman's",'managing Director', 'Woman'

In [12]:
merged_gr_to_change = merged_gr.loc[merged_gr.text.isin(to_change)]
i_to_change = list(merged_gr_to_change.index)
for i in i_to_change:
    merged.at[i,"label"] = "Generalization"
merged = merged.drop_duplicates(subset=['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator', 'category'])

In [13]:
merged.loc[merged.id == 19631]  # "Empress" should be Generalization

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
14651,Coll-1383_00600.ann,"(5374, 5378)",Lady,19631,T79,Gendered-Role,0,Linguistic
58746,Coll-1452_00100.ann,"(1726, 1733)",Empress,19631,T18,Generalization,2,Linguistic
19273,Coll-1452_00100.ann,"(1726, 1733)",Empress,19631,T18,Generalization,2,Linguistic


In [14]:
# WHY ARE THERE STILL DUPLICATES?

In [15]:
# 7. Add all the remaining Person-Name labels from annotators 0 and 2 to the gold standard DataFrame.
ann0_p = ann0.loc[ann0.category == "Person-Name"]
print(ann0_p.shape)
ann2_p = ann2.loc[ann2.category == "Person-Name"]
print(ann2_p.shape)

(7606, 8)
(12255, 8)


In [16]:
merged = (merged.append(ann0_p)).append(ann2_p)
print(merged.shape)

(87678, 8)


In [18]:
# 8. Add the Omission and Stereotype labels from annotators 0, 3, and 4 to the merged DataFrame.
ann0 = pd.read_csv("labels0C.csv",index_col=0)
ann0 = ann0[['file', 'offsets','text', 'id', 'entity', 'label', 'annotator', 'category']]
# ann0.head()  # Looks good
ann0_o = ann0.loc[ann0.label == "Omission"]
ann0_s = ann0.loc[ann0.label == "Stereotype"]

ann3 = pd.read_csv("labels3.csv",index_col=0)
ann3 = ann3[['file', 'offsets','text', 'id', 'entity', 'label', 'annotator', 'category']]
ann3_o = ann3.loc[ann3.label == "Omission"]
ann3_s = ann3.loc[ann3.label == "Stereotype"]

ann4 = pd.read_csv("labels4.csv",index_col=0)
ann4 = ann4[['file', 'offsets','text', 'id', 'entity', 'label', 'annotator', 'category']]
ann4_o = ann4.loc[ann4.label == "Omission"]
ann4_s = ann4.loc[ann4.label == "Stereotype"]

merged = (merged.append(ann0_o)).append(ann0_s)
merged = (merged.append(ann3_o)).append(ann3_s)
merged = (merged.append(ann4_o)).append(ann4_s)
merged.shape

(98332, 8)

In [19]:
# Make sure all annotators are listed as integers in the 'annotator' column of the gold DataFrame
merged = useAnnotatorNumber(merged)
merged.head()  # Looks good

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
0,Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
1,Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
2,Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
3,Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
4,BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


In [20]:
merged.tail()  # Looks good

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
2340,Coll-1434_20400.ann,"(5497, 5524)",a group of young Khond boys,6559,T25,Stereotype,4,Contextual
2351,Coll-146_30100.ann,"(2242, 2247)",Woman,6575,T8,Stereotype,4,Contextual
2353,Coll-1434_19600.ann,"(4129, 4134)",a man,6582,T6,Stereotype,4,Contextual
2354,Coll-1434_19600.ann,"(528, 544)",a group of women,6584,T8,Stereotype,4,Contextual
2355,Coll-146_16400.ann,"(2134, 2139)",a boy,6586,T2,Stereotype,4,Contextual


In [21]:
# Write merged with non-disagreed and non-agreed rows to CSV
merged.to_csv("merged.csv")

#### Remove Duplicates from the Merged Dataset
* Make sure there aren't any overlapping or matching Occupations or Gendered-Pronoun labels in the gold data
* Combine the `merged` and `overlaps_fo_merged` CSVs into one DataFrame
* Count total rows per annotator in the resulting dataset
* Remove `annotator`, `id`, `entity`, and `remove` rows and drop duplicates to get only unique annotations in the aggregated (a.k.a. merged, formerly gold) dataset
* Write the final aggregated dataset

Let's double check that there aren't any overlapping Occupation or Gendered-Pronoun labels.  Any matching labels would have been removed when we dropped duplicate rows from the data, so we don't have to worry about those anymore.

We'll change our functions from the start of this Notebook slightly to do so, and add one more at the end...

In [22]:
# Separate the input DataFrame's offsets column into 'start' offset and 'end' offset columns of type int
def splitOffsets(df):
    offsets = list(df.offsets)
    start_list, end_list = [], []
    for o in offsets:
        pair = o[1:-1]
        pair_list = pair.split(",")
        start_list += [int(pair_list[0])]
        end_list += [int(pair_list[1])]
    df = df.assign(start = start_list)
    df = df.assign(end = end_list)
    return df

# Find the files both input annotators labeled
def findCommonFiles(df_a, df_b):
    common = []
    files_a = set(list(df_a.file))
    files_b = set(list(df_b.file))
    for f in files_a:
        if f in files_b:
            common += [f]
    return common

# Create an interval tree for one annotator for a specified file and specified label
def createIntervalTree(df, filename, labelname):
    subdf = df[df.file == filename]                                       # Get only rows for the input file
    subdf = subdf[subdf.label == labelname]                               # Get only rows for that file with the input label
    offsets = list(zip(list(subdf.start), list(subdf.end)))
    return IntervalTree.from_tuples(offsets)

# Find strict agreements (exact matches)
def findMatches(tree_exp, tree_pred):
    matches = []
    for annotation in tree_exp:
        if annotation in tree_pred:
            matches += [annotation]
    return matches

# Find annotations that overlap one another (including enveloping but excluding exactly matching annotations)
# Note: exp = A, pred = B
def findOverlaps(tree_exp, tree_pred):
    overlaps_pred, overlaps_exp = [], []
    for annotation in tree_exp:
        overlapping_intervals = tree_pred.overlap(annotation)
        if len(overlapping_intervals) > 0:
            for oi in overlapping_intervals:
                overlaps_pred += [oi]
    for annotation in tree_pred:
        overlapping_intervals = tree_exp.overlap(annotation)
        if len(overlapping_intervals) > 0:
            for oi in overlapping_intervals:
                overlaps_exp += [oi]
    return overlaps_exp, overlaps_pred

# Record the type of agreement in the DataFrame
def recordAgreements(df, labelname, filename, agreements, agreement_col_name, agreement_type):
    df_label = df.loc[df.label == labelname]
    df_file = df_label.loc[df_label.file == filename]
    for a in agreements:
        offset = "("+str(a.begin)+", "+str(a.end)+")"
        row = df_file.loc[df_file.offsets == offset].index
        df.at[row, agreement_col_name] = agreement_type
    return df

# Record overlapping annotations between two annotators' DataFrames in the DF's agreement columns
def getAllOverlapsPerLabel(annA, annB, sub_annA, sub_annB, agreement_A, agreement_B, labelname, common_files):
    sub_annA = splitOffsets(sub_annA)
    sub_annB = splitOffsets(sub_annB)
    for f in common_files:
        # Create interval trees of the offset data for each annotator for file f
        treeA = createIntervalTree(sub_annA, f, labelname)
        treeB = createIntervalTree(sub_annB, f, labelname)
        # Find overlaps between the annotators
        overlapsA, overlapsB = findOverlaps(treeA,treeB)
        annA = recordAgreements(annA,labelname,f,overlapsA,agreement_B,"Overlap")
        annB = recordAgreements(annB,labelname,f,overlapsB,agreement_A,"Overlap")
    return annA, annB

# Create an interval tree for one annotator's agreement DataFrame for a specified file
def createSubIntervalTree(df, filename):
    subdf = df.loc[df.file == filename]
    offsets = list(zip(list(subdf.start), list(subdf.end)))
    return IntervalTree.from_tuples(offsets)

# SAME AS ABOVE - COPIED FOR REFERENCE
# Find the longest text span (interval) annotated for each group of overlapping annotations,
# comparing two or three trees (two or three annotators' labels) for one file at a time and
# for one label at a time
def findBiggestIntervals(tree_exp, tree_pred, tree_pred2=False):
    interval_list = []
    for annotation in tree_exp:
        biggest_interval = annotation
        overlapping_intervals = list(tree_pred.overlap(annotation))
        if tree_pred2:
            overlapping_intervals += list(tree_pred2.overlap(annotation))
        if len(overlapping_intervals) > 0:
            for oi in overlapping_intervals:
                if (annotation.end - annotation.begin) <= (oi.begin - oi.end):
                    biggest_interval = oi
        interval_list += [biggest_interval]
    return interval_list

def removeOverlapsFromMerged(labelname, dfa, dfb, agmta, agmtb, dfagp, dfbgp, merged):
    dfa, dfb = getAllOverlapsPerLabel(dfa, dfb, dfagp, dfbgp, agmta, agmtb, labelname, common)

    # Drop duplicates from the overlaps so only one from among a group of matching annotations remains
    dfa = dfa.loc[dfa[agmtb] == "Overlap"]
    dfb = dfb.loc[dfb[agmta] == "Overlap"]
    dfa = dfa[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator','category']]
    dfb = dfb[['file', 'offsets', 'text', 'id', 'entity', 'label', 'annotator','category']]
    df = dfa.append(dfb)
    print(df.shape)
    df = df.drop_duplicates(subset=["file","offsets","text","label"])
    print(df.shape)
    
    dfa = splitOffsets(dfa)  # Get start and end columns
    dfb = splitOffsets(dfb)  # Get start and end columns
    iv_dict = {}
    for f in common:
        treea = createSubIntervalTree(dfa, f)
        treeb = createSubIntervalTree(dfb, f)
        biggest_ivs = findBiggestIntervals(treea, treeb)
        iv_dict[f] = ["("+str(iv.begin)+", "+str(iv.end)+")" for iv in biggest_ivs]
    # print(iv_dict) # Looks good

    # Get the indeces of the overlaps that should be dropped (not the biggest interval)
    indeces_to_drop = []
    for f in common:
        suba = dfa.loc[dfa.file == f]  # subset of rows for the given file and label and annotator
        subb = dfb.loc[dfb.file == f]  # subset of rows for the given file and label and annotator
        intervals = iv_dict[f]
        i_a = (suba.loc[~suba.offsets.isin(intervals)]).index
        indeces_to_drop += list(i_a)
        i_b = (subb.loc[~subb.offsets.isin(intervals)]).index
        indeces_to_drop += list(i_b)
    print(indeces_to_drop)

    # Drop rows with indeces found above from the merged dataset so only one annotation from each group of overlaps remains
    if len(indeces_to_drop) > 0:
        print(merged.shape)
        for i in indeces_to_drop:
            try:
                merged = merged.drop(index=[i])
            except KeyError:                      # Some indeces may have been removed if they were in other overlap groups
                continue
        print(merged.shape)
    else:
        print("None to drop!")
    
    return merged

In [23]:
merged = pd.read_csv("merged.csv",index_col=0)  # Contains disagreements + GP and O + matched + remaining
merged0 = merged.loc[merged.annotator == 0]
merged1 = merged.loc[merged.annotator == 1]
merged2 = merged.loc[merged.annotator == 2]
merged3 = merged.loc[merged.annotator == 3]
merged4 = merged.loc[merged.annotator == 4]

##### Gendered Pronoun
Keep only one annotation from each group of overlapping (excluding matching) annotations between annotators 0 and 1

In [24]:
labelname = "Gendered-Pronoun"
dfa, dfb = merged0, merged1
agmta, agmtb = "agreement_0", "agreement_1"
common = findCommonFiles(dfa,dfb)
dfagp = dfa.loc[dfa.label == labelname]
dfbgp = dfb.loc[dfb.label == labelname]
dfa = dfa.assign(agreement_1 = [None]*(dfa.shape[0]))  # agmtb
dfb = dfb.assign(agreement_0 = [None]*(dfb.shape[0]))  # agmta
merged = removeOverlapsFromMerged(labelname, dfa, dfb, agmta, agmtb, dfagp, dfbgp, merged)

(8542, 8)
(5245, 8)
[39017, 24138]
(98332, 8)
(98330, 8)


Keep only one annotation from each group of overlapping (excluding matching) annotations between annotators 0 and 2

In [25]:
labelname = "Gendered-Pronoun"
dfa, dfb = merged0, merged2
agmta, agmtb = "agreement_0", "agreement_2"
common = findCommonFiles(dfa,dfb)
dfagp = dfa.loc[dfa.label == labelname]
dfbgp = dfb.loc[dfb.label == labelname]
dfa = dfa.assign(agreement_2 = [None]*(dfa.shape[0]))  # agmtb
dfb = dfb.assign(agreement_0 = [None]*(dfb.shape[0]))  # agmta
merged = removeOverlapsFromMerged(labelname, dfa, dfb, agmta, agmtb, dfagp, dfbgp, merged)

(1303, 8)
(793, 8)
[]
None to drop!


In [26]:
# G-P overlaps between 1 and 2
labelname = "Gendered-Pronoun"
dfa, dfb = merged1, merged2
agmta, agmtb = "agreement_1", "agreement_2"
common = findCommonFiles(dfa,dfb)
dfagp = dfa.loc[dfa.label == labelname]
dfbgp = dfb.loc[dfb.label == labelname]
dfa = dfa.assign(agreement_2 = [None]*(dfa.shape[0]))  # agmtb
dfb = dfb.assign(agreement_1 = [None]*(dfb.shape[0]))  # agmta
merged = removeOverlapsFromMerged(labelname, dfa, dfb, agmta, agmtb, dfagp, dfbgp, merged)

(1036, 8)
(518, 8)
[]
None to drop!


##### Occupation
Keep only one annotation from each group of overlapping (excluding matching) annotations between annotators 0 and 3

In [27]:
# Occupation overlaps between 0 and 3
labelname = "Occupation"
dfa, dfb = merged0, merged3
agmta, agmtb = "agreement_0", "agreement_3"
common = findCommonFiles(dfa,dfb)
dfagp = dfa.loc[dfa.label == labelname]
dfbgp = dfb.loc[dfb.label == labelname]
dfa = dfa.assign(agreement_3 = [None]*(dfa.shape[0]))  # agmtb
dfb = dfb.assign(agreement_0 = [None]*(dfb.shape[0]))  # agmta
merged = removeOverlapsFromMerged(labelname, dfa, dfb, agmta, agmtb, dfagp, dfbgp, merged)

(3941, 8)
(2244, 8)
[62048, 62049, 62057, 63349, 63351, 62977, 63160, 63163, 63415, 62073, 63524, 63525, 63526, 63533, 63013, 62082, 62083, 63841, 63856, 62013, 62014, 62225, 62006, 62699, 62705, 63914, 62666, 62667, 62551, 62851, 62856, 63106, 63107, 63112, 63114, 63174, 63479, 62256, 63605, 62211, 62212, 62213, 62221, 61867, 61740, 63794, 62605, 62608, 62613, 62618, 62623, 62901, 62904, 62905, 62908, 62291, 62295, 63624, 63625, 63626, 63631, 63634, 62253, 63500, 63502, 63503, 63504, 63505, 61735, 62259, 62260, 62261, 62263, 62265, 63082, 62815, 63071, 63073, 63311, 62176, 63127, 63139, 63772, 62848, 63386, 63387, 63381, 61807, 62686, 62687, 62688, 63862, 63863, 63866, 63539, 63729, 63732, 63735, 64015, 63020, 63021, 63887, 62079, 63120, 63121, 62781, 62115, 62119, 62120, 63519, 63324, 62003, 62004, 62894, 62530, 62133, 62134, 61819, 61822, 61824, 63090, 63091, 63939, 63941, 61939, 63289, 63290, 63142, 61770, 61774, 61775, 63045, 63048, 63049, 63060, 61991, 61992, 63357, 63665, 63666,

In [28]:
# Occupation overlaps between 0 and 4
labelname = "Occupation"
dfa, dfb = merged0, merged4
agmta, agmtb = "agreement_0", "agreement_4"
common = findCommonFiles(dfa,dfb)
dfagp = dfa.loc[dfa.label == labelname]
dfbgp = dfb.loc[dfb.label == labelname]
dfa = dfa.assign(agreement_4 = [None]*(dfa.shape[0]))  # agmtb
dfb = dfb.assign(agreement_0 = [None]*(dfb.shape[0]))  # agmta
merged = removeOverlapsFromMerged(labelname, dfa, dfb, agmta, agmtb, dfagp, dfbgp, merged)

(1442, 8)
(828, 8)
[65102, 65261, 65262, 65091, 64645, 64649, 64058, 64059, 64388, 64389, 64405, 64411, 64416, 64419, 65043, 64492, 64791, 65112, 65391, 65393, 65056, 65018, 65114, 65118, 65122, 65123, 65124, 65127, 65128, 65131, 65134, 65135, 65137, 65138, 65140, 65146, 65149, 65150, 65151, 64540, 64035, 64036, 64037, 64760, 65363, 64045, 64766, 64771, 64773, 65317, 65318, 65322, 65437, 65438, 65441, 65442, 65583, 64121, 64174, 64178, 64209, 64267, 64301, 64314, 64325, 64362, 64365, 64603, 64680, 65065, 65679, 64703, 64704, 64705, 64499, 64913, 64688, 64762, 64876, 64899, 64903, 65607, 64880, 64888, 64891, 65513, 65516, 65520, 65510, 65088, 64783, 64784, 64526, 64852, 65713, 65301, 65302, 65167, 65168, 65685, 65689, 64612, 64552, 64553, 64554, 64555, 64556]
(98057, 8)
(97950, 8)


In [29]:
# Occupation overlaps between 3 and 4
labelname = "Occupation"
dfa, dfb = merged3, merged4
agmta, agmtb = "agreement_3", "agreement_4"
common = findCommonFiles(dfa,dfb)
dfagp = dfa.loc[dfa.label == labelname]
dfbgp = dfb.loc[dfb.label == labelname]
dfa = dfa.assign(agreement_4 = [None]*(dfa.shape[0]))  # agmtb
dfb = dfb.assign(agreement_3 = [None]*(dfb.shape[0]))  # agmta
merged = removeOverlapsFromMerged(labelname, dfa, dfb, agmta, agmtb, dfagp, dfbgp, merged)

(838, 8)
(472, 8)
[65261, 64649, 64058, 64404, 64416, 64417, 65114, 65118, 65122, 65123, 65124, 65127, 65128, 65131, 65135, 65140, 65146, 65149, 65150, 64760, 64773, 65317, 65318, 65437, 65438, 65440, 65583, 65305, 64121, 64174, 64254, 64267, 64314, 64325, 64362, 64381, 64899, 64903, 64884, 64888, 65513, 65516, 65520, 65521, 64529, 65277, 64851, 64852, 64853, 65302, 65685, 64612, 64552]
(97950, 8)
(97938, 8)


In [30]:
merged.tail()  # Looks good!

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
2340,Coll-1434_20400.ann,"(5497, 5524)",a group of young Khond boys,6559,T25,Stereotype,4,Contextual
2351,Coll-146_30100.ann,"(2242, 2247)",Woman,6575,T8,Stereotype,4,Contextual
2353,Coll-1434_19600.ann,"(4129, 4134)",a man,6582,T6,Stereotype,4,Contextual
2354,Coll-1434_19600.ann,"(528, 544)",a group of women,6584,T8,Stereotype,4,Contextual
2355,Coll-146_16400.ann,"(2134, 2139)",a boy,6586,T2,Stereotype,4,Contextual


Now combine the data in the `merged` DataFrame, which contains annotation data chosen from disagreements as well as gendered pronouns and occupations determined to be valid and not overlapping (matches may still be present), with annotation data in the `overlaps_for_merged.csv` file, which contains annotations chosen from agreements (matched, enveloped, and overlapped among two or more annotators).

In [31]:
# Drop remove, start, end cols from merged
overlaps_for_merged = pd.read_csv("overlaps_for_merged.csv", index_col=0)
overlaps_for_merged = overlaps_for_merged.drop(labels=["remove","start","end"], axis=1)
overlaps_for_merged.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
2260,Coll-1462_00100.ann,"(1143, 1156)",granddaughter,3339,T3,Gendered-Role,0,Linguistic
10439,Coll-1434_16000.ann,"(2393, 2397)",boys,15310,T34,Gendered-Role,0,Linguistic
18422,Coll-1109_00100.ann,"(480, 492)",Lord Provost,26879,T0,Gendered-Role,0,Linguistic
13530,Coll-1383_00600.ann,"(8394, 8400)",Master,19689,T137,Gendered-Role,0,Linguistic
11139,Coll-1434_09700.ann,"(559, 565)",Prince,16369,T14,Gendered-Role,0,Linguistic


In [32]:
# Combine the DataFrames
aggregated = merged.append(overlaps_for_merged)
assert aggregated.shape[0] == merged.shape[0] + overlaps_for_merged.shape[0]
# Remove rows where all column values are duplicated (e.g., the same annotation by the same annotator appears more than once)
aggregated = aggregated.astype({'file':str, 'offsets':str,'text':str, 'id':int, 'entity':str, 'label':str, 'annotator':int, 'category':str})
aggregated = aggregated.drop_duplicates()
print(aggregated.shape)  # (Number of rows, Number of columns)

(76543, 8)


In [33]:
print("Rows per Annotator")
print("- 0:",aggregated.loc[aggregated.annotator == 0].shape[0])
print("- 1:",aggregated.loc[aggregated.annotator == 1].shape[0])
print("- 2:",aggregated.loc[aggregated.annotator == 2].shape[0])
print("- 3:",aggregated.loc[aggregated.annotator == 3].shape[0])
print("- 4:",aggregated.loc[aggregated.annotator == 4].shape[0])

Rows per Annotator
- 0: 31291
- 1: 16595
- 2: 19733
- 3: 4587
- 4: 4337


In [34]:
# Write non-unique aggregated dataset to CSV
aggregated.to_csv("aggregated_with_annotator_col.csv")

In [35]:
# Remove annotator, entity, and id columns
unique_aggregated = aggregated.drop(labels=["id","entity","annotator"], axis=1)
unique_aggregated.head()

Unnamed: 0,file,offsets,text,label,category
0,Coll-1434_11900.ann,"(1954, 1957)",his,Generalization,Linguistic
1,Coll-1397_00100.ann,"(2633, 2638)",Lords,Generalization,Linguistic
2,Coll-1310_00800.ann,"(3703, 3706)",Man,Generalization,Linguistic
3,Coll-1434_14500.ann,"(5782, 5788)",cowboy,Generalization,Linguistic
4,BAI_02300.ann,"(1586, 1596)",shipmaster,Generalization,Linguistic


In [36]:
before = unique_aggregated.shape[0]
print(before)
# Drop duplicate rows (e.g., the same annotation made by different annotators)
unique_aggregated = unique_aggregated.drop_duplicates()
after = unique_aggregated.shape[0]
print(after)
print("Rows Dropped:", before-after)

76543
55260
Rows Dropped: 21283


21,283 rows that were identical annotations made by more than one annotator have been removed from the aggregated dataset, leaving us with 55,260 rows of unique annotations.

In [37]:
# Write final aggregated dataset without duplicates to CSV
unique_aggregated.to_csv("aggregated_final.csv")