# Merge Annotated Datasets for a Gold Standard, part 1

#### To reconcile the differences in the five annotatated archival metadata descriptions datasets to create one merged dataset:

  [1.](#1) For the labels listed below, review overlapping annotations with *different* labels and manually determine which is correct, and then add the correct annotations to a new DataFrame for the gold standard and remove the others from the old DataFrame.
  
**[Linguistic Labels](#linguistic)** - for this category, only one label on a single text span is allowed
* [Investigating Text Spans](#l-spans)
* [Generalization](#gen)
* [Gendered-Role](#g-r)

**[Contextual Labels](#contextual)** - for this category, more than one label on a single text span is allowed
* [Investigating Text Spans](#c-spans)
* [Stereotype](#ste)
* [Omission](#omi)

**[Person-Name Labels](#name)** - for this category, only one label on a single text span is allowed; *only compare annotators 0 and 2 (too many inconsistencies with annotator 1's Person-Name labels)*
* [Investigating Text Spans](#n-spans)
* [Unknown](#unk)
* [Masculine](#mas)
* [Feminine](#fem)
  
***

Import useful libraries:

In [1]:
import pandas as pd
import numpy as np
import string
import csv
import re
import os

Load the data:

In [6]:
# Add an identifier column for ease of dropping rows that have been reviewed
def getIdentifiers(df):
    df = df.reset_index()
    df.rename(columns={"index":"id"}, inplace=True)
    return df

In [38]:
# # Person-Name and Linguistic label data
annPL0 = pd.read_csv("OriginalAnnotatorData/labels0PL-Copy1.csv", index_col=0)
annPL0 = getIdentifiers(annPL0)
ann1 = pd.read_csv("OriginalAnnotatorData/labels1-Copy1.csv", index_col=0)
ann1 = getIdentifiers(ann1)
ann2 = pd.read_csv("OriginalAnnotatorData/labels2-Copy1.csv", index_col=0)
ann2 = getIdentifiers(ann2)

# Contextual label data
annC0 = pd.read_csv("OriginalAnnotatorData/labels0C-Copy1.csv", index_col=0)
annC0 = getIdentifiers(annC0)
ann3 = pd.read_csv("OriginalAnnotatorData/labels3-Copy1.csv", index_col=0)
ann3 = getIdentifiers(ann3)
ann4 = pd.read_csv("OriginalAnnotatorData/labels4-Copy1.csv", index_col=0)
ann4 = getIdentifiers(ann4)

# Preview the data
ann4.head()

Unnamed: 0,id,file,entity,label,start,end,text,annotator,category
0,0,Coll-1444_00100.ann,T1,Occupation,715,740,Educational Psychologists,Annotator 4,Contextual
1,1,Coll-1444_00100.ann,T2,Occupation,1664,1676,Psychologist,Annotator 4,Contextual
2,2,Coll-1444_00100.ann,T3,Occupation,2312,2375,researcher at the Godfrey Thomson Unit for Edu...,Annotator 4,Contextual
3,21,BAI_01200.ann,T19,Occupation,2347,2371,Archbishop of Canterbury,Annotator 4,Contextual
4,22,BAI_01200.ann,T20,Omission,2381,2397,Duke of Montrose,Annotator 4,Contextual


Add a `remove` column to each DataFrame for ease of removing annotations to exclude from the gold standard:

In [39]:
dfs = [annC0, ann3, ann4, annPL0, ann1, ann2] 
for df in dfs:
    df["remove"] = ["None"]*df.shape[0]
ann3.tail()  # Looks good!

Unnamed: 0,id,file,entity,label,start,end,text,annotator,category,remove
5120,9383,Coll-1028_00100.ann,T14,Omission,598,605,Baillie,Annotator 3,Contextual,
5121,9384,Coll-1028_00100.ann,T33,Occupation,2572,2580,printers,Annotator 3,Contextual,
5122,9385,Coll-1028_00100.ann,T34,Occupation,2559,2570,translators,Annotator 3,Contextual,
5123,9386,Coll-1028_00100.ann,T35,Omission,2597,2603,Calvin,Annotator 3,Contextual,
5124,9387,Coll-1028_00100.ann,T36,Omission,2536,2542,Calvin,Annotator 3,Contextual,


Turn the `start` and `end` columns into a single `offsets` column for each DataFrame:

In [40]:
for df in dfs:
    start = list(df.start)
    end = list(df.end)
    offsets = list(zip(start,end))
    df["offsets"] = offsets
    df.drop(["start", "end"], axis=1, inplace=True)
ann3.head()  # Looks good!

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
0,4,Coll-1326_00100.ann,T4,Occupation,physicians,Annotator 3,Contextual,,"(403, 413)"
1,5,Coll-1326_00100.ann,T5,Occupation,physicians,Annotator 3,Contextual,,"(538, 548)"
2,6,Coll-1326_00100.ann,T6,Occupation,physician,Annotator 3,Contextual,,"(876, 885)"
3,7,Coll-1326_00100.ann,T7,Occupation,Professor of the Practice of Physic,Annotator 3,Contextual,,"(925, 960)"
4,8,Coll-1326_00100.ann,T8,Occupation,physician,Annotator 3,Contextual,,"(1355, 1364)"


Create an empty DataFrame for the gold standard for adding the correct annotations from the five annotated datasets: 

In [48]:
gold = pd.DataFrame(columns=["file", "offsets", "text", "id", "entity", "label", "category", "annotator"])
gold = gold.set_index(["file", "offsets", "text"])
gold

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,category,annotator
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


<a id="1"></a>
## 1. Overlapping Text Spans Annotated with Different Labels

In [2]:
def investigateTextSpans(df, annotator_name, label_to_review):
    df = df[df.label == label_to_review]
    texts = list(df.text.unique())
    word_counts = []
    for t in texts:
        t_split = t.split(" ")
        word_counts += [len(t_split)]
    print(annotator_name)
    print(" - Average word count in "+label_to_review+" text spans:", np.mean(word_counts))
    print(" - Longest word count in "+label_to_review+" text spans:", np.max(word_counts))
    print(" - Shortest word count in "+label_to_review+" text spans:", np.min(word_counts))
    print(" - Standard deviation for word count in "+label_to_review+" text spans:", np.std(word_counts))

# Create a subset of the first DataFrame with only rows containing the input label, and create
# subsets of the second and third DataFrames with rows that do not contain the input label 
# FOR LINGUISTIC AND PERSON-NAME LABELS ONLY
def createSubsetsToReview(dfA, dfB, dfC, label_to_review):
    sub_dfA = dfA[dfA.label == label_to_review]
    sub_dfA = sub_dfA.set_index(["file", "offsets", "text"])
    sub_dfB = dfB[dfB.label != label_to_review]
    sub_dfB = sub_dfB.set_index(["file", "offsets", "text"])
    # Compare 3 DataFrames for Linguistic labels but only 2 DFs for Person-Name labels
    if type(dfC) != type(None):
        sub_dfC = dfC[dfC.label != label_to_review]
        sub_dfC = sub_dfC.set_index(["file", "offsets", "text"])
        return sub_dfA, sub_dfB, sub_dfC
    else:
        return sub_dfA, sub_dfB

def rowsToKeepAndRemove(joined, remove_col_name, id_col_name):
    all_ids = []
    keep = joined[joined[remove_col_name] == "No"] 
    all_ids += list(keep[id_col_name])
    remove = joined[joined[remove_col_name] == "Yes"]
    all_ids += list(remove[id_col_name])
    return keep, remove, all_ids

# Add the rows to keep to the gold DataFrame
def addToGold(sub_df_list, keep_list, gold, annotator_list):
    maxI = len(keep_list)
    i = 0
    while i < maxI:
        indeces_to_include = list(keep_list[i])
        sub_df = sub_df_list[i]
        sub_df.drop("remove",axis=1,inplace=True)
        for j in indeces_to_include:
            to_append = sub_df.loc[[j]]
#             print(type(to_append))
#             print(to_append)
            to_append.loc[:,"annotator"] = annotator_list[i]
            gold = gold.append(to_append, sort=False)
        i += 1
    return gold

# Drop reviewed rows from original annotator DataFrame
def dropReviewedRows(df, ids_to_drop):
    df = df.set_index("id")
    for identifier in ids_to_drop:
        df.drop(identifier, inplace=True)
    df = df.reset_index()
    return df

In [3]:
labels = {"Person-Name": ["Unknown", "Masculine", "Feminine", "Nonbinary"], 
          "Linguistic":["Gendered-Role", "Gendered-Pronoun", "Generalization"], 
          "Contextual": ["Occupation", "Omission", "Stereotype", "Empowering"]}

<a id="linguistic"></a>
### Linguistic Labels

<a id="l-spans"></a>
#### Investigating Text Spans

Let's take a look at the text spans labeled with `Generalization`:

In [63]:
investigateTextSpans(annPL0, "Annotator 0", label_to_review)
investigateTextSpans(ann1, "Annotator 1", label_to_review)
investigateTextSpans(ann2, "Annotator 2", label_to_review)

Annotator 0
 - Average word count in Generalization text spans: 1.1570247933884297
 - Longest word count in Generalization text spans: 4
 - Shortest word count in Generalization text spans: 1
 - Standard deviation for word count in Generalization text spans: 0.5605231391012618
Annotator 1
 - Average word count in Generalization text spans: 1.131578947368421
 - Longest word count in Generalization text spans: 3
 - Shortest word count in Generalization text spans: 1
 - Standard deviation for word count in Generalization text spans: 0.46853931091486456
Annotator 2
 - Average word count in Generalization text spans: 1.3111111111111111
 - Longest word count in Generalization text spans: 3
 - Shortest word count in Generalization text spans: 1
 - Standard deviation for word count in Generalization text spans: 0.7248669524577819


Let's take a look at the text spans annotated with the label `Gendered-Role`:

In [61]:
investigateTextSpans(annPL0, "Annotator 0", label_to_review)
investigateTextSpans(ann1, "Annotator 1", label_to_review)
investigateTextSpans(ann2, "Annotator 2", label_to_review)

Annotator 0
 - Average word count in Gendered-Role text spans: 1.1346801346801347
 - Longest word count in Gendered-Role text spans: 10
 - Shortest word count in Gendered-Role text spans: 1
 - Standard deviation for word count in Gendered-Role text spans: 0.8136313301841448
Annotator 1
 - Average word count in Gendered-Role text spans: 1.1470588235294117
 - Longest word count in Gendered-Role text spans: 12
 - Shortest word count in Gendered-Role text spans: 1
 - Standard deviation for word count in Gendered-Role text spans: 0.9372159315878841
Annotator 2
 - Average word count in Gendered-Role text spans: 1.056338028169014
 - Longest word count in Gendered-Role text spans: 4
 - Shortest word count in Gendered-Role text spans: 1
 - Standard deviation for word count in Gendered-Role text spans: 0.3088973549219902


<a id="gen"></a>
#### GENERALIZATION

In [13]:
label_to_review = labels["Linguistic"][2]  # "Generalization"

**Review the annotator 0 vs. annotator 1's/annotator 2's data:**

In [6]:
### ANN0 VS. ANN1/ANN2
sub0, sub1, sub2 = createSubsetsToReview(annPL0, ann1, ann2, label_to_review)
# sub0.head() # Looks good

In [4]:
joined01 = sub0.join(sub1, how='inner', lsuffix='_0', rsuffix='_1')
# joined01

In [61]:
joined01["remove_1"] = ["Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes"]
joined01["remove_0"] = ["No", "No", "No", "No", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "No"]

### Get indeces of rows to remove and rows to keep for each annotator
joined = joined01
keep0 = joined[joined.remove_0 == "No"] 
ids0 = list(keep0.id_0)
# print(ids0)
remove0 = joined[joined.remove_0 == "Yes"]
ids0 += list(remove0.id_0)
# print(ids0)
keep1 = joined[joined.remove_1 == "No"]
ids1 = list(keep1.id_1)
remove1 = joined[joined.remove_1 == "Yes"]
ids1 += list(remove1.id_1)

Add the rows to keep (marked as `No` in the `remove` column) to the gold DataFrame:

In [7]:
# Add the rows to keep to the gold DataFrame
annotators = [keep0.index, keep1.index]
sub_dfs = [sub0, sub1]
gold = addToGold(sub_dfs, annotators, gold, [0,1])
gold

In [5]:
joined02 = sub0.join(sub2, how='inner', lsuffix='_0', rsuffix='_2')
# joined02

In [65]:
joined02["remove_2"] = ["Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"]
joined02["remove_0"] = ["No", "No", "No", "No", "No", "No", "No", "No", "No", "No"]

### Get indeces of rows to remove and rows to keep for each annotator
joined = joined02
keep0 = joined[joined.remove_0 == "No"] 
ids0 += list(keep0.id_0)
remove0 = joined[joined.remove_0 == "Yes"]
ids0 += list(remove0.id_0)
keep2 = joined[joined.remove_2 == "No"]
ids2 = list(keep2.id_2)
remove2 = joined[joined.remove_2 == "Yes"]
ids2 += list(remove2.id_2)

Add the rows to keep (marked as `No` in the `remove` column) to the gold DataFrame:

In [66]:
# Add the rows to keep to the gold DataFrame
annotators = [keep0.index, keep2.index]
sub_dfs = [sub0, sub2]
gold = addToGold(sub_dfs, annotators, gold, [0,2])

Drop all the rows reviewed (`remove[#]` and `keep[#]` variables) from the original annotator DataFrames:

In [67]:
# Drop reviewed rows from the original annotator DataFrames
ids0 = list(set(ids0)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann1 = dropReviewedRows(ann1, ids1)
ann2 = dropReviewedRows(ann2, ids2)

Write the gold DataFrame to a CSV and rewrite the annotators' CSV files (copies of the originals saved already) so the above steps can be re-run for the remaining labels:

In [68]:
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann1.to_csv("labels1.csv")
ann2.to_csv("labels2.csv")

**Review annotator 1 vs. annotator 0's/annotator 2's data:**

In [36]:
ann1 = pd.read_csv("labels1.csv", index_col=0)
# ann1.head()
annPL0 = pd.read_csv("labels0PL.csv", index_col=0)
ann2 = pd.read_csv("labels2.csv", index_col=0)
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])

In [25]:
sub1, sub0, sub2 = createSubsetsToReview(ann1, annPL0, ann2, label_to_review)
# sub1.head() # Looks good

In [27]:
joined10 = sub1.join(sub0, how='inner', lsuffix='_1', rsuffix='_0')
joined10

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_1,entity_1,label_1,annotator_1,category_1,remove_1,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


In [28]:
joined12 = sub1.join(sub2, how='inner', lsuffix='_1', rsuffix='_2')
joined12

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_1,entity_1,label_1,annotator_1,category_1,remove_1,id_2,entity_2,label_2,annotator_2,category_2,remove_2
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1014_00100.ann,"(746, 757)",men-at-arms,5995,T3,Generalization,Annotator 1,Linguistic,,1399,T1,Gendered-Role,Annotator 2,Linguistic,


In [30]:
joined12["remove_1"] = ["No"]
joined12["remove_2"] = ["Yes"]
### Get indeces of rows to remove and rows to keep for each annotator
joined = joined12
keep1 = joined[joined.remove_1 == "No"] 
ids1 = list(keep1.id_1)
remove1 = joined[joined.remove_1 == "Yes"]
ids1 += list(remove1.id_1)
keep2 = joined[joined.remove_2 == "No"]
ids2 = list(keep2.id_2)
remove2 = joined[joined.remove_2 == "Yes"]
ids2 += list(remove2.id_2)

In [37]:
# Add the rows to keep to the gold DataFrame
annotators = [keep1.index, keep2.index]
sub_dfs = [sub1, sub2]
gold = addToGold(sub_dfs, annotators, gold, [1,2])

In [38]:
# Drop reviewed rows from the original annotator DataFrames
ann1 = dropReviewedRows(ann1, ids1)
ann2 = dropReviewedRows(ann2, ids2)

**Review annotator 2 vs. annotator 0's/annotator 1's data:**

In [39]:
sub2, sub0, sub1 = createSubsetsToReview(ann2, annPL0, ann1, label_to_review)
# sub2.head() # Looks good

In [41]:
joined20 = sub2.join(sub0, how='inner', lsuffix='_2', rsuffix='_0')
joined20

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


In [42]:
joined21 = sub2.join(sub1, how='inner', lsuffix='_2', rsuffix='_1')
joined21

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2,id_1,entity_1,label_1,annotator_1,category_1,remove_1
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


Write the gold standard annotator DataFrames to CSVs:

In [43]:
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann1.to_csv("labels1.csv")
ann2.to_csv("labels2.csv")

<a id="g-r"></a>
#### GENDERED-ROLE

In [50]:
label_to_review = labels["Linguistic"][0]
label_to_review

'Gendered-Role'

Load the latest data for the relevant annotators:

In [51]:
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
annPL0 = pd.read_csv("labels0PL.csv", index_col=0)
ann1 = pd.read_csv("labels1.csv", index_col=0)
ann2 = pd.read_csv("labels2.csv", index_col=0)

**Review annotator 0 vs. annotator 1's/annotator 2's data:**

In [6]:
# Get relevant subsets of data
sub0, sub1, sub2 = createSubsetsToReview(annPL0, ann1, ann2, label_to_review)
# sub2.head() # Looks good

In [9]:
joined01 = sub0.join(sub1, how='inner', lsuffix='_0', rsuffix='_1')
joined01["remove_0"] = ["No", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"]
joined01["remove_1"] = ["Yes", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No"]
# joined01
### Get indeces of rows to remove and rows to keep for each annotator
keep01, remove01, ids01 = rowsToKeepAndRemove(joined01, "remove_0", "id_0")
keep1, remove1, ids1 = rowsToKeepAndRemove(joined01, "remove_1", "id_1")

Add the rows to keep (marked as `No` in the `remove` column) to the gold DataFrame:

In [13]:
# Add the rows to keep to the gold DataFrame
annotators = [keep01.index, keep1.index]
sub_dfs = [sub0, sub1]
gold = addToGold(sub_dfs, annotators, gold, [0,1])
# gold

In [86]:
joined02 = sub0.join(sub2, how='inner', lsuffix='_0', rsuffix='_2')
joined02["remove_0"] = ["No", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes",
                       "Yes", "Yes", "Yes", "No", "Yes", "Yes"]
joined02["remove_2"] = ["Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "No",
                       "No", "No", "No", "Yes", "No", "No"]
# joined02
### Get indeces of rows to remove and rows to keep for each annotator
keep02, remove02, ids02 = rowsToKeepAndRemove(joined02, "remove_0", "id_0")
keep2, remove2, ids2 = rowsToKeepAndRemove(joined02, "remove_2", "id_2")

Add the rows to keep (marked as `No` in the `remove` column) to the gold DataFrame:

In [87]:
# Add the rows to keep to the gold DataFrame
annotators = [keep02.index, keep2.index]
sub_dfs = [sub0, sub2]
gold = addToGold(sub_dfs, annotators, gold, [0,2])
gold.shape

(90, 5)

Drop all the rows reviewed (`remove[#]` and `keep[#]` variables) from the original annotator DataFrames:

In [89]:
# Drop reviewed rows from the original annotator DataFrames
ids0 = list(set(ids02+ids01)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann1 = dropReviewedRows(ann1, ids1)
ann2 = dropReviewedRows(ann2, ids2)

Write the gold DataFrame to a CSV and rewrite the annotators' CSV files (copies of the originals saved already) so the above steps can be re-run for the remaining labels.

In [90]:
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann1.to_csv("labels1.csv")
ann2.to_csv("labels2.csv")

**Review annotator 1 vs. annotator 0's/annotator 2's data:**

In [52]:
sub1, sub0, sub2 = createSubsetsToReview(ann1, annPL0, ann2, label_to_review)
# sub1.head() # Looks good

In [53]:
joined10 = sub1.join(sub0, how='inner', lsuffix='_1', rsuffix='_0')
joined10.sort_values(["file", "offsets", "text"], inplace=True)
# joined10
joined10["remove_1"] = ["Yes", "No", "No", "No", "No", "No", "No", "No"]
joined10["remove_0"] = ["No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"]
### Get indeces of rows to remove and rows to keep for each annotator
joined = joined10
keep1 = joined[joined.remove_1 == "No"] 
ids1 = list(keep1.id_1)
remove1 = joined[joined.remove_1 == "Yes"]
ids1 += list(remove1.id_1)
keep0 = joined[joined.remove_0 == "No"]
ids0 = list(keep0.id_0)
remove0 = joined[joined.remove_0 == "Yes"]
ids0 += list(remove0.id_0)

In [54]:
# Add the rows to keep to the gold DataFrame
annotators = [keep1.index, keep0.index]
sub_dfs = [sub1, sub0]
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
gold = addToGold(sub_dfs, annotators, gold, [1,0])

In [56]:
# Drop reviewed rows from the original annotator DataFrames
ids0 = list(set(ids0)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann1 = dropReviewedRows(ann1, ids1)

In [58]:
sub1, sub0, sub2 = createSubsetsToReview(ann1, annPL0, ann2, label_to_review)
joined12 = sub1.join(sub2, how='inner', lsuffix='_1', rsuffix='_2')
joined12

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_1,entity_1,label_1,annotator_1,category_1,remove_1,id_2,entity_2,label_2,annotator_2,category_2,remove_2
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


**Review annotator 2 vs. annotator 0's/annotator 1's data:**

In [59]:
sub2, sub0, sub1 = createSubsetsToReview(ann2, annPL0, ann1, label_to_review)
# sub2.head() # Looks good

In [61]:
joined20 = sub2.join(sub0, how='inner', lsuffix='_2', rsuffix='_0')
# joined20

In [62]:
joined20["remove_2"] = ["No","Yes","Yes","No"]
joined20["remove_0"] = ["Yes","No","No","Yes"]
### Get indeces of rows to remove and rows to keep for each annotator
joined = joined20
keep2 = joined[joined.remove_2 == "No"] 
ids2 = list(keep2.id_2)
remove2 = joined[joined.remove_2 == "Yes"]
ids2 += list(remove2.id_2)
keep0 = joined[joined.remove_0 == "No"]
ids0 = list(keep0.id_0)
remove0 = joined[joined.remove_0 == "Yes"]
ids0 += list(remove0.id_0)

In [63]:
# Add the rows to keep to the gold DataFrame
annotators = [keep2.index, keep0.index]
sub_dfs = [sub2, sub0]
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
gold = addToGold(sub_dfs, annotators, gold, [2,0])
# Drop reviewed rows from the original annotator DataFrames
ids0 = list(set(ids0)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann2 = dropReviewedRows(ann2, ids2)

In [64]:
sub2, sub0, sub1 = createSubsetsToReview(ann2, annPL0, ann1, label_to_review)
joined21 = sub2.join(sub1, how='inner', lsuffix='_2', rsuffix='_1')
joined21

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2,id_1,entity_1,label_1,annotator_1,category_1,remove_1
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


Write the gold standard annotator DataFrames to CSVs:

In [65]:
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann1.to_csv("labels1.csv")
ann2.to_csv("labels2.csv")

<a id="contextual"></a>
### Contextual Labels
Some words and phrases may have more than one correct Contextual label, so we need some new functions to make sure that we don't remove any annotations that have both the same and different labels; we only want to review annotations with the same offsets where there is absolutely no agreement on the label for the given offset.

In [4]:
# Create a copy of the input DataFrame that doesn't have rows with the input file/offset combination 
def dropMatchingRows(df, rows_to_drop):
    df = df.set_index(["file", "offsets"])
    sub_df = df
    for row in rows_to_drop:
        sub_df.drop(row, inplace=True)
    sub_df = sub_df.reset_index()
    return sub_df

# Find offsets that match, look for labels that are the same within those matched offsets, and
# then get a list of their file/offset combinations and remove all rows with those file/offset combos. 
def removeOffsetsWithMatchingLabels(dfA, dfB, dfC, label_to_review):
    sub_dfA = dfA.set_index(["file", "offsets", "text", "label"], inplace=False)
    sub_dfB = dfB.set_index(["file", "offsets", "text", "label"], inplace=False)
    sub_dfC = dfC.set_index(["file", "offsets", "text", "label"], inplace=False)
    # Get the rows with matching offsets and labels
    joinedAB = sub_dfA.join(sub_dfB, how='inner', lsuffix='_A', rsuffix='_B')
    joinedAB = joinedAB.reset_index()
    joinedAB = joinedAB.set_index(["file", "offsets"])
    joinedAC = sub_dfA.join(sub_dfC, how='inner', lsuffix='_A', rsuffix='_C')
    joinedAC = joinedAC.reset_index()
    joinedAC = joinedAC.set_index(["file", "offsets"])
    joinedBC = sub_dfB.join(sub_dfC, how='inner', lsuffix='_B', rsuffix='_C')
    joinedBC = joinedBC.reset_index()
    joinedBC = joinedBC.set_index(["file", "offsets"])
    # Create unique lists of the indeces to remove
    rowsA_to_remove = list(set(list(joinedAB.index) + list(joinedAC.index)))
    rowsB_to_remove = list(set(list(joinedAB.index) + list(joinedBC.index)))
    rowsC_to_remove = list(set(list(joinedAC.index) + list(joinedBC.index)))
    # Drop the indeces to remove to create subsets of the input DataFrames without rows 
    # that have these file/offset combinations (meaning those combos. have agreed-upon labels 
    # amongst 2 annotators - majority voting)
    sub_dfA = dropMatchingRows(dfA, rowsA_to_remove)
    sub_dfB = dropMatchingRows(dfB, rowsB_to_remove)
    sub_dfC = dropMatchingRows(dfC, rowsC_to_remove)
    return sub_dfA, sub_dfB, sub_dfC

# Create subsets of the DataFrames with matched file/offset combos removed,
# including in the first DataFrame's subset only rows containing the input label and in
# the second and third DataFrames' subsets only rows that do not contain the input label 
def getMismatchedLabels(sub_dfA, sub_dfB, sub_dfC, label_to_review):
    mis_dfA = sub_dfA[sub_dfA.label == label_to_review]
    mis_dfB = sub_dfB[sub_dfB.label != label_to_review]
    mis_dfC = sub_dfC[sub_dfC.label != label_to_review]
    mis_dfA = mis_dfA.set_index(["file", "offsets", "text"])
    mis_dfB = mis_dfB.set_index(["file", "offsets", "text"])
    mis_dfC = mis_dfC.set_index(["file", "offsets", "text"])
    return mis_dfA, mis_dfB, mis_dfC

def createContextualSubsets(dfA, dfB, dfC, label_to_review):
    sub_dfA, sub_dfB, sub_dfC = removeOffsetsWithMatchingLabels(dfA, dfB, dfC, label_to_review)
    sub_dfA_mis, sub_dfB_mis, sub_dfC_mis = getMismatchedLabels(sub_dfA, sub_dfB, sub_dfC, label_to_review)
    return sub_dfA_mis, sub_dfB_mis, sub_dfC_mis

<a id="c-spans"></a>
#### Investigating Text Spans
Let's take a look at the text spans that were annotated with Contextual labels:

In [10]:
investigateTextSpans(annC0, "Annotator 0", "Stereotype")
investigateTextSpans(ann3, "Annotator 3", "Stereotype")
investigateTextSpans(ann4, "Annotator 4", "Stereotype")

Annotator 0
 - Average word count in Stereotype text spans: 5.191780821917808
 - Longest word count in Stereotype text spans: 43
 - Shortest word count in Stereotype text spans: 1
 - Standard deviation for word count in Stereotype text spans: 6.640798628005098
Annotator 3
 - Average word count in Stereotype text spans: 6.437070938215103
 - Longest word count in Stereotype text spans: 52
 - Shortest word count in Stereotype text spans: 1
 - Standard deviation for word count in Stereotype text spans: 6.5885135902617575
Annotator 4
 - Average word count in Stereotype text spans: 4.4033613445378155
 - Longest word count in Stereotype text spans: 41
 - Shortest word count in Stereotype text spans: 1
 - Standard deviation for word count in Stereotype text spans: 3.8179569699057603


It looks like the text spans vary in length quite a bit: they can be a single word or up to 52 words!  On average, about 5 words are included in a `Stereotype` text span. 

In [11]:
investigateTextSpans(annC0, "Annotator 0", "Omission")
investigateTextSpans(ann3, "Annotator 3", "Omission")
investigateTextSpans(ann4, "Annotator 4", "Omission")

Annotator 0
 - Average word count in Omission text spans: 2.257348530293941
 - Longest word count in Omission text spans: 21
 - Shortest word count in Omission text spans: 1
 - Standard deviation for word count in Omission text spans: 2.0445836798276504
Annotator 3
 - Average word count in Omission text spans: 3.92688679245283
 - Longest word count in Omission text spans: 45
 - Shortest word count in Omission text spans: 1
 - Standard deviation for word count in Omission text spans: 3.5112353888250656
Annotator 4
 - Average word count in Omission text spans: 1.7441424554826617
 - Longest word count in Omission text spans: 13
 - Shortest word count in Omission text spans: 1
 - Standard deviation for word count in Omission text spans: 1.2544768529731893


The text spans for `Omission` also vary quite a bit, ranging from 1 word to 45 words.  On average, about 3 words are included in the text spans, slightly less than the average words included in the `Stereotype` text spans.

In [12]:
investigateTextSpans(annC0, "Annotator 0", "Occupation")
investigateTextSpans(ann3, "Annotator 3", "Occupation")
investigateTextSpans(ann4, "Annotator 4", "Occupation")

Annotator 0
 - Average word count in Occupation text spans: 2.5032
 - Longest word count in Occupation text spans: 14
 - Shortest word count in Occupation text spans: 1
 - Standard deviation for word count in Occupation text spans: 2.0342049454270827
Annotator 3
 - Average word count in Occupation text spans: 2.1424272818455368
 - Longest word count in Occupation text spans: 14
 - Shortest word count in Occupation text spans: 1
 - Standard deviation for word count in Occupation text spans: 1.7683768818798313
Annotator 4
 - Average word count in Occupation text spans: 2.6064814814814814
 - Longest word count in Occupation text spans: 20
 - Shortest word count in Occupation text spans: 1
 - Standard deviation for word count in Occupation text spans: 2.399412472381526


Even the `Occupation` text spans can vary quite a bit, from 1 word up to 20 words!  On average this has the shortest text span of the Contextual labels: about 2 words.

The `Empowering` label was only used by annotator 3.  Let's investigate that annotator's text spans with this label:

In [13]:
investigateTextSpans(ann3, "Annotator 3", "Empowering")

Annotator 3
 - Average word count in Empowering text spans: 9.176470588235293
 - Longest word count in Empowering text spans: 51
 - Shortest word count in Empowering text spans: 1
 - Standard deviation for word count in Empowering text spans: 10.144458319795895


As with the other Contextual labels, `Empowering` labels can vary quite a bit in length, from 1 word to 51 words.

How many instances of `Empowering` annotations are there total?

In [14]:
ann3[ann3.label == "Empowering"].shape[0]

80

I don't think that will be enough to use for training a classifier this time around.

<a id="ste"></a>
#### STEREOTYPE

In [88]:
gold = pd.read_csv("gold_standard.csv", index_col=0)
gold = gold[["file","offsets","text","id","entity","label","annotator","category"]]
gold.set_index(["file","offsets","text"],inplace=True)
gold.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


In [5]:
label_to_review = labels["Contextual"][2]
label_to_review

'Stereotype'

**Review annotator 0 vs. annotator 3 and annotator 4's data:**

In [90]:
sub0, sub3, sub4 = createContextualSubsets(annC0, ann3, ann4, label_to_review)
sub0.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1326_00100.ann,"(1484, 1528)",considered to be one of the leading chemists,57,T31,Stereotype,Annotator 0,Contextual,
Coll-1434_12300.ann,"(3373, 3380)",cowboys,224,T25,Stereotype,Annotator 0,Contextual,
Coll-1442_00100.ann,"(1558, 1588)",he married Judith Ann Horrocks,350,T33,Stereotype,Annotator 0,Contextual,
Coll-1434_19900.ann,"(254, 257)",man,369,T12,Stereotype,Annotator 0,Contextual,
Coll-1434_19900.ann,"(369, 372)",man,371,T14,Stereotype,Annotator 0,Contextual,


Manually review, from among the remaining annotations, those with matching offsets and mismatched labels, adding the rows to keep to the gold DataFrame and removing mistaken annotations from the original annotators' DataFrames:

In [91]:
joined03 = sub0.join(sub3, how='inner', lsuffix='_0', rsuffix='_3')
joined03 = joined03.loc[joined03.label_3 != "Occupation"]  # Will clean up occupation labels in later step
joined03  # Keep all labels from both annotators

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_0,entity_0,label_0,annotator_0,category_0,remove_0,id_3,entity_3,label_3,annotator_3,category_3,remove_3
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1036_00600.ann,"(18688, 18706)","two boys, one girl",17762,T227,Stereotype,Annotator 0,Contextual,,5011,T19,Omission,Annotator 3,Contextual,
Coll-1434_12800.ann,"(4762, 4775)",farmer's wife,25991,T38,Stereotype,Annotator 0,Contextual,,7522,T21,Omission,Annotator 3,Contextual,
Coll-1434_12800.ann,"(4721, 4734)",farmer's wife,25990,T37,Stereotype,Annotator 0,Contextual,,7521,T20,Omission,Annotator 3,Contextual,
Coll-1308_00100.ann,"(637, 661)",daughter of Isaac Taylor,14791,T22,Stereotype,Annotator 0,Contextual,,4271,T15,Omission,Annotator 3,Contextual,
Coll-1357_00100.ann,"(4252, 4284)",daughter of the Rev. J. S. Whale,23058,T55,Stereotype,Annotator 0,Contextual,,6625,T32,Omission,Annotator 3,Contextual,
Coll-1434_12800.ann,"(4822, 4830)",his wife,25999,T45,Stereotype,Annotator 0,Contextual,,7520,T19,Omission,Annotator 3,Contextual,


In [92]:
joined03["remove_0"] = ["No"] * (joined03.shape[0])
joined03["remove_3"] = ["No"] * (joined03.shape[0])
## Get indeces of rows to remove and rows to keep for each annotator
keep0_a, remove0_a, ids0_a = rowsToKeepAndRemove(joined03, "remove_0", "id_0")
keep3, remove3, ids3 = rowsToKeepAndRemove(joined03, "remove_3", "id_3")
# print(ids0) # Looks good

In [93]:
print(gold.shape)
# Add the rows to keep to the gold DataFrame, making sure the indeces are aligned
annotators = [keep0_a.index, keep3.index]
sub_dfs = [sub0, sub3]
gold = addToGold(sub_dfs, annotators, gold, [0,3])
print(gold.shape)

(357, 5)
(369, 5)


In [94]:
gold.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_12800.ann,"(4762, 4775)",farmer's wife,7522,T21,Omission,3,Contextual
Coll-1434_12800.ann,"(4721, 4734)",farmer's wife,7521,T20,Omission,3,Contextual
Coll-1308_00100.ann,"(637, 661)",daughter of Isaac Taylor,4271,T15,Omission,3,Contextual
Coll-1357_00100.ann,"(4252, 4284)",daughter of the Rev. J. S. Whale,6625,T32,Omission,3,Contextual
Coll-1434_12800.ann,"(4822, 4830)",his wife,7520,T19,Omission,3,Contextual


In [95]:
sub0["remove"] = None  # add 'remove' column back to sub0 DataFrame
sub0.head()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1326_00100.ann,"(1484, 1528)",considered to be one of the leading chemists,57,T31,Stereotype,Annotator 0,Contextual,
Coll-1434_12300.ann,"(3373, 3380)",cowboys,224,T25,Stereotype,Annotator 0,Contextual,
Coll-1442_00100.ann,"(1558, 1588)",he married Judith Ann Horrocks,350,T33,Stereotype,Annotator 0,Contextual,
Coll-1434_19900.ann,"(254, 257)",man,369,T12,Stereotype,Annotator 0,Contextual,
Coll-1434_19900.ann,"(369, 372)",man,371,T14,Stereotype,Annotator 0,Contextual,


In [96]:
sub4.head()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
BAI_01200.ann,"(2381, 2397)",Duke of Montrose,22,T20,Omission,Annotator 4,Contextual,
BAI_01200.ann,"(5450, 5476)",Fowler and Pearse families,27,T26,Omission,Annotator 4,Contextual,
Coll-1434_15200.ann,"(3287, 3299)",office staff,35,T9,Occupation,Annotator 4,Contextual,
Coll-1434_15200.ann,"(2521, 2557)","Premier of Cape Colony, South Africa",37,T7,Occupation,Annotator 4,Contextual,
Coll-1460_00100.ann,"(506, 514)",Mr. Muir,56,T11,Omission,Annotator 4,Contextual,


In [97]:
joined04 = sub0.join(sub4, how='inner', lsuffix='_0', rsuffix='_4')
joined04 = joined04.loc[joined04.label_4 != "Occupation"]  # Will clean up occupations in later step
joined04  # keep label from both annotators

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_0,entity_0,label_0,annotator_0,category_0,remove_0,id_4,entity_4,label_4,annotator_4,category_4,remove_4
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1434_16700.ann,"(2438, 2446)",his wife,15020,T20,Stereotype,Annotator 0,Contextual,,3131,T2,Omission,Annotator 4,Contextual,


In [98]:
joined04["remove_0"] = ["No"]
joined04["remove_4"] = ["No"]
### Get indeces of rows to remove and rows to keep for each annotator
keep0_b, remove0_b, ids0_b = rowsToKeepAndRemove(joined04, "remove_0", "id_0")
keep4, remove4, ids4 = rowsToKeepAndRemove(joined04, "remove_4", "id_4")

In [99]:
# Add the rows to keep to the gold DataFrame
print(gold.shape)
annotators = [keep0_b.index, keep4.index]
sub_dfs = [sub0, sub4]
gold = addToGold(sub_dfs, annotators, gold, [0,4])
print(gold.shape) # Looks good!

(369, 5)
(371, 5)


In [100]:
gold.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1308_00100.ann,"(637, 661)",daughter of Isaac Taylor,4271,T15,Omission,3,Contextual
Coll-1357_00100.ann,"(4252, 4284)",daughter of the Rev. J. S. Whale,6625,T32,Omission,3,Contextual
Coll-1434_12800.ann,"(4822, 4830)",his wife,7520,T19,Omission,3,Contextual
Coll-1434_16700.ann,"(2438, 2446)",his wife,15020,T20,Stereotype,0,Contextual
Coll-1434_16700.ann,"(2438, 2446)",his wife,3131,T2,Omission,4,Contextual


In [101]:
# Drop reviewed rows from the original annotator DataFrames
ids0 = list(set(ids0_a+ids0_b)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annC0 = dropReviewedRows(annC0, ids0)
ann3 = dropReviewedRows(ann3, ids3)
ann4 = dropReviewedRows(ann4, ids4)

Update the data files:

In [102]:
gold.to_csv("gold_standard.csv")
annC0.to_csv("labels0C.csv")
ann3.to_csv("labels3.csv")
ann4.to_csv("labels4.csv")

**Review annotator 3 vs. annotator 0 and annotator 4's data:**

In [103]:
# gold = pd.read_csv("gold_standard.csv", index_col=0)
# gold = gold[["file","offsets","text","id","entity","label","annotator","category"]]
# gold.set_index(["file","offsets","text"],inplace=True)
# gold.head()

In [104]:
# annC0 = pd.read_csv("labels0C.csv", index_col=0)
# ann3 = pd.read_csv("labels3.csv", index_col=0)
# ann4 = pd.read_csv("labels4.csv", index_col=0)
label_to_review = labels["Contextual"][2]
label_to_review

'Stereotype'

In [105]:
sub3, sub0, sub4 = createContextualSubsets(ann3, annC0, ann4, label_to_review)
sub3.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1326_00100.ann,"(1555, 1599)",President of the Royal College of Physicians,13,T13,Stereotype,Annotator 3,Contextual,
Coll-1143_00100.ann,"(1359, 1391)",Faculty of Law Class Merit Lists,22,T5,Stereotype,Annotator 3,Contextual,
Coll-1318_00100.ann,"(2191, 2217)",First Class Honours degree,133,T26,Stereotype,Annotator 3,Contextual,
Coll-1434_13500.ann,"(2714, 2717)",man,173,T30,Stereotype,Annotator 3,Contextual,
Coll-1260_00100.ann,"(1446, 1476)",A parish minister and his wife,183,T7,Stereotype,Annotator 3,Contextual,


In [106]:
joined30 = sub3.join(sub0, how='inner', lsuffix='_3', rsuffix='_0')
joined30 = joined30.loc[joined30.label_0 != "Occupation"]  # Don't review occupations - will do this in another step
joined30  # Keep all annotators' labels

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_3,entity_3,label_3,annotator_3,category_3,remove_3,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1068_00100.ann,"(1224, 1232)",his wife,8491,T23,Stereotype,Annotator 3,Contextual,,28715,T35,Omission,Annotator 0,Contextual,
Coll-1434_01000.ann,"(194, 197)",men,6754,T3,Stereotype,Annotator 3,Contextual,,23521,T12,Omission,Annotator 0,Contextual,
Coll-1434_03100.ann,"(792, 795)",Man,4712,T13,Stereotype,Annotator 3,Contextual,,16768,T6,Omission,Annotator 0,Contextual,
Coll-1434_05800.ann,"(124, 127)",Man,8090,T2,Stereotype,Annotator 3,Contextual,,27512,T3,Omission,Annotator 0,Contextual,
Coll-1434_06600.ann,"(712, 717)",Woman,2614,T15,Stereotype,Annotator 3,Contextual,,8720,T16,Omission,Annotator 0,Contextual,
Coll-1434_11300.ann,"(4977, 4983)",mother,6586,T32,Stereotype,Annotator 3,Contextual,,22845,T46,Omission,Annotator 0,Contextual,
Coll-1434_11300.ann,"(4985, 4991)",father,6587,T33,Stereotype,Annotator 3,Contextual,,22846,T47,Omission,Annotator 0,Contextual,
Coll-1434_11300.ann,"(4996, 4999)",son,6588,T34,Stereotype,Annotator 3,Contextual,,22847,T48,Omission,Annotator 0,Contextual,


In [107]:
joined30["remove_3"] = ["No"]*(joined30.shape[0])
joined30["remove_0"] = ["No"]*(joined30.shape[0])
# Get indeces of rows to remove and rows to keep for each annotator
keep3_a, remove3_a, ids3_a = rowsToKeepAndRemove(joined30, "remove_3", "id_3")
keep0, remove0, ids0 = rowsToKeepAndRemove(joined30, "remove_0", "id_0")

In [108]:
print(gold.shape)
# Add the rows to keep to the gold DataFrame
annotators = [keep3_a.index, keep0.index]
sub_dfs = [sub3, sub0]
gold = addToGold(sub_dfs, annotators, gold, [3,0])
print(gold.shape)  # Looks good

(371, 5)
(387, 5)


In [109]:
gold.tail() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_05800.ann,"(124, 127)",Man,27512,T3,Omission,0,Contextual
Coll-1434_06600.ann,"(712, 717)",Woman,8720,T16,Omission,0,Contextual
Coll-1434_11300.ann,"(4977, 4983)",mother,22845,T46,Omission,0,Contextual
Coll-1434_11300.ann,"(4985, 4991)",father,22846,T47,Omission,0,Contextual
Coll-1434_11300.ann,"(4996, 4999)",son,22847,T48,Omission,0,Contextual


In [112]:
sub3["remove"] = None  # add remove column back to sub3 DataFrame
sub3.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1326_00100.ann,"(1555, 1599)",President of the Royal College of Physicians,13,T13,Stereotype,Annotator 3,Contextual,
Coll-1143_00100.ann,"(1359, 1391)",Faculty of Law Class Merit Lists,22,T5,Stereotype,Annotator 3,Contextual,
Coll-1318_00100.ann,"(2191, 2217)",First Class Honours degree,133,T26,Stereotype,Annotator 3,Contextual,
Coll-1434_13500.ann,"(2714, 2717)",man,173,T30,Stereotype,Annotator 3,Contextual,
Coll-1260_00100.ann,"(1446, 1476)",A parish minister and his wife,183,T7,Stereotype,Annotator 3,Contextual,


In [113]:
joined34 = sub3.join(sub4, how='inner', lsuffix='_3', rsuffix='_4')
joined34 = joined34.loc[joined34.label_4 != "Occupation"]  # Don't review occupations yet - will do this in another step
joined34

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_3,entity_3,label_3,annotator_3,category_3,remove_3,id_4,entity_4,label_4,annotator_4,category_4,remove_4
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1054_00100.ann,"(3805, 3820)",His second wife,2366,T32,Stereotype,Annotator 3,Contextual,,2040,T33,Omission,Annotator 4,Contextual,
Coll-1028_00100.ann,"(1837, 1862)",David H. Stam was married,9374,T23,Stereotype,Annotator 3,Contextual,,5467,T23,Omission,Annotator 4,Contextual,
Coll-1054_00100.ann,"(3013, 3021)",His wife,2363,T29,Stereotype,Annotator 3,Contextual,,2035,T28,Omission,Annotator 4,Contextual,
Coll-1057_00300.ann,"(962, 980)",his daughter Maria,1118,T4,Stereotype,Annotator 3,Contextual,,1190,T7,Omission,Annotator 4,Contextual,
Coll-1054_00100.ann,"(2460, 2468)",His wife,2361,T27,Stereotype,Annotator 3,Contextual,,2033,T26,Omission,Annotator 4,Contextual,


In [115]:
joined34["remove_3"] = ["No"]*(joined34.shape[0])
joined34["remove_4"] = ["No"]*(joined34.shape[0])
# Get indeces of rows to remove and rows to keep for each annotator
keep3_b, remove3_b, ids3_b = rowsToKeepAndRemove(joined34, "remove_3", "id_3")
keep4, remove4, ids4 = rowsToKeepAndRemove(joined34, "remove_4", "id_4")
print(keep3_b.shape)   # Looks good
print(remove4.shape) # Looks good

(5, 12)
(0, 12)


In [116]:
print(gold.shape)
# Add the rows to keep to the gold DataFrame
annotators = [keep3_b.index, keep4.index]
sub_dfs = [sub3, sub4]
gold = addToGold(sub_dfs, annotators, gold, [3,4])
print(gold.shape)  # Looks good

(387, 5)
(397, 5)


In [117]:
gold.loc[gold.annotator == 4].head()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_16700.ann,"(2438, 2446)",his wife,3131,T2,Omission,4,Contextual
Coll-1054_00100.ann,"(3805, 3820)",His second wife,2040,T33,Omission,4,Contextual
Coll-1028_00100.ann,"(1837, 1862)",David H. Stam was married,5467,T23,Omission,4,Contextual
Coll-1054_00100.ann,"(3013, 3021)",His wife,2035,T28,Omission,4,Contextual
Coll-1057_00300.ann,"(962, 980)",his daughter Maria,1190,T7,Omission,4,Contextual


In [119]:
gold.loc[gold.annotator == 3].tail()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1054_00100.ann,"(3805, 3820)",His second wife,2366,T32,Stereotype,3,Contextual
Coll-1028_00100.ann,"(1837, 1862)",David H. Stam was married,9374,T23,Stereotype,3,Contextual
Coll-1054_00100.ann,"(3013, 3021)",His wife,2363,T29,Stereotype,3,Contextual
Coll-1057_00300.ann,"(962, 980)",his daughter Maria,1118,T4,Stereotype,3,Contextual
Coll-1054_00100.ann,"(2460, 2468)",His wife,2361,T27,Stereotype,3,Contextual


In [120]:
# Drop reviewed rows from the original annotator DataFrames
ids3 = list(set(ids3_a+ids3_b)) # Make sure there aren't any duplicated identifiers in the list of annotator 3's identifiers 
annC0 = dropReviewedRows(annC0, ids0)
ann3 = dropReviewedRows(ann3, ids3)
ann4 = dropReviewedRows(ann4, ids4)

In [121]:
# Update the data files
gold.to_csv("gold_standard.csv")
annC0.to_csv("labels0C.csv")
ann3.to_csv("labels3.csv")
ann4.to_csv("labels4.csv")

**Review annotator 4 vs. annotator 0 and annotator 3's data:**

In [8]:
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
# gold = gold[["file","offsets","text","id","entity","label","annotator","category"]]
# gold.set_index(["file","offsets","text"],inplace=True)
gold.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


In [11]:
annC0 = pd.read_csv("labels0C.csv", index_col=0)
ann3 = pd.read_csv("labels3.csv", index_col=0)
ann4 = pd.read_csv("labels4.csv", index_col=0)
annC0.head()  # All look good

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
0,6,Coll-1444_00100.ann,T7,Omission,M.Ed,Annotator 0,Contextual,,"(444, 448)"
1,8,Coll-1444_00100.ann,T9,Occupation,Educational Psychologists,Annotator 0,Contextual,,"(715, 740)"
2,16,Coll-1444_00100.ann,T17,Occupation,Psychologist,Annotator 0,Contextual,,"(1664, 1676)"
3,20,Coll-1444_00100.ann,T21,Omission,Bell,Annotator 0,Contextual,,"(2065, 2069)"
4,23,Coll-1444_00100.ann,T24,Occupation,researcher at the Godfrey Thomson Unit for Edu...,Annotator 0,Contextual,,"(2312, 2375)"


In [12]:
label_to_review = labels["Contextual"][2]
label_to_review

'Stereotype'

In [15]:
sub4, sub0, sub3 = createContextualSubsets(ann4, annC0, ann3, label_to_review)
sub3.head() # All look good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1326_00100.ann,"(925, 960)",Professor of the Practice of Physic,7,T7,Occupation,Annotator 3,Contextual,
Coll-1320_01900.ann,"(657, 670)",Embryologists,15,T1,Occupation,Annotator 3,Contextual,
Coll-1287_00100.ann,"(172, 180)",Clegyman,25,T2,Occupation,Annotator 3,Contextual,
Coll-1434_12300.ann,"(34, 71)",a man leading a Belgian Gelding horse,43,T3,Omission,Annotator 3,Contextual,
Coll-1434_12300.ann,"(315, 329)",a group of men,44,T4,Omission,Annotator 3,Contextual,


Manually review, from among the remaining annotations, those with matching offsets and mismatched labels, adding the rows to keep to the gold DataFrame and removing mistaken annotations from the original annotators' DataFrames:

In [16]:
joined40 = sub4.join(sub0, how='inner', lsuffix='_4', rsuffix='_0')
joined40 = joined40.loc[joined40.label_0 != "Occupation"]  # Will clean up occupation labels in later step
print(joined40.shape)

(336, 12)


That's a lot of rows to review, so let's export these to review in MS Excel.

In [28]:
# joined40.to_csv("stereotype40.csv")

In [18]:
joined40 = pd.read_csv("stereotype40.csv")
# joined40.head()  # Looks good
# Get indeces of rows to remove and rows to keep for each annotator
keep4_a, remove4_a, ids4_a = rowsToKeepAndRemove(joined40, "remove_4", "id_4")
keep0, remove0, ids0 = rowsToKeepAndRemove(joined40, "remove_0", "id_0")
print(len(ids0))  # Looks good

336


Add rows to the gold standard and save the ids reviewed to drop from the annotators' DataFrames:

In [35]:
keep4_a.drop(labels=['id_0','entity_0', 'label_0', 'annotator_0', 'category_0', 'remove_0'], axis=1, inplace=True)
keep4_a.set_index(["file","offsets","text"], inplace=True)
keep4_a.rename(columns={"id_4":"id","entity_4":"entity","label_4":"label","annotator_4":"annotator","category_4":"category","remove_4":"remove"},inplace=True)
keep4_a.drop(labels=["remove"],axis=1,inplace=True)
keep4_a.tail()
new_gold = gold.append(keep4_a, sort=False)
new_gold.shape # Looks good!
new_gold.tail() # Looks good!
new_gold.to_csv("gold_standard.csv")

In [37]:
ids4_a = list(keep4_a.id)

In [19]:
print(gold.shape)

(732, 5)


In [30]:
keep0.drop(labels=['id_4', 'entity_4', 'label_4', 'annotator_4','category_4', 'remove_4'], axis=1, inplace=True)
keep0.set_index(["file","offsets","text"], inplace=True)
keep0.rename(columns={"id_0":"id","entity_0":"entity","label_0":"label","annotator_0":"annotator","category_0":"category","remove_0":"remove"},inplace=True)
keep0.drop(labels=["remove"],axis=1,inplace=True)
keep0.head()  # Looks good!
new_gold = gold.append(keep0, sort=False)
new_gold.shape # Looks good!
new_gold.tail() # Looks good!
new_gold.to_csv("gold_standard.csv")

In [45]:
gold = new_gold

In [53]:
ids0 = list((set(keep0.id)))
# print(ids0)

[22529, 8195, 22532, 22541, 22544, 28764, 28767, 28769, 22627, 28773, 22630, 28792, 22649, 22655, 6336, 6342, 6348, 6355, 6358, 6361, 6368, 8419, 8424, 8426, 6383, 6395, 6399, 6402, 6404, 6408, 6412, 6415, 26906, 26907, 26917, 22864, 22871, 22881, 22888, 367, 370, 22907, 6526, 6534, 393, 6538, 22923, 6543, 8595, 6550, 8598, 6552, 406, 410, 8603, 8609, 8612, 8621, 8629, 22966, 8632, 22970, 8634, 22973, 458, 467, 22998, 23000, 472, 505, 512, 515, 534, 21019, 540, 542, 21024, 16931, 548, 16935, 16938, 16941, 21038, 21040, 16944, 16949, 21046, 14916, 14918, 14923, 14926, 14930, 23125, 14934, 23131, 14941, 14951, 23149, 14959, 14963, 23157, 14969, 23161, 14987, 31371, 14990, 14995, 31379, 14998, 31388, 15005, 31391, 15008, 31396, 15015, 22523, 23211, 23213, 15022, 15027, 23221, 15033, 4819, 4822, 4826, 4836, 4848, 4851, 13045, 4854, 13050, 13052, 8958, 13055, 8962, 13070, 13072, 8977, 13078, 8985, 6938, 8987, 6941, 8991, 6944, 6948, 6950, 6953, 6957, 6959, 6962, 6966, 6971, 25409, 6989, 699

In [25]:
# sub4.head()  # Looks good

In [42]:
joined43 = sub4.join(sub3, how='inner', lsuffix='_4', rsuffix='_3')
joined43 = joined43.loc[joined43.label_3 != "Occupation"]  # Will clean up occupations in later step
# joined43.shape
joined43  # Keep all labels from both annotators

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_4,entity_4,label_4,annotator_4,category_4,remove_4,id_3,entity_3,label_3,annotator_3,category_3,remove_3
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1057_00700.ann,"(204, 222)",two female workers,2454,T22,Stereotype,Annotator 4,Contextual,,3144,T11,Omission,Annotator 3,Contextual,
Coll-1057_00700.ann,"(9926, 9945)",unidentified female,2451,T19,Stereotype,Annotator 4,Contextual,,3163,T30,Omission,Annotator 3,Contextual,
Coll-1057_00800.ann,"(10750, 10770)",two unidentified men,2780,T32,Stereotype,Annotator 4,Contextual,,3654,T16,Omission,Annotator 3,Contextual,
Coll-1057_00800.ann,"(1331, 1345)",two young boys,2773,T25,Stereotype,Annotator 4,Contextual,,3642,T2,Omission,Annotator 3,Contextual,
Coll-1057_00800.ann,"(2169, 2189)",two unidentified men,2775,T27,Stereotype,Annotator 4,Contextual,,3645,T7,Omission,Annotator 3,Contextual,
Coll-1057_01000.ann,"(10051, 10070)",an unidentified man,751,T58,Stereotype,Annotator 4,Contextual,,7333,T37,Omission,Annotator 3,Contextual,
Coll-1057_01000.ann,"(6256, 6276)",two unidentified men,738,T48,Stereotype,Annotator 4,Contextual,,7321,T24,Omission,Annotator 3,Contextual,
Coll-1057_01000.ann,"(9909, 9928)",an unidentified man,749,T56,Stereotype,Annotator 4,Contextual,,7331,T35,Omission,Annotator 3,Contextual,
Coll-1057_01000.ann,"(9995, 10015)",two unidentified men,750,T57,Stereotype,Annotator 4,Contextual,,7332,T36,Omission,Annotator 3,Contextual,


In [43]:
joined43["remove_4"] = ["No"]*(joined43.shape[0])
joined43["remove_3"] = ["No"]*(joined43.shape[0])
## Get indeces of rows to remove and rows to keep for each annotator
keep4_b, remove4_b, ids4_b = rowsToKeepAndRemove(joined43, "remove_4", "id_4")
keep3, remove3, ids3 = rowsToKeepAndRemove(joined43, "remove_3", "id_3")

In [48]:
# Add the rows to keep to the gold DataFrame
annotators = [keep4_b.index, keep3.index]
sub_dfs = [sub4, sub3]
gold = addToGold(sub_dfs, annotators, gold, [4,3])
print(gold.shape) # Looks good!

(1086, 5)


In [49]:
gold.tail()  # Looks good!

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1057_00800.ann,"(2169, 2189)",two unidentified men,3645,T7,Omission,3,Contextual
Coll-1057_01000.ann,"(10051, 10070)",an unidentified man,7333,T37,Omission,3,Contextual
Coll-1057_01000.ann,"(6256, 6276)",two unidentified men,7321,T24,Omission,3,Contextual
Coll-1057_01000.ann,"(9909, 9928)",an unidentified man,7331,T35,Omission,3,Contextual
Coll-1057_01000.ann,"(9995, 10015)",two unidentified men,7332,T36,Omission,3,Contextual


In [54]:
# Drop reviewed rows from the original annotator DataFrames
ids4 = list(set(ids4_a+ids4_b)) # Make sure there aren't any duplicated identifiers in the list of annotator 4's identifiers 
annC0 = dropReviewedRows(annC0, ids0)
ann3 = dropReviewedRows(ann3, ids3)
ann4 = dropReviewedRows(ann4, ids4)

Update the data files:

In [55]:
gold.to_csv("gold_standard.csv")
annC0.to_csv("labels0C.csv")
ann3.to_csv("labels3.csv")
ann4.to_csv("labels4.csv")

<a id="omi"></a>
#### OMISSION

In [62]:
# Read the data files
# gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
# annC0 = pd.read_csv("labels0C.csv", index_col=0)
# ann3 = pd.read_csv("labels3.csv", index_col=0)
# ann4 = pd.read_csv("labels4.csv", index_col=0)

In [56]:
label_to_review = labels["Contextual"][1]
label_to_review

'Omission'

**Review annotator 0 vs. annotator 3 and annotator 4's data:**

In [60]:
sub0, sub3, sub4 = createContextualSubsets(annC0, ann3, ann4, label_to_review)
# # sub4.head() # All look good

Manually review, from among the remaining annotations, those with matching offsets and mismatched labels, adding the rows to keep to the gold DataFrame and removing mistaken annotations from the original annotators' DataFrames:

In [62]:
joined03 = sub0.join(sub3, how='inner', lsuffix='_0', rsuffix='_3')
joined03 = joined03.loc[joined03.label_3 != "Occupation"]
joined03

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_0,entity_0,label_0,annotator_0,category_0,remove_0,id_3,entity_3,label_3,annotator_3,category_3,remove_3
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


Nothing to review!

All annotations from annotators are correct here, so we won't remove any rows and will add them all to the gold standard.

In [64]:
joined04 = sub0.join(sub4, how='inner', lsuffix='_0', rsuffix='_4')
joined04 = joined04.loc[joined04.label_4 != "Occupation"]
joined04

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_0,entity_0,label_0,annotator_0,category_0,remove_0,id_4,entity_4,label_4,annotator_4,category_4,remove_4
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


Nothing to review!

**Review annotator 3 vs. annotator 0 and annotator 4's data:**

In [65]:
sub3, sub0, sub4 = createContextualSubsets(ann3, annC0, ann4, label_to_review)
sub3.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1434_12300.ann,"(34, 71)",a man leading a Belgian Gelding horse,43,T3,Omission,Annotator 3,Contextual,
Coll-1434_12300.ann,"(315, 329)",a group of men,44,T4,Omission,Annotator 3,Contextual,
Coll-1434_12300.ann,"(1123, 1160)",two men standing on the lefthand side,47,T7,Omission,Annotator 3,Contextual,
Coll-1434_12300.ann,"(1441, 1463)",men standing around it,48,T8,Omission,Annotator 3,Contextual,
Coll-1434_12300.ann,"(2356, 2396)","a group of Khond men, women and children",50,T10,Omission,Annotator 3,Contextual,


In [67]:
joined30 = sub3.join(sub0, how='inner', lsuffix='_3', rsuffix='_0')
joined30 = joined30.loc[joined30.label_0 != "Occupation"]
joined30

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_3,entity_3,label_3,annotator_3,category_3,remove_3,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


In [68]:
joined34 = sub3.join(sub4, how='inner', lsuffix='_3', rsuffix='_4')
joined34 = joined34.loc[joined34.label_4 != "Occupation"]
joined34

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_3,entity_3,label_3,annotator_3,category_3,remove_3,id_4,entity_4,label_4,annotator_4,category_4,remove_4
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


Nothing to review!

**Review annotator 4 vs. annotator 0 and annotator 3's data:**

In [69]:
sub4, sub0, sub3 = createContextualSubsets(ann4, annC0, ann3, label_to_review)
sub4.head() # All look good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
BAI_01200.ann,"(2381, 2397)",Duke of Montrose,22,T20,Omission,Annotator 4,Contextual,
BAI_01200.ann,"(5450, 5476)",Fowler and Pearse families,27,T26,Omission,Annotator 4,Contextual,
Coll-1460_00100.ann,"(506, 514)",Mr. Muir,56,T11,Omission,Annotator 4,Contextual,
Coll-1460_00100.ann,"(538, 544)",Isobel,57,T12,Omission,Annotator 4,Contextual,
Coll-1487_00100.ann,"(1099, 1121)","his son, Adam Ferguson",81,T18,Omission,Annotator 4,Contextual,


In [70]:
joined43 = sub4.join(sub3, how='inner', lsuffix='_4', rsuffix='_3')
joined43 = joined43.loc[joined43.label_3 != "Occupation"]
joined43

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_4,entity_4,label_4,annotator_4,category_4,remove_4,id_3,entity_3,label_3,annotator_3,category_3,remove_3
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


In [72]:
joined40 = sub4.join(sub0, how='inner', lsuffix='_4', rsuffix='_0')
joined40 = joined40.loc[joined40.label_0 != "Occupation"]
joined40

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_4,entity_4,label_4,annotator_4,category_4,remove_4,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


Nothing to review!

<a id="name"></a>
### Person-Name
Load the data for annotators 0 and 2 (annotator 1 had too many mistakes with the Person-Name labels, so we won't include their's in the gold standard).

In [77]:
annPL0 = pd.read_csv("labels0PL.csv", index_col=0)
ann2 = pd.read_csv("labels2.csv", index_col=0)

<a id="n-spans"></a>
#### Investigating Text Spans

In [78]:
investigateTextSpans(annPL0, "Annotator 0", "Unknown")
investigateTextSpans(ann2, "Annotator 2", "Unknown")

Annotator 0
 - Average word count in Unknown text spans: 2.205065158593558
 - Longest word count in Unknown text spans: 7
 - Shortest word count in Unknown text spans: 1
 - Standard deviation for word count in Unknown text spans: 0.8275106952529653
Annotator 2
 - Average word count in Unknown text spans: 2.1633475050346833
 - Longest word count in Unknown text spans: 10
 - Shortest word count in Unknown text spans: 1
 - Standard deviation for word count in Unknown text spans: 0.6950189305259009


In [79]:
investigateTextSpans(annPL0, "Annotator 0", "Feminine")
investigateTextSpans(ann2, "Annotator 2", "Feminine")

Annotator 0
 - Average word count in Feminine text spans: 2.4794520547945207
 - Longest word count in Feminine text spans: 7
 - Shortest word count in Feminine text spans: 1
 - Standard deviation for word count in Feminine text spans: 1.0888759531317438
Annotator 2
 - Average word count in Feminine text spans: 2.5734767025089607
 - Longest word count in Feminine text spans: 10
 - Shortest word count in Feminine text spans: 1
 - Standard deviation for word count in Feminine text spans: 1.1457318764053608


In [80]:
investigateTextSpans(annPL0, "Annotator 0", "Masculine")
investigateTextSpans(ann2, "Annotator 2", "Masculine")

Annotator 0
 - Average word count in Masculine text spans: 2.511111111111111
 - Longest word count in Masculine text spans: 12
 - Shortest word count in Masculine text spans: 1
 - Standard deviation for word count in Masculine text spans: 1.2286411530192272
Annotator 2
 - Average word count in Masculine text spans: 2.546875
 - Longest word count in Masculine text spans: 12
 - Shortest word count in Masculine text spans: 1
 - Standard deviation for word count in Masculine text spans: 1.2274898393891378


In [82]:
investigateTextSpans(annPL0, "Annotator 0", "Nonbinary")
# investigateTextSpans(ann2, "Annotator 2", "Nonbinary") # ann2 didn't use this label

Annotator 0
 - Average word count in Nonbinary text spans: 1.0
 - Longest word count in Nonbinary text spans: 1
 - Shortest word count in Nonbinary text spans: 1
 - Standard deviation for word count in Nonbinary text spans: 0.0


<a id="unk"></a>
#### UNKNOWN

In [47]:
# Read the data files
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
annPL0 = pd.read_csv("labels0PL.csv", index_col=0)
ann2 = pd.read_csv("labels2.csv", index_col=0)

In [48]:
label_to_review = labels["Person-Name"][0]
label_to_review

'Unknown'

**Review annotator 0 vs. annotator 2's data:**

In [27]:
sub0, sub2 = createSubsetsToReview(annPL0, ann2, None, label_to_review)
sub0.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1444_00100.ann,"(52, 66)",Robert E. Bell,0,T1,Unknown,Annotator 0,Person-Name,
Coll-1444_00100.ann,"(825, 837)",G K Gardiner,9,T10,Unknown,Annotator 0,Person-Name,
Coll-1444_00100.ann,"(1310, 1327)",Alexander Darroch,12,T13,Unknown,Annotator 0,Person-Name,
Coll-1444_00100.ann,"(1525, 1539)",Boris Seminoff,14,T15,Unknown,Annotator 0,Person-Name,
Coll-1444_00100.ann,"(1620, 1638)",Gillian Sutherland,15,T16,Unknown,Annotator 0,Person-Name,


In [28]:
joined02 = sub0.join(sub2, how='inner', lsuffix='_0', rsuffix='_2')
# joined02 # 81 rows to review - that's a lot, so let's focus on one type of disagreement at a time

Unknown vs. Feminine

In [33]:
joined02_unk_fem = pd.DataFrame(joined02[joined02.label_2 == "Feminine"])
joined02_unk_fem.sort_values(by=["file", "offsets", "text"],inplace=True)
joined02_unk_fem

In [32]:
joined02_unk_fem["remove_0"] = [True,True,True,False,False,False,False,True,False,False,True,True,False,False,False,True,False,True,True,True,False,True,True,True,False,False,True,True,True,False,True,True,True,True]
joined02_unk_fem["remove_2"] = [False, False, False, True, True, True, True, False, True, True, False, False, True, True, True, False, True, False, False, False, True, False, False, False, True, True, False, False, False, True, False, False, False, False]

### Get indeces of rows to remove and rows to keep for each annotator
joined = joined02_unk_fem
keep0 = joined[joined.remove_0 == False] 
remove0 = joined[joined.remove_0 == True]
keep2 = joined[joined.remove_2 == False]
remove2 = joined[joined.remove_2 == True]

Unknown vs. Masculine

In [35]:
joined02_unk_mas = pd.DataFrame(joined02[joined02.label_2 == "Masculine"])
joined02_unk_mas.sort_values(by=["file", "offsets", "text"],inplace=True)
joined02_unk_mas

Unknown vs. Nonbinary

In [13]:
joined02_unk_nb = pd.DataFrame(joined02[joined02.label_2 == "Nonbinary"])
joined02_unk_nb

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_0,entity_0,label_0,annotator_0,category_0,remove_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


*There are no "Nonbinary" labels to compare to "Unknown" labels*

Combine the data of labels to keep and remove from the reviews of joined DataFrames:

In [31]:
joined02_unk_mas["remove_0"] = [True,True,False,True,True,True,True,False,False,True,True,True,True,True,False,True,True,True,True,True,True,False,True,True,True,False,False,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True]
joined02_unk_mas["remove_2"] = [False, False, True, False, False, False, False, True, True, False, False, False, False, False, True, False, False, False, False, False, False, True, False, False, False, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]

### Get indeces of rows to remove and rows to keep for each annotator
joined = joined02_unk_mas
keep0_mas = joined[joined.remove_0 == False] 
keep0 = keep0.append(keep0_mas)
ids0 = list(keep0.id_0)
remove0_mas = joined[joined.remove_0 == True]
remove0 = remove0.append(remove0_mas)
ids0 += list(remove0.id_0)
keep2_mas = joined[joined.remove_2 == False]
keep2 = keep2.append(keep2_mas)
ids2 = list(keep2.id_2)
remove2_mas = joined[joined.remove_2 == True]
remove2 = remove2.append(remove2_mas)
ids2 += list(remove2.id_2)

Add the rows to keep (marked as `False` in the `remove` column) to the gold DataFrame:

In [42]:
# Add the rows to keep to the gold DataFrame
annotators = [keep0.index, keep2.index]
sub_dfs = [sub0, sub2]
gold = addToGold(sub_dfs, annotators, gold, [0,2])

Drop all the rows reviewed (`remove[#]` and `keep[#]` variables) from the original annotator DataFrames:

In [35]:
# Drop reviewed rows from the original annotator DataFrames
ids0 = list(set(ids0)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann2 = dropReviewedRows(ann2, ids2)

Write the gold DataFrame to a CSV and rewrite the annotators' CSV files (copies of the originals saved already) so the above steps can be re-run for the remaining labels:

In [80]:
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann2.to_csv("labels2.csv")

In [45]:
gold.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,category,annotator
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1448_00100.ann,"(1204, 1211)",Thomson,19588,T26,Masculine,Person-Name,2
Coll-1448_00100.ann,"(1319, 1326)",Thomson,19590,T28,Masculine,Person-Name,2
Coll-1448_00100.ann,"(1392, 1399)",Thomson,19591,T29,Masculine,Person-Name,2
Coll-1448_00100.ann,"(951, 958)",Thomson,19583,T21,Masculine,Person-Name,2
Coll-1454_00100.ann,"(687, 695)",Brewster,19682,T35,Masculine,Person-Name,2


**Review annotator 2 vs. annotator 0's data:**

In [23]:
gold = pd.read_csv("gold_standard.csv", index_col = [0,1,2])
annPL0 = pd.read_csv("labels0PL.csv", index_col = 0)
ann2 = pd.read_csv("labels2.csv", index_col = 0)
label_to_review = labels["Person-Name"][0]
label_to_review

'Unknown'

In [49]:
sub2, sub0 = createSubsetsToReview(ann2, annPL0, None, label_to_review)
sub2.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AA5_00100.ann,"(43, 63)",Rev Prof James Whyte,7,T10,Unknown,Annotator 2,Person-Name,
AA6_00100.ann,"(658, 665)",William,21,T11,Unknown,Annotator 2,Person-Name,
AA6_00100.ann,"(670, 675)",Agnes,22,T12,Unknown,Annotator 2,Person-Name,
AA6_00100.ann,"(34, 47)",Rev Tom Allan,28,T10,Unknown,Annotator 2,Person-Name,
AA6_00100.ann,"(1057, 1067)",Jane Moore,29,T15,Unknown,Annotator 2,Person-Name,


In [50]:
joined20 = sub2.join(sub0, how='inner', lsuffix='_2', rsuffix='_0')
joined20.shape

(149, 12)

149 rows is a lot!  Since that's difficult to review in a Jupyter Notebook, let's export these to review in MS Excel and then reload the CSV file:

In [51]:
# joined20.to_csv("joined20_person-names.csv")
# After noting which annotator's labels to remove and to keep for each row in the CSV, load the latest version of it:
joined20 = pd.read_csv("joined20_person-names.csv", index_col=["file","offsets","text"])
joined20.sort_values(by=["file","offsets","text"], inplace=True)
joined20.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1461_00100.ann,"(2618, 2633)",C.H. Waddington,5772,T41,Unknown,Annotator 2,Person-Name,Yes,2820,T52,Masculine,Annotator 0,Person-Name,No
Coll-1462_00100.ann,"(1178, 1195)",Archibald Kennedy,19812,T26,Unknown,Annotator 2,Person-Name,Yes,3365,T28,Masculine,Annotator 0,Person-Name,No
Coll-1469_00100.ann,"(251, 257)",Ballie,19894,T12,Unknown,Annotator 2,Person-Name,Yes,26465,T12,Masculine,Annotator 0,Person-Name,No
Coll-1469_00100.ann,"(301, 308)",Baillie,19895,T13,Unknown,Annotator 2,Person-Name,Yes,26466,T13,Masculine,Annotator 0,Person-Name,No
Coll-1469_00100.ann,"(697, 704)",Baillie,19896,T14,Unknown,Annotator 2,Person-Name,Yes,26469,T16,Masculine,Annotator 0,Person-Name,No


In [57]:
### Get indeces of rows to remove and rows to keep for each annotator
joined = joined20
keep0, remove0 = joined[joined.remove_0 == "No"], joined[joined.remove_0 == "Yes"]
ids0 = list(keep0.id_0) + list(remove0.id_0)
keep2, remove2 = joined[joined.remove_2 == "No"], joined[joined.remove_2 == "Yes"]
ids2 = list(keep2.id_2) + list(remove2.id_2)
# print(ids0)

Add the rows to keep (marked as `False` in the `remove` column) to the gold DataFrame:

In [58]:
# Add the rows to keep to the gold DataFrame
annotators = [keep2.index, keep0.index]
sub_dfs = [sub2, sub0]
gold = addToGold(sub_dfs, annotators, gold, [2,0])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return concat(


Drop all the rows reviewed (`remove[#]` and `keep[#]` variables) from the original annotator DataFrames:

In [59]:
# Drop reviewed rows from the original annotator DataFrames
ids0 = list(set(ids0)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann2 = dropReviewedRows(ann2, ids2)

Write the gold DataFrame to a CSV and rewrite the annotators' CSV files (copies of the originals saved already) so the above steps can be re-run for the remaining labels:

In [60]:
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann2.to_csv("labels2.csv")

In [61]:
gold.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,annotator,category,entity,id,label
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1461_00100.ann,"(2618, 2633)",C.H. Waddington,0,Person-Name,T52,2820,Masculine
Coll-1462_00100.ann,"(1178, 1195)",Archibald Kennedy,0,Person-Name,T28,3365,Masculine
Coll-1469_00100.ann,"(251, 257)",Ballie,0,Person-Name,T12,26465,Masculine
Coll-1469_00100.ann,"(301, 308)",Baillie,0,Person-Name,T13,26466,Masculine
Coll-1469_00100.ann,"(697, 704)",Baillie,0,Person-Name,T16,26469,Masculine


<a id="mas"></a>
#### MASCULINE

In [62]:
# Read the data files
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
annPL0 = pd.read_csv("labels0PL.csv", index_col=0)
ann2 = pd.read_csv("labels2.csv", index_col=0)

In [63]:
label_to_review = labels["Person-Name"][1]
label_to_review

'Masculine'

**Review annotator 0 vs. annotator 2's data:**

In [64]:
sub0, sub2 = createSubsetsToReview(annPL0, ann2, None, label_to_review)
sub0.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1326_00100.ann,"(549, 562)",James Gregory,38,T12,Masculine,Annotator 0,Person-Name,
Coll-1326_00100.ann,"(579, 596)",Daniel Rutherford,39,T13,Masculine,Annotator 0,Person-Name,
Coll-1326_00100.ann,"(886, 899)",James Gregory,42,T16,Masculine,Annotator 0,Person-Name,
Coll-1326_00100.ann,"(1076, 1083)",Gregory,46,T20,Masculine,Annotator 0,Person-Name,
Coll-1326_00100.ann,"(1365, 1382)",Daniel Rutherford,51,T25,Masculine,Annotator 0,Person-Name,


In [66]:
joined02 = sub0.join(sub2, how='inner', lsuffix='_0', rsuffix='_2')
# joined02.shape
joined02

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_0,entity_0,label_0,annotator_0,category_0,remove_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1036_00400.ann,"(37638, 37660)","Bantock, Sir Granville",9560,T524,Masculine,Annotator 0,Person-Name,,2125,T404,Feminine,Annotator 2,Person-Name,


In [67]:
joined02["remove_0"] = ["No"]
joined02["remove_2"] = ["Yes"]
### Get indeces of rows to remove and rows to keep for each annotator
joined = joined02
keep0, remove0 = joined[joined.remove_0 == "No"], joined[joined.remove_0 == "Yes"]
ids0 = list(keep0.id_0) + list(remove0.id_0)
keep2, remove2 = joined[joined.remove_2 == "No"], joined[joined.remove_2 == "Yes"]
ids2 = list(keep2.id_2) + list(remove2.id_2)
print(ids0)

[9560]


Add the rows to keep (marked as `False` in the `remove` column) to the gold DataFrame, drop all the rows reviewed (`remove[#]` and `keep[#]` variables) from the original annotator DataFrames, and write the gold DataFrame to a CSV and rewrite the annotators' CSV files (copies of the originals saved already) so the above steps can be re-run for the remaining labels:

In [68]:
# Add the rows to keep to the gold DataFrame
annotators = [keep0.index, keep2.index]
sub_dfs = [sub0, sub2]
gold = addToGold(sub_dfs, annotators, gold, [0,2])
# Drop reviewed rows
ids0 = list(set(ids0)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann2 = dropReviewedRows(ann2, ids2)
# Write data files
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann2.to_csv("labels2.csv")

In [69]:
gold.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,annotator,category,entity,id,label
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1462_00100.ann,"(1178, 1195)",Archibald Kennedy,0,Person-Name,T28,3365,Masculine
Coll-1469_00100.ann,"(251, 257)",Ballie,0,Person-Name,T12,26465,Masculine
Coll-1469_00100.ann,"(301, 308)",Baillie,0,Person-Name,T13,26466,Masculine
Coll-1469_00100.ann,"(697, 704)",Baillie,0,Person-Name,T16,26469,Masculine
Coll-1036_00400.ann,"(37638, 37660)","Bantock, Sir Granville",0,Person-Name,T524,9560,Masculine


**Review annotator 2 vs. annotator 0's data:**

In [71]:
sub2, sub0 = createSubsetsToReview(ann2, annPL0, None, label_to_review)
sub2.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AA5_00100.ann,"(661, 689)",Professor James Aitken White,8,T7,Masculine,Annotator 2,Person-Name,
AA5_00100.ann,"(1032, 1043)",James Whyte,9,T8,Masculine,Annotator 2,Person-Name,
AA5_00100.ann,"(1350, 1361)",James Whyte,10,T9,Masculine,Annotator 2,Person-Name,
AA6_00100.ann,"(1150, 1163)",Rev Tom Allan,23,T13,Masculine,Annotator 2,Person-Name,
AA6_00100.ann,"(1884, 1897)",Rev Tom Allan,25,T17,Masculine,Annotator 2,Person-Name,


In [73]:
joined20 = sub2.join(sub0, how='inner', lsuffix='_2', rsuffix='_0')
# joined20.shape
joined20

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Coll-1445_00100.ann,"(358, 363)",Cooke,19516,T8,Masculine,Annotator 2,Person-Name,,28252,T8,Feminine,Annotator 0,Person-Name,


In [76]:
joined20["remove_0"] = ["No"]
joined20["remove_2"] = ["Yes"]
### Get indeces of rows to remove and rows to keep for each annotator
joined = joined20
keep0, remove0 = joined[joined.remove_0 == "No"], joined[joined.remove_0 == "Yes"]
ids0 = list(keep0.id_0) + list(remove0.id_0)
keep2, remove2 = joined[joined.remove_2 == "No"], joined[joined.remove_2 == "Yes"]
ids2 = list(keep2.id_2) + list(remove2.id_2)
print(ids0,ids2)

[28252] [19516]


Add the rows to keep (marked as `False` in the `remove` column) to the gold DataFrame, drop all the rows reviewed (`remove[#]` and `keep[#]` variables) from the original annotator DataFrames, and write the gold DataFrame to a CSV and rewrite the annotators' CSV files (copies of the originals saved already) so the above steps can be re-run for the remaining labels:

In [77]:
# Add the rows to keep to the gold DataFrame
annotators = [keep2.index, keep0.index]
sub_dfs = [sub2, sub0]
gold = addToGold(sub_dfs, annotators, gold, [2,0])
# Drop reviewed rows
ids0 = list(set(ids0)) # Make sure there aren't any duplicated identifiers in the list of annotator 0's identifiers 
annPL0 = dropReviewedRows(annPL0, ids0)
ann2 = dropReviewedRows(ann2, ids2)
# Write data files
gold.to_csv("gold_standard.csv")
annPL0.to_csv("labels0PL.csv")
ann2.to_csv("labels2.csv")
gold.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,annotator,category,entity,id,label
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1469_00100.ann,"(251, 257)",Ballie,0,Person-Name,T12,26465,Masculine
Coll-1469_00100.ann,"(301, 308)",Baillie,0,Person-Name,T13,26466,Masculine
Coll-1469_00100.ann,"(697, 704)",Baillie,0,Person-Name,T16,26469,Masculine
Coll-1036_00400.ann,"(37638, 37660)","Bantock, Sir Granville",0,Person-Name,T524,9560,Masculine
Coll-1445_00100.ann,"(358, 363)",Cooke,0,Person-Name,T8,28252,Feminine


<a id="fem"></a>
#### FEMININE

In [78]:
# Read the data files
gold = pd.read_csv("gold_standard.csv", index_col=[0,1,2])
annPL0 = pd.read_csv("labels0PL.csv", index_col=0)
ann2 = pd.read_csv("labels2.csv", index_col=0)

In [79]:
label_to_review = labels["Person-Name"][2]
label_to_review

'Feminine'

**Review annotator 0 vs. annotator 2's data:**

In [80]:
sub0, sub2 = createSubsetsToReview(annPL0, ann2, None, label_to_review)
sub0.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coll-1460_00100.ann,"(234, 243)",Katherine,161,T8,Feminine,Annotator 0,Person-Name,
Coll-1460_00100.ann,"(294, 303)",Katherine,163,T10,Feminine,Annotator 0,Person-Name,
Coll-1460_00100.ann,"(402, 411)",Katherine,164,T11,Feminine,Annotator 0,Person-Name,
Coll-1460_00100.ann,"(601, 610)",Katherine,175,T22,Feminine,Annotator 0,Person-Name,
Coll-1460_00100.ann,"(710, 713)",MMM,177,T24,Feminine,Annotator 0,Person-Name,


In [81]:
joined02 = sub0.join(sub2, how='inner', lsuffix='_0', rsuffix='_2')
joined02

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_0,entity_0,label_0,annotator_0,category_0,remove_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


No rows of disagreement remaining!

**Review annotator 2 vs. annotator 0's data:**

In [82]:
sub2, sub0 = createSubsetsToReview(ann2, annPL0, None, label_to_review)
sub2.head() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category,remove
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AA7_00100.ann,"(621, 625)",Mona,52,T16,Feminine,Annotator 2,Person-Name,
BAI_01200.ann,"(308, 329)",Florence Jewel Fowler,218,T39,Feminine,Annotator 2,Person-Name,
BAI_01200.ann,"(344, 348)",Mary,219,T40,Feminine,Annotator 2,Person-Name,
BAI_01200.ann,"(351, 363)",Sarah Fowler,220,T41,Feminine,Annotator 2,Person-Name,
BAI_01200.ann,"(1357, 1379)",Florence Jewel Baillie,229,T49,Feminine,Annotator 2,Person-Name,


In [83]:
joined20 = sub0.join(sub2, how='inner', lsuffix='_2', rsuffix='_0')
joined20

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id_2,entity_2,label_2,annotator_2,category_2,remove_2,id_0,entity_0,label_0,annotator_0,category_0,remove_0
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


No rows of disagreement remaining!