# Merge Annotated Datasets for a Gold Standard, part 2

#### To continue reconciling the differences in the five annotatated archival metadata descriptions datasets to create one merged dataset:

  [2.](#2) Remove any `Gendered-Pronoun` labels that are not actually singular pronouns. For any files annotator 1 didn't label, add all annotator 0's `Gendered-Pronoun` labels.  For any files annotator 0 didn't label, add all annotator 2's `Gendered-Pronoun` labels.

  [3.](#3) For the remaining rows of the old DataFrame, for each `Occupation` label, remove any found to be incorrect during the manual review of those labels (see grep scripts/instructions) from that DataFrame, and add those that are correct to the gold standard DataFrame.  Be sure to exclude occupation annotations with the following terms in their text spans: fellow, honorary, emeritus, knight commander.
  
***

Import required libraries:

In [1]:
import pandas as pd
import numpy as np
import string
import csv
import re
import os

## 2. Gendered Pronouns

### Correcting Mistakes

In [5]:
# # Person-Name and Linguistic label data
annPL0 = pd.read_csv("labels0PL.csv", index_col=0)
ann1 = pd.read_csv("labels1.csv", index_col=0)
ann2 = pd.read_csv("labels2.csv", index_col=0)

# # Preview the data
annPL0.head()
# ann4.head()

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
0,0,Coll-1444_00100.ann,T1,Unknown,Robert E. Bell,Annotator 0,Person-Name,,"(52, 66)"
1,1,Coll-1444_00100.ann,T2,Generalization,Bachelors,Annotator 0,Linguistic,,"(219, 228)"
2,2,Coll-1444_00100.ann,T3,Generalization,Bachelors,Annotator 0,Linguistic,,"(301, 310)"
3,3,Coll-1444_00100.ann,T4,Generalization,Ed.B,Annotator 0,Linguistic,,"(368, 372)"
4,4,Coll-1444_00100.ann,T5,Generalization,M.Ed,Annotator 0,Linguistic,,"(377, 381)"


I want to review the text spans given a Gendered Pronoun label to make sure they fit the annotation instructions as being one of the following: he, him, his, her, or she.

In [7]:
def getGenderedPronouns(df):
    return df.loc[df.label == "Gendered-Pronoun"]

In [29]:
def reviewText(df, annotator):
    df_text = list(set(df.text))
    print("{annotator_no}'s Gendered Pronoun Text: {unique_text}".format(annotator_no=annotator, unique_text=df_text))
    print("Total:",len(df_text))
    return df_text

In [15]:
gp0 = getGenderedPronouns(annPL0)
gp1 = getGenderedPronouns(ann1)
gp2 = getGenderedPronouns(ann2)
gp2.tail()  # Looks good

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
19601,19887,Coll-1469_00100.ann,T5,Gendered-Pronoun,He,Annotator 2,Linguistic,,"(516, 518)"
19602,19888,Coll-1469_00100.ann,T6,Gendered-Pronoun,He,Annotator 2,Linguistic,,"(619, 621)"
19603,19889,Coll-1469_00100.ann,T7,Gendered-Pronoun,him,Annotator 2,Linguistic,,"(735, 738)"
19604,19890,Coll-1469_00100.ann,T8,Gendered-Pronoun,he,Annotator 2,Linguistic,,"(739, 741)"
19605,19891,Coll-1469_00100.ann,T9,Gendered-Pronoun,he,Annotator 2,Linguistic,,"(764, 766)"


In [30]:
gp0_text = reviewText(gp0, "Annotator 0")

Annotator 0's Gendered Pronoun Text: ['she', 'him', 'he', 'himself', 'His', 'He', 'H', 'She', 'Himself', '.', 'H[is]', 'his', 'her', 'herself', 'Her']
Total: 15


The `'H'` and `'.'` are probably mistakes, either not enough text was annotated or text that shouldn't have been annotated at all was.  I'll investigate those and manually correct them.  The rest are clearly valid gendered pronouns, though.

In [31]:
gp1_text = reviewText(gp1, "Annotator 1")

Annotator 1's Gendered Pronoun Text: ['she', 'him', 'Him', 'he', 'himself', 'His', 'He', 'She', "He's", 'his', 'her', 'Her']
Total: 12


In [32]:
gp2_text = reviewText(gp2, "Annotator 2")

Annotator 2's Gendered Pronoun Text: ['she', 'him', 'he', 'himself', 'His', 'He', 'She', 'his', 'her', 'Her']
Total: 10


Annotators 1 and 2's gendered pronouns are valid!

In [45]:
annPL0.loc[annPL0.text == "."]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
11582,16570,Coll-1310_01900.ann,T145,Gendered-Pronoun,.,Annotator 0,Linguistic,,"(8472, 8473)"


This should be `him` with offsets `(8469,8472)`.  Let's change that in the annotator's data file:

In [61]:
annPL0 = annPL0.astype({"id":int,"file":str,"entity":str,"label":str,"text":str,"annotator":str,"category":str,"offsets":str})
row_to_replace = annPL0.loc[annPL0.text == "."]
index_to_drop = row_to_replace.index[0]
type(list(row_to_replace.offsets)[0])

str

In [58]:
new_row = pd.DataFrame({"id":[16570],"file":["Coll-1310_01900.ann"],"entity":"T145","label":"Gendered-Pronoun","text":"him",
                        "annotator":"Annotator 0","category":"Linguistic","remove":None,"offsets":"(8469,8472)"})
new_row

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
0,16570,Coll-1310_01900.ann,T145,Gendered-Pronoun,him,Annotator 0,Linguistic,,"(8469,8472)"


In [63]:
print(annPL0.shape)
annPL0.drop(index=index_to_drop,inplace=True)
print(annPL0.shape)

(22296, 9)
(22295, 9)


In [65]:
annPL0 = annPL0.append(new_row)
print(annPL0.shape)

(22296, 9)


Looks good!

In [35]:
gp0.loc[gp0.text == "H"]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
19654,28053,Coll-1296_00100.ann,T60,Gendered-Pronoun,H,Annotator 0,Linguistic,,"(4646, 4647)"


This is an abbreviation for `His` (or in other cases it could be `Her`) as in `HM [His Majesty] the king`, so this I'll keep as is!

Now let's update Annotator 0's data file with the correction of the labeled text span `'.'`:

In [66]:
annPL0.to_csv("labels0PL.csv")

### Adding Unique Files' Labels to Gold

For now, we'll add any of annotator 0's Gendered-Pronoun labels on files that annotator 1 didn't label, and we'll add any of annotator 2's Gendered-Pronoun labels on files that annotator 0 didn't label to the merged dataset:

In [70]:
files0 = (set(annPL0.file))
files1 = (set(ann1.file))
files2 = (set(ann2.file))

In [72]:
unique_to_0 = files0.difference(files1)
unique_to_2 = files2.difference(files0)
print(len(unique_to_0))
print(len(unique_to_2))

130
274


In [79]:
annPL0_gp = annPL0.loc[annPL0.label == "Gendered-Pronoun"]
gp0_for_gold = annPL0_gp.loc[annPL0_gp.file.isin(unique_to_0) == True]
gp0_for_gold.shape

(502, 9)

In [89]:
gp0_for_gold.drop(labels=["remove"],axis=1,inplace=True)
gp0_for_gold.set_index(["file","offsets","text"],inplace=True)
gp0_for_gold.annotator = 0
gp0_for_gold.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1460_00100.ann,"(485, 488)",her,155,T2,Gendered-Pronoun,0,Linguistic
Coll-1460_00100.ann,"(633, 636)",her,156,T3,Gendered-Pronoun,0,Linguistic
Coll-1460_00100.ann,"(780, 783)",She,157,T4,Gendered-Pronoun,0,Linguistic
Coll-1460_00100.ann,"(1223, 1226)",she,158,T5,Gendered-Pronoun,0,Linguistic
Coll-1442_00100.ann,"(796, 798)",He,317,T0,Gendered-Pronoun,0,Linguistic


In [91]:
gold = pd.read_csv("gold_standard.csv",index_col=[0,1,2])
print(gold.shape)
gold.head()

(1086, 5)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


In [92]:
new_gold = gold.append(gp0_for_gold)
new_gold.shape  # Looks good!

(1588, 5)

In [95]:
ann2_gp = ann2.loc[ann2.label == "Gendered-Pronoun"]
gp2_for_gold = ann2_gp.loc[ann2_gp.file.isin(unique_to_2) == True]
gp2_for_gold.shape
gp2_for_gold.drop(labels=["remove"],axis=1,inplace=True)
gp2_for_gold.set_index(["file","offsets","text"],inplace=True)
gp2_for_gold.annotator = 2
print(gp2_for_gold.shape)
gp2_for_gold.head()

(78, 5)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-146_15400.ann,"(1373, 1376)",his,5582,T0,Gendered-Pronoun,2,Linguistic
Coll-146_15400.ann,"(1482, 1485)",his,5583,T1,Gendered-Pronoun,2,Linguistic
Coll-146_15400.ann,"(1594, 1597)",his,5584,T2,Gendered-Pronoun,2,Linguistic
Coll-146_15400.ann,"(1706, 1709)",his,5585,T3,Gendered-Pronoun,2,Linguistic
Coll-146_15400.ann,"(1816, 1819)",his,5586,T4,Gendered-Pronoun,2,Linguistic


In [96]:
new_gold = gold.append(gp2_for_gold)
new_gold.shape  # Looks good!

(1164, 5)

In [97]:
new_gold.tail()  # Looks good!

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-146_18900.ann,"(1171, 1174)",his,11736,T1,Gendered-Pronoun,2,Linguistic
Coll-146_19100.ann,"(2052, 2055)",his,11814,T2,Gendered-Pronoun,2,Linguistic
Coll-146_19200.ann,"(702, 705)",his,11851,T0,Gendered-Pronoun,2,Linguistic
Coll-146_21100.ann,"(3402, 3405)",him,12921,T1,Gendered-Pronoun,2,Linguistic
Coll-146_22800.ann,"(3568, 3571)",his,14096,T3,Gendered-Pronoun,2,Linguistic


Let's update the gold standard data file:

In [98]:
new_gold.to_csv("gold_standard.csv")

Let's remove the added rows from the annotators' DataFrames:

In [106]:
# print(gp0_for_gold.index)
annPL0.set_index(["file","offsets","text"],inplace=True)
# print(annPL0.index)
to_drop0 = gp0_for_gold.index
updated0 = annPL0.drop(index=to_drop0)
assert annPL0.shape[0] - updated0.shape[0] == len(to_drop0)

In [110]:
updated0.to_csv("labels0PL.csv")

In [109]:
ann2.set_index(["file","offsets","text"],inplace=True)
to_drop2 = gp2_for_gold.index
updated2 = ann2.drop(index=to_drop2)
assert ann2.shape[0] - updated2.shape[0] == len(to_drop2)

In [113]:
updated2.to_csv("labels2.csv")

### Adding Remaining Labels (Without Duplicates)

In [125]:
# Load the latest version of the data
annPL0 = pd.read_csv("labels0PL.csv")
ann1 = pd.read_csv("labels1.csv",index_col=0)
ann2 = pd.read_csv("labels2.csv")

In [129]:
annPL0.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove
0,Coll-1444_00100.ann,"(52, 66)",Robert E. Bell,0,T1,Unknown,Annotator 0,Person-Name,
1,Coll-1444_00100.ann,"(219, 228)",Bachelors,1,T2,Generalization,Annotator 0,Linguistic,
2,Coll-1444_00100.ann,"(301, 310)",Bachelors,2,T3,Generalization,Annotator 0,Linguistic,
3,Coll-1444_00100.ann,"(368, 372)",Ed.B,3,T4,Generalization,Annotator 0,Linguistic,
4,Coll-1444_00100.ann,"(377, 381)",M.Ed,4,T5,Generalization,Annotator 0,Linguistic,


In [132]:
# ann1.head()
ann1.set_index(["file","offsets","text"],inplace=True)
ann1.reset_index(inplace=True)
ann1.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove
0,Coll-1326_00100.ann,"(1132, 1135)",his,0,T0,Gendered-Pronoun,Annotator 1,Linguistic,
1,Coll-1326_00100.ann,"(1142, 1144)",He,1,T1,Gendered-Pronoun,Annotator 1,Linguistic,
2,Coll-1326_00100.ann,"(1532, 1535)",his,2,T2,Gendered-Pronoun,Annotator 1,Linguistic,
3,Coll-1326_00100.ann,"(1548, 1550)",He,3,T3,Gendered-Pronoun,Annotator 1,Linguistic,
4,Coll-1326_00100.ann,"(48, 62)",Dr. Rutherford,4,T4,Unknown,Annotator 1,Person-Name,


In [131]:
ann2.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove
0,AA5_00100.ann,"(789, 791)",He,0,T0,Gendered-Pronoun,Annotator 2,Linguistic,
1,AA5_00100.ann,"(871, 873)",he,1,T1,Gendered-Pronoun,Annotator 2,Linguistic,
2,AA5_00100.ann,"(913, 916)",his,2,T2,Gendered-Pronoun,Annotator 2,Linguistic,
3,AA5_00100.ann,"(928, 930)",he,3,T3,Gendered-Pronoun,Annotator 2,Linguistic,
4,AA5_00100.ann,"(1217, 1219)",he,4,T4,Gendered-Pronoun,Annotator 2,Linguistic,


Now that the columns are all in the same order, let's append the DataFrames together:

In [133]:
combo = (annPL0.append(ann1)).append(ann2)
assert combo.shape[0] == annPL0.shape[0] + ann1.shape[0] + ann2.shape[0]
combo.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category,remove
0,Coll-1444_00100.ann,"(52, 66)",Robert E. Bell,0,T1,Unknown,Annotator 0,Person-Name,
1,Coll-1444_00100.ann,"(219, 228)",Bachelors,1,T2,Generalization,Annotator 0,Linguistic,
2,Coll-1444_00100.ann,"(301, 310)",Bachelors,2,T3,Generalization,Annotator 0,Linguistic,
3,Coll-1444_00100.ann,"(368, 372)",Ed.B,3,T4,Generalization,Annotator 0,Linguistic,
4,Coll-1444_00100.ann,"(377, 381)",M.Ed,4,T5,Generalization,Annotator 0,Linguistic,


In [135]:
combo = combo.drop_duplicates()

In [139]:
combo.drop(labels=["remove"],axis=1,inplace=True)
combo.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
0,Coll-1444_00100.ann,"(52, 66)",Robert E. Bell,0,T1,Unknown,Annotator 0,Person-Name
1,Coll-1444_00100.ann,"(219, 228)",Bachelors,1,T2,Generalization,Annotator 0,Linguistic
2,Coll-1444_00100.ann,"(301, 310)",Bachelors,2,T3,Generalization,Annotator 0,Linguistic
3,Coll-1444_00100.ann,"(368, 372)",Ed.B,3,T4,Generalization,Annotator 0,Linguistic
4,Coll-1444_00100.ann,"(377, 381)",M.Ed,4,T5,Generalization,Annotator 0,Linguistic


In [140]:
gold = pd.read_csv("gold_standard.csv")
gold.head()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
0,Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
1,Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
2,Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
3,Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
4,BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


In [141]:
new_gold = gold.append(combo)
new_gold.tail()

Unnamed: 0,file,offsets,text,id,entity,label,annotator,category
19527,Coll-1469_00100.ann,"(764, 766)",he,19891,T9,Gendered-Pronoun,Annotator 2,Linguistic
19528,Coll-1469_00100.ann,"(42, 54)",Andrew Young,19892,T10,Unknown,Annotator 2,Person-Name
19529,Coll-1469_00100.ann,"(58, 69)",D M Baillie,19893,T11,Unknown,Annotator 2,Person-Name
19530,Coll-1469_00100.ann,"(817, 822)",Dunne,19897,T15,Unknown,Annotator 2,Person-Name
19531,Coll-1469_00100.ann,"(913, 928)",Graeme D. Eddie,19898,T16,Unknown,Annotator 2,Person-Name


In [142]:
print(gold.shape)
print(new_gold.shape)

(1164, 8)
(59005, 8)


In [143]:
new_gold.to_csv("gold_standard.csv")

## Occupations

In [146]:
# Contextual label data
annC0 = pd.read_csv("labels0C.csv", index_col=0)
ann3 = pd.read_csv("labels3.csv", index_col=0)
ann4 = pd.read_csv("labels4.csv", index_col=0)
annC0.head()  # Looks good

In [155]:
def getOccupationDF(df):
    return df.loc[df.label == "Occupation"]

def correctRow(df, mistake, correct_text, correct_offsets):
    df = df.astype({"id":int,"file":str,"entity":str,"label":str,"text":str,"annotator":str,"category":str,"offsets":str})
    row_to_replace = df.loc[df.text == mistake]
    index_to_drop = row_to_replace.index[0]
    # Create a correct version of the DataFrame row
    new_row = pd.DataFrame({"id":row_to_replace.values[0][0],"file":row_to_replace.values[0][1],
                            "entity":row_to_replace.values[0][2],"label":row_to_replace.values[0][3],
                            "text":correct_text,"annotator":row_to_replace.values[0][5],
                            "category":row_to_replace.values[0][6],"remove":None,"offsets":correct_offsets},
                          index=row_to_replace.index)
    # Drop the incorrect version of the DataFrame row
    df.drop(index=index_to_drop,inplace=True)
    # Add the correct version of the DataFrame row
    df = df.append(new_row)
    return df

In [5]:
occ0 = getOccupationDF(annC0)
print(occ0.shape)
occ0.head()

(2736, 9)


Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
1,8,Coll-1444_00100.ann,T9,Occupation,Educational Psychologists,Annotator 0,Contextual,,"(715, 740)"
2,16,Coll-1444_00100.ann,T17,Occupation,Psychologist,Annotator 0,Contextual,,"(1664, 1676)"
4,23,Coll-1444_00100.ann,T24,Occupation,researcher at the Godfrey Thomson Unit for Edu...,Annotator 0,Contextual,,"(2312, 2375)"
7,34,Coll-1326_00100.ann,T8,Occupation,physicians,Annotator 0,Contextual,,"(403, 413)"
8,37,Coll-1326_00100.ann,T11,Occupation,physicians,Annotator 0,Contextual,,"(538, 548)"


In [8]:
occ3 = getOccupationDF(ann3)
print(occ3.shape)
occ3.head()

(2330, 9)


Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
0,4,Coll-1326_00100.ann,T4,Occupation,physicians,Annotator 3,Contextual,,"(403, 413)"
1,5,Coll-1326_00100.ann,T5,Occupation,physicians,Annotator 3,Contextual,,"(538, 548)"
2,6,Coll-1326_00100.ann,T6,Occupation,physician,Annotator 3,Contextual,,"(876, 885)"
3,7,Coll-1326_00100.ann,T7,Occupation,Professor of the Practice of Physic,Annotator 3,Contextual,,"(925, 960)"
4,8,Coll-1326_00100.ann,T8,Occupation,physician,Annotator 3,Contextual,,"(1355, 1364)"


In [9]:
occ4 = getOccupationDF(ann4)
print(occ4.shape)
occ4.head()

(1776, 9)


Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
0,0,Coll-1444_00100.ann,T1,Occupation,Educational Psychologists,Annotator 4,Contextual,,"(715, 740)"
1,1,Coll-1444_00100.ann,T2,Occupation,Psychologist,Annotator 4,Contextual,,"(1664, 1676)"
2,2,Coll-1444_00100.ann,T3,Occupation,researcher at the Godfrey Thomson Unit for Edu...,Annotator 4,Contextual,,"(2312, 2375)"
3,21,BAI_01200.ann,T19,Occupation,Archbishop of Canterbury,Annotator 4,Contextual,,"(2347, 2371)"
5,23,BAI_01200.ann,T21,Occupation,Moderator Designate of the General Assembly of...,Annotator 4,Contextual,,"(2467, 2537)"


In [13]:
occ_text = list(set(list(occ0.text)+list(occ3.text)+list(occ4.text)))
print(len(occ_text))

1724


In [14]:
lower_occ_text = [o.lower() for o in occ_text]
lower_occ_text = list(set(lower_occ_text))
print(len(lower_occ_text))

1528


In [72]:
# to_remove list from ../IAA/AnnotatedOccupationsReview.ipynb
to_remove = [
    'and', 'authored', 'bailie', 'baillie', 'baillies', 'dog guardian', 'Emeritus Professor', 'Fellow', 
    'Fellow Emeritus', 'Fellow of the Poultry Science Association', 
    'Fellow of the Royal Scottish Society of Arts', 'Fellow of the Royal Society', 'Fellow of the Royal Society of Edinburgh', 
    'Fellow of the Society of Antiquaries', 'Fulbright Scholar', 'Fulbright Scholar at the Educational Testing Service',
    'Gifford Lecturer', 'honorary conductor', 'Honorary Librarian of Abbotsford', 
    'Honorary Librarian of Abbotsford House', 'Honorary Secretary', 'l', 'New Gifford Lecturer',
    'P', 'part of a British Council of Churches delegation',
    'President of the American Friends of the University of Edinburgh', 
    'President of the Dumfriesshire and Galloway Natural History and Antiquarian Society', 
    'president of the International Union of the History of Science and Medicine', 
    'President of the Ontario Creameries Association', 'President of the Royal Society', 'president of the Royal Statistical Society', 
    'President Ontario Creameries Association', 'Presidents of the Royal Medical Society', 'Professor Emeritus of Neurophysiology', 
    'Professor Emeritus of Nursing Studies', 'regular candidate for the ministry', 'representative', 'Saint', 
    'The New Gifford Lecturer', 'traveller', 'u', 'University Librarian Emeritus', 'knight commander'
]
to_remove_lower = [t.lower() for t in to_remove]

In [73]:
occ_list = [o for o in lower_occ_text if o not in to_remove_lower]
print(len(occ_list))

1488


In [74]:
print(occ_list)  
# Mistakes: publisher:, editors:, corresponding member of the general assembly of the presbyterian church in ireland.,
#           ,moderator designate of the general assembly of the church of scotland., church men.

['chief commissioner', 'professor of moral philosophy', 'folksong collector', 'sculptors', 'consultant for the national foundation for educational research', 'contractor', 'justice clerks', 'book publishers', 'professor of politics', 'professors of public law and the law of nature and nations', 'guard', 'lecturer in petrology', 'seismologist', 'farmers', 'professor of logic and metaphysics', 'seller', 'professor of rhetoric and english literature', 'senator of the college of justice', 'professor of mathematics', 'co-founding editor', 'tutorial fellow', 'cheif justices', 'minister of bervie united free church', "composer'", 'bailiff', 'medical practitioner', 'member of the council', 'visiting writer in residence', 'knights', 'art critic', 'fish curer', 'chemist', 'minister of town and country planning', 'confectioner', 'sales director', 'principal of the university of aberdeen', 'lecturer in botany and materia medica', 'theologian', 'vendors', 'lecturer in the department of physiology',

In [76]:
occ0.loc[occ0.text == "Moderator Designate of the General Assembly of the Church of Scotland."]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
5990,21562,BAI_01200.ann,T35,Occupation,Moderator Designate of the General Assembly of...,Annotator 0,Contextual,,"(2467, 2537)"


In [78]:
# occ3.loc[occ3.text == "Moderator Designate of the General Assembly of the Church of Scotland."]  # No match
occ4.loc[occ4.text == "Moderator Designate of the General Assembly of the Church of Scotland."]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
5,23,BAI_01200.ann,T21,Occupation,Moderator Designate of the General Assembly of...,Annotator 4,Contextual,,"(2467, 2537)"


In [79]:
occ3.loc[occ3.text == "Moderator Designate of the General Assembly of the Church of Scotland"]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
1139,2215,BAI_01600.ann,T16,Occupation,Moderator Designate of the General Assembly of...,Annotator 3,Contextual,,"(1656, 1725)"
3316,6229,BAI_01200.ann,T21,Occupation,Moderator Designate of the General Assembly of...,Annotator 3,Contextual,,"(2467, 2536)"


Rather than changing the annotator DataFrame, we can simply keep annotator 3's label for BAI_01200, as it correctly excludes the period at the end of the occupation, which annotators 0 and 4 accidentally included!

In [80]:
occ_list.remove("moderator designate of the general assembly of the church of scotland.")
print(len(occ_list))

1487


In [147]:
# occ0.loc[occ0.text == "Publisher:"] # no match
# occ3.loc[occ3.text == "publisher:"] # no match
occ4.loc[occ4.text == "Publisher:"]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
2394,3677,Coll-146_30900.ann,T20,Occupation,Publisher:,Annotator 4,Contextual,,"(5163, 5173)"


In [86]:
occ0.loc[occ0.text == "Publisher"] # different file

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
1753,5996,Coll-1443_00100.ann,T84,Occupation,Publisher,Annotator 0,Contextual,,"(4680, 4689)"


In [96]:
occ3.loc[occ3.text == "Publisher"]  # No match

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets


Let's correct annotator 4's DataFrame since the other annotators don't have the correct version of this annotation:

In [148]:
ann4 = correctRow(ann4,"Publisher:","Publisher","(5163, 5172)")
ann4.loc[ann4.text == "Publisher"]  # Looks good

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
115,202,Coll-146_20100.ann,T19,Occupation,Publisher,Annotator 4,Contextual,,"(1017, 1026)"
1297,1879,Coll-146_19700.ann,T2,Occupation,Publisher,Annotator 4,Contextual,,"(1023, 1032)"
1400,2096,Coll-146_26700.ann,T4,Occupation,Publisher,Annotator 4,Contextual,,"(4388, 4397)"
1423,2135,Coll-146_34400.ann,T1,Occupation,Publisher,Annotator 4,Contextual,,"(3647, 3656)"
1447,2175,Coll-146_22400.ann,T4,Occupation,Publisher,Annotator 4,Contextual,,"(4094, 4103)"
1494,2256,Coll-146_25200.ann,T7,Occupation,Publisher,Annotator 4,Contextual,,"(2068, 2077)"
1503,2266,Coll-146_24300.ann,T2,Occupation,Publisher,Annotator 4,Contextual,,"(3738, 3747)"
1545,2321,Coll-146_20000.ann,T11,Occupation,Publisher,Annotator 4,Contextual,,"(4157, 4166)"
1578,2375,Coll-146_26100.ann,T6,Occupation,Publisher,Annotator 4,Contextual,,"(2412, 2421)"
1808,2690,Coll-146_25600.ann,T4,Occupation,Publisher,Annotator 4,Contextual,,"(886, 895)"


In [112]:
# occ0.loc[occ0.text == "Editors:"] # no match
# occ3.loc[occ3.text == "Editors:"] # no match
occ4.loc[occ4.text == "Editors:"]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
1982,2974,Coll-146_32000.ann,T4,Occupation,Editors:,Annotator 4,Contextual,,"(3264, 3272)"


In [115]:
occ3.loc[occ3.text == "Editors"] # no match
occ0.loc[occ0.text == "Editors"] # different file

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
7310,26133,Coll-1362_06400.ann,T10,Occupation,Editors,Annotator 0,Contextual,,"(1845, 1852)"


Again, we'll correct annotator 4's DataFrame since the other annotators don't have the correct version of this annotation:

In [149]:
ann4 = correctRow(ann4, "Editors:","Editors","(3264, 3271)")
ann4.loc[ann4.text == "Editors"]  # Looks good

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
1216,1750,Coll-146_28000.ann,T7,Occupation,Editors,Annotator 4,Contextual,,"(259, 266)"
1834,2727,Coll-146_30300.ann,T4,Occupation,Editors,Annotator 4,Contextual,,"(3777, 3784)"
1982,2974,Coll-146_32000.ann,T4,Occupation,Editors,Annotator 4,Contextual,,"(3264, 3271)"


In [150]:
mistake = "Corresponding Member of the General Assembly of the Presbyterian Church in Ireland."
occ0.loc[occ0.text == mistake]
# occ3.loc[occ3.text == mistake] # no match
# occ4.loc[occ4.text == mistake] # no match

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
2111,7229,BAI_01600.ann,T16,Occupation,Corresponding Member of the General Assembly o...,Annotator 0,Contextual,,"(1733, 1816)"


Let's correct annotator 0's DataFrame since the other annotators don't have the correct version of this annotation:

In [162]:
annC0 = correctRow(annC0, mistake[:-1], mistake[:-1], "(1733, 1815)")
print(list(annC0.loc[annC0.text == mistake[:-1]].text))  # Looks good

['Corresponding Member of the General Assembly of the Presbyterian Church in Ireland']


In [165]:
mistake = "church men."
# occ0.loc[occ0.text == mistake] # no match
# occ3.loc[occ3.text == mistake] # no match
occ4.loc[occ4.text == mistake]

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
2592,4103,BAI_02200.ann,T147,Occupation,church men.,Annotator 4,Contextual,,"(5016, 5027)"


Let's correct annotator 4's DataFrame since the other annotators don't have the correct version of this annotation:

In [166]:
ann4 = correctRow(ann4, "church men.","church men","(5016, 5026)")
ann4.loc[ann4.text == "church men"]  # Looks good

Unnamed: 0,id,file,entity,label,text,annotator,category,remove,offsets
2592,4103,BAI_02200.ann,T147,Occupation,church men,Annotator 4,Contextual,,"(5016, 5026)"


Let's update the annotators' data files with the corrections:

In [167]:
annC0.to_csv("labels0C.csv")
ann3.to_csv("labels3.csv")
ann4.to_csv("labels4.csv")

Now we can remove the rest of the mistakes from the list of unique occupations:

In [169]:
before = len(occ_list)
mistakes = ["publisher:", "editors:", "corresponding member of the general assembly of the presbyterian church in ireland.",
            "church men."]
for m in mistakes:
    occ_list.remove(m)
after = len(occ_list)
assert before - after == len(mistakes)

In [170]:
print(len(occ_list))

1483


Great!  Now we have a list of 1,483 occupations that should be included in the merged dataset (`gold_standard.csv`).

Let's mark which occupations to remove and keep in each of the three annotators' DataFrames:

In [214]:
annC0 = pd.read_csv("labels0C.csv", index_col=0)
annC0.drop(labels=["remove"],axis=1,inplace=True)
ann3 = pd.read_csv("labels3.csv", index_col=0)
ann3.drop(labels=["remove"],axis=1,inplace=True)
ann4 = pd.read_csv("labels4.csv", index_col=0)
ann4.drop(labels=["remove"],axis=1,inplace=True)
ann3.tail()  # Looks good

Unnamed: 0,id,file,entity,label,text,annotator,category,offsets
5092,9383,Coll-1028_00100.ann,T14,Omission,Baillie,Annotator 3,Contextual,"(598, 605)"
5093,9384,Coll-1028_00100.ann,T33,Occupation,printers,Annotator 3,Contextual,"(2572, 2580)"
5094,9385,Coll-1028_00100.ann,T34,Occupation,translators,Annotator 3,Contextual,"(2559, 2570)"
5095,9386,Coll-1028_00100.ann,T35,Omission,Calvin,Annotator 3,Contextual,"(2597, 2603)"
5096,9387,Coll-1028_00100.ann,T36,Omission,Calvin,Annotator 3,Contextual,"(2536, 2542)"


In [215]:
def whichToRemove(df, occ_list):    
    occ_df = df.loc[df.label == "Occupation"]
    text_list = list(occ_df.text)
    remove_list = []
    for occupation in text_list:
        if occupation.lower() in occ_list:
            remove_list += ["No"]
        else:
            remove_list += ["Yes"]
    assert len(remove_list) == len(text_list)
    occ_df.insert(7,"remove",remove_list)
    occ_df = occ_df.loc[occ_df.remove == "No"]
    occ_df = occ_df.drop(labels=["remove"],axis=1)
    return occ_df

In [216]:
occ0 = whichToRemove(annC0, occ_list)
occ0.annotator = 0
occ3 = whichToRemove(ann3, occ_list)
occ3.annotator = 3
occ4 = whichToRemove(ann4, occ_list)
occ4.annotator = 4

In [218]:
for_gold = (occ0.append(occ3)).append(occ4)
for_gold.set_index(["file","offsets","text"],inplace=True)
for_gold.head()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1444_00100.ann,"(715, 740)",Educational Psychologists,8,T9,Occupation,0,Contextual
Coll-1444_00100.ann,"(1664, 1676)",Psychologist,16,T17,Occupation,0,Contextual
Coll-1444_00100.ann,"(2312, 2375)",researcher at the Godfrey Thomson Unit for Educational Research,23,T24,Occupation,0,Contextual
Coll-1326_00100.ann,"(403, 413)",physicians,34,T8,Occupation,0,Contextual
Coll-1326_00100.ann,"(538, 548)",physicians,37,T11,Occupation,0,Contextual


In [219]:
gold = pd.read_csv("gold_standard.csv",index_col=0)
gold.set_index(["file","offsets","text"],inplace=True)
gold.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_11900.ann,"(1954, 1957)",his,22593,T1,Generalization,0,Linguistic
Coll-1397_00100.ann,"(2633, 2638)",Lords,29349,T58,Generalization,0,Linguistic
Coll-1310_00800.ann,"(3703, 3706)",Man,15451,T54,Generalization,0,Linguistic
Coll-1434_14500.ann,"(5782, 5788)",cowboy,8005,T76,Generalization,0,Linguistic
BAI_02300.ann,"(1586, 1596)",shipmaster,20810,T53,Generalization,0,Linguistic


In [220]:
new_gold = gold.append(for_gold)
new_gold.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,entity,label,annotator,category
file,offsets,text,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coll-1434_20400.ann,"(2804, 2811)",cowboys,6549,T13,Occupation,4,Contextual
Coll-1434_20400.ann,"(354, 361)",soldier,6552,T16,Occupation,4,Contextual
Coll-146_16400.ann,"(1973, 1985)",photographer,6585,T1,Occupation,4,Contextual
Coll-146_30900.ann,"(5163, 5172)",Publisher,3677,T20,Occupation,4,Contextual
Coll-146_32000.ann,"(3264, 3271)",Editors,2974,T4,Occupation,4,Contextual


In [221]:
assert new_gold.shape[0] == gold.shape[0] + for_gold.shape[0]

In [222]:
new_gold.to_csv("gold_standard.csv")

In [223]:
new_gold.shape

(65770, 5)

Remove rows with the "Occupation" label from the annotators' data files:

In [224]:
annC0 = annC0.loc[annC0.label != "Occupation"]
ann3 = ann3.loc[ann3.label != "Occupation"]
ann4 = ann4.loc[ann4.label != "Occupation"]
annC0.to_csv("labels0C.csv")
ann3.to_csv("labels3.csv")
ann4.to_csv("labels4.csv")