# Preparing Metadata Descriptions for Annotation in brat

The text in this Jupyter Notebook is organized for uploading into [brat](https://brat.nlplab.org/index.html), where the text will be annotated for instances of gender bias.  The aim of the annotation is to create a gold standard dataset on which a classifier can be trained to identify gender bias in archival metadata descriptions.  

This project is focused on the English language and archival institutions in the United Kingdom.

* Author: Lucy Havens
* Date: November 17, 2020 - TBD
* Project: PhD Case Study 1
* Data Source: Files of select metadata descriptions extracted and exported in [the GitHub repo, annot-prep](https://github.com/thegoose20/annot-prep)
* Data Provider: [ArchivesSpace](https://archives.collections.ed.ac.uk/), Centre for Research Collections, University of Edinburgh

### Import Libraries

In [1]:
# Libraries for data analysis and visualization
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
from scipy.stats import mode
from collections import Counter

# To avoid SSL error when downloading NLTK packages...
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
# nltk.download()

# Libraries for Natural Language Processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.text import Text
nltk.download('punkt')
from nltk.probability import FreqDist
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.corpus import PlaintextCorpusReader
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

# Other useful libraries
import string
import csv
import re

[nltk_data] Downloading package punkt to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Load the Data
The annotation dataset will be created from TXT files of extracted metadata descriptions designated for training and development.

In [10]:
dataset1 = open('DatasetExports/UoEArchivesMetadata_ID-SC-BH-PI_trainingset1.txt', 'r')
dataset2 = open('DatasetExports/UoEArchivesMetadata_ID-SC-BH-PI_trainingset2.txt', 'r')
dataset3 = open('DatasetExports/UoEArchivesMetadata_ID-SC-BH-PI_trainingset3.txt', 'r')
dataset4 = open('DatasetExports/UoEArchivesMetadata_ID-SC-BH-PI_devset.txt', 'r')

dataset1 = dataset1.read()
dataset2 = dataset2.read()
dataset3 = dataset3.read()
dataset4 = dataset4.read()

print(dataset1[:5000])

Fonds ID:Coll-1149
unitids
{'Coll-1149'}
scopecontent
{'This key research resource is an important survival, being a manuscript account book detailing transactions - debits and credits - relating to the lead-ore company at Leadhills, operated by Sir John Hope of Craighall. Many important people are mentioned in this book, including Alexander Hope of London, Archibald Hope of Craighall, the Earl of Wigtown, the Duke of Hamilton, the Lord of Inglestone, Charles Erskine of Alba, Alexander Tait, Lady Marie Keith, the Earl of Crawford, Lord Mordington, Lord Cardcross, and Alexander Ross. The amounts involved are huge, with the account of revenues in hand running to over £70,000 towards the end of the period. The manuscript volume itself is composed of a short alphabetic table of names, then from folio 1, accounts dating from 1 August 1662, Edinburgh, to 7 September 1671, Edinburgh, at folio 221. Towards the rear of the volume are another set of accounts and revenues and interests on 87 foli

### Reformat the Data
Remove extraneous characters and add new lines to make the text more readable, and split the text into one file per collection ("fonds" in archive-speak). In brat, annotations will be made on collection descriptions (including descriptions for the collection, its subcollections, and its items) so that the text annotators read isn't taken out of context (in the [ArchivesSpace](https://archives.collections.ed.ac.uk/) catalog, metadata descriptions are organized hierarchically with items in subcollections and subcollections in collections).

**Note:** Collections are of vastly different sizes and their descriptions of varying lengths, so the amount of text in one collection should not be used as a proxy for all the collections.  The longest collection is Coll-41, The Papers of Conrad Hal Waddington, so that can be used as the maximum text that may appear in a single collection and single annotation task.

In [11]:
def makeReadable(f):
    # Remove curly braces and empty sets, and add empty
    # lines between descriptions and field names
    f = f.replace('}', '\n')
    f = f.replace('{', '')
    f = f.replace('set()', 'No description provided \n')

    # Add space after 'Fonds ID:'
    f.replace('Fonds ID:', 'Fonds ID: ')

    # Replace metadata field names with their
    # corresponding headings on ArchivesSpace
    f = f.replace('unitids', 'Collection, Sub-collection, and Item IDs')
    f = f.replace('scopecontent', 'Scope and Contents')
    f = f.replace('bioghist', 'Biographical / Historical')
    f = f.replace('processinfo', 'Processing Information')
    
    return f

In [12]:
dataset1 = makeReadable(dataset1)
dataset2 = makeReadable(dataset2)
dataset3 = makeReadable(dataset3)
dataset4 = makeReadable(dataset4)

print("1\n",dataset1[:500])
print("2\n",dataset2[500:1000])
print("3\n",dataset3[1000:1500])
print("4\n",dataset4[1500:2000])

1
 Fonds ID:Coll-1149
Collection, Sub-collection, and Item IDs
'Coll-1149'

Scope and Contents
'This key research resource is an important survival, being a manuscript account book detailing transactions - debits and credits - relating to the lead-ore company at Leadhills, operated by Sir John Hope of Craighall. Many important people are mentioned in this book, including Alexander Hope of London, Archibald Hope of Craighall, the Earl of Wigtown, the Duke of Hamilton, the Lord of Inglestone, Charles
2
 nner - 'Empress of Britain' - Sunday 27 August - A crofters cottage on the island of Harris / Sir Alexander Mackenzie 1 x menu card - with abstract of log - diner au revoir - 'Empress of Britain' - Monday 28 August - Westerham / Maj-Gen. James Wolfe"

Biographical / Historical
"For the Canadian Pacific Steamships Ltd., Atlantic passenger carrying would last barely four decades from 1921. In the 1960s when air travel and cargo containerisation started to compete with North Atlantic shippin

Split the string in each dataset file into several strings, one for each fonds (collection) ID.

In [13]:
dataset1 = dataset1.split('Fonds ')[1:]
dataset2 = dataset2.split('Fonds ')[1:]
dataset3 = dataset3.split('Fonds ')[1:]
dataset4 = dataset4.split('Fonds ')[1:]

In [15]:
#dataset1
#dataset2
#dataset3
# dataset4

In [16]:
print(dataset1[0])

ID:Coll-1149
Collection, Sub-collection, and Item IDs
'Coll-1149'

Scope and Contents
'This key research resource is an important survival, being a manuscript account book detailing transactions - debits and credits - relating to the lead-ore company at Leadhills, operated by Sir John Hope of Craighall. Many important people are mentioned in this book, including Alexander Hope of London, Archibald Hope of Craighall, the Earl of Wigtown, the Duke of Hamilton, the Lord of Inglestone, Charles Erskine of Alba, Alexander Tait, Lady Marie Keith, the Earl of Crawford, Lord Mordington, Lord Cardcross, and Alexander Ross. The amounts involved are huge, with the account of revenues in hand running to over £70,000 towards the end of the period. The manuscript volume itself is composed of a short alphabetic table of names, then from folio 1, accounts dating from 1 August 1662, Edinburgh, to 7 September 1671, Edinburgh, at folio 221. Towards the rear of the volume are another set of accounts and re

Looks as expected!

Next, write each collection's descriptions to a separate file.

In [20]:
# Training Data
####################
datasets = [dataset1, dataset2, dataset3]
fileCount = 0
i = 1                                             # identifier for subset of training data
for d in datasets:
    j = 0                                         # index of collection string in subset of training data
    for coll in d:
        filename = 'training' + str(i) + '-'+ str(j) + '.txt'
        filepath = 'bratTxts/' + filename
        f = open(filepath, 'x')
        f.write(coll)
        f.close()
        j += 1
        fileCount += 1
    i += 1
print("Total Training Data Files:",fileCount)

Total Training Data Files: 592


In [21]:
# Development Data
####################
j = 0                                         # index of collection string in subset of development data
for coll in dataset4:
    filename = 'dev' + str(j) + '.txt'
    filepath = 'bratTxts/' + filename
    f = open(filepath, 'x')
    f.write(coll)
    f.close()
    j += 1
print("Total Development Data Files:",j)

Total Development Data Files: 197


In [66]:
print("Total brat files:",fileCount+j)

Total brat files: 789


Load the resulting files into brat for annotating!

### Summary Statistics of Descriptions to be Annotated

In [82]:
datasets = [dataset1, dataset2, dataset3, dataset4]
headings = ["ID:", "Collection, Sub-collection, and Item IDs", "Scope and Contents", "Biographical / Historical", "Processing Information", "No description provided"]
descs = []
for data in datasets:
    for s in data:
        coll_ids = re.findall("Coll-\d{4}", s)
        for coll_id in coll_ids:
            s = s.replace(coll_id, "")
        for heading in headings:
            s = s.replace(heading, "")
        s = s.strip()
        descs += [s]

In [83]:
print(type(descs[1]))
print(descs[1])

<class 'str'>
''


"The manuscript materialLe Thresor des Divines et Celestes Consolations'(London, 1643) was bound by 'Lord Herbert's Binder'. It contains 21 chapters on the nature and benefits of Afflictions. A rough translation of the introduction gives the flavour: 'Friendly Reader, this book, to which I have given light, shows how tribulations tear us away from sin, which is the source and origin of all pain; it brings us to virtue, to good, and to God, who is the means, the Principle, indeed who is in Himself all the Sovereign good. And afterwards it produces the means to keep always on the right path of virtue, eases our path towards Heaven, and forces us through a secret violence and voluntary constraints, despising that which is of the world (holding its voluptuousness, its delights and vanities against one's will and in disgust) and to breathe towards Heaven, with tears in the eyes, sighs on the lips and sobs in the heart.' The volume is dedicated to Edward Montagu, 2nd. Earl

In [85]:
fileCount = 0
i = 0   # identifier for subset of data
for desc in descs:
    filename = 'desc' + str(i) + '.txt'
    filepath = 'DatasetExports/DescriptionsOnly/' + filename
    f = open(filepath, 'x')
    f.write(desc)
    f.close()
    fileCount += 1
    i += 1
print(fileCount) # File count should be 789

789


In [95]:
wordlists = PlaintextCorpusReader("DatasetExports/DescriptionsOnly/", '\w+\d{1}.txt', encoding='utf-8')
fileids = wordlists.fileids()
tokens = wordlists.words()
# tokens[:100]

In [96]:
sentences = sent_tokenize(wordlists.raw())
# sentences[:10]

In [88]:
print("Total sentences:",len(sentences))
alpha_tokens = [t for t in tokens if t.isalpha()]
print("Total words:",len(alpha_tokens))

Total sentences: 77093
Total words: 1279713


In [89]:
print("Average sentences per file:", len(sentences)/fileCount)

Average sentences per file: 97.70975918884665


In [92]:
# Sources:
# https://www.wikiwand.com/en/Courtesy_titles_in_the_United_Kingdom#/Scottish_courtesy_titles
# https://en.wikipedia.org/wiki/English_honorifics

fem_titles = ["Madam", "Madame", "Ma'am", "Lady", "Queen", "Dame", "Duchess", "Miss", "Ms", "Mrs", "Missus", "Mx", "Marchioness", "Countess", "Viscountess", "Baroness", "Maid"]
masc_titles = ["Sir", "Lord", "King", "Duke", "Mr", "Sire", "Gentleman", "Marquess", "Viscount", "Baron", "Laird"]
fem_pronouns = ["she", "her"]
masc_pronouns = ["him", "his", "he"]
both_pronouns = ["they", "their", "them"]

In [93]:
fem_tokens = [t for t in alpha_tokens if t in fem_titles]
fem_pronouns = [t for t in alpha_tokens if t in fem_pronouns]
masc_pronouns = [t for t in alpha_tokens if t in masc_pronouns]
masc_tokens = [t for t in alpha_tokens if t in masc_titles]
both_tokens = [t for t in alpha_tokens if t in both_pronouns]

In [100]:
print("Feminine Titles:", len(fem_tokens), "("+str((len(fem_tokens)/(len(masc_tokens)+len(fem_tokens)))*100)+"%)")
print("Feminine Pronouns:", len(fem_pronouns), "("+str((len(fem_pronouns)/(len(masc_pronouns)+len(fem_pronouns)+len(both_pronouns)))*100)+"%)")
print("Masculine Titles:", len(masc_tokens), "("+str((len(masc_tokens)/(len(masc_tokens)+len(fem_tokens)))*100)+"%)")
print("Masculine Pronouns:", len(masc_pronouns), "("+str((len(masc_pronouns)/(len(masc_pronouns)+len(fem_pronouns)+len(both_pronouns)))*100)+"%)")
print("Both Pronouns:", len(both_pronouns), "("+str((len(both_pronouns)/(len(masc_pronouns)+len(fem_pronouns)+len(both_pronouns)))*100)+"%)")

Feminine Titles: 1631 (28.887708111937656%)
Feminine Pronouns: 4131 (25.792957042957042%)
Masculine Titles: 4015 (71.11229188806234%)
Masculine Pronouns: 11882 (74.18831168831169%)
Both Pronouns: 3 (0.018731268731268732%)


### Estimating Time Needed for Annotation
Pilot 1 (myself):
* Files visited: training1-0.txt - training1-121.txt
* Files annotated: 26 out of 27 files
* Total time: 1 hour, 30 minutes

Pilot 2 (three participants):
* Files visited: training1-122.txt - training1-130.txt (10); training2-0.txt - training2-103.txt (7); training3-0.txt - training3-108.txt (11)
* Files annotated: 10 out of 10 files; 6 out of 7 files; 10 out of 11 files
* Total time: 30 minutes

Estimates by file:

In [10]:
total_files = 789 + 197
est_total_hrs = (total_files/27)*1.5
print("Estimated time for me to annotate:",est_total_hrs)
print("Estimated total weeks at 9 hours per week:", est_total_hrs/9.0)

Estimated time for me to annotate: 54.77777777777778
Estimated total weeks at 9 hours per week: 6.08641975308642


In [12]:
est2_total_hrs = (total_files/7)*0.5
print("Maximum estimated time to annotate:", est2_total_hrs)
print("Maximum estimated weeks at 9 hours per week:",est2_total_hrs/9.0)

Maximum estimated time to annotate: 70.42857142857143
Maximum estimated weeks at 9 hours per week: 7.825396825396826


Estimates by token:

In [15]:
wordlists = PlaintextCorpusReader("bratTxts/", '.+(.txt)', encoding='utf-8')
fileids = wordlists.fileids()
# print(fileids)

In [25]:
tokens = wordlists.words()
print(len(tokens))

2082296


In [17]:
pilot1 = ["training1-0.txt", "training1-1.txt", "training1-10.txt", "training1-100.txt", "training1-101.txt", "training1-102.txt", 
          "training1-103.txt", "training1-104.txt", "training1-105.txt", "training1-106.txt", "training1-107.txt", "training1-108.txt",
          "training1-109.txt", "training1-11.txt", "training1-110.txt", "training1-111.txt", "training1-112.txt", "training1-113.txt",
          "training1-114.txt", "training1-115.txt", "training1-116.txt", "training1-117.txt", "training1-118.txt", "training1-119.txt",
          "training1-12.txt", "training1-120.txt", "training1-121.txt"
         ]
pilot2a = ["training1-122.txt", "training1-123.txt", "training1-124.txt", "training1-125.txt", "training1-126.txt", "training1-127.txt",
          "training1-128.txt", "training1-129.txt", "training1-13.txt", "training1-130.txt"]
pilot2b = ["training2-0.txt", "training2-1.txt", "training2-10.txt", "training2-100.txt", "training2-101.txt", "training2-102.txt",
          "training2-103.txt"
          ]
pilot2c = ["training3-0.txt", "training3-1.txt", "training3-10.txt", "training3-100.txt", "training3-101.txt", "training3-102.txt",
          "training3-103.txt", "training3-104.txt", "training3-105.txt", "training3-106.txt", "training3-107.txt", "training3-108.txt"
          ]

In [34]:
pilot1_tokens = 0
for filename in pilot1:
    pilot1_tokens += len(wordlists.words(filename))
print("Pilot 1 Total Tokens:", pilot1_tokens)
print("Tokens per hour:", ((pilot1_tokens/3)*2))
print("Total hours:", (len(tokens))/((pilot1_tokens/3)*2))
print("Total weeks at 20 hours per week:", (len(tokens))/((pilot1_tokens/3)*2)/20)

Pilot 1 Total Tokens: 13419
Tokens per hour: 8946.0
Total hours: 232.76279901632014
Total weeks at 20 hours per week: 11.638139950816008


In [37]:
pilot2a_tokens = 0
for filename in pilot2a:
    pilot2a_tokens += len(wordlists.words(filename))
print("Pilot 2a Total Tokens:", pilot2a_tokens)
print("Tokens per hour:", ((pilot2a_tokens/3)*2))
print("Total hours:", (len(tokens))/((pilot2a_tokens/3)*2))
print("Total weeks at 9 hours per week:", (len(tokens))/((pilot2a_tokens/3)*2)/9)
print("Total tokens in 8 weeks at 9 hours per week:", ((pilot2a_tokens/3)*2)*(9*8), " (", ((pilot2a_tokens/3)*2)*(9*8)/(len(tokens)), "% of data)")

Pilot 2a Total Tokens: 3977
Tokens per hour: 2651.3333333333335
Total hours: 785.3769172743273
Total weeks at 9 hours per week: 87.2641019193697
Total tokens in 8 weeks at 9 hours per week: 190896.0  ( 0.09167572717807651 % of data)


In [43]:
pilot2b_tokens = 0
for filename in pilot2b:
    pilot2b_tokens += len(wordlists.words(filename))
print("Pilot 2b Total Tokens:", pilot2b_tokens)
print("Tokens per hour:", ((pilot2b_tokens/3)*2))
print("Total hours:", (len(tokens))/((pilot2b_tokens/3)*2))
print("Total weeks at 9 hours per week:", (len(tokens))/((pilot2b_tokens/3)*2)/9)
print("***Note that this pilot annotator was a non-native English speaker!")

Pilot 2b Total Tokens: 3465
Tokens per hour: 2310.0
Total hours: 901.4268398268398
Total weeks at 9 hours per week: 100.15853775853776
***Note that this pilot annotator was a non-native English speaker!


In [38]:
pilot2c_tokens = 0
for filename in pilot2c:
    pilot2c_tokens += len(wordlists.words(filename))
print("Pilot 2b Total Tokens:", pilot2c_tokens)
print("Tokens per hour:", ((pilot2c_tokens/3)*2))
print("Total hours:", (len(tokens))/((pilot2c_tokens/3)*2))
print("Total weeks at 9 hours per week:", (len(tokens))/((pilot2c_tokens/3)*2)/9)
print("Total weeks at 9 hours per week:", (len(tokens))/((pilot2c_tokens/3)*2)/9)
print("Total tokens in 8 weeks at 9 hours per week:", ((pilot2c_tokens/3)*2)*(9*8), " (", ((pilot2c_tokens/3)*2)*(9*8)/(len(tokens)), "% of data)")

Pilot 2b Total Tokens: 5387
Tokens per hour: 3591.3333333333335
Total hours: 579.8113978095415
Total weeks at 9 hours per week: 64.42348864550462
Total weeks at 9 hours per week: 64.42348864550462
Total tokens in 8 weeks at 9 hours per week: 258576.0  ( 0.12417831086454567 % of data)


***

**Considering I had not implemented keyboard shortcuts, it seems likely that I'll be able to hire an annotator to annotate about 10% of the data (calculated by tokens) in 8 weeks at 9 hours per week.**

**I should be able to annotate the data in 11-12 weeks at 20 hours per week.**