# Creating Data for Final Annotation
Harvesting, transforming, and exporting metadata descriptions for annotation of gendered language in [brat](brat.nlplab.org/).

The text in this Jupyter Notebook is organized for uploading into [brat](https://brat.nlplab.org/index.html), where the text will be annotated for instances of gender bias.  The aim of the annotation is to create a gold standard dataset on which a classifier can be trained to identify gender bias in archival metadata descriptions.  

This project is focused on the English language and archival institutions in the United Kingdom.

* Creator: Lucy Havens
* Date: February 2021 - April 2021
* Project: PhD research at the School of Informatics, University of Edinburgh
* Data Source: Centre for Research Collections' (CRC) [online archival catalog](https://archives.collections.ed.ac.uk/)

***
**Table of Contents**

  [I. Harvesting](#harvesting)

  [II. Transforming](#transforming)

  [III. Preparing](#preparing)

  [IV. Renaming Pre-annotated Files](#renaming)
  
  [V. Splitting Files Among Annotators](#splitting)
  
  ***

<a id="harvesting"></a>
## I. Harvesting
Obtain metadata from the CRC's online archival catalog using the Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH).  The CRC provides its metadata in Encoded Archival Description (EAD) format as XML data.  Harvest metadata descriptions from the following metadata fields in the Centre for Research Collections' online catalog:
  * Scope and Contents
  * Biographical Historical
  * Processing Information
  * Title
  * Language of Material
  * Geography Name
  * Unit ID
  * Encoded Archival Description Identifier

In [7]:
# Import libraries for harvesting
import xml.dom.minidom
import urllib.request
import urllib
import xml.etree.ElementTree as ET
from lxml import etree

In [8]:
archiveMetadataUrl = "http://lac-archives-live.is.ed.ac.uk:8082/?verb=ListRecords&metadataPrefix=oai_ead"

def getRootFromUrl(url):
    content = urllib.request.urlopen(url)

    #tree = ET.parse(content)
    parser = etree.XMLParser(recover=True)  # Use recover to try to fix broken XML
    tree = etree.parse(content, parser)
    
    root = tree.getroot()
    return root

root = getRootFromUrl(archiveMetadataUrl)
print(root)

<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 0x7fc04a4950c0>


In [9]:
# Input: part of or the entirety of a tag name below which you want to get text 
# Output: a list of text between tags contained within the inputted tagName, 
#         with one list element per tagName instance
def getTextBeneathTag(root, tagName, header):
    text_list = []
    for child in root.iter():
        tag = child.tag
        if tagName in tag:
            text_elem = ""
            for subchild_text in child.itertext():
                if header:
                    if header not in subchild_text:
                        text_elem = text_elem + subchild_text
                else:
                    text_elem = text_elem + subchild_text
            text_list.append(text_elem)
    return text_list

In [10]:
# Input: binary value, url for harvesting metadata, starting prefix for the end of the url, and lists of metadata fields to gather
# Output: lists of strings of the gathered metadata fields' descriptions, with one string per fonds, series, and item in the catalog
def getDescriptiveMetadata(more, archiveMetadataUrlShort, startingPrefix, eadid, ut, ui, ud, gn, lm, sc, bh, pi):    
   
    archiveMetadataUrlWithPrefix = archiveMetadataUrlShort + startingPrefix
    root = getRootFromUrl(archiveMetadataUrlWithPrefix)
    eadid.append(getTextBeneathTag(root, "eadid", "Encoded Archival Description Identifier"))
    ut.append(getTextBeneathTag(root, "unittitle", "Unit Title"))
    ui.append(getTextBeneathTag(root, "unitid", "Unit Identifier"))
    ud.append(getTextBeneathTag(root, "unitdate", "Unit Date"))
    gn.append(getTextBeneathTag(root, "geogname", "Geography Name"))
    lm.append(getTextBeneathTag(root, "langmaterial", "Language of Materials"))
    sc.append(getTextBeneathTag(root, "scopecontent", "Scope and Contents"))
    bh.append(getTextBeneathTag(root, "bioghist", "Biographical / Historical"))
    pi.append(getTextBeneathTag(root, "processinfo", "Processing Information"))
    resumptionToken = getTextBeneathTag(root, "resumptionToken", "")
    
    if len(resumptionToken) == 0:
        more = False
    i = 1
    
    while more:
        archiveMetadataUrlWithToken = archiveMetadataUrlShort + "resumptionToken=" + resumptionToken[0]
        root = getRootFromUrl(archiveMetadataUrlWithToken)
        eadid.append(getTextBeneathTag(root, "eadid", "Encoded Archival Description Identifier"))
        ut.append(getTextBeneathTag(root, "unittitle", "Unit Title"))
        ui.append(getTextBeneathTag(root, "unitid", "Unit Identifier"))
        ud.append(getTextBeneathTag(root, "unitdate", "Unit Date"))
        gn.append(getTextBeneathTag(root, "geogname", "Geography Name"))
        lm.append(getTextBeneathTag(root, "langmaterial", "Language of Materials"))
        sc.append(getTextBeneathTag(root, "scopecontent", "Scope and Contents"))
        bh.append(getTextBeneathTag(root, "bioghist", "Biographical / Historical"))
        pi.append(getTextBeneathTag(root, "processinfo", "Processing Information"))
        resumptionToken = getTextBeneathTag(root, "resumptionToken", "")
        if len(resumptionToken) == 0:
            more = False
        i += 1
    
    print(str(i) + " resumption tokens")
    return eadid, ut, ui, ud, gn, lm, sc, bh, pi

In [11]:
url = "http://lac-archives-live.is.ed.ac.uk:8082/?verb=ListRecords&"
startPrefix = "metadataPrefix=oai_ead"
eadid = [] # List of fonds-level identifiers
ut = [] # List of fonds, series, and item titles
ui = [] # List of fonds, series, and item identifiers
ud = [] # List of fonds, series, and item dates
gn = [] # List of fonds, series, and item associated geographic locations 
lm = [] # List of fonds, series, and item material languages
sc = [] # List of fonds, series, and item "Scope and Contents" descriptions
bh = [] # List of fonds, series, and item "Biographical / Historical" descriptions
pi = []  # List of fonds, series, and item "Processing Information" descriptions

eadid, ut, ui, ud, gn, lm, sc, bh, pi = getDescriptiveMetadata(True, url, startPrefix, eadid, ut, ui, ud, gn, lm, sc, bh, pi)

1081 resumption tokens


In [12]:
print(len(eadid))
print(len(ut))
print(len(ui))
print(len(ud))
print(len(gn))
print(len(lm))
print(len(sc))
print(len(bh))
print(len(pi))

1081
1081
1081
1081
1081
1081
1081
1081
1081


In [14]:
i = 0
print(len(eadid[i]))
print(len(ut[i]))
print(len(ui[i]))
print(len(ud[i]))
print(len(gn[i]))
print(len(lm[i]))
print(len(sc[i]))
print(len(bh[i]))
print(len(pi[i]))

1
124
124
124
116
125
119
2
1


In [18]:
def deduplicateDescriptions(metadata_field_list):
    unique_descs = []
    for fonds in metadata_field_list:
        unique_descs += [list(set(fonds))]
    assert len(metadata_field_list) == len(unique_descs)
    return unique_descs
unique_sc = deduplicateDescriptions(sc)
unique_bh = deduplicateDescriptions(bh)
unique_pi = deduplicateDescriptions(pi)

<a id="transforming"></a>
## II. Transforming
Create a table (pandas DataFrame) of the metadata without multi-sentence descriptions and plain text files of the descriptive metadata.

In [19]:
import pandas as pd
import re
import string
import csv

In [23]:
flatten = []
for sublist in eadid:
    for item in sublist:
        flatten += [item]
print(flatten[0])

Coll-1064


In [24]:
df = pd.DataFrame.from_dict({"eadid":flatten,"unit_title":ut, "unit_identifier":ui, "unit_date":ud, "geography":gn, "language":lm})
# df = pd.read_csv("CRC_units-grouped-by-fonds.csv")
df.head()

Unnamed: 0,eadid,unit_title,unit_identifier,unit_date,geography,language
0,Coll-1064,"[Papers of Professor Walter Ledermann, 1 (37),...","[Coll-1064, Coll-1064/1, Coll-1064/2, Coll-106...","[1937-1954, 2 Feb 1937, 10 Feb 1937, 16 Feb 19...","[Edinburgh (Scotland), Edinburgh (Scotland), E...","[\n English\n , English, English, Engl..."
1,Coll-31,[Drawings from the Office of Sir Rowand Anders...,"[Coll-31, Coll-31/1, Coll-31/1/1, Coll-31/1/1/...","[1814-1924, 1874-1905, 1874-1879, 1874-1875, 1...",[],"[\n English\n , English, English, Engl..."
2,Coll-51,[Papers of Sir Roderick Impey Murchison and hi...,"[Coll-51, Coll-51/1, Coll-51/2, Coll-51/2/1, C...","[1771-1935, 1723-1935, 1770-1938, 1770-1938, 1...","[Calcutta (India), Europe, Scotland, Tarradale...","[\n English\n , English, English, Engl..."
3,Coll-204,"[Lecture Notes of John Robison, Introductions,...","[Coll-204, Coll-204/1, Coll-204/2, Coll-204/3,...","[c1779-c1801, c1779-c1801, c1804, c1802, c1780...","[Edinburgh (Scotland), Glasgow Lanarkshire Sco...","[\n English\n , English., English Lati..."
4,Coll-206,[Records of the Wernerian Natural History Soci...,"[Coll-206, Coll-206/1, Coll-206/1/1, Coll-206/...","[1808-1858, 12 January 1808-16 April 1858, 12 ...","[Edinburgh (Scotland), Freiburg im Breisgau (G...","[\n English\n , English, English, Engl..."


In [25]:
print(df.shape)
df.to_csv("CRC_units-grouped-by-fonds.csv")

(1081, 6)


In [29]:
list(df["eadid"])  # Some of these are empty strings!

['Coll-1064',
 'Coll-31',
 'Coll-51',
 'Coll-204',
 'Coll-206',
 'Coll 205',
 'Coll-1443',
 'Coll-1444',
 'Coll-1391',
 'Coll-1371',
 'Coll-1373',
 'Coll-96',
 'Coll-891',
 'Coll-623',
 'Coll-58',
 'Coll-60',
 'Coll-61',
 'Coll-56',
 'Coll-550',
 'Coll-55',
 'Coll-548',
 'Coll-549',
 'Coll-89',
 'Coll-540',
 'Coll-542',
 'Coll-547',
 'Coll-53',
 'Coll-522',
 'Coll-523',
 'Coll-516',
 'Coll-518',
 'Coll-521',
 'Coll-507',
 'Coll-890',
 'Coll-509',
 'Coll-513',
 'Coll-499',
 'Coll-500',
 'Coll-49',
 'Coll-487',
 'Coll-494',
 'Coll-48',
 'Coll-475',
 'Coll-478',
 'Coll-887',
 'Coll-454',
 'Coll-461',
 'Coll-462',
 'Coll-468',
 'Coll-439',
 'Coll-443',
 'Coll-444',
 'Coll-453',
 'Coll-42',
 'Coll-43',
 'Coll-88',
 'Coll-417',
 'Coll-426',
 'Coll-405',
 'Coll-39',
 'Coll-40',
 'Coll-392',
 'Coll-35',
 'Coll-36',
 'Coll-38',
 'Coll-344',
 'Coll-84',
 'Coll-34',
 'Coll-333',
 'Coll-335',
 'Coll-341',
 'Coll-283',
 'Coll-302',
 'Coll-307',
 'Coll-311',
 'Coll-275',
 'Coll-276',
 '',
 'Coll-282

In [32]:
indeces = []
for ui_list in ui:
    indeces += [ui_list[0]]
print(len(indeces))
print(indeces[:10])

1081
['Coll-1064', 'Coll-31', 'Coll-51', 'Coll-204', 'Coll-206', 'Coll-205', 'Coll-1443', 'Coll-1444', 'Coll-1391', 'Coll-1371']


In [32]:
# def flattenTwoDimensionalList(two_d_list):
#     flattened = []
#     for listoflists in two_d_list:
#         for unit in listoflists:
#             flattened += [unit]
#     return flattened

In [33]:
# titles = flattenTwoDimensionalList(ut)
# # print(titles[0:30])
# identifiers = flattenTwoDimensionalList(ui)
# dates = flattenTwoDimensionalList(ud)
# geogs = flattenTwoDimensionalList(gn)
# lang = flattenTwoDimensionalList(lm)
# scopecont = flattenTwoDimensionalList(sc)
# bioghist = flattenTwoDimensionalList(bh)
# procinfo = flattenTwoDimensionalList(pi)

In [34]:
# print(len(titles))
# print(len(identifiers))
# print(len(dates))
# print(len(geogs))
# print(len(lang))
# print(len(scopecont))
# print(len(bioghist))
# print(len(procinfo))

In [34]:
def writeListsToFilesPerFonds(indeces, titles, scopeconts, bioghists, procinfo):
    maxI = len(indeces)
    i = 0
    while i < maxI:
        filename = (indeces[i]).strip()
        filename = filename.replace(" ", "_")
        filename = filename.replace("/", "_")
        filepath = "descriptions_by_fonds/"+filename+".txt"
        with open(filepath, 'w') as f:
            f.write("Identifier: ")
            f.write(filename + "\n")
            for t in titles[i]:
                t = t.strip()
                f.write("\nTitle:\n")
                f.write(t + "\n")
            for s in scopeconts[i]:
                s = s.strip()
                f.write("\nScope and Contents:\n")
                f.write(s + "\n")
            for b in bioghists[i]:
                b = b.strip()
                f.write("\nBiographical / Historical:\n")
                f.write(b + "\n")
            for p in procinfo[i]:
                p = p.strip()
                f.write("\nProcessing Information:\n")
                f.write(p + "\n")
        f.close()
        i += 1
    return str(maxI) + " files written"

In [35]:
writeListsToFilesPerFonds(indeces, ut, unique_sc, unique_bh, unique_pi)

'1081 files written'

<a id="preparing"></a>
## III. Preparing
Prepare the files for annotation, ensuring ease in reading and splitting any excessively long files.

In [1]:
import string
import re
import csv

# Libraries for Natural Language Processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.text import Text
# nltk.download('punkt')
# from nltk.probability import FreqDist
# nltk.download('stopwords')
# from nltk.corpus import stopwords
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
# from nltk.tag import pos_tag

In [2]:
directory = 'descriptions_by_fonds/'
files = PlaintextCorpusReader(directory, '.+')
tokens = files.words()

In [5]:
print(tokens[:20])

['Identifier', ':', 'Title', ':', 'Papers', 'of', 'Professor', 'Sir', 'Kenneth', 'Murray', 'Title', ':', 'Awards', 'and', 'honours', 'Title', ':', 'Bronze', 'medal', 'from']


In [3]:
token_totals = []
filenames = files.fileids()
for f in filenames:
        token_totals += [len(files.words(f))]
file_lengths = dict(zip(filenames,token_totals))
print(filenames[0:10])

['.txt', 'AA4.txt', 'AA5.txt', 'AA6.txt', 'AA7.txt', 'BAI.txt', 'Coll-100.txt', 'Coll-1000.txt', 'Coll-1001.txt', 'Coll-1004.txt']


In [6]:
token_totals.sort
print(set(token_totals))

{8, 9, 14, 2063, 15, 24595, 4115, 2068, 19, 23, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 12333, 48, 47, 50, 51, 10290, 8242, 54, 55, 56, 24632, 52, 59, 57, 61, 58, 63, 64, 65, 66, 67, 68, 60, 70, 71, 72, 73, 74, 75, 8269, 78, 79, 77, 80, 82, 83, 81, 85, 86, 89, 91, 95, 97, 99, 100, 103, 104, 105, 106, 107, 108, 109, 110, 114, 116, 118, 119, 120, 121, 24, 124, 125, 126, 127, 128, 129, 131, 2180, 133, 134, 136, 137, 138, 139, 140, 142, 143, 144, 145, 147, 148, 149, 150, 151, 153, 154, 156, 157, 158, 159, 161, 163, 164, 45221, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 2223, 184, 186, 187, 185, 190, 192, 196, 198, 199, 200, 202, 203, 204, 205, 207, 208, 209, 210, 211, 212, 213, 215, 216, 217, 218, 219, 222, 223, 224, 225, 226, 227, 230, 231, 233, 234, 235, 28908, 236, 237, 28911, 240, 241, 242, 239, 244, 245, 243, 247, 248, 249, 250, 251, 252, 253, 255, 256, 257, 258, 259, 260, 261, 262, 263,

In [60]:
file_lengths["Coll-1250.txt"]

826

In [61]:
too_long = []
for key,value in file_lengths.items():
    if value > 1000:
        too_long += [key]

In [62]:
print(too_long)
print(len(too_long))

['BAI.txt', 'Coll-1022.txt', 'Coll-1036.txt', 'Coll-1052.txt', 'Coll-1057.txt', 'Coll-1059.txt', 'Coll-1060.txt', 'Coll-1061.txt', 'Coll-1062.txt', 'Coll-1064.txt', 'Coll-1066.txt', 'Coll-1142.txt', 'Coll-1146.txt', 'Coll-1156.txt', 'Coll-1162.txt', 'Coll-1167.txt', 'Coll-1242.txt', 'Coll-1243.txt', 'Coll-1247.txt', 'Coll-1255.txt', 'Coll-1257.txt', 'Coll-1260.txt', 'Coll-1266.txt', 'Coll-1294.txt', 'Coll-13.txt', 'Coll-1310.txt', 'Coll-1320.txt', 'Coll-1329.txt', 'Coll-1357.txt', 'Coll-1362.txt', 'Coll-1363.txt', 'Coll-1364.txt', 'Coll-1373.txt', 'Coll-1383.txt', 'Coll-1385.txt', 'Coll-14.txt', 'Coll-1434.txt', 'Coll-1443.txt', 'Coll-146.txt', 'Coll-1461.txt', 'Coll-1489.txt', 'Coll-1490.txt', 'Coll-1492.txt', 'Coll-1496.txt', 'Coll-1497.txt', 'Coll-1499.txt', 'Coll-1527.txt', 'Coll-1528.txt', 'Coll-1541.txt', 'Coll-1549.txt', 'Coll-1557.txt', 'Coll-1574.txt', 'Coll-1577.txt', 'Coll-1580.txt', 'Coll-1583.txt', 'Coll-1586.txt', 'Coll-1593.txt', 'Coll-16.txt', 'Coll-1613.txt', 'Coll-162

That's a lot of files to break up manually, so let's use Python to divide these large files into smaller files with a maximum of 100 lines each.

In [5]:
# Code in this cell based on:
# https://stackoverflow.com/questions/16289859/splitting-large-text-file-into-smaller-text-files-by-line-numbers-using-python
def splitLargeFile(f, max_lines, old_dir, new_dir):
    short = None
    file_path = old_dir+f
    with open(file_path) as long:
        for line_no, line in enumerate(long):
            if line_no % max_lines == 0:
                if short:
                    short.close()
                f = f.replace(".txt","_")
                short_name = str(f)+'{}.txt'.format(line_no + max_lines)
                new_path = new_dir+short_name
                short = open(new_path, "w")
            short.write(line)
        if short:
            short.close()

In [7]:
for f in filenames:
    splitLargeFile(f, 100, "descriptions_by_fonds/", "descriptions_by_fonds_split/")

In [8]:
directory = 'descriptions_by_fonds_split/'
files = PlaintextCorpusReader(directory, '.+')
print("Total files to annotate:",len(files.fileids()))

Total files to annotate: 3649


<a id="renaming"></a>
## IV. Renaming Pre-Annotated Files
Renaming so the files are properly ordered

In [2]:
import os
import re

In [10]:
datadir = "descriptions_by_fonds_split"
filenames = os.listdir(datadir)

In [11]:
max_digits = 0
for f in filenames:
    fend = re.findall("\d+\.ann|\d+\.txt",f)
    if fend:
        fdigits = len(re.findall("\d",fend[0])) 
        if fdigits > max_digits:
            max_digits = fdigits
print(max_digits)

5


In [12]:
# Pad file names with zeros so all are 5 digits long
for f in filenames:
    oldpath = os.path.join(datadir,f)
    end_list = re.findall("\d+\.ann|\d+\.txt",f)
    if len(end_list) > 0:
        start = f.replace(end_list[0],"")
        digits = len(re.findall("\d",end_list[0]))
        new_f = start + "0" * (max_digits - digits) + end_list[0]
        newpath = os.path.join(datadir,new_f)
        os.rename(oldpath, newpath)
        
# Note: code in this cell based on: https://stackoverflow.com/questions/2491222/how-to-rename-a-file-using-python

In [14]:
# # Put files in subfolders based on fonds (collection) identifier
# filenames = os.listdir(datadir)
# for f in filenames:
#     oldpath = os.path.join(datadir,f)
#     end_list = re.findall("\d+\.ann|\d+\.txt",f)
#     if len(end_list) > 0:
#         identifier = f.replace(end_list[0],"")
#         new_dir = os.path.join(datadir,identifier)
#         try:
#             os.makedirs(new_dir)
#         except FileExistsError:
#             # directory already exists
#             pass
#         newpath = os.path.join(new_dir,f)
#         os.rename(oldpath, newpath)

<a id="splitting"></a>
## V. Splitting Files Among Annotators
* Total annotators: 5
    * A1 and A2 to annotate with Person Name and Linguistic labels
    * A3 and A4 to annotate with Contextual labels
* Inter-annotator agreement (IAA) will be evaluated for:
    * A1 and A2
    * Me and A1
    * Me and A2
    * A3 and A4
    * Me and A3
    * Me and A4

In [4]:
# My pilot with finalized instructions begun 16:05 and ended at 16:35, many of the 15 files I 
# annotated were short on description, so let's estimate an average of 10 file in an hour, to be safe
hired_hours = 9*8  # each annotator working 9 hours per week for 8 weeks
est_files_annotated = hired_hours * 10
print("Estimated files each annotator will label:", est_files_annotated)

Estimated files each annotator will label: 720
Estimated files both pairs of annotators will label in total: 1440


In [8]:
print("Allow 10% overlap for each pair of annotators, totaling", str(int(720*0.1)), "files")
print("Estimated files both pairs of annotators will label in total:", (est_files_annotated * 2) - int(720*0.1))

Allow 10% overlap for each pair of annotators, totaling 72 files
Estimated files both pairs of annotators will label in total: 1368


In [10]:
import os
from shutil import copyfile

In [11]:
directory = "descriptions_by_fonds_split_with_ann/descriptions_by_fonds_split_with_ann"
descs_split = list(os.listdir(directory))
descs_split.sort()
descs_split.pop(0)
print(len(descs_split))
print(descs_split[0])
print(descs_split[1])
print(descs_split[2])
print(descs_split[3])
print(".txt" in descs_split[1])
# print(descs_split)

7298
AA4_00100.ann
AA4_00100.txt
AA5_00100.ann
AA5_00100.txt
True


In [34]:
print(73*2)
print(1461+1460-73)

146
2848


Select approximately 10% of the total number of files for the hired annotators, and select approximately 10% of that number of files to be doubly annotated by the hired annotators.
* 730 txt files per annotator (including ann files, 1460 files total)
* First 73 txt files annotated by everyone

In [35]:
fonds1 = descs_split[0:1460]                           # 730 * 2 to account for .ann files
fonds2 = descs_split[0:146] + descs_split[1461:2848]   # only first 76 txt files (146 with ann files) should overlap
        
print(len(fonds1))
print(len(fonds2))
print(fonds1[-10:])
print(fonds2[-10:])

1460
1533
['Coll-1434_14300.ann', 'Coll-1434_14300.txt', 'Coll-1434_14400.ann', 'Coll-1434_14400.txt', 'Coll-1434_14500.ann', 'Coll-1434_14500.txt', 'Coll-1434_14600.ann', 'Coll-1434_14600.txt', 'Coll-1434_14700.ann', 'Coll-1434_14700.txt']
['Coll-1584_00100.ann', 'Coll-1584_00100.txt', 'Coll-1585_00100.ann', 'Coll-1585_00100.txt', 'Coll-1586_00100.ann', 'Coll-1586_00100.txt', 'Coll-1586_00200.ann', 'Coll-1586_00200.txt', 'Coll-1586_00300.ann', 'Coll-1586_00300.txt']


Copy the select files of metadata descriptions into folders for each annotator.

In [30]:
for f in fonds1:
    oldpath = os.path.join(directory,f)
    newpath1 = os.path.join("ann1",f)  # paired with ann2
    newpath2 = os.path.join("ann3",f)  # paired with ann4
    copyfile(oldpath, newpath1)
    copyfile(oldpath, newpath2)

In [36]:
for f in fonds2:
    oldpath = os.path.join(directory,f)
    newpath3 = os.path.join("ann2",f)  # paired with ann1
    newpath4 = os.path.join("ann4",f)  # paired with ann3
    copyfile(oldpath, newpath3)
    copyfile(oldpath, newpath4)

In [1]:
def countWordsInDirectory(directory):
    files = PlaintextCorpusReader(directory, '.+\.txt')
    tokens = files.words()
    print(str(directory)+": " + str(len(tokens)))
    return

pair1_words = countWordsInDirectory("ann1/")                                   # 486880
pair2_words = countWordsInDirectory("ann3/")                                   # 595018
total_words = countWordsInDirectory("descriptions_by_fonds_split_with_ann")    # 2754044

In [8]:
print("Percentage of dataset annotated by annotator pair 1:",(pair1_words/total_words)*100)

Percentage of dataset annotated by annotator pair 1: 17.678729896835343


In [9]:
print("Percentage of dataset annotated by annotator pair 2:",(pair2_words/total_words)*100)

Percentage of dataset annotated by annotator pair 2: 21.60524668451194


If each pair of annotators annotates about half the total files allocated to them, in total, about 10% of my entire dataset will be doubly annotated (because I'm labeling all files with all categories).

In [32]:
overlap = descs_split[0:146]
files = PlaintextCorpusReader(directory, '.+\.txt')
fileids = files.fileids()
overlap_token_count = 0
for f in fileids:
    if f in overlap:
        overlap_token_count += len(files.words(f))
print(overlap_token_count)

89273


In [33]:
total_tokens = 2754044
print((overlap_token_count/total_tokens)*100)

3.2415241005590323


The files that EVERYONE annotates represent about 3% of the total dataset, meaning about 3% of the data will be triply annotated.