# Preparing Data for Analysis (Aggregated and Non-aggregated Datasets)

To understand patterns and outliers in the datasets of archival metadata descriptions annotated for gendered and gender biased language, and to inform feature engineering of classification models, the datasets must be properly formatted and organized into files and directories through the following steps:

[1.](#1) In the non-aggregated datasets and `aggregated_with_annotator_col.csv`, associate each annotation with annotator's notes explaining the annotation.
  
[2.](#2) Associate each annotation with its fonds-level (a.k.a. collection-level) metadata (see `annot-prep/CRC_units-grouped-by-fonds.csv`) for: 
* language of material
* date description written
* date(s) of material
* associated geographic locations of material
* fonds identifiers
* unit identifiers (at the series, sub-series, and item (lowest level of the archival hierarchy) levels))

[3.](#3) Split the files of grouped archival metadata descriptions into files each with one description and associate each description with its annotations (create description ID to add in column to annotation datasets).

***

In [1]:
import pandas as pd
import numpy as np
import os
import string
import re

### Data and Functions

Data locations:

In [6]:
# Files of non-aggregated annotation data (one per annotator)
ann0PL = "data/OriginalAnnotatorData/labels0PL-Copy1.csv"
ann1 = "data/OriginalAnnotatorData/labels1-Copy1.csv"
ann2 = "data/OriginalAnnotatorData/labels2-Copy1.csv"
ann0C = "data/OriginalAnnotatorData/labels0C-Copy1.csv"
ann3 = "data/OriginalAnnotatorData/labels3-Copy1.csv"
ann4 = "data/OriginalAnnotatorData/labels4-Copy1.csv"
annpaths = {0:[ann0PL, ann0C], 1:ann1, 2:ann2, 3:ann3, 4:ann4}

# Files of annotation notes data
notespaths = ["data/data_annot/annot_notes/notes0.csv", "data/data_annot/annot_notes/notes1.csv", 
              "data/data_annot/annot_notes/notes2.csv", "data/data_annot/annot_notes/notes3.csv", 
              "data/data_annot/annot_notes/notes4.csv"]

# Files of aggregated annotation data
aggpath = "data/data_annot/aggregated_final.csv"
aggannpath = "data/data_annot/aggregated_with_annotator_col.csv"

# Additional archival metadata to associate with labels and notes data, 
# cleaned in another notebook (MetadataCleaning.ipynb)
metadata = "../annot-prep/CRC_units-grouped-by-fonds_clean.csv"

# Directories of metadata descriptions per annotator
datapath0 = "../IAA_DataAndAnalysis/annotator-0"
datapath1 = "../IAA_DataAndAnalysis/Linguistic/annotator-1"
datapath2 = "../IAA_DataAndAnalysis/Linguistic/annotator-2"
datapath3 = "../IAA_DataAndAnalysis/Contextual/annotator-3"
datapath4 = "../IAA_DataAndAnalysis/Contextual/annotator-4"
datapaths = [datapath0,datapath1,datapath2,datapath3,datapath4]

# Path to plaintext files of descriptions that each annotator labeled
descs_path = "data/desc_txts"

Functions:

In [24]:
def useAnnotatorNumber(df):
    df.replace("Annotator 0", "0", inplace=True)
    df.replace("Annotator 1", "1", inplace=True)
    df.replace("Annotator 2", "2", inplace=True)
    df.replace("Annotator 3", "3", inplace=True)
    df.replace("Annotator 4", "4", inplace=True)
    return df

# Find index in input list closest to input offset
# Reference: https://www.geeksforgeeks.org/python-find-closest-number-to-k-in-given-list/#:~:text=Given%20a%20list%20of%20numbers%20and%20a%20variable,K%2C%20and%20returns%20the%20element%20having%20minimum%20difference.
def findClosest(i, j, f_string, position):
    if position == "start":
        offset = j
        bh = f_string.rfind("Biographical / Historical:", i, j)
        sc = f_string.rfind("Scope and Contents:", i, j)
        ti = f_string.rfind("Title:", i, j)
        pi = f_string.rfind("Processing Information:", i, j)
    elif position == "end":
        offset = i
        bh = f_string.find("Biographical / Historical:", i, j)
        sc = f_string.find("Scope and Contents:", i, j)
        ti = f_string.find("Title:", i, j)
        pi = f_string.find("Processing Information:", i, j)
    indeces = [bh, sc, ti, pi]
    while -1 in indeces:
        indeces.remove(-1)     # -1 indicates input string wasn't found
    if len(indeces) == 1:      # If only one index was found, return it
        return indeces[0]
    elif len(indeces) == 0:
        return None
    else:
        indeces = np.asarray(indeces)
        k = (np.abs(indeces - offset)).argmin()
        return indeces[k]
            
# INPUT: file path to metadata descriptions (str), annotator number (0-5; int), 
#       filename of description to review (str), labeled text span offsets (both int)
# OUTPUT: the metadata description that contians the labeled text
def getDescription(datapaths, ann_no, filename, start_offset, end_offset):
    datapath = datapaths[ann_no]                                   # Get the file of data for the input annotator
    f_string = open(os.path.join(datapath,filename),'r').read()    # Get a string of the file's text (metadata description)

    # Find the index of the beginning of the metadata description with the labeled text
    begin_desc_i = findClosest(0, start_offset, f_string, "start")
    if begin_desc_i == None:
        begin_desc_i = 0
    # Find the index of the end of the metadata description with the labeled text
    end_file_i = len(f_string) - 1
    end_desc_i = findClosest(end_offset, end_file_i, f_string, "end")
    if end_desc_i == None:
        end_desc_i = end_file_i

    return f_string[begin_desc_i:end_desc_i]


# Get Notes
# notespaths = ["notes0.csv", "notes1.csv", "notes2.csv", "notes3.csv", "notes4.csv"]
# notes4 = pd.read_csv(notespaths[4], index_col = 0)
# # notes4.head()
# note = notes4.loc[(notes4.file == "Coll-1434_16400.ann") & (notes4.entity == "T15")]
# note

def joinLabelsAndNotes(df_labels, df_notes):
    df_labels.set_index(["annotator","file","entity"], inplace=True)
    df_notes.set_index(["annotator","file","entity"], inplace=True)
    return df_labels.join(df_notes, on=["annotator","file","entity"], how="left")

<a id="1"></a>
### Step 1: Associate Annotation Labels and Notes

For each annotator's individual dataset (non-aggregated data):

In [22]:
# Annotator 0 - Contextual, Person-Name, and Linguistic categories of labels
ann0PL_labels = pd.read_csv(ann0PL, index_col=0)
ann0C_labels = pd.read_csv(ann0C, index_col=0)
ann0_labels = ann0PL_labels.append(ann0C_labels)
ann0_notes = pd.read_csv(notespaths[0], index_col=0)
ann0_labels_notes = joinLabelsAndNotes(ann0_labels, ann0_notes)
ann0_labels_notes.head()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,start,end,text,category,note
annotator,file,entity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Annotator 0,Coll-1444_00100.ann,T1,Unknown,52,66,Robert E. Bell,Person-Name,
Annotator 0,Coll-1444_00100.ann,T2,Generalization,219,228,Bachelors,Linguistic,A masculine term for a degree that can be awar...
Annotator 0,Coll-1444_00100.ann,T3,Generalization,301,310,Bachelors,Linguistic,A masculine term for a degree that can be awar...
Annotator 0,Coll-1444_00100.ann,T4,Generalization,368,372,Ed.B,Linguistic,"B for bachelor, a masculine term for a degree ..."
Annotator 0,Coll-1444_00100.ann,T5,Generalization,377,381,M.Ed,Linguistic,"M for master, a masculine term for a degree th..."


In [31]:
# Annotator 1 - Person-Name, and Linguistic categories of labels
a = 1
labels = pd.read_csv(annpaths[a], index_col=0)
notes = pd.read_csv(notespaths[a], index_col=0)
ann1_labels_notes = joinLabelsAndNotes(labels, notes)
ann1_labels_notes.tail()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,start,end,text,category,note
annotator,file,entity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Annotator 1,Coll-1434_11300.ann,T22,Gendered-Role,4867,4870,man,Linguistic,
Annotator 1,Coll-1434_09000.ann,T0,Gendered-Pronoun,2117,2120,His,Linguistic,
Annotator 1,Coll-1434_09000.ann,T1,Masculine,2117,2176,"His Excellency, the Kushbegi, Vice Emir of Bok...",Person-Name,gendered role given\n
Annotator 1,Coll-1434_09000.ann,T2,Gendered-Role,2137,2145,Kushbegi,Linguistic,Unclear if this is a gendered title but previo...
Annotator 1,Coll-1434_09000.ann,T3,Gendered-Role,2147,2156,Vice Emir,Linguistic,Unclear if this is a gendered title but previo...


In [36]:
# Annotator 2 - Person-Name, and Linguistic categories of labels
a = 2
labels = pd.read_csv(annpaths[a], index_col=0)
notes = pd.read_csv(notespaths[a], index_col=0)
ann2_labels_notes = joinLabelsAndNotes(labels, notes)
ann2_labels_notes.loc[ann2_labels_notes.note.notnull()].head()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,start,end,text,category,note
annotator,file,entity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Annotator 2,BAI_01600.ann,T22,Unknown,5234,5253,Hector Hetherington,Person-Name,In file BAI_01900 it notes that Hector Hetheri...
Annotator 2,BAI_01800.ann,T26,Unknown,5065,5090,William Paterson Paterson,Person-Name,"Does this appear to be a mistake? ""William Pat..."
Annotator 2,Coll-1014_00100.ann,T4,Gendered-Role,988,992,Pope,Linguistic,"Technically being ""pope"" is a job, but only me..."
Annotator 2,Coll-146_22900.ann,T40,Unknown,3068,3082,"Rosoff, Israel",Person-Name,Format makes me think that this is a name and ...
Annotator 2,Coll-146_27200.ann,T22,Unknown,1299,1312,"olton, Gerald",Person-Name,I'm assuming this is a typo\n


In [38]:
# Annotator 3 - Contextual categories of labels
a = 3
labels = pd.read_csv(annpaths[a], index_col=0)
notes = pd.read_csv(notespaths[a], index_col=0)
ann3_labels_notes = joinLabelsAndNotes(labels, notes)
ann3_labels_notes.tail() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,start,end,text,category,note
annotator,file,entity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Annotator 3,Coll-1028_00100.ann,T14,Omission,598,605,Baillie,Contextual,only family name given\n
Annotator 3,Coll-1028_00100.ann,T33,Occupation,2572,2580,printers,Contextual,
Annotator 3,Coll-1028_00100.ann,T34,Occupation,2559,2570,translators,Contextual,
Annotator 3,Coll-1028_00100.ann,T35,Omission,2597,2603,Calvin,Contextual,only family name given\n
Annotator 3,Coll-1028_00100.ann,T36,Omission,2536,2542,Calvin,Contextual,only family name given\n


In [39]:
# Annotator 4 - Contextual categories of labels
a = 4
labels = pd.read_csv(annpaths[a], index_col=0)
notes = pd.read_csv(notespaths[a], index_col=0)
ann4_labels_notes = joinLabelsAndNotes(labels, notes)
ann4_labels_notes.tail() # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,start,end,text,category,note
annotator,file,entity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Annotator 4,Coll-146_16400.ann,T2,Stereotype,2134,2139,a boy,Contextual,photograph description assumes unknown person'...
Annotator 4,Coll-1490_00300.ann,T1,Omission,1094,1101,friends,Contextual,friends not named. possible omission of women.\n
Annotator 4,Coll-1490_00300.ann,T2,Omission,1000,1012,Lady Jackson,Contextual,woman identified with title and married name o...
Annotator 4,Coll-1490_00300.ann,T3,Omission,386,393,wedding,Contextual,Married couple not named. Missed opportunity t...
Annotator 4,Coll-1490_00300.ann,T5,Omission,7,12,Kitty,Contextual,woman identified with nickname (diminutive). s...


Write the joined label-notes DataFrames to CSVs:

In [43]:
joined_list = [ann0_labels_notes, ann1_labels_notes, ann2_labels_notes, ann3_labels_notes, ann4_labels_notes]
a = 0
while a < 5:
    df = joined_list[a]
    df.to_csv("data/data_annot/ann{number}_labels_notes.csv".format(number=a))
    a += 1

For the aggregated dataset with the annotator column (duplicate annotations made by different annotators exist): 

In [30]:
df_labels = pd.read_csv(aggannpath, index_col=0)
df_labels = df_labels.astype({'annotator':'int', 'file':'str', 'entity':'str', 'offsets':'str', 'text':'str', 'id':'int', 'label':'str', 'category':'str'})
df_labels.set_index(["annotator","file","entity"], inplace=True)
df_notes = pd.DataFrame()
for notes in notespaths:
     df_notes = df_notes.append(pd.read_csv(notes, index_col=0))
df_notes = useAnnotatorNumber(df_notes)
df_notes = df_notes.astype({'annotator':'int', 'file':'str', 'entity':'str', 'note':'str'})
df_notes.set_index(["annotator","file","entity"], inplace=True)

In [31]:
df_labels.head()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,offsets,text,id,label,category
annotator,file,entity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Coll-1434_11900.ann,T1,"(1954, 1957)",his,22593,Generalization,Linguistic
0,Coll-1397_00100.ann,T58,"(2633, 2638)",Lords,29349,Generalization,Linguistic
0,Coll-1310_00800.ann,T54,"(3703, 3706)",Man,15451,Generalization,Linguistic
0,Coll-1434_14500.ann,T76,"(5782, 5788)",cowboy,8005,Generalization,Linguistic
0,BAI_02300.ann,T53,"(1586, 1596)",shipmaster,20810,Generalization,Linguistic


In [32]:
df_notes.tail()  # Looks good

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,note
annotator,file,entity,Unnamed: 3_level_1
4,Coll-146_16400.ann,T2,photograph description assumes unknown person'...
4,Coll-1490_00300.ann,T1,friends not named. possible omission of women.\n
4,Coll-1490_00300.ann,T2,woman identified with title and married name o...
4,Coll-1490_00300.ann,T3,Married couple not named. Missed opportunity t...
4,Coll-1490_00300.ann,T5,woman identified with nickname (diminutive). s...


In [34]:
# agg_labels_notes = df_labels.join(df_notes, on=["annotator","file","entity"], how="left")
agg_labels_notes.tail(10)  # Looks good!

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,offsets,text,id,label,category,note
annotator,file,entity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
4,Coll-1057_00600.ann,T23,"(6382, 6390)",and wife,4918,Omission,Contextual,wife not named\n
4,Coll-1054_00100.ann,T29,"(3100, 3142)",son of the Rev. Dr. William Logie (Junior),2036,Omission,Contextual,"mother not mentioned or named, only the father\n"
4,Coll-1054_00100.ann,T24,"(2000, 2047)","the son of Alexander Logie, a Kirkwall merchant",2031,Omission,Contextual,"mother not mentioned or named, only the father\n"
4,Coll-1054_00100.ann,T27,"(2527, 2569)",son of the Rev. Dr. William Logie (Senior),2034,Omission,Contextual,"mother not mentioned or named, only the father\n"
4,Coll-1036_00400.ann,T244,"(89368, 89376)",Jensen's,495,Omission,Contextual,man identified with surname. first name missin...
4,BAI_01300.ann,T11,"(1598, 1664)",the descendants of Rev John Baillie of Gairloc...,99,Omission,Contextual,Family's lineage doesn't mention women (wives ...
4,Coll-1057_00300.ann,T10,"(644, 648)",Alan,1192,Omission,Contextual,"man identified with first name, surname missing\n"
4,Coll-1057_00300.ann,T11,"(659, 662)",Ken,1193,Omission,Contextual,"man identified with first name, surname missing\n"
0,Coll-1036_00400.ann,T296,"(24219, 24233)",Thom. Campbell,9332,Unknown,Person-Name,
0,Coll-1036_00600.ann,T13,"(1079, 1092)",M. H. Sturgis,17548,Feminine,Person-Name,


Write aggregated with annotator column label and note data to a CSV file:

In [35]:
agg_labels_notes.to_csv("data/data_annot/aggregated_with_annotator_col_labels_notes.csv")

<a id="2"></a>
## Step 2: Associate the Metadata to the Annotation Data

Add an `eadid` column (based on the prefix of the file names in the `file` column) to the annotation data so it can be associated with the annotation data.

For the aggregated datasets (one with and one without an annotator column):

In [56]:
def addEadid(df, has_id=True):
    file_list = list(df.file)
    eadid_list = []
    for f in file_list:
        f_split = f.split("_")
        eadid_list += [f_split[0]]
    df["eadid"] = eadid_list
    df["eadid"] = df["eadid"].astype("str")
    if has_id:
        df = df.sort_values(by=["eadid","file","id"])
    else:
        df = df.sort_values(by=["eadid","file"])
    return df

In [48]:
# aggann: aggregated dataset with annotator column and notes for annotations
aggann = pd.read_csv("data/data_annot/aggregated_with_annotator_col_labels_notes.csv")
df = aggann
df = addEadid(df)
df.tail()  # Looks good

Unnamed: 0,annotator,file,entity,offsets,text,id,label,category,note,eadid
64444,4,Coll-1497_00400.ann,T7,"(3499, 3510)",magistrates,2622,Occupation,Contextual,,Coll-1497
75086,4,Coll-1497_00400.ann,T8,"(3681, 3690)",Wolfenden,2623,Omission,Contextual,person (unknown gender) identified with surnam...,Coll-1497
64445,4,Coll-1497_00400.ann,T9,"(5194, 5207)",psychologists,2624,Occupation,Contextual,,Coll-1497
64446,4,Coll-1497_00400.ann,T10,"(5212, 5225)",psychiatrists,2625,Occupation,Contextual,,Coll-1497
64447,4,Coll-1497_00400.ann,T11,"(6062, 6069)",doctors,2626,Occupation,Contextual,,Coll-1497


In [49]:
df.to_csv("data/data_annot/aggregated_with_annotator_eadid_note_cols.csv")

In [59]:
df = pd.read_csv(aggpath, index_col=0)
df = addEadid(df, has_id=False)
df.head()  # Looks good
df.to_csv("data/data_annot/aggregated_with_eadid_col.csv")

For each annotator's individual dataset (non-aggregated data):

In [39]:
# Load the data
ann0 = pd.read_csv("data/data_annot/ann0_labels_notes.csv")
ann0.head()

Unnamed: 0,annotator,file,entity,label,start,end,text,category,note
0,Annotator 0,Coll-1444_00100.ann,T1,Unknown,52,66,Robert E. Bell,Person-Name,
1,Annotator 0,Coll-1444_00100.ann,T2,Generalization,219,228,Bachelors,Linguistic,A masculine term for a degree that can be awar...
2,Annotator 0,Coll-1444_00100.ann,T3,Generalization,301,310,Bachelors,Linguistic,A masculine term for a degree that can be awar...
3,Annotator 0,Coll-1444_00100.ann,T4,Generalization,368,372,Ed.B,Linguistic,"B for bachelor, a masculine term for a degree ..."
4,Annotator 0,Coll-1444_00100.ann,T5,Generalization,377,381,M.Ed,Linguistic,"M for master, a masculine term for a degree th..."


In [58]:
# Add eadid column to annotator DataFrame
files = list(ann0.file)
eadid_list = []
for f in files:
    eadid_list += [re.search('[A-Za-z]*-*\d*', f)[0]]
ann0 = ann0.assign(eadid=eadid_list)
ann0.tail()  # Looks good

Unnamed: 0,annotator,file,entity,label,start,end,text,category,note,eadid
31781,Annotator 0,Coll-1246_00100.ann,T8,Occupation,554,570,Chief Classifier,Contextual,,Coll-1246
31782,Annotator 0,Coll-1246_00100.ann,T9,Occupation,692,705,Sub-Librarian,Contextual,,Coll-1246
31783,Annotator 0,Coll-1172_00100.ann,T12,Occupation,287,293,priest,Contextual,,Coll-1172
31784,Annotator 0,Coll-1172_00100.ann,T19,Occupation,311,327,musical director,Contextual,,Coll-1172
31785,Annotator 0,Coll-1172_00100.ann,T20,Occupation,414,420,priest,Contextual,,Coll-1172


In [55]:
# Get unique list of eadids    
eadids = set(eadid_list)
# print(eadids)

In [None]:
# Clean up metadata (strip whitespace, make lists in cells sets so have unique values) and associate with annotator data


<a id="3"></a>
## Step 3: Split annotated files of text into one file per description

Split the files of grouped archival metadata descriptions into files each with one description and associate each description with its annotations.  Create a description ID to add in as a column to the annotation datasets, so each annotation can be associated with the description in which it was made.

In [57]:
# Find index in input file (as data type str) closest to input offsets i and j
# Reference: https://www.geeksforgeeks.org/python-find-closest-number-to-k-in-given-list/#:~:text=Given%20a%20list%20of%20numbers%20and%20a%20variable,K%2C%20and%20returns%20the%20element%20having%20minimum%20difference.
def findClosest(i, j, f_string, position):
    if position == "start":
        offset = j
        bh = f_string.rfind("Biographical / Historical:", i, j)
        sc = f_string.rfind("Scope and Contents:", i, j)
        ti = f_string.rfind("Title:", i, j)
        pi = f_string.rfind("Processing Information:", i, j)
    elif position == "end":
        offset = i
        bh = f_string.find("Biographical / Historical:", i, j)
        sc = f_string.find("Scope and Contents:", i, j)
        ti = f_string.find("Title:", i, j)
        pi = f_string.find("Processing Information:", i, j)
    indeces = [bh, sc, ti, pi]
    while -1 in indeces:
        indeces.remove(-1)     # -1 indicates input string wasn't found
    if len(indeces) == 1:      # If only one index was found, return it
        return indeces[0]
    elif len(indeces) == 0:
        return None
    else:
        indeces = np.asarray(indeces)
        k = (np.abs(indeces - offset)).argmin()
        return indeces[k]
            
# INPUT: file path to metadata descriptions (str), filename of description to review (str),
#        labeled text span start offset and end offset (int)
# OUTPUT: the metadata description from a single text file that contains the labeled text
#         (descriptions will be incomplete if split across two files)
def getDescription(datapath, filename, start_offset, end_offset):
    f_string = open(os.path.join(datapath,filename),'r').read()
    # Find the index of the beginning of the metadata description with the labeled text
    begin_desc_i = findClosest(0, start_offset, f_string, "start")
    if begin_desc_i == None:
        begin_desc_i = 0
    # Find the index of the end of the metadata description with the labeled text
    end_file_i = len(f_string) - 1
    end_desc_i = findClosest(end_offset, end_file_i, f_string, "end")
    if end_desc_i == None:
        end_desc_i = end_file_i

    return f_string[begin_desc_i:end_desc_i]

# INPUT: file path to text files of metadata descriptions
# OUTPUT: DataFrame of the complete descriptions (type=str), each with a unique ID (type=int)
archives_fields =["Identifier","Title","Scope and Contents","Biographical / Historical","Processing Information"]
def getCompleteDescriptions(datapath, field_names=archives_fields):
    descs_files = os.listdir(datapath)
    # Store all the descriptions across all files in a single string
    descriptions_string = "" 
    for filename in descs_files:
        descriptions_string += open(os.path.join(datapath,filename), "r").read()  # Looks good
    
    # Create two lists, one of every field name and one of their corresponding descriptions,
    # being sure to combine descriptios for the same field split across two files
    fields, descs = [], []
    pattern = "(Identifier|Title|Scope and Contents|Biographical / Historical|Processing Information):"
    d_list = re.split(pattern, descriptions_string)
    previous_was_field = False
    i, maxI = 0, len(d_list)
    while i < maxI:
        s = d_list[i]
        if len(s) > 0:
            if s in field_names:
                fields += [s]
                previous_was_field = True
            else:
                if previous_was_field:
                    s = s.strip()
                    descs += [s]
                    previous_was_field = False
                else:
                    descs = descs[:-1] + [descs[-1]+s]
        i += 1
        
    return fields, descs

We'll begin classification and analysis work with the aggregated dataset (the aggregated dataset with unique labels, so there isn't an annotator column nor a notes column), so for now let's focus only on associating rows in that dataset with their descriptions.

First, add an `id` column to the aggregated dataset so each row of annotation data has a unique identifier:

In [115]:
df = pd.read_csv("data/data_annot/aggregated_with_eadid_col.csv")
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()  

Unnamed: 0,file,offsets,text,label,category,eadid
0,AA5_00100.ann,"(789, 791)",He,Gendered-Pronoun,Linguistic,AA5
1,AA5_00100.ann,"(871, 873)",he,Gendered-Pronoun,Linguistic,AA5
2,AA5_00100.ann,"(913, 916)",his,Gendered-Pronoun,Linguistic,AA5
3,AA5_00100.ann,"(928, 930)",he,Gendered-Pronoun,Linguistic,AA5
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5


Next, create a version of the aggregated data that has the description id for each label, and another version that has the description id *and* the actual description.

In [7]:
# Check that all annotation files of `df` have associated plaintext files in `descs_files` list
descs_files = os.listdir(descs_path)
df_files = np.array(df.file)
missing = []
for f in df_files:
    f = f.replace(".ann",".txt")
    if f not in descs_files:
        missing += [f]
assert len(missing) == 0  # Great!

In [89]:
descs, fields = [], []  # store the metadata descriptions and metadata field names in lists
field_names = ["Identifier:\n", "Title:\n", "Biographical / Historical:\n", 
               "Scope and Contents:\n", "Processing Information:\n"]
for index,row in df.iterrows():
    f = row.file.replace(".ann",".txt")
    offsets = re.findall("\d+",row.offsets)
    desc = getDescription(descs_path, f, int(offsets[0]), int(offsets[1]))
    desc = desc.lstrip()  # remove leading whitespace
    descs += [desc]
    # Store metadata field names at the start of each description
    found = False
    for field in field_names:
        if not found and (field in desc[0:28]):  
            fields += [field[:-2]]
            found = True
    if not found:
        fields += ["Missing"]

assert len(descs) == len(fields), "There should be one field name for each description."

In [90]:
print("Annotations with incomplete descriptions:",fields.count("Missing"))
indeces_with_missing = [i for i, value in enumerate(fields) if value == "Missing"]
assert len(indeces_with_missing) == fields.count("Missing")

Annotations with incomplete descriptions: 730


In [91]:
# Get complete descriptions and replace any in the lists of fields and descriptions for each annotation 
# (`descs` and `fields` created above) with the complete description
complete_fields, complete_descs = getCompleteDescriptions(descs_path)
assert len(complete_descs) == len(complete_fields), "There should be a field for each description"
# print(complete_descs[:3], "\n", complete_fields[:3])  # Looks good

In [106]:
# Replace incomplete descriptions in `descs` and "Missing" values in `fields` with the complete
# description and corresponding field name
for i in indeces_with_missing:
    incomplete_desc = (descs[i]).strip()
    missing_field = fields[i]
    j, maxJ = 0, len(complete_descs)
    while j < maxJ:
        complete_desc = complete_fields[j]+":\n"+complete_descs[j]
        if incomplete_desc in complete_desc:
                descs[i] = complete_desc
                fields[i] = complete_fields[j]
                break
        j += 1
# assert len(replace_with) == len(indeces_with_missing)
# assert len(descs) == len(fields)
print(fields.count("Missing"))

3


In [110]:
indeces_still_missing = [i for i, value in enumerate(fields) if value == "Missing"]
for i in indeces_still_missing:
#     print(str(i)+": "+descs[i]+"\n")  # These aren't actually missing!  Field should be "Title"
    fields[i] = "Title"
print(fields.count("Missing"))

0


In [116]:
df["description"] = descs
df["field"] = fields
# df.field.unique() # Looks good
df = df.sort_values(by=["eadid", "file", "offsets", "category", "label", "text"])
df["id"] = [i for i in range(0, df.shape[0])]
df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,description,field,id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,1
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,2
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,3
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,4


Lastly, create a DataFrame of all the descriptions with columns for the file they appeared in, the associated eadid, and a unique identifier.  Add the description identifiers to the DataFrame of annotation data above.

In [127]:
unique_descs = df.description.unique()
total_descs = len(unique_descs)
print("Total unique descriptions:", total_descs)

df_desc = df.drop(columns=["file", "offsets", "text", "label", "category", "id"])
df_desc = df_desc.drop_duplicates()
rows = df_desc.shape[0]
desc_ids = [i for i in range(rows)]
df_desc["desc_id"] = desc_ids
print(df_desc.shape)

Total unique descriptions: 11745
(11888, 4)


143 descriptions are repeated across collections (a.k.a. "fonds," which are identified with the `eadid` column).

In [129]:
annot_desc_ids = []
desc_dict = dict(zip(list(df_desc.description),list(df_desc.desc_id)))
annot_descs = list(df.description)
for d in annot_descs:
    annot_desc_ids += [desc_dict[d]]
df["desc_id"] = annot_desc_ids
df.tail()

Unnamed: 0,file,offsets,text,label,category,eadid,description,field,id,desc_id
55257,Coll-1497_00400.ann,"(433, 442)",Wolfenden,Omission,Contextual,Coll-1497,Scope and Contents:\nPolicy Files 1927-57 incl...,Scope and Contents,55255,11884
55254,Coll-1497_00400.ann,"(5194, 5207)",psychologists,Occupation,Contextual,Coll-1497,Scope and Contents:\nFile contents: F8: Materi...,Scope and Contents,55256,11885
55255,Coll-1497_00400.ann,"(5212, 5225)",psychiatrists,Occupation,Contextual,Coll-1497,Scope and Contents:\nFile contents: F8: Materi...,Scope and Contents,55257,11885
55256,Coll-1497_00400.ann,"(6062, 6069)",doctors,Occupation,Contextual,Coll-1497,Scope and Contents:\nFile contents: F1: Notes ...,Scope and Contents,55258,11886
55250,Coll-1497_00400.ann,"(842, 855)",Lord Advocate,Occupation,Contextual,Coll-1497,Scope and Contents:\nNotes and material relati...,Scope and Contents,55259,11887


Write the newly-created DataFrames to CSVs:

In [130]:
df.to_csv("data/data_annot/aggregated_with_eadid_descid_desc_cols.csv")
df_desc.to_csv("data/OriginalAnnotatorData/descriptions.csv")
df_without_desc = df.drop("description",axis=1)
df_without_desc.to_csv("data/data_annot/aggregated_with_eadid_descid_cols.csv")

#### Get a sample of the aggregated data

In [131]:
# Write sample of data
df = pd.read_csv("data/data_annot/aggregated_with_eadid_descid_desc_cols.csv", index_col=0)
df_sample = df.head(100)
df_sample.shape

(100, 10)

In [132]:
df_sample.to_csv("data/data_annot/sample_aggregated_with_eadid_descid_desc_cols.csv")