# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs or updates the files:
  * `../data/crc_metadata/annot_descs.csv`: adds columns for description offsets, word counts, and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `../data/analysis_data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description
  * `../data/crc_metadata/descriptions_annotated/`: contains one file for every description in the annotated datasets with file names as zero-padded description IDs

***

**Table of Contents**

[0.](#0) Annotated Descriptions Data

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Tokens

[3.](#3) Description and Annotation Linking

[4.](#4) Token and Sentence Linking
    
  * TO REMOVE: [BIO Tags](#3.1)

***

First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

## 0. Annotated Descriptions Data

Create a CSV dataset of the descriptions that were annotated in brat, including the descriptions' file name and offsets. 

In [2]:
filenames = os.listdir(config.doc_path)
print(filenames[:10])
print(len(filenames))

['Coll-227_00100.txt', 'La_03600.txt', 'PJM_03000.txt', 'La_07300.txt', 'Coll-1434_07400.txt', 'Coll-1434_03100.txt', 'MS_BOX_25.5_00100.txt', 'EUA_IN1_38300.txt', 'Coll-14_05900.txt', 'Coll-1694_00100.txt']
3649


Check that every annotation's file is in the directory

In [3]:
agg_df = pd.read_csv("../data/aggregated_data/aggregated_final.csv", index_col=0)
print(agg_df.shape)
agg_df.head()

(55260, 7)


Unnamed: 0_level_0,file,text,ann_offsets,label,category,associated_genders,description_id
agg_ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear,2364
1,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear,4542
2,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear,3660
3,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear,4678
4,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear,4732


In [4]:
agg_files = list(agg_df.file)
agg_files = [f.replace(".ann", ".txt") for f in agg_files]
agg_files = list(set(agg_files))
agg_files.sort()
missing = []
for f in agg_files:
    if not f in filenames:
        missing += [f]
assert len(missing) == 0
print(len(agg_files))

1112


`agg_files` is list of the file names in `descriptions/brat` that should be included in CSV of the annotated description data.

In [5]:
### TESTING FUNCTIONS ###
# test_file = open(os.path.join(config.doc_path, "Coll-1036_00300.txt"), "r").read()
# desc_dict, did, field_for_next_file = getFieldDescriptions(dict(), test_file, "Scope and Contents", ["Title", "Biographical / Historical", "Processing Information"], 0, "Coll-1036_00300.txt")
# test_files = agg_files[:5]
desc_dict = utils.getDescriptionsInFiles(config.doc_path, agg_files)

In [6]:
# desc_dict = utils.getDescriptionsInFiles(config.doc_path, ["Coll-1036_00300.txt", "Coll-1036_00400.txt", "Coll-1036_00500.txt"])
# desc_dict # Looks good - 16 for _00400!

In [7]:
print("Total Descriptions:", len(desc_dict.keys()))

Total Descriptions: 27908


In [8]:
ann_desc_df = pd.DataFrame.from_dict(desc_dict, orient="index")
# Give the descriptions a unique identifier
ann_desc_df = ann_desc_df.reset_index()
ann_desc_df = ann_desc_df.rename(columns={"index":"description_id"})
ann_desc_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset
0,0,Identifier: AA5,AA5_00100.txt,0,16
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725
4,4,Identifier: AA6,AA6_00100.txt,0,16


Make sure all the description values have text:

In [9]:
assert ann_desc_df.loc[ann_desc_df.description.isnull() == True].shape[0] == 0
assert ann_desc_df.loc[ann_desc_df.description.isna() == True].shape[0] == 0
assert ann_desc_df.loc[ann_desc_df.description == ""].shape[0] == 0

In [10]:
ann_desc_df.loc[ann_desc_df.description_id == 108]

Unnamed: 0,description_id,description,file,start_offset,end_offset
108,108,Title:\nThe Trinity,BAI_00300.txt,1191,1210


In [11]:
ann_desc_df.loc[ann_desc_df.file == "Coll-1036_00400.txt"]  # Should have 16 rows 

Unnamed: 0,description_id,description,file,start_offset,end_offset
1040,1040,"Miscellaneous working material, Part 1, 2, 3, ...",Coll-1036_00400.txt,0,116
1041,1041,Scope and Contents:\nMiscellaneous music.Sever...,Coll-1036_00400.txt,117,1264
1042,1042,"Scope and Contents:\nMiscellaneous items, Part...",Coll-1036_00400.txt,1265,1563
1043,1043,Scope and Contents:\n'Loose Leaf M.S.S. [manus...,Coll-1036_00400.txt,1564,1837
1044,1044,"Scope and Contents:\n'News Cuttings', note boo...",Coll-1036_00400.txt,1838,2033
1045,1045,Scope and Contents:\nVarious music collections...,Coll-1036_00400.txt,2034,5254
1046,1046,Scope and Contents:\n'Proofs M.S.S.[manuscrip...,Coll-1036_00400.txt,5255,5552
1047,1047,Scope and Contents:\n'Kennedy-Fraser MSS. [man...,Coll-1036_00400.txt,5553,5983
1048,1048,Scope and Contents:\n'Tolmie Gesto'. \nBundle...,Coll-1036_00400.txt,5984,6821
1049,1049,Scope and Contents:\nProofs of A Life of Song ...,Coll-1036_00400.txt,6822,6902


In [12]:
ann_desc_df.loc[ann_desc_df.file == "Coll-146_15400.txt"]

Unnamed: 0,description_id,description,file,start_offset,end_offset
20928,20928,"Portrait of Arthur Koestler :: Geiger, Gretl: ...",Coll-146_15400.txt,0,59
20929,20929,Title:\nPhotograph of Arthur Koestler with a w...,Coll-146_15400.txt,60,205
20930,20930,Title:\nPhotograph of Arthur Koestler with a w...,Coll-146_15400.txt,206,351
20931,20931,Title:\nArthur Koestler with dog Attila at Lon...,Coll-146_15400.txt,352,429
20932,20932,Title:\nPortrait of Arthur Koestler at Long Ba...,Coll-146_15400.txt,430,481
20933,20933,Title:\nPhotograph of Arthur Koestler with dog...,Coll-146_15400.txt,482,545
20934,20934,Title:\nPortrait of Arthur Koestler :: Yevonde...,Coll-146_15400.txt,546,606
20935,20935,Title:\nPortrait of Arthur Koestler :: Yevonde...,Coll-146_15400.txt,607,667
20936,20936,Title:\nPhotograph of Arthur Koestler :: Forga...,Coll-146_15400.txt,668,738
20937,20937,Title:\nSeries of three photographs of Arthur ...,Coll-146_15400.txt,739,796


In [30]:
descriptions = list(ann_desc_df.description)
fields = ["Identifier: ", "Title:\n", "Scope and Contents:\n", "Biographical / Historical:\n", "Processing Information:\n"]
i, maxI = 0, len(descriptions)
fields_col, descs_col = [], []
while i < maxI:
    d = descriptions[i]
    foundField = False
    for f in fields:
        if f in d:
            foundField = True
            field = f
            break
    if foundField == False:
        clean_field = last_field
        fields_col += [clean_field]
        descs_col += [d]
    elif foundField == True:
        descs_col += [d.replace(field,"")]
        clean_field = field.strip()[:-1]
        fields_col += [clean_field]
    last_field = clean_field
    i += 1
assert len(descriptions) == len(descs_col)
assert len(fields_col) == len(descs_col)

In [33]:
ann_desc_df.insert(len(ann_desc_df.columns), "field", fields_col)
ann_desc_df.insert(len(ann_desc_df.columns), "clean_desc", descs_col)
ann_desc_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6


The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we have calculated the intermediate start and end offsets as well, which are all in the DataFrame above.

Write the file of annotated descritions with their start and end offsets to a CSV file:

In [34]:
annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
ann_desc_df.to_csv(annot_desc_filepath)

Write each description to a TXT file for later analysis with NLTK:

In [35]:
dir_path = config.crc_meta_path+"descriptions_annotated/"
# Make sure the directory exists
Path(dir_path).mkdir(parents=True, exist_ok=True)

# Write one TXT file per descrpition (utf-8 encoded by default), with the description ID as the file name
description_list = list(ann_desc_df.description)
id_list = list(ann_desc_df.description_id)
# For zero padding so files are ordered correctly
max_digits = len(str(max(id_list)))
counter = 0
for i in range(len(description_list)):
    d = description_list[i]
    did = id_list[i]
    zeros = max_digits - len(str(did))
    filename = ("0"*zeros)+str(did)+".txt"
    f = open(dir_path+filename, "w")
    f.write(d)
    f.close()
    counter += 1
    if counter % 1000 == 0:
        print("10000 new files written")
print("{} files finished writing!".format(counter))

10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
27908 files finished writing!


<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
**Find the minimum, maximum, average, and standard deviation of word and sentence counts...**
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [5]:
# # Uncomment if need to reload data
# # --------------------------------
# annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
# ann_desc_df = pd.read_csv(annot_desc_filepath)
# ann_desc_df = ann_desc_df.drop(columns=["Unnamed: 0"])
# dir_path = config.crc_meta_path+"descriptions_annotated/"

In [6]:
corpus = PlaintextCorpusReader(dir_path, "\w*.txt", encoding="utf8")
print(corpus.fileids()[:10]) # Looks good
print(corpus.fileids()[-10:]) # Looks good

['00000.txt', '00001.txt', '00002.txt', '00003.txt', '00004.txt', '00005.txt', '00006.txt', '00007.txt', '00008.txt', '00009.txt']
['27898.txt', '27899.txt', '27900.txt', '27901.txt', '27902.txt', '27903.txt', '27904.txt', '27905.txt', '27906.txt', '27907.txt']


In [7]:
print(len(corpus.fileids()))

27908


In [8]:
ann_desc_df.shape

(27908, 9)

#### Length per Description: Sentences and Words

In [9]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

['Identifier']
['identifier']
['Identifier: AA5']


In [10]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good

[1, 10] [1, 1, 1, 8]


In [43]:
ann_desc_df.insert(len(ann_desc_df.columns), "word_count", word_count)
ann_desc_df.insert(len(ann_desc_df.columns), "sent_count", sent_count)
ann_desc_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [44]:
ann_desc_df.to_csv(annot_desc_filepath)  # add to the counts to the existing CSV file

Write a file with the sentences for each description:

In [11]:
sent_df = ann_desc_df[["description_id", "description", "file", "start_offset", "end_offset", "field"]]
sent_df.insert(len(sent_df.columns), "sentences", desc_sents)
sent_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,sentences
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,[Identifier: AA5]
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,[Title:\nPapers of The Very Rev Prof James Why...
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"[Scope and Contents:\nSermons and addresses, 1..."
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,[Biographical / Historical:\nProfessor James A...
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,[Identifier: AA6]


In [14]:
sent_df = sent_df.drop(columns=["description"])
sent_df = sent_df.rename(columns={"start_offset":"desc_start_offset", "end_offset":"desc_end_offset"})
sent_df_exploded = sent_df.apply(pd.Series.explode)
sent_df_exploded.head()

Unnamed: 0,description_id,file,desc_start_offset,desc_end_offset,field,sentences
0,0,AA5_00100.txt,0,16,Identifier,Identifier: AA5
1,1,AA5_00100.txt,17,76,Title,Title:\nPapers of The Very Rev Prof James Whyt...
2,2,AA5_00100.txt,77,633,Scope and Contents,"Scope and Contents:\nSermons and addresses, 19..."
3,3,AA5_00100.txt,634,1725,Biographical / Historical,Biographical / Historical:\nProfessor James Ai...
3,3,AA5_00100.txt,634,1725,Biographical / Historical,He was educated at Daniel Stewart's College an...


In [16]:
sent_df_exploded.to_csv(config.crc_meta_path+"description_sentences.csv")

#### Calculate summary stats for word and sentence counts

In [45]:
desc_df_stats = utils.makeDescribeDf("All", ann_desc_df)
bh_stats = utils.makeDescribeDf("Biographical / Historical", ann_desc_df)
sc_stats = utils.makeDescribeDf("Scope and Contents", ann_desc_df)
pi_stats = utils.makeDescribeDf("Processing Information", ann_desc_df)
t_stats = utils.makeDescribeDf("Title", ann_desc_df)

In [46]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All,word_count,27908.0,20.006342,87.010297,0.0,12343.0
All,sent_count,27908.0,1.50602,5.363265,1.0,742.0
Title,word_count,15135.0,8.177007,5.653681,0.0,62.0
Title,sent_count,15135.0,1.114635,0.494993,1.0,15.0
Scope and Contents,word_count,11470.0,30.731386,128.28867,1.0,12343.0
Scope and Contents,sent_count,11470.0,1.793636,8.109776,1.0,742.0
Biographical / Historical,word_count,661.0,118.476551,134.910761,2.0,1112.0
Biographical / Historical,sent_count,661.0,5.931921,6.552115,1.0,45.0
Processing Information,word_count,304.0,11.305921,10.516143,2.0,179.0
Processing Information,sent_count,304.0,1.078947,0.345195,1.0,4.0


In [47]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>
## 2. Description and Annotation Linking

**Assign a description ID to every annotation, using the file names and offsets to determine within which description each annotated text span appears.**

**STEP 1:** Convert all offsets to tuples of integers and create a `filename` column to match up the descriptions' .txt files and annotated text spans' .ann files. 

In [56]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)

# Remove file extensions
desc_filenames = list(df_descs.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_descs.insert(1, "filename", desc_filenames)

# Get offsets as tuples of ints
start_offsets = list(df_descs.start_offset)
end_offsets = list(df_descs.end_offset)
offsets_strs = list(zip(list(df_descs.start_offset),list(df_descs.end_offset)))
desc_offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_descs = df_descs.drop(columns=["start_offset", "end_offset", "word_count", "sent_count"])
df_descs.insert(4, "desc_offsets", desc_offsets_int_tuples)

df_descs.head()

Unnamed: 0,description_id,filename,description,file,desc_offsets,field,clean_desc
0,0,AA5_00100,Identifier: AA5,AA5_00100.txt,"(0, 16)",Identifier,AA5
1,1,AA5_00100,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,"(17, 76)",Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,AA5_00100,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,"(77, 633)",Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,AA5_00100,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,"(634, 1725)",Biographical / Historical,Professor James Aitken White was a leading Sco...
4,4,AA6_00100,Identifier: AA6,AA6_00100.txt,"(0, 16)",Identifier,AA6


In [57]:
df_ann = pd.read_csv("../data/aggregated_data/aggregated_final.csv", index_col=0)

# Remove file extensions
ann_filenames = list(df_ann.file)
ann_filenames = [f[:-4] for f in ann_filenames]
df_ann.insert(1, "filename", ann_filenames)

# Get offsets as tuples of ints
ann_offsets_strs = list(df_ann.offsets)
ann_offsets_strs = [pair[1:-1].split(",") for pair in ann_offsets_strs]
ann_offsets_ints = [tuple((int(pair[0].strip()), int(pair[1].strip()))) for pair in ann_offsets_strs]
df_ann = df_ann.drop(columns=["offsets"])
df_ann.insert(4, "ann_offsets", ann_offsets_ints)

df_ann.head()

Unnamed: 0,agg_ann_id,filename,file,text,ann_offsets,label,category,associated_genders
12,0,Coll-1157_00100,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear
22,1,Coll-1310_02300,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear
23,2,Coll-1281_00100,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear
24,3,Coll-1310_02700,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear
25,4,Coll-1310_02900,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear


**STEP 2:** Associate each file to IDs and offsets, for ease of comparison of the annotations' files and offsets to the descriptions' files and offsets to determine which description ID to assign to each annotation.

In [58]:
subdf_descs = df_descs.drop(columns=["file"]) #"field",
df_descs_imploded = utils.implodeDataFrame(subdf_descs, ["filename"])
df_descs_imploded.head()

Unnamed: 0_level_0,description_id,description,desc_offsets,field,clean_desc
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA5_00100,"[0, 1, 2, 3]","[Identifier: AA5, Title:\nPapers of The Very R...","[(0, 16), (17, 76), (77, 633), (634, 1725)]","[Identifier, Title, Scope and Contents, Biogra...","[AA5, Papers of The Very Rev Prof James Whyte ..."
AA6_00100,"[4, 5, 6, 7]","[Identifier: AA6, Title:\nPapers of Rev Tom Al...","[(0, 16), (17, 60), (61, 560), (561, 2513)]","[Identifier, Title, Scope and Contents, Biogra...","[AA6, Papers of Rev Tom Allan (1916-1965), Ser..."
AA7_00100,"[8, 9, 10, 11]","[Identifier: AA7, Title:\nPapers of Rev Prof A...","[(0, 16), (17, 76), (77, 417), (418, 2442)]","[Identifier, Title, Scope and Contents, Biogra...","[AA7, Papers of Rev Prof Alec Campbell Cheyne ..."
BAI_00100,"[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2...","[Identifier: BAI, Title:\nPapers of Professor ...","[(0, 16), (17, 84), (85, 115), (116, 143), (14...","[Identifier, Title, Title, Title, Title, Title...","[BAI, Papers of Professor John Baillie, and Ba..."
BAI_00200,"[46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 5...","[\nTitle:\nNew Testament (senior), Title:\nApo...","[(0, 31), (32, 60), (61, 97), (98, 134), (135,...","[Title, Title, Title, Title, Title, Title, Tit...","[\nNew Testament (senior), Apologetics (senior..."


In [59]:
descs_dict = df_descs_imploded.to_dict(orient="index")
print(descs_dict["AA5_00100"])

{'description_id': [0, 1, 2, 3], 'description': ['Identifier: AA5', 'Title:\nPapers of The Very Rev Prof James Whyte (1920-2005)', 'Scope and Contents:\nSermons and addresses, 1948-1996; lectures, 1949-1982; class notes and lecture notes, 1949-1982; correspondence, 1988-1989 and 1964-1970; newspaper cuttings, 1988-1989 and 1964-1969; publications and articles, 1902-1970; church magazines, 1929-1993; conference papers, 1978; moderatorial papers, 1988-1989; University Christian Consultative Group papers, 1970-1972; Church of Scotland and the Congregational Union of Scotland papers, 1959-1967; personal papers, 1848-1983; photographs 1911 and 1960.See also External Documents (below).', "Biographical / Historical:\nProfessor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland. He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity. After his ordination he spent thre

In [60]:
subdf_ann = df_ann.drop(columns=["file","category","associated_genders"])
df_ann_imploded = utils.implodeDataFrame(subdf_ann, ["filename"])
df_ann_imploded.head()

Unnamed: 0_level_0,agg_ann_id,text,ann_offsets,label
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA5_00100,"[14377, 14378, 14379, 14380, 14381, 14382, 143...","[He, he, his, he, he, His, he, The Very Rev Pr...","[(789, 791), (871, 873), (913, 916), (928, 930...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-..."
AA6_00100,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...","[Billy Graham, He, he, he, he, He, his, his, h...","[(1778, 1790), (677, 679), (920, 922), (1222, ...","[Masculine, Gendered-Pronoun, Gendered-Pronoun..."
AA7_00100,"[127, 13987, 13988, 13989, 13990, 13991, 13992...","[Professor Cheyne, son, sister, brother, His, ...","[(2399, 2415), (505, 508), (614, 620), (647, 6...","[Masculine, Gendered-Role, Gendered-Role, Gend..."
BAI_00100,"[17473, 17474, 17475, 17476, 17477, 17478, 416...","[Jacques Chevalier, Lloyd Morgan, Professor Jo...","[(371, 388), (393, 405), (34, 56), (102, 114),...","[Unknown, Unknown, Unknown, Unknown, Unknown, ..."
BAI_00200,"[20496, 20497, 20498, 20499, 20500, 20501, 205...","[Barker, Garvie, Busch, Adolf Jülicher, Johann...","[(215, 221), (226, 232), (250, 255), (285, 299...","[Unknown, Unknown, Unknown, Unknown, Unknown, ..."


In [61]:
anns_dict = df_ann_imploded.to_dict(orient="index")
print(anns_dict["AA5_00100"])

{'agg_ann_id': [14377, 14378, 14379, 14380, 14381, 14382, 14383, 14384, 14385, 14386, 14387, 24275, 26233, 41260, 41261, 41262, 41263, 52952, 52953], 'text': ['He', 'he', 'his', 'he', 'he', 'His', 'he', 'The Very Rev Prof James Whyte', 'Professor James Aitken White', 'James Whyte', 'James Whyte', 'The Very Rev Prof James Whyte', 'Rev Prof James Whyte', 'Scottish Theologian', 'Moderator of the General Assembly of the Church of Scotland', 'army Chaplain', 'chair of practical theology and Christian ethics', 'The Very Rev Prof James Whyte', 'leading Scottish Theologian'], 'ann_offsets': [(789, 791), (871, 873), (913, 916), (928, 930), (1217, 1219), (1241, 1244), (1315, 1317), (34, 63), (661, 689), (1032, 1043), (1350, 1361), (34, 63), (43, 63), (704, 723), (728, 787), (955, 968), (1129, 1177), (34, 63), (696, 723)], 'label': ['Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Unknown', 'Masculine', 'M

**STEP 3:** File by file, determine which description's offsets each annotation occurs within and associate the corresponding description IDs and annotation IDs.

In [62]:
annid_to_descid = dict.fromkeys(list(df_ann.agg_ann_id))

In [63]:
files = list(df_ann_imploded.index)
assert list(df_ann_imploded.index).sort() == list(df_descs_imploded.index).sort()

In [64]:
df_descs.head()

Unnamed: 0,description_id,filename,description,file,desc_offsets,field,clean_desc
0,0,AA5_00100,Identifier: AA5,AA5_00100.txt,"(0, 16)",Identifier,AA5
1,1,AA5_00100,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,"(17, 76)",Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,AA5_00100,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,"(77, 633)",Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,AA5_00100,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,"(634, 1725)",Biographical / Historical,Professor James Aitken White was a leading Sco...
4,4,AA6_00100,Identifier: AA6,AA6_00100.txt,"(0, 16)",Identifier,AA6


In [65]:
for f in files:  #sample
    ann_ids, ann_offsets = anns_dict[f]["agg_ann_id"], anns_dict[f]["ann_offsets"]
    desc_ids, desc_offsets = descs_dict[f]["description_id"], descs_dict[f]["desc_offsets"]
    for i,ann_id in enumerate(ann_ids):
        ann_offset_pair = ann_offsets[i]
        for j,desc_id in enumerate(desc_ids):
            desc_offset_pair = desc_offsets[j]
            # If the annotation offsets are within the description offsets, assign that description ID to that annotation 
            if (ann_offset_pair[0] >= desc_offset_pair[0]) and (ann_offset_pair[0] <= desc_offset_pair[1]):
                if (ann_offset_pair[1] >= desc_offset_pair[0]) and (ann_offset_pair[1] <= desc_offset_pair[1]):
                    annid_to_descid[ann_id] = desc_id  #sample_annid_to_descid[ann_id] = desc_id

In [66]:
df_ids = pd.DataFrame({"agg_ann_id":list(annid_to_descid.keys()), "description_id":list(annid_to_descid.values())})
df_ids.head()

Unnamed: 0,agg_ann_id,description_id
0,0,2364
1,1,4542
2,2,3660
3,3,4678
4,4,4732


In [67]:
print(df_ids.loc[df_ids.agg_ann_id.isna() == True].shape)
print(df_ids.loc[df_ids.description_id.isna() == True].shape)

(0, 2)
(0, 2)


In [70]:
anns_without_desc = list(df_ids.loc[df_ids.description_id.isna() == True].agg_ann_id)
df_anns_without_desc = df_ann.loc[df_ann.agg_ann_id.isin(anns_without_desc)]
assert len(set(list(df_anns_without_desc.file))) == 0

**STEP 4:** Add annotations' corresponding description IDs to the annotation DataFrame: 

In [71]:
df_ann = df_ann.set_index("agg_ann_id")

In [73]:
df_ann_joined = df_ann.join(df_ids.set_index("agg_ann_id"), on="agg_ann_id", how="outer")
df_ann_joined.head()

Unnamed: 0_level_0,filename,file,text,ann_offsets,label,category,associated_genders,description_id
agg_ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,Coll-1157_00100,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear,2364
1,Coll-1310_02300,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear,4542
2,Coll-1281_00100,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear,3660
3,Coll-1310_02700,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear,4678
4,Coll-1310_02900,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear,4732


In [74]:
assert df_ann_joined.loc[df_ann_joined.description_id.isna() == True].shape[0] == 0

In [76]:
assert df_ann_joined.shape[0] == df_ann.shape[0]
assert df_ann_joined.shape[0] == df_ids.shape[0]

Update the CSV file:

In [77]:
df_ann_joined = df_ann_joined.drop(columns=["filename"])
df_ann_joined.to_csv(config.agg_path+"aggregated_final.csv")

<a id="3"></a>
## 3. Offsets of Sentences

In [10]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
df_descs.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [11]:
print(df_descs.loc[df_descs.description.isna()].shape)
print(df_descs.loc[df_descs.clean_desc.isna()].shape)

(0, 9)
(258, 9)


In [16]:
df_descs[["clean_desc"]] = df_descs[["clean_desc"]].fillna("")
df_descs.to_csv(config.crc_meta_path+"annot_descs.csv")

In [13]:
sents_dict, offsets_dict = utils.getSentsAndOffsetsFromStrings(list(df_descs.description), list(df_descs.description_id), list(df_descs.start_offset), list(df_descs.end_offset))

In [14]:
desc_id_col = list(sents_dict.keys())
sents_col = list(sents_dict.values())
offsets_col = list(offsets_dict.values())
df_sents = pd.DataFrame({"description_id":desc_id_col, "sentences":sents_col, "sent_offsets":offsets_col})
df_sents.head()

Unnamed: 0,description_id,sentences,sent_offsets
0,0,[Identifier: AA5],"[(0, 16)]"
1,1,[Title:\nPapers of The Very Rev Prof James Why...,"[(0, 59)]"
2,2,"[Scope and Contents:\nSermons and addresses, 1...","[(0, 556)]"
3,3,[Biographical / Historical:\nProfessor James A...,"[(0, 155), (155, 273), (273, 398), (398, 607),..."
4,4,[Identifier: AA6],"[(0, 16)]"


In [15]:
df_sents_exploded = df_sents.apply(pd.Series.explode)
df_sents_exploded = df_sents_exploded.reset_index()
df_sents_exploded = df_sents_exploded.rename(columns={"index":"sentence_id"})
df_sents_exploded.head()

Unnamed: 0,sentence_id,description_id,sentences,sent_offsets
0,0,0,Identifier: AA5,"(0, 16)"
1,1,1,Title:\nPapers of The Very Rev Prof James Whyt...,"(0, 59)"
2,2,2,"Scope and Contents:\nSermons and addresses, 19...","(0, 556)"
3,3,3,Biographical / Historical:\nProfessor James Ai...,"(0, 155)"
4,3,3,He was educated at Daniel Stewart's College an...,"(155, 273)"


Save the file to a CSV:

In [33]:
df_sents_exploded.to_csv(config.tokc_path+"sentences.csv")

<a id="4"></a>

## 4. Offsets of Tokens

In [28]:
assert type(list(df_sents_exploded.sentences)[0]) == str, "Each sentence should be a string"
assert type(list(df_sents_exploded.sent_offsets)[0]) == tuple, "Each offset pair should be a tuple"
assert type(list(df_sents_exploded.sent_offsets)[0][0]) == int, "Each tuple's item should be an int"
assert type(list(df_sents_exploded.sent_offsets)[0][1]) == int, "Each tuple's item should be an int"

Get the tokens in each sentence:

In [46]:
sents = list(df_sents_exploded.sentences)
sent_ids = list(df_sents_exploded.sentence_id)
sent_to_tokens = dict.fromkeys(sent_ids)  # {sent_id: {token: "...", "token_offsets": (...)}}
for i in range(len(sents)):
    sent_id, sent = sent_ids[i], sents[i]
    tokens = word_tokenize(sent)
    prev_end = sent_start
    token_list= []
    for t in tokens:
        token_list += [t]        
    tokens_dict = {"token":token_list}
    sent_to_tokens[sent_id] = tokens_dict

print(sent_to_tokens[0])

{'token': ['Identifier', ':', 'AA5']}


In [47]:
sent_to_tokens_df = pd.DataFrame.from_dict(sent_to_tokens, orient="index")
sent_to_tokens_df = sent_to_tokens_df.reset_index()
sent_to_tokens_df = sent_to_tokens_df.rename(columns={"index":"sentence_id"})
sent_to_tokens_df.head()

Unnamed: 0,sentence_id,token
0,0,"[Identifier, :, AA5]"
1,1,"[Title, :, Papers, of, The, Very, Rev, Prof, J..."
2,2,"[Scope, and, Contents, :, Sermons, and, addres..."
3,3,"[The, full, text, of, this, sermon, was, publi..."
4,4,"[Identifier, :, AA6]"


Join the sentence-tokens data to the rest of the sentence data:

In [48]:
df_sents_exploded.set_index("sentence_id")
df_sents_tokens = df_sents_exploded.join(sent_to_tokens_df.set_index("sentence_id"), on="sentence_id", how="outer")
df_sents_tokens.head()

Unnamed: 0,sentence_id,description_id,sentences,sent_offsets,token
0,0,0,Identifier: AA5,"(0, 16)","[Identifier, :, AA5]"
1,1,1,Title:\nPapers of The Very Rev Prof James Whyt...,"(0, 59)","[Title, :, Papers, of, The, Very, Rev, Prof, J..."
2,2,2,"Scope and Contents:\nSermons and addresses, 19...","(0, 556)","[Scope, and, Contents, :, Sermons, and, addres..."
3,3,3,Biographical / Historical:\nProfessor James Ai...,"(0, 155)","[The, full, text, of, this, sermon, was, publi..."
4,3,3,He was educated at Daniel Stewart's College an...,"(155, 273)","[The, full, text, of, this, sermon, was, publi..."


In [49]:
print(df_sents_tokens.loc[df_sents_tokens.token.isna()].shape)
print(df_sents_tokens.loc[df_sents_tokens.description_id.isna()].shape)
print(df_sents_tokens.loc[df_sents_tokens.sentence_id.isna()].shape)
print(df_sents_tokens.loc[df_sents_tokens.sentences.isna()].shape)
print(df_sents_tokens.loc[df_sents_tokens.sent_offsets.isna()].shape)

(0, 5)
(0, 5)
(0, 5)
(0, 5)
(0, 5)


Explode the DataFrame to have one token per row:

In [51]:
df_sents_tokens_exploded = df_sents_tokens.explode("token")
df_sents_tokens_exploded.head()

Unnamed: 0,sentence_id,description_id,sentences,sent_offsets,token
0,0,0,Identifier: AA5,"(0, 16)",Identifier
0,0,0,Identifier: AA5,"(0, 16)",:
0,0,0,Identifier: AA5,"(0, 16)",AA5
1,1,1,Title:\nPapers of The Very Rev Prof James Whyt...,"(0, 59)",Title
1,1,1,Title:\nPapers of The Very Rev Prof James Whyt...,"(0, 59)",:


Calculate token offsets based on the sentence strings, being sure whitespace is considered when calculating offset positions):

In [58]:
sents = list(df_sents_tokens_exploded.sentences)
sent_ids = list(df_sents_tokens_exploded.sentence_id)
sent_offsets = list(df_sents_tokens_exploded.sent_offsets)
sent_start_offsets = [offsets[0] for offsets in sent_offsets]
sent_end_offsets = [offsets[1] for offsets in sent_offsets]

In [59]:
tokens_dict, offsets_dict = utils.getTokensAndOffsetsFromStrings(sents, sent_ids, sent_start_offsets, sent_end_offsets)

In [62]:
assert len(tokens_dict.keys()) == len(offsets_dict.keys())

In [63]:
tokens_col, offsets_col, sent_ids_col = [], [], []
for sent_id,token_list in tokens_dict.items():
    tokens_col += token_list
    offsets_list = offsets_dict[sent_id]
    offsets_col += offsets_list
    assert len(token_list) == len(offsets_list)
    sent_ids_col += [sent_id]*len(token_list)

assert len(tokens_col) == len(offsets_col)
assert len(tokens_col) == len(sent_ids_col)

In [64]:
for col_list in [tokens_col, offsets_col, desc_ids_col]:
    print(col_list[0:5])

['Identifier', ':', 'AA5', 'Title', ':']
[(0, 10), (10, 11), (12, 15), (0, 5), (5, 6)]
[0, 0, 0, 1, 1]


Looks good!  Now create a DataFrame with these lists as columns:

In [65]:
df_tokens = pd.DataFrame({"sentence_id":sent_ids_col, "token":tokens_col, "offsets":offsets_col})
df_tokens = df_tokens.reset_index()
df_tokens = df_tokens.rename(columns={"index":"token_id"})
df_tokens.head()

Unnamed: 0,token_id,sentence_id,token,offsets
0,0,0,Identifier,"(0, 10)"
1,1,0,:,"(10, 11)"
2,2,0,AA5,"(12, 15)"
3,3,1,Title,"(0, 5)"
4,4,1,:,"(5, 6)"


Join the token data to the remaining sentence and description data:

In [69]:
df_tokens_imploded = utils.implodeDataFrame(df_tokens, ["sentence_id"])
df_tsd = df_tokens_imploded.join(df_sents_exploded.set_index("sentence_id"), on="sentence_id", how="outer")
df_tsd = df_tsd.rename(columns={"offsets":"token_offsets", "sentences":"sentence"})
df_tsd.head()

Unnamed: 0_level_0,token_id,token,token_offsets,description_id,sentence,sent_offsets
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[0, 1, 2]","[Identifier, :, AA5]","[(0, 10), (10, 11), (12, 15)]",0,Identifier: AA5,"(0, 16)"
1,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[(0, 5), (5, 6), (7, 13), (14, 16), (17, 20), ...",1,Title:\nPapers of The Very Rev Prof James Whyt...,"(0, 59)"
2,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[(0, 5), (6, 9), (10, 18), (18, 19), (20, 27),...",2,"Scope and Contents:\nSermons and addresses, 19...","(0, 556)"
3,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[The, full, text, of, this, sermon, was, publi...","[(943, 946), (947, 951), (952, 956), (957, 959...",3,Biographical / Historical:\nProfessor James Ai...,"(0, 155)"
3,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[The, full, text, of, this, sermon, was, publi...","[(943, 946), (947, 951), (952, 956), (957, 959...",3,He was educated at Daniel Stewart's College an...,"(155, 273)"


In [72]:
print(df_tsd.loc[df_tsd.token_id.isna()].shape)
print(df_tsd.loc[df_tsd.token.isna()].shape)
print(df_tsd.loc[df_tsd.token_offsets.isna()].shape)
print(df_tsd.loc[df_tsd.description_id.isna()].shape)
print(df_tsd.loc[df_tsd.sentence.isna()].shape)
print(df_tsd.loc[df_tsd.sent_offsets.isna()].shape)

(0, 6)
(0, 6)
(0, 6)
(0, 6)
(0, 6)
(0, 6)


In [77]:
subdf_tsd = df_tsd.drop(columns=["sentence", "sent_offsets"])
subdf_tsd = subdf_tsd.reset_index()
df_tokens = subdf_tsd.apply(pd.Series.explode)
df_tokens.head()

Unnamed: 0,sentence_id,token_id,token,token_offsets,description_id
0,0,0,Identifier,"(0, 10)",0
0,0,1,:,"(10, 11)",0
0,0,2,AA5,"(12, 15)",0
1,1,3,Title,"(0, 5)",1
1,1,4,:,"(5, 6)",1


In [78]:
print(df_tokens.shape)

(692568, 5)


Write the data to a file:

In [79]:
df_tokens.to_csv(config.tokc_path+"tokens_sents_descs.csv")