# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs or updates the files:
  * `../data/crc_metadata/annot_descs.csv`: adds columns for description offsets, word counts, and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `../data/analysis_data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description
  * `../data/crc_metadata/descriptions_annotated/`: contains one file for every description in the annotated datasets with file names as zero-padded description IDs

***

**Table of Contents**

[0.](#0) Annotated Descriptions Data

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Tokens

[3.](#3) Description and Annotation Linking

[4.](#4) Token and Sentence Linking
    
  * TO REMOVE: [BIO Tags](#3.1)

***

First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

## 0. Annotated Descriptions Data

Create a CSV dataset of the descriptions that were annotated in brat, including the descriptions' file name and offsets. 

In [2]:
filenames = os.listdir(config.doc_path)
print(filenames[:10])
print(len(filenames))

['Coll-227_00100.txt', 'La_03600.txt', 'PJM_03000.txt', 'La_07300.txt', 'Coll-1434_07400.txt', 'Coll-1434_03100.txt', 'MS_BOX_25.5_00100.txt', 'EUA_IN1_38300.txt', 'Coll-14_05900.txt', 'Coll-1694_00100.txt']
3649


Check that every annotation's file is in the directory

In [3]:
agg_df = pd.read_csv("../data/aggregated_data/aggregated_final.csv", index_col=0)
print(agg_df.shape)
agg_df.head()

(55260, 7)


Unnamed: 0_level_0,file,text,ann_offsets,label,category,associated_genders,description_id
agg_ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear,2364
1,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear,4542
2,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear,3660
3,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear,4678
4,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear,4732


In [4]:
agg_files = list(agg_df.file)
agg_files = [f.replace(".ann", ".txt") for f in agg_files]
agg_files = list(set(agg_files))
agg_files.sort()
missing = []
for f in agg_files:
    if not f in filenames:
        missing += [f]
assert len(missing) == 0
print(len(agg_files))

1112


`agg_files` is list of the file names in `descriptions/brat` that should be included in CSV of the annotated description data.

In [5]:
### TESTING FUNCTIONS ###
# test_file = open(os.path.join(config.doc_path, "Coll-1036_00300.txt"), "r").read()
# desc_dict, did, field_for_next_file = getFieldDescriptions(dict(), test_file, "Scope and Contents", ["Title", "Biographical / Historical", "Processing Information"], 0, "Coll-1036_00300.txt")
# test_files = agg_files[:5]
desc_dict = utils.getDescriptionsInFiles(config.doc_path, agg_files)

In [6]:
# desc_dict = utils.getDescriptionsInFiles(config.doc_path, ["Coll-1036_00300.txt", "Coll-1036_00400.txt", "Coll-1036_00500.txt"])
# desc_dict # Looks good - 16 for _00400!

In [7]:
print("Total Descriptions:", len(desc_dict.keys()))

Total Descriptions: 27908


In [8]:
ann_desc_df = pd.DataFrame.from_dict(desc_dict, orient="index")
# Give the descriptions a unique identifier
ann_desc_df = ann_desc_df.reset_index()
ann_desc_df = ann_desc_df.rename(columns={"index":"description_id"})
ann_desc_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset
0,0,Identifier: AA5,AA5_00100.txt,0,16
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725
4,4,Identifier: AA6,AA6_00100.txt,0,16


Make sure all the description values have text:

In [9]:
assert ann_desc_df.loc[ann_desc_df.description.isnull() == True].shape[0] == 0
assert ann_desc_df.loc[ann_desc_df.description.isna() == True].shape[0] == 0
assert ann_desc_df.loc[ann_desc_df.description == ""].shape[0] == 0

In [10]:
ann_desc_df.loc[ann_desc_df.description_id == 108]

Unnamed: 0,description_id,description,file,start_offset,end_offset
108,108,Title:\nThe Trinity,BAI_00300.txt,1191,1210


In [11]:
ann_desc_df.loc[ann_desc_df.file == "Coll-1036_00400.txt"]  # Should have 16 rows 

Unnamed: 0,description_id,description,file,start_offset,end_offset
1040,1040,"Miscellaneous working material, Part 1, 2, 3, ...",Coll-1036_00400.txt,0,116
1041,1041,Scope and Contents:\nMiscellaneous music.Sever...,Coll-1036_00400.txt,117,1264
1042,1042,"Scope and Contents:\nMiscellaneous items, Part...",Coll-1036_00400.txt,1265,1563
1043,1043,Scope and Contents:\n'Loose Leaf M.S.S. [manus...,Coll-1036_00400.txt,1564,1837
1044,1044,"Scope and Contents:\n'News Cuttings', note boo...",Coll-1036_00400.txt,1838,2033
1045,1045,Scope and Contents:\nVarious music collections...,Coll-1036_00400.txt,2034,5254
1046,1046,Scope and Contents:\n'Proofs M.S.S.[manuscrip...,Coll-1036_00400.txt,5255,5552
1047,1047,Scope and Contents:\n'Kennedy-Fraser MSS. [man...,Coll-1036_00400.txt,5553,5983
1048,1048,Scope and Contents:\n'Tolmie Gesto'. \nBundle...,Coll-1036_00400.txt,5984,6821
1049,1049,Scope and Contents:\nProofs of A Life of Song ...,Coll-1036_00400.txt,6822,6902


In [12]:
ann_desc_df.loc[ann_desc_df.file == "Coll-146_15400.txt"]

Unnamed: 0,description_id,description,file,start_offset,end_offset
20928,20928,"Portrait of Arthur Koestler :: Geiger, Gretl: ...",Coll-146_15400.txt,0,59
20929,20929,Title:\nPhotograph of Arthur Koestler with a w...,Coll-146_15400.txt,60,205
20930,20930,Title:\nPhotograph of Arthur Koestler with a w...,Coll-146_15400.txt,206,351
20931,20931,Title:\nArthur Koestler with dog Attila at Lon...,Coll-146_15400.txt,352,429
20932,20932,Title:\nPortrait of Arthur Koestler at Long Ba...,Coll-146_15400.txt,430,481
20933,20933,Title:\nPhotograph of Arthur Koestler with dog...,Coll-146_15400.txt,482,545
20934,20934,Title:\nPortrait of Arthur Koestler :: Yevonde...,Coll-146_15400.txt,546,606
20935,20935,Title:\nPortrait of Arthur Koestler :: Yevonde...,Coll-146_15400.txt,607,667
20936,20936,Title:\nPhotograph of Arthur Koestler :: Forga...,Coll-146_15400.txt,668,738
20937,20937,Title:\nSeries of three photographs of Arthur ...,Coll-146_15400.txt,739,796


In [30]:
descriptions = list(ann_desc_df.description)
fields = ["Identifier: ", "Title:\n", "Scope and Contents:\n", "Biographical / Historical:\n", "Processing Information:\n"]
i, maxI = 0, len(descriptions)
fields_col, descs_col = [], []
while i < maxI:
    d = descriptions[i]
    foundField = False
    for f in fields:
        if f in d:
            foundField = True
            field = f
            break
    if foundField == False:
        clean_field = last_field
        fields_col += [clean_field]
        descs_col += [d]
    elif foundField == True:
        descs_col += [d.replace(field,"")]
        clean_field = field.strip()[:-1]
        fields_col += [clean_field]
    last_field = clean_field
    i += 1
assert len(descriptions) == len(descs_col)
assert len(fields_col) == len(descs_col)

In [33]:
ann_desc_df.insert(len(ann_desc_df.columns), "field", fields_col)
ann_desc_df.insert(len(ann_desc_df.columns), "clean_desc", descs_col)
ann_desc_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6


The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we have calculated the intermediate start and end offsets as well, which are all in the DataFrame above.

Write the file of annotated descritions with their start and end offsets to a CSV file:

In [34]:
annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
ann_desc_df.to_csv(annot_desc_filepath)

Write each description to a TXT file for later analysis with NLTK:

In [35]:
dir_path = config.crc_meta_path+"descriptions_annotated/"
# Make sure the directory exists
Path(dir_path).mkdir(parents=True, exist_ok=True)

# Write one TXT file per descrpition (utf-8 encoded by default), with the description ID as the file name
description_list = list(ann_desc_df.description)
id_list = list(ann_desc_df.description_id)
# For zero padding so files are ordered correctly
max_digits = len(str(max(id_list)))
counter = 0
for i in range(len(description_list)):
    d = description_list[i]
    did = id_list[i]
    zeros = max_digits - len(str(did))
    filename = ("0"*zeros)+str(did)+".txt"
    f = open(dir_path+filename, "w")
    f.write(d)
    f.close()
    counter += 1
    if counter % 1000 == 0:
        print("10000 new files written")
print("{} files finished writing!".format(counter))

10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
27908 files finished writing!


<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
**Find the minimum, maximum, average, and standard deviation of word and sentence counts...**
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [5]:
# # Uncomment if need to reload data
# # --------------------------------
# annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
# ann_desc_df = pd.read_csv(annot_desc_filepath)
# ann_desc_df = ann_desc_df.drop(columns=["Unnamed: 0"])
# dir_path = config.crc_meta_path+"descriptions_annotated/"

In [6]:
corpus = PlaintextCorpusReader(dir_path, "\w*.txt", encoding="utf8")
print(corpus.fileids()[:10]) # Looks good
print(corpus.fileids()[-10:]) # Looks good

['00000.txt', '00001.txt', '00002.txt', '00003.txt', '00004.txt', '00005.txt', '00006.txt', '00007.txt', '00008.txt', '00009.txt']
['27898.txt', '27899.txt', '27900.txt', '27901.txt', '27902.txt', '27903.txt', '27904.txt', '27905.txt', '27906.txt', '27907.txt']


In [7]:
print(len(corpus.fileids()))

27908


In [8]:
ann_desc_df.shape

(27908, 9)

#### Length per Description

In [9]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

['Identifier']
['identifier']
['Identifier: AA5']


In [10]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good

[1, 10] [1, 1, 1, 8]


In [43]:
ann_desc_df.insert(len(ann_desc_df.columns), "word_count", word_count)
ann_desc_df.insert(len(ann_desc_df.columns), "sent_count", sent_count)
ann_desc_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [44]:
ann_desc_df.to_csv(annot_desc_filepath)  # add to the counts to the existing CSV file

Write a file with the sentences for each description:

In [11]:
sent_df = ann_desc_df[["description_id", "description", "file", "start_offset", "end_offset", "field"]]
sent_df.insert(len(sent_df.columns), "sentences", desc_sents)
sent_df.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,sentences
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,[Identifier: AA5]
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,[Title:\nPapers of The Very Rev Prof James Why...
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"[Scope and Contents:\nSermons and addresses, 1..."
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,[Biographical / Historical:\nProfessor James A...
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,[Identifier: AA6]


In [14]:
sent_df = sent_df.drop(columns=["description"])
sent_df = sent_df.rename(columns={"start_offset":"desc_start_offset", "end_offset":"desc_end_offset"})
sent_df_exploded = sent_df.apply(pd.Series.explode)
sent_df_exploded.head()

Unnamed: 0,description_id,file,desc_start_offset,desc_end_offset,field,sentences
0,0,AA5_00100.txt,0,16,Identifier,Identifier: AA5
1,1,AA5_00100.txt,17,76,Title,Title:\nPapers of The Very Rev Prof James Whyt...
2,2,AA5_00100.txt,77,633,Scope and Contents,"Scope and Contents:\nSermons and addresses, 19..."
3,3,AA5_00100.txt,634,1725,Biographical / Historical,Biographical / Historical:\nProfessor James Ai...
3,3,AA5_00100.txt,634,1725,Biographical / Historical,He was educated at Daniel Stewart's College an...


In [16]:
sent_df_exploded.to_csv(config.crc_meta_path+"description_sentences.csv")

#### Calculate summary stats for word and sentence counts

In [45]:
desc_df_stats = utils.makeDescribeDf("All", ann_desc_df)
bh_stats = utils.makeDescribeDf("Biographical / Historical", ann_desc_df)
sc_stats = utils.makeDescribeDf("Scope and Contents", ann_desc_df)
pi_stats = utils.makeDescribeDf("Processing Information", ann_desc_df)
t_stats = utils.makeDescribeDf("Title", ann_desc_df)

In [46]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All,word_count,27908.0,20.006342,87.010297,0.0,12343.0
All,sent_count,27908.0,1.50602,5.363265,1.0,742.0
Title,word_count,15135.0,8.177007,5.653681,0.0,62.0
Title,sent_count,15135.0,1.114635,0.494993,1.0,15.0
Scope and Contents,word_count,11470.0,30.731386,128.28867,1.0,12343.0
Scope and Contents,sent_count,11470.0,1.793636,8.109776,1.0,742.0
Biographical / Historical,word_count,661.0,118.476551,134.910761,2.0,1112.0
Biographical / Historical,sent_count,661.0,5.931921,6.552115,1.0,45.0
Processing Information,word_count,304.0,11.305921,10.516143,2.0,179.0
Processing Information,sent_count,304.0,1.078947,0.345195,1.0,4.0


In [47]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>

## 2. Offsets of Tokens

**Get the offsets of the tokens in every description.**

In [48]:
annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
df_desc = pd.read_csv(annot_desc_filepath, index_col=0)
df_desc.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [49]:
descs = list(df_desc.description)
desc_ids = list(df_desc.description_id)
desc_start_offsets = list(df_desc.start_offset)
desc_end_offsets = list(df_desc.end_offset)

In [50]:
tokens_dict, offsets_dict = utils.getTokensAndOffsetsFromStrings(descs, desc_ids, desc_start_offsets, desc_end_offsets)

In [51]:
tokens_col, offsets_col, desc_ids_col = [], [], []
for desc_id,token_list in tokens_dict.items():
    tokens_col += token_list
    offsets_list = offsets_dict[desc_id]
    offsets_col += offsets_list
    assert len(token_list) == len(offsets_list)
    desc_ids_col += [desc_id]*len(token_list)

assert len(tokens_col) == len(offsets_col)
assert len(tokens_col) == len(desc_ids_col)

In [52]:
for col_list in [tokens_col, offsets_col, desc_ids_col]:
    print(col_list[0:5])

['Identifier', ':', 'AA5', 'Title', ':']
[(0, 10), (10, 11), (12, 15), (17, 22), (22, 23)]
[0, 0, 0, 1, 1]


Looks good!  Now create a DataFrame with these lists as columns:

In [53]:
df_tokens = pd.DataFrame({"desc_id":desc_ids_col, "token":tokens_col, "offsets":offsets_col})
df_tokens.head()

Unnamed: 0,desc_id,token,offsets
0,0,Identifier,"(0, 10)"
1,0,:,"(10, 11)"
2,0,AA5,"(12, 15)"
3,1,Title,"(17, 22)"
4,1,:,"(22, 23)"


In [54]:
df_tokens.shape

(753911, 3)

Great!  Now write the DataFrame to a file:

In [55]:
df_tokens.to_csv(config.crc_meta_path+"descid_token_offsets.csv")

<a id="3"></a>
## 3. Description and Annotation Linking

**Assign a description ID to every annotation, using the file names and offsets to determine within which description each annotated text span appears.**

**STEP 1:** Convert all offsets to tuples of integers and create a `filename` column to match up the descriptions' .txt files and annotated text spans' .ann files. 

In [56]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)

# Remove file extensions
desc_filenames = list(df_descs.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_descs.insert(1, "filename", desc_filenames)

# Get offsets as tuples of ints
start_offsets = list(df_descs.start_offset)
end_offsets = list(df_descs.end_offset)
offsets_strs = list(zip(list(df_descs.start_offset),list(df_descs.end_offset)))
desc_offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_descs = df_descs.drop(columns=["start_offset", "end_offset", "word_count", "sent_count"])
df_descs.insert(4, "desc_offsets", desc_offsets_int_tuples)

df_descs.head()

Unnamed: 0,description_id,filename,description,file,desc_offsets,field,clean_desc
0,0,AA5_00100,Identifier: AA5,AA5_00100.txt,"(0, 16)",Identifier,AA5
1,1,AA5_00100,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,"(17, 76)",Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,AA5_00100,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,"(77, 633)",Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,AA5_00100,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,"(634, 1725)",Biographical / Historical,Professor James Aitken White was a leading Sco...
4,4,AA6_00100,Identifier: AA6,AA6_00100.txt,"(0, 16)",Identifier,AA6


In [57]:
df_ann = pd.read_csv("../data/aggregated_data/aggregated_final.csv", index_col=0)

# Remove file extensions
ann_filenames = list(df_ann.file)
ann_filenames = [f[:-4] for f in ann_filenames]
df_ann.insert(1, "filename", ann_filenames)

# Get offsets as tuples of ints
ann_offsets_strs = list(df_ann.offsets)
ann_offsets_strs = [pair[1:-1].split(",") for pair in ann_offsets_strs]
ann_offsets_ints = [tuple((int(pair[0].strip()), int(pair[1].strip()))) for pair in ann_offsets_strs]
df_ann = df_ann.drop(columns=["offsets"])
df_ann.insert(4, "ann_offsets", ann_offsets_ints)

df_ann.head()

Unnamed: 0,agg_ann_id,filename,file,text,ann_offsets,label,category,associated_genders
12,0,Coll-1157_00100,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear
22,1,Coll-1310_02300,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear
23,2,Coll-1281_00100,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear
24,3,Coll-1310_02700,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear
25,4,Coll-1310_02900,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear


**STEP 2:** Associate each file to IDs and offsets, for ease of comparison of the annotations' files and offsets to the descriptions' files and offsets to determine which description ID to assign to each annotation.

In [58]:
subdf_descs = df_descs.drop(columns=["file"]) #"field",
df_descs_imploded = utils.implodeDataFrame(subdf_descs, ["filename"])
df_descs_imploded.head()

Unnamed: 0_level_0,description_id,description,desc_offsets,field,clean_desc
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA5_00100,"[0, 1, 2, 3]","[Identifier: AA5, Title:\nPapers of The Very R...","[(0, 16), (17, 76), (77, 633), (634, 1725)]","[Identifier, Title, Scope and Contents, Biogra...","[AA5, Papers of The Very Rev Prof James Whyte ..."
AA6_00100,"[4, 5, 6, 7]","[Identifier: AA6, Title:\nPapers of Rev Tom Al...","[(0, 16), (17, 60), (61, 560), (561, 2513)]","[Identifier, Title, Scope and Contents, Biogra...","[AA6, Papers of Rev Tom Allan (1916-1965), Ser..."
AA7_00100,"[8, 9, 10, 11]","[Identifier: AA7, Title:\nPapers of Rev Prof A...","[(0, 16), (17, 76), (77, 417), (418, 2442)]","[Identifier, Title, Scope and Contents, Biogra...","[AA7, Papers of Rev Prof Alec Campbell Cheyne ..."
BAI_00100,"[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2...","[Identifier: BAI, Title:\nPapers of Professor ...","[(0, 16), (17, 84), (85, 115), (116, 143), (14...","[Identifier, Title, Title, Title, Title, Title...","[BAI, Papers of Professor John Baillie, and Ba..."
BAI_00200,"[46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 5...","[\nTitle:\nNew Testament (senior), Title:\nApo...","[(0, 31), (32, 60), (61, 97), (98, 134), (135,...","[Title, Title, Title, Title, Title, Title, Tit...","[\nNew Testament (senior), Apologetics (senior..."


In [59]:
descs_dict = df_descs_imploded.to_dict(orient="index")
print(descs_dict["AA5_00100"])

{'description_id': [0, 1, 2, 3], 'description': ['Identifier: AA5', 'Title:\nPapers of The Very Rev Prof James Whyte (1920-2005)', 'Scope and Contents:\nSermons and addresses, 1948-1996; lectures, 1949-1982; class notes and lecture notes, 1949-1982; correspondence, 1988-1989 and 1964-1970; newspaper cuttings, 1988-1989 and 1964-1969; publications and articles, 1902-1970; church magazines, 1929-1993; conference papers, 1978; moderatorial papers, 1988-1989; University Christian Consultative Group papers, 1970-1972; Church of Scotland and the Congregational Union of Scotland papers, 1959-1967; personal papers, 1848-1983; photographs 1911 and 1960.See also External Documents (below).', "Biographical / Historical:\nProfessor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland. He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity. After his ordination he spent thre

In [60]:
subdf_ann = df_ann.drop(columns=["file","category","associated_genders"])
df_ann_imploded = utils.implodeDataFrame(subdf_ann, ["filename"])
df_ann_imploded.head()

Unnamed: 0_level_0,agg_ann_id,text,ann_offsets,label
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA5_00100,"[14377, 14378, 14379, 14380, 14381, 14382, 143...","[He, he, his, he, he, His, he, The Very Rev Pr...","[(789, 791), (871, 873), (913, 916), (928, 930...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-..."
AA6_00100,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...","[Billy Graham, He, he, he, he, He, his, his, h...","[(1778, 1790), (677, 679), (920, 922), (1222, ...","[Masculine, Gendered-Pronoun, Gendered-Pronoun..."
AA7_00100,"[127, 13987, 13988, 13989, 13990, 13991, 13992...","[Professor Cheyne, son, sister, brother, His, ...","[(2399, 2415), (505, 508), (614, 620), (647, 6...","[Masculine, Gendered-Role, Gendered-Role, Gend..."
BAI_00100,"[17473, 17474, 17475, 17476, 17477, 17478, 416...","[Jacques Chevalier, Lloyd Morgan, Professor Jo...","[(371, 388), (393, 405), (34, 56), (102, 114),...","[Unknown, Unknown, Unknown, Unknown, Unknown, ..."
BAI_00200,"[20496, 20497, 20498, 20499, 20500, 20501, 205...","[Barker, Garvie, Busch, Adolf Jülicher, Johann...","[(215, 221), (226, 232), (250, 255), (285, 299...","[Unknown, Unknown, Unknown, Unknown, Unknown, ..."


In [61]:
anns_dict = df_ann_imploded.to_dict(orient="index")
print(anns_dict["AA5_00100"])

{'agg_ann_id': [14377, 14378, 14379, 14380, 14381, 14382, 14383, 14384, 14385, 14386, 14387, 24275, 26233, 41260, 41261, 41262, 41263, 52952, 52953], 'text': ['He', 'he', 'his', 'he', 'he', 'His', 'he', 'The Very Rev Prof James Whyte', 'Professor James Aitken White', 'James Whyte', 'James Whyte', 'The Very Rev Prof James Whyte', 'Rev Prof James Whyte', 'Scottish Theologian', 'Moderator of the General Assembly of the Church of Scotland', 'army Chaplain', 'chair of practical theology and Christian ethics', 'The Very Rev Prof James Whyte', 'leading Scottish Theologian'], 'ann_offsets': [(789, 791), (871, 873), (913, 916), (928, 930), (1217, 1219), (1241, 1244), (1315, 1317), (34, 63), (661, 689), (1032, 1043), (1350, 1361), (34, 63), (43, 63), (704, 723), (728, 787), (955, 968), (1129, 1177), (34, 63), (696, 723)], 'label': ['Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Unknown', 'Masculine', 'M

**STEP 3:** File by file, determine which description's offsets each annotation occurs within and associate the corresponding description IDs and annotation IDs.

In [62]:
annid_to_descid = dict.fromkeys(list(df_ann.agg_ann_id))

In [63]:
files = list(df_ann_imploded.index)
assert list(df_ann_imploded.index).sort() == list(df_descs_imploded.index).sort()

In [64]:
df_descs.head()

Unnamed: 0,description_id,filename,description,file,desc_offsets,field,clean_desc
0,0,AA5_00100,Identifier: AA5,AA5_00100.txt,"(0, 16)",Identifier,AA5
1,1,AA5_00100,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,"(17, 76)",Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,AA5_00100,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,"(77, 633)",Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,AA5_00100,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,"(634, 1725)",Biographical / Historical,Professor James Aitken White was a leading Sco...
4,4,AA6_00100,Identifier: AA6,AA6_00100.txt,"(0, 16)",Identifier,AA6


In [65]:
for f in files:  #sample
    ann_ids, ann_offsets = anns_dict[f]["agg_ann_id"], anns_dict[f]["ann_offsets"]
    desc_ids, desc_offsets = descs_dict[f]["description_id"], descs_dict[f]["desc_offsets"]
    for i,ann_id in enumerate(ann_ids):
        ann_offset_pair = ann_offsets[i]
        for j,desc_id in enumerate(desc_ids):
            desc_offset_pair = desc_offsets[j]
            # If the annotation offsets are within the description offsets, assign that description ID to that annotation 
            if (ann_offset_pair[0] >= desc_offset_pair[0]) and (ann_offset_pair[0] <= desc_offset_pair[1]):
                if (ann_offset_pair[1] >= desc_offset_pair[0]) and (ann_offset_pair[1] <= desc_offset_pair[1]):
                    annid_to_descid[ann_id] = desc_id  #sample_annid_to_descid[ann_id] = desc_id

In [66]:
df_ids = pd.DataFrame({"agg_ann_id":list(annid_to_descid.keys()), "description_id":list(annid_to_descid.values())})
df_ids.head()

Unnamed: 0,agg_ann_id,description_id
0,0,2364
1,1,4542
2,2,3660
3,3,4678
4,4,4732


In [67]:
print(df_ids.loc[df_ids.agg_ann_id.isna() == True].shape)
print(df_ids.loc[df_ids.description_id.isna() == True].shape)

(0, 2)
(0, 2)


In [70]:
anns_without_desc = list(df_ids.loc[df_ids.description_id.isna() == True].agg_ann_id)
df_anns_without_desc = df_ann.loc[df_ann.agg_ann_id.isin(anns_without_desc)]
assert len(set(list(df_anns_without_desc.file))) == 0

**STEP 4:** Add annotations' corresponding description IDs to the annotation DataFrame: 

In [71]:
df_ann = df_ann.set_index("agg_ann_id")

In [73]:
df_ann_joined = df_ann.join(df_ids.set_index("agg_ann_id"), on="agg_ann_id", how="outer")
df_ann_joined.head()

Unnamed: 0_level_0,filename,file,text,ann_offsets,label,category,associated_genders,description_id
agg_ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,Coll-1157_00100,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear,2364
1,Coll-1310_02300,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear,4542
2,Coll-1281_00100,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear,3660
3,Coll-1310_02700,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear,4678
4,Coll-1310_02900,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear,4732


In [74]:
assert df_ann_joined.loc[df_ann_joined.description_id.isna() == True].shape[0] == 0

In [76]:
assert df_ann_joined.shape[0] == df_ann.shape[0]
assert df_ann_joined.shape[0] == df_ids.shape[0]

Update the CSV file:

In [77]:
df_ann_joined = df_ann_joined.drop(columns=["filename"])
df_ann_joined.to_csv(config.agg_path+"aggregated_final.csv")

<a id="4"></a>
## 4. Token and Sentence Linking

<a id="1"></a>
## 1. Preprocess the Data

**Perform sentence tokenization of the descriptions, associate each sentence to a description ID, and then associate every token to a sentence ID.**

#### Sentence Tokenization

In [30]:
# Ignore descriptions that weren't annotated
# subdf_descs = df_descs.loc[df_descs.field != "Identifier"]
# print(subdf_descs.shape)
# print(subdf_descs.loc[subdf_descs.clean_desc.isna()].shape)

In [26]:
# Remove any empty clean_description values (NaN if description for metadata field at end of file appears in next file)
# subdf_descs = subdf_descs.loc[~subdf_descs.clean_desc.isna()]
# Fill NaN with empty string
# subdf_descs = subdf_descs.fillna("")
# subdf_descs.head()
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
df_descs.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [27]:
print(df_descs.loc[df_descs.description.isna()].shape)
print(df_descs.loc[df_descs.clean_desc.isna()].shape)

(0, 9)
(258, 9)


In [28]:
df_descs[["clean_desc"]] = df_descs[["clean_desc"]].fillna("")

#### Associate Tokens to Sentences

In [30]:
sents_dict, offsets_dict = utils.getSentsAndOffsetsFromStrings(list(df_descs.description), list(df_descs.description_id), list(df_descs.start_offset), list(df_descs.end_offset))

In [31]:
desc_id_col = list(sents_dict.keys())
sents_col = list(sents_dict.values())
offsets_col = list(offsets_dict.values())
df_sents = pd.DataFrame({"description_id":desc_id_col, "sentences":sents_col, "sent_offsets":offsets_col})
df_sents.head()

Unnamed: 0,description_id,sentences,sent_offsets
0,0,[Identifier: AA5],"[(0, 16)]"
1,1,[Title:\nPapers of The Very Rev Prof James Why...,"[(0, 59)]"
2,2,"[Scope and Contents:\nSermons and addresses, 1...","[(0, 556)]"
3,3,[Biographical / Historical:\nProfessor James A...,"[(0, 155), (155, 273), (273, 398), (398, 607),..."
4,4,[Identifier: AA6],"[(0, 16)]"


In [32]:
df_sents_exploded = df_sents.apply(pd.Series.explode)
df_sents_exploded = df_sents_exploded.reset_index()
df_sents_exploded = df_sents_exploded.rename(columns={"index":"sentence_id"})
df_sents_exploded.head()

Unnamed: 0,sentence_id,description_id,sentences,sent_offsets
0,0,0,Identifier: AA5,"(0, 16)"
1,1,1,Title:\nPapers of The Very Rev Prof James Whyt...,"(0, 59)"
2,2,2,"Scope and Contents:\nSermons and addresses, 19...","(0, 556)"
3,3,3,Biographical / Historical:\nProfessor James Ai...,"(0, 155)"
4,3,3,He was educated at Daniel Stewart's College an...,"(155, 273)"


Save the file to a CSV:

In [33]:
df_sents_exploded.to_csv(config.tokc_path+"sentences.csv")

Determine which sentences each token belongs to by comparing their description IDs and offsets.

In [36]:
df_tags = pd.read_csv(config.tokc_path+"tagged_tokens.csv")
df_tags = df_tags.drop(columns=["Unnamed: 0"])
df_tags.head()

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
0,,0,"(0, 10)",O,,Identifier,0.0
1,,0,"(10, 11)",O,,:,1.0
2,,0,"(12, 15)",O,,AA5,2.0
3,,1,"(17, 22)",O,,Title,3.0
4,,1,"(22, 23)",O,,:,4.0


In [40]:
print(df_tags.loc[df_tags.token.isna()].shape)
df_tags.loc[df_tags.token.isna()].head()

(328, 7)


Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
12471,41306.0,639,,,poet,,
12472,,639,,O,,,
14785,41490.0,705,,,editor,,
14786,41495.0,705,,,author,,
16722,12923.0,733,,,he,,


WHAT HAPPENED HERE?

In [45]:
df_tags.loc[(df_tags.text == "editor") & (df_tags.ann_id == 41490.0)]

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
14785,41490.0,705,,,editor,,


In [43]:
df_tags_imploded = utils.implodeDataFrame(df_tags[["description_id", "offsets", "token_id", "token"]], ["description_id"])
df_tags_imploded.head()

Unnamed: 0_level_0,offsets,token_id,token
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[(0, 10), (10, 11), (12, 15)]","[0.0, 1.0, 2.0]","[Identifier, :, AA5]"
1,"[(17, 22), (22, 23), (24, 30), (31, 33), (34, ...","[3.0, 4.0, 5.0, 6.0, 7.0, 7.0, 7.0, 8.0, 8.0, ...","[Title, :, Papers, of, The, The, The, Very, Ve..."
2,"[(77, 82), (83, 86), (87, 95), (95, 96), (97, ...","[16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23....","[Scope, and, Contents, :, Sermons, and, addres..."
3,"[(634, 646), (647, 648), (649, 659), (659, 660...","[109.0, 110.0, 111.0, 112.0, 113.0, 114.0, 115...","[Biographical, /, Historical, :, Professor, Ja..."
4,"[(0, 10), (10, 11), (12, 15)]","[308.0, 309.0, 310.0]","[Identifier, :, AA6]"


In [44]:
token_dict = df_tags_imploded.to_dict(orient="index")  # keys are description_id values
token_dict[0]

{'offsets': ['(0, 10)', '(10, 11)', '(12, 15)'],
 'token_id': [0.0, 1.0, 2.0],
 'token': ['Identifier', ':', 'AA5']}

In [47]:
df_sents = utils.implodeDataFrame(df_sents_exploded, ["description_id"])
sents_dict = df_sents.to_dict(orient="index")  # keys are description_id values
sents_dict[1]

{'sentence_id': [1],
 'sentences': ['Title:\nPapers of The Very Rev Prof James Whyte (1920-2005)'],
 'sent_offsets': [(0, 59)]}

In [42]:
desc_ids = list(sents_dict.keys())
desc_ids_check = list(token_dict.keys())
assert desc_ids == desc_ids_check, "The dictionaries should have the same order of keys (description IDs)"

In [None]:
for desc_id in desc_ids:
    # Get the sentence data for the description
    sentence_list = sents_dict[desc_id]["sentences"]
    sent_offsets_list = sents_dict[desc_id]["sent_offsets"]
    sent_ids = sents_dict[desc_id]["sentence_id"]
    # Get the token data for the description
    token_list = token_dict[desc_id]["token"]
    token_offsets_list = token_dict[desc_id]["offsets"]
    token_ids = token_dict[desc_id]["token_id"]
    # Assign each token to a sentence
    token_sentences = dict.fromkeys(token_ids)
    for i in range(len(token_ids)):
        token_start,token_end = token_offsets_list[i][0], token_offsets_list[i][1]
        
        sent_start,sent_end = sent_offsets_list[i][0], sent_offsets_list[i][1]
        
        

In [54]:
# t = IntervalTree()
s_tree = IntervalTree.from_tuples(sents_dict[3]["sent_offsets"])
print(s_tree)
t_tree = IntervalTree.from_tuples(token_dict[3]["offsets"])
print(t_tree)

IntervalTree([Interval(0, 155), Interval(155, 273), Interval(273, 398), Interval(398, 607), Interval(607, 716), Interval(716, 841), Interval(841, 943), Interval(943, 1090)])


TypeError: __new__() takes from 3 to 4 positional arguments but 11 were given

***
***
***
# DELETE CODE BELOW (MOVING TO NEW NB)

<a id="3.1"></a>
### 3.1 BIO Tags

**Compare the descriptions' tokens' offsets to the annotated text spans' offsets to determine which tokens to mark as the beginning of an annotation (`B-[LABELNAME]`), inside an annotation (`I-[LABELNAME]`), and unannotated, or outisde of an annotation (`O`).**

In [None]:
# TO DO: convert the three dataframes to dictionaries, 
#        for each filename, check whether each token_offset pair contained within each ann_offset pair and desc_,
#        recording which description (using indeces) annotation appears within

In [12]:
df_tokens = pd.read_csv(config.tokc_path+"descid_token_offsets.csv", index_col=0)
token_desc_ids = list(df_tokens.desc_id)
tokens = list(df_tokens.token)
token_offsets = list(df_tokens.offsets)
token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets]
token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
token_offsets_tuples[:5]  # Looks good

[(29, 36), (37, 39), (40, 43), (44, 57), (58, 65)]

Associate description tokens and annotated text spans' text and offsets to description IDs.

In [31]:
# df_tokens_imploded = utils.implodeDataFrame(df_tokens, ["desc_id"])
df_tokens_imploded = df_tokens_imploded.rename(columns={"offsets":"token_offsets"})
df_tokens_imploded.head()

Unnamed: 0_level_0,token,token_offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ..."
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141..."


In [35]:
df_tokens_imploded.to_csv(config.tokc_path+"token_data_imploded.csv")

Load the data associating description and annotation IDs to offsets.

In [32]:
df_descs_imploded = pd.read_csv(config.agg_path+"description_data_imploded.csv", index_col=0)
df_descs_imploded.head()

Unnamed: 0_level_0,eadid,desc_id,desc_offsets
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BAI_01000,['BAI'],[68],"[(1290, 1315)]"
BAI_01300,['BAI'],[143],"[(5853, 5983)]"
BAI_01600,['BAI'],[221],"[(5967, 6202)]"
BAI_01900,['BAI'],[292],"[(5297, 5506)]"
BAI_02200,['BAI'],[361],"[(15180, 15419)]"


In [33]:
df_anns_imploded = pd.read_csv(config.agg_path+"annotation_data_imploded.csv", index_col=0)
df_anns_imploded.head()

Unnamed: 0_level_0,agg_ann_id,ann_offsets
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
AA5_00100,"[14377, 14378, 14379, 14380, 14381, 14382, 143...","['(789, 791)', '(871, 873)', '(913, 916)', '(9..."
AA6_00100,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...","['(1778, 1790)', '(677, 679)', '(920, 922)', '..."
AA7_00100,"[127, 13987, 13988, 13989, 13990, 13991, 13992...","['(2399, 2415)', '(505, 508)', '(614, 620)', '..."
BAI_00100,"[17473, 17474, 17475, 17476, 17477, 17478, 416...","['(371, 388)', '(393, 405)', '(34, 56)', '(102..."
BAI_00200,"[20496, 20497, 20498, 20499, 20500, 20501, 205...","['(215, 221)', '(226, 232)', '(250, 255)', '(2..."


**Step 1: O tags**

Compare description IDs in the two DataFrames above to determine which descriptions (from `df_tokens_imploded`) do not have annotations, and assign all those descriptions' tokens an `O` tag (for *outside* of an annotation).

In [40]:
all_desc_ids = list(df_tokens_imploded.index)
ann_desc_ids = list(df_merged_imploded.index)
unannotated = [desc_id for desc_id in all_desc_ids if not desc_id in ann_desc_ids]
print("Rows to assign tag 'O':", len(unannotated))

Rows to assign tag 'O': 86742


In [48]:
o_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(unannotated)]
assert o_df.shape[0] == len(unannotated)

In [50]:
tokens_list = list(o_df.token)
tags = [["O"]*len(tokens) for tokens in tokens_list]
assert len(tags) == len(tokens_list)
o_df.insert(len(o_df.columns), "ann_tag", tags)
o_df.head()

Unnamed: 0_level_0,token,offsets,ann_tag
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ...","[O, O, O, O, O, O, O]"
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [53]:
assert len(o_df.token[100]) == len(o_df.ann_tag[100])
assert len(o_df.token[488]) == len(o_df.ann_tag[488])
assert len(o_df.token[0]) == len(o_df.ann_tag[0])

**Step 2: B- and I- tags**

For description IDs that do have annotations (and thus are in `df_merged_imploded`), assign their tokens tags of `B-[LABELNAME]` and `I-[LABELNAME]` for *beginning* and *inside* of an annotation, replacing `[LABELNAME]` with the name of the annotation's label.

In [41]:
annotated = [desc_id for desc_id in all_desc_ids if desc_id in ann_desc_ids]
print("Rows to assign 'B-' or 'I-'':", len(annotated))

Rows to assign 'B-' or 'I-'': 1855


In [54]:
bi_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(annotated)]
assert bi_df.shape[0] == len(annotated)

In [55]:
bi_df.head()

Unnamed: 0_level_0,token,offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
167,"[Brick, Burning, ,, Beardman, 's]","[(1421, 1426), (1427, 1434), (1434, 1435), (14..."
508,"[Interpreting, sequence, motifs, [, Letter, to...","[(3064, 3076), (3077, 3085), (3086, 3092), (30..."
610,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
611,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
640,"[Lady, Luck, :, the, theory, of, probability, ...","[(2118, 2122), (2123, 2127), (2127, 2128), (21..."


In [57]:
bi_dict = bi_df.to_dict('index')
print(bi_dict[167])

{'token': ['Brick', 'Burning', ',', 'Beardman', "'s"], 'offsets': ['(1421, 1426)', '(1427, 1434)', '(1434, 1435)', '(1436, 1444)', '(1444, 1446)']}


In [59]:
ann_dict = df_merged_imploded.to_dict('index')
print(ann_dict[167])

{'offsets_ann': ['(1436, 1444)', '(1436, 1444)'], 'text_ann': ['Beardman', 'Beardman'], 'label': ['Omission', 'Unknown'], 'id': [31928, 31929]}


In [78]:
# Turn a string of offsets into a tuple with each offset of type int
# "(1436, 1444)" --> (1436, 1444)
def offsetsStrToTuple(offsets_str):
    offsets_list = offsets_str[1:-1].split(", ")
    offsets_ints = [int(o) for o in offsets_list]
    return tuple((offsets_ints))

assert type(offsetsStrToTuple('(1436, 1444)')) == tuple
assert type(offsetsStrToTuple('(1436, 1444)')[0]) == int
assert type(offsetsStrToTuple('(1436, 1444)')[1]) == int

In [101]:
desc_ids = list(bi_dict.keys())[:100]  # START WITH SAMPLE
assert len(set(desc_ids)) == len(desc_ids)  # Make sure every description ID is unique
log = 0
descid_to_tag = dict.fromkeys(desc_ids)
for desc_id in desc_ids:
    text_spans = ann_dict[desc_id]["text_ann"]
    desc_tokens = bi_dict[desc_id]['token']
    desc_tokens_offsets = bi_dict[desc_id]['offsets']
    desc_tags = []
    for i,desc_token in enumerate(desc_tokens):
        token_offset_pair = offsetsStrToTuple(desc_tokens_offsets[i])
        span_indeces, tags = [], []  # Note: one token may have multiple tags
        
        # Record the indeces of every item in text_spans with the desc_token
        for j,text_span in enumerate(text_spans):
            span_offset_pair = offsetsStrToTuple(ann_dict[desc_id]["offsets_ann"][j])    
            # Be sure a matching token's offsets are within the annotated text span
            if (desc_token in text_span
               ) and (
                token_offset_pair[0] >= span_offset_pair[0]
                ) and (
                token_offset_pair[1] <= span_offset_pair[1]):
                    span_indeces += [j] 
            else:
                span_indeces += ["unannotated"]
        for j in span_indeces:
            # If the token is annotated, assign it a B- or I- tag with a label
            if type(j) == int:
            # If the start offsets are the same, assign a 'B-' tag
                if token_offset_pair[0] == span_offset_pair[0]:
                    tags += ['B-'+ann_dict[desc_id]["label"][j]]
                # Otherwise, assign an 'I-' tag
                else:
                    tags += ['I-'+ann_dict[desc_id]["label"][j]]
            # If the description token isn't annotated, assign it an O tag
            elif j == "unannotated":
                tags += ["O"]
            else:
                raise ValueError("Invalid j value: {}".format(j))
        
        desc_tags += [set(tags)]
    
    assert len(desc_tokens) == len(desc_tags)
    descid_to_tag[desc_id] = desc_tags
    
    log += 1
    if log % 100 == 0:
        print("Assigned tags for {} descriptions".format(log))

Assigned tags for 100 descriptions


In [109]:
did = 610 #508 #167
# print(ann_dict[did])
print(bi_dict[did])
# print(descid_to_tag[did])

# spans = ['Beardman', 'Beardman']
# spans2 = ["Brick Burning"]
# tokens = ['Brick', 'Burning', ',', 'Beardman', "'s"]
# # print(spans.count('Beardman'))
# # # print(spans.index('Beardman'))
# # # print(tokens.index('Beardman'))
# # for k in range(0,3):
# #     print(k)
# indeces = [index for index in range(len(spans)) if spans[index] == 'Beardman']
# print(indeces)

{'token': ['Letter', ':', ':', 'Koestler', ',', 'Arthur'], 'offsets': ['(127, 133)', '(134, 135)', '(135, 136)', '(137, 145)', '(145, 146)', '(147, 153)']}


In [4]:
# is_annotated_col = []
# annotated_id = []
# i, maxI = 0, len(token_desc_ids)  #1188478, 1189478
# while i < maxI:
#     desc_id = token_desc_ids[i]
#     token = tokens[i]
#     token_start, token_end = token_offsets_tuples[i][0], token_offsets_tuples[i][1] 
    
#     ann_df = df_merged.loc[df_merged.desc_id == desc_id]
#     ann_id_list = list(ann_df.id)
#     ann_offsets_list = list(ann_df.offsets_ann)
#     ann_offsets_clean = [ann_offsets[1:-1].split(", ") for ann_offsets in ann_offsets_list]
#     ann_offsets_tuples = [tuple((int(ann_offsets[0]), int(ann_offsets[1]))) for ann_offsets in ann_offsets_clean]
    
#     for j,ann_offsets in enumerate(ann_offsets_tuples):
#         ann_start = ann_offsets[0]
#         ann_end = ann_offsets[1]
#         if token_start == ann_start:
#             is_annotated_col += ["B"]
#             annotated_id += [ann_id_list[j]]
#         elif (token_start > ann_start) and (token_start <= ann_end):
#             is_annotated_col += ["I"]
#             annotated_id += [ann_id_list[j]]
#         else:
#             is_annotated_col += ["O"]
#             annotated_id += ["None"]
    
#     i += 1

# assert len(is_annotated_col) == len(token_desc_ids)
# assert len(is_annotated_col) == len(annotated_id)

KeyboardInterrupt: 

In [None]:
df_tokens.insert(len(df_tokens.columns),"is_annotated",is_annotated_col)
df_tokens.insert(len(df_tokens.columns),"ann_id",annotated_id)
df_tokens.head()

In [5]:
print(len(is_annotated_col))

48022031


In [8]:
print(len(annotated_id))

48022031
