# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs or updates the files:
  * `../data/crc_metadata/annot_descs.csv`: adds columns for description offsets, word counts, and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `../data/analysis_data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description
  * `../data/crc_metadata/descriptions_annotated/`: contains one file for every description in the annotated datasets with file names as zero-padded description IDs

***

**Table of Contents**

[0.](#0) Annotated Descriptions Data

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Tokens

[3.](#3) Description and Annotation Linking
    
  * TO REMOVE: [BIO Tags](#3.1)

***

First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

from intervaltree import Interval, IntervalTree

## 0. Annotated Descriptions Data

Create a CSV dataset of the descriptions that were annotated in brat, including the descriptions' file name and offsets. 

In [2]:
filenames = os.listdir(config.doc_path)
print(filenames[:10])
print(len(filenames))  # descs_with_offsets has 3645, not 3649

['Coll-227_00100.txt', 'La_03600.txt', 'PJM_03000.txt', 'La_07300.txt', 'Coll-1434_07400.txt', 'Coll-1434_03100.txt', 'MS_BOX_25.5_00100.txt', 'EUA_IN1_38300.txt', 'Coll-14_05900.txt', 'Coll-1694_00100.txt']
3649


Check that every annotation's file is in the directory

In [3]:
agg_df = pd.read_csv("../data/aggregated_data/aggregated_final.csv", index_col=0)
print(agg_df.shape)
agg_df.head()

(55260, 7)


Unnamed: 0,agg_ann_id,file,offsets,text,label,category,associated_genders
12,0,Coll-1157_00100.ann,"(1407, 1415)",knighted,Gendered-Role,Linguistic,Unclear
22,1,Coll-1310_02300.ann,"(9625, 9635)",knighthood,Gendered-Role,Linguistic,Unclear
23,2,Coll-1281_00100.ann,"(2426, 2439)",Prince Regent,Gendered-Role,Linguistic,Unclear
24,3,Coll-1310_02700.ann,"(9993, 10003)",knighthood,Gendered-Role,Linguistic,Unclear
25,4,Coll-1310_02900.ann,"(7192, 7195)",Sir,Gendered-Role,Linguistic,Unclear


In [4]:
agg_files = list(agg_df.file)
agg_files = [f.replace(".ann", ".txt") for f in agg_files]
agg_files = list(set(agg_files))
agg_files.sort()
missing = []
for f in agg_files:
    if not f in filenames:
        missing += [f]
assert len(missing) == 0

`agg_files` is list of the file names in `descriptions/brat` that should be included in CSV of the annotated description data.

In [6]:
### TESTING FUNCTIONS ###
# test_file = open(os.path.join(config.doc_path, "Coll-1036_00300.txt"), "r").read()
# desc_dict, did, field_for_next_file = getFieldDescriptions(dict(), test_file, "Scope and Contents", ["Title", "Biographical / Historical", "Processing Information"], 0, "Coll-1036_00300.txt")
# test_files = agg_files[:5]
desc_dict = utils.getDescriptionsInFiles(config.doc_path, agg_files)

In [108]:
# print(open(config.doc_path+"Coll-1057_00600.txt").read())

Title:
Page mounted with three photographs of the site of the Poultry Research Centre sub-station at the Easter Bush estate

Title:
Page mounted with four photographs

Title:
Page mounted with two photographs of the Poultry Research Centre sub-station at the Easter Bush estate

Title:
Page mounted with programme of the Scottish Poultry Conference, Stirling (16 November 1955) and group photograph from the conference

Title:
Page mounted with five photographs

Title:
Page mounted with four items

Title:
Page mounted with seven photographs of Alan Greenwood at the Massachusetts Institute of Technology, February 1961

Title:
Group photograph of staff and students outside the Institute of Animal Genetics

Title:
Four items found loose among photographs

Title:
Page mounted with two photographs of Alan Greenwood

Title:
Album of postcards from Greenwood's trip to Canada, the USA and Mexico

Scope and Contents:
The article is concerned with the death of a particularly high-producing hen, L164

In [7]:
print("Total Descriptions:", len(desc_dict.keys()))

Total Descriptions: 26875


In [8]:
ann_desc_df = pd.DataFrame.from_dict(desc_dict, orient="index")
# Give the descriptions a unique identifier
ann_desc_df = ann_desc_df.reset_index()
ann_desc_df = ann_desc_df.rename(columns={"index":"description_id"})
ann_desc_df.head()

Unnamed: 0,description_id,description,field,file,start_offset,end_offset
0,0,Papers of The Very Rev Prof James Whyte (1920-...,Title,AA5_00100.txt,24,76
1,1,"Sermons and addresses, 1948-1996; lectures, 19...",Scope and Contents,AA5_00100.txt,97,633
2,2,Professor James Aitken White was a leading Sco...,Biographical / Historical,AA5_00100.txt,661,1724
3,3,Papers of Rev Tom Allan (1916-1965),Title,AA6_00100.txt,24,60
4,4,"Sermons and addresses, 1947-1963; essays and l...",Scope and Contents,AA6_00100.txt,81,560


Make sure all the description values have text:

In [9]:
assert ann_desc_df.loc[ann_desc_df.description.isnull() == True].shape[0] == 0
assert ann_desc_df.loc[ann_desc_df.description.isna() == True].shape[0] == 0
assert ann_desc_df.loc[ann_desc_df.description == ""].shape[0] == 0

In [10]:
ann_desc_df.loc[ann_desc_df.description_id == 108]

Unnamed: 0,description_id,description,field,file,start_offset,end_offset
108,108,General addresses,Title,BAI_00400.txt,28,46


In [11]:
ann_desc_df.loc[ann_desc_df.file == "Coll-1057_00600.txt"]  #Coll-1036_00400.txt   # Looks good!

Unnamed: 0,description_id,description,field,file,start_offset,end_offset
1286,1286,Page mounted with three photographs of the sit...,Title,Coll-1057_00600.txt,7,124
1287,1287,Page mounted with four photographs,Title,Coll-1057_00600.txt,132,167
1288,1288,Page mounted with two photographs of the Poult...,Title,Coll-1057_00600.txt,175,278
1289,1289,Page mounted with programme of the Scottish Po...,Title,Coll-1057_00600.txt,286,419
1290,1290,Page mounted with five photographs,Title,Coll-1057_00600.txt,427,462
1291,1291,Page mounted with four items,Title,Coll-1057_00600.txt,470,499
1292,1292,Page mounted with seven photographs of Alan Gr...,Title,Coll-1057_00600.txt,507,621
1293,1293,Group photograph of staff and students outside...,Title,Coll-1057_00600.txt,629,709
1294,1294,Four items found loose among photographs,Title,Coll-1057_00600.txt,717,758
1295,1295,Page mounted with two photographs of Alan Gree...,Title,Coll-1057_00600.txt,766,818


The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we have calculated the intermediate start and end offsets as well, which are all in the DataFrame above.

Write the file of annotated descritions with their start and end offsets to a CSV file:

In [12]:
annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
ann_desc_df.to_csv(annot_desc_filepath)

Write each description to a TXT file for later analysis with NLTK:

In [13]:
dir_path = config.crc_meta_path+"descriptions_annotated/"
# Make sure the directory exists
Path(dir_path).mkdir(parents=True, exist_ok=True)

# Write one TXT file per descrpition (utf-8 encoded by default), with the description ID as the file name
description_list = list(ann_desc_df.description)
id_list = list(ann_desc_df.description_id)
# For zero padding so files are ordered correctly
max_digits = len(str(max(id_list)))
counter = 0
for i in range(len(description_list)):
    d = description_list[i]
    did = id_list[i]
    zeros = max_digits - len(str(did))
    filename = ("0"*zeros)+str(did)+".txt"
    f = open(dir_path+filename, "w")
    f.write(d)
    f.close()
    counter += 1
    if counter % 1000 == 0:
        print("10000 new files written")
print("{} files finished writing!".format(counter))

10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
10000 new files written
26875 files finished writing!


<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
**Find the minimum, maximum, average, and standard deviation of word and sentence counts...**
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [2]:
# # Uncomment if need to reload data
# # --------------------------------
# annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
# ann_desc_df = pd.read_csv(annot_desc_filepath)
# ann_desc_df = ann_desc_df.drop(columns=["Unnamed: 0"])
# dir_path = config.crc_meta_path+"descriptions_annotated/"

In [14]:
corpus = PlaintextCorpusReader(dir_path, "\w*.txt", encoding="utf8")
print(corpus.fileids()[:10]) # Looks good
print(corpus.fileids()[-10:]) # Looks good

['00000.txt', '00001.txt', '00002.txt', '00003.txt', '00004.txt', '00005.txt', '00006.txt', '00007.txt', '00008.txt', '00009.txt']
['26865.txt', '26866.txt', '26867.txt', '26868.txt', '26869.txt', '26870.txt', '26871.txt', '26872.txt', '26873.txt', '26874.txt']


#### Length per Description

In [15]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

['Papers', 'of', 'The', 'Very', 'Rev', 'Prof', 'James', 'Whyte', '1920-2005']
['papers', 'of', 'the', 'very', 'rev', 'prof', 'james', 'whyte', '1920-2005']
['Papers of The Very Rev Prof James Whyte (1920-2005)']


In [16]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good

[9, 62] [1, 1, 8, 1]


In [17]:
ann_desc_df.insert(len(ann_desc_df.columns), "word_count", word_count)
ann_desc_df.insert(len(ann_desc_df.columns), "sent_count", sent_count)
ann_desc_df.head()

Unnamed: 0,description_id,description,field,file,start_offset,end_offset,word_count,sent_count
0,0,Papers of The Very Rev Prof James Whyte (1920-...,Title,AA5_00100.txt,24,76,9,1
1,1,"Sermons and addresses, 1948-1996; lectures, 19...",Scope and Contents,AA5_00100.txt,97,633,62,1
2,2,Professor James Aitken White was a leading Sco...,Biographical / Historical,AA5_00100.txt,661,1724,179,8
3,3,Papers of Rev Tom Allan (1916-1965),Title,AA6_00100.txt,24,60,6,1
4,4,"Sermons and addresses, 1947-1963; essays and l...",Scope and Contents,AA6_00100.txt,81,560,59,2


In [18]:
ann_desc_df.to_csv(annot_desc_filepath)  # add to the counts to the existing CSV file

#### Calculate summary stats for word and sentence counts

In [19]:
desc_df_stats = utils.makeDescribeDf("All", ann_desc_df)
bh_stats = utils.makeDescribeDf("Biographical / Historical", ann_desc_df)
sc_stats = utils.makeDescribeDf("Scope and Contents", ann_desc_df)
pi_stats = utils.makeDescribeDf("Processing Information", ann_desc_df)
t_stats = utils.makeDescribeDf("Title", ann_desc_df)

In [20]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All,word_count,26875.0,18.672298,87.842174,0.0,12340.0
All,sent_count,26875.0,1.519479,5.422002,1.0,742.0
Title,word_count,14862.0,7.253061,5.623145,0.0,51.0
Title,sent_count,14862.0,1.116001,0.498569,1.0,15.0
Scope and Contents,word_count,11056.0,28.425018,129.541254,0.0,12340.0
Scope and Contents,sent_count,11056.0,1.809877,8.190972,1.0,742.0
Biographical / Historical,word_count,655.0,117.453435,135.133575,6.0,1110.0
Biographical / Historical,sent_count,655.0,5.975573,6.566015,1.0,45.0
Processing Information,word_count,302.0,9.350993,10.53436,4.0,177.0
Processing Information,sent_count,302.0,1.07947,0.346279,1.0,4.0


In [21]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>

## 2. Offsets of Tokens

**Get the offsets of the tokens in every description.**

In [22]:
annot_desc_filepath = config.crc_meta_path+"annot_descs.csv"
df_desc = pd.read_csv(annot_desc_filepath, index_col=0)
df_desc.head()

Unnamed: 0,description_id,description,field,file,start_offset,end_offset,word_count,sent_count
0,0,Papers of The Very Rev Prof James Whyte (1920-...,Title,AA5_00100.txt,24,76,9,1
1,1,"Sermons and addresses, 1948-1996; lectures, 19...",Scope and Contents,AA5_00100.txt,97,633,62,1
2,2,Professor James Aitken White was a leading Sco...,Biographical / Historical,AA5_00100.txt,661,1724,179,8
3,3,Papers of Rev Tom Allan (1916-1965),Title,AA6_00100.txt,24,60,6,1
4,4,"Sermons and addresses, 1947-1963; essays and l...",Scope and Contents,AA6_00100.txt,81,560,59,2


In [23]:
descs = list(df_desc.description)
desc_ids = list(df_desc.description_id)
desc_start_offsets = list(df_desc.start_offset)
desc_end_offsets = list(df_desc.end_offset)

In [24]:
tokens_dict, offsets_dict = utils.getTokensAndOffsetsFromStrings(descs, desc_ids, desc_start_offsets, desc_end_offsets)

In [25]:
tokens_col, offsets_col, desc_ids_col = [], [], []
for desc_id,token_list in tokens_dict.items():
    tokens_col += token_list
    offsets_list = offsets_dict[desc_id]
    offsets_col += offsets_list
    assert len(token_list) == len(offsets_list)
    desc_ids_col += [desc_id]*len(token_list)

assert len(tokens_col) == len(offsets_col)
assert len(tokens_col) == len(desc_ids_col)

In [26]:
for col_list in [tokens_col, offsets_col, desc_ids_col]:
    print(col_list[0:5])

['Papers', 'of', 'The', 'Very', 'Rev']
[(24, 30), (31, 33), (34, 37), (38, 42), (43, 46)]
[0, 0, 0, 0, 0]


Looks good!  Now create a DataFrame with these lists as columns:

In [27]:
df_tokens = pd.DataFrame({"desc_id":desc_ids_col, "token":tokens_col, "offsets":offsets_col})
df_tokens.head()

Unnamed: 0,desc_id,token,offsets
0,0,Papers,"(24, 30)"
1,0,of,"(31, 33)"
2,0,The,"(34, 37)"
3,0,Very,"(38, 42)"
4,0,Rev,"(43, 46)"


In [28]:
df_tokens.shape

(666242, 3)

Great!  Now write the DataFrame to a file:

In [29]:
df_tokens.to_csv(config.crc_meta_path+"descid_token_offsets.csv")

<a id="3"></a>
## 3. Description and Annotation Linking

**Assign a description ID to every annotation, using the file names and offsets to determine within which description each annotated text span appears.**

**STEP 1:** Convert all offsets to tuples of integers and create a `filename` column to match up the descriptions' .txt files and annotated text spans' .ann files. 

In [40]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)

# Remove file extensions
desc_filenames = list(df_descs.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_descs.insert(1, "filename", desc_filenames)

# Get offsets as tuples of ints
start_offsets = list(df_descs.start_offset)
end_offsets = list(df_descs.end_offset)
offsets_strs = list(zip(list(df_descs.start_offset),list(df_descs.end_offset)))
desc_offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_descs = df_descs.drop(columns=["start_offset", "end_offset", "word_count", "sent_count"])
df_descs.insert(, "desc_offsets", offsets_int_tuples)

df_descs.head()

Unnamed: 0,description_id,filename,description,desc_offsets,field,file
0,0,AA5_00100,Papers of The Very Rev Prof James Whyte (1920-...,"(24, 76)",Title,AA5_00100.txt
1,1,AA5_00100,"Sermons and addresses, 1948-1996; lectures, 19...","(97, 633)",Scope and Contents,AA5_00100.txt
2,2,AA5_00100,Professor James Aitken White was a leading Sco...,"(661, 1724)",Biographical / Historical,AA5_00100.txt
3,3,AA6_00100,Papers of Rev Tom Allan (1916-1965),"(24, 60)",Title,AA6_00100.txt
4,4,AA6_00100,"Sermons and addresses, 1947-1963; essays and l...","(81, 560)",Scope and Contents,AA6_00100.txt


In [47]:
df_ann = pd.read_csv("../data/aggregated_data/aggregated_final.csv", index_col=0)

# Remove file extensions
ann_filenames = list(df_ann.file)
ann_filenames = [f[:-4] for f in ann_filenames]
df_ann.insert(1, "filename", ann_filenames)

# Get offsets as tuples of ints
ann_offsets_strs = list(df_ann.offsets)
ann_offsets_strs = [pair[1:-1].split(",") for pair in ann_offsets_strs]
ann_offsets_ints = [tuple((int(pair[0].strip()), int(pair[1].strip()))) for pair in ann_offsets_strs]
df_ann = df_ann.drop(columns=["offsets"])
df_ann.insert(4, "ann_offsets", ann_offsets_ints)

df_ann.head()

Unnamed: 0,agg_ann_id,filename,file,text,ann_offsets,label,category,associated_genders
12,0,Coll-1157_00100,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear
22,1,Coll-1310_02300,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear
23,2,Coll-1281_00100,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear
24,3,Coll-1310_02700,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear
25,4,Coll-1310_02900,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear


**STEP 2:** Associate each file to IDs and offsets, for ease of comparison of the annotations' files and offsets to the descriptions' files and offsets to determine which description ID to assign to each annotation.

In [62]:
subdf_descs = df_descs.drop(columns=["field","file"])
df_descs_imploded = utils.implodeDataFrame(subdf_descs, ["filename"])
df_descs_imploded.head()

Unnamed: 0_level_0,description_id,description,desc_offsets
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AA5_00100,"[0, 1, 2]",[Papers of The Very Rev Prof James Whyte (1920...,"[(24, 76), (97, 633), (661, 1724)]"
AA6_00100,"[3, 4, 5]","[Papers of Rev Tom Allan (1916-1965), Sermons ...","[(24, 60), (81, 560), (588, 2512)]"
AA7_00100,"[6, 7, 8]",[Papers of Rev Prof Alec Campbell Cheyne (1924...,"[(24, 76), (97, 417), (445, 2441)]"
BAI_00100,"[9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20...","[Papers of Professor John Baillie, and Baillie...","[(24, 84), (92, 115), (123, 143), (151, 210), ..."
BAI_00200,"[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 5...","[New Testament (senior), Apologetics (senior),...","[(8, 31), (39, 60), (68, 97), (105, 134), (142..."


In [63]:
descs_dict = df_descs_imploded.to_dict(orient="index")
print(descs_dict["AA5_00100"])

{'description_id': [0, 1, 2], 'description': ['Papers of The Very Rev Prof James Whyte (1920-2005)', 'Sermons and addresses, 1948-1996; lectures, 1949-1982; class notes and lecture notes, 1949-1982; correspondence, 1988-1989 and 1964-1970; newspaper cuttings, 1988-1989 and 1964-1969; publications and articles, 1902-1970; church magazines, 1929-1993; conference papers, 1978; moderatorial papers, 1988-1989; University Christian Consultative Group papers, 1970-1972; Church of Scotland and the Congregational Union of Scotland papers, 1959-1967; personal papers, 1848-1983; photographs 1911 and 1960.See also External Documents (below).', "Professor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland. He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity. After his ordination he spent three years as an army Chaplain and then in 1948 was inducted to Dunollie Road Chur

In [64]:
subdf_ann = df_ann.drop(columns=["file","category","associated_genders"])
df_ann_imploded = utils.implodeDataFrame(subdf_ann, ["filename"])
df_ann_imploded.head()

Unnamed: 0_level_0,agg_ann_id,text,ann_offsets,label
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA5_00100,"[14377, 14378, 14379, 14380, 14381, 14382, 143...","[He, he, his, he, he, His, he, The Very Rev Pr...","[(789, 791), (871, 873), (913, 916), (928, 930...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-..."
AA6_00100,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...","[Billy Graham, He, he, he, he, He, his, his, h...","[(1778, 1790), (677, 679), (920, 922), (1222, ...","[Masculine, Gendered-Pronoun, Gendered-Pronoun..."
AA7_00100,"[127, 13987, 13988, 13989, 13990, 13991, 13992...","[Professor Cheyne, son, sister, brother, His, ...","[(2399, 2415), (505, 508), (614, 620), (647, 6...","[Masculine, Gendered-Role, Gendered-Role, Gend..."
BAI_00100,"[17473, 17474, 17475, 17476, 17477, 17478, 416...","[Jacques Chevalier, Lloyd Morgan, Professor Jo...","[(371, 388), (393, 405), (34, 56), (102, 114),...","[Unknown, Unknown, Unknown, Unknown, Unknown, ..."
BAI_00200,"[20496, 20497, 20498, 20499, 20500, 20501, 205...","[Barker, Garvie, Busch, Adolf Jülicher, Johann...","[(215, 221), (226, 232), (250, 255), (285, 299...","[Unknown, Unknown, Unknown, Unknown, Unknown, ..."


In [67]:
anns_dict = df_ann_imploded.to_dict(orient="index")
print(anns_dict["AA5_00100"])

{'agg_ann_id': [14377, 14378, 14379, 14380, 14381, 14382, 14383, 14384, 14385, 14386, 14387, 24275, 26233, 41260, 41261, 41262, 41263, 52952, 52953], 'text': ['He', 'he', 'his', 'he', 'he', 'His', 'he', 'The Very Rev Prof James Whyte', 'Professor James Aitken White', 'James Whyte', 'James Whyte', 'The Very Rev Prof James Whyte', 'Rev Prof James Whyte', 'Scottish Theologian', 'Moderator of the General Assembly of the Church of Scotland', 'army Chaplain', 'chair of practical theology and Christian ethics', 'The Very Rev Prof James Whyte', 'leading Scottish Theologian'], 'ann_offsets': [(789, 791), (871, 873), (913, 916), (928, 930), (1217, 1219), (1241, 1244), (1315, 1317), (34, 63), (661, 689), (1032, 1043), (1350, 1361), (34, 63), (43, 63), (704, 723), (728, 787), (955, 968), (1129, 1177), (34, 63), (696, 723)], 'label': ['Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Gendered-Pronoun', 'Unknown', 'Masculine', 'M

**STEP 3:** File by file, determine which description's offsets each annotation occurs within and associate the corresponding description IDs and annotation IDs.

In [90]:
annid_to_descid = dict.fromkeys(list(df_ann.agg_ann_id))

In [91]:
files = list(df_ann_imploded.index)
assert list(df_ann_imploded.index).sort() == list(df_descs_imploded.index).sort()

In [92]:
# sample = files[:10]
# sample_annids = list(df_ann.loc[df_ann.file.isin(sample)].agg_ann_id)
# sample_annid_to_descid = dict.fromkeys(sample_annids)
for f in files:  #sample
    ann_ids, ann_offsets = anns_dict[f]["agg_ann_id"], anns_dict[f]["ann_offsets"]
    desc_ids, desc_offsets = descs_dict[f]["description_id"], descs_dict[f]["desc_offsets"]
    for i,ann_id in enumerate(ann_ids):
        ann_offset_pair = ann_offsets[i]
        for j,desc_id in enumerate(desc_ids):
            desc_offset_pair = desc_offsets[j]
            # If the annotation offsets are within the description offsets, assign that description ID to that annotation 
            if (ann_offset_pair[0] >= desc_offset_pair[0]) and (ann_offset_pair[0] <= desc_offset_pair[1]):
                if (ann_offset_pair[1] >= desc_offset_pair[0]) and (ann_offset_pair[1] <= desc_offset_pair[1]):
                    annid_to_descid[ann_id] = desc_id  #sample_annid_to_descid[ann_id] = desc_id
# print(sample_annid_to_descid)

In [95]:
# sample_df = pd.DataFrame({"agg_ann_id":list(sample_annid_to_descid.keys()), "description_id":list(sample_annid_to_descid.values())})
# sample_df.head()
df_ids = pd.DataFrame({"agg_ann_id":list(annid_to_descid.keys()), "description_id":list(annid_to_descid.values())})
df_ids.head()

Unnamed: 0,agg_ann_id,description_id
0,0,2112.0
1,1,4163.0
2,2,3320.0
3,3,4297.0
4,4,4349.0


In [97]:
print(df_ids.loc[df_ids.agg_ann_id.isna() == True].shape)
print(df_ids.loc[df_ids.description_id.isna() == True].shape)

(0, 2)
(4659, 2)


In [99]:
anns_without_desc = list(df_ids.loc[df_ids.description_id.isna() == True].agg_ann_id)
df_anns_without_desc = df_ann.loc[df_ann.agg_ann_id.isin(anns_without_desc)]
df_anns_without_desc.head()

Unnamed: 0,agg_ann_id,filename,file,text,ann_offsets,label,category,associated_genders
73,34,Coll-1057_00600,Coll-1057_00600.ann,J.E Wilson,"(5963, 5973)",Unknown,Person-Name,Unclear
74,35,Coll-1057_00600,Coll-1057_00600.ann,Major MacDougall,"(5975, 5991)",Unknown,Person-Name,Unclear
75,36,Coll-1057_00700,Coll-1057_00700.ann,Professor I. Michael Lerner,"(9683, 9710)",Unknown,Person-Name,Unclear
91,51,Coll-1057_00900,Coll-1057_00900.ann,Mrs Campbell,"(5511, 5523)",Feminine,Person-Name,Unclear
116,75,Coll-1057_00800,Coll-1057_00800.ann,Duncan Weatherstone,"(11529, 11548)",Masculine,Person-Name,Multiple


In [106]:
df_desc.loc[df_desc.file == "Coll-1057_00600.txt"]

Unnamed: 0,description_id,description,field,file,start_offset,end_offset,word_count,sent_count
1286,1286,Page mounted with three photographs of the sit...,Title,Coll-1057_00600.txt,7,124,18,1
1287,1287,Page mounted with four photographs,Title,Coll-1057_00600.txt,132,167,5,1
1288,1288,Page mounted with two photographs of the Poult...,Title,Coll-1057_00600.txt,175,278,15,1
1289,1289,Page mounted with programme of the Scottish Po...,Title,Coll-1057_00600.txt,286,419,19,1
1290,1290,Page mounted with five photographs,Title,Coll-1057_00600.txt,427,462,5,1
1291,1291,Page mounted with four items,Title,Coll-1057_00600.txt,470,499,5,1
1292,1292,Page mounted with seven photographs of Alan Gr...,Title,Coll-1057_00600.txt,507,621,16,1
1293,1293,Group photograph of staff and students outside...,Title,Coll-1057_00600.txt,629,709,12,1
1294,1294,Four items found loose among photographs,Title,Coll-1057_00600.txt,717,758,6,1
1295,1295,Page mounted with two photographs of Alan Gree...,Title,Coll-1057_00600.txt,766,818,8,1


***Parts of description files still aren't being included...why???***

In [88]:
sample_df = sample_df.set_index("agg_ann_id")
sample_df_joined = sample_df.join(df_ann.set_index("agg_ann_id"), on="agg_ann_id", how="inner")
assert sample_df_joined.loc[sample_df_joined.description_id.isna() == True].shape[0] == 0
sample_df_joined.tail()

Unnamed: 0_level_0,description_id,filename,file,text,ann_offsets,label,category,associated_genders
agg_ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
44635,194,BAI_00600,BAI_00600.ann,Moderator Designate,"(655, 674)",Omission,Contextual,Unclear
17768,214,BAI_00700,BAI_00700.ann,Elizabeth II,"(246, 258)",Unknown,Person-Name,Unclear
17769,234,BAI_00700,BAI_00700.ann,Jesus Christ,"(959, 971)",Unknown,Person-Name,Unclear
17770,221,BAI_00700,BAI_00700.ann,Woman,"(472, 477)",Gendered-Role,Linguistic,Feminine
51135,223,BAI_00700,BAI_00700.ann,Companion of Honour,"(516, 535)",Stereotype,Contextual,Unclear


Looks good!

***
***
***
# DELETE CODE BELOW (MOVING TO NEW NB)

<a id="3.1"></a>
### 3.1 BIO Tags

**Compare the descriptions' tokens' offsets to the annotated text spans' offsets to determine which tokens to mark as the beginning of an annotation (`B-[LABELNAME]`), inside an annotation (`I-[LABELNAME]`), and unannotated, or outisde of an annotation (`O`).**

In [None]:
# TO DO: convert the three dataframes to dictionaries, 
#        for each filename, check whether each token_offset pair contained within each ann_offset pair and desc_,
#        recording which description (using indeces) annotation appears within

In [12]:
df_tokens = pd.read_csv(config.tokc_path+"descid_token_offsets.csv", index_col=0)
token_desc_ids = list(df_tokens.desc_id)
tokens = list(df_tokens.token)
token_offsets = list(df_tokens.offsets)
token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets]
token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
token_offsets_tuples[:5]  # Looks good

[(29, 36), (37, 39), (40, 43), (44, 57), (58, 65)]

Associate description tokens and annotated text spans' text and offsets to description IDs.

In [31]:
# df_tokens_imploded = utils.implodeDataFrame(df_tokens, ["desc_id"])
df_tokens_imploded = df_tokens_imploded.rename(columns={"offsets":"token_offsets"})
df_tokens_imploded.head()

Unnamed: 0_level_0,token,token_offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ..."
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141..."


In [35]:
df_tokens_imploded.to_csv(config.tokc_path+"token_data_imploded.csv")

Load the data associating description and annotation IDs to offsets.

In [32]:
df_descs_imploded = pd.read_csv(config.agg_path+"description_data_imploded.csv", index_col=0)
df_descs_imploded.head()

Unnamed: 0_level_0,eadid,desc_id,desc_offsets
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BAI_01000,['BAI'],[68],"[(1290, 1315)]"
BAI_01300,['BAI'],[143],"[(5853, 5983)]"
BAI_01600,['BAI'],[221],"[(5967, 6202)]"
BAI_01900,['BAI'],[292],"[(5297, 5506)]"
BAI_02200,['BAI'],[361],"[(15180, 15419)]"


In [33]:
df_anns_imploded = pd.read_csv(config.agg_path+"annotation_data_imploded.csv", index_col=0)
df_anns_imploded.head()

Unnamed: 0_level_0,agg_ann_id,ann_offsets
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
AA5_00100,"[14377, 14378, 14379, 14380, 14381, 14382, 143...","['(789, 791)', '(871, 873)', '(913, 916)', '(9..."
AA6_00100,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...","['(1778, 1790)', '(677, 679)', '(920, 922)', '..."
AA7_00100,"[127, 13987, 13988, 13989, 13990, 13991, 13992...","['(2399, 2415)', '(505, 508)', '(614, 620)', '..."
BAI_00100,"[17473, 17474, 17475, 17476, 17477, 17478, 416...","['(371, 388)', '(393, 405)', '(34, 56)', '(102..."
BAI_00200,"[20496, 20497, 20498, 20499, 20500, 20501, 205...","['(215, 221)', '(226, 232)', '(250, 255)', '(2..."


**Step 1: O tags**

Compare description IDs in the two DataFrames above to determine which descriptions (from `df_tokens_imploded`) do not have annotations, and assign all those descriptions' tokens an `O` tag (for *outside* of an annotation).

In [40]:
all_desc_ids = list(df_tokens_imploded.index)
ann_desc_ids = list(df_merged_imploded.index)
unannotated = [desc_id for desc_id in all_desc_ids if not desc_id in ann_desc_ids]
print("Rows to assign tag 'O':", len(unannotated))

Rows to assign tag 'O': 86742


In [48]:
o_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(unannotated)]
assert o_df.shape[0] == len(unannotated)

In [50]:
tokens_list = list(o_df.token)
tags = [["O"]*len(tokens) for tokens in tokens_list]
assert len(tags) == len(tokens_list)
o_df.insert(len(o_df.columns), "ann_tag", tags)
o_df.head()

Unnamed: 0_level_0,token,offsets,ann_tag
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ...","[O, O, O, O, O, O, O]"
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [53]:
assert len(o_df.token[100]) == len(o_df.ann_tag[100])
assert len(o_df.token[488]) == len(o_df.ann_tag[488])
assert len(o_df.token[0]) == len(o_df.ann_tag[0])

**Step 2: B- and I- tags**

For description IDs that do have annotations (and thus are in `df_merged_imploded`), assign their tokens tags of `B-[LABELNAME]` and `I-[LABELNAME]` for *beginning* and *inside* of an annotation, replacing `[LABELNAME]` with the name of the annotation's label.

In [41]:
annotated = [desc_id for desc_id in all_desc_ids if desc_id in ann_desc_ids]
print("Rows to assign 'B-' or 'I-'':", len(annotated))

Rows to assign 'B-' or 'I-'': 1855


In [54]:
bi_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(annotated)]
assert bi_df.shape[0] == len(annotated)

In [55]:
bi_df.head()

Unnamed: 0_level_0,token,offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
167,"[Brick, Burning, ,, Beardman, 's]","[(1421, 1426), (1427, 1434), (1434, 1435), (14..."
508,"[Interpreting, sequence, motifs, [, Letter, to...","[(3064, 3076), (3077, 3085), (3086, 3092), (30..."
610,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
611,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
640,"[Lady, Luck, :, the, theory, of, probability, ...","[(2118, 2122), (2123, 2127), (2127, 2128), (21..."


In [57]:
bi_dict = bi_df.to_dict('index')
print(bi_dict[167])

{'token': ['Brick', 'Burning', ',', 'Beardman', "'s"], 'offsets': ['(1421, 1426)', '(1427, 1434)', '(1434, 1435)', '(1436, 1444)', '(1444, 1446)']}


In [59]:
ann_dict = df_merged_imploded.to_dict('index')
print(ann_dict[167])

{'offsets_ann': ['(1436, 1444)', '(1436, 1444)'], 'text_ann': ['Beardman', 'Beardman'], 'label': ['Omission', 'Unknown'], 'id': [31928, 31929]}


In [78]:
# Turn a string of offsets into a tuple with each offset of type int
# "(1436, 1444)" --> (1436, 1444)
def offsetsStrToTuple(offsets_str):
    offsets_list = offsets_str[1:-1].split(", ")
    offsets_ints = [int(o) for o in offsets_list]
    return tuple((offsets_ints))

assert type(offsetsStrToTuple('(1436, 1444)')) == tuple
assert type(offsetsStrToTuple('(1436, 1444)')[0]) == int
assert type(offsetsStrToTuple('(1436, 1444)')[1]) == int

In [101]:
desc_ids = list(bi_dict.keys())[:100]  # START WITH SAMPLE
assert len(set(desc_ids)) == len(desc_ids)  # Make sure every description ID is unique
log = 0
descid_to_tag = dict.fromkeys(desc_ids)
for desc_id in desc_ids:
    text_spans = ann_dict[desc_id]["text_ann"]
    desc_tokens = bi_dict[desc_id]['token']
    desc_tokens_offsets = bi_dict[desc_id]['offsets']
    desc_tags = []
    for i,desc_token in enumerate(desc_tokens):
        token_offset_pair = offsetsStrToTuple(desc_tokens_offsets[i])
        span_indeces, tags = [], []  # Note: one token may have multiple tags
        
        # Record the indeces of every item in text_spans with the desc_token
        for j,text_span in enumerate(text_spans):
            span_offset_pair = offsetsStrToTuple(ann_dict[desc_id]["offsets_ann"][j])    
            # Be sure a matching token's offsets are within the annotated text span
            if (desc_token in text_span
               ) and (
                token_offset_pair[0] >= span_offset_pair[0]
                ) and (
                token_offset_pair[1] <= span_offset_pair[1]):
                    span_indeces += [j] 
            else:
                span_indeces += ["unannotated"]
        for j in span_indeces:
            # If the token is annotated, assign it a B- or I- tag with a label
            if type(j) == int:
            # If the start offsets are the same, assign a 'B-' tag
                if token_offset_pair[0] == span_offset_pair[0]:
                    tags += ['B-'+ann_dict[desc_id]["label"][j]]
                # Otherwise, assign an 'I-' tag
                else:
                    tags += ['I-'+ann_dict[desc_id]["label"][j]]
            # If the description token isn't annotated, assign it an O tag
            elif j == "unannotated":
                tags += ["O"]
            else:
                raise ValueError("Invalid j value: {}".format(j))
        
        desc_tags += [set(tags)]
    
    assert len(desc_tokens) == len(desc_tags)
    descid_to_tag[desc_id] = desc_tags
    
    log += 1
    if log % 100 == 0:
        print("Assigned tags for {} descriptions".format(log))

Assigned tags for 100 descriptions


In [109]:
did = 610 #508 #167
# print(ann_dict[did])
print(bi_dict[did])
# print(descid_to_tag[did])

# spans = ['Beardman', 'Beardman']
# spans2 = ["Brick Burning"]
# tokens = ['Brick', 'Burning', ',', 'Beardman', "'s"]
# # print(spans.count('Beardman'))
# # # print(spans.index('Beardman'))
# # # print(tokens.index('Beardman'))
# # for k in range(0,3):
# #     print(k)
# indeces = [index for index in range(len(spans)) if spans[index] == 'Beardman']
# print(indeces)

{'token': ['Letter', ':', ':', 'Koestler', ',', 'Arthur'], 'offsets': ['(127, 133)', '(134, 135)', '(135, 136)', '(137, 145)', '(145, 146)', '(147, 153)']}


In [4]:
# is_annotated_col = []
# annotated_id = []
# i, maxI = 0, len(token_desc_ids)  #1188478, 1189478
# while i < maxI:
#     desc_id = token_desc_ids[i]
#     token = tokens[i]
#     token_start, token_end = token_offsets_tuples[i][0], token_offsets_tuples[i][1] 
    
#     ann_df = df_merged.loc[df_merged.desc_id == desc_id]
#     ann_id_list = list(ann_df.id)
#     ann_offsets_list = list(ann_df.offsets_ann)
#     ann_offsets_clean = [ann_offsets[1:-1].split(", ") for ann_offsets in ann_offsets_list]
#     ann_offsets_tuples = [tuple((int(ann_offsets[0]), int(ann_offsets[1]))) for ann_offsets in ann_offsets_clean]
    
#     for j,ann_offsets in enumerate(ann_offsets_tuples):
#         ann_start = ann_offsets[0]
#         ann_end = ann_offsets[1]
#         if token_start == ann_start:
#             is_annotated_col += ["B"]
#             annotated_id += [ann_id_list[j]]
#         elif (token_start > ann_start) and (token_start <= ann_end):
#             is_annotated_col += ["I"]
#             annotated_id += [ann_id_list[j]]
#         else:
#             is_annotated_col += ["O"]
#             annotated_id += ["None"]
    
#     i += 1

# assert len(is_annotated_col) == len(token_desc_ids)
# assert len(is_annotated_col) == len(annotated_id)

KeyboardInterrupt: 

In [None]:
df_tokens.insert(len(df_tokens.columns),"is_annotated",is_annotated_col)
df_tokens.insert(len(df_tokens.columns),"ann_id",annotated_id)
df_tokens.head()

In [5]:
print(len(is_annotated_col))

48022031


In [8]:
print(len(annotated_id))

48022031
