# Analysis: Token BIO Tags

## Post Annotation and Aggregation

Determine which description each annotated text span occurs in and then determine which tokens are in an annotated text span.

***

**Table of Contents**

[0](#0). Load libraries

[1](#1). Load and Transform Data

[2](#2). Associate Annotations to a Description

[3](#3). Assign BIO Tags

***

### 0. Load libraries:

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

from intervaltree import Interval, IntervalTree

### 1. Load and Transform Data

**Load description and annotation data and transform the datasets to more easily associate description IDs to annotation IDs.**

In [7]:
df_tokens = pd.read_csv(config.tokc_path+"descid_token_offsets.csv")
df_tokens = df_tokens.drop(columns=["Unnamed: 0"])
token_desc_ids = list(df_tokens.desc_id)
tokens = list(df_tokens.token)
token_offsets = list(df_tokens.offsets)
token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets]
token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
token_offsets_tuples[:5]  # Looks good

[(29, 36), (37, 39), (40, 43), (44, 57), (58, 65)]

Associate description tokens and annotated text spans' text and offsets to description IDs.

In [8]:
df_tokens_imploded = utils.implodeDataFrame(df_tokens, ["desc_id"])
df_tokens_imploded = df_tokens_imploded.rename(columns={"offsets":"token_offsets"})
df_tokens_imploded.head()

Unnamed: 0_level_0,token,token_offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ..."
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141..."


In [9]:
df_tokens_imploded.to_csv(config.tokc_path+"token_data_imploded.csv")

Load the data associating description and annotation IDs to offsets.

In [10]:
df_descs = pd.read_csv(config.crc_meta_path+"descs_with_offsets.csv", index_col=0)

# Remove file extensions
desc_filenames = list(df_descs.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_descs.insert(1, "filename", desc_filenames)

# Make sure offsets are in one column as tuples of ints
start_offsets = list(df_descs.desc_start_offset)
end_offsets = list(df_descs.desc_end_offset)
offsets_strs = list(zip(list(df_descs.desc_start_offset),list(df_descs.desc_end_offset)))
offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_descs = df_descs.drop(columns=["desc_start_offset", "desc_end_offset"])
df_descs.insert(3, "desc_offsets", offsets_int_tuples)

df_descs.head()

Unnamed: 0,desc_id,filename,eadid,desc_offsets,field,file,description
0,0,Coll-227_00100,Coll-227,"(29, 79)",Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh
1,1,Coll-227_00100,Coll-227,"(100, 610)",Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...
2,2,Coll-227_00100,Coll-227,"(638, 2277)",Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...
3,3,La_03600,La,"(7, 117)",Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ..."
4,4,La_03600,La,"(125, 223)",Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [..."


In [11]:
df_descs_imploded = utils.implodeDataFrame(df_descs, ["filename"])
df_descs_imploded.head()

Unnamed: 0_level_0,desc_id,eadid,desc_offsets,field,file,description
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AA4_00100,"[19714, 19715, 19716]","[AA4, AA4, AA4]","[(24, 69), (90, 505), (533, 985)]","[Title, Scope and Contents, Biographical / His...","[AA4_00100.txt, AA4_00100.txt, AA4_00100.txt]","[Papers of Rev Prof John McIntyre (1916-2005),..."
AA5_00100,"[39870, 39871, 39872]","[AA5, AA5, AA5]","[(24, 76), (97, 633), (661, 1350)]","[Title, Scope and Contents, Biographical / His...","[AA5_00100.txt, AA5_00100.txt, AA5_00100.txt]",[Papers of The Very Rev Prof James Whyte (1920...
AA6_00100,"[65922, 65923, 65924]","[AA6, AA6, AA6]","[(24, 60), (81, 523), (588, 1031)]","[Title, Scope and Contents, Biographical / His...","[AA6_00100.txt, AA6_00100.txt, AA6_00100.txt]","[Papers of Rev Tom Allan (1916-1965), Sermons ..."
AA7_00100,"[86719, 86720, 86721]","[AA7, AA7, AA7]","[(24, 76), (97, 417), (445, 934)]","[Title, Scope and Contents, Biographical / His...","[AA7_00100.txt, AA7_00100.txt, AA7_00100.txt]",[Papers of Rev Prof Alec Campbell Cheyne (1924...
BAI_00100,"[278, 279, 280, 281, 282, 283, 284, 285, 286, ...","[BAI, BAI, BAI, BAI, BAI, BAI, BAI, BAI, BAI, ...","[(24, 84), (92, 115), (123, 143), (151, 210), ...","[Title, Title, Title, Title, Title, Title, Tit...","[BAI_00100.txt, BAI_00100.txt, BAI_00100.txt, ...","[Papers of Professor John Baillie, and Baillie..."


In [12]:
descs_dict = (df_descs_imploded[["desc_id", "desc_offsets"]]).to_dict(orient="index")
# print(descs_dict) # Looks good

In [13]:
df_anns = pd.read_csv(config.agg_path+"aggregated_final.csv", index_col=0)

# Remove file extensions
desc_filenames = list(df_anns.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_anns.insert(1, "filename", desc_filenames)

# Make sure offsets are in one column as tuples of ints
offsets_strs = list(df_anns.offsets)
offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_anns = df_anns.drop(columns=["offsets"])
df_anns.insert(3, "ann_offsets", offsets_int_tuples)

df_anns.head()

Unnamed: 0,agg_ann_id,filename,file,ann_offsets,text,label,category,associated_genders
12,0,Coll-1157_00100,Coll-1157_00100.ann,"(1407, 1415)",knighted,Gendered-Role,Linguistic,Unclear
22,1,Coll-1310_02300,Coll-1310_02300.ann,"(9625, 9635)",knighthood,Gendered-Role,Linguistic,Unclear
23,2,Coll-1281_00100,Coll-1281_00100.ann,"(2426, 2439)",Prince Regent,Gendered-Role,Linguistic,Unclear
24,3,Coll-1310_02700,Coll-1310_02700.ann,"(9993, 10003)",knighthood,Gendered-Role,Linguistic,Unclear
25,4,Coll-1310_02900,Coll-1310_02900.ann,"(7192, 7195)",Sir,Gendered-Role,Linguistic,Unclear


In [14]:
df_anns_imploded = utils.implodeDataFrame(df_anns, ["filename"])
df_anns_imploded.head()

Unnamed: 0_level_0,agg_ann_id,file,ann_offsets,text,label,category,associated_genders
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AA5_00100,"[14377, 14378, 14379, 14380, 14381, 14382, 143...","[AA5_00100.ann, AA5_00100.ann, AA5_00100.ann, ...","[(789, 791), (871, 873), (913, 916), (928, 930...","[He, he, his, he, he, His, he, The Very Rev Pr...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-...","[Linguistic, Linguistic, Linguistic, Linguisti...","[Masculine, Masculine, Masculine, Masculine, M..."
AA6_00100,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...","[AA6_00100.ann, AA6_00100.ann, AA6_00100.ann, ...","[(1778, 1790), (677, 679), (920, 922), (1222, ...","[Billy Graham, He, he, he, he, He, his, his, h...","[Masculine, Gendered-Pronoun, Gendered-Pronoun...","[Person-Name, Linguistic, Linguistic, Linguist...","[Unclear, Masculine, Masculine, Masculine, Mas..."
AA7_00100,"[127, 13987, 13988, 13989, 13990, 13991, 13992...","[AA7_00100.ann, AA7_00100.ann, AA7_00100.ann, ...","[(2399, 2415), (505, 508), (614, 620), (647, 6...","[Professor Cheyne, son, sister, brother, His, ...","[Masculine, Gendered-Role, Gendered-Role, Gend...","[Person-Name, Linguistic, Linguistic, Linguist...","[Masculine, Unclear, Unclear, Multiple, Mascul..."
BAI_00100,"[17473, 17474, 17475, 17476, 17477, 17478, 416...","[BAI_00100.ann, BAI_00100.ann, BAI_00100.ann, ...","[(371, 388), (393, 405), (34, 56), (102, 114),...","[Jacques Chevalier, Lloyd Morgan, Professor Jo...","[Unknown, Unknown, Unknown, Unknown, Unknown, ...","[Person-Name, Person-Name, Person-Name, Person...","[Masculine, Unclear, Unclear, Unclear, Unclear..."
BAI_00200,"[20496, 20497, 20498, 20499, 20500, 20501, 205...","[BAI_00200.ann, BAI_00200.ann, BAI_00200.ann, ...","[(215, 221), (226, 232), (250, 255), (285, 299...","[Barker, Garvie, Busch, Adolf Jülicher, Johann...","[Unknown, Unknown, Unknown, Unknown, Unknown, ...","[Person-Name, Person-Name, Person-Name, Person...","[Unclear, Unclear, Unclear, Multiple, Multiple..."


In [15]:
anns_dict = (df_anns_imploded[["agg_ann_id", "ann_offsets"]]).to_dict(orient="index")
# print(anns_dict) # Looks good

### 2. Associate Annotations to a Description

**Using offsets, determine which annotation ID to match to which description ID.**

In [16]:
annid_to_descid = dict() #.fromkeys((list(df_anns.agg_ann_id)))
files = list(anns_dict.keys())
for f in files:
    # Get the IDs and offsets of annotations in that file
    f_ann_ids = anns_dict[f]["agg_ann_id"]
    f_ann_offsets = anns_dict[f]["ann_offsets"]
    
    # Get the IDs and offsets of descriptions in that file
    f_desc_ids = descs_dict[f]["desc_id"]
    f_desc_offsets = descs_dict[f]["desc_offsets"]
    
    # For every annotation, find the description it appears within
    for i in range(len(f_ann_ids)):
        start_ann_offset, end_ann_offset = f_ann_offsets[i][0], f_ann_offsets[i][1]
        for j in range(len(f_desc_ids)):
            start_desc_offset, end_desc_offset = f_desc_offsets[j][0], f_desc_offsets[j][1]
            if start_ann_offset >= start_desc_offset and start_ann_offset <= end_desc_offset:
                if end_ann_offset >= start_desc_offset and end_ann_offset <= end_desc_offset:
                    ann_id, desc_id = f_ann_ids[i], f_desc_ids[j]
                    annid_to_descid[ann_id] = desc_id

In [17]:
ann_to_desc_df = pd.DataFrame({"agg_ann_id":list(annid_to_descid.keys()), "desc_id":list(annid_to_descid.values())})
ann_to_desc_df.head()

Unnamed: 0,agg_ann_id,desc_id
0,14377,39872
1,14378,39872
2,14379,39872
3,14380,39872
4,14381,39872


In [18]:
df_anns = df_anns.set_index("agg_ann_id")
df_anns_with_desc = df_anns.join(ann_to_desc_df.set_index("agg_ann_id"), on="agg_ann_id", how="outer")
df_anns_with_desc.head()

Unnamed: 0_level_0,filename,file,ann_offsets,text,label,category,associated_genders,desc_id
agg_ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,Coll-1157_00100,Coll-1157_00100.ann,"(1407, 1415)",knighted,Gendered-Role,Linguistic,Unclear,49896.0
1,Coll-1310_02300,Coll-1310_02300.ann,"(9625, 9635)",knighthood,Gendered-Role,Linguistic,Unclear,54748.0
2,Coll-1281_00100,Coll-1281_00100.ann,"(2426, 2439)",Prince Regent,Gendered-Role,Linguistic,Unclear,71394.0
3,Coll-1310_02700,Coll-1310_02700.ann,"(9993, 10003)",knighthood,Gendered-Role,Linguistic,Unclear,58054.0
4,Coll-1310_02900,Coll-1310_02900.ann,"(7192, 7195)",Sir,Gendered-Role,Linguistic,Unclear,25014.0


In [19]:
df_anns_with_desc.loc[df_anns_with_desc.desc_id.isnull()]

Unnamed: 0_level_0,filename,file,ann_offsets,text,label,category,associated_genders,desc_id
agg_ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
11,Coll-1036_00400,Coll-1036_00400.ann,"(86191, 86196)",Women,Gendered-Role,Linguistic,Feminine,
13,Coll-1036_00400,Coll-1036_00400.ann,"(37968, 37973)",Woman,Gendered-Role,Linguistic,Feminine,
19,Coll-1036_00400,Coll-1036_00400.ann,"(104833, 104840)",McNeill,Unknown,Person-Name,Unclear,
20,Coll-1036_00400,Coll-1036_00400.ann,"(45180, 45187)",Kennedy,Unknown,Person-Name,Unclear,
21,Coll-1036_00400,Coll-1036_00400.ann,"(49859, 49878)","Hood, Helen Patuffa",Unknown,Person-Name,Masculine,
...,...,...,...,...,...,...,...,...
55235,Coll-1234_00100,Coll-1234_00100.ann,"(1260, 1264)",wife,Generalization,Linguistic,Feminine,
55236,Coll-1234_00100,Coll-1234_00100.ann,"(1350, 1356)",father,Generalization,Linguistic,Multiple,
55237,Coll-1234_00100,Coll-1234_00100.ann,"(1384, 1387)",Mrs,Generalization,Linguistic,Unclear,
55252,Coll-1434_18400,Coll-1434_18400.ann,"(64, 67)",men,Generalization,Linguistic,Masculine,


WHY DO THE ANNOTATIONS IN THE DF ABOVE NOT HAVE A CORRESPONDING DESCRIPTION??? 

In [20]:
df_descs.loc[df_descs["filename"] == "Coll-1036_00400"]

Unnamed: 0,desc_id,filename,eadid,desc_offsets,field,file,description
68307,68307,Coll-1036_00400,Coll-1036,"(137, 1264)",Scope and Contents,Coll-1036_00400.txt,Miscellaneous music.Several Marjory Kennedy-Fr...
68308,68308,Coll-1036_00400,Coll-1036,"(1285, 1361)",Scope and Contents,Coll-1036_00400.txt,"Miscellaneous items, Part 1, 2, 3.'Burns as a..."
68309,68309,Coll-1036_00400,Coll-1036,"(1584, 1837)",Scope and Contents,Coll-1036_00400.txt,"'Loose Leaf M.S.S. [manuscripts] of ""Book""'. S..."
68310,68310,Coll-1036_00400,Coll-1036,"(1858, 2033)",Scope and Contents,Coll-1036_00400.txt,"'News Cuttings', note book. Softbound, charcoa..."
68311,68311,Coll-1036_00400,Coll-1036,"(2054, 5254)",Scope and Contents,Coll-1036_00400.txt,"Various music collections, Part 1 2. Part 1: ..."
68312,68312,Coll-1036_00400,Coll-1036,"(5275, 5552)",Scope and Contents,Coll-1036_00400.txt,"'Proofs M.S.S.[manuscripts], More Songs of [t..."
68313,68313,Coll-1036_00400,Coll-1036,"(5573, 5941)",Scope and Contents,Coll-1036_00400.txt,"'Kennedy-Fraser MSS. [manuscripts], D. 18377 [..."
68314,68314,Coll-1036_00400,Coll-1036,"(6004, 6022)",Scope and Contents,Coll-1036_00400.txt,'Tolmie Gesto'.
68315,68315,Coll-1036_00400,Coll-1036,"(6842, 6902)",Scope and Contents,Coll-1036_00400.txt,Proofs of A Life of Song by Marjory Kennedy-Fr...
68316,68316,Coll-1036_00400,Coll-1036,"(6923, 6939)",Scope and Contents,Coll-1036_00400.txt,Breton songs.


LOOKS LIKE DESCRIPTION OFFSETS NEED TO BE REGENERAGTED BECAUSE SOME DESCRIPTIONS ARE MISSING FROM THE DF_DESCS DATAFRAME (descs_with_offsets.csv)!

### 3. Assign BIO Tags

**Compare the descriptions' tokens' offsets to the annotated text spans' offsets to determine which tokens to mark as the beginning of an annotation (`B-[LABELNAME]`), inside an annotation (`I-[LABELNAME]`), and unannotated, or outisde of an annotation (`O`).**