# Analysis: Token BIO Tags

## Post Annotation and Aggregation

Determine which description each annotated text span occurs in and then determine which tokens are in an annotated text span.

***

**Table of Contents**

[0](#0). Load libraries

[1](#1). Load and Transform Data

[2](#2). Assign BIO Tags

***

### 0. Load libraries:

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys

<a id="1"></a>
### 1. Load and Transform Data

**Load description and annotation data and transform the datasets to more easily associate description IDs to annotation IDs.**

In [67]:
df_tokens = pd.read_csv(config.crc_meta_path+"descid_token_offsets.csv")
df_tokens = df_tokens.rename(columns={"Unnamed: 0":"token_id","desc_id":"description_id"})
df_tokens.head()

Unnamed: 0,token_id,description_id,token,offsets
0,0,0,Identifier,"(0, 10)"
1,1,0,:,"(10, 11)"
2,2,0,AA5,"(12, 15)"
3,3,1,Title,"(17, 22)"
4,4,1,:,"(22, 23)"


Transform the offsets column's string values to tuples of ints.

In [68]:
token_desc_ids = list(df_tokens.description_id)
tokens = list(df_tokens.token)
token_offsets = list(df_tokens.offsets)
token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets]
token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
df_tokens = df_tokens.drop(columns=["offsets"])
df_tokens.insert(len(df_tokens.columns), "offsets", token_offsets_tuples)

Associate description tokens and annotated text spans' text and offsets to description IDs.

In [69]:
df_tokens_imploded = utils.implodeDataFrame(df_tokens, ["description_id"])
df_tokens_imploded = df_tokens_imploded.rename(columns={"offsets":"token_offsets"})
df_tokens_imploded.head()

Unnamed: 0_level_0,token_id,token,token_offsets
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[0, 1, 2]","[Identifier, :, AA5]","[(0, 10), (10, 11), (12, 15)]"
1,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ..."
2,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[(77, 82), (83, 86), (87, 95), (95, 96), (97, ..."
3,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[(634, 646), (647, 648), (649, 659), (659, 660..."
4,"[308, 309, 310]","[Identifier, :, AA6]","[(0, 10), (10, 11), (12, 15)]"


Load the data description data.

In [70]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
# Remove columns not needed for linking
df_descs = df_descs.drop(columns=["description", "word_count", "sent_count"])
# Ignore rows for Identifer fields (the text of this field wasn't annotated)
df_descs = df_descs.loc[df_descs.field != "Identifier"]

# Remove file extensions
desc_filenames = list(df_descs.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_descs.insert(1, "filename", desc_filenames)

# Make sure offsets are in one column as tuples of ints
start_offsets = list(df_descs.start_offset)
end_offsets = list(df_descs.end_offset)
offsets_strs = list(zip(list(df_descs.start_offset),list(df_descs.end_offset)))
offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_descs = df_descs.drop(columns=["start_offset", "end_offset"])
df_descs.insert(3, "desc_offsets", offsets_int_tuples)

df_descs.head()

Unnamed: 0,description_id,filename,file,desc_offsets,field,clean_desc
1,1,AA5_00100,AA5_00100.txt,"(17, 76)",Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,AA5_00100,AA5_00100.txt,"(77, 633)",Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,AA5_00100,AA5_00100.txt,"(634, 1725)",Biographical / Historical,Professor James Aitken White was a leading Sco...
5,5,AA6_00100,AA6_00100.txt,"(17, 60)",Title,Papers of Rev Tom Allan (1916-1965)
6,6,AA6_00100,AA6_00100.txt,"(61, 560)",Scope and Contents,"Sermons and addresses, 1947-1963; essays and l..."


In [71]:
print(df_descs.shape)
print(df_tokens_imploded.shape)

(27570, 6)
(27908, 3)


Associate the imploded token data to the description data (using the `description_id` columns).

In [72]:
df_descs = df_descs.set_index("description_id")
descs_to_tokens = df_descs.join(df_tokens_imploded, on="description_id", how="left")
print(descs_to_tokens.shape)
descs_to_tokens = descs_to_tokens.rename(columns={"file":"desc_file"})
descs_to_tokens.head()

(27570, 8)


Unnamed: 0_level_0,filename,desc_file,desc_offsets,field,clean_desc,token_id,token,token_offsets
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,AA5_00100,AA5_00100.txt,"(17, 76)",Title,Papers of The Very Rev Prof James Whyte (1920-...,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ..."
2,AA5_00100,AA5_00100.txt,"(77, 633)",Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...","[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[(77, 82), (83, 86), (87, 95), (95, 96), (97, ..."
3,AA5_00100,AA5_00100.txt,"(634, 1725)",Biographical / Historical,Professor James Aitken White was a leading Sco...,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[(634, 646), (647, 648), (649, 659), (659, 660..."
5,AA6_00100,AA6_00100.txt,"(17, 60)",Title,Papers of Rev Tom Allan (1916-1965),"[311, 312, 313, 314, 315, 316, 317, 318, 319, ...","[Title, :, Papers, of, Rev, Tom, Allan, (, 191...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ..."
6,AA6_00100,AA6_00100.txt,"(61, 560)",Scope and Contents,"Sermons and addresses, 1947-1963; essays and l...","[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[Scope, and, Contents, :, Sermons, and, addres...","[(61, 66), (67, 70), (71, 79), (79, 80), (81, ..."


Load the annotation data and associate it to the description-token data joined above.

In [73]:
df_anns = pd.read_csv(config.agg_path+"aggregated_final.csv")
# Remove unnecessary columns
df_anns = df_anns.drop(columns=["category", "associated_genders"])
# Rename column to specify annotation association
df_anns = df_anns.rename(columns={"file":"ann_file"})

# Remove file extensions
desc_filenames = list(df_anns.ann_file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_anns.insert(1, "filename", desc_filenames)

# Make sure offsets are in one column as tuples of ints
offsets_strs = list(df_anns.ann_offsets)
offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_anns = df_anns.drop(columns=["ann_offsets"])
df_anns.insert(3, "ann_offsets", offsets_int_tuples)

df_anns.head()

Unnamed: 0,agg_ann_id,filename,ann_file,ann_offsets,text,label,description_id
0,0,Coll-1157_00100,Coll-1157_00100.ann,"(1407, 1415)",knighted,Gendered-Role,2364
1,1,Coll-1310_02300,Coll-1310_02300.ann,"(9625, 9635)",knighthood,Gendered-Role,4542
2,2,Coll-1281_00100,Coll-1281_00100.ann,"(2426, 2439)",Prince Regent,Gendered-Role,3660
3,3,Coll-1310_02700,Coll-1310_02700.ann,"(9993, 10003)",knighthood,Gendered-Role,4678
4,4,Coll-1310_02900,Coll-1310_02900.ann,"(7192, 7195)",Sir,Gendered-Role,4732


In [74]:
df_anns = df_anns.drop(columns=["filename"])
df_anns_imploded = utils.implodeDataFrame(df_anns, ["description_id"])
ann_file_col = (df_anns_imploded.ann_file)
new_col = []
for file_list in ann_file_col:
    assert len(set(file_list)) == 1, "File lists should only have one unique value"
    new_col += [file_list[0]]
df_anns_imploded = df_anns_imploded.drop(columns=["ann_file"])
df_anns_imploded.insert(1, "ann_file", new_col)
print(df_anns_imploded.shape)
df_anns_imploded.head()

(14779, 5)


Unnamed: 0_level_0,agg_ann_id,ann_file,ann_offsets,text,label
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"[14384, 24275, 26233, 52952]",AA5_00100.ann,"[(34, 63), (34, 63), (43, 63), (34, 63)]","[The Very Rev Prof James Whyte, The Very Rev P...","[Unknown, Masculine, Unknown, Stereotype]"
3,"[14377, 14378, 14379, 14380, 14381, 14382, 143...",AA5_00100.ann,"[(789, 791), (871, 873), (913, 916), (928, 930...","[He, he, his, he, he, His, he, Professor James...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-..."
5,"[9531, 23084]",AA6_00100.ann,"[(34, 47), (34, 47)]","[Rev Tom Allan, Rev Tom Allan]","[Unknown, Masculine]"
7,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...",AA6_00100.ann,"[(1778, 1790), (677, 679), (920, 922), (1222, ...","[Billy Graham, He, he, he, he, He, his, his, h...","[Masculine, Gendered-Pronoun, Gendered-Pronoun..."
9,"[14000, 24207]",AA7_00100.ann,"[(34, 63), (34, 63)]","[Rev Prof Alec Campbell Cheyne, Rev Prof Alec ...","[Unknown, Masculine]"


In [75]:
# Join the data, keeping only the rows with annotation data (right join)
descs_anns_tokens = descs_to_tokens.join(df_anns_imploded, on=["description_id"], how="right")
print(descs_anns_tokens.shape)
descs_anns_tokens.head()

(14779, 13)


Unnamed: 0_level_0,filename,desc_file,desc_offsets,field,clean_desc,token_id,token,token_offsets,agg_ann_id,ann_file,ann_offsets,text,label
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,AA5_00100,AA5_00100.txt,"(17, 76)",Title,Papers of The Very Rev Prof James Whyte (1920-...,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ...","[14384, 24275, 26233, 52952]",AA5_00100.ann,"[(34, 63), (34, 63), (43, 63), (34, 63)]","[The Very Rev Prof James Whyte, The Very Rev P...","[Unknown, Masculine, Unknown, Stereotype]"
3,AA5_00100,AA5_00100.txt,"(634, 1725)",Biographical / Historical,Professor James Aitken White was a leading Sco...,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[(634, 646), (647, 648), (649, 659), (659, 660...","[14377, 14378, 14379, 14380, 14381, 14382, 143...",AA5_00100.ann,"[(789, 791), (871, 873), (913, 916), (928, 930...","[He, he, his, he, he, His, he, Professor James...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-..."
5,AA6_00100,AA6_00100.txt,"(17, 60)",Title,Papers of Rev Tom Allan (1916-1965),"[311, 312, 313, 314, 315, 316, 317, 318, 319, ...","[Title, :, Papers, of, Rev, Tom, Allan, (, 191...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ...","[9531, 23084]",AA6_00100.ann,"[(34, 47), (34, 47)]","[Rev Tom Allan, Rev Tom Allan]","[Unknown, Masculine]"
7,AA6_00100,AA6_00100.txt,"(561, 2513)",Biographical / Historical,Rev Thomas Allan was born on 16 August 1916 in...,"[413, 414, 415, 416, 417, 418, 419, 420, 421, ...","[Biographical, /, Historical, :, Rev, Thomas, ...","[(561, 573), (574, 575), (576, 586), (586, 587...","[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...",AA6_00100.ann,"[(1778, 1790), (677, 679), (920, 922), (1222, ...","[Billy Graham, He, he, he, he, He, his, his, h...","[Masculine, Gendered-Pronoun, Gendered-Pronoun..."
9,AA7_00100,AA7_00100.txt,"(17, 76)",Title,Papers of Rev Prof Alec Campbell Cheyne (1924-...,"[768, 769, 770, 771, 772, 773, 774, 775, 776, ...","[Title, :, Papers, of, Rev, Prof, Alec, Campbe...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ...","[14000, 24207]",AA7_00100.ann,"[(34, 63), (34, 63)]","[Rev Prof Alec Campbell Cheyne, Rev Prof Alec ...","[Unknown, Masculine]"


Write the data to a file:

In [76]:
descs_anns_tokens.to_csv(config.agg_path+"descs_anns_tokens.csv")

In [77]:
# anns_dict = (df_anns_imploded[["agg_ann_id", "ann_offsets"]]).to_dict(orient="index")
# print(anns_dict) # Looks good

<a id="2"></a>
### 3. Assign BIO Tags

**Compare the descriptions' tokens' offsets to the annotated text spans' offsets to determine which tokens to mark as the beginning of an annotation (`B-[LABELNAME]`), inside an annotation (`I-[LABELNAME]`), and unannotated, or outisde of an annotation (`O`).**

In [78]:
# Remove columns without IDs and offsets
subdf = descs_anns_tokens.drop(columns=["field", "clean_desc", "token", "text", "label", "ann_file", "desc_file"])
# Create a dictionary of the remaining offsets, filename, and ID data
dta_dict = subdf.to_dict(orient="index")
print(dta_dict[1])

{'filename': 'AA5_00100', 'desc_offsets': (17, 76), 'token_id': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'token_offsets': [(17, 22), (22, 23), (24, 30), (31, 33), (34, 37), (38, 42), (43, 46), (47, 51), (52, 57), (58, 63), (64, 65), (65, 74), (74, 75)], 'agg_ann_id': [14384, 24275, 26233, 52952], 'ann_offsets': [(34, 63), (34, 63), (43, 63), (34, 63)]}


#### 3.1 Review Tokens in Annotated Descriptions

For description IDs that do have annotations, assign their tokens in annotated text spans tags of `B` and `I` for *beginning* and *inside* of an annotation, and assign tokens outside of annotated text spans a tag of `O`.

In [119]:
desc_ids = list(dta_dict.keys()) #[:10]
# One token can be in multiple annotations, so give each token a list of tag values
desc_to_anntokentags = dict.fromkeys(desc_ids, dict())

In [120]:
for did in desc_ids:
    # Get the description's token data
    token_ids = dta_dict[did]["token_id"]
    token_offsets = dta_dict[did]["token_offsets"]
    # Get the description's annotation data
    ann_ids = dta_dict[did]["agg_ann_id"]
    ann_offsets = dta_dict[did]["ann_offsets"]
    
    # Determine which tokens begin or are inside of annotated text spans
    tagged_token_ids, tagged_ann_ids, tags = [],[],[]
    for i in range(len(token_ids)):
        token_id = token_ids[i]
        token_offset_pair = token_offsets[i]
        for j in range(len(ann_offsets)):
            ann_offset_pair = ann_offsets[j]
            # If the token's start offset equals the annotation's start offset, give it a B
            if (token_offset_pair[0] == ann_offset_pair[0]):
                tagged_token_ids += [token_id]
                tagged_ann_ids += [ann_ids[j]]
                tags += ["B"]
            # If the token's start offset is in between the annotation's offsets, give it an I
            elif (token_offset_pair[0] > ann_offset_pair[0]) and (token_offset_pair[0] <= ann_offset_pair[1]):
                tagged_token_ids += [token_id]
                tagged_ann_ids += [ann_ids[j]]
                tags += ["I"]

    desc_to_anntokentags[did] = {"token_ids":tagged_token_ids, "ann_ids":tagged_ann_ids,"tags":tags}
    
print(desc_to_anntokentags[1])

{'token_ids': [7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12], 'ann_ids': [14384, 24275, 52952, 14384, 24275, 52952, 14384, 24275, 26233, 52952, 14384, 24275, 26233, 52952, 14384, 24275, 26233, 52952, 14384, 24275, 26233, 52952], 'tags': ['B', 'B', 'B', 'I', 'I', 'I', 'I', 'I', 'B', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I']}


In [121]:
df = pd.DataFrame.from_dict(desc_to_anntokentags, orient="index").reset_index()
df = df.rename(columns={"index":"description_id"})
df.head()

Unnamed: 0,description_id,token_ids,ann_ids,tags
0,1,"[7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10,...","[14384, 24275, 52952, 14384, 24275, 52952, 143...","[B, B, B, I, I, I, I, I, B, I, I, I, I, I, I, ..."
1,3,"[113, 114, 115, 116, 119, 120, 120, 121, 121, ...","[14385, 14385, 14385, 14385, 52953, 41260, 529...","[B, I, I, I, B, B, I, I, I, B, I, I, I, I, I, ..."
2,5,"[315, 315, 316, 316, 317, 317]","[9531, 23084, 9531, 23084, 9531, 23084]","[B, B, I, I, I, I]"
3,7,"[417, 418, 419, 430, 431, 431, 431, 432, 433, ...","[9526, 9526, 9526, 50889, 9532, 50889, 51789, ...","[B, I, I, B, B, I, B, I, B, I, B, I, I, I, B, ..."
4,9,"[772, 772, 773, 773, 774, 774, 775, 775, 776, ...","[14000, 24207, 14000, 24207, 14000, 24207, 140...","[B, B, I, I, I, I, I, I, I, I]"


In [122]:
df.shape

(14779, 4)

In [123]:
df_exploded = df.apply(pd.Series.explode)
print(df_exploded.shape)
df_exploded.head()

(133975, 4)


Unnamed: 0,description_id,token_ids,ann_ids,tags
0,1,7,14384,B
0,1,7,24275,B
0,1,7,52952,B
0,1,8,14384,I
0,1,8,24275,I


In [124]:
df_exploded = df_exploded.drop_duplicates()
print(df_exploded.shape)

(133975, 4)


#### 3.2 Assign O Tags to All Tokens in Unannotated Descriptions

Join the B and I tag data to the entire token dataset and assign all tokens without tags an O.

In [132]:
df_exploded = df_exploded.rename(columns={"token_ids":"token_id", "ann_ids":"ann_id", "tags":"tag"})
df_exploded.set_index(["token_id","description_id"])
all_tokens = df_exploded.join(df_tokens.set_index(["token_id","description_id"]), on=["token_id","description_id"], how="outer")
all_tokens.head()

Unnamed: 0,description_id,token_id,ann_id,tag,token,offsets
0,1,7.0,14384,B,The,"(34, 37)"
0,1,7.0,24275,B,The,"(34, 37)"
0,1,7.0,52952,B,The,"(34, 37)"
0,1,8.0,14384,I,Very,"(38, 42)"
0,1,8.0,24275,I,Very,"(38, 42)"


In [134]:
print(all_tokens.loc[all_tokens.tag.isna()].shape) # assign these tag `O`
all_tokens[["tag"]] = all_tokens[["tag"]].fillna("O")
print(all_tokens.loc[all_tokens.tag.isna()].shape)

(650420, 6)
(0, 6)


In [136]:
print(all_tokens.shape)
all_tokens = all_tokens.drop_duplicates()
print(all_tokens.shape)

(784375, 6)
(784375, 6)


#### 3.3 Add Label Names to B and I Tags

Join the annotation data to the token data get the label associated with each B and I tag.

In [142]:
# df_anns.head()
subdf_anns = df_anns[["agg_ann_id","text","description_id","label"]]
subdf_anns = subdf_anns.rename(columns={"agg_ann_id":"ann_id"})
subdf_anns.head()

Unnamed: 0,ann_id,text,description_id,label
0,0,knighted,2364,Gendered-Role
1,1,knighthood,4542,Gendered-Role
2,2,Prince Regent,3660,Gendered-Role
3,3,knighthood,4678,Gendered-Role
4,4,Sir,4732,Gendered-Role


In [156]:
all_tokens.set_index(["ann_id", "description_id"])
all_tokens_labeled = all_tokens.join(subdf_anns.set_index(["ann_id", "description_id"]), on=["ann_id","description_id"], how="outer")
all_tokens_labeled.head()

Unnamed: 0,description_id,token_id,ann_id,tag,token,offsets,text,label
0,1,7.0,14384.0,B,The,"(34, 37)",The Very Rev Prof James Whyte,Unknown
0,1,8.0,14384.0,I,Very,"(38, 42)",The Very Rev Prof James Whyte,Unknown
0,1,9.0,14384.0,I,Rev,"(43, 46)",The Very Rev Prof James Whyte,Unknown
0,1,10.0,14384.0,I,Prof,"(47, 51)",The Very Rev Prof James Whyte,Unknown
0,1,11.0,14384.0,I,James,"(52, 57)",The Very Rev Prof James Whyte,Unknown


In [162]:
o_tags = all_tokens_labeled.loc[all_tokens_labeled.tag == "O"]
o_tags = o_tags.drop(columns=["label"])
o_tags.head()

Unnamed: 0,description_id,token_id,ann_id,tag,token,offsets,text
285,639,,,O,,,
14778,639,11466.0,,O,Scope,"(3303, 3308)",
14778,639,11467.0,,O,and,"(3309, 3312)",
14778,639,11468.0,,O,Contents,"(3313, 3321)",
14778,639,11469.0,,O,:,"(3321, 3322)",


In [163]:
bi_tags = all_tokens_labeled.loc[all_tokens_labeled.tag != "O"]
complete_tags = bi_tags["tag"] +"-"+bi_tags["label"]
bi_tags = bi_tags.drop(columns=["tag","label"])
bi_tags.insert(3, "tag", complete_tags)
bi_tags.head()

Unnamed: 0,description_id,token_id,ann_id,tag,token,offsets,text
0,1,7.0,14384.0,B-Unknown,The,"(34, 37)",The Very Rev Prof James Whyte
0,1,8.0,14384.0,I-Unknown,Very,"(38, 42)",The Very Rev Prof James Whyte
0,1,9.0,14384.0,I-Unknown,Rev,"(43, 46)",The Very Rev Prof James Whyte
0,1,10.0,14384.0,I-Unknown,Prof,"(47, 51)",The Very Rev Prof James Whyte
0,1,11.0,14384.0,I-Unknown,James,"(52, 57)",The Very Rev Prof James Whyte


In [164]:
df = pd.concat([bi_tags,o_tags], sort=True)
df.head()

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
0,14384.0,1,"(34, 37)",B-Unknown,The Very Rev Prof James Whyte,The,7.0
0,14384.0,1,"(38, 42)",I-Unknown,The Very Rev Prof James Whyte,Very,8.0
0,14384.0,1,"(43, 46)",I-Unknown,The Very Rev Prof James Whyte,Rev,9.0
0,14384.0,1,"(47, 51)",I-Unknown,The Very Rev Prof James Whyte,Prof,10.0
0,14384.0,1,"(52, 57)",I-Unknown,The Very Rev Prof James Whyte,James,11.0


In [165]:
df = df.sort_values(by=["description_id","token_id"])
df.head()

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
14778,,0,"(0, 10)",O,,Identifier,0.0
14778,,0,"(10, 11)",O,,:,1.0
14778,,0,"(12, 15)",O,,AA5,2.0
14778,,1,"(17, 22)",O,,Title,3.0
14778,,1,"(22, 23)",O,,:,4.0


Write the resulting data for token classification:

In [167]:
df.to_csv(config.tokc_path+"tagged_tokens.csv")