# Analysis: Token BIO Tags

## Post Annotation and Aggregation

Determine which description each annotated text span occurs in and then determine which tokens are in an annotated text span.

***

**Table of Contents**

[0](#0). Load libraries

[1](#1). Load and Transform Data

[2](#2). Assign BIO Tags

[3](#3). Export Tags' Data for Visualization

***

### 0. Load libraries:

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys

<a id="1"></a>
### 1. Load and Transform Data

**Load description and annotation data and transform the datasets to more easily associate description IDs to annotation IDs.**

In [2]:
df_tokens = pd.read_csv(config.tokc_path+"tokens_sents_descs.csv", index_col=0)
df_tokens.head()

Unnamed: 0,sentence_id,token_id,token,token_offsets,description_id
0,0,0,Identifier,"(0, 10)",0
0,0,1,:,"(10, 11)",0
0,0,2,AA5,"(12, 15)",0
1,1,3,Title,"(17, 22)",1
1,1,4,:,"(22, 23)",1


In [3]:
df_tokens.loc[df_tokens.token.isna()] #.shape

Unnamed: 0,sentence_id,token_id,token,token_offsets,description_id


In [4]:
df_tokens.loc[df_tokens.token == "nan"] #.shape

Unnamed: 0,sentence_id,token_id,token,token_offsets,description_id


Transform the offsets column's string values to tuples of ints.

In [5]:
token_desc_ids = list(df_tokens.description_id)
tokens = list(df_tokens.token)
token_offsets = list(df_tokens.token_offsets)
token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets]
token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
df_tokens = df_tokens.drop(columns=["token_offsets"])
df_tokens.insert(len(df_tokens.columns), "token_offsets", token_offsets_tuples)
df_tokens.tail()

Unnamed: 0,sentence_id,token_id,token,description_id,token_offsets
42029,42029,753927,cases,27907,"(6332, 6337)"
42029,42029,753928,involving,27907,"(6338, 6347)"
42029,42029,753929,homosexual,27907,"(6348, 6358)"
42029,42029,753930,offences,27907,"(6359, 6367)"
42029,42029,753931,.,27907,"(6367, 6368)"


Associate description tokens and annotated text spans' text and offsets to description IDs.

In [6]:
df_tokens_imploded = utils.implodeDataFrame(df_tokens, ["description_id"])
df_tokens_imploded.head()

Unnamed: 0_level_0,sentence_id,token_id,token,token_offsets
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[0, 0, 0]","[0, 1, 2]","[Identifier, :, AA5]","[(0, 10), (10, 11), (12, 15)]"
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ..."
2,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[(77, 82), (83, 86), (87, 95), (95, 96), (97, ..."
3,"[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[(634, 646), (647, 648), (649, 659), (659, 660..."
4,"[11, 11, 11]","[308, 309, 310]","[Identifier, :, AA6]","[(0, 10), (10, 11), (12, 15)]"


Load the data description data.

In [7]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
# Remove columns not needed for linking
df_descs = df_descs.drop(columns=["clean_desc", "word_count", "sent_count"])
# # Ignore rows for Identifer fields (the text of this field wasn't annotated)
# df_descs = df_descs.loc[df_descs.field != "Identifier"]

# Remove file extensions
desc_filenames = list(df_descs.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_descs.insert(1, "filename", desc_filenames)

# Make sure offsets are in one column as tuples of ints
start_offsets = list(df_descs.start_offset)
end_offsets = list(df_descs.end_offset)
offsets_strs = list(zip(list(df_descs.start_offset),list(df_descs.end_offset)))
offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_descs = df_descs.drop(columns=["start_offset", "end_offset"])
df_descs.insert(3, "desc_offsets", offsets_int_tuples)

# # Remove rows with a NaN clean description (their description is in another row under the next file)
# df_descs = df_descs.loc[~df_descs.clean_desc.isna()]

df_descs.head()

Unnamed: 0,description_id,filename,description,desc_offsets,file,field
0,0,AA5_00100,Identifier: AA5,"(0, 16)",AA5_00100.txt,Identifier
1,1,AA5_00100,Title:\nPapers of The Very Rev Prof James Whyt...,"(17, 76)",AA5_00100.txt,Title
2,2,AA5_00100,"Scope and Contents:\nSermons and addresses, 19...","(77, 633)",AA5_00100.txt,Scope and Contents
3,3,AA5_00100,Biographical / Historical:\nProfessor James Ai...,"(634, 1725)",AA5_00100.txt,Biographical / Historical
4,4,AA6_00100,Identifier: AA6,"(0, 16)",AA6_00100.txt,Identifier


In [8]:
assert df_descs.loc[df_descs.description.isna()].shape[0] == 0
assert df_descs.shape[0] == df_tokens_imploded.shape[0]

Associate the imploded token data to the description data (using the `description_id` columns).

In [9]:
df_descs = df_descs.set_index("description_id")
descs_to_tokens = df_descs.join(df_tokens_imploded, on="description_id", how="left")
print(descs_to_tokens.shape)
descs_to_tokens = descs_to_tokens.drop(columns=["file"])
descs_to_tokens.head()

(27908, 9)


Unnamed: 0_level_0,filename,description,desc_offsets,field,sentence_id,token_id,token,token_offsets
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,AA5_00100,Identifier: AA5,"(0, 16)",Identifier,"[0, 0, 0]","[0, 1, 2]","[Identifier, :, AA5]","[(0, 10), (10, 11), (12, 15)]"
1,AA5_00100,Title:\nPapers of The Very Rev Prof James Whyt...,"(17, 76)",Title,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ..."
2,AA5_00100,"Scope and Contents:\nSermons and addresses, 19...","(77, 633)",Scope and Contents,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[(77, 82), (83, 86), (87, 95), (95, 96), (97, ..."
3,AA5_00100,Biographical / Historical:\nProfessor James Ai...,"(634, 1725)",Biographical / Historical,"[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[(634, 646), (647, 648), (649, 659), (659, 660..."
4,AA6_00100,Identifier: AA6,"(0, 16)",Identifier,"[11, 11, 11]","[308, 309, 310]","[Identifier, :, AA6]","[(0, 10), (10, 11), (12, 15)]"


Load the annotation data and associate it to the description-token data joined above.

In [10]:
df_anns = pd.read_csv(config.agg_path+"aggregated_final.csv")
# Remove unnecessary columns
df_anns = df_anns.drop(columns=["category", "associated_genders"])

# Remove file extensions
desc_filenames = list(df_anns.file)
desc_filenames = [f[:-4] for f in desc_filenames]
df_anns.insert(1, "filename", desc_filenames)
df_anns = df_anns.drop(columns=["file"])

# Make sure offsets are in one column as tuples of ints
offsets_strs = list(df_anns.ann_offsets)
offsets_int_tuples = utils.turnStrTuplesToIntTuples(offsets_strs)
df_anns = df_anns.drop(columns=["ann_offsets"])
df_anns.insert(3, "ann_offsets", offsets_int_tuples)

df_anns.head()

Unnamed: 0,agg_ann_id,filename,text,ann_offsets,label,description_id
0,0,Coll-1157_00100,knighted,"(1407, 1415)",Gendered-Role,2364
1,1,Coll-1310_02300,knighthood,"(9625, 9635)",Gendered-Role,4542
2,2,Coll-1281_00100,Prince Regent,"(2426, 2439)",Gendered-Role,3660
3,3,Coll-1310_02700,knighthood,"(9993, 10003)",Gendered-Role,4678
4,4,Coll-1310_02900,Sir,"(7192, 7195)",Gendered-Role,4732


In [11]:
df_anns_imploded = utils.implodeDataFrame(df_anns, ["description_id"])
ann_file_col = (df_anns_imploded.filename)
new_col = []
for file_list in ann_file_col:
    assert len(set(file_list)) == 1, "File lists should only have one unique value"
    new_col += [file_list[0]]
df_anns_imploded = df_anns_imploded.drop(columns=["filename"])
df_anns_imploded.insert(1, "filename", new_col)
print(df_anns_imploded.shape)
df_anns_imploded.head()

(14779, 5)


Unnamed: 0_level_0,agg_ann_id,filename,text,ann_offsets,label
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"[14384, 24275, 26233, 52952]",AA5_00100,"[The Very Rev Prof James Whyte, The Very Rev P...","[(34, 63), (34, 63), (43, 63), (34, 63)]","[Unknown, Masculine, Unknown, Stereotype]"
3,"[14377, 14378, 14379, 14380, 14381, 14382, 143...",AA5_00100,"[He, he, his, he, he, His, he, Professor James...","[(789, 791), (871, 873), (913, 916), (928, 930...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-..."
5,"[9531, 23084]",AA6_00100,"[Rev Tom Allan, Rev Tom Allan]","[(34, 47), (34, 47)]","[Unknown, Masculine]"
7,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...",AA6_00100,"[Billy Graham, He, he, he, he, He, his, his, h...","[(1778, 1790), (677, 679), (920, 922), (1222, ...","[Masculine, Gendered-Pronoun, Gendered-Pronoun..."
9,"[14000, 24207]",AA7_00100,"[Rev Prof Alec Campbell Cheyne, Rev Prof Alec ...","[(34, 63), (34, 63)]","[Unknown, Masculine]"


In [12]:
df_anns_imploded.loc[df_anns_imploded.index == 639]

Unnamed: 0_level_0,agg_ann_id,filename,text,ann_offsets,label
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
639,[41306],BAI_01900,[poet],"[(3400, 3404)]",[Occupation]


In [13]:
# Join the data, keeping all rows (including those without annotations)
sub_descs_to_tokens = descs_to_tokens[["sentence_id", "token_id", "token_offsets"]]
descs_anns_tokens = sub_descs_to_tokens.join(df_anns_imploded, on=["description_id"], how="outer")
print(descs_anns_tokens.shape)
descs_anns_tokens.head()

(27908, 8)


Unnamed: 0_level_0,sentence_id,token_id,token_offsets,agg_ann_id,filename,text,ann_offsets,label
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,"[0, 0, 0]","[0, 1, 2]","[(0, 10), (10, 11), (12, 15)]",,,,,
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[(17, 22), (22, 23), (24, 30), (31, 33), (34, ...","[14384, 24275, 26233, 52952]",AA5_00100,"[The Very Rev Prof James Whyte, The Very Rev P...","[(34, 63), (34, 63), (43, 63), (34, 63)]","[Unknown, Masculine, Unknown, Stereotype]"
2,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[(77, 82), (83, 86), (87, 95), (95, 96), (97, ...",,,,,
3,"[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[(634, 646), (647, 648), (649, 659), (659, 660...","[14377, 14378, 14379, 14380, 14381, 14382, 143...",AA5_00100,"[He, he, his, he, he, His, he, Professor James...","[(789, 791), (871, 873), (913, 916), (928, 930...","[Gendered-Pronoun, Gendered-Pronoun, Gendered-..."
4,"[11, 11, 11]","[308, 309, 310]","[(0, 10), (10, 11), (12, 15)]",,,,,


Write the data to a file, replacing `NaN` values with empty strings:

In [15]:
descs_anns_tokens = descs_anns_tokens.fillna("")
descs_anns_tokens.to_csv(config.agg_path+"descs_sents_tokens_anns.csv")

<a id="2"></a>
### 3. Assign BIO Tags

**Compare the descriptions' tokens' offsets to the annotated text spans' offsets to determine which tokens to mark as the beginning of an annotation (`B-[LABELNAME]`), inside an annotation (`I-[LABELNAME]`), and unannotated, or outisde of an annotation (`O`).**

In [16]:
# Remove columns without IDs and offsets
subdf = descs_anns_tokens.drop(columns=["text", "filename", "label"])
print(subdf.shape)

(27908, 5)


#### 3.1 Review Tokens in Annotated Descriptions

For description IDs that do have annotations, assign their tokens in annotated text spans tags of `B` and `I` for *beginning* and *inside* of an annotation, and assign tokens outside of annotated text spans a tag of `O`.

In [17]:
# Get only the descriptions with annotations
subdf_withann = subdf.loc[subdf.agg_ann_id != ""]
print(subdf_withann.shape)
# Create a dictionary of the remaining offsets and ID data
withann_dict = subdf_withann.to_dict(orient="index")
print(withann_dict[639])

(14779, 5)
{'sentence_id': [720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720], 'token_id': [11466, 11467, 11468, 11469, 11470, 11471, 11472, 11473, 11474, 11475, 11476, 11477, 11478, 11479, 11480, 11481, 11482, 11483, 11484], 'token_offsets': [(3303, 3308), (3309, 3312), (3313, 3321), (3321, 3322), (3323, 3324), (3325, 3335), (3336, 3342), (3343, 3345), (3346, 3352), (3353, 3355), (3356, 3362), (3363, 3364), (3364, 3368), (3369, 3379), (3379, 3380), (3381, 3384), (3385, 3390), (3391, 3393), (3394, 3405)], 'agg_ann_id': [41306], 'ann_offsets': [(3400, 3404)]}


In [18]:
desc_ids = list(withann_dict.keys())
# One token can be in multiple annotations, so give each token a list of tag values
desc_to_anntokentags = dict.fromkeys(desc_ids, dict())

In [19]:
for did in desc_ids:
    # Get the description's token data
    token_ids = withann_dict[did]["token_id"]
    token_offsets = withann_dict[did]["token_offsets"]
    # Get the description's annotation data
    ann_ids = withann_dict[did]["agg_ann_id"]
    ann_offsets = withann_dict[did]["ann_offsets"]

    # Determine which tokens begin or are inside of annotated text spans
    tagged_token_ids, tagged_ann_ids, tags = [],[],[]
    for j in range(len(ann_offsets)):
            ann_offset_pair = ann_offsets[j]
            ann_id = ann_ids[j]
            for i in range(len(token_ids)):
                token_id = token_ids[i]
                token_offset_pair = token_offsets[i]
                #print(ann_offset_pair, token_offset_pair)
                # If the token's start offset equals the annotation's start offset, give it a B
                if (token_offset_pair[0] == ann_offset_pair[0]):
                    tagged_token_ids += [token_id]
                    tagged_ann_ids += [ann_id]
                    tags += ["B"]
                # If the token's start offset is in between the annotation's offsets, give it an I
                elif (token_offset_pair[0] > ann_offset_pair[0]) and (token_offset_pair[0] <= ann_offset_pair[1]):
                    tagged_token_ids += [token_id]
                    tagged_ann_ids += [ann_id]
                    tags += ["I"]
                # If the annotation's offsets are in between the token's offsets, give it a B
                elif (ann_offset_pair[0] > token_offset_pair[0]) and (ann_offset_pair[1] <= token_offset_pair[1]):
                    tagged_token_ids += [token_id]
                    tagged_ann_ids += [ann_id]
                    tags += ["B"]

    desc_to_anntokentags[did] = {"token_ids":tagged_token_ids, "ann_ids":tagged_ann_ids,"tags":tags}

In [20]:
print(desc_to_anntokentags[639])  # looks good

{'token_ids': [11484], 'ann_ids': [41306], 'tags': ['B']}


In [21]:
df = pd.DataFrame.from_dict(desc_to_anntokentags, orient="index").reset_index()
df = df.rename(columns={"index":"description_id"})
df.head()

Unnamed: 0,description_id,token_ids,ann_ids,tags
0,1,"[7, 8, 9, 10, 11, 12, 7, 8, 9, 10, 11, 12, 9, ...","[14384, 14384, 14384, 14384, 14384, 14384, 242...","[B, I, I, I, I, I, B, I, I, I, I, I, B, I, I, ..."
1,3,"[134, 148, 155, 157, 211, 216, 226, 113, 114, ...","[14377, 14378, 14379, 14380, 14381, 14382, 143...","[B, B, B, B, B, B, B, B, I, I, I, B, I, B, I, ..."
2,5,"[315, 316, 317, 315, 316, 317]","[9531, 9531, 9531, 23084, 23084, 23084]","[B, I, I, B, I, I]"
3,7,"[631, 632, 435, 478, 533, 539, 598, 618, 634, ...","[55, 55, 9516, 9517, 9518, 9519, 9520, 9521, 9...","[B, I, B, B, B, B, B, B, B, B, B, B, B, I, I, ..."
4,9,"[772, 773, 774, 775, 776, 772, 773, 774, 775, ...","[14000, 14000, 14000, 14000, 14000, 24207, 242...","[B, I, I, I, I, B, I, I, I, I]"


In [22]:
df.shape

(14779, 4)

In [23]:
df_exploded = df.apply(pd.Series.explode)
print(df_exploded.shape)
df_exploded.head()

(134644, 4)


Unnamed: 0,description_id,token_ids,ann_ids,tags
0,1,7,14384,B
0,1,8,14384,I
0,1,9,14384,I
0,1,10,14384,I
0,1,11,14384,I


In [24]:
assert df_exploded.loc[df_exploded.token_ids.isna()].shape[0] == 0, "All tokens should have a value"
df_exploded_dedup = df_exploded.drop_duplicates()
assert df_exploded_dedup.shape[0] == df_exploded.shape[0], "Each row should be unique"

#### 3.2 Assign O Tags to All Tokens in Unannotated Descriptions

Join the B and I tag data to the entire token dataset and assign all tokens without tags an O.

In [25]:
# Get the descriptions without any annotations
subdf_withoutann = subdf.loc[subdf.agg_ann_id == ""]
print(subdf_withoutann.shape)
# Create a dictionary of the remaining offsets and ID data
withoutann_dict = subdf_withoutann.to_dict(orient="index")
print(withoutann_dict[0])

(13129, 5)
{'sentence_id': [0, 0, 0], 'token_id': [0, 1, 2], 'token_offsets': [(0, 10), (10, 11), (12, 15)], 'agg_ann_id': '', 'ann_offsets': ''}


In [26]:
remaining_tokens = pd.DataFrame.from_dict(withoutann_dict, orient="index")
remaining_tokens = remaining_tokens.reset_index()
remaining_tokens = remaining_tokens.rename(columns={"index":"description_id", "agg_ann_id":"ann_id"})
remaining_tokens = remaining_tokens.drop(columns=["sentence_id","token_offsets", "ann_offsets"])
remaining_tokens_exploded = remaining_tokens.apply(pd.Series.explode)
tags = ["O"]*(remaining_tokens_exploded.shape[0])
remaining_tokens_exploded.insert(3, "tag", tags)
remaining_tokens_exploded.head()

Unnamed: 0,description_id,token_id,ann_id,tag
0,0,0,,O
0,0,1,,O
0,0,2,,O
1,2,16,,O
1,2,17,,O


#### 3.3 Add Label Names to B and I Tags

Join the annotation data to the token data get the label associated with each B and I tag.

In [27]:
# df_anns.head()
subdf_anns = df_anns[["agg_ann_id","text","description_id","label"]]
subdf_anns = subdf_anns.rename(columns={"agg_ann_id":"ann_id"})
subdf_anns.head()

Unnamed: 0,ann_id,text,description_id,label
0,0,knighted,2364,Gendered-Role
1,1,knighthood,4542,Gendered-Role
2,2,Prince Regent,3660,Gendered-Role
3,3,knighthood,4678,Gendered-Role
4,4,Sir,4732,Gendered-Role


In [28]:
df_exploded = df_exploded.rename(columns={"token_ids":"token_id", "ann_ids":"ann_id", "tags":"tag"})
df_exploded.set_index(["ann_id", "description_id"])
all_tokens_labeled = df_exploded.join(subdf_anns.set_index(["ann_id", "description_id"]), on=["ann_id","description_id"], how="left")
all_tokens_labeled.head()

Unnamed: 0,description_id,token_id,ann_id,tag,text,label
0,1,7,14384,B,The Very Rev Prof James Whyte,Unknown
0,1,8,14384,I,The Very Rev Prof James Whyte,Unknown
0,1,9,14384,I,The Very Rev Prof James Whyte,Unknown
0,1,10,14384,I,The Very Rev Prof James Whyte,Unknown
0,1,11,14384,I,The Very Rev Prof James Whyte,Unknown


In [29]:
assert all_tokens_labeled.loc[all_tokens_labeled.token_id.isna()].shape[0] == 0

In [30]:
# o_tags = all_tokens_labeled.loc[all_tokens_labeled.tag == "O"]
# o_tags = o_tags.drop(columns=["label"])
# o_tags.head()

In [31]:
# bi_tags = all_tokens_labeled.loc[all_tokens_labeled.tag != "O"]
complete_tags = all_tokens_labeled["tag"] +"-"+all_tokens_labeled["label"]
all_tokens_labeled = all_tokens_labeled.drop(columns=["tag","label"])
all_tokens_labeled.insert(3, "tag", complete_tags)
all_tokens_labeled.head()

Unnamed: 0,description_id,token_id,ann_id,tag,text
0,1,7,14384,B-Unknown,The Very Rev Prof James Whyte
0,1,8,14384,I-Unknown,The Very Rev Prof James Whyte
0,1,9,14384,I-Unknown,The Very Rev Prof James Whyte
0,1,10,14384,I-Unknown,The Very Rev Prof James Whyte
0,1,11,14384,I-Unknown,The Very Rev Prof James Whyte


#### 3.4 Combine the Data

In [32]:
all_tokens_labeled = all_tokens_labeled.drop(columns=["text"])
all_tokens = pd.concat([remaining_tokens_exploded,all_tokens_labeled], sort=True)
print(all_tokens.shape)
all_tokens = all_tokens.sort_values(by=["description_id","token_id", "ann_id", "tag"])
all_tokens.head()

(355434, 4)


Unnamed: 0,ann_id,description_id,tag,token_id
0,,0,O,0
0,,0,O,1
0,,0,O,2
0,14384.0,1,B-Unknown,7
0,24275.0,1,B-Masculine,7


In [33]:
assert all_tokens.loc[all_tokens.tag.isna()].shape[0] == 0, "All tags should have a value"
all_tokens_dedup = all_tokens.drop_duplicates()
assert all_tokens_dedup.shape == all_tokens.shape, "Each row should be unique"

Write the data to a file:

In [34]:
all_tokens.to_csv(config.tokc_path+"tagged_tokens.csv")

In [101]:
df_tokens = df_tokens.reset_index()
df_tokens.head()

Unnamed: 0,token_id,description_id,sentence_id,token,token_offsets
0,0,0,0,Identifier,"(0, 10)"
1,1,0,0,:,"(10, 11)"
2,2,0,0,AA5,"(12, 15)"
3,3,1,1,Title,"(17, 22)"
4,4,1,1,:,"(22, 23)"


In [103]:
# all_tokens_imploded = utils.implodeDataFrame(all_tokens, ["token_id"])
# all_tokens_imploded = all_tokens_imploded.reset_index()
all_tokens_imploded = all_tokens_imploded.drop(columns=["description_id"])
all_tokens_imploded.head()

Unnamed: 0,token_id,ann_id,tag
0,0,[],[O]
1,1,[],[O]
2,2,[],[O]
3,7,"[14384, 24275, 52952]","[B-Unknown, B-Masculine, B-Stereotype]"
4,8,"[14384, 24275, 52952]","[I-Unknown, I-Masculine, I-Stereotype]"


In [104]:
df_tokens = df_tokens.set_index("token_id")
all_tokens_joined = df_tokens.join(all_tokens_imploded.set_index("token_id"), on="token_id", how="outer")
all_tokens_joined.head()

Unnamed: 0_level_0,description_id,sentence_id,token,token_offsets,ann_id,tag
token_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0,Identifier,"(0, 10)",[],[O]
1,0,0,:,"(10, 11)",[],[O]
2,0,0,AA5,"(12, 15)",[],[O]
3,1,1,Title,"(17, 22)",,
4,1,1,:,"(22, 23)",,


In [107]:
all_tokens_joined[["ann_id"]] = all_tokens_joined[["ann_id"]].fillna("[]")

In [109]:
all_tokens_joined[["tag"]] = all_tokens_joined[["tag"]].fillna("[O]")

In [110]:
all_tokens_joined.head()

Unnamed: 0_level_0,description_id,sentence_id,token,token_offsets,ann_id,tag
token_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0,Identifier,"(0, 10)",[],[O]
1,0,0,:,"(10, 11)",[],[O]
2,0,0,AA5,"(12, 15)",[],[O]
3,1,1,Title,"(17, 22)",[],[O]
4,1,1,:,"(22, 23)",[],[O]


Write the data for token classification:

In [111]:
all_tokens_joined.to_csv(config.tokc_path+"desc_sent_ann_token_tag.csv")

<a id="3"></a>

### 3. Visualize the Tags' Distribution

In [14]:
# Load the data from the token classification preprocessing Notebook
tag_totals = pd.read_csv(config.tokc_path+"token_tag_totals.csv", index_col=0)
tag_totals.head()

Unnamed: 0,tag,total
15,I-Nonbinary,1
5,B-Nonbinary,1
11,I-Gendered-Pronoun,67
12,I-Gendered-Role,726
13,I-Generalization,891


In [21]:
tag_totals[["tag", "label"]] = tag_totals.tag.str.split("-", n=1, expand=True)
tag_totals = tag_totals[["tag", "label", "total"]]
tag_totals

Unnamed: 0,tag,label,total
15,I,Nonbinary,1
5,B,Nonbinary,1
11,I,Gendered-Pronoun,67
12,I,Gendered-Role,726
13,I,Generalization,891
0,B,Feminine,1614
3,B,Generalization,2051
8,B,Stereotype,2614
2,B,Gendered-Role,3577
10,I,Feminine,3782


In [22]:
b_tags = tag_totals.loc[tag_totals.tag == "B"]
i_tags = tag_totals.loc[tag_totals.tag == "I"]

In [24]:
label_totals = tag_totals.groupby("label").sum().reset_index()
label_totals

Unnamed: 0,label,total
0,Feminine,5396
1,Gendered-Pronoun,4228
2,Gendered-Role,4303
3,Generalization,2942
4,Masculine,14417
5,Nonbinary,2
6,Occupation,8543
7,Omission,18812
8,Stereotype,8302
9,Unknown,67699


In [28]:
tag_totals.to_csv(config.tokc_path+"token_tag_totals.csv")
label_totals.to_csv(config.tokc_path+"token_label_totals.csv")