# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs the files:
  * `../data/analysis_data/descriptions_with_counts.csv`: adds columns to `descriptions.csv` for word counts and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `../data/analysis_data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description
  * `../data/doc_clf_data/desc_field_descid_label_eadid.csv`: contains one row per description with the description's labels from the manual annotation process
  * `../data/crc_metadata/descs_with_offsets.csv`: contains one row for every description in the annotated datasets with columns for the descriptions' corresponding id, eadid, file, start offset, and end offset

***

**Table of Contents**

[0.](#0) Loading

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Descriptions

[3.](#3) Offsets of Tokens
    
  * TO REMOVE: [BIO Tags](#3.1)

***

<a id="0"></a>
### 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

from intervaltree import Interval, IntervalTree

<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
**Find the minimum, maximum, average, and standard deviation of word and sentence counts...**
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [37]:
descs_path = config.crc_meta_path+"all_descriptions.csv"     # descriptions in column of CSV file

In [38]:
desc_df = pd.read_csv(descs_path, index_col=0)
desc_df.head()

Unnamed: 0,eadid,description,field,desc_id
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0
1,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1
2,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2
3,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3
4,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4


In [39]:
# # Remove metadata field name from each description
# new_descs = []
# descs = list(desc_df.description)
# fields = list(desc_df.field)
# i = 0
# maxI = len(descs)
# while i < maxI:
#     d, f = descs[i], fields[i]
#     to_remove = f+":\n"
#     d = d.replace(to_remove,"")
#     new_descs += [d]
#     i += 1
# assert len(new_descs) == len(descs)
# # new_descs[:10]            # Looks good

In [40]:
# # Update the CSV file
# desc_df.description = new_descs
# desc_df.head()
# desc_df.to_csv(descs_path)

In [42]:
# Write each description to a txt file named with desc_id
ids = list(desc_df.desc_id)
descs = list(desc_df.description)
desc_txt_dir = config.crc_meta_path+"descriptions_brat/"
utils.strToTxt(ids, descs, "description", desc_txt_dir)

In [None]:
corpus = PlaintextCorpusReader(desc_txt_dir, "description\d+.txt", encoding="utf8")
# print(len(corpus.fileids()), desc_df.shape[0])  # Looks good
print(corpus.fileids()[-20:]) # Looks good

#### Length per Description

In [None]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

In [None]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good
# len(desc_sents[2]) # 14

In [None]:
desc_df.insert(len(desc_df.columns), "word_count", word_count)
desc_df.insert(len(desc_df.columns), "sent_count", sent_count)
desc_df.head()

In [None]:
desc_df.to_csv("descriptions_with_counts.csv")  # write a new CSV file with the word and sentence counts

In [None]:
desc_df = desc_df.reset_index()
desc_df.head(1)

In [None]:
desc_df_stats = utils.makeDescribeDf("All", desc_df)
desc_df_stats

#### Lengths per Metadata Field

In [None]:
field = "Biographical / Historical"
bh_stats = utils.makeDescribeDf(field, desc_df)
bh_stats

In [None]:
field = "Scope and Contents"
sc_stats = utils.makeDescribeDf(field, desc_df)
sc_stats

In [None]:
field = "Processing Information"
pi_stats = utils.makeDescribeDf(field, desc_df)
pi_stats

In [None]:
field = "Title"
t_stats = utils.makeDescribeDf(field, desc_df)
t_stats

#### Combine the Statistics

In [None]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

In [None]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

#### Prepare data for visualization in Observable

In [None]:
df_descs = pd.read_csv("../data/analysis_data/descriptions_with_counts.csv", index_col=0)
df_descs.head()

<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>
## 2. Offsets of Descriptions

**Get the start and end offset of every description so that automated labels can be exported as .ann files for visualization with brat.**

The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we need to determinen the intermediate start and end offsets as well, which we'll add as a column to the file `../data/crc_metadata/all_descriptions.csv`.

In [2]:
file_type = ".txt"  # Read in only the PlainText files

In [3]:
filenames = os.listdir(config.doc_path)
filenames = [f for f in filenames if f[-4:] == file_type] # the descriptions are in the txt files
# print(filenames[:3])

In [4]:
descs_details = utils.getDescriptionsInFiles(config.doc_path, filenames)

Great!  Now create a DataFrame of the description data:

In [5]:
ids_col = list(descs_details.keys())
desc_col, field_col, file_col, eadid_col, start_offset_col, end_offset_col = [], [], [], [], [], []
for desc_id in ids_col:
    desc_dict = descs_details[desc_id]
    
    eadid = (re.findall("^.*(?=_\d+.txt)", desc_dict["file"]))[0]
    eadid_col += [eadid]
    
    field_col += [desc_dict["field"]]
    
    file_col += [desc_dict["file"]]
    
    desc_col += [desc_dict["description"]]
    
    start_offset_col += [desc_dict["start_offset"]]
    end_offset_col += [desc_dict["end_offset"]]

new_descs_df = pd.DataFrame({
    "desc_id":ids_col, "eadid":eadid_col, "field":field_col, "file":file_col, 
    "description":desc_col, "desc_start_offset":start_offset_col, "desc_end_offset":end_offset_col
})

new_descs_df.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


Write the data to a CSV file:

In [6]:
new_descs_df.to_csv(config.crc_meta_path+"descs_with_offsets.csv")
# new_descs_df = pd.read_csv(config.crc_meta_path+"descs_with_offsets.csv", index_col=0)
# new_descs_df.head()

**Now assign description IDs from this DataFrame to the aggregated annotated datasets:**

In [9]:
df_grouped = pd.read_csv(config.docc_path+"desc_field_descid_label_eadid.csv", index_col=0)
# df_grouped.head()

In [11]:
df_merged = df_grouped.merge(new_descs_df, left_on=["description", "field", "eadid", "file", "desc_start_offset", "desc_end_offset"], right_on=["description", "field", "eadid", "file", "desc_start_offset", "desc_end_offset"])
df_merged.head()

Unnamed: 0,eadid,file,desc_id_x,field,description,label,desc_start_offset,desc_end_offset,desc_id_y
0,Coll-1320,Coll-1320_00400.txt,3247,Title,'Effect of an inhibitor of 3ß-hydroxysteroid d...,"{'Masculine', 'Unknown'}",5040.0,5281.0,19370
1,Coll-146,Coll-146_28000.txt,11317,Scope and Contents,"3 photographs : negative, col.Sent from: [Cap ...",{'Unknown'},4731.0,4810.0,63842
2,Coll-146,Coll-146_20500.txt,9233,Scope and Contents,"4 photographs : negative, col.. 1 stripSent fr...",{'Unknown'},4387.0,4592.0,66124
3,Coll-1130,Coll-1130_00100.txt,1465,Biographical / Historical,"A collection of copied letters, mainly from th...",{'Unknown'},1262.0,1434.0,20367
4,Coll-1143,Coll-1143_00100.txt,1493,Biographical / Historical,Alexander Herbert Main studied Law at Edinburg...,"{'Gendered-Pronoun', 'Stereotype', 'Masculine'...",1170.0,1559.0,48433


Great!  Now we want to keep the *right* DataFrame's description IDs, so we'll drop `desc_id_x` and remove the `_y` from `desc_id_y`:

In [12]:
df_merged = df_merged.drop(columns=["desc_id_x"])
df_merged = df_merged.rename(columns={"desc_id_y":"desc_id"})
df_merged.head()

Unnamed: 0,eadid,file,field,description,label,desc_start_offset,desc_end_offset,desc_id
0,Coll-1320,Coll-1320_00400.txt,Title,'Effect of an inhibitor of 3ß-hydroxysteroid d...,"{'Masculine', 'Unknown'}",5040.0,5281.0,19370
1,Coll-146,Coll-146_28000.txt,Scope and Contents,"3 photographs : negative, col.Sent from: [Cap ...",{'Unknown'},4731.0,4810.0,63842
2,Coll-146,Coll-146_20500.txt,Scope and Contents,"4 photographs : negative, col.. 1 stripSent fr...",{'Unknown'},4387.0,4592.0,66124
3,Coll-1130,Coll-1130_00100.txt,Biographical / Historical,"A collection of copied letters, mainly from th...",{'Unknown'},1262.0,1434.0,20367
4,Coll-1143,Coll-1143_00100.txt,Biographical / Historical,Alexander Herbert Main studied Law at Edinburg...,"{'Gendered-Pronoun', 'Stereotype', 'Masculine'...",1170.0,1559.0,48433


Update the file:

In [13]:
df_merged.to_csv(config.agg_path+"desc_field_descid_label_eadid.csv")

<a id="3"></a>

## 3. Offsets of Tokens

In [14]:
df_desc = pd.read_csv(config.crc_meta_path+"descs_with_offsets.csv", index_col=0)
df_desc.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


In [15]:
df_desc.loc[df_desc.description.isna() == True]
# df_desc.loc[df_desc.description == "N/A"]
# df_desc.loc[df_desc.file == "EUA_IN1_56700.txt"]

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
57361,57361,EUA_IN1,Scope and Contents,EUA_IN1_56700.txt,,6137,6141


In [16]:
df_desc.description = df_desc.description.fillna("N/A")
df_desc.loc[df_desc.desc_id == 57361]

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
57361,57361,EUA_IN1,Scope and Contents,EUA_IN1_56700.txt,,6137,6141


Write the corrected description to the `descs_with_offsets.csv` file:

In [17]:
df_desc.to_csv(config.crc_meta_path+"descs_with_offsets.csv")

Get the offsets of the tokens in every description:

In [18]:
descs = list(df_desc.description)
desc_ids = list(df_desc.desc_id)
desc_start_offsets = list(df_desc.desc_start_offset)
desc_end_offsets = list(df_desc.desc_end_offset)

In [20]:
tokens_dict, offsets_dict = utils.getTokensAndOffsetsFromStrings(descs, desc_ids, desc_start_offsets, desc_end_offsets)

In [21]:
tokens_col, offsets_col, desc_ids_col = [], [], []
for desc_id,token_list in tokens_dict.items():
    tokens_col += token_list
    offsets_list = offsets_dict[desc_id]
    offsets_col += offsets_list
    assert len(token_list) == len(offsets_list)
    desc_ids_col += [desc_id]*len(token_list)

assert len(tokens_col) == len(offsets_col)
assert len(tokens_col) == len(desc_ids_col)

In [22]:
for col_list in [tokens_col, offsets_col, desc_ids_col]:
    print(col_list[0:5])

['Records', 'of', 'the', 'Phrenological', 'Society']
[(29, 36), (37, 39), (40, 43), (44, 57), (58, 65)]
[0, 0, 0, 0, 0]


Looks good!  Now create a DataFrame with these lists as columns:

In [23]:
df_tokens = pd.DataFrame({"desc_id":desc_ids_col, "token":tokens_col, "offsets":offsets_col})
df_tokens.head()

Unnamed: 0,desc_id,token,offsets
0,0,Records,"(29, 36)"
1,0,of,"(37, 39)"
2,0,the,"(40, 43)"
3,0,Phrenological,"(44, 57)"
4,0,Society,"(58, 65)"


Great!  Now write the DataFrame to a file:

In [24]:
df_tokens.to_csv(config.agg_path+"descid_token_offsets.csv")

***
***
***
# DELETE CODE BELOW (MOVING TO NEW NB)

<a id="3.1"></a>
### 3.1 BIO Tags

**Compare the descriptions' tokens' offsets to the annotated text spans' offsets to determine which tokens to mark as the beginning of an annotation (`B-[LABELNAME]`), inside an annotation (`I-[LABELNAME]`), and unannotated, or outisde of an annotation (`O`).**

In [None]:
# TO DO: convert the three dataframes to dictionaries, 
#        for each filename, check whether each token_offset pair contained within each ann_offset pair and desc_,
#        recording which description (using indeces) annotation appears within

In [12]:
df_tokens = pd.read_csv(config.tokc_path+"descid_token_offsets.csv", index_col=0)
token_desc_ids = list(df_tokens.desc_id)
tokens = list(df_tokens.token)
token_offsets = list(df_tokens.offsets)
token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets]
token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
token_offsets_tuples[:5]  # Looks good

[(29, 36), (37, 39), (40, 43), (44, 57), (58, 65)]

Associate description tokens and annotated text spans' text and offsets to description IDs.

In [31]:
# df_tokens_imploded = utils.implodeDataFrame(df_tokens, ["desc_id"])
df_tokens_imploded = df_tokens_imploded.rename(columns={"offsets":"token_offsets"})
df_tokens_imploded.head()

Unnamed: 0_level_0,token,token_offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ..."
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141..."


In [35]:
df_tokens_imploded.to_csv(config.tokc_path+"token_data_imploded.csv")

Load the data associating description and annotation IDs to offsets.

In [32]:
df_descs_imploded = pd.read_csv(config.agg_path+"description_data_imploded.csv", index_col=0)
df_descs_imploded.head()

Unnamed: 0_level_0,eadid,desc_id,desc_offsets
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BAI_01000,['BAI'],[68],"[(1290, 1315)]"
BAI_01300,['BAI'],[143],"[(5853, 5983)]"
BAI_01600,['BAI'],[221],"[(5967, 6202)]"
BAI_01900,['BAI'],[292],"[(5297, 5506)]"
BAI_02200,['BAI'],[361],"[(15180, 15419)]"


In [33]:
df_anns_imploded = pd.read_csv(config.agg_path+"annotation_data_imploded.csv", index_col=0)
df_anns_imploded.head()

Unnamed: 0_level_0,agg_ann_id,ann_offsets
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
AA5_00100,"[14377, 14378, 14379, 14380, 14381, 14382, 143...","['(789, 791)', '(871, 873)', '(913, 916)', '(9..."
AA6_00100,"[55, 9516, 9517, 9518, 9519, 9520, 9521, 9522,...","['(1778, 1790)', '(677, 679)', '(920, 922)', '..."
AA7_00100,"[127, 13987, 13988, 13989, 13990, 13991, 13992...","['(2399, 2415)', '(505, 508)', '(614, 620)', '..."
BAI_00100,"[17473, 17474, 17475, 17476, 17477, 17478, 416...","['(371, 388)', '(393, 405)', '(34, 56)', '(102..."
BAI_00200,"[20496, 20497, 20498, 20499, 20500, 20501, 205...","['(215, 221)', '(226, 232)', '(250, 255)', '(2..."


**Step 1: O tags**

Compare description IDs in the two DataFrames above to determine which descriptions (from `df_tokens_imploded`) do not have annotations, and assign all those descriptions' tokens an `O` tag (for *outside* of an annotation).

In [40]:
all_desc_ids = list(df_tokens_imploded.index)
ann_desc_ids = list(df_merged_imploded.index)
unannotated = [desc_id for desc_id in all_desc_ids if not desc_id in ann_desc_ids]
print("Rows to assign tag 'O':", len(unannotated))

Rows to assign tag 'O': 86742


In [48]:
o_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(unannotated)]
assert o_df.shape[0] == len(unannotated)

In [50]:
tokens_list = list(o_df.token)
tags = [["O"]*len(tokens) for tokens in tokens_list]
assert len(tags) == len(tokens_list)
o_df.insert(len(o_df.columns), "ann_tag", tags)
o_df.head()

Unnamed: 0_level_0,token,offsets,ann_tag
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ...","[O, O, O, O, O, O, O]"
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [53]:
assert len(o_df.token[100]) == len(o_df.ann_tag[100])
assert len(o_df.token[488]) == len(o_df.ann_tag[488])
assert len(o_df.token[0]) == len(o_df.ann_tag[0])

**Step 2: B- and I- tags**

For description IDs that do have annotations (and thus are in `df_merged_imploded`), assign their tokens tags of `B-[LABELNAME]` and `I-[LABELNAME]` for *beginning* and *inside* of an annotation, replacing `[LABELNAME]` with the name of the annotation's label.

In [41]:
annotated = [desc_id for desc_id in all_desc_ids if desc_id in ann_desc_ids]
print("Rows to assign 'B-' or 'I-'':", len(annotated))

Rows to assign 'B-' or 'I-'': 1855


In [54]:
bi_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(annotated)]
assert bi_df.shape[0] == len(annotated)

In [55]:
bi_df.head()

Unnamed: 0_level_0,token,offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
167,"[Brick, Burning, ,, Beardman, 's]","[(1421, 1426), (1427, 1434), (1434, 1435), (14..."
508,"[Interpreting, sequence, motifs, [, Letter, to...","[(3064, 3076), (3077, 3085), (3086, 3092), (30..."
610,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
611,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
640,"[Lady, Luck, :, the, theory, of, probability, ...","[(2118, 2122), (2123, 2127), (2127, 2128), (21..."


In [57]:
bi_dict = bi_df.to_dict('index')
print(bi_dict[167])

{'token': ['Brick', 'Burning', ',', 'Beardman', "'s"], 'offsets': ['(1421, 1426)', '(1427, 1434)', '(1434, 1435)', '(1436, 1444)', '(1444, 1446)']}


In [59]:
ann_dict = df_merged_imploded.to_dict('index')
print(ann_dict[167])

{'offsets_ann': ['(1436, 1444)', '(1436, 1444)'], 'text_ann': ['Beardman', 'Beardman'], 'label': ['Omission', 'Unknown'], 'id': [31928, 31929]}


In [78]:
# Turn a string of offsets into a tuple with each offset of type int
# "(1436, 1444)" --> (1436, 1444)
def offsetsStrToTuple(offsets_str):
    offsets_list = offsets_str[1:-1].split(", ")
    offsets_ints = [int(o) for o in offsets_list]
    return tuple((offsets_ints))

assert type(offsetsStrToTuple('(1436, 1444)')) == tuple
assert type(offsetsStrToTuple('(1436, 1444)')[0]) == int
assert type(offsetsStrToTuple('(1436, 1444)')[1]) == int

In [101]:
desc_ids = list(bi_dict.keys())[:100]  # START WITH SAMPLE
assert len(set(desc_ids)) == len(desc_ids)  # Make sure every description ID is unique
log = 0
descid_to_tag = dict.fromkeys(desc_ids)
for desc_id in desc_ids:
    text_spans = ann_dict[desc_id]["text_ann"]
    desc_tokens = bi_dict[desc_id]['token']
    desc_tokens_offsets = bi_dict[desc_id]['offsets']
    desc_tags = []
    for i,desc_token in enumerate(desc_tokens):
        token_offset_pair = offsetsStrToTuple(desc_tokens_offsets[i])
        span_indeces, tags = [], []  # Note: one token may have multiple tags
        
        # Record the indeces of every item in text_spans with the desc_token
        for j,text_span in enumerate(text_spans):
            span_offset_pair = offsetsStrToTuple(ann_dict[desc_id]["offsets_ann"][j])    
            # Be sure a matching token's offsets are within the annotated text span
            if (desc_token in text_span
               ) and (
                token_offset_pair[0] >= span_offset_pair[0]
                ) and (
                token_offset_pair[1] <= span_offset_pair[1]):
                    span_indeces += [j] 
            else:
                span_indeces += ["unannotated"]
        for j in span_indeces:
            # If the token is annotated, assign it a B- or I- tag with a label
            if type(j) == int:
            # If the start offsets are the same, assign a 'B-' tag
                if token_offset_pair[0] == span_offset_pair[0]:
                    tags += ['B-'+ann_dict[desc_id]["label"][j]]
                # Otherwise, assign an 'I-' tag
                else:
                    tags += ['I-'+ann_dict[desc_id]["label"][j]]
            # If the description token isn't annotated, assign it an O tag
            elif j == "unannotated":
                tags += ["O"]
            else:
                raise ValueError("Invalid j value: {}".format(j))
        
        desc_tags += [set(tags)]
    
    assert len(desc_tokens) == len(desc_tags)
    descid_to_tag[desc_id] = desc_tags
    
    log += 1
    if log % 100 == 0:
        print("Assigned tags for {} descriptions".format(log))

Assigned tags for 100 descriptions


In [109]:
did = 610 #508 #167
# print(ann_dict[did])
print(bi_dict[did])
# print(descid_to_tag[did])

# spans = ['Beardman', 'Beardman']
# spans2 = ["Brick Burning"]
# tokens = ['Brick', 'Burning', ',', 'Beardman', "'s"]
# # print(spans.count('Beardman'))
# # # print(spans.index('Beardman'))
# # # print(tokens.index('Beardman'))
# # for k in range(0,3):
# #     print(k)
# indeces = [index for index in range(len(spans)) if spans[index] == 'Beardman']
# print(indeces)

{'token': ['Letter', ':', ':', 'Koestler', ',', 'Arthur'], 'offsets': ['(127, 133)', '(134, 135)', '(135, 136)', '(137, 145)', '(145, 146)', '(147, 153)']}


In [4]:
# is_annotated_col = []
# annotated_id = []
# i, maxI = 0, len(token_desc_ids)  #1188478, 1189478
# while i < maxI:
#     desc_id = token_desc_ids[i]
#     token = tokens[i]
#     token_start, token_end = token_offsets_tuples[i][0], token_offsets_tuples[i][1] 
    
#     ann_df = df_merged.loc[df_merged.desc_id == desc_id]
#     ann_id_list = list(ann_df.id)
#     ann_offsets_list = list(ann_df.offsets_ann)
#     ann_offsets_clean = [ann_offsets[1:-1].split(", ") for ann_offsets in ann_offsets_list]
#     ann_offsets_tuples = [tuple((int(ann_offsets[0]), int(ann_offsets[1]))) for ann_offsets in ann_offsets_clean]
    
#     for j,ann_offsets in enumerate(ann_offsets_tuples):
#         ann_start = ann_offsets[0]
#         ann_end = ann_offsets[1]
#         if token_start == ann_start:
#             is_annotated_col += ["B"]
#             annotated_id += [ann_id_list[j]]
#         elif (token_start > ann_start) and (token_start <= ann_end):
#             is_annotated_col += ["I"]
#             annotated_id += [ann_id_list[j]]
#         else:
#             is_annotated_col += ["O"]
#             annotated_id += ["None"]
    
#     i += 1

# assert len(is_annotated_col) == len(token_desc_ids)
# assert len(is_annotated_col) == len(annotated_id)

KeyboardInterrupt: 

In [None]:
df_tokens.insert(len(df_tokens.columns),"is_annotated",is_annotated_col)
df_tokens.insert(len(df_tokens.columns),"ann_id",annotated_id)
df_tokens.head()

In [5]:
print(len(is_annotated_col))

48022031


In [8]:
print(len(annotated_id))

48022031
